Currently we skip looking through truncates if the sign flag is used. But that's overly restrictive.
It's safe to look through the truncate as long as we ensure one of the 3 things when we shrink. Either the MSB of the mask at the shrunken size isn't set. If the mask bit is set then either the shrunk size needs to be equal to the compare size or the sign flag needs to be unused.
There are still missed opportunities to shrink a load and fold it in here. This will be fixed in a future patch.
Differential Revision: https://reviews.llvm.org/D52669
llvm-svn: 343498
This patch adds another variant class to identify zero-idiom VPERM2F128rr
instructions.
On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom.
Differential Revision: https://reviews.llvm.org/D52663
llvm-svn: 343452
Summary:
While looking at PR35606, I found out that the scheduling info is incorrect.
One can check that it's really a P5+P6 and not a 2*P56 with:
echo -e 'vzeroall\nvandps %xmm1, %xmm2, %xmm3' | ./bin/llvm-exegesis -mode=uops -snippets-file=-
(vandps executes on P5 only)
Reviewers: craig.topper, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52541
llvm-svn: 343447
We can only copy between a k-register and a GR32/GR64 register.
This patch detects that the copy will be illegal and prevents the domain reassignment from happening for that closure.
This probably isn't the best fix, and we should probably figure out how to handle this correctly.
Fixes PR38803.
llvm-svn: 343443
There's a conditional report_fatal_error just above this llvm_unreachable. The optimizer when seeing the unreachable removes the conditional and just makes any other error trigger the existing report_fatal_error.
llvm-svn: 343428
Summary:
This function turns (X >> C1) & C2 into a BMI BEXTR or TBM BEXTRI instruction. For BMI BEXTR we have to materialize an immediate into a register to feed to the BEXTR instruction.
The BMI BEXTR instruction is 2 uops on Intel CPUs. It looks like on SKL its one port 0/6 uop and one port 1/5 uop. Despite what Agner's tables say. I know one of the uops is a regular shift uop so it would have to go through the port 0/6 shifter unit. So that's the same or worse execution wise than the shift+and which is one 0/6 uop and one 0/1/5/6 uop. The move immediate into register is an additional 0/1/5/6 uop.
For now I've limited this transform to AMD CPUs which have a single uop BEXTR. If may also might make sense if we can fold a load or if the and immediate is larger than 32-bits and can't be encoded as a sign extended 32-bit value or if LICM or CSE can hoist the move immediate and share it. But we'd need to look more carefully at that. In the regression I looked at it doesn't look load folding or large immediates were occurring so the regression isn't caused by the loss of those. So we could try to be smarter here if we find a compelling case.
Reviewers: RKSimon, spatel, lebedev.ri, andreadb
Reviewed By: RKSimon
Subscribers: llvm-commits, andreadb, RKSimon
Differential Revision: https://reviews.llvm.org/D52570
llvm-svn: 343399
By removing demanded target shuffles that simplify to zero/undef/identity before simplifying its inputs we improve chances of further simplification, as only the immediate parent user of the combined is added back to the work list - this still doesn't help us if its passed through other ops though (bitcasts....).
llvm-svn: 343390
The shift amount might have peeked through a extract_subvector, altering the number of vector elements in the 'Amt' variable - so we were incorrectly calculating the ratio when peeking through bitcasts, resulting in incorrectly detecting splats.
llvm-svn: 343373
Similar to the existing ISD::SRL constant vector shifts from D49562, this patch adds ISD::SRA support with ISD::MULHS.
As we're dealing with signed values, we have to handle shift by zero and shift by one special cases, so XOP+AVX2/AVX512 splitting/extension is still a better solution - really we should still use ISD::MULHS if one of the special cases are used but for now I've just left a TODO and filtered by isKnownNeverZero.
Differential Revision: https://reviews.llvm.org/D52171
llvm-svn: 343093
This removes an int->fp bitcast between the surrounding code and the movmsk. I had already added a hack to combineMOVMSK to try to look through this bitcast to improve the SimplifyDemandedBits there.
But I found an additional issue where the bitcast was preventing combineMOVMSK from being called again after earlier nodes in the DAG are optimized. The bitcast gets revisted, but not the user of the bitcast. By using integer types throughout, the bitcast doesn't get in the way.
llvm-svn: 343046
This is the final (I hope!) problem pattern mentioned in PR37749:
https://bugs.llvm.org/show_bug.cgi?id=37749
We are trying to avoid an AVX1 sinkhole caused by having 256-bit bitwise logic ops but no other 256-bit integer ops.
We've already solved the simple logic ops, but 'andn' is an x86 special. I looked at alternative solutions like
extending the generic DAG combine or trying to wait until the ANDNP node is created, but those are bigger patches
that can over-reach. Ie, splitting to 128-bit does not look like a win in most cases with >1 256-bit op.
The pattern matching is cluttered with bitcasts because of our i64 element canonicalization. For the affected test,
we have this vector-type-legalized sequence:
t29: v8i32 = concat_vectors t27, t28
t30: v4i64 = bitcast t29
t18: v8i32 = BUILD_VECTOR Constant:i32<-1>, Constant:i32<-1>, ...
t31: v4i64 = bitcast t18
t32: v4i64 = xor t30, t31
t9: v8i32 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, ...
t34: v4i64 = bitcast t9
t35: v4i64 = and t32, t34
t36: v8i32 = bitcast t35
t37: v4i32 = extract_subvector t36, Constant:i64<0>
t38: v4i32 = extract_subvector t36, Constant:i64<4>
Differential Revision: https://reviews.llvm.org/D52318
llvm-svn: 343008
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.
The uops are slightly different to the register variant, so requires a +1uop tweak
llvm-svn: 342969
We're missing quite a bit of data for these instruction, removing the overrides makes this obvious - inconsistent reg/mem variants is a concern as well.
Also, we have Divider resources (HWDivider etc.) but they aren't actually used consistently.
llvm-svn: 342904
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.
This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.
llvm-svn: 342892
Variable Shifts/Rotates using the CL register have different behaviours to the immediate instructions - split accordingly to help remove yet more repeated overrides from the schedule models.
llvm-svn: 342852
Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models.
Yet another step on the scheduler model cleanup marathon......
llvm-svn: 342846
This is an alternative to https://reviews.llvm.org/D37896. We can't decompose
multiplies generically without a target hook to tell us when it's profitable.
ARM and AArch64 may be able to remove some existing code that overlaps with
this transform.
This extends D52195 and may resolve PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
(still an open question about transforming legal vector multiplies, but we
could open another bug report for those)
llvm-svn: 342844
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.
I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.
This is necessary to allow WriteRotate to be updated to remove other rotate overrides.
It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.
llvm-svn: 342842
Despite being rotates, these more modern instructions avoid many of the quirks of the regular x86 rotate instructions and consistently have a schedule closer to shifts.
llvm-svn: 342839
NFCI for now, but it should make it easier to remove a lot of unnecessary overrides in a future commit.
Now that funnel shift intrinsics are coming online we need to get this cleaned up to make vectorization costs from scalar rotate patterns more straightforward.
llvm-svn: 342837
Our lowering that tries to avoid this sign extend can be defeated by the DAG combine folding it with a truncate.
The pattern needs to extend to an v8i32 then truncate back down to v8i16.
llvm-svn: 342830
Summary: Similar to D51893 which was for memcpy
Reviewers: efriedma
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52063
llvm-svn: 342796
We don't have a vXi8 shift left so we need to bitcast to a vXi16 vector to perform the shift. If we let lowering legalize the vXi8 shift we get an extra and that we don't need and fail to remove.
llvm-svn: 342795
Previously we used SUBREG_TO_REG+MOV32ri. But regular isel was changed recently to use the MOV32ri64 pseudo. Fast isel now does the same.
llvm-svn: 342788
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.
This is a follow-up of r342555.
On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.
Differential Revision: https://reviews.llvm.org/D52347
llvm-svn: 342728
x86 had 2 versions of peekThroughBitcast. DAGCombiner had 1. Plus, it had a 1-off implementation for the one-use variant.
Move the x86 versions of the code to SelectionDAG, so we don't have different copies of the code.
No functional change intended.
I'm putting this next to isBitwiseNot() because I am planning to use it in there. Another option is next to the
helpers in the ISD namespace (eg, ISD::isConstantSplatVector()). But if there's no good reason for those to be
there, I'd prefer to pull other helpers over to SelectionDAG in follow-up steps.
Differential Revision: https://reviews.llvm.org/D52285
llvm-svn: 342669
Enable enableMultipleCopyHints() on X86.
Original Patch by @jonpa:
While enabling the mischeduler for SystemZ, it was discovered that for some reason a test needed one extra seemingly needless COPY (test/CodeGen/SystemZ/call-03.ll). The handling for that is resulted in this patch, which improves the register coalescing by providing not just one copy hint, but a sorted list of copy hints. On SystemZ, this gives ~12500 less register moves on SPEC, as well as marginally less spilling.
Instead of improving just the SystemZ backend, the improvement has been implemented in common-code (calculateSpillWeightAndHint(). This gives a lot of test failures, but since this should be a general improvement I hope that the involved targets will help and review the test updates.
Differential Revision: https://reviews.llvm.org/D38128
llvm-svn: 342578
As the code comments suggest, these are about splitting, and they
are not necessarily limited to lowering, so that misled me.
There's nothing that's actually x86-specific in these either, so
they might be better placed in a common header so any target can
use them.
llvm-svn: 342575
This patch adds an initial x86 SimplifyDemandedVectorEltsForTargetNode implementation to handle target shuffles.
Currently the patch only decodes a target shuffle, calls SimplifyDemandedVectorElts on its input operands and removes any shuffle that reduces to undef/zero/identity.
Future work will need to integrate this with combineX86ShufflesRecursively, add support for other x86 ops, etc.
NOTE: There is a minor regression that appears to be affecting further (extractelement?) combines which I haven't been able to solve yet - possibly something to do with how nodes are added to the worklist after simplification.
Differential Revision: https://reviews.llvm.org/D52140
llvm-svn: 342564
This patch adds the ability for processor models to describe dependency breaking
instructions.
Different processors may specify a different set of dependency-breaking
instructions.
That means, we cannot assume that all processors of the same target would use
the same rules to classify dependency breaking instructions.
The main goal of this patch is to provide the means to describe dependency
breaking instructions directly via tablegen, and have the following
TargetSubtargetInfo hooks redefined in overrides by tabegen'd
XXXGenSubtargetInfo classes (here, XXX is a Target name).
```
virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const {
return false;
}
virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const {
return isZeroIdiom(MI);
}
```
An instruction MI is a dependency-breaking instruction if a call to method
isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to
true. Similarly, an instruction MI is a special case of zero-idiom dependency
breaking instruction if a call to STI.isZeroIdiom(MI) returns true.
The extra APInt is used for those targets that may want to select which machine
operands have their dependency broken (see comments in code).
Note that by default, subtargets don't know about the existence of
dependency-breaking. In the absence of external information, those method calls
would always return false.
A new tablegen class named STIPredicate has been added by this patch to let
processor models classify instructions that have properties in common. The idea
is that, a MCInstrPredicate definition can be used to "generate" an instruction
equivalence class, with the idea that instructions of a same class all have a
property in common.
STIPredicate definitions are essentially a collection of instruction equivalence
classes.
Also, different processor models can specify a different variant of the same
STIPredicate with different rules (i.e. predicates) to classify instructions.
Tablegen backends (in this particular case, the SubtargetEmitter) will be able
to process STIPredicate definitions, and automatically generate functions in
XXXGenSubtargetInfo.
This patch introduces two special kind of STIPredicate classes named
IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a
definition for those in the BtVer2 scheduling model only.
This patch supersedes the one committed at r338372 (phabricator review: D49310).
The main advantages are:
- We can describe subtarget predicates via tablegen using STIPredicates.
- We can describe zero-idioms / dep-breaking instructions directly via
tablegen in the scheduling models.
In future, the STIPredicates framework can be used for solving other problems.
Examples of future developments are:
- Teach how to identify optimizable register-register moves
- Teach how to identify slow LEA instructions (each subtarget defining its own
concept of "slow" LEA).
- Teach how to identify instructions that have undocumented false dependencies
on the output registers on some processors only.
It is also (in my opinion) an elegant way to expose knowledge to both external
tools like llvm-mca, and codegen passes.
For example, machine schedulers in LLVM could reuse that information when
internally constructing the data dependency graph for a code region.
This new design feature is also an "opt-in" feature. Processor models don't have
to use the new STIPredicates. It has all been designed to be as unintrusive as
possible.
Differential Revision: https://reviews.llvm.org/D52174
llvm-svn: 342555
This is an alternative to D37896. I don't see a way to decompose multiplies
generically without a target hook to tell us when it's profitable.
ARM and AArch64 may be able to remove some duplicate code that overlaps with
this transform.
As a first step, we're only getting the most clear wins on the vector examples
requested in PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
As noted in the code comment, it's likely that the x86 constraints are tighter
than necessary, but it may not always be a win to replace a pmullw/pmulld.
Differential Revision: https://reviews.llvm.org/D52195
llvm-svn: 342554
The 0x800 bit in @feat.00 needs to be set in order to make LLD pick up
the .gfid$y table. I believe this is fine to set even if we don't emit
the instrumentation.
We haven't emitted @feat.00 on 64-bit before. I see that MSVC does emit
it, but I'm not entirely sure what the default value should be. I went
with zero since that seems as safe as not emitting the symbol in the
first place.
Differential Revision: https://reviews.llvm.org/D52235
llvm-svn: 342532
Summary:
The IR reference for the `byval` attribute states:
```
This indicates that the pointer parameter should really be passed by value
to the function. The attribute implies that a hidden copy of the pointee is
made between the caller and the callee, so the callee is unable to modify
the value in the caller. This attribute is only valid on LLVM pointer arguments.
```
However, on Win64, this attribute is unimplemented and the raw pointer is
passed to the callee instead. This is problematic, because frontend authors
relying on the implicit hidden copy (as happens for every other calling
convention) will see the passed value silently (if mutable memory) or
loudly (by means of a crash) modified because the callee treats the
location as scratch memory space it is allowed to mutate.
At this point, it's worth taking a step back to understand the context.
In most calling conventions, aggregates that are too large to be passed
in registers, instead get *copied* to the stack at a fixed (computable
from the signature) offset of the stack pointer. At the LLVM, we hide
this hidden copy behind the byval attribute. The caller passes a pointer
to the desired data and the callee receives a pointer, but these pointers
are not the same. In particular, the pointer that the callee receives
points to temporary stack memory allocated as part of the call lowering.
In most calling conventions, this pointer is never realized in registers
or memory. The temporary memory is simply defined by an implicit
offset from the stack pointer at function entry.
Win64, uniquely, works differently. The structure is still passed in
memory, but instead of being stored at an implicit memory offset, the
caller computes a pointer to the temporary memory and passes it to
the callee as a regular pointer (taking up a register, or if all
registers are taken up, an additional stack slot). Presumably, this
was done to allow eliding the copy when passing aggregates through
several functions on the stack.
This explains why ignoring the `byval` attribute mostly works on Win64.
The argument simply gets passed as a pointer and as long as we're ok
with the callee trampling all over that memory, there are no ill effects.
However, it does contradict the documentation of the `byval` attribute
which specifies that there is to be an implicit copy.
Frontends can of course work around this by never emitting the `byval`
attribute for Win64 and creating `alloca`s for the requisite temporary
stack slots (and that does appear to be what frontends are doing).
However, the presence of the `byval` attribute is not a trap for
frontend authors, since it seems to work, but silently modifies the
passed memory contrary to documentation.
I see two solutions:
- Disallow the `byval` attribute in the verifier if using the Win64
calling convention.
- Make it work by simply emitting a temporary stack copy as we would
with any other calling convention (frontends can of course always
not use the attribute if they want to elide the copy).
This patch implements the second option (make it work), though I would
be fine with the first also.
Ref: https://github.com/JuliaLang/julia/issues/28338
Reviewers: rnk
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51842
llvm-svn: 342402
For constant non-uniform cases we'll never introduce more and/andn/or selects than already occur in generic pre-SSE41 ISD::SRL lowering.
llvm-svn: 342352
https://bugs.llvm.org/show_bug.cgi?id=38949
It's not clear to me that we even need a one-use check in this fold.
Ie, 2 independent loads might be better than a load+dependent shuffle.
Note that the existing re-use tests are not affected. We actually do form a
broadcast node in those tests now because there's no extra use of the
insert_subvector node in those cases. But something later in isel pattern
matching decides that it is not worth using a broadcast for the full load in
those tests:
Legalized selection DAG: %bb.0 'test_broadcast_2f64_4f64_reuse:'
t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64
t4: i64,ch = CopyFromReg t0, Register:i64 %1
t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64
t18: v4f64 = insert_subvector undef:v4f64, t7, Constant:i64<0>
t20: v4f64 = insert_subvector t18, t7, Constant:i64<2>
Becomes:
t7: v2f64,ch = load<(load 16 from %ir.p0)> t0, t2, undef:i64
t4: i64,ch = CopyFromReg t0, Register:i64 %1
t10: ch = store<(store 16 into %ir.p1)> t7:1, t7, t4, undef:i64
t21: v4f64 = X86ISD::SUBV_BROADCAST t7
ISEL: Starting selection on root node: t21: v4f64 = X86ISD::SUBV_BROADCAST t7
...
Created node: t27: v4f64 = INSERT_SUBREG IMPLICIT_DEF:v4f64, t7, TargetConstant:i32<7>
Morphed node: t21: v4f64 = VINSERTF128rr t27, t7, TargetConstant:i8<1>
llvm-svn: 342347
Summary: This unfortunately adds a move, but isn't that better than going to the int domain and back?
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D52134
llvm-svn: 342327
Summary:
MOVMSK only care about the sign bit so we don't need the setcc to fill the whole element with 0s/1s. We can just shift the bit we're looking for into the sign bit. This saves a constant pool load.
Inspired by PR38840.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: lebedev.ri, llvm-commits
Differential Revision: https://reviews.llvm.org/D52121
llvm-svn: 342326
Attempt to lower a shuffle as an unpack of elements from two inputs followed by a single-input (wider) permutation.
As long as the permutation is wider this is a win - there may be some circumstances where same size permutations would also be useful but I've left that for future work.
Differential Revision: https://reviews.llvm.org/D52043
llvm-svn: 342257
When replacing a named register input to the appropriately sized
sub/super-register. In the case of a 64-bit value being assigned to a
register in 32-bit mode, match GCC's assignment.
Reviewers: eli.friedman, craig.topper
Subscribers: nickdesaulniers, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D51502
llvm-svn: 342175
Summary:
Previously we type legalized v2i32 div/rem by promoting to v2i64. But we don't support div/rem of vectors so op legalization would then scalarize it using i64 scalar ops since it doesn't know about the original promotion. 64-bit scalar divides on Intel hardware are known to be slow and in 32-bit mode they require a libcall.
This patch switches type legalization to do the scalarizing itself using i32.
It looks like the division by power of 2 optimization is still kicking in and leaving the code as a vector. The division by other constant optimization doesn't kick in pre type legalization since it ignores illegal types. And previously, after type legalization we scalarized the v2i64 since we don't have v2i64 MULHS/MULHU support.
Another option might be to widen v2i32 to v4i32 so we could do division by constant optimizations, but we'd have to be careful to only do that for constant divisors or we risk scalaring to 4 scalar divides.
Reviewers: RKSimon, spatel
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51325
llvm-svn: 342114
There's no advantage to this instruction unless you need to avoid touching other flag bits. It's encoding is longer, it can't fold an immediate, it doesn't write all the flags.
I don't think gcc will generate this instruction either.
Fixes PR38852.
Differential Revision: https://reviews.llvm.org/D51754
llvm-svn: 342059
Summary:
In GNUX23, is64BitMode returns true, but pointers are 32-bits. So we shouldn't copy pointer values into RSI/RDI since the widths don't match.
Fixes PR38865 despite what the title says. I think the llvm_unreachable in the copyPhysReg code tricked the optimizer and made the fatal error trigger.
Reviewers: rnk, efriedma, MatzeB, echristo
Reviewed By: efriedma
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51893
llvm-svn: 342015
In r337348, I changed lowering to prefer X86ISD::UNPCKL/UNPCKH opcodes over MOVLHPS/MOVHLPS for v2f64 {0,0} and {1,1} shuffles when we have SSE2. This enabled the removal of a bunch of weirdly bitcasted isel patterns in r337349. To avoid changing the tests I placed a gross hack in isel to still emit movhlps instructions for fake unary unpckh nodes. A similar hack was not needed for unpckl and movlhps because we do execution domain switching for those. But unpckh and movhlps have swapped operand order.
This patch removes the hack.
This is a code size increase since unpckhpd requires a 0x66 prefix and movhlps does not. But if that's a big concern we should be using movhlps for all unpckhpd opcodes and let commuteInstruction turnit into unpckhpd when its an advantage.
Differential Revision: https://reviews.llvm.org/D49499
llvm-svn: 341973
GNUX32 uses 32-bit pointers despite is64BitMode being true. So we should use EAX to return the value.
Fixes ones of the failures from PR38865.
Differential Revision: https://reviews.llvm.org/D51940
llvm-svn: 341972
MOVMSKPS and MOVMSKPD both take FP types, but likely the operations before it are on integer types with just a int->fp bitcast between them. If the bitcast isn't used by anything else and doesn't change the element width we can look through it to simplify the integer ops.
llvm-svn: 341915
I'm having a hard time finding a test case for this, but we should be consistent here. The fact that we canonicalize all zeros and all ones constants to vXi32 and all other constants to loads makes this hard to hit the easy DAG combine infinite loop we get for some of the other types.
llvm-svn: 341859
We have isel patterns for v4i32/v4f64 that artificially widen to v8i32/v8f64 so just use that.
If x86-experimental-vector-widening-legalization is enabled, we don't need any custom legalization and can just return. I've modified the test RUN lines to cover this case.
llvm-svn: 341765
Summary:
This patch allows vectors with a power of 2 number of elements and i8/i16 element type to select paddus/psubus instructions. ReplaceNodeResults has been updated to custom widen these operations up to 128 bits like we already do for PAVG.
Another step towards fixing PR38691
Reviewers: RKSimon, spatel
Reviewed By: RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51818
llvm-svn: 341753
The generic type legalizer will scalarize vXi1 instructions getting rid of the vector entirely. Creating wider vector instructions is just going to prevent that.
llvm-svn: 341705
The type legalizer will try to scalarize this and fail.
It looks like there's some other v1iX oddities out there too since we still generated some vector instructions.
llvm-svn: 341704
Similar to what was recently done for addcarry/subborrow and has been done for rdrand/rdseed for a while. It's better to use two results and an explicit store in IR when the store isn't part of the semantics of the instruction. This allows store->load forwarding to happen in the middle end. Or the store to be removed if its never loaded.
Differential Revision: https://reviews.llvm.org/D51803
llvm-svn: 341698
We should represent the store directly in IR instead. This gives the middle end a chance to remove it if it can see a load from the same address.
Differential Revision: https://reviews.llvm.org/D51769
llvm-svn: 341677
Previously we only handled loads in operand 0, but nothing guarantees the load will be operand 0 for commutable operations.
Differential Revision: https://reviews.llvm.org/D51768
llvm-svn: 341675
ADC is commutable and the load could be in either operand, but we were only checking operand 0.
Ideally we'd mark X86adc_flag as commutable and tablegen would automatically do this, but the EFLAGS register mention is preventing it.
llvm-svn: 341606
The peephole pass likely gets this normally, but we should be doing it during isel.
Ideally we'd just make the X86adc_flag pattern SDNPCommutable, but the tablegen doesn't handle that when one of the operands is a register reference.
llvm-svn: 341596
This basically reverts a change made in r336217, but improves the text of the error message for not allowing IP-relative addressing in 32-bit mode.
Fixes PR38826.
Patch by Iain Sandoe.
llvm-svn: 341512
This removes the FrameAccess struct that was added to the interface
in D51537, since the PseudoValue from the MachineMemoryOperand
can be safely casted to a FixedStackPseudoSourceValue.
Reviewers: MatzeB, thegameg, javed.absar
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D51617
llvm-svn: 341454
On Windows, if shouldAssumeDSOLocal returns false, it's either a
dllimport reference, or a reference that we should treat as non-local
and create a stub for.
Clean up AArch64Subtarget::ClassifyGlobalReference a little while
touching the flag handling relating to dllimport.
Differential Revision: https://reviews.llvm.org/D51590
llvm-svn: 341402
Load Hardening.
Wires up the existing pass to work with a proper IR attribute rather
than just a hidden/internal flag. The internal flag continues to work
for now, but I'll likely remove it soon.
Most of the churn here is adding the IR attribute. I talked about this
Kristof Beyls and he seemed at least initially OK with this direction.
The idea of using a full attribute here is that we *do* expect at least
some forms of this for other architectures. There isn't anything
*inherently* x86-specific about this technique, just that we only have
an implementation for x86 at the moment.
While we could potentially expose this as a Clang-level attribute as
well, that seems like a good question to defer for the moment as it
isn't 100% clear whether that or some other programmer interface (or
both?) would be best. We'll defer the programmer interface side of this
for now, but at least get to the point where the feature can be enabled
without relying on implementation details.
This also allows us to do something that was really hard before: we can
enable *just* the indirect call retpolines when using SLH. For x86, we
don't have any other way to mitigate indirect calls. Other architectures
may take a different approach of course, and none of this is surfaced to
user-level flags.
Differential Revision: https://reviews.llvm.org/D51157
llvm-svn: 341363
implementing the proposed mitigation technique described in the original
design document.
The idea is to check after calls that the return address used to arrive
at that location is in fact the correct address. In the event of
a mis-predicted return which reaches a *valid* return but not the
*correct* return, this will detect the mismatch much like it would
a mispredicted conditional branch.
This is the last published attack vector that I am aware of in the
Spectre v1 space which is not mitigated by SLH+retpolines. However,
don't read *too* much into that: this is an area of ongoing research
where we expect more issues to be discovered in the future, and it also
makes no attempt to mitigate Spectre v4. Still, this is an important
completeness bar for SLH.
The change here is of course delightfully simple. It was predicated on
cutting support for post-instruction symbols into LLVM which was not at
all simple. Many thanks to Hal Finkel, Reid Kleckner, and Justin Bogner
who helped me figure out how to do a bunch of the complex changes
involved there.
Differential Revision: https://reviews.llvm.org/D50837
llvm-svn: 341358
retpolines.
This implements the core design of tracing the intended target into the
target, checking it, and using that to update the predicate state. It
takes advantage of a few interesting aspects of SLH to make it a bit
easier to implement:
- We already split critical edges with conditional branches, so we can
assume those are gone.
- We already unfolded any memory access in the indirect branch
instruction itself.
I've left hard errors in place to catch if any of these somewhat subtle
invariants get violated.
There is some code that I can factor out and share with D50837 when it
lands, but I didn't want to couple landing the two patches, so I'll do
that in a follow-up cleanup commit if alright.
Factoring out the code to handle different scenarios of materializing an
address remains frustratingly hard. In a bunch of cases you want to fold
one of the cases into an immediate operand of some other instruction,
and you also have both symbols and basic blocks being used which require
different methods on the MI builder (and different operand kinds).
Still, I'll take a stab at sharing at least some of this code in
a follow-up if I can figure out how.
Differential Revision: https://reviews.llvm.org/D51083
llvm-svn: 341356
A ReadAdvance was incorrectly added to the SchedReadWrite list associated with
the following SSE instructions:
sqrtss
sqrtsd
rsqrtss
rcpss
As a consequence, a wrong operand latency was computed for the register operand
used as the base address of the folded load operand.
This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases.
There is still a problem with correctly modeling partial register writes on XMM
registers This other problem is currently tracked here:
https://bugs.llvm.org/show_bug.cgi?id=38813
Differential Revision: https://reviews.llvm.org/D51542
llvm-svn: 341326
Also adjust some of dsymutil's headers to put the header guards at the top,
otherwise the compiler will not recognize them as header guards.
llvm-svn: 341323
For instructions that spill/fill to and from multiple frame-indices
in a single instruction, hasStoreToStackSlot and hasLoadFromStackSlot
should return an array of accesses, rather than just the first encounter
of such an access.
This better describes FI accesses for AArch64 (paired) LDP/STP
instructions.
Reviewers: t.p.northover, gberry, thegameg, rengolin, javed.absar, MatzeB
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D51537
llvm-svn: 341301
These intrinsics use the same implementation as PTEST intrinsics, but use vXi1 vectors.
New clang builtins will be accompanying them shortly.
llvm-svn: 341259
This patch recognizes shuffles that shift elements and fill with zeros. I've copied and modified the shift matching code we use for normal vector registers to do this. I'm not sure if there's a good way to share more of this code without making the existing function more complex than it already is.
This will be used to enable kshift intrinsics in clang.
Differential Revision: https://reviews.llvm.org/D51401
llvm-svn: 341227
The presence of a ReadAdvance for input operand #0 is problematic
because it changes the input latency of the register used as the base address
for the folded load.
A broadcast cannot start executing if the load address hasn't been computed yet.
In the llvm-mca example, the VBROADCASTSS is dependent on the address generated
by the LEAQ. That means, it cannot start until LEAQ reaches the write-back
stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3
cycles in advance.
Differential Revision: https://reviews.llvm.org/D51534
llvm-svn: 341222
This patch fixes the number of micro opcodes, and processor resource cycles for
the following AVX instructions:
vinsertf128rr/rm
vperm2f128rr/rm
vbroadcastf128
Tests have been regenerated using the usual scripts in the llvm/utils directory.
Differential Revision: https://reviews.llvm.org/D51492
llvm-svn: 341185
These stubs should never be emitted for internal symbols, and
nothing in AsmPrinter ever actually use this value when producing
the stubs for COFF anyway.
llvm-svn: 341177
This assert tried to check that AND constants are only on the RHS. But its possible for both operands to be constants if one is opaque which will prevent the AND from being constant folded.
Fixes PR38771
llvm-svn: 341102
..Move all target-dependent checks into new isCopyInstrImpl method.
This change allows us to treat MoveReg-type instructions and generic
COPY instruction in the same way
Differential Revision: https://reviews.llvm.org/D49913
llvm-svn: 341072
We now only add +64bit to the CPU string for "generic" CPU. All other CPU names are assumed to have the feature flag already set if they support 64-bit. I've remove the implies from CMPXCHG8 so that Feature64Bit only comes in via CPUs or user passing -mattr=+64bit.
I've changed the assert to a report_fatal_error so it's not lost in Release builds.
The test updates are to fix things that tripped the new error.
Differential Revision: https://reviews.llvm.org/D51231
llvm-svn: 341022
Variables declared with the dllimport attribute are accessed via a
stub variable named __imp_<var>. In MinGW configurations, variables that
aren't declared with a dllimport attribute might still end up imported
from another DLL with runtime pseudo relocs.
For x86_64, this avoids the risk that the target is out of range
for a 32 bit PC relative reference, in case the target DLL is loaded
further than 4 GB from the reference. It also avoids having to make the
text section writable at runtime when doing the runtime fixups, which
makes it worthwhile to do for i386 as well.
Add stub variables for all dso local data references where a definition
of the variable isn't visible within the module, since the DLL data
autoimporting might make them imported even though they are marked as
dso local within LLVM.
Don't do this for variables that actually are defined within the same
module, since we then know for sure that it actually is dso local.
Don't do this for references to functions, since there's no need for
runtime pseudo relocations for autoimporting them; if a function from
a different DLL is called without the appropriate dllimport attribute,
the call just gets routed via a thunk instead.
GCC does something similar since 4.9 (when compiling with -mcmodel=medium
or large; from that version, medium is the default code model for x86_64
mingw), but only for x86_64.
Differential Revision: https://reviews.llvm.org/D51288
llvm-svn: 340942
Noticed while looking at D49562 codegen - we can avoid a large constant mask load and a slow VPBLENDVB select op by using VPBLENDW+VPBLENDD instead.
TODO: As discussed on the patch, we should investigate adding VPBLENDVB handling to target shuffle combining as well, that will allow us to extend this to VPBLENDW+VPBLENDW+VPBLENDD.
Differential Revision: https://reviews.llvm.org/D50074
llvm-svn: 340913
These are intrinsics for supporting kadd builtins in clang. These builtins are already in gcc to implement intrinsics from icc. Though they are missing from the Intel Intrinsics Guide.
This instruction adds two mask registers together as if they were scalar rather than a vXi1. We might be able to get away with a bitcast to scalar and a normal add instruction, but that would require DAG combine smarts in the backend to recoqnize add+bitcast. For now I'd prefer to go with the easiest implementation so we can get these builtins in to clang with good codegen.
Differential Revision: https://reviews.llvm.org/D51370
llvm-svn: 340869
These instructions were added on the PentiumPro along with CMOV.
This was already comprehended by the lowering process which should emit an alternate sequence using FCOM and FNSTW. This just makes it an explicit error if that doesn't work for some reason.
llvm-svn: 340844
This patch creates the shift mask and actual shift using the vXi16 vector shift ops.
Differential Revision: https://reviews.llvm.org/D51263
llvm-svn: 340813
We're using a 256-bit PACKUS to do the truncation, but that instruction operates on 128-bit lanes. So previously we shuffled first to rearrange the lanes. But that requires 2 shuffles. Instead we can shuffle after the PACKUS using a single VPERMQ. This matches what our normal LowerTRUNCATE code does when it uses PACKUS.
Differential Revision: https://reviews.llvm.org/D51284
llvm-svn: 340757
InstCombine mucks these up a bit. So we need to do some additional pattern matching to fix it. There are a still a few special cases not handled, but this covers the general case.
Differential Revision: https://reviews.llvm.org/D50952
llvm-svn: 340756
vXi32 support was recently moved from LowerMUL_LOHI to LowerMULH.
This commit shares the getOperand calls, switches both to use common IsSigned flag, and hoists the NumElems/NumElts variable.
llvm-svn: 340720
Summary: This was inheriting the cost from the AVX table, but should be legal under AVX512.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51267
llvm-svn: 340708
Summary:
Previously most CPUs inherited cmov support through Feature64Bit(or FeatureCMPXCHG16HB implying Feature64Bit) or FeatureSSE1.
This has the surprising side effect that -mattr=-cmov causes an assert to fire in 64-bit mode because it clears the Feature64Bit. Or in 32-bit mode, -mattr=-cmov disables any sse/avx features which seems surprising.
This patch removes the implication and instead updates hasCMOV in X86Subtarget to check SSE1 or is64Bit in addition to the regular cmov flag. This should keep most things working the way they did before. I don't believe there is a way to specific "-cmov" directly from clang so this should only effect our lower level tools.
This does stop -mattr=cx16(cmpxchg16b) from implying cmov is enabled via the 64bit flag as you can see from one of the changed tests. But that was a 32-bit test so I don't know why it enabled cx16 anyway.
For the other test I had to add -sse to override the new sse check in hasCMOV.
Reviewers: RKSimon, DavidKreitzer, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits, jfb
Differential Revision: https://reviews.llvm.org/D51228
llvm-svn: 340707
Summary: This matches gcc and one cpuid dump I found online. Given that these are considered 7th generation x86 CPU it seems likely they support cmov since cmov was added by Intel in their 6th generation.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51264
llvm-svn: 340706
I noticed this along with the patterns in D51125, but when the index is variable,
we don't convert insertelement into a build_vector.
For x86, that means these get expanded at legalization time into the loading/spilling
code that we see in the tests. I think it's always better to avoid going to memory on
these, and we get the optimal 'broadcast' if it's available.
I suspect other targets may want to look at enabling the hook. AArch64 and AMDGPU have
regression tests that would be affected (although I did not check what would happen in
those cases). In the most basic cases shown here, AArch64 would probably do much
better with a splat.
Differential Revision: https://reviews.llvm.org/D51186
llvm-svn: 340705
Summary:
The only time vector SMUL_LOHI/UMUL_LOHI nodes are created is during division/remainder lowering. If its created before op legalization, generic DAGCombine immediately turns that SMUL_LOHI/UMUL_LOHI into a MULHS/MULHU since only the upper half is used. That node will stick around through vector op legalization and will be turned back into UMUL_LOHI/SMUL_LOHI during op legalization. It will then be custom lowered by the X86 backend. Due to this two step lowering the vector shuffles created by the custom lowering get legalized after their inputs rather than before. This prevents the shuffles from being combined with any build_vector of constants.
This patch uses changes vXi32 to use MULHS/MULHU instead. This is what the later DAG combine did anyway. But by skipping the change back to UMUL_LOHI/SMUL_LOHI we lower it before any constant BUILD_VECTORS. This allows the vector_shuffle creation to constant fold with the build_vectors. This accounts for the test changes here.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D51254
llvm-svn: 340690
Summary:
Previously the value being stored is the last operand in SDNode. This causes the type legalizer to visit the mask operand before the value operand. The type legalizer was more complicated because of this since we want the type of the value to drive the decisions.
This patch moves the value to be the first operand so we visit it first during type legalization. It also simplifies the type legalization code accordingly.
X86 is currently the only in tree target that uses this SDNode. Not sure if there are any users out of tree.
Reviewers: RKSimon, delena, hfinkel, eli.friedman
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D50402
llvm-svn: 340689
This is a preliminary step for a preliminary step for D50992.
I noticed that x86 often misses chances to load a scalar directly
into a vector register.
So this patch is just allowing more of those cases to match a
broadcast op in lowerBuildVectorAsBroadcast(). The old code comment
said it doesn't make sense to use a broadcast when we're loading a
single element and everything else is undef, but I think that's the
best case in the improved tests in insert-loaded-scalar.ll. We avoid
scalar-to-vector-register move and/or less efficient shuffling.
Note that there are some existing types that were already producing
a broadcast, but that happens semi-accidentally. Ie, it's not
happening as part of lowerBuildVectorAsBroadcast(). The build vector
gets expanded into load + shuffle, and then shuffle lowering produces
the broadcast.
Description of the other test diffs:
1. avx-basic.ll - replacing load+shufle is a win.
2. sse3-avx-addsub-2.ll - vmovddup vs. vbroadcastss is neutral
3. sse41.ll - don't care - we convert that intrinsic to generic IR now, so this test is deprecated
4. vector-shuffle-128-v8.ll / vector-shuffle-256-v16.ll - pshufb alternatives with an extra instruction are not obviously bad
Differential Revision: https://reviews.llvm.org/D51125
llvm-svn: 340685
This adds a new method to ELFObjectFileBase that returns the symbols and addresses of PLT entries.
This design was suggested by pcc and eugenis in https://reviews.llvm.org/D49383.
Differential Revision: https://reviews.llvm.org/D50203
llvm-svn: 340610
The commit that added this functionality:
rL322957
may be causing/exposing a miscompile in PR38648:
https://bugs.llvm.org/show_bug.cgi?id=38648
so allow enabling/disabling to make debugging easier.
llvm-svn: 340540
subtarget features for indirect calls and indirect branches.
This is in preparation for enabling *only* the call retpolines when
using speculative load hardening.
I've continued to use subtarget features for now as they continue to
seem the best fit given the lack of other retpoline like constructs so
far.
The LLVM side is pretty simple. I'd like to eventually get rid of the
old feature, but not sure what backwards compatibility issues that will
cause.
This does remove the "implies" from requesting an external thunk. This
always seemed somewhat questionable and is now clearly not desirable --
you specify a thunk the same way no matter which set of things are
getting retpolines.
I really want to keep this nicely isolated from end users and just an
LLVM implementation detail, so I've moved the `-mretpoline` flag in
Clang to no longer rely on a specific subtarget feature by that name and
instead to be directly handled. In some ways this is simpler, but in
order to preserve existing behavior I've had to add some fallback code
so that users who relied on merely passing -mretpoline-external-thunk
continue to get the same behavior. We should eventually remove this
I suspect (we have never tested that it works!) but I've not done that
in this patch.
Differential Revision: https://reviews.llvm.org/D51150
llvm-svn: 340515
Previously we asumed a vector reduction add is part of a loop and one of the input is a phi. But the code in SelectionDAGBuilder that sets vector reduction flag handles more cases than that. It just requires that the use chain ends in a horizontal reduction. And there are no other uses. This means it can handle unrolled reduction loops.
If the initial value of the reduction was 0, an unrolled loop would begin with a vector reduction add that has two sad inputs. Previously we would only transform one side of the add, but for this case we need to transform both sides.
I've created a lambda to reuse some of the code for both sides. And fixed the variables names to remove reference to "phi".
Differential Revision: https://reviews.llvm.org/D50817
llvm-svn: 340478
Inspired by what AArch64 does for shifts, this patch attempts to replace shift amounts with neg if we can.
This is done directly as part of isel so its as late as possible to avoid breaking some BZHI patterns since those patterns need an unmasked (32-n) to be correct.
To avoid manual load folding and custom instruction selection for the negate. I've inserted new nodes in the DAG above the shift node in topological order.
Differential Revision: https://reviews.llvm.org/D48789
llvm-svn: 340441
When the key is not already in the map, the access operator[] creates an empty value and grows the map.
Resizing a map is very slow, so this needs to be avoided.
Found with csmith + asserts.
May help with
https://bugs.llvm.org/show_bug.cgi?id=25843
Patch by Tom Rix.
Differential Revision: https://reviews.llvm.org/D50780
llvm-svn: 340434
This adds the plumbing for the Tiny code model for the AArch64 backend. This,
instead of loading addresses through the normal ADRP;ADD pair used in the Small
model, uses a single ADR. The 21 bit range of an ADR means that the code and
its statically defined symbols need to be within 1MB of each other.
This makes it mostly interesting for embedded applications where we want to fit
as much as we can in as small a space as possible.
Differential Revision: https://reviews.llvm.org/D49673
llvm-svn: 340397
Summary:
So far, `isReturn` property is used to mean both a return instruction
from a functon and the end of an EH scope, a scope that starts with a EH
scope entry BB and ends with a catchret or a cleanupret instruction.
Because WinEH uses funclets, all EH-scope-ending instructions are also
real return instruction from a function. But for wasm, they only serve
as the end marker of an EH scope but not a return instruction that
exits a function. This mismatch caused incorrect prolog and epilog
generation in wasm EH scopes. This patch fixes this.
This patch is in the same vein with rL333045, which splits
`MachineBasicBlock::isEHFuncletEntry` into `isEHFuncletEntry` and
`isEHScopeEntry`.
Reviewers: dschuff
Subscribers: sbc100, jgravelle-google, sunfish, llvm-commits
Differential Revision: https://reviews.llvm.org/D50653
llvm-svn: 340325
Most of these shifts are extended to vXi16 so we don't gain anything from forcing another round of generic shift lowering - we know these extended cases are legal constant splat shifts.
llvm-svn: 340307
Due to some splat handling code in getVectorShuffle, its possible for NewV1/NewV2 to have their mask modified from what is requested. This can lead to cycles being created in the DAG.
This patch examines the returned mask and makes sure its different. Long term we may need to look closer at that splat code in getVectorShuffle, or add more splat awareness to getVectorShuffle.
Fixes PR38639
Differential Revision: https://reviews.llvm.org/D50981
llvm-svn: 340214
We can safely avoid interfering with the subus combine if both inputs are freely truncatable. Either both extends, or an extend and a constant vector.
Differential Revision: https://reviews.llvm.org/D50878
llvm-svn: 340212
We were basically assuming only one operand of the compare could be an ADD node and using that to swap operands. But we can have a normal add followed by a saturing add.
This rewrites the canonicalization to just be based on the condition code.
llvm-svn: 340134
The code already support 128 and 256 and even knows to split 256 for AVX1. So we really just needed to stop looking for specific VTs and subtarget features and just look for legal VTs with i8/i16 elements.
While there, add some curly braces around outer if statement bodies that contain only another if. It makes all the closing curly braces look more regular.
llvm-svn: 340128
Extending the concept introduced in D49562, this patch lowers constant vXi8 ISD::SRL/ISD::SRA by zero/sign extending to vXi16 and using PMULLW and then truncating the high 8 bits of the result.
Differential Revision: https://reviews.llvm.org/D50781
llvm-svn: 340062
isOnlyUserOf is a little heavier because it allows the node to be used multiple times by the other node. In this case we are looking at a truncate which only has one operand so we know it can only use it once. Thus hasOneUse is better.
llvm-svn: 340059
test/CodeGen/X86/shadow-stack.ll has the following machine verifier
errors:
```
*** Bad machine code: Using a killed virtual register ***
- function: bar
- basic block: %bb.6 entry (0x7fdc81857818)
- instruction: %3:gr64 = MOV64rm killed %2:gr64, 1, $noreg, 8, $noreg
- operand 1: killed %2:gr64
*** Bad machine code: Using a killed virtual register ***
- function: bar
- basic block: %bb.6 entry (0x7fdc81857818)
- instruction: $rsp = MOV64rm killed %2:gr64, 1, $noreg, 16, $noreg
- operand 1: killed %2:gr64
*** Bad machine code: Virtual register killed in block, but needed live out. ***
- function: bar
- basic block: %bb.2 entry (0x7fdc818574f8)
Virtual register %2 is used after the block.
```
The fix here is to only copy the machine operand's register without the
kill flags for all the instructions except the very last one of the
sequence.
I had to insert dummy PHIs in the test case to force the NoPHI function
property to be set to false. More on this here: https://llvm.org/PR38439
Differential Revision: https://reviews.llvm.org/D50260
llvm-svn: 340033
Normally the peephole pass converts EXTRACT_SUBREG to COPY instructions. But we're after peephole so we can't rely on it to clean these up.
To fix this, the eflags pass now emits a COPY with a subreg input.
I also noticed that in 32-bit mode we need to constrain the input to the copy to ensure the subreg is valid. Otherwise we'll fail verify-machineinstrs
Differential Revision: https://reviews.llvm.org/D50656
llvm-svn: 339945
a generically extensible collection of extra info attached to
a `MachineInstr`.
The primary change here is cleaning up the APIs used for setting and
manipulating the `MachineMemOperand` pointer arrays so chat we can
change how they are allocated.
Then we introduce an extra info object that using the trailing object
pattern to attach some number of MMOs but also other extra info. The
design of this is specifically so that this extra info has a fixed
necessary cost (the header tracking what extra info is included) and
everything else can be tail allocated. This pattern works especially
well with a `BumpPtrAllocator` which we use here.
I've also added the basic scaffolding for putting interesting pointers
into this, namely pre- and post-instruction symbols. These aren't used
anywhere yet, they're just there to ensure I've actually gotten the data
structure types correct. I'll flesh out support for these in
a subsequent patch (MIR dumping, parsing, the works).
Finally, I've included an optimization where we store any single pointer
inline in the `MachineInstr` to avoid the allocation overhead. This is
expected to be the overwhelmingly most common case and so should avoid
any memory usage growth due to slightly less clever / dense allocation
when dealing with >1 MMO. This did require several ergonomic
improvements to the `PointerSumType` to reasonably support the various
usage models.
This also has a side effect of freeing up 8 bits within the
`MachineInstr` which could be repurposed for something else.
The suggested direction here came largely from Hal Finkel. I hope it was
worth it. ;] It does hopefully clear a path for subsequent extensions
w/o nearly as much leg work. Lots of thanks to Reid and Justin for
careful reviews and ideas about how to do all of this.
Differential Revision: https://reviews.llvm.org/D50701
llvm-svn: 339940
Summary:
This prefix was added in r333421, and it changed our dumper output to
say things like "CVRegEAX" instead of just "EAX". That's a functional
change that I'd rather avoid.
I tested GCC, Clang, and MSVC, and all of them support #pragma
push_macro. They don't issue warnings whem the macro is not defined
either.
I don't have a Mac so I can't test the real termios.h header, but I
looked at the termios.h sources online and looked for other conflicts.
I saw only the CR* macros, so those are the ones we work around.
Reviewers: zturner, JDevlieghere
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D50851
llvm-svn: 339907
Allow the comparison of x86 registers in the evaluation of assembler
directives. This generalizes and simplifies the extension from r334022
to catch another case found in the Linux kernel.
Reviewers: rnk, void
Reviewed By: rnk
Subscribers: hiraditya, nickdesaulniers, llvm-commits
Differential Revision: https://reviews.llvm.org/D50795
llvm-svn: 339895
When compiling with /arch:AVX512 and optimizations turned on,
we could crash while emitting debug info because we did not
have CodeView register constants for the AVX 512 register
set defined. This patch defines them.
Differential Revision: https://reviews.llvm.org/D50819
llvm-svn: 339893
a shorter name ('x86-slh') for the internal flags and pass name.
Without this, you can't use the -stop-after or -stop-before
infrastructure. I seem to have just missed this when originally adding
the pass.
The shorter name solves two problems. First, the flag names were ...
really long and hard to type/manage. Second, the pass name can't be the
exact same as the flag name used to enable this, and there are already
some users of that flag name so I'm avoiding changing it unnecessarily.
llvm-svn: 339836
To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.
This fixes PR35833
Differential Revison: https://reviews.llvm.org/D41794
llvm-svn: 339818
AVX512 added new versions of these intrinsics that take a rounding mode. If the rounding mode is 4 the new intrinsics are equivalent to the old intrinsics.
The AVX512 intrinsics were being lowered to ISD opcodes, but the legacy SSE intrinsics were left as intrinsics. This resulted in the AVX512 instructions needing separate patterns for the ISD opcodes and the legacy SSE intrinsics.
Now we convert SSE intrinsics and AVX512 intrinsics with rounding mode 4 to the same ISD opcode so we can share the isel patterns.
llvm-svn: 339749
`MachineMemOperand` pointers attached to `MachineSDNodes` and instead
have the `SelectionDAG` fully manage the memory for this array.
Prior to this change, the memory management was deeply confusing here --
The way the MI was built relied on the `SelectionDAG` allocating memory
for these arrays of pointers using the `MachineFunction`'s allocator so
that the raw pointer to the array could be blindly copied into an
eventual `MachineInstr`. This creates a hard coupling between how
`MachineInstr`s allocate their array of `MachineMemOperand` pointers and
how the `MachineSDNode` does.
This change is motivated in large part by a change I am making to how
`MachineFunction` allocates these pointers, but it seems like a layering
improvement as well.
This would run the risk of increasing allocations overall, but I've
implemented an optimization that should avoid that by storing a single
`MachineMemOperand` pointer directly instead of allocating anything.
This is expected to be a net win because the vast majority of uses of
these only need a single pointer.
As a side-effect, this makes the API for updating a `MachineSDNode` and
a `MachineInstr` reasonably different which seems nice to avoid
unexpected coupling of these two layers. We can map between them, but we
shouldn't be *surprised* at where that occurs. =]
Differential Revision: https://reviews.llvm.org/D50680
llvm-svn: 339740
This patch removes redundant template argument `TargetName` from TIIPredicate.
Tablegen can always infer the target name from the context. So we don't need to
force users of TIIPredicate to always specify it.
This allows us to better modularize the tablegen class hierarchy for the
so-called "function predicates". class FunctionPredicateBase has been added; it
is currently used as a building block for TIIPredicates. However, I plan to
reuse that class to model other function predicate classes too (i.e. not just
TIIPredicates). For example, this can be a first step towards implementing
proper support for dependency breaking instructions in tablegen.
This patch also adds a verification step on TIIPredicates in tablegen.
We cannot have multiple TIIPredicates with the same name. Otherwise, this will
cause build errors later on, when tablegen'd .inc files are included by cpp
files and then compiled.
Differential Revision: https://reviews.llvm.org/D50708
llvm-svn: 339706
rL339686 added the case where a faux shuffle might have repeated shuffle inputs coming from either side of the OR().
This patch improves the insertion of the inputs into the source ops lists to account for this, as well as making it trivial to add support for shuffles with more than 2 inputs in the future.
llvm-svn: 339696
Summary: This revision improves previous version (rL330322) which has been reverted due to crashes.
This is the patch that lowers x86 intrinsics to native IR
in order to enable optimizations. The patch also includes folding
of previously missing saturation patterns so that IR emits the same
machine instructions as the intrinsics.
Reviewers: craig.topper, spatel, RKSimon
Reviewed By: craig.topper
Subscribers: mike.dvoretsky, DavidKreitzer, sroland, llvm-commits
Differential Revision: https://reviews.llvm.org/D46179
llvm-svn: 339650
The behavior in 64-bit mode is different between Intel and AMD CPUs. Intel ignores the 0x66 prefix. AMD does not. objump doesn't ignore the 0x66 prefix. Since LLVM aims to match objdump behavior, we should do the same.
While I was trying to fix this I had change brtarget16/32 to use ENCODING_IW/ID instead of ENCODING_Iv to get the 0x66+REX.W case to act sort of sanely. It's still wrong, but that's a problem for another day.
The change in encoding exposed the fact that 16-bit mode disassembly of relative jumps was creating JMP_4 with a 2 byte immediate. It should have been JMP_2. From just printing you can't tell the difference, but if you dumped the encoding it wouldn't have matched what we started with.
While fixing that, it exposed that jo/jno opcodes were missing from the switch that this patch deleted and there were no test cases for them.
Fixes PR38537.
llvm-svn: 339622