FeatureSlowUAMem32.
The idea was to mark things that are slow on widely available processors
as slow in the generic CPU so that the code generated for that CPU would
be fast across those processors. However, for this feature that doesn't
work out very well at all.
The problem here is that you can very easily enable AVX or AVX2 on top
of this generic CPU. For example, this can happen just by using AVX2
intrinsics from Clang within a region of code guarded by a dynamic CPU
feature test. When you do that, the generated code with SlowUAMem32 set
is ... amazingly slower. The problem is that there really aren't very
good alternatives to the unaligned loads, and so our vector codegen
regresses significantly.
The other issue is that there are plenty of AMD CPUs with AVX1 that
don't set FeatureSlowUAMem32 and so we shouldn't just check for AVX2
instead of this special feature. =/
It would be nice to have the target attriute logic be able to
enable/disable more than just one feature at a time and control this in
a more fine grained and useful way, but that doesn't seem easy. Given
that it is only Sandybridge and Ivybridge that set this feature, for now
I'm just backing it out of the generic CPU. That has the additional
advantage of going back to the previous state that people seemed vaguely
happy with.
llvm-svn: 311740
The comment for this code indicated that it should work similar to our
handling of add lowering above: if we see uses of an instruction other
than flag usage and store usage, it tries to avoid the specialized
X86ISD::* nodes that are designed for flag+op modeling and emits an
explicit test.
Problem is, only the add case actually did this. In all the other cases,
the logic was incomplete and inverted. Any time the value was used by
a store, we bailed on the specialized X86ISD node. All of this appears
to have been historical where we had different logic here. =/
Turns out, we have quite a few patterns designed around these nodes. We
should actually form them. I fixed the code to match what we do for add,
and it has quite a positive effect just within some of our test cases.
The only thing close to a regression I see is using:
notl %r
testl %r, %r
instead of:
xorl -1, %r
But we can add a pattern or something to fold that back out. The
improvements seem more than worth this.
I've also worked with Craig to update the comments to no longer be
actively contradicted by the code. =[ Some of this still remains
a mystery to both Craig and myself, but this seems like a large step in
the direction of consistency and slightly more accurate comments.
Many thanks to Craig for help figuring out this nasty stuff.
Differential Revision: https://reviews.llvm.org/D37096
llvm-svn: 311737
This goes back to a discussion about IR canonicalization. We'd like to preserve and convert
more IR to 'select' than we currently do because that's likely the best choice in IR:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105335.html
...but that's often not true for codegen, so we need to account for this pattern coming in
to the backend and transform it to better DAG ops.
Steps in this patch:
1. Add an EVT param to the existing convertSelectOfConstantsToMath() TLI hook to more finely
enable this transform. Other targets will probably want that anyway to distinguish scalars
from vectors. We're using that here to exclude AVX512 targets, but it may not be necessary.
2. Convert a vselect to ext+add. This eliminates a constant load/materialization, and the
vector ext is often free.
Implementing a more general fold using xor+and can be a follow-up for targets that don't have
a legal vselect. It's also possible that we can remove the TLI hook for the special case fold
implemented here because we're eliminating a constant, but it needs to be tested on other
targets.
Differential Revision: https://reviews.llvm.org/D36840
llvm-svn: 311731
Mostly this involved giving unnamed values names and running the IR
through `opt` to re-format it but merging in any important comments in
the original. I then deleted pointless comments and inlined the function
attributes for ease of reading and editting.
All of this is to make it much easier to see the instructions being
generated here and evaluate any updates to the tests.
llvm-svn: 311634
When one operand is a user of another in a promoted binary operation
we may replace and delete the returned value before returning
triggering an assertion. Reorder node replacements to prevent this.
Fixes PR34137.
Landing on behalf of Nirav.
Differential Revision: https://reviews.llvm.org/D36581
llvm-svn: 311623
Summary:
Most DIExpressions are empty or very simple. When they are complex, they
tend to be unique, so checking them inline is reasonable.
This also avoids the need for CodeGen passes to append to the
llvm.dbg.mir named md node.
See also PR22780, for making DIExpression not be an MDNode.
Reviewers: aprantl, dexonsmith, dblaikie
Subscribers: qcolombet, javed.absar, eraman, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D37075
llvm-svn: 311594
There are no 512-bit blend instructions so we shouldn't create SHRUNKBLEND for them.
On a side note, it looks like there may be a missed opportunity for constant folding TESTM when LHS and RHS are equal.
This fixes PR34139.
Differential Revision: https://reviews.llvm.org/D36992
llvm-svn: 311572
Summary:
This change achieves two things:
- Redefine the Custom Event handling instrumentation points emitted by
the compiler to not require dynamic relocation of references to the
__xray_CustomEvent trampoline.
- Remove the synthetic reference we emit at the end of a function that
we used to keep auxiliary sections alive in favour of SHF_LINK_ORDER
associated with the section where the function is defined.
To achieve the custom event handling change, we've had to introduce the
concept of sled versioning -- this will need to be supported by the
runtime to allow us to understand how to turn on/off the new version of
the custom event handling sleds. That change has to land first before we
change the way we write the sleds.
To remove the synthetic reference, we rely on a relatively new linker
feature that preserves the sections that are associated with each other.
This allows us to limit the effects on the .text section of ELF
binaries.
Because we're still using absolute references that are resolved at
runtime for the instrumentation map (and function index) maps, we mark
these sections write-able. In the future we can re-define the entries in
the map to use relative relocations instead that can be statically
determined by the linker. That change will be a bit more invasive so we
defer this for later.
Depends on D36816.
Reviewers: dblaikie, echristo, pcc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D36615
llvm-svn: 311525
The output of this test changed after the fix in r311520 to have
-run-pass=block-placement behave like it does in a normal pipeline.
Adjust the test.
llvm-svn: 311521
I've replaced the two OS-specific runs with a generic run because
there's no functional difference in the resulting output that
we're checking. Also, the script still doesn't work with a Win
target.
llvm-svn: 311463
ISD::isConstantSplatVector can shrink to the smallest splat width. But we don't check the size of the resulting APInt at all. This can cause us to misinterpret the results.
This patch just adds a flag to prevent the APInt from changing width.
Fixes PR34271.
Differential Revision: https://reviews.llvm.org/D36996
llvm-svn: 311429
Summary: With masked operations, its possible for the operation node like fadd, fsub, etc. to be used by multiple different vselects. Since the pattern matching will start at the vselect, we need to make sure the operation node itself is only used once before we can fold a load. Otherwise we'll end up folding the same load into multiple instructions.
Reviewers: RKSimon, spatel, zvi, igorb
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D36938
llvm-svn: 311342
widely used processors.
This occured to me when I saw that we were generating 'inc' and 'dec'
when for Haswell and newer we shouldn't. However, there were a few "X is
slow" things that we should probably just set.
I've avoided any of the "X is fast" features because most of those would
be pretty serious regressions on processors where X isn't actually fast.
The slow things are likely to be negligible costs on processors where
these aren't slow and a significant win when they are slow.
In retrospect this seems somewhat obvious. Not sure why we didn't do
this a long time ago.
Differential Revision: https://reviews.llvm.org/D36947
llvm-svn: 311318
rather than doing a separate comparison.
This both saves an explicit comparision and avoids the use of `xadd`
which introduces register constraints and other challenges to the
generated code.
The motivating case is from atomic reference counts where `1` is the
sentinel rather than `0` for whatever reason. This can and should be
lowered efficiently on x86 by just using a different flag, however the
x86 code only handled the `0` case.
There remains some further opportunities here that are currently hidden
due to canonicalization. I've included test cases that show these and
FIXMEs. However, I don't at the moment have any production use cases and
they seem substantially harder to address.
Differential Revision: https://reviews.llvm.org/D36945
llvm-svn: 311317
There's no functional difference between the AVX512DQ instructions if we're not masking.
This change unifies test checks and removes extra isel entries. Similar was done for subvector insert and extracts recently.
llvm-svn: 311308
Summary: Support call ABI. For now only Linux C and X86_64_SysV calling conventions supported. Variadic function not supported.
Reviewers: zvi, guyblank, oren_ben_simhon
Reviewed By: oren_ben_simhon
Subscribers: rovka, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D34602
llvm-svn: 311279
Summary:
If all the operands of a BUILD_VECTOR extract elements from same vector then split the
vector efficiently based on the maximum vector access index.
Reviewers: zvi, delena, RKSimon, thakis
Reviewed By: RKSimon
Subscribers: chandlerc, eladcohen, llvm-commits
Differential Revision: https://reviews.llvm.org/D35788
llvm-svn: 311255
We see a modest performance improvement from this slightly higher tail dup threshold.
Differential Revision: https://reviews.llvm.org/D36775
llvm-svn: 311139
There's really no reason to do this we should just let isel pick the integer version and let the execution dependency fixing pass take care of moving to FP if necessary.
It's not very reliable to look for bitcasts at the edges of patterns. If for some reason one input was bitcasted and the other wasn't, or if one was a v4f32 bitcast and one was a v2f64 bitcast, we would have fallen back to the integer pattern anyway.
llvm-svn: 311138
Two issues identified by buildbots were addressed:
- The pass no longer forwards COPYs to physical register uses, since
doing so can break code that implicitly relies on the physical
register number of the use.
- The pass no longer forwards COPYs to undef uses, since doing so
can break the machine verifier by creating LiveRanges that don't
end on a use (since the undef operand is not considered a use).
[MachineCopyPropagation] Extend pass to do COPY source forwarding
This change extends MachineCopyPropagation to do COPY source forwarding.
This change also extends the MachineCopyPropagation pass to be able to
be run during register allocation, after physical registers have been
assigned, but before the virtual registers have been re-written, which
allows it to remove virtual register COPY LiveIntervals that become dead
through the forwarding of all of their uses.
Reviewers: qcolombet, javed.absar, MatzeB, jonpa
Subscribers: jyknight, nemanjai, llvm-commits, nhaehnle, mcrosier, mgorny
Differential Revision: https://reviews.llvm.org/D30751
llvm-svn: 311135
We've discussed canonicalizing to this form in IR, so the backend
should be prepared to lower these in ways better than what we see
here in most cases.
llvm-svn: 311103
The SelectionDAGBuilder translates various conditional branches into
CaseBlocks which are then translated into SDNodes. If a conditional
branch results in multiple CaseBlocks only the first CaseBlock is
translated into SDNodes immediately, the rest of the CaseBlocks are
put in a queue and processed when all LLVM IR instructions in the
basic block have been processed.
When a CaseBlock is transformed into SDNodes the SelectionDAGBuilder
is queried for the current LLVM IR instruction and the resulting
SDNodes are annotated with the debug info of the current
instruction (if it exists and has debug metadata).
When the deferred CaseBlocks are processed, the SelectionDAGBuilder
does not have a current LLVM IR instruction, and the resulting SDNodes
will not have any debuginfo. As DwarfDebug::beginInstruction() outputs
a .loc directive for the first instruction in a labeled
block (typically the case for something coming from a CaseBlock) this
tends to produce a line-0 directive.
This patch changes the handling of CaseBlocks to store the current
instruction's debug info into the CaseBlock when it is created (and the
SelectionDAGBuilder knows the current instruction) and to always use
the stored debug info when translating a CaseBlock to SDNodes.
Patch by Frej Drejhammar!
Differential Revision: https://reviews.llvm.org/D36671
llvm-svn: 311097
There's no reason to switch instructions with and without DQI. It just creates extra isel patterns and test divergences.
There is however value in enabling the masked version of the instructions with DQI.
This required introducing some new multiclasses to enabling this splitting.
Differential Revision: https://reviews.llvm.org/D36661
llvm-svn: 311091