This reverts commit b4b97ec813.
As discussed in post-commit feedback at:
https://reviews.llvm.org/D118376
...there's a stage 2 failure on a Mac running a clang-refactor tool test.
As noted in D116804, we want to effectively invert that patch
for CPUs (intel) that don't break the false dependency on
sbb %eax, %eax
So we will likely want to create that here in the
X86DAGToDAGISel::Select() case for X86::SETCC_CARRY.
This is an intentionally limited/different form of D90113.
That patch bravely tries to generalize folds where we pull
a binop into the arms of a select:
N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal)
...but it is not universally profitable.
This is the inverse of IR canonicalization as discussed in
D113442.
We know that this transform is not entirely profitable even
within x86, so we only handle x86 vector fadd/fsub as a 1st
step. The intent is to prevent AVX512 regressions as mentioned
in D113442.
The plan is to port this to DAGCombiner (so it will eventually
look more like D90113) and add more types/cases in pieces with
many more tests to verify that we are seeing improvements.
Differential Revision: https://reviews.llvm.org/D118644
None of the external users actual touch these (they're purely used internally down the recursive call) - its trivial to add another wrapper if anything ever does want to track known elements.
Based on the output of include-what-you-use.
This is a big chunk of changes. It is very likely to break downstream code
unless they took a lot of care in avoiding hidden ehader dependencies, something
the LLVM codebase doesn't do that well :-/
I've tried to summarize the biggest change below:
- llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h
- llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h
- llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h
- llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h
- llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h
- llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h
- llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h
And the usual count of preprocessed lines:
$ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
before: 6400831
after: 6189948
200k lines less to process is no that bad ;-)
Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D118652
We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements.
This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.
Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).
Allows pow2 mask tests to avoid an unnecessary constant load.
Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.
Without AVX512 (which can efficiently extend/truncate to vXi16/vXi32), unpacking/packing to vXi16 is more efficient that relying on the (uops-heavy) PBLENDV shift expansion
extract_vec_elt (load X), C --> scalar load (X+C)
As noted in the comment, DAGCombiner has this fold -- and the code in this
patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but
x86 should benefit even if the loaded vector has other uses as long as we
apply some other x86-specific conditions. The motivating example from #50310
is shown in vec_int_to_fp.ll.
Fixes#50310
Differential Revision: https://reviews.llvm.org/D118376
Previous folds by combineSetCCMOVMSK might have converted these to CMP when changing the bitwidth, and the CMP->SUB fold might not have happened (or will happen)
rG9103b73fe052 was assuming that we could OR/AND with the source vector, but that will fail on float/double vectors without bitcasting - it also missed the case that any_of checks might be testing less than all the source elements
InstCombine performs this more generally with SimplifyUsingDistributiveLaws, but we don't need anything that complex here - this is mainly to fix up cases where logic ops get created late on during lowering, often in conjunction with sext/zext ops for type legalization.
https://alive2.llvm.org/ce/z/gGpY5v
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
There's no further reasons to limit this to cmpeq-with-zero, the outstanding regressions with lowering to PTEST have now been addressed
Improves codegen for Issue #53379
vXi16 patterns allof(cmp()) reduction patterns will have to be pack the comparison results to vXi8 to use PMOVMSKB.
If we're reducing cmpeq(), then we can compare the vXi8 halves directly - similar to what we already do for vXi64 -> vXi32 for cases without PCMPEQQ.
As suggested on PR53379, for all-of icmp-eq patterns, we can use ptest(sub(x,y)) on SSE41+ targets
This is a generalization of the existing allof(cmpeq(x,0)) -> ptest(x) pattern
We can probably extend this further, in particularly to handle 256-bit cases on pre-AVX2 targets, but this part of the generalization is pretty trivial
Fixes Issue #53379
MSVC currently doesn't support 80 bits long double. ICC supports it when
the option `/Qlong-double` is specified. Changing the alignment of f80
to 16 bytes so that we can be compatible with ICC's option.
Reviewed By: rnk, craig.topper
Differential Revision: https://reviews.llvm.org/D115942
Pulled out of D106237, this folds truncstore(extend(x)) back to store(x)
if the original store was legal. This can come up due to the order we
fold nodes. A fold from X86 needs to be adjusted to prevent infinite
loops, to have it pick the operand of a trunc more directly.
Differential Revision: https://reviews.llvm.org/D117901
Intel's CET/IBT requires every indirect branch target to be an ENDBR instruction. Because of that, the compiler needs to correctly emit these instruction on function's prologues. Because this is a security feature, it is desirable that only actual indirect-branch-targeted functions are emitted with ENDBRs. While it is possible to identify address-taken functions through LTO, minimizing these ENDBR instructions remains a hard task for user-space binaries because exported functions may end being reachable through PLT entries, that will use an indirect branch for such. Because this cannot be determined during compilation-time, the compiler currently emits ENDBRs to every non-local-linkage function.
Despite the challenge presented for user-space, the kernel landscape is different as no PLTs are used. With the intent of providing the most fit ENDBR emission for the kernel, kernel developers proposed an optimization named "ibt-seal" which replaces the ENDBRs for NOPs directly in the binary. The discussion of this feature can be seen in [1].
This diff brings the enablement of the flag -mibt-seal, which in combination with LTO enforces a different policy for ENDBR placement in when the code-model is set to "kernel". In this scenario, the compiler will only emit ENDBRs to address taken functions, ignoring non-address taken functions that are don't have local linkage.
A comparison between an LTO-compiled kernel binaries without and with the -mibt-seal feature enabled shows that when -mibt-seal was used, the number of ENDBRs in the vmlinux.o binary patched by objtool decreased from 44383 to 33192, and that the number of superfluous ENDBR instructions nopped-out decreased from 11730 to 540.
The 540 missed superfluous ENDBRs need to be investigated further, but hypotheses are: assembly code not being taken care of by the compiler, kernel exported symbols mechanisms creating bogus address taken situations or even these being removed due to other binary optimizations like kernel's static_calls. For now, I assume that the large drop in the number of ENDBR instructions already justifies the feature being merged.
[1] - https://lkml.org/lkml/2021/11/22/591
Reviewed By: xiangzhangllvm
Differential Revision: https://reviews.llvm.org/D116070
AVX512 doesn't provide a ADDSUB instruction, but if we've built this from a build vector of scalar fsub/fadd elements we can still lower to blend(fsub,fadd)