Re-applying this patch after bots failures. Should be fine now.
The function __multi3() is undefined on 32-bit ARM, so a call to it should
never be emitted. Instead, plain instructions need to be generated to
perform 128-bit multiplications.
Differential Revision: https://reviews.llvm.org/D103906
There's an existing optimization for x != C, but somehow it was missing
a special case for 0.
While I'm here, also cleaned up the code/comments a bit: the second
value produced by the MERGE_VALUES was actually dead, since a CMOV only
produces one result.
Differential Revision: https://reviews.llvm.org/D59616
llvm-svn: 357437
This takes sequences like "mov r4, sp; str r0, [r4]", and optimizes them
to something like "str r0, [sp]".
For regular stack variables, this optimization was already implemented:
we lower loads and stores using frame indexes, which are expanded later.
However, when constructing a call frame for a call with more than four
arguments, the existing optimization doesn't apply. We need to use
stores which are actually relative to the current value of sp, and don't
have an associated frame index.
This patch adds a special case to handle that construct. At the DAG
level, this is an ISD::STORE where the address is a CopyFromReg from SP
(plus a small constant offset).
This applies only to Thumb1: in Thumb2 or ARM mode, a regular store
instruction can access SP directly, so the COPY gets eliminated by
existing code.
The change to ARMDAGToDAGISel::SelectThumbAddrModeSP is a related
cleanup: we shouldn't pretend that it can select anything other than
frame indexes.
Differential Revision: https://reviews.llvm.org/D59568
llvm-svn: 356601
This adds a few extra Thumb1 opcodes to improve the peephole opimisers
ability to remove redundant cmp instructions. tADC and tSBC require
a small fixup to prevent MOVS being moved past the instruction, giving
the wrong flags.
Differential Revision: https://reviews.llvm.org/D58281
llvm-svn: 354791
The "dead" markings allow existing target-independent optimizations,
like MachineSink, to trigger more frequently. The CPSR defs would have
eventually been marked dead by LiveVariables, so this only affects
optimizations before regalloc.
The ARMBaseInstrInfo.cpp change is fixing a bug which is only visible
with this change: the transform adds a use to an otherwise dead def
of CPSR. This is covered by existing regression tests.
thumb2-tbh.ll breaks for Thumb1 due to MachineLICM changing the
generated code; I'll fix it in D53452.
Differential Revision: https://reviews.llvm.org/D53453
llvm-svn: 345420
There is no way in the universe, that doing a full-width division in
software will be faster than doing overflowing multiplication in
software in the first place, especially given that this same full-width
multiplication needs to be done anyway.
This patch replaces the previous implementation with a direct lowering
into an overflowing multiplication algorithm based on half-width
operations.
Correctness of the algorithm was verified by exhaustively checking the
output of this algorithm for overflowing multiplication of 16 bit
integers against an obviously correct widening multiplication. Baring
any oversights introduced by porting the algorithm to DAG, confidence in
correctness of this algorithm is extremely high.
Following table shows the change in both t = runtime and s = space. The
change is expressed as a multiplier of original, so anything under 1 is
“better” and anything above 1 is worse.
+-------+-----------+-----------+-------------+-------------+
| Arch | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s |
+-------+-----------+-----------+-------------+-------------+
| X64 | - | - | ~0.5 | ~0.64 |
| i686 | ~0.5 | ~0.6666 | ~0.05 | ~0.9 |
| armv7 | - | ~0.75 | - | ~1.4 |
+-------+-----------+-----------+-------------+-------------+
Performance numbers have been collected by running overflowing
multiplication in a loop under `perf` on two x86_64 (one Intel Haswell,
other AMD Ryzen) based machines. Size numbers have been collected by
looking at the size of function containing an overflowing multiply in
a loop.
All in all, it can be seen that both performance and size has improved
except in the case of armv7 where code size has regressed for 128-bit
multiply. u128*u128 overflowing multiply on 32-bit platforms seem to
benefit from this change a lot, taking only 5% of the time compared to
original algorithm to calculate the same thing.
The final benefit of this change is that LLVM is now capable of lowering
the overflowing unsigned multiply for integers of any bit-width as long
as the target is capable of lowering regular multiplication for the same
bit-width. Previously, 128-bit overflowing multiply was the widest
possible.
Patch by Simonas Kazlauskas!
Differential Revision: https://reviews.llvm.org/D50310
llvm-svn: 339922