llvm-project/llvm/test/CodeGen/PowerPC/umulo-128-legalisation-lowe...

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

171 lines
5.8 KiB
LLVM
Raw Normal View History

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
[SelectionDAG] Improve the legalisation lowering of UMULO. There is no way in the universe, that doing a full-width division in software will be faster than doing overflowing multiplication in software in the first place, especially given that this same full-width multiplication needs to be done anyway. This patch replaces the previous implementation with a direct lowering into an overflowing multiplication algorithm based on half-width operations. Correctness of the algorithm was verified by exhaustively checking the output of this algorithm for overflowing multiplication of 16 bit integers against an obviously correct widening multiplication. Baring any oversights introduced by porting the algorithm to DAG, confidence in correctness of this algorithm is extremely high. Following table shows the change in both t = runtime and s = space. The change is expressed as a multiplier of original, so anything under 1 is “better” and anything above 1 is worse. +-------+-----------+-----------+-------------+-------------+ | Arch | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s | +-------+-----------+-----------+-------------+-------------+ | X64 | - | - | ~0.5 | ~0.64 | | i686 | ~0.5 | ~0.6666 | ~0.05 | ~0.9 | | armv7 | - | ~0.75 | - | ~1.4 | +-------+-----------+-----------+-------------+-------------+ Performance numbers have been collected by running overflowing multiplication in a loop under `perf` on two x86_64 (one Intel Haswell, other AMD Ryzen) based machines. Size numbers have been collected by looking at the size of function containing an overflowing multiply in a loop. All in all, it can be seen that both performance and size has improved except in the case of armv7 where code size has regressed for 128-bit multiply. u128*u128 overflowing multiply on 32-bit platforms seem to benefit from this change a lot, taking only 5% of the time compared to original algorithm to calculate the same thing. The final benefit of this change is that LLVM is now capable of lowering the overflowing unsigned multiply for integers of any bit-width as long as the target is capable of lowering regular multiplication for the same bit-width. Previously, 128-bit overflowing multiply was the widest possible. Patch by Simonas Kazlauskas! Differential Revision: https://reviews.llvm.org/D50310 llvm-svn: 339922
2018-08-17 02:39:39 +08:00
; RUN: llc < %s -mtriple=powerpc64-unknown-linux-gnu | FileCheck %s --check-prefixes=PPC64
; RUN: llc < %s -mtriple=powerpc-unknown-linux-gnu | FileCheck %s --check-prefixes=PPC32
define { i128, i8 } @muloti_test(i128 %l, i128 %r) unnamed_addr #0 {
; PPC64-LABEL muloti_test:
; PPC64-LABEL: muloti_test:
; PPC64: # %bb.0: # %start
; PPC64-NEXT: mulld 8, 5, 4
; PPC64-NEXT: cmpdi 5, 3, 0
; PPC64-NEXT: mulhdu. 9, 3, 6
; PPC64-NEXT: mulld 3, 3, 6
; PPC64-NEXT: mcrf 1, 0
; PPC64-NEXT: add 3, 3, 8
; PPC64-NEXT: cmpdi 5, 0
; PPC64-NEXT: crnor 20, 2, 22
; PPC64-NEXT: cmpldi 3, 0
; PPC64-NEXT: mulhdu 8, 4, 6
; PPC64-NEXT: add 3, 8, 3
; PPC64-NEXT: cmpld 6, 3, 8
; PPC64-NEXT: crandc 21, 24, 2
; PPC64-NEXT: crorc 20, 20, 6
; PPC64-NEXT: li 7, 1
; PPC64-NEXT: mulhdu. 5, 5, 4
; PPC64-NEXT: crorc 20, 20, 2
; PPC64-NEXT: crnor 20, 20, 21
; PPC64-NEXT: mulld 4, 4, 6
; PPC64-NEXT: bc 12, 20, .LBB0_2
; PPC64-NEXT: # %bb.1: # %start
; PPC64-NEXT: ori 5, 7, 0
; PPC64-NEXT: blr
; PPC64-NEXT: .LBB0_2: # %start
; PPC64-NEXT: addi 5, 0, 0
; PPC64-NEXT: blr
[SelectionDAG] Improve the legalisation lowering of UMULO. There is no way in the universe, that doing a full-width division in software will be faster than doing overflowing multiplication in software in the first place, especially given that this same full-width multiplication needs to be done anyway. This patch replaces the previous implementation with a direct lowering into an overflowing multiplication algorithm based on half-width operations. Correctness of the algorithm was verified by exhaustively checking the output of this algorithm for overflowing multiplication of 16 bit integers against an obviously correct widening multiplication. Baring any oversights introduced by porting the algorithm to DAG, confidence in correctness of this algorithm is extremely high. Following table shows the change in both t = runtime and s = space. The change is expressed as a multiplier of original, so anything under 1 is “better” and anything above 1 is worse. +-------+-----------+-----------+-------------+-------------+ | Arch | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s | +-------+-----------+-----------+-------------+-------------+ | X64 | - | - | ~0.5 | ~0.64 | | i686 | ~0.5 | ~0.6666 | ~0.05 | ~0.9 | | armv7 | - | ~0.75 | - | ~1.4 | +-------+-----------+-----------+-------------+-------------+ Performance numbers have been collected by running overflowing multiplication in a loop under `perf` on two x86_64 (one Intel Haswell, other AMD Ryzen) based machines. Size numbers have been collected by looking at the size of function containing an overflowing multiply in a loop. All in all, it can be seen that both performance and size has improved except in the case of armv7 where code size has regressed for 128-bit multiply. u128*u128 overflowing multiply on 32-bit platforms seem to benefit from this change a lot, taking only 5% of the time compared to original algorithm to calculate the same thing. The final benefit of this change is that LLVM is now capable of lowering the overflowing unsigned multiply for integers of any bit-width as long as the target is capable of lowering regular multiplication for the same bit-width. Previously, 128-bit overflowing multiply was the widest possible. Patch by Simonas Kazlauskas! Differential Revision: https://reviews.llvm.org/D50310 llvm-svn: 339922
2018-08-17 02:39:39 +08:00
;
; PPC32-LABEL: muloti_test:
; PPC32: # %bb.0: # %start
; PPC32-NEXT: mflr 0
; PPC32-NEXT: stw 0, 4(1)
; PPC32-NEXT: stwu 1, -80(1)
; PPC32-NEXT: stw 26, 56(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 27, 60(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 29, 68(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 30, 72(1) # 4-byte Folded Spill
; PPC32-NEXT: mfcr 12
; PPC32-NEXT: mr 30, 8
; PPC32-NEXT: mr 29, 7
; PPC32-NEXT: mr 27, 4
; PPC32-NEXT: mr 26, 3
; PPC32-NEXT: li 3, 0
; PPC32-NEXT: li 4, 0
; PPC32-NEXT: li 7, 0
; PPC32-NEXT: li 8, 0
; PPC32-NEXT: stw 20, 32(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 21, 36(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 22, 40(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 23, 44(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 24, 48(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 25, 52(1) # 4-byte Folded Spill
; PPC32-NEXT: stw 28, 64(1) # 4-byte Folded Spill
; PPC32-NEXT: mr 25, 10
; PPC32-NEXT: stw 12, 28(1)
; PPC32-NEXT: mr 28, 9
; PPC32-NEXT: mr 23, 6
; PPC32-NEXT: mr 24, 5
; PPC32-NEXT: bl __multi3@PLT
; PPC32-NEXT: mr 7, 4
; PPC32-NEXT: mullw 4, 24, 30
; PPC32-NEXT: mullw 8, 29, 23
; PPC32-NEXT: mullw 10, 28, 27
; PPC32-NEXT: mullw 11, 26, 25
; PPC32-NEXT: mulhwu 9, 30, 23
; PPC32-NEXT: mulhwu 12, 27, 25
; PPC32-NEXT: mullw 0, 30, 23
; PPC32-NEXT: mullw 22, 27, 25
; PPC32-NEXT: add 21, 8, 4
; PPC32-NEXT: add 10, 11, 10
; PPC32-NEXT: addc 4, 22, 0
; PPC32-NEXT: add 11, 9, 21
; PPC32-NEXT: add 0, 12, 10
; PPC32-NEXT: adde 8, 0, 11
; PPC32-NEXT: addc 4, 7, 4
; PPC32-NEXT: adde 8, 3, 8
; PPC32-NEXT: xor 22, 4, 7
; PPC32-NEXT: xor 20, 8, 3
; PPC32-NEXT: or. 22, 22, 20
; PPC32-NEXT: mcrf 1, 0
; PPC32-NEXT: cmpwi 29, 0
; PPC32-NEXT: cmpwi 5, 24, 0
; PPC32-NEXT: cmpwi 6, 26, 0
; PPC32-NEXT: cmpwi 7, 28, 0
; PPC32-NEXT: crnor 8, 22, 2
; PPC32-NEXT: mulhwu. 23, 29, 23
; PPC32-NEXT: crnor 9, 30, 26
; PPC32-NEXT: mcrf 5, 0
; PPC32-NEXT: cmplwi 21, 0
; PPC32-NEXT: cmplw 6, 11, 9
; PPC32-NEXT: cmplwi 7, 10, 0
; PPC32-NEXT: crandc 10, 24, 2
; PPC32-NEXT: cmplw 3, 0, 12
; PPC32-NEXT: mulhwu. 9, 24, 30
; PPC32-NEXT: mcrf 6, 0
; PPC32-NEXT: crandc 11, 12, 30
; PPC32-NEXT: cmplw 4, 7
; PPC32-NEXT: cmplw 7, 8, 3
; PPC32-NEXT: crand 12, 30, 0
; PPC32-NEXT: crandc 13, 28, 30
; PPC32-NEXT: mulhwu. 3, 26, 25
; PPC32-NEXT: mcrf 7, 0
; PPC32-NEXT: cror 0, 12, 13
; PPC32-NEXT: crandc 12, 0, 6
; PPC32-NEXT: crorc 20, 8, 22
; PPC32-NEXT: crorc 20, 20, 26
; PPC32-NEXT: mulhwu. 3, 28, 27
; PPC32-NEXT: mcrf 1, 0
; PPC32-NEXT: crorc 25, 9, 30
; PPC32-NEXT: or. 3, 27, 26
; PPC32-NEXT: cror 24, 20, 10
; PPC32-NEXT: mcrf 5, 0
; PPC32-NEXT: crorc 25, 25, 6
; PPC32-NEXT: or. 3, 30, 29
; PPC32-NEXT: cror 25, 25, 11
; PPC32-NEXT: crnor 20, 2, 22
; PPC32-NEXT: lwz 12, 28(1)
; PPC32-NEXT: cror 20, 20, 25
; PPC32-NEXT: cror 20, 20, 24
; PPC32-NEXT: crnor 20, 20, 12
; PPC32-NEXT: li 3, 1
; PPC32-NEXT: bc 12, 20, .LBB0_2
; PPC32-NEXT: # %bb.1: # %start
; PPC32-NEXT: ori 7, 3, 0
; PPC32-NEXT: b .LBB0_3
; PPC32-NEXT: .LBB0_2: # %start
; PPC32-NEXT: addi 7, 0, 0
; PPC32-NEXT: .LBB0_3: # %start
; PPC32-NEXT: mr 3, 8
; PPC32-NEXT: mtcrf 32, 12 # cr2
; PPC32-NEXT: mtcrf 16, 12 # cr3
; PPC32-NEXT: lwz 30, 72(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 29, 68(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 28, 64(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 27, 60(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 26, 56(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 25, 52(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 24, 48(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 23, 44(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 22, 40(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 21, 36(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 20, 32(1) # 4-byte Folded Reload
; PPC32-NEXT: lwz 0, 84(1)
; PPC32-NEXT: addi 1, 1, 80
; PPC32-NEXT: mtlr 0
; PPC32-NEXT: blr
[SelectionDAG] Improve the legalisation lowering of UMULO. There is no way in the universe, that doing a full-width division in software will be faster than doing overflowing multiplication in software in the first place, especially given that this same full-width multiplication needs to be done anyway. This patch replaces the previous implementation with a direct lowering into an overflowing multiplication algorithm based on half-width operations. Correctness of the algorithm was verified by exhaustively checking the output of this algorithm for overflowing multiplication of 16 bit integers against an obviously correct widening multiplication. Baring any oversights introduced by porting the algorithm to DAG, confidence in correctness of this algorithm is extremely high. Following table shows the change in both t = runtime and s = space. The change is expressed as a multiplier of original, so anything under 1 is “better” and anything above 1 is worse. +-------+-----------+-----------+-------------+-------------+ | Arch | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s | +-------+-----------+-----------+-------------+-------------+ | X64 | - | - | ~0.5 | ~0.64 | | i686 | ~0.5 | ~0.6666 | ~0.05 | ~0.9 | | armv7 | - | ~0.75 | - | ~1.4 | +-------+-----------+-----------+-------------+-------------+ Performance numbers have been collected by running overflowing multiplication in a loop under `perf` on two x86_64 (one Intel Haswell, other AMD Ryzen) based machines. Size numbers have been collected by looking at the size of function containing an overflowing multiply in a loop. All in all, it can be seen that both performance and size has improved except in the case of armv7 where code size has regressed for 128-bit multiply. u128*u128 overflowing multiply on 32-bit platforms seem to benefit from this change a lot, taking only 5% of the time compared to original algorithm to calculate the same thing. The final benefit of this change is that LLVM is now capable of lowering the overflowing unsigned multiply for integers of any bit-width as long as the target is capable of lowering regular multiplication for the same bit-width. Previously, 128-bit overflowing multiply was the widest possible. Patch by Simonas Kazlauskas! Differential Revision: https://reviews.llvm.org/D50310 llvm-svn: 339922
2018-08-17 02:39:39 +08:00
; PPC32-LABEL muloti_test:
start:
%0 = tail call { i128, i1 } @llvm.umul.with.overflow.i128(i128 %l, i128 %r) #2
%1 = extractvalue { i128, i1 } %0, 0
%2 = extractvalue { i128, i1 } %0, 1
%3 = zext i1 %2 to i8
%4 = insertvalue { i128, i8 } undef, i128 %1, 0
%5 = insertvalue { i128, i8 } %4, i8 %3, 1
ret { i128, i8 } %5
}
; Function Attrs: nounwind readnone speculatable
declare { i128, i1 } @llvm.umul.with.overflow.i128(i128, i128) #1
attributes #0 = { nounwind readnone }
[SelectionDAG] Improve the legalisation lowering of UMULO. There is no way in the universe, that doing a full-width division in software will be faster than doing overflowing multiplication in software in the first place, especially given that this same full-width multiplication needs to be done anyway. This patch replaces the previous implementation with a direct lowering into an overflowing multiplication algorithm based on half-width operations. Correctness of the algorithm was verified by exhaustively checking the output of this algorithm for overflowing multiplication of 16 bit integers against an obviously correct widening multiplication. Baring any oversights introduced by porting the algorithm to DAG, confidence in correctness of this algorithm is extremely high. Following table shows the change in both t = runtime and s = space. The change is expressed as a multiplier of original, so anything under 1 is “better” and anything above 1 is worse. +-------+-----------+-----------+-------------+-------------+ | Arch | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s | +-------+-----------+-----------+-------------+-------------+ | X64 | - | - | ~0.5 | ~0.64 | | i686 | ~0.5 | ~0.6666 | ~0.05 | ~0.9 | | armv7 | - | ~0.75 | - | ~1.4 | +-------+-----------+-----------+-------------+-------------+ Performance numbers have been collected by running overflowing multiplication in a loop under `perf` on two x86_64 (one Intel Haswell, other AMD Ryzen) based machines. Size numbers have been collected by looking at the size of function containing an overflowing multiply in a loop. All in all, it can be seen that both performance and size has improved except in the case of armv7 where code size has regressed for 128-bit multiply. u128*u128 overflowing multiply on 32-bit platforms seem to benefit from this change a lot, taking only 5% of the time compared to original algorithm to calculate the same thing. The final benefit of this change is that LLVM is now capable of lowering the overflowing unsigned multiply for integers of any bit-width as long as the target is capable of lowering regular multiplication for the same bit-width. Previously, 128-bit overflowing multiply was the widest possible. Patch by Simonas Kazlauskas! Differential Revision: https://reviews.llvm.org/D50310 llvm-svn: 339922
2018-08-17 02:39:39 +08:00
attributes #1 = { nounwind readnone speculatable }
attributes #2 = { nounwind }