llvm-project/llvm/lib/Target/X86/X86SchedPredicates.td

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

144 lines
4.2 KiB
TableGen
Raw Normal View History

[X86][BtVer2] correctly model the latency/throughput of LEA instructions. This patch fixes the latency/throughput of LEA instructions in the BtVer2 scheduling model. On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of 1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA with a "Scale" operand is also slow, and it has the same latency profile as the 3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e. RThrouhgput of 2.0). This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td. The tablegen backend (for instruction-info) expands that definition into this (file X86GenInstrInfo.inc): ``` static bool isThreeOperandsLEA(const MachineInstr &MI) { return ( ( MI.getOpcode() == X86::LEA32r || MI.getOpcode() == X86::LEA64r || MI.getOpcode() == X86::LEA64_32r || MI.getOpcode() == X86::LEA16r ) && MI.getOperand(1).isReg() && MI.getOperand(1).getReg() != 0 && MI.getOperand(3).isReg() && MI.getOperand(3).getReg() != 0 && ( ( MI.getOperand(4).isImm() && MI.getOperand(4).getImm() != 0 ) || (MI.getOperand(4).isGlobal()) ) ); } ``` A similar method is generated in the X86_MC namespace, and included into X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h). Back to the BtVer2 scheduling model: A new scheduling predicate named JSlowLEAPredicate now checks if either the instruction is a three-operands LEA, or it is an LEA with a Scale value different than 1. A variant scheduling class uses that new predicate to correctly select the appropriate latency profile. Differential Revision: https://reviews.llvm.org/D49436 llvm-svn: 337469
2018-07-20 00:42:15 +08:00
//===-- X86SchedPredicates.td - X86 Scheduling Predicates --*- tablegen -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
[X86][BtVer2] correctly model the latency/throughput of LEA instructions. This patch fixes the latency/throughput of LEA instructions in the BtVer2 scheduling model. On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of 1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA with a "Scale" operand is also slow, and it has the same latency profile as the 3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e. RThrouhgput of 2.0). This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td. The tablegen backend (for instruction-info) expands that definition into this (file X86GenInstrInfo.inc): ``` static bool isThreeOperandsLEA(const MachineInstr &MI) { return ( ( MI.getOpcode() == X86::LEA32r || MI.getOpcode() == X86::LEA64r || MI.getOpcode() == X86::LEA64_32r || MI.getOpcode() == X86::LEA16r ) && MI.getOperand(1).isReg() && MI.getOperand(1).getReg() != 0 && MI.getOperand(3).isReg() && MI.getOperand(3).getReg() != 0 && ( ( MI.getOperand(4).isImm() && MI.getOperand(4).getImm() != 0 ) || (MI.getOperand(4).isGlobal()) ) ); } ``` A similar method is generated in the X86_MC namespace, and included into X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h). Back to the BtVer2 scheduling model: A new scheduling predicate named JSlowLEAPredicate now checks if either the instruction is a three-operands LEA, or it is an LEA with a Scale value different than 1. A variant scheduling class uses that new predicate to correctly select the appropriate latency profile. Differential Revision: https://reviews.llvm.org/D49436 llvm-svn: 337469
2018-07-20 00:42:15 +08:00
//
//===----------------------------------------------------------------------===//
//
// This file defines scheduling predicate definitions that are common to
// all X86 subtargets.
//
//===----------------------------------------------------------------------===//
// A predicate used to identify dependency-breaking instructions that clear the
// content of the destination register. Note that this predicate only checks if
// input registers are the same. This predicate doesn't make any assumptions on
// the expected instruction opcodes, because different processors may implement
// different zero-idioms.
def ZeroIdiomPredicate : CheckSameRegOperand<1, 2>;
// A predicate used to identify VPERM that have bits 3 and 7 of their mask set.
// On some processors, these VPERM instructions are zero-idioms.
def ZeroIdiomVPERMPredicate : CheckAll<[
ZeroIdiomPredicate,
CheckImmOperand<3, 0x88>
]>;
// A predicate used to check if a LEA instruction uses all three source
// operands: base, index, and offset.
[X86][BtVer2] correctly model the latency/throughput of LEA instructions. This patch fixes the latency/throughput of LEA instructions in the BtVer2 scheduling model. On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of 1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA with a "Scale" operand is also slow, and it has the same latency profile as the 3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e. RThrouhgput of 2.0). This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td. The tablegen backend (for instruction-info) expands that definition into this (file X86GenInstrInfo.inc): ``` static bool isThreeOperandsLEA(const MachineInstr &MI) { return ( ( MI.getOpcode() == X86::LEA32r || MI.getOpcode() == X86::LEA64r || MI.getOpcode() == X86::LEA64_32r || MI.getOpcode() == X86::LEA16r ) && MI.getOperand(1).isReg() && MI.getOperand(1).getReg() != 0 && MI.getOperand(3).isReg() && MI.getOperand(3).getReg() != 0 && ( ( MI.getOperand(4).isImm() && MI.getOperand(4).getImm() != 0 ) || (MI.getOperand(4).isGlobal()) ) ); } ``` A similar method is generated in the X86_MC namespace, and included into X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h). Back to the BtVer2 scheduling model: A new scheduling predicate named JSlowLEAPredicate now checks if either the instruction is a three-operands LEA, or it is an LEA with a Scale value different than 1. A variant scheduling class uses that new predicate to correctly select the appropriate latency profile. Differential Revision: https://reviews.llvm.org/D49436 llvm-svn: 337469
2018-07-20 00:42:15 +08:00
def IsThreeOperandsLEAPredicate: CheckAll<[
// isRegOperand(Base)
CheckIsRegOperand<1>,
CheckNot<CheckInvalidRegOperand<1>>,
// isRegOperand(Index)
CheckIsRegOperand<3>,
CheckNot<CheckInvalidRegOperand<3>>,
// hasLEAOffset(Offset)
CheckAny<[
CheckAll<[
CheckIsImmOperand<4>,
CheckNot<CheckZeroOperand<4>>
]>,
CheckNonPortable<"MI.getOperand(4).isGlobal()">
]>
]>;
def LEACases : MCOpcodeSwitchCase<
[LEA32r, LEA64r, LEA64_32r, LEA16r],
MCReturnStatement<IsThreeOperandsLEAPredicate>
>;
// Used to generate the body of a TII member function.
def IsThreeOperandsLEABody :
MCOpcodeSwitchStatement<[LEACases], MCReturnStatement<FalsePred>>;
[X86][BtVer2] correctly model the latency/throughput of LEA instructions. This patch fixes the latency/throughput of LEA instructions in the BtVer2 scheduling model. On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of 1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA with a "Scale" operand is also slow, and it has the same latency profile as the 3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e. RThrouhgput of 2.0). This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td. The tablegen backend (for instruction-info) expands that definition into this (file X86GenInstrInfo.inc): ``` static bool isThreeOperandsLEA(const MachineInstr &MI) { return ( ( MI.getOpcode() == X86::LEA32r || MI.getOpcode() == X86::LEA64r || MI.getOpcode() == X86::LEA64_32r || MI.getOpcode() == X86::LEA16r ) && MI.getOperand(1).isReg() && MI.getOperand(1).getReg() != 0 && MI.getOperand(3).isReg() && MI.getOperand(3).getReg() != 0 && ( ( MI.getOperand(4).isImm() && MI.getOperand(4).getImm() != 0 ) || (MI.getOperand(4).isGlobal()) ) ); } ``` A similar method is generated in the X86_MC namespace, and included into X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h). Back to the BtVer2 scheduling model: A new scheduling predicate named JSlowLEAPredicate now checks if either the instruction is a three-operands LEA, or it is an LEA with a Scale value different than 1. A variant scheduling class uses that new predicate to correctly select the appropriate latency profile. Differential Revision: https://reviews.llvm.org/D49436 llvm-svn: 337469
2018-07-20 00:42:15 +08:00
// This predicate evaluates to true only if the input machine instruction is a
// 3-operands LEA. Tablegen automatically generates a new method for it in
// X86GenInstrInfo.
def IsThreeOperandsLEAFn :
TIIPredicate<"isThreeOperandsLEA", IsThreeOperandsLEABody>;
// A predicate to check for COND_A and COND_BE CMOVs which have an extra uop
// on recent Intel CPUs.
def IsCMOVArr_Or_CMOVBErr : CheckAny<[
CheckImmOperand_s<3, "X86::COND_A">,
CheckImmOperand_s<3, "X86::COND_BE">
]>;
def IsCMOVArm_Or_CMOVBErm : CheckAny<[
CheckImmOperand_s<7, "X86::COND_A">,
CheckImmOperand_s<7, "X86::COND_BE">
]>;
// A predicate to check for COND_A and COND_BE SETCCs which have an extra uop
// on recent Intel CPUs.
def IsSETAr_Or_SETBEr : CheckAny<[
CheckImmOperand_s<1, "X86::COND_A">,
CheckImmOperand_s<1, "X86::COND_BE">
]>;
def IsSETAm_Or_SETBEm : CheckAny<[
CheckImmOperand_s<5, "X86::COND_A">,
CheckImmOperand_s<5, "X86::COND_BE">
]>;
[X86][Btver2] Fix latency and throughput of CMPXCHG instructions. On Jaguar, CMPXCHG has a latency of 11cy, and a maximum throughput of 0.33 IPC. Throughput is superiorly limited to 0.33 because of the implicit in/out dependency on register EAX. In the case of repeated non-atomic CMPXCHG with the same memory location, store-to-load forwarding occurs and values for sequent loads are quickly forwarded from the store buffer. Interestingly, the functionality in LLVM that computes the reciprocal throughput doesn't seem to know about RMW instructions. That functionality only looks at the "consumed resource cycles" for the throughput computation. It should be fixed/improved by a future patch. In particular, for RMW instructions, that logic should also take into account for the write latency of in/out register operands. An atomic CMPXCHG has a latency of ~17cy. Throughput is also limited to ~17cy/inst due to cache locking, which prevents other memory uOPs to start executing before the "lock releasing" store uOP. CMPXCHG8rr and CMPXCHG8rm are treated specially because they decode to one less macro opcode. Their latency tend to be the same as the other RR/RM variants. RR variants are relatively fast 3cy (but still microcoded - 5 macro opcodes). CMPXCHG8B is 11cy and unfortunately doesn't seem to benefit from store-to-load forwarding. That means, throughput is clearly limited by the in/out dependency on GPR registers. The uOP composition is sadly unknown (due to the lack of PMCs for the Integer pipes). I have reused the same mix of consumed resource from the other CMPXCHG instructions for CMPXCHG8B too. LOCK CMPXCHG8B is instead 18cycles. CMPXCHG16B is 32cycles. Up to 38cycles when the LOCK prefix is specified. Due to the in/out dependencies, throughput is limited to 1 instruction every 32 (or 38) cycles dependeing on whether the LOCK prefix is specified or not. I wouldn't be surprised if the microcode for CMPXCHG16B is similar to 2x microcode from CMPXCHG8B. So, I have speculatively set the JALU01 consumption to 2x the resource cycles used for CMPXCHG8B. The two new hasLockPrefix() functions are used by the btver2 scheduling model check if a MCInst/MachineInst has a LOCK prefix. Calls to hasLockPrefix() have been encoded in predicates of variant scheduling classes that describe lat/thr of CMPXCHG. Differential Revision: https://reviews.llvm.org/D66424 llvm-svn: 369365
2019-08-20 18:23:55 +08:00
// A predicate used to check if an instruction has a LOCK prefix.
def CheckLockPrefix : CheckFunctionPredicate<
"X86_MC::hasLockPrefix",
"X86InstrInfo::hasLockPrefix"
>;
def IsRegRegCompareAndSwap_8 : CheckOpcode<[ CMPXCHG8rr ]>;
def IsRegMemCompareAndSwap_8 : CheckOpcode<[
LCMPXCHG8, CMPXCHG8rm
]>;
def IsRegRegCompareAndSwap_16_32_64 : CheckOpcode<[
CMPXCHG16rr, CMPXCHG32rr, CMPXCHG64rr
]>;
def IsRegMemCompareAndSwap_16_32_64 : CheckOpcode<[
CMPXCHG16rm, CMPXCHG32rm, CMPXCHG64rm,
LCMPXCHG16, LCMPXCHG32, LCMPXCHG64,
LCMPXCHG8B, LCMPXCHG16B
]>;
def IsCompareAndSwap8B : CheckOpcode<[ CMPXCHG8B, LCMPXCHG8B ]>;
def IsCompareAndSwap16B : CheckOpcode<[ CMPXCHG16B, LCMPXCHG16B ]>;
def IsRegMemCompareAndSwap : CheckOpcode<
!listconcat(
IsRegMemCompareAndSwap_8.ValidOpcodes,
IsRegMemCompareAndSwap_16_32_64.ValidOpcodes
)>;
def IsRegRegCompareAndSwap : CheckOpcode<
!listconcat(
IsRegRegCompareAndSwap_8.ValidOpcodes,
IsRegRegCompareAndSwap_16_32_64.ValidOpcodes
)>;
def IsAtomicCompareAndSwap_8 : CheckAll<[
CheckLockPrefix,
IsRegMemCompareAndSwap_8
]>;
def IsAtomicCompareAndSwap : CheckAll<[
CheckLockPrefix,
IsRegMemCompareAndSwap
]>;
def IsAtomicCompareAndSwap8B : CheckAll<[
CheckLockPrefix,
IsCompareAndSwap8B
]>;
def IsAtomicCompareAndSwap16B : CheckAll<[
CheckLockPrefix,
IsCompareAndSwap16B
]>;