2018-04-03 18:04:37 +08:00
|
|
|
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
|
|
|
|
; RUN: llc -o - -mtriple=i686-unknown-unknown %s | FileCheck %s --check-prefixes=ALL,X32
|
|
|
|
; RUN: llc -o - -mtriple=x86_64-unknown-unknown %s | FileCheck %s --check-prefixes=ALL,X64
|
|
|
|
;
|
|
|
|
; Test patterns that require preserving and restoring flags.
|
2016-01-05 08:48:16 +08:00
|
|
|
|
|
|
|
@b = common global i8 0, align 1
|
|
|
|
@c = common global i32 0, align 4
|
|
|
|
@a = common global i8 0, align 1
|
|
|
|
@d = common global i8 0, align 1
|
|
|
|
@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
|
|
|
|
|
2018-04-03 18:04:37 +08:00
|
|
|
declare void @external(i32)
|
In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting with compiler time improvements
Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 297695
2017-03-14 08:34:14 +08:00
|
|
|
|
2018-04-03 18:04:37 +08:00
|
|
|
; A test that re-uses flags in interesting ways due to volatile accesses.
|
|
|
|
; Specifically, the first increment's flags are reused for the branch despite
|
|
|
|
; being clobbered by the second increment.
|
|
|
|
define i32 @test1() nounwind {
|
|
|
|
; X32-LABEL: test1:
|
|
|
|
; X32: # %bb.0: # %entry
|
|
|
|
; X32-NEXT: movb b, %cl
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: movl %ecx, %eax
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: incb %al
|
|
|
|
; X32-NEXT: movb %al, b
|
|
|
|
; X32-NEXT: incl c
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: sete %dl
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: movb a, %ah
|
|
|
|
; X32-NEXT: movb %ah, %ch
|
|
|
|
; X32-NEXT: incb %ch
|
|
|
|
; X32-NEXT: cmpb %cl, %ah
|
|
|
|
; X32-NEXT: sete d
|
|
|
|
; X32-NEXT: movb %ch, a
|
2018-04-18 23:52:50 +08:00
|
|
|
; X32-NEXT: testb %dl, %dl
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: jne .LBB0_2
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: # %bb.1: # %if.then
|
|
|
|
; X32-NEXT: movsbl %al, %eax
|
|
|
|
; X32-NEXT: pushl %eax
|
|
|
|
; X32-NEXT: calll external
|
|
|
|
; X32-NEXT: addl $4, %esp
|
|
|
|
; X32-NEXT: .LBB0_2: # %if.end
|
|
|
|
; X32-NEXT: xorl %eax, %eax
|
|
|
|
; X32-NEXT: retl
|
|
|
|
;
|
|
|
|
; X64-LABEL: test1:
|
|
|
|
; X64: # %bb.0: # %entry
|
|
|
|
; X64-NEXT: movb {{.*}}(%rip), %dil
|
|
|
|
; X64-NEXT: movl %edi, %eax
|
|
|
|
; X64-NEXT: incb %al
|
|
|
|
; X64-NEXT: movb %al, {{.*}}(%rip)
|
|
|
|
; X64-NEXT: incl {{.*}}(%rip)
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: sete %sil
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: movb {{.*}}(%rip), %cl
|
|
|
|
; X64-NEXT: movl %ecx, %edx
|
|
|
|
; X64-NEXT: incb %dl
|
|
|
|
; X64-NEXT: cmpb %dil, %cl
|
|
|
|
; X64-NEXT: sete {{.*}}(%rip)
|
|
|
|
; X64-NEXT: movb %dl, {{.*}}(%rip)
|
2018-04-18 23:52:50 +08:00
|
|
|
; X64-NEXT: testb %sil, %sil
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: jne .LBB0_2
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: # %bb.1: # %if.then
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: pushq %rax
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: movsbl %al, %edi
|
|
|
|
; X64-NEXT: callq external
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: addq $8, %rsp
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: .LBB0_2: # %if.end
|
|
|
|
; X64-NEXT: xorl %eax, %eax
|
|
|
|
; X64-NEXT: retq
|
2016-01-05 08:48:16 +08:00
|
|
|
entry:
|
|
|
|
%bval = load i8, i8* @b
|
|
|
|
%inc = add i8 %bval, 1
|
In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting with compiler time improvements
Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 297695
2017-03-14 08:34:14 +08:00
|
|
|
store volatile i8 %inc, i8* @b
|
|
|
|
%cval = load volatile i32, i32* @c
|
2016-01-05 08:48:16 +08:00
|
|
|
%inc1 = add nsw i32 %cval, 1
|
In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting with compiler time improvements
Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 297695
2017-03-14 08:34:14 +08:00
|
|
|
store volatile i32 %inc1, i32* @c
|
|
|
|
%aval = load volatile i8, i8* @a
|
2016-01-05 08:48:16 +08:00
|
|
|
%inc2 = add i8 %aval, 1
|
In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled.
Recommiting with compiler time improvements
Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 297695
2017-03-14 08:34:14 +08:00
|
|
|
store volatile i8 %inc2, i8* @a
|
2016-01-05 08:48:16 +08:00
|
|
|
%cmp = icmp eq i8 %aval, %bval
|
|
|
|
%conv5 = zext i1 %cmp to i8
|
|
|
|
store i8 %conv5, i8* @d
|
|
|
|
%tobool = icmp eq i32 %inc1, 0
|
|
|
|
br i1 %tobool, label %if.end, label %if.then
|
|
|
|
|
|
|
|
if.then:
|
|
|
|
%conv6 = sext i8 %inc to i32
|
2018-04-03 18:04:37 +08:00
|
|
|
call void @external(i32 %conv6)
|
2016-01-05 08:48:16 +08:00
|
|
|
br label %if.end
|
|
|
|
|
|
|
|
if.end:
|
|
|
|
ret i32 0
|
|
|
|
}
|
|
|
|
|
2018-04-03 18:04:37 +08:00
|
|
|
; Preserve increment flags across a call.
|
|
|
|
define i32 @test2(i32* %ptr) nounwind {
|
|
|
|
; X32-LABEL: test2:
|
|
|
|
; X32: # %bb.0: # %entry
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: pushl %ebx
|
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: incl (%eax)
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: setne %bl
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: pushl $42
|
|
|
|
; X32-NEXT: calll external
|
|
|
|
; X32-NEXT: addl $4, %esp
|
2018-04-18 23:52:50 +08:00
|
|
|
; X32-NEXT: testb %bl, %bl
|
2019-02-04 00:16:48 +08:00
|
|
|
; X32-NEXT: jne .LBB1_2
|
|
|
|
; X32-NEXT: # %bb.1: # %then
|
|
|
|
; X32-NEXT: movl $64, %eax
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: popl %ebx
|
|
|
|
; X32-NEXT: retl
|
2019-02-04 00:16:48 +08:00
|
|
|
; X32-NEXT: .LBB1_2: # %else
|
|
|
|
; X32-NEXT: xorl %eax, %eax
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: popl %ebx
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: retl
|
|
|
|
;
|
|
|
|
; X64-LABEL: test2:
|
|
|
|
; X64: # %bb.0: # %entry
|
|
|
|
; X64-NEXT: pushq %rbx
|
|
|
|
; X64-NEXT: incl (%rdi)
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: setne %bl
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: movl $42, %edi
|
|
|
|
; X64-NEXT: callq external
|
2018-04-18 23:52:50 +08:00
|
|
|
; X64-NEXT: testb %bl, %bl
|
2019-02-04 00:16:48 +08:00
|
|
|
; X64-NEXT: jne .LBB1_2
|
|
|
|
; X64-NEXT: # %bb.1: # %then
|
|
|
|
; X64-NEXT: movl $64, %eax
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: popq %rbx
|
|
|
|
; X64-NEXT: retq
|
2019-02-04 00:16:48 +08:00
|
|
|
; X64-NEXT: .LBB1_2: # %else
|
|
|
|
; X64-NEXT: xorl %eax, %eax
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: popq %rbx
|
|
|
|
; X64-NEXT: retq
|
|
|
|
entry:
|
|
|
|
%val = load i32, i32* %ptr
|
|
|
|
%inc = add i32 %val, 1
|
|
|
|
store i32 %inc, i32* %ptr
|
|
|
|
%cmp = icmp eq i32 %inc, 0
|
|
|
|
call void @external(i32 42)
|
|
|
|
br i1 %cmp, label %then, label %else
|
|
|
|
|
|
|
|
then:
|
|
|
|
ret i32 64
|
|
|
|
|
|
|
|
else:
|
|
|
|
ret i32 0
|
|
|
|
}
|
|
|
|
|
|
|
|
declare void @external_a()
|
|
|
|
declare void @external_b()
|
|
|
|
|
|
|
|
; This lowers to a conditional tail call instead of a conditional branch. This
|
|
|
|
; is tricky because we can only do this from a leaf function, and so we have to
|
|
|
|
; use volatile stores similar to test1 to force the save and restore of
|
|
|
|
; a condition without calling another function. We then set up subsequent calls
|
|
|
|
; in tail position.
|
|
|
|
define void @test_tail_call(i32* %ptr) nounwind optsize {
|
|
|
|
; X32-LABEL: test_tail_call:
|
|
|
|
; X32: # %bb.0: # %entry
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: incl (%eax)
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: setne %al
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: incb a
|
|
|
|
; X32-NEXT: sete d
|
2018-04-18 23:52:50 +08:00
|
|
|
; X32-NEXT: testb %al, %al
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X32-NEXT: jne external_b # TAILCALL
|
|
|
|
; X32-NEXT: # %bb.1: # %then
|
2018-04-03 18:04:37 +08:00
|
|
|
; X32-NEXT: jmp external_a # TAILCALL
|
|
|
|
;
|
|
|
|
; X64-LABEL: test_tail_call:
|
|
|
|
; X64: # %bb.0: # %entry
|
|
|
|
; X64-NEXT: incl (%rdi)
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: setne %al
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: incb {{.*}}(%rip)
|
|
|
|
; X64-NEXT: sete {{.*}}(%rip)
|
2018-04-18 23:52:50 +08:00
|
|
|
; X64-NEXT: testb %al, %al
|
[x86] Introduce a pass to begin more systematically fixing PR36028 and similar issues.
The key idea is to lower COPY nodes populating EFLAGS by scanning the
uses of EFLAGS and introducing dedicated code to preserve the necessary
state in a GPR. In the vast majority of cases, these uses are cmovCC and
jCC instructions. For such cases, we can very easily save and restore
the necessary information by simply inserting a setCC into a GPR where
the original flags are live, and then testing that GPR directly to feed
the cmov or conditional branch.
However, things are a bit more tricky if arithmetic is using the flags.
This patch handles the vast majority of cases that seem to come up in
practice: adc, adcx, adox, rcl, and rcr; all without taking advantage of
partially preserved EFLAGS as LLVM doesn't currently model that at all.
There are a large number of operations that techinaclly observe EFLAGS
currently but shouldn't in this case -- they typically are using DF.
Currently, they will not be handled by this approach. However, I have
never seen this issue come up in practice. It is already pretty rare to
have these patterns come up in practical code with LLVM. I had to resort
to writing MIR tests to cover most of the logic in this pass already.
I suspect even with its current amount of coverage of arithmetic users
of EFLAGS it will be a significant improvement over the current use of
pushf/popf. It will also produce substantially faster code in most of
the common patterns.
This patch also removes all of the old lowering for EFLAGS copies, and
the hack that forced us to use a frame pointer when EFLAGS copies were
found anywhere in a function so that the dynamic stack adjustment wasn't
a problem. None of this is needed as we now lower all of these copies
directly in MI and without require stack adjustments.
Lots of thanks to Reid who came up with several aspects of this
approach, and Craig who helped me work out a couple of things tripping
me up while working on this.
Differential Revision: https://reviews.llvm.org/D45146
llvm-svn: 329657
2018-04-10 09:41:17 +08:00
|
|
|
; X64-NEXT: jne external_b # TAILCALL
|
|
|
|
; X64-NEXT: # %bb.1: # %then
|
2018-04-03 18:04:37 +08:00
|
|
|
; X64-NEXT: jmp external_a # TAILCALL
|
|
|
|
entry:
|
|
|
|
%val = load i32, i32* %ptr
|
|
|
|
%inc = add i32 %val, 1
|
|
|
|
store i32 %inc, i32* %ptr
|
|
|
|
%cmp = icmp eq i32 %inc, 0
|
|
|
|
%aval = load volatile i8, i8* @a
|
|
|
|
%inc2 = add i8 %aval, 1
|
|
|
|
store volatile i8 %inc2, i8* @a
|
|
|
|
%cmp2 = icmp eq i8 %inc2, 0
|
|
|
|
%conv5 = zext i1 %cmp2 to i8
|
|
|
|
store i8 %conv5, i8* @d
|
|
|
|
br i1 %cmp, label %then, label %else
|
|
|
|
|
|
|
|
then:
|
|
|
|
tail call void @external_a()
|
|
|
|
ret void
|
|
|
|
|
|
|
|
else:
|
|
|
|
tail call void @external_b()
|
|
|
|
ret void
|
|
|
|
}
|
2018-04-18 23:13:16 +08:00
|
|
|
|
|
|
|
; Test a function that gets special select lowering into CFG with copied EFLAGS
|
|
|
|
; threaded across the CFG. This requires our EFLAGS copy rewriting to handle
|
|
|
|
; cross-block rewrites in at least some narrow cases.
|
2018-10-31 04:46:23 +08:00
|
|
|
define void @PR37100(i8 %arg1, i16 %arg2, i64 %arg3, i8 %arg4, i8* %ptr1, i32* %ptr2, i32 %x) nounwind {
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-LABEL: PR37100:
|
|
|
|
; X32: # %bb.0: # %bb
|
|
|
|
; X32-NEXT: pushl %ebp
|
|
|
|
; X32-NEXT: pushl %ebx
|
|
|
|
; X32-NEXT: pushl %edi
|
|
|
|
; X32-NEXT: pushl %esi
|
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %esi
|
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %ebx
|
2018-10-31 04:46:23 +08:00
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %ebp
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-NEXT: movb {{[0-9]+}}(%esp), %ch
|
|
|
|
; X32-NEXT: movb {{[0-9]+}}(%esp), %cl
|
|
|
|
; X32-NEXT: jmp .LBB3_1
|
|
|
|
; X32-NEXT: .p2align 4, 0x90
|
|
|
|
; X32-NEXT: .LBB3_5: # %bb1
|
|
|
|
; X32-NEXT: # in Loop: Header=BB3_1 Depth=1
|
2018-10-31 04:46:23 +08:00
|
|
|
; X32-NEXT: movl %esi, %eax
|
|
|
|
; X32-NEXT: cltd
|
|
|
|
; X32-NEXT: idivl %edi
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-NEXT: .LBB3_1: # %bb1
|
|
|
|
; X32-NEXT: # =>This Inner Loop Header: Depth=1
|
|
|
|
; X32-NEXT: movsbl %cl, %eax
|
|
|
|
; X32-NEXT: movl %eax, %edx
|
|
|
|
; X32-NEXT: sarl $31, %edx
|
2018-10-31 04:46:23 +08:00
|
|
|
; X32-NEXT: cmpl %eax, {{[0-9]+}}(%esp)
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
|
|
|
|
; X32-NEXT: sbbl %edx, %eax
|
|
|
|
; X32-NEXT: setl %al
|
|
|
|
; X32-NEXT: setl %dl
|
2018-10-31 04:46:23 +08:00
|
|
|
; X32-NEXT: movzbl %dl, %edi
|
|
|
|
; X32-NEXT: negl %edi
|
2018-04-18 23:52:50 +08:00
|
|
|
; X32-NEXT: testb %al, %al
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-NEXT: jne .LBB3_3
|
|
|
|
; X32-NEXT: # %bb.2: # %bb1
|
|
|
|
; X32-NEXT: # in Loop: Header=BB3_1 Depth=1
|
|
|
|
; X32-NEXT: movb %ch, %cl
|
|
|
|
; X32-NEXT: .LBB3_3: # %bb1
|
|
|
|
; X32-NEXT: # in Loop: Header=BB3_1 Depth=1
|
2018-10-31 04:46:23 +08:00
|
|
|
; X32-NEXT: movb %cl, (%ebp)
|
|
|
|
; X32-NEXT: movl (%ebx), %edx
|
2018-04-18 23:52:50 +08:00
|
|
|
; X32-NEXT: testb %al, %al
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-NEXT: jne .LBB3_5
|
|
|
|
; X32-NEXT: # %bb.4: # %bb1
|
|
|
|
; X32-NEXT: # in Loop: Header=BB3_1 Depth=1
|
2018-10-31 04:46:23 +08:00
|
|
|
; X32-NEXT: movl %edx, %edi
|
2018-04-18 23:13:16 +08:00
|
|
|
; X32-NEXT: jmp .LBB3_5
|
|
|
|
;
|
|
|
|
; X64-LABEL: PR37100:
|
|
|
|
; X64: # %bb.0: # %bb
|
2018-10-31 04:46:23 +08:00
|
|
|
; X64-NEXT: movq %rdx, %r11
|
|
|
|
; X64-NEXT: movl {{[0-9]+}}(%rsp), %r10d
|
2018-04-18 23:13:16 +08:00
|
|
|
; X64-NEXT: jmp .LBB3_1
|
|
|
|
; X64-NEXT: .p2align 4, 0x90
|
|
|
|
; X64-NEXT: .LBB3_5: # %bb1
|
|
|
|
; X64-NEXT: # in Loop: Header=BB3_1 Depth=1
|
2018-10-31 04:46:23 +08:00
|
|
|
; X64-NEXT: movl %r10d, %eax
|
|
|
|
; X64-NEXT: cltd
|
2018-04-18 23:13:16 +08:00
|
|
|
; X64-NEXT: idivl %esi
|
|
|
|
; X64-NEXT: .LBB3_1: # %bb1
|
|
|
|
; X64-NEXT: # =>This Inner Loop Header: Depth=1
|
|
|
|
; X64-NEXT: movsbq %dil, %rax
|
|
|
|
; X64-NEXT: xorl %esi, %esi
|
2018-10-31 04:46:23 +08:00
|
|
|
; X64-NEXT: cmpq %rax, %r11
|
2018-04-18 23:13:16 +08:00
|
|
|
; X64-NEXT: setl %sil
|
|
|
|
; X64-NEXT: negl %esi
|
2018-10-31 04:46:23 +08:00
|
|
|
; X64-NEXT: cmpq %rax, %r11
|
2018-04-18 23:13:16 +08:00
|
|
|
; X64-NEXT: jl .LBB3_3
|
|
|
|
; X64-NEXT: # %bb.2: # %bb1
|
|
|
|
; X64-NEXT: # in Loop: Header=BB3_1 Depth=1
|
|
|
|
; X64-NEXT: movl %ecx, %edi
|
|
|
|
; X64-NEXT: .LBB3_3: # %bb1
|
|
|
|
; X64-NEXT: # in Loop: Header=BB3_1 Depth=1
|
|
|
|
; X64-NEXT: movb %dil, (%r8)
|
|
|
|
; X64-NEXT: jl .LBB3_5
|
|
|
|
; X64-NEXT: # %bb.4: # %bb1
|
|
|
|
; X64-NEXT: # in Loop: Header=BB3_1 Depth=1
|
|
|
|
; X64-NEXT: movl (%r9), %esi
|
|
|
|
; X64-NEXT: jmp .LBB3_5
|
|
|
|
bb:
|
|
|
|
br label %bb1
|
|
|
|
|
|
|
|
bb1:
|
|
|
|
%tmp = phi i8 [ %tmp8, %bb1 ], [ %arg1, %bb ]
|
|
|
|
%tmp2 = phi i16 [ %tmp12, %bb1 ], [ %arg2, %bb ]
|
|
|
|
%tmp3 = icmp sgt i16 %tmp2, 7
|
|
|
|
%tmp4 = select i1 %tmp3, i16 %tmp2, i16 7
|
|
|
|
%tmp5 = sext i8 %tmp to i64
|
|
|
|
%tmp6 = icmp slt i64 %arg3, %tmp5
|
|
|
|
%tmp7 = sext i1 %tmp6 to i32
|
|
|
|
%tmp8 = select i1 %tmp6, i8 %tmp, i8 %arg4
|
|
|
|
store volatile i8 %tmp8, i8* %ptr1
|
|
|
|
%tmp9 = load volatile i32, i32* %ptr2
|
|
|
|
%tmp10 = select i1 %tmp6, i32 %tmp7, i32 %tmp9
|
2018-10-31 04:46:23 +08:00
|
|
|
%tmp11 = srem i32 %x, %tmp10
|
2018-04-18 23:13:16 +08:00
|
|
|
%tmp12 = trunc i32 %tmp11 to i16
|
|
|
|
br label %bb1
|
|
|
|
}
|
2018-05-16 04:16:57 +08:00
|
|
|
|
|
|
|
; Use a particular instruction pattern in order to lower to the post-RA pseudo
|
|
|
|
; used to lower SETB into an SBB pattern in order to make sure that kind of
|
|
|
|
; usage of a copied EFLAGS continues to work.
|
2018-10-31 04:44:54 +08:00
|
|
|
define void @PR37431(i32* %arg1, i8* %arg2, i8* %arg3, i32 %x) nounwind {
|
2018-05-16 04:16:57 +08:00
|
|
|
; X32-LABEL: PR37431:
|
|
|
|
; X32: # %bb.0: # %entry
|
2018-10-31 04:44:54 +08:00
|
|
|
; X32-NEXT: pushl %edi
|
2018-05-16 04:16:57 +08:00
|
|
|
; X32-NEXT: pushl %esi
|
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
|
|
|
|
; X32-NEXT: movl (%eax), %eax
|
|
|
|
; X32-NEXT: movl %eax, %ecx
|
|
|
|
; X32-NEXT: sarl $31, %ecx
|
|
|
|
; X32-NEXT: cmpl %eax, %eax
|
|
|
|
; X32-NEXT: sbbl %ecx, %eax
|
2018-10-31 04:44:54 +08:00
|
|
|
; X32-NEXT: setb %cl
|
|
|
|
; X32-NEXT: sbbb %dl, %dl
|
2018-05-16 04:16:57 +08:00
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %esi
|
2018-10-31 04:44:54 +08:00
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
|
|
|
|
; X32-NEXT: movl {{[0-9]+}}(%esp), %edi
|
|
|
|
; X32-NEXT: movb %dl, (%edi)
|
|
|
|
; X32-NEXT: movzbl %cl, %ecx
|
|
|
|
; X32-NEXT: xorl %edi, %edi
|
|
|
|
; X32-NEXT: subl %ecx, %edi
|
|
|
|
; X32-NEXT: cltd
|
|
|
|
; X32-NEXT: idivl %edi
|
2018-05-16 04:16:57 +08:00
|
|
|
; X32-NEXT: movb %dl, (%esi)
|
|
|
|
; X32-NEXT: popl %esi
|
2018-10-31 04:44:54 +08:00
|
|
|
; X32-NEXT: popl %edi
|
2018-05-16 04:16:57 +08:00
|
|
|
; X32-NEXT: retl
|
|
|
|
;
|
|
|
|
; X64-LABEL: PR37431:
|
|
|
|
; X64: # %bb.0: # %entry
|
2018-10-31 04:44:54 +08:00
|
|
|
; X64-NEXT: movl %ecx, %eax
|
|
|
|
; X64-NEXT: movq %rdx, %r8
|
|
|
|
; X64-NEXT: movslq (%rdi), %rdx
|
|
|
|
; X64-NEXT: cmpq %rdx, %rax
|
|
|
|
; X64-NEXT: sbbb %cl, %cl
|
|
|
|
; X64-NEXT: cmpq %rdx, %rax
|
|
|
|
; X64-NEXT: movb %cl, (%rsi)
|
|
|
|
; X64-NEXT: sbbl %ecx, %ecx
|
|
|
|
; X64-NEXT: cltd
|
|
|
|
; X64-NEXT: idivl %ecx
|
|
|
|
; X64-NEXT: movb %dl, (%r8)
|
2018-05-16 04:16:57 +08:00
|
|
|
; X64-NEXT: retq
|
|
|
|
entry:
|
|
|
|
%tmp = load i32, i32* %arg1
|
|
|
|
%tmp1 = sext i32 %tmp to i64
|
|
|
|
%tmp2 = icmp ugt i64 %tmp1, undef
|
|
|
|
%tmp3 = zext i1 %tmp2 to i8
|
|
|
|
%tmp4 = sub i8 0, %tmp3
|
|
|
|
store i8 %tmp4, i8* %arg2
|
|
|
|
%tmp5 = sext i8 %tmp4 to i32
|
2018-10-31 04:44:54 +08:00
|
|
|
%tmp6 = srem i32 %x, %tmp5
|
2018-05-16 04:16:57 +08:00
|
|
|
%tmp7 = trunc i32 %tmp6 to i8
|
|
|
|
store i8 %tmp7, i8* %arg3
|
|
|
|
ret void
|
|
|
|
}
|