Commit Graph

14 Commits

Author SHA1 Message Date
Christudasan Devadasan 375cec4b6c [AMDGPU] Introduce more scratch registers in the ABI.
The AMDGPU target has a convention that defined all VGPRs
(execept the initial 32 argument registers) as callee-saved.
This convention is not efficient always, esp. when the callee
requiring more registers, ended up emitting a large number of
spills, even though its caller requires only a few.

This patch revises the ABI by introducing more scratch registers
that a callee can freely use.
The 256 vgpr registers now become:
  32 argument registers
  112 scratch registers and
  112 callee saved registers.
The scratch registers and the CSRs are intermixed at regular
intervals (a split boundary of 8) to obtain a better occupancy.

Reviewers: arsenm, t-tye, rampitec, b-sumner, mjbedy, tpr

Reviewed By: arsenm, t-tye

Differential Revision: https://reviews.llvm.org/D76356
2020-05-05 23:02:58 +05:30
Matt Arsenault 5660bb6bc9 AMDGPU: Remove denormal subtarget features
Switch to using the denormal-fp-math/denormal-fp-math-f32 attributes.
2020-04-02 17:17:12 -04:00
Matt Arsenault 75cf30918f AMDGPU: Assume f32 denormals are enabled by default
This will likely introduce catastrophic performance regressions on
older subtargets, but should be correct. A follow up change will
remove the old fp32-denormals subtarget features, and switch to using
the new denormal-fp-math/denormal-fp-math-f32 attributes. Frontends
should be making sure to add the denormal-fp-math-f32 attribute when
appropriate to avoid performance regressions.
2020-04-02 17:17:12 -04:00
Scott Linder 0e9368cc8c [AMDGPU] Move frame pointer from s34 to s33
Remove the gap left between the stack pointer (s32) and frame pointer
(s34) now that the scratch wave offset is no longer a part of the
calling convention ABI.

Update llvm/docs/AMDGPUUsage.rst to reflect the change.

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D75657
2020-03-19 15:35:16 -04:00
Stanislav Mekhanoshin c262b69dcc [AMDGPU] Fix endcf collapse
Only collapse inner endcf if the outer one belongs to SI_IF.
If it does belong to SI_ELSE then mask being restored in fact
a partial inverse of what we need.

Differential Revision: https://reviews.llvm.org/D76154
2020-03-13 13:50:21 -07:00
Stanislav Mekhanoshin 32e90cbcd1 [AMDGPU] Disable endcf collapse
There are some functional regressions and I suspect our
scopes are not as perfectly enclosed as I expected.
Disable it for now.

Differential Revision: https://reviews.llvm.org/D76148
2020-03-13 12:33:22 -07:00
Stanislav Mekhanoshin 360aff0493 [AMDGPU] Simplify nested SI_END_CF
This is to replace the optimization from the SIOptimizeExecMaskingPreRA.
We have less opportunities in the control flow lowering because many
VGPR copies are still in place and will be removed later, but we know
for sure an instruction is SI_END_CF and not just an arbitrary S_OR_B64
with EXEC.

The subsequent change needs to convert s_and_saveexec into s_and and
address new TODO lines in tests, then code block guarded by the
-amdgpu-remove-redundant-endcf option in the pre-RA exec mask optimizer
will be removed.

Differential Revision: https://reviews.llvm.org/D76033
2020-03-12 11:25:07 -07:00
Changpeng Fang 2531535984 AMDGPU: Implement FDIV optimizations in AMDGPUCodeGenPrepare
Summary:
      RCP has the accuracy limit. If FDIV fpmath require high accuracy rcp may not
    meet the requirement. However, in DAG lowering, fpmath information gets lost,
    and thus we may generate either inaccurate rcp related computation or slow code
    for fdiv.

    In patch implements fdiv optimizations in the AMDGPUCodeGenPrepare, which could
    exactly know !fpmath.

     FastUnsafeRcpLegal: We determine whether it is legal to use rcp based on
                         unsafe-fp-math, fast math flags, denormals and fpmath
                         accuracy request.

     RCP Optimizations:
       1/x -> rcp(x) when fast unsafe rcp is legal or fpmath >= 2.5ULP with
                                                      denormals flushed.
       a/b -> a*rcp(b) when fast unsafe rcp is legal.

     Use fdiv.fast:
       a/b -> fdiv.fast(a, b) when RCP optimization is not performed and
                              fpmath >= 2.5ULP with denormals flushed.

       1/x -> fdiv.fast(1,x)  when RCP optimization is not performed and
                              fpmath >= 2.5ULP with denormals.

    Reviewers:
      arsenm

    Differential Revision:
      https://reviews.llvm.org/D71293
2020-01-23 16:57:43 -08:00
cdevadas e53a9d96e6 Resubmit: [AMDGPU] Invert the handling of skip insertion.
The current implementation of skip insertion (SIInsertSkip) makes it a
mandatory pass required for correctness. Initially, the idea was to
have an optional pass. This patch inserts the s_cbranch_execz upfront
during SILowerControlFlow to skip over the sections of code when no
lanes are active. Later, SIRemoveShortExecBranches removes the skips
for short branches, unless there is a sideeffect and the skip branch is
really necessary.

This new pass will replace the handling of skip insertion in the
existing SIInsertSkip Pass.

Differential revision: https://reviews.llvm.org/D68092
2020-01-22 13:18:32 +09:00
Nicolai Hähnle a80291ce10 Revert "[AMDGPU] Invert the handling of skip insertion."
This reverts commit 0dc6c249bf.

The commit is reported to cause a regression in piglit/bin/glsl-vs-loop for
Mesa.
2020-01-21 09:17:25 +01:00
cdevadas 0dc6c249bf [AMDGPU] Invert the handling of skip insertion.
The current implementation of skip insertion (SIInsertSkip) makes it a
mandatory pass required for correctness. Initially, the idea was to
have an optional pass. This patch inserts the s_cbranch_execz upfront
during SILowerControlFlow to skip over the sections of code when no
lanes are active. Later, SIRemoveShortExecBranches removes the skips
for short branches, unless there is a sideeffect and the skip branch is
really necessary.

This new pass will replace the handling of skip insertion in the
existing SIInsertSkip Pass.

Differential revision: https://reviews.llvm.org/D68092
2020-01-15 15:18:16 +05:30
Stanislav Mekhanoshin cd69e4c74c [AMDGPU] Fix bundle scheduling
Bundles coming to scheduler considered free, i.e. zero latency.
Fixed.

Differential Revision: https://reviews.llvm.org/D72487
2020-01-09 15:56:36 -08:00
Matt Arsenault 3b95986a32 AMDGPU: Run AMDGPUCodeGenPrepare after scalar opts
The mul24 matching could interfere with SLSR and the other addressing
mode related passes. This probably is not the optimal placement, but
is an intermediate step. This should probably be moved after all the
generic IR passes, particularly LSR. Moving this after LSR seems to
help in some cases, and hurts others.

As-is in this patch, in idiv-licm, it saves 1-2 instructions inside
some of the loop bodies, but increases the number in others. Moving
this later helps these loops. In the new lsr tests in
mul24-pass-ordering, the intrinsic prevents introducing more
instructions in the loop preheader, so moving this later ends up
hurting them. This shouldn't be any worse than before the intrinsics
were introduced in r366094, and LSR should probably be smarter. I
think it's because it doesn't know the and inside the loop will be
folded away.

llvm-svn: 369991
2019-08-27 00:08:31 +00:00
Matt Arsenault 74115ef791 AMDGPU: Add baseline test for mul24 ordering issues
llvm-svn: 369858
2019-08-24 22:22:38 +00:00