Commit Graph

4150 Commits

Author SHA1 Message Date
Evgeniy Brevnov 9fb074e7bb [BPI] Improve static heuristics for "cold" paths.
Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
 of other paths.

New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.

One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.

In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.

Reviewed By: yrouban

Differential Revision: https://reviews.llvm.org/D79485
2020-12-23 22:47:36 +07:00
Sebastian Neubauer 221fdedc69 [AMDGPU][GlobalISel] Fold flat vgpr + constant addresses
Use getPtrBaseWithConstantOffset in selectFlatOffsetImpl to fold more
vgpr+constant addresses.

Differential Revision: https://reviews.llvm.org/D93692
2020-12-23 10:40:30 +01:00
Matt Arsenault bac54639c7 AMDGPU: Add spilled CSR SGPRs to entry block live ins 2020-12-22 21:55:59 -05:00
Matt Arsenault 29ed846d67 AMDGPU: Fix assert when checking for implicit operand legality 2020-12-22 20:56:24 -05:00
Stanislav Mekhanoshin d15119a02d [AMDGPU][GlobalISel] GlobalISel for flat scratch
It does not seem to fold offsets but this is not specific
to the flat scratch as getPtrBaseWithConstantOffset() does
not return the split for these tests unlike its SDag
counterpart.

Differential Revision: https://reviews.llvm.org/D93670
2020-12-22 16:33:06 -08:00
Stanislav Mekhanoshin ca4bf58e4e [AMDGPU] Support unaligned flat scratch in TLI
Adjust SITargetLowering::allowsMisalignedMemoryAccessesImpl for
unaligned flat scratch support. Mostly needed for global isel.

Differential Revision: https://reviews.llvm.org/D93669
2020-12-22 16:12:31 -08:00
Stanislav Mekhanoshin ae8f4b2178 [AMDGPU] Folding of FI operand with flat scratch
Differential Revision: https://reviews.llvm.org/D93501
2020-12-22 10:48:04 -08:00
Fangrui Song 8ffda237a6 MCContext::reportError: don't call report_fatal_error
Errors from MCAssembler, MCObjectStreamer and *ObjectWriter typically cause a crash:

```
% cat c.c
int bar;
extern int foo __attribute__((alias("bar")));
% clang -c -fcommon c.c
fatal error: error in backend: Common symbol 'bar' cannot be used in assignment expr
PLEASE submit a bug report to ...
Stack dump:
...
```

`LLVMTargetMachine::addPassesToEmitFile` constructs `MachineModuleInfoWrapperPass`
which creates a MCContext without SourceMgr. `MCContext::reportError` calls
`report_fatal_error` which gets captured by Clang `LLVMErrorHandler` and gets translated
to the output above.

Since `MCContext::reportError` errors indicate user errors, such a crashing style error
is inappropriate. So this patch changes `report_fatal_error` to `SourceMgr().PrintMessage`.
```
% clang -c -fcommon c.c
<unknown>:0: error: Common symbol 'bar' cannot be used in assignment expr
```

Ideally we should at least recover the original filename (the line information
is generally lost).  That requires general improvement to MC diagnostics,
because currently in many cases SMLoc information is lost.
2020-12-20 23:23:12 -08:00
Pushpinder Singh e2303a448e [FastRA] Fix handling of bundled MIs
Fast register allocator skips bundled MIs, as the main assignment
loop uses MachineBasicBlock::iterator (= MachineInstrBundleIterator)
This was causing SIInsertWaitcnts to crash which expects all
instructions to have registers assigned.

This patch makes sure to set everything inside bundle to the same
assignments done on BUNDLE header.

Reviewed By: qcolombet

Differential Revision: https://reviews.llvm.org/D90369
2020-12-21 02:10:55 -05:00
Whitney Tsang 2a814cd9e1 Ensure SplitEdge to return the new block between the two given blocks
This PR implements the function splitBasicBlockBefore to address an
issue
that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore.
The issue occurs in SplitEdge when the Succ has a single predecessor
and the edge between the BB and Succ is not critical. This produces
the result ‘BB->Succ->New’. The new function splitBasicBlockBefore
was added to splitBlockBefore to handle the issue and now produces
the correct result ‘BB->New->Succ’.

Below is an example of splitting the block bb1 at its first instruction.

/// Original IR
bb0:
	br bb1
bb1:
        %0 = mul i32 1, 2
	br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlock
bb0:
	br bb1
bb1:
	br bb1.split
bb1.split:
        %0 = mul i32 1, 2
	br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore
bb0:
	br bb1.split
bb1.split
	br bb1
bb1:
        %0 = mul i32 1, 2
	br bb2
bb2:

Differential Revision: https://reviews.llvm.org/D92200
2020-12-18 17:37:17 +00:00
Bangtian Liu 511cfe9441 Revert "Ensure SplitEdge to return the new block between the two given blocks"
This reverts commit d20e0c3444.
2020-12-17 21:00:37 +00:00
Bangtian Liu d20e0c3444 Ensure SplitEdge to return the new block between the two given blocks
This PR implements the function splitBasicBlockBefore to address an
issue
that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore.
The issue occurs in SplitEdge when the Succ has a single predecessor
and the edge between the BB and Succ is not critical. This produces
the result ‘BB->Succ->New’. The new function splitBasicBlockBefore
was added to splitBlockBefore to handle the issue and now produces
the correct result ‘BB->New->Succ’.

Below is an example of splitting the block bb1 at its first instruction.

/// Original IR
bb0:
	br bb1
bb1:
        %0 = mul i32 1, 2
	br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlock
bb0:
	br bb1
bb1:
	br bb1.split
bb1.split:
        %0 = mul i32 1, 2
	br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore
bb0:
	br bb1.split
bb1.split
	br bb1
bb1:
        %0 = mul i32 1, 2
	br bb2
bb2:

Differential Revision: https://reviews.llvm.org/D92200
2020-12-17 16:00:15 +00:00
Matt Arsenault f333736757 AMDGPU: Remove SGPRSpillVGPRDefinedSet hack
These VGPRs should be reserved and therefore do not need "correct"
liveness. They should not have undef uses, which can still cause
issues.
2020-12-16 21:33:35 -05:00
Bangtian Liu c10757200d Revert "Ensure SplitEdge to return the new block between the two given blocks"
This reverts commit cf638d793c.
2020-12-16 11:52:30 +00:00
Stanislav Mekhanoshin eb66bf0802 [AMDGPU] Print SCRATCH_EN field after the kernel
Differential Revision: https://reviews.llvm.org/D93353
2020-12-15 22:44:30 -08:00
Bangtian Liu cf638d793c Ensure SplitEdge to return the new block between the two given blocks
This PR implements the function splitBasicBlockBefore to address an
issue
that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore.
The issue occurs in SplitEdge when the Succ has a single predecessor
and the edge between the BB and Succ is not critical. This produces
the result ‘BB->Succ->New’. The new function splitBasicBlockBefore
was added to splitBlockBefore to handle the issue and now produces
the correct result ‘BB->New->Succ’.

Below is an example of splitting the block bb1 at its first instruction.

/// Original IR
bb0:
	br bb1
bb1:
        %0 = mul i32 1, 2
	br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlock
bb0:
	br bb1
bb1:
	br bb1.split
bb1.split:
        %0 = mul i32 1, 2
	br bb2
bb2:
/// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore
bb0:
	br bb1.split
bb1.split
	br bb1
bb1:
        %0 = mul i32 1, 2
	br bb2
bb2:

Differential Revision: https://reviews.llvm.org/D92200
2020-12-15 23:32:29 +00:00
Matt Arsenault 60eba8161b RegisterCoalescer: Remove phi-only subranges when erasing identity copies
Undef subranges are not present in the live range values, except when
they cross block boundaries. In this situation, a identity copy is
inside a loop, and one of the lanes is undefined. It only appears
alive inside the loop due to the copy. Once the copy was erased, it
would leave behind a segment inside the loop body with no
corresponding def anywhere in the program.

When RenameIndependentSubregs processed this dummy interval, it would
introduce a "Multiple connected components in live interval" verifier
error when IMPLICIT_DEFs were added to the other two blocks. I believe
there is a missing verifier check for this type of dummy interval.

I have found additional cases from the same fundamental problem in
other areas I haven't managed to fix yet (e.g. the commented out
prune_subrange_phi_value_* cases).
2020-12-15 17:36:32 -05:00
Changpeng Fang ce0c0013d8 AMDGPU: If a store defines (alias) a load, it clobbers the load.
Summary:
 If a store defines (must alias) a load, it clobbers the load.

Fixes: SWDEV-258915

Reviewers:
  arsenm

Differential Revision:
  https://reviews.llvm.org/D92951
2020-12-14 16:34:32 -08:00
Stanislav Mekhanoshin cf5845d6c4 [AMDGPU] Use multi-dword flat scratch for spilling
Differential Revision: https://reviews.llvm.org/D93067
2020-12-14 14:19:29 -08:00
Michael Liao 1fd1f638b6 [amdgpu] Fix a crash case when `V_CNDMASK` could be simplified.
- Once an instruction is simplified, foldable candidates from it should
  be invalidated or skipped as the operand index is no longer valid.

Differential Revision: https://reviews.llvm.org/D93174
2020-12-14 13:08:13 -05:00
Sebastian Neubauer 5733167f54 [AMDGPU] Mark amdgpu_gfx functions as module entry function
- Allows lds allocations
- Writes resource usage into COMPUTE_PGM_RSRC1 registers in PAL metadata

Differential Revision: https://reviews.llvm.org/D92946
2020-12-14 10:43:39 +01:00
Georgii Rymar 98a4289810 [llvm-readobj] - For SHT_REL relocations, don't display an addend.
This is https://bugs.llvm.org/show_bug.cgi?id=44257.

In LLVM style we always print `0` as addend when dumping
SHT_REL relocations. It is confusing, this patch stops
printing it as the first comment on the bug page suggests.

Differential revision: https://reviews.llvm.org/D93033
2020-12-14 12:03:00 +03:00
Mirko Brkusanin 0c7cce54eb [AMDGPU] Resolve issues when picking between ds_read/write and ds_read2/write2
Both ds_read_b128 and ds_read2_b64 are valid for 128bit 16-byte aligned
loads but the one that will be selected is determined either by the order in
tablegen or by the AddedComplexity attribute. Currently ds_read_b128 has
priority.

While ds_read2_b64 has lower alignment requirements, we cannot always
restrict ds_read_b128 to 16-byte alignment because of unaligned-access-mode
option. This was causing ds_read_b128 to be selected for 8-byte aligned
loads regardles of chosen access mode.

To resolve this we use two patterns for selecting ds_read_b128. One
requires alignment of 16-byte and the other requires
unaligned-access-mode option.

Same goes for ds_write2_b64 and ds_write_b128.

Differential Revision: https://reviews.llvm.org/D92767
2020-12-10 12:40:49 +01:00
Stanislav Mekhanoshin 4617cc68f6 [AMDGPU] Fix expansion of 192 bit spills in PEI
Differential Revision: https://reviews.llvm.org/D92979
2020-12-09 16:36:29 -08:00
Austin Kerbow 4aa842a800 [AMDGPU] Add new pseudos for indirect addressing with VGPR Indexing
It is possible for copies or spills to be inserted in the middle of indirect
addressing sequences which use VGPR indexing. Spills to accvgprs could be
effected by the indexing mode.

Add new pseudo instructions that are expanded after register allocation to avoid
the problematic spill or copy placement.

Differential Revision: https://reviews.llvm.org/D91048
2020-12-08 12:24:12 -08:00
Jay Foad 03663e4130 [AMDGPU] Add occupancy level tests for GFX10.3. NFC.
getMaxWavesPerEU and getVGPRAllocGranule both changed in GFX10.3 and
they both affect the occupancy calculation.

Differential Revision: https://reviews.llvm.org/D92839
2020-12-08 14:15:01 +00:00
Stanislav Mekhanoshin dd89249498 [AMDGPU] Annotate vgpr<->agpr spills in asm
Differential Revision: https://reviews.llvm.org/D92125
2020-12-07 11:25:25 -08:00
Scott Linder f6b9afae00 [AMDGPU] Extend and reorganize memory legalizer tests
* Rename some tests to try to make a convention (where all components
  are optional) of:

    <addrspace>_<syncscope>_<memory-orders>_<operation>

* Split up at a level of granularity appropriate for the different RUN
  lines (i.e. split on addrspace so GFX6 can avoid FLAT) and that makes
  running a specific test reasonable in terms of wall time taken. This
  also means when run as part of the test suite the testing is not one
  serial bottleneck.

* Auto-generate check lines with `update_llc_test_checks.py` to make
  future maintenance more tractable.

Reviewed By: rampitec, t-tye

Differential Revision: https://reviews.llvm.org/D91545
2020-12-03 19:36:33 +00:00
Jay Foad d28624a209 [AMDGPU] Stop adding an implicit def of vcc_hi for wave32
This doesn't seem to be needed for anything.

Differential Revision: https://reviews.llvm.org/D92400
2020-12-02 10:11:42 +00:00
Juneyoung Lee 53040a968d [ConstantFold] Fold more operations to poison
This patch folds more operations to poison.

Alive2 proof: https://alive2.llvm.org/ce/z/mxcb9G (it does not contain tests about div/rem because they fold to poison when raising UB)

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D92270
2020-11-29 21:19:48 +09:00
Nikita Popov 4df8efce80 [AA] Split up LocationSize::unknown()
Currently, we have some confusion in the codebase regarding the
meaning of LocationSize::unknown(): Some parts (including most of
BasicAA) assume that LocationSize::unknown() only allows accesses
after the base pointer. Some parts (various callers of AA) assume
that LocationSize::unknown() allows accesses both before and after
the base pointer (but within the underlying object).

This patch splits up LocationSize::unknown() into
LocationSize::afterPointer() and LocationSize::beforeOrAfterPointer()
to make this completely unambiguous. I tried my best to determine
which one is appropriate for all the existing uses.

The test changes in cs-cs.ll in particular illustrate a previously
clearly incorrect AA result: We were effectively assuming that
argmemonly functions were only allowed to access their arguments
after the passed pointer, but not before it. I'm pretty sure that
this was not intentional, and it's certainly not specified by
LangRef that way.

Differential Revision: https://reviews.llvm.org/D91649
2020-11-26 18:39:55 +01:00
Roman Lebedev ba74fa244f
[AMDGPU] Actually fully update opt-pipeline.ll test to account for -loop-idiom vs -indvars switch 2020-11-25 19:39:32 +03:00
Roman Lebedev a8d74517dc
[PassManager] Run Induction Variable Simplification pass *after* Recognize loop idioms pass, not before
Currently, `-indvars` runs first, and then immediately after `-loop-idiom` does.
I'm not really sure if `-loop-idiom` requires `-indvars` to run beforehand,
but i'm *very* sure that `-indvars` requires `-loop-idiom` to run afterwards,
as it can be seen in the phase-ordering test.

LoopIdiom runs on two types of loops: countable ones, and uncountable ones.
For uncountable ones, IndVars obviously didn't make any change to them,
since they are uncountable, so for them the order should be irrelevant.
For countable ones, well, they should have been countable before IndVars
for IndVars to make any change to them, and since SCEV is used on them,
it shouldn't matter if IndVars have already canonicalized them.
So i don't really see why we'd want the current ordering.

Should this cause issues, it will give us a reproducer test case
that shows flaws in this logic, and we then could adjust accordingly.

While this is quite likely beneficial in-the-wild already,
it's a required part for the full motivational pattern
behind `left-shift-until-bittest` loop idiom (D91038).

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D91800
2020-11-25 19:20:07 +03:00
Sebastian Neubauer edd675643d [AMDGPU] Emit stack frame size in metadata
Add .shader_functions to pal metadata, which contains the stack frame
size for all non-entry-point functions.

Differential Revision: https://reviews.llvm.org/D90036
2020-11-25 16:30:02 +01:00
Matt Arsenault 79f75468b4 AMDGPU: Fix counting kernel arguments towards register usage
Also use DataLayout to get type size. Relying on the IR type size is
also pretty broken here, since this won't perfectly capture how types
are legalized.
2020-11-20 21:23:33 -05:00
Matt Arsenault 1d1234b2a4 OpaquePtr: Update more tests to use typed sret 2020-11-20 20:08:43 -05:00
Matt Arsenault 20c43d6bd5 OpaquePtr: Bulk update tests to use typed sret 2020-11-20 17:58:26 -05:00
Matt Arsenault 06c192d454 OpaquePtr: Bulk update tests to use typed byval
Upgrade of the IR text tests should be the only thing blocking making
typed byval mandatory. Partially done through regex and partially
manual.
2020-11-20 14:00:46 -05:00
Sebastian Neubauer 7a18bdb350 [AMDGPU] Implement flat scratch init for pal
Extract the scratch offset from the scratch buffer descriptor that is
stored in the global table.

Differential Revision: https://reviews.llvm.org/D91701
2020-11-20 11:14:30 +01:00
Scott Linder 0fe4b8e4b5 [NFC][AMDGPU] Remove some generic pointers in memory-legalizer tests
These tests implicitly depend on the target supporting generic pointers,
so to prepare for testing them on GFX6 (which lacks FLAT) remove the
dependency where possible.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D91666
2020-11-18 20:52:18 +00:00
Sebastian Neubauer 72ccec1bbc [AMDGPU] Fix v3f16 interaction with image store workaround
In some cases, the wrong amount of registers was reserved.

Also enable more v3f16 tests.

Differential Revision: https://reviews.llvm.org/D90847
2020-11-18 18:21:04 +01:00
Jay Foad 7ecf19697e [AMDGPU] Fix and extend vccz workarounds
We have workarounds for two different cases where vccz can get out of
sync with the value in vcc. This fixes them in two ways:

1. Fix the case where the def of vcc was in a previous basic block, by
pessimistically assuming that vccz might be incorrect at a basic block
boundary.

2. Fix the handling of pre-existing waitcnt instructions by calling
generateWaitcntInstBefore before examining ScoreBrackets to determine
whether there's an outstanding smem read operation.

Differential Revision: https://reviews.llvm.org/D91636
2020-11-18 15:26:06 +00:00
Jay Foad e67d8859f2 [AMDGPU] Precommit more vccz workaround tests 2020-11-17 15:55:40 +00:00
Michael Liao f375885ab8 [InferAddrSpace] Teach to handle assumed address space.
- In certain cases, a generic pointer could be assumed as a pointer to
  the global memory space or other spaces. With a dedicated target hook
  to query that address space from a given value, infer-address-space
  pass could infer and propagate that to all its users.

Differential Revision: https://reviews.llvm.org/D91121
2020-11-16 17:06:33 -05:00
Matt Arsenault d2e52eec51 AMDGPU: Select global saddr mode from SGPR pointer
Use the 64-bit SGPR base with a 0 offset, since it's 1 fewer
instruction to materialize the 0 vs. the 64-bit copy.
2020-11-16 11:51:06 -05:00
Mirko Brkusanin 4cf6dd518e [AMDGPU][GlobalISel] Fix lowerShlSat
RegBankSelect would crash on G_SELECT when type is not s1.

Differential Revision: https://reviews.llvm.org/D91437
2020-11-16 17:43:31 +01:00
Matt Arsenault a6e353b1d0 AMDGPU: Split large offsets when selecting global saddr mode
When the offset doesn't fit in the immediate field, move some to
voffset.
2020-11-16 11:36:01 -05:00
Florian Hahn 8dbe44cb29 Add pass to add !annotate metadata from @llvm.global.annotations.
This patch adds a new pass to add !annotation metadata for entries in
@llvm.global.anotations, which is generated  using
__attribute__((annotate("_name"))) on functions in Clang.

This has been discussed on llvm-dev as part of
    RFC: Combining Annotation Metadata and Remarks
    http://lists.llvm.org/pipermail/llvm-dev/2020-November/146393.html

Reviewed By: thegameg

Differential Revision: https://reviews.llvm.org/D91195
2020-11-16 14:57:11 +00:00
Matt Arsenault e7eb2ac53f AMDGPU/GlobalISel: Regenerate some checks
Fixes indentation confusing diff in future patch.
2020-11-13 11:29:15 -05:00
Florian Hahn 8bb6347939
Add !annotation metadata and remarks pass.
This patch adds a new !annotation metadata kind which can be used to
attach annotation strings to instructions.

It also adds a new pass that emits summary remarks per function with the
counts for each annotation kind.

The intended uses cases for this new metadata is annotating
'interesting' instructions and the remarks should provide additional
insight into transformations applied to a program.

To motivate this, consider these specific questions we would like to get answered:

* How many stores added for automatic variable initialization remain after optimizations? Where are they?
* How many runtime checks inserted by a frontend could be eliminated? Where are the ones that did not get eliminated?

Discussed on llvm-dev as part of 'RFC: Combining Annotation Metadata and Remarks'
(http://lists.llvm.org/pipermail/llvm-dev/2020-November/146393.html)

Reviewed By: thegameg, jdoerfert

Differential Revision: https://reviews.llvm.org/D91188
2020-11-13 13:24:10 +00:00