The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For this tuple, measuring becomes problematic since there's a lot of spilling going on,
but apparently all these memory ops do not affect worst-case estimate at all here.
For load we have:
https://godbolt.org/z/zP4hd8MT6 - for intels `Block RThroughput: =150.0`; for ryzens, `Block RThroughput: <=59`
So pick cost of `150`.
For store we have:
https://godbolt.org/z/vKb8zTK8E - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=24.0`
So pick cost of `64`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110548
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =75.0`; for ryzens, `Block RThroughput: <=29.5`
So pick cost of `75`. (note that `# 32-byte Reload` does not affect throughput there.)
For store we have:
https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `32`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110543
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/dd8T5P471 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=14.5`
So pick cost of `33`.
For store we have:
https://godbolt.org/z/zPxcKWhn4 - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=6.0`
So pick cost of `10`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110541
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/rnsf639Wh - for intels `Block RThroughput: =17.0`; for ryzens, `Block RThroughput: <=7.5`
So pick cost of `17`.
For store we have:
https://godbolt.org/z/565KKrcY6 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `6`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110537
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3
For load we have:
https://godbolt.org/z/5EYc6r9nh - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `6`.
For store we have:
https://godbolt.org/z/z61e5d6GE - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.
I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D110536
This patch adds a new preprocessor extension ``#pragma clang final``
which enables warning on undefinition and re-definition of macros.
The intent of this warning is to extend beyond ``-Wmacro-redefined`` to
warn against any and all alterations to macros that are marked `final`.
This warning is part of the ``-Wpedantic-macros`` diagnostics group.
Reviewed By: aaron.ballman
Differential Revision: https://reviews.llvm.org/D108567
This integration tests runs a fused and non-fused version of
sampled matrix multiplication. Both should eventually have the
same performance!
NOTE: relies on pending tensor.init fix!
Reviewed By: bixia
Differential Revision: https://reviews.llvm.org/D110444
Fixes 51982. Minor refactor to remove `return x = y` construct.
Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\
blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting
parts while checking the assertion failure still occurred.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110556
This revision makes sure that when the output buffer materializes locally
(in contrast with the passing in of output tensors either in-place or not
in-place), the zero initialization assumption is preserved. This also adds
a bit more documentation on our sparse kernel assumption (viz. TACO
assumptions).
Reviewed By: bixia
Differential Revision: https://reviews.llvm.org/D110442
We see that it might otherwise do:
%10 = getelementptr {}**, <2 x {}***> %9, <2 x i32> <i32 10, i32 4>
%11 = bitcast <2 x {}***> %10 to <2 x i64*>
...
%27 = extractelement <2 x i64*> %11, i32 0
%28 = bitcast i64* %27 to <2 x i64>*
store <2 x i64> %22, <2 x i64>* %28, align 4, !tbaa !2
Which is an out-of-bounds store (the extractelement got offset 10
instead of offset 4 as intended). With the fix, we correctly generate
extractelement for i32 1 and generate correct code.
Differential Revision: https://reviews.llvm.org/D106613
Fixes https://bugs.llvm.org/show_bug.cgi?id=51790. The check triggers
incorrectly with non-type template parameters.
A bisect determined that the bug was introduced here:
ea2225a10b
Unfortunately that patch can no longer be reverted on top of the main
branch, so add a fix instead. Add a unit test to avoid regression in
the future.
Enforce constraints C1034 & C1038, which disallow the use
of otherwise valid statements as branch targets when they
appear in FORALL &/or WHERE constructs. (And make the
diagnostic message somewhat more user-friendly.)
Differential Revision: https://reviews.llvm.org/D109936
HSA runtime fails to find the symbols for Init and Fini kernels as
they mark with internal linkage, changing the linkage to external
to fix those errors.
Differential Revision: https://reviews.llvm.org/D110054
New mode option that allows for either running the default fusion kind that happens today or doing either of producer-consumer or sibling fusion. This will also be helpful to minimize the compile-time of the fusion tests.
Reviewed By: bondhugula, dcaballe
Differential Revision: https://reviews.llvm.org/D110102
This patch fixes the pattern for the P10 instructions Vector Shift Left
Double by Bit Immediate VN-form and Vector Shift Right Double by Bit
Immediate VN-form. The third argument should be a target constant (`timm`)
instead of an `i32` because an immediate is expected.
Reviewed By: lei
Differential Revision: https://reviews.llvm.org/D109920
HIP currently uses -mlink-builtin-bitcode to link all bitcode libraries, which
changes the linkage of functions to be internal once they are linked in. This
works for common bitcode libraries since these functions are not intended
to be exposed for external callers.
However, the functions in the sanitizer bitcode library is intended to be
called by instructions generated by the sanitizer pass. If their linkage is
changed to internal, their parameters may be altered by optimizations before
the sanitizer pass, which renders them unusable by the sanitizer pass.
To fix this issue, HIP toolchain links the sanitizer bitcode library with
-mlink-bitcode-file, which does not change the linkage.
A struct BitCodeLibraryInfo is introduced in ToolChain as a generic
approach to pass the bitcode library information between ToolChain and Tool.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D110304
A defined assignment subroutine invoked in the context of a WHERE
statement or construct must necessarily be elemental (C1032).
Differential Revision: https://reviews.llvm.org/D109932
This can avoid a loss of decoupling with the scalar unit on cores
with decoupled scalar and vector units.
We should support FP too, but those use extract_element and not a
custom ISD node so it is a little different. I also left a FIXME
in the test for i64 extract and store on RV32.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109482
The sparse constant provides a constant tensor in coordinate format. We first split the sparse constant into a constant tensor for indices and a constant tensor for values. We then generate a loop to fill a sparse tensor in coordinate format using the tensors for the indices and the values. Finally, we convert the sparse tensor in coordinate format to the destination sparse tensor format.
Add tests.
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D110373
The minor code refactorization introduces the TASK_TIED constant inside
kmp_gsupprot.cpp as a replacement for the literal value 1.
The mentioned constant is now used in both kmp_tasking.cpp and
kmp_gsupport.cpp files.
Differential Revision: https://reviews.llvm.org/D110441
If one input of a fixed vector multiply is a sign/zero extend and
the other operand is a splat of a scalar, we can use a widening
multiply if the scalar value has sufficient sign/zero bits.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110028
This is no-functional-change-intended, but it hopefully makes things
slightly clearer and more efficient to have transforms that require
'shl' be called only from visitShl(). Further cleanup is possible.
Note that getTheLanaiTarget is declared in
TargetInfo/LanaiTargetInfo.h, which LanaiDisassembler.cpp includes.
Identified with readability-redundant-declaration.
This path defines the newly added `__kmpc_disitrute_static_init`
functions in the device runtime library. These functions are currently
exact copies of the current worksharing method but can be tuned later.
Depends on D110429
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D110430
This patch adds a new RTL function for worksharing. Currently we use
`__kmpc_for_static_init` for both the `distribute` and `parallel`
portion of the loop clause. This patch replaces the `distribute` portion
with a new runtime call `__kmpc_distribute_static_init`. Currently this
will be used exactly the same way, but will make it easier in the future
to fine-tune the distribute and parallel portion of the loop.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D110429
Apparently macOS is padding the name result with several padding zeroes at
the end. Just strip them all to pretend it's a C-string.
Thanks to Pavel for suggesting this fix.
The tests only specify -march, so when the tests are run on AIX the target OS defaults to AIX, which causes the tests to misbehave.
This patch constrains the tests by specifying -mtriple instead of -march.
Reviewed By: daltenty, jsji, MaskRay
Differential Revision: https://reviews.llvm.org/D110186
This is another step towards trying to re-apply D110170
by eliminating conflicting transforms that cause infinite loops.
a47c8e40c7 was a previous patch in this direction.
The diffs here are mostly cosmetic, but intentional:
1. The existing code that would handle this pattern in FoldShiftByConstant()
is limited to 'shl' only now. The formatting change to IsLeftShift shows
that we could move several transforms into visitShl() directly for
efficiency because they are not common shift transforms.
2. The tests are regenerated to show new instruction names to prove that
we are getting (almost) identical logic results.
3. The one case where we differ ("trunc_sandwich_small_shift1") shows that
we now use a narrow 'and' instruction. Previously, we relied on another
transform to do that, but it is limited to legal types. That seems to
be a legacy constraint from when IR analysis and codegen were less robust.
https://alive2.llvm.org/ce/z/JxyGA4
declare void @llvm.assume(i1)
define i8 @src(i32 %x, i32 %c0, i8 %c1) {
; The sum of the shifts must not overflow the source width.
%z1 = zext i8 %c1 to i32
%sum = add i32 %c0, %z1
%ov = icmp ult i32 %sum, 32
call void @llvm.assume(i1 %ov)
%sh1 = lshr i32 %x, %c0
%tr = trunc i32 %sh1 to i8
%sh2 = lshr i8 %tr, %c1
ret i8 %sh2
}
define i8 @tgt(i32 %x, i32 %c0, i8 %c1) {
%z1 = zext i8 %c1 to i32
%sum = add i32 %c0, %z1
%maskc = lshr i8 -1, %c1
%s = lshr i32 %x, %sum
%t = trunc i32 %s to i8
%a = and i8 %t, %maskc
ret i8 %a
}
It appears that this test assumes that the toolchain utilizes the integrated
assembler by default, since the expected output in the CHECKs are
compilation_database.o.
However, this test fails on AIX as AIX does not utilize the integrated assembler.
On AIX, the output instead is of the form /tmp/compilation_database-*.s.
Thus, this patch explicitly adds the -fintegrated-as option to match the
assumption that the integrated assembler is used by default.
Differential Revision: https://reviews.llvm.org/D110431