llvm-project/clang/test/OpenMP/tile_codegen.cpp

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

2074 lines
116 KiB
C++
Raw Normal View History

// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --function-signature --include-generated-funcs --replace-value-regex "__omp_offloading_[0-9a-z]+_[0-9a-z]+" "reduction_size[.].+[.]" "pl_cond[.].+[.|,]" --prefix-filecheck-ir-name _
[OpenMP] Implement '#pragma omp tile', by Michael Kruse (@Meinersbur). The tile directive is in OpenMP's Technical Report 8 and foreseeably will be part of the upcoming OpenMP 5.1 standard. This implementation is based on an AST transformation providing a de-sugared loop nest. This makes it simple to forward the de-sugared transformation to loop associated directives taking the tiled loops. In contrast to other loop associated directives, the OMPTileDirective does not use CapturedStmts. Letting loop associated directives consume loops from different capture context would be difficult. A significant amount of code generation logic is taking place in the Sema class. Eventually, I would prefer if these would move into the CodeGen component such that we could make use of the OpenMPIRBuilder, together with flang. Only expressions converting between the language's iteration variable and the logical iteration space need to take place in the semantic analyzer: Getting the of iterations (e.g. the overload resolution of `std::distance`) and converting the logical iteration number to the iteration variable (e.g. overload resolution of `iteration + .omp.iv`). In clang, only CXXForRangeStmt is also represented by its de-sugared components. However, OpenMP loop are not defined as syntatic sugar. Starting with an AST-based approach allows us to gradually move generated AST statements into CodeGen, instead all at once. I would also like to refactor `checkOpenMPLoop` into its functionalities in a follow-up. In this patch it is used twice. Once for checking proper nesting and emitting diagnostics, and additionally for deriving the logical iteration space per-loop (instead of for the loop nest). Differential Revision: https://reviews.llvm.org/D76342
2021-02-13 03:26:59 +08:00
// Check code generation
// RUN: %clang_cc1 -verify -triple x86_64-pc-linux-gnu -fopenmp -fopenmp-version=51 -emit-llvm %s -o - | FileCheck %s --check-prefix=CHECK1
[OpenMP] Implement '#pragma omp tile', by Michael Kruse (@Meinersbur). The tile directive is in OpenMP's Technical Report 8 and foreseeably will be part of the upcoming OpenMP 5.1 standard. This implementation is based on an AST transformation providing a de-sugared loop nest. This makes it simple to forward the de-sugared transformation to loop associated directives taking the tiled loops. In contrast to other loop associated directives, the OMPTileDirective does not use CapturedStmts. Letting loop associated directives consume loops from different capture context would be difficult. A significant amount of code generation logic is taking place in the Sema class. Eventually, I would prefer if these would move into the CodeGen component such that we could make use of the OpenMPIRBuilder, together with flang. Only expressions converting between the language's iteration variable and the logical iteration space need to take place in the semantic analyzer: Getting the of iterations (e.g. the overload resolution of `std::distance`) and converting the logical iteration number to the iteration variable (e.g. overload resolution of `iteration + .omp.iv`). In clang, only CXXForRangeStmt is also represented by its de-sugared components. However, OpenMP loop are not defined as syntatic sugar. Starting with an AST-based approach allows us to gradually move generated AST statements into CodeGen, instead all at once. I would also like to refactor `checkOpenMPLoop` into its functionalities in a follow-up. In this patch it is used twice. Once for checking proper nesting and emitting diagnostics, and additionally for deriving the logical iteration space per-loop (instead of for the loop nest). Differential Revision: https://reviews.llvm.org/D76342
2021-02-13 03:26:59 +08:00
// Check same results after serialization round-trip
// RUN: %clang_cc1 -verify -triple x86_64-pc-linux-gnu -fopenmp -fopenmp-version=51 -emit-pch -o %t %s
// RUN: %clang_cc1 -verify -triple x86_64-pc-linux-gnu -fopenmp -fopenmp-version=51 -include-pch %t -emit-llvm %s -o - | FileCheck %s --check-prefix=CHECK2
[OpenMP] Implement '#pragma omp tile', by Michael Kruse (@Meinersbur). The tile directive is in OpenMP's Technical Report 8 and foreseeably will be part of the upcoming OpenMP 5.1 standard. This implementation is based on an AST transformation providing a de-sugared loop nest. This makes it simple to forward the de-sugared transformation to loop associated directives taking the tiled loops. In contrast to other loop associated directives, the OMPTileDirective does not use CapturedStmts. Letting loop associated directives consume loops from different capture context would be difficult. A significant amount of code generation logic is taking place in the Sema class. Eventually, I would prefer if these would move into the CodeGen component such that we could make use of the OpenMPIRBuilder, together with flang. Only expressions converting between the language's iteration variable and the logical iteration space need to take place in the semantic analyzer: Getting the of iterations (e.g. the overload resolution of `std::distance`) and converting the logical iteration number to the iteration variable (e.g. overload resolution of `iteration + .omp.iv`). In clang, only CXXForRangeStmt is also represented by its de-sugared components. However, OpenMP loop are not defined as syntatic sugar. Starting with an AST-based approach allows us to gradually move generated AST statements into CodeGen, instead all at once. I would also like to refactor `checkOpenMPLoop` into its functionalities in a follow-up. In this patch it is used twice. Once for checking proper nesting and emitting diagnostics, and additionally for deriving the logical iteration space per-loop (instead of for the loop nest). Differential Revision: https://reviews.llvm.org/D76342
2021-02-13 03:26:59 +08:00
// expected-no-diagnostics
#ifndef HEADER
#define HEADER
// placeholder for loop body code.
extern "C" void body(...) {}
struct S {
int i;
S() {
#pragma omp tile sizes(5)
for (i = 7; i < 17; i += 3)
body(i);
}
} s;
extern "C" void foo1(int start, int end, int step) {
int i;
#pragma omp tile sizes(5)
for (i = start; i < end; i += step)
body(i);
}
extern "C" void foo2(int start, int end, int step) {
#pragma omp tile sizes(5,5)
for (int i = 7; i < 17; i+=3)
for (int j = 7; j < 17; j+=3)
body(i,j);
}
extern "C" void foo3() {
#pragma omp for
#pragma omp tile sizes(5,5)
for (int i = 7; i < 17; i += 3)
for (int j = 7; j < 17; j += 3)
body(i, j);
}
extern "C" void foo4() {
#pragma omp for collapse(2)
for (int k = 7; k < 17; k += 3)
#pragma omp tile sizes(5,5)
for (int i = 7; i < 17; i += 3)
for (int j = 7; j < 17; j += 3)
body(i, j);
}
extern "C" void foo5() {
#pragma omp for collapse(3)
#pragma omp tile sizes(5)
for (int i = 7; i < 17; i += 3)
for (int j = 7; j < 17; j += 3)
body(i, j);
}
extern "C" void foo6() {
#pragma omp parallel for
#pragma omp tile sizes(5)
for (int i = 7; i < 17; i += 3)
body(i);
}
template<typename T, T Step, T Tile>
void foo7(T start, T end) {
#pragma omp tile sizes(Tile)
for (T i = start; i < end; i += Step)
body(i);
}
extern "C" void tfoo7() {
foo7<int,3,5>(0, 42);
}
#endif /* HEADER */
// CHECK1-LABEL: define {{[^@]+}}@body
// CHECK1-SAME: (...) #[[ATTR0:[0-9]+]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@__cxx_global_var_init
// CHECK1-SAME: () #[[ATTR1:[0-9]+]] section ".text.startup" {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: call void @_ZN1SC1Ev(%struct.S* nonnull align 4 dereferenceable(4) @s)
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@_ZN1SC1Ev
// CHECK1-SAME: (%struct.S* nonnull align 4 dereferenceable(4) [[THIS:%.*]]) unnamed_addr #[[ATTR2:[0-9]+]] comdat align 2 {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[THIS_ADDR:%.*]] = alloca %struct.S*, align 8
// CHECK1-NEXT: store %struct.S* [[THIS]], %struct.S** [[THIS_ADDR]], align 8
// CHECK1-NEXT: [[THIS1:%.*]] = load %struct.S*, %struct.S** [[THIS_ADDR]], align 8
// CHECK1-NEXT: call void @_ZN1SC2Ev(%struct.S* nonnull align 4 dereferenceable(4) [[THIS1]])
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@_ZN1SC2Ev
// CHECK1-SAME: (%struct.S* nonnull align 4 dereferenceable(4) [[THIS:%.*]]) unnamed_addr #[[ATTR2]] comdat align 2 {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[THIS_ADDR:%.*]] = alloca %struct.S*, align 8
// CHECK1-NEXT: [[I:%.*]] = alloca i32*, align 8
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: store %struct.S* [[THIS]], %struct.S** [[THIS_ADDR]], align 8
// CHECK1-NEXT: [[THIS1:%.*]] = load %struct.S*, %struct.S** [[THIS_ADDR]], align 8
// CHECK1-NEXT: [[I2:%.*]] = getelementptr inbounds [[STRUCT_S:%.*]], %struct.S* [[THIS1]], i32 0, i32 0
// CHECK1-NEXT: store i32* [[I2]], i32** [[I]], align 8
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP0:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[CMP:%.*]] = icmp slt i32 [[TMP0]], 4
// CHECK1-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END11:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP1]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND3:%.*]]
// CHECK1: for.cond3:
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP3]], 5
// CHECK1-NEXT: [[CMP4:%.*]] = icmp slt i32 4, [[ADD]]
// CHECK1-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD5:%.*]] = add nsw i32 [[TMP4]], 5
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ 4, [[COND_TRUE]] ], [ [[ADD5]], [[COND_FALSE]] ]
// CHECK1-NEXT: [[CMP6:%.*]] = icmp slt i32 [[TMP2]], [[COND]]
// CHECK1-NEXT: br i1 [[CMP6]], label [[FOR_BODY7:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body7:
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP5]], 3
// CHECK1-NEXT: [[ADD8:%.*]] = add nsw i32 7, [[MUL]]
// CHECK1-NEXT: [[TMP6:%.*]] = load i32*, i32** [[I]], align 8
// CHECK1-NEXT: store i32 [[ADD8]], i32* [[TMP6]], align 4
// CHECK1-NEXT: [[TMP7:%.*]] = load i32*, i32** [[I]], align 8
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[TMP7]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP8]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP9]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND3]], !llvm.loop [[LOOP2:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[FOR_INC9:%.*]]
// CHECK1: for.inc9:
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD10:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK1-NEXT: store i32 [[ADD10]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP4:![0-9]+]]
// CHECK1: for.end11:
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@foo1
// CHECK1-SAME: (i32 [[START:%.*]], i32 [[END:%.*]], i32 [[STEP:%.*]]) #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[START_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[END_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[STEP_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_2:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_3:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: store i32 [[START]], i32* [[START_ADDR]], align 4
// CHECK1-NEXT: store i32 [[END]], i32* [[END_ADDR]], align 4
// CHECK1-NEXT: store i32 [[STEP]], i32* [[STEP_ADDR]], align 4
// CHECK1-NEXT: [[TMP0:%.*]] = load i32, i32* [[START_ADDR]], align 4
// CHECK1-NEXT: store i32 [[TMP0]], i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[END_ADDR]], align 4
// CHECK1-NEXT: store i32 [[TMP1]], i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[STEP_ADDR]], align 4
// CHECK1-NEXT: store i32 [[TMP2]], i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB:%.*]] = sub i32 [[TMP3]], [[TMP4]]
// CHECK1-NEXT: [[SUB4:%.*]] = sub i32 [[SUB]], 1
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[ADD:%.*]] = add i32 [[SUB4]], [[TMP5]]
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[DIV:%.*]] = udiv i32 [[ADD]], [[TMP6]]
// CHECK1-NEXT: [[SUB5:%.*]] = sub i32 [[DIV]], 1
// CHECK1-NEXT: store i32 [[SUB5]], i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[ADD6:%.*]] = add i32 [[TMP8]], 1
// CHECK1-NEXT: [[CMP:%.*]] = icmp ult i32 [[TMP7]], [[ADD6]]
// CHECK1-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END18:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP9]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND7:%.*]]
// CHECK1: for.cond7:
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[ADD8:%.*]] = add i32 [[TMP11]], 1
// CHECK1-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD9:%.*]] = add nsw i32 [[TMP12]], 5
// CHECK1-NEXT: [[CMP10:%.*]] = icmp ult i32 [[ADD8]], [[ADD9]]
// CHECK1-NEXT: br i1 [[CMP10]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[ADD11:%.*]] = add i32 [[TMP13]], 1
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD12:%.*]] = add nsw i32 [[TMP14]], 5
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ [[ADD11]], [[COND_TRUE]] ], [ [[ADD12]], [[COND_FALSE]] ]
// CHECK1-NEXT: [[CMP13:%.*]] = icmp ult i32 [[TMP10]], [[COND]]
// CHECK1-NEXT: br i1 [[CMP13]], label [[FOR_BODY14:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body14:
// CHECK1-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[MUL:%.*]] = mul i32 [[TMP16]], [[TMP17]]
// CHECK1-NEXT: [[ADD15:%.*]] = add i32 [[TMP15]], [[MUL]]
// CHECK1-NEXT: store i32 [[ADD15]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP18:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP18]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP19:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP19]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND7]], !llvm.loop [[LOOP5:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[FOR_INC16:%.*]]
// CHECK1: for.inc16:
// CHECK1-NEXT: [[TMP20:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD17:%.*]] = add nsw i32 [[TMP20]], 5
// CHECK1-NEXT: store i32 [[ADD17]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP6:![0-9]+]]
// CHECK1: for.end18:
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@foo2
// CHECK1-SAME: (i32 [[START:%.*]], i32 [[END:%.*]], i32 [[STEP:%.*]]) #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[START_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[END_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[STEP_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_1_IV_J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_1_IV_J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: store i32 [[START]], i32* [[START_ADDR]], align 4
// CHECK1-NEXT: store i32 [[END]], i32* [[END_ADDR]], align 4
// CHECK1-NEXT: store i32 [[STEP]], i32* [[STEP_ADDR]], align 4
// CHECK1-NEXT: store i32 7, i32* [[I]], align 4
// CHECK1-NEXT: store i32 7, i32* [[J]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP0:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[CMP:%.*]] = icmp slt i32 [[TMP0]], 4
// CHECK1-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END30:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND1:%.*]]
// CHECK1: for.cond1:
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[CMP2:%.*]] = icmp slt i32 [[TMP1]], 4
// CHECK1-NEXT: br i1 [[CMP2]], label [[FOR_BODY3:%.*]], label [[FOR_END27:%.*]]
// CHECK1: for.body3:
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP2]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND4:%.*]]
// CHECK1: for.cond4:
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], 5
// CHECK1-NEXT: [[CMP5:%.*]] = icmp slt i32 4, [[ADD]]
// CHECK1-NEXT: br i1 [[CMP5]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD6:%.*]] = add nsw i32 [[TMP5]], 5
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ 4, [[COND_TRUE]] ], [ [[ADD6]], [[COND_FALSE]] ]
// CHECK1-NEXT: [[CMP7:%.*]] = icmp slt i32 [[TMP3]], [[COND]]
// CHECK1-NEXT: br i1 [[CMP7]], label [[FOR_BODY8:%.*]], label [[FOR_END24:%.*]]
// CHECK1: for.body8:
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP6]], 3
// CHECK1-NEXT: [[ADD9:%.*]] = add nsw i32 7, [[MUL]]
// CHECK1-NEXT: store i32 [[ADD9]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: store i32 [[TMP7]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND10:%.*]]
// CHECK1: for.cond10:
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD11:%.*]] = add nsw i32 [[TMP9]], 5
// CHECK1-NEXT: [[CMP12:%.*]] = icmp slt i32 4, [[ADD11]]
// CHECK1-NEXT: br i1 [[CMP12]], label [[COND_TRUE13:%.*]], label [[COND_FALSE14:%.*]]
// CHECK1: cond.true13:
// CHECK1-NEXT: br label [[COND_END16:%.*]]
// CHECK1: cond.false14:
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD15:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK1-NEXT: br label [[COND_END16]]
// CHECK1: cond.end16:
// CHECK1-NEXT: [[COND17:%.*]] = phi i32 [ 4, [[COND_TRUE13]] ], [ [[ADD15]], [[COND_FALSE14]] ]
// CHECK1-NEXT: [[CMP18:%.*]] = icmp slt i32 [[TMP8]], [[COND17]]
// CHECK1-NEXT: br i1 [[CMP18]], label [[FOR_BODY19:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body19:
// CHECK1-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[MUL20:%.*]] = mul nsw i32 [[TMP11]], 3
// CHECK1-NEXT: [[ADD21:%.*]] = add nsw i32 7, [[MUL20]]
// CHECK1-NEXT: store i32 [[ADD21]], i32* [[J]], align 4
// CHECK1-NEXT: [[TMP12:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: [[TMP13:%.*]] = load i32, i32* [[J]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP12]], i32 [[TMP13]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP14]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND10]], !llvm.loop [[LOOP7:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[FOR_INC22:%.*]]
// CHECK1: for.inc22:
// CHECK1-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC23:%.*]] = add nsw i32 [[TMP15]], 1
// CHECK1-NEXT: store i32 [[INC23]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND4]], !llvm.loop [[LOOP8:![0-9]+]]
// CHECK1: for.end24:
// CHECK1-NEXT: br label [[FOR_INC25:%.*]]
// CHECK1: for.inc25:
// CHECK1-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD26:%.*]] = add nsw i32 [[TMP16]], 5
// CHECK1-NEXT: store i32 [[ADD26]], i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND1]], !llvm.loop [[LOOP9:![0-9]+]]
// CHECK1: for.end27:
// CHECK1-NEXT: br label [[FOR_INC28:%.*]]
// CHECK1: for.inc28:
// CHECK1-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD29:%.*]] = add nsw i32 [[TMP17]], 5
// CHECK1-NEXT: store i32 [[ADD29]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP10:![0-9]+]]
// CHECK1: for.end30:
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@foo3
// CHECK1-SAME: () #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_1_IV_J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_1_IV_J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(%struct.ident_t* @[[GLOB2:[0-9]+]])
// CHECK1-NEXT: store i32 7, i32* [[I]], align 4
// CHECK1-NEXT: store i32 7, i32* [[J]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_LB]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: store i32 1, i32* [[DOTOMP_STRIDE]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK1-NEXT: call void @__kmpc_for_static_init_4(%struct.ident_t* @[[GLOB1:[0-9]+]], i32 [[TMP0]], i32 34, i32* [[DOTOMP_IS_LAST]], i32* [[DOTOMP_LB]], i32* [[DOTOMP_UB]], i32* [[DOTOMP_STRIDE]], i32 1, i32 1)
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[CMP:%.*]] = icmp sgt i32 [[TMP1]], 0
// CHECK1-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ 0, [[COND_TRUE]] ], [ [[TMP2]], [[COND_FALSE]] ]
// CHECK1-NEXT: store i32 [[COND]], i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTOMP_LB]], align 4
// CHECK1-NEXT: store i32 [[TMP3]], i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK1: omp.inner.for.cond:
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[CMP1:%.*]] = icmp sle i32 [[TMP4]], [[TMP5]]
// CHECK1-NEXT: br i1 [[CMP1]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK1: omp.inner.for.body:
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP6]], 5
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
// CHECK1-NEXT: store i32 [[ADD]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[CMP2:%.*]] = icmp slt i32 [[TMP7]], 4
// CHECK1-NEXT: br i1 [[CMP2]], label [[FOR_BODY:%.*]], label [[FOR_END32:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP8]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND3:%.*]]
// CHECK1: for.cond3:
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK1-NEXT: [[CMP5:%.*]] = icmp slt i32 4, [[ADD4]]
// CHECK1-NEXT: br i1 [[CMP5]], label [[COND_TRUE6:%.*]], label [[COND_FALSE7:%.*]]
// CHECK1: cond.true6:
// CHECK1-NEXT: br label [[COND_END9:%.*]]
// CHECK1: cond.false7:
// CHECK1-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP11]], 5
// CHECK1-NEXT: br label [[COND_END9]]
// CHECK1: cond.end9:
// CHECK1-NEXT: [[COND10:%.*]] = phi i32 [ 4, [[COND_TRUE6]] ], [ [[ADD8]], [[COND_FALSE7]] ]
// CHECK1-NEXT: [[CMP11:%.*]] = icmp slt i32 [[TMP9]], [[COND10]]
// CHECK1-NEXT: br i1 [[CMP11]], label [[FOR_BODY12:%.*]], label [[FOR_END29:%.*]]
// CHECK1: for.body12:
// CHECK1-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[MUL13:%.*]] = mul nsw i32 [[TMP12]], 3
// CHECK1-NEXT: [[ADD14:%.*]] = add nsw i32 7, [[MUL13]]
// CHECK1-NEXT: store i32 [[ADD14]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: store i32 [[TMP13]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND15:%.*]]
// CHECK1: for.cond15:
// CHECK1-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD16:%.*]] = add nsw i32 [[TMP15]], 5
// CHECK1-NEXT: [[CMP17:%.*]] = icmp slt i32 4, [[ADD16]]
// CHECK1-NEXT: br i1 [[CMP17]], label [[COND_TRUE18:%.*]], label [[COND_FALSE19:%.*]]
// CHECK1: cond.true18:
// CHECK1-NEXT: br label [[COND_END21:%.*]]
// CHECK1: cond.false19:
// CHECK1-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD20:%.*]] = add nsw i32 [[TMP16]], 5
// CHECK1-NEXT: br label [[COND_END21]]
// CHECK1: cond.end21:
// CHECK1-NEXT: [[COND22:%.*]] = phi i32 [ 4, [[COND_TRUE18]] ], [ [[ADD20]], [[COND_FALSE19]] ]
// CHECK1-NEXT: [[CMP23:%.*]] = icmp slt i32 [[TMP14]], [[COND22]]
// CHECK1-NEXT: br i1 [[CMP23]], label [[FOR_BODY24:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body24:
// CHECK1-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[MUL25:%.*]] = mul nsw i32 [[TMP17]], 3
// CHECK1-NEXT: [[ADD26:%.*]] = add nsw i32 7, [[MUL25]]
// CHECK1-NEXT: store i32 [[ADD26]], i32* [[J]], align 4
// CHECK1-NEXT: [[TMP18:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: [[TMP19:%.*]] = load i32, i32* [[J]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP18]], i32 [[TMP19]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP20:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP20]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND15]], !llvm.loop [[LOOP11:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[FOR_INC27:%.*]]
// CHECK1: for.inc27:
// CHECK1-NEXT: [[TMP21:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC28:%.*]] = add nsw i32 [[TMP21]], 1
// CHECK1-NEXT: store i32 [[INC28]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND3]], !llvm.loop [[LOOP12:![0-9]+]]
// CHECK1: for.end29:
// CHECK1-NEXT: br label [[FOR_INC30:%.*]]
// CHECK1: for.inc30:
// CHECK1-NEXT: [[TMP22:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD31:%.*]] = add nsw i32 [[TMP22]], 5
// CHECK1-NEXT: store i32 [[ADD31]], i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP13:![0-9]+]]
// CHECK1: for.end32:
// CHECK1-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK1: omp.body.continue:
// CHECK1-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK1: omp.inner.for.inc:
// CHECK1-NEXT: [[TMP23:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[ADD33:%.*]] = add nsw i32 [[TMP23]], 1
// CHECK1-NEXT: store i32 [[ADD33]], i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK1: omp.inner.for.end:
// CHECK1-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK1: omp.loop.exit:
// CHECK1-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]])
// CHECK1-NEXT: call void @__kmpc_barrier(%struct.ident_t* @[[GLOB3:[0-9]+]], i32 [[TMP0]])
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@foo4
// CHECK1-SAME: () #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[_TMP1:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[K:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_1_IV_J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_1_IV_J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(%struct.ident_t* @[[GLOB2]])
// CHECK1-NEXT: store i32 7, i32* [[I]], align 4
// CHECK1-NEXT: store i32 7, i32* [[J]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_LB]], align 4
// CHECK1-NEXT: store i32 3, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: store i32 1, i32* [[DOTOMP_STRIDE]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK1-NEXT: call void @__kmpc_for_static_init_4(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]], i32 34, i32* [[DOTOMP_IS_LAST]], i32* [[DOTOMP_LB]], i32* [[DOTOMP_UB]], i32* [[DOTOMP_STRIDE]], i32 1, i32 1)
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[CMP:%.*]] = icmp sgt i32 [[TMP1]], 3
// CHECK1-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ 3, [[COND_TRUE]] ], [ [[TMP2]], [[COND_FALSE]] ]
// CHECK1-NEXT: store i32 [[COND]], i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTOMP_LB]], align 4
// CHECK1-NEXT: store i32 [[TMP3]], i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK1: omp.inner.for.cond:
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[CMP2:%.*]] = icmp sle i32 [[TMP4]], [[TMP5]]
// CHECK1-NEXT: br i1 [[CMP2]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK1: omp.inner.for.body:
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[DIV:%.*]] = sdiv i32 [[TMP6]], 1
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i32 [[DIV]], 3
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 7, [[MUL]]
// CHECK1-NEXT: store i32 [[ADD]], i32* [[K]], align 4
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[DIV3:%.*]] = sdiv i32 [[TMP8]], 1
// CHECK1-NEXT: [[MUL4:%.*]] = mul nsw i32 [[DIV3]], 1
// CHECK1-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP7]], [[MUL4]]
// CHECK1-NEXT: [[MUL5:%.*]] = mul nsw i32 [[SUB]], 5
// CHECK1-NEXT: [[ADD6:%.*]] = add nsw i32 0, [[MUL5]]
// CHECK1-NEXT: store i32 [[ADD6]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[CMP7:%.*]] = icmp slt i32 [[TMP9]], 4
// CHECK1-NEXT: br i1 [[CMP7]], label [[FOR_BODY:%.*]], label [[FOR_END37:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP10]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND8:%.*]]
// CHECK1: for.cond8:
// CHECK1-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD9:%.*]] = add nsw i32 [[TMP12]], 5
// CHECK1-NEXT: [[CMP10:%.*]] = icmp slt i32 4, [[ADD9]]
// CHECK1-NEXT: br i1 [[CMP10]], label [[COND_TRUE11:%.*]], label [[COND_FALSE12:%.*]]
// CHECK1: cond.true11:
// CHECK1-NEXT: br label [[COND_END14:%.*]]
// CHECK1: cond.false12:
// CHECK1-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD13:%.*]] = add nsw i32 [[TMP13]], 5
// CHECK1-NEXT: br label [[COND_END14]]
// CHECK1: cond.end14:
// CHECK1-NEXT: [[COND15:%.*]] = phi i32 [ 4, [[COND_TRUE11]] ], [ [[ADD13]], [[COND_FALSE12]] ]
// CHECK1-NEXT: [[CMP16:%.*]] = icmp slt i32 [[TMP11]], [[COND15]]
// CHECK1-NEXT: br i1 [[CMP16]], label [[FOR_BODY17:%.*]], label [[FOR_END34:%.*]]
// CHECK1: for.body17:
// CHECK1-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[MUL18:%.*]] = mul nsw i32 [[TMP14]], 3
// CHECK1-NEXT: [[ADD19:%.*]] = add nsw i32 7, [[MUL18]]
// CHECK1-NEXT: store i32 [[ADD19]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: store i32 [[TMP15]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND20:%.*]]
// CHECK1: for.cond20:
// CHECK1-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD21:%.*]] = add nsw i32 [[TMP17]], 5
// CHECK1-NEXT: [[CMP22:%.*]] = icmp slt i32 4, [[ADD21]]
// CHECK1-NEXT: br i1 [[CMP22]], label [[COND_TRUE23:%.*]], label [[COND_FALSE24:%.*]]
// CHECK1: cond.true23:
// CHECK1-NEXT: br label [[COND_END26:%.*]]
// CHECK1: cond.false24:
// CHECK1-NEXT: [[TMP18:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD25:%.*]] = add nsw i32 [[TMP18]], 5
// CHECK1-NEXT: br label [[COND_END26]]
// CHECK1: cond.end26:
// CHECK1-NEXT: [[COND27:%.*]] = phi i32 [ 4, [[COND_TRUE23]] ], [ [[ADD25]], [[COND_FALSE24]] ]
// CHECK1-NEXT: [[CMP28:%.*]] = icmp slt i32 [[TMP16]], [[COND27]]
// CHECK1-NEXT: br i1 [[CMP28]], label [[FOR_BODY29:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body29:
// CHECK1-NEXT: [[TMP19:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[MUL30:%.*]] = mul nsw i32 [[TMP19]], 3
// CHECK1-NEXT: [[ADD31:%.*]] = add nsw i32 7, [[MUL30]]
// CHECK1-NEXT: store i32 [[ADD31]], i32* [[J]], align 4
// CHECK1-NEXT: [[TMP20:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: [[TMP21:%.*]] = load i32, i32* [[J]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP20]], i32 [[TMP21]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP22:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP22]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND20]], !llvm.loop [[LOOP14:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[FOR_INC32:%.*]]
// CHECK1: for.inc32:
// CHECK1-NEXT: [[TMP23:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC33:%.*]] = add nsw i32 [[TMP23]], 1
// CHECK1-NEXT: store i32 [[INC33]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND8]], !llvm.loop [[LOOP15:![0-9]+]]
// CHECK1: for.end34:
// CHECK1-NEXT: br label [[FOR_INC35:%.*]]
// CHECK1: for.inc35:
// CHECK1-NEXT: [[TMP24:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: [[ADD36:%.*]] = add nsw i32 [[TMP24]], 5
// CHECK1-NEXT: store i32 [[ADD36]], i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP16:![0-9]+]]
// CHECK1: for.end37:
// CHECK1-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK1: omp.body.continue:
// CHECK1-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK1: omp.inner.for.inc:
// CHECK1-NEXT: [[TMP25:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[ADD38:%.*]] = add nsw i32 [[TMP25]], 1
// CHECK1-NEXT: store i32 [[ADD38]], i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK1: omp.inner.for.end:
// CHECK1-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK1: omp.loop.exit:
// CHECK1-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]])
// CHECK1-NEXT: call void @__kmpc_barrier(%struct.ident_t* @[[GLOB3]], i32 [[TMP0]])
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@foo5
// CHECK1-SAME: () #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[DOTOMP_IV:%.*]] = alloca i64, align 8
// CHECK1-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[_TMP1:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[_TMP2:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_3:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_5:%.*]] = alloca i64, align 8
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_LB:%.*]] = alloca i64, align 8
// CHECK1-NEXT: [[DOTOMP_UB:%.*]] = alloca i64, align 8
// CHECK1-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i64, align 8
// CHECK1-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I11:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I12:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[J13:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(%struct.ident_t* @[[GLOB2]])
// CHECK1-NEXT: store i32 7, i32* [[I]], align 4
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[TMP]], align 4
// CHECK1-NEXT: store i32 [[TMP1]], i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[TMP]], align 4
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP2]], 5
// CHECK1-NEXT: [[CMP:%.*]] = icmp slt i32 4, [[ADD]]
// CHECK1-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[TMP]], align 4
// CHECK1-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP3]], 5
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ 4, [[COND_TRUE]] ], [ [[ADD4]], [[COND_FALSE]] ]
// CHECK1-NEXT: store i32 [[COND]], i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB:%.*]] = sub i32 [[TMP4]], [[TMP5]]
// CHECK1-NEXT: [[SUB6:%.*]] = sub i32 [[SUB]], 1
// CHECK1-NEXT: [[ADD7:%.*]] = add i32 [[SUB6]], 1
// CHECK1-NEXT: [[DIV:%.*]] = udiv i32 [[ADD7]], 1
// CHECK1-NEXT: [[CONV:%.*]] = zext i32 [[DIV]] to i64
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i64 1, [[CONV]]
// CHECK1-NEXT: [[MUL8:%.*]] = mul nsw i64 [[MUL]], 4
// CHECK1-NEXT: [[SUB9:%.*]] = sub nsw i64 [[MUL8]], 1
// CHECK1-NEXT: store i64 [[SUB9]], i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: store i32 [[TMP6]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: store i32 7, i32* [[J]], align 4
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[CMP10:%.*]] = icmp slt i32 [[TMP7]], [[TMP8]]
// CHECK1-NEXT: br i1 [[CMP10]], label [[OMP_PRECOND_THEN:%.*]], label [[OMP_PRECOND_END:%.*]]
// CHECK1: omp.precond.then:
// CHECK1-NEXT: store i64 0, i64* [[DOTOMP_LB]], align 8
// CHECK1-NEXT: [[TMP9:%.*]] = load i64, i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK1-NEXT: store i64 [[TMP9]], i64* [[DOTOMP_UB]], align 8
// CHECK1-NEXT: store i64 1, i64* [[DOTOMP_STRIDE]], align 8
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK1-NEXT: call void @__kmpc_for_static_init_8(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]], i32 34, i32* [[DOTOMP_IS_LAST]], i64* [[DOTOMP_LB]], i64* [[DOTOMP_UB]], i64* [[DOTOMP_STRIDE]], i64 1, i64 1)
// CHECK1-NEXT: [[TMP10:%.*]] = load i64, i64* [[DOTOMP_UB]], align 8
// CHECK1-NEXT: [[TMP11:%.*]] = load i64, i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK1-NEXT: [[CMP14:%.*]] = icmp sgt i64 [[TMP10]], [[TMP11]]
// CHECK1-NEXT: br i1 [[CMP14]], label [[COND_TRUE15:%.*]], label [[COND_FALSE16:%.*]]
// CHECK1: cond.true15:
// CHECK1-NEXT: [[TMP12:%.*]] = load i64, i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK1-NEXT: br label [[COND_END17:%.*]]
// CHECK1: cond.false16:
// CHECK1-NEXT: [[TMP13:%.*]] = load i64, i64* [[DOTOMP_UB]], align 8
// CHECK1-NEXT: br label [[COND_END17]]
// CHECK1: cond.end17:
// CHECK1-NEXT: [[COND18:%.*]] = phi i64 [ [[TMP12]], [[COND_TRUE15]] ], [ [[TMP13]], [[COND_FALSE16]] ]
// CHECK1-NEXT: store i64 [[COND18]], i64* [[DOTOMP_UB]], align 8
// CHECK1-NEXT: [[TMP14:%.*]] = load i64, i64* [[DOTOMP_LB]], align 8
// CHECK1-NEXT: store i64 [[TMP14]], i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK1: omp.inner.for.cond:
// CHECK1-NEXT: [[TMP15:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP16:%.*]] = load i64, i64* [[DOTOMP_UB]], align 8
// CHECK1-NEXT: [[CMP19:%.*]] = icmp sle i64 [[TMP15]], [[TMP16]]
// CHECK1-NEXT: br i1 [[CMP19]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK1: omp.inner.for.body:
// CHECK1-NEXT: [[TMP17:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP18:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP19:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB20:%.*]] = sub i32 [[TMP18]], [[TMP19]]
// CHECK1-NEXT: [[SUB21:%.*]] = sub i32 [[SUB20]], 1
// CHECK1-NEXT: [[ADD22:%.*]] = add i32 [[SUB21]], 1
// CHECK1-NEXT: [[DIV23:%.*]] = udiv i32 [[ADD22]], 1
// CHECK1-NEXT: [[MUL24:%.*]] = mul i32 1, [[DIV23]]
// CHECK1-NEXT: [[MUL25:%.*]] = mul i32 [[MUL24]], 4
// CHECK1-NEXT: [[CONV26:%.*]] = zext i32 [[MUL25]] to i64
// CHECK1-NEXT: [[DIV27:%.*]] = sdiv i64 [[TMP17]], [[CONV26]]
// CHECK1-NEXT: [[MUL28:%.*]] = mul nsw i64 [[DIV27]], 5
// CHECK1-NEXT: [[ADD29:%.*]] = add nsw i64 0, [[MUL28]]
// CHECK1-NEXT: [[CONV30:%.*]] = trunc i64 [[ADD29]] to i32
// CHECK1-NEXT: store i32 [[CONV30]], i32* [[DOTFLOOR_0_IV_I11]], align 4
// CHECK1-NEXT: [[TMP20:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[CONV31:%.*]] = sext i32 [[TMP20]] to i64
// CHECK1-NEXT: [[TMP21:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP22:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP23:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP24:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB32:%.*]] = sub i32 [[TMP23]], [[TMP24]]
// CHECK1-NEXT: [[SUB33:%.*]] = sub i32 [[SUB32]], 1
// CHECK1-NEXT: [[ADD34:%.*]] = add i32 [[SUB33]], 1
// CHECK1-NEXT: [[DIV35:%.*]] = udiv i32 [[ADD34]], 1
// CHECK1-NEXT: [[MUL36:%.*]] = mul i32 1, [[DIV35]]
// CHECK1-NEXT: [[MUL37:%.*]] = mul i32 [[MUL36]], 4
// CHECK1-NEXT: [[CONV38:%.*]] = zext i32 [[MUL37]] to i64
// CHECK1-NEXT: [[DIV39:%.*]] = sdiv i64 [[TMP22]], [[CONV38]]
// CHECK1-NEXT: [[TMP25:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP26:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB40:%.*]] = sub i32 [[TMP25]], [[TMP26]]
// CHECK1-NEXT: [[SUB41:%.*]] = sub i32 [[SUB40]], 1
// CHECK1-NEXT: [[ADD42:%.*]] = add i32 [[SUB41]], 1
// CHECK1-NEXT: [[DIV43:%.*]] = udiv i32 [[ADD42]], 1
// CHECK1-NEXT: [[MUL44:%.*]] = mul i32 1, [[DIV43]]
// CHECK1-NEXT: [[MUL45:%.*]] = mul i32 [[MUL44]], 4
// CHECK1-NEXT: [[CONV46:%.*]] = zext i32 [[MUL45]] to i64
// CHECK1-NEXT: [[MUL47:%.*]] = mul nsw i64 [[DIV39]], [[CONV46]]
// CHECK1-NEXT: [[SUB48:%.*]] = sub nsw i64 [[TMP21]], [[MUL47]]
// CHECK1-NEXT: [[DIV49:%.*]] = sdiv i64 [[SUB48]], 4
// CHECK1-NEXT: [[MUL50:%.*]] = mul nsw i64 [[DIV49]], 1
// CHECK1-NEXT: [[ADD51:%.*]] = add nsw i64 [[CONV31]], [[MUL50]]
// CHECK1-NEXT: [[CONV52:%.*]] = trunc i64 [[ADD51]] to i32
// CHECK1-NEXT: store i32 [[CONV52]], i32* [[DOTTILE_0_IV_I12]], align 4
// CHECK1-NEXT: [[TMP27:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP28:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP29:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP30:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB53:%.*]] = sub i32 [[TMP29]], [[TMP30]]
// CHECK1-NEXT: [[SUB54:%.*]] = sub i32 [[SUB53]], 1
// CHECK1-NEXT: [[ADD55:%.*]] = add i32 [[SUB54]], 1
// CHECK1-NEXT: [[DIV56:%.*]] = udiv i32 [[ADD55]], 1
// CHECK1-NEXT: [[MUL57:%.*]] = mul i32 1, [[DIV56]]
// CHECK1-NEXT: [[MUL58:%.*]] = mul i32 [[MUL57]], 4
// CHECK1-NEXT: [[CONV59:%.*]] = zext i32 [[MUL58]] to i64
// CHECK1-NEXT: [[DIV60:%.*]] = sdiv i64 [[TMP28]], [[CONV59]]
// CHECK1-NEXT: [[TMP31:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP32:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB61:%.*]] = sub i32 [[TMP31]], [[TMP32]]
// CHECK1-NEXT: [[SUB62:%.*]] = sub i32 [[SUB61]], 1
// CHECK1-NEXT: [[ADD63:%.*]] = add i32 [[SUB62]], 1
// CHECK1-NEXT: [[DIV64:%.*]] = udiv i32 [[ADD63]], 1
// CHECK1-NEXT: [[MUL65:%.*]] = mul i32 1, [[DIV64]]
// CHECK1-NEXT: [[MUL66:%.*]] = mul i32 [[MUL65]], 4
// CHECK1-NEXT: [[CONV67:%.*]] = zext i32 [[MUL66]] to i64
// CHECK1-NEXT: [[MUL68:%.*]] = mul nsw i64 [[DIV60]], [[CONV67]]
// CHECK1-NEXT: [[SUB69:%.*]] = sub nsw i64 [[TMP27]], [[MUL68]]
// CHECK1-NEXT: [[TMP33:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP34:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[TMP35:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP36:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB70:%.*]] = sub i32 [[TMP35]], [[TMP36]]
// CHECK1-NEXT: [[SUB71:%.*]] = sub i32 [[SUB70]], 1
// CHECK1-NEXT: [[ADD72:%.*]] = add i32 [[SUB71]], 1
// CHECK1-NEXT: [[DIV73:%.*]] = udiv i32 [[ADD72]], 1
// CHECK1-NEXT: [[MUL74:%.*]] = mul i32 1, [[DIV73]]
// CHECK1-NEXT: [[MUL75:%.*]] = mul i32 [[MUL74]], 4
// CHECK1-NEXT: [[CONV76:%.*]] = zext i32 [[MUL75]] to i64
// CHECK1-NEXT: [[DIV77:%.*]] = sdiv i64 [[TMP34]], [[CONV76]]
// CHECK1-NEXT: [[TMP37:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK1-NEXT: [[TMP38:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB78:%.*]] = sub i32 [[TMP37]], [[TMP38]]
// CHECK1-NEXT: [[SUB79:%.*]] = sub i32 [[SUB78]], 1
// CHECK1-NEXT: [[ADD80:%.*]] = add i32 [[SUB79]], 1
// CHECK1-NEXT: [[DIV81:%.*]] = udiv i32 [[ADD80]], 1
// CHECK1-NEXT: [[MUL82:%.*]] = mul i32 1, [[DIV81]]
// CHECK1-NEXT: [[MUL83:%.*]] = mul i32 [[MUL82]], 4
// CHECK1-NEXT: [[CONV84:%.*]] = zext i32 [[MUL83]] to i64
// CHECK1-NEXT: [[MUL85:%.*]] = mul nsw i64 [[DIV77]], [[CONV84]]
// CHECK1-NEXT: [[SUB86:%.*]] = sub nsw i64 [[TMP33]], [[MUL85]]
// CHECK1-NEXT: [[DIV87:%.*]] = sdiv i64 [[SUB86]], 4
// CHECK1-NEXT: [[MUL88:%.*]] = mul nsw i64 [[DIV87]], 4
// CHECK1-NEXT: [[SUB89:%.*]] = sub nsw i64 [[SUB69]], [[MUL88]]
// CHECK1-NEXT: [[MUL90:%.*]] = mul nsw i64 [[SUB89]], 3
// CHECK1-NEXT: [[ADD91:%.*]] = add nsw i64 7, [[MUL90]]
// CHECK1-NEXT: [[CONV92:%.*]] = trunc i64 [[ADD91]] to i32
// CHECK1-NEXT: store i32 [[CONV92]], i32* [[J13]], align 4
// CHECK1-NEXT: [[TMP39:%.*]] = load i32, i32* [[DOTTILE_0_IV_I12]], align 4
// CHECK1-NEXT: [[MUL93:%.*]] = mul nsw i32 [[TMP39]], 3
// CHECK1-NEXT: [[ADD94:%.*]] = add nsw i32 7, [[MUL93]]
// CHECK1-NEXT: store i32 [[ADD94]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP40:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: [[TMP41:%.*]] = load i32, i32* [[J13]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP40]], i32 [[TMP41]])
// CHECK1-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK1: omp.body.continue:
// CHECK1-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK1: omp.inner.for.inc:
// CHECK1-NEXT: [[TMP42:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: [[ADD95:%.*]] = add nsw i64 [[TMP42]], 1
// CHECK1-NEXT: store i64 [[ADD95]], i64* [[DOTOMP_IV]], align 8
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK1: omp.inner.for.end:
// CHECK1-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK1: omp.loop.exit:
// CHECK1-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]])
// CHECK1-NEXT: br label [[OMP_PRECOND_END]]
// CHECK1: omp.precond.end:
// CHECK1-NEXT: call void @__kmpc_barrier(%struct.ident_t* @[[GLOB3]], i32 [[TMP0]])
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@foo6
// CHECK1-SAME: () #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: call void (%struct.ident_t*, i32, void (i32*, i32*, ...)*, ...) @__kmpc_fork_call(%struct.ident_t* @[[GLOB2]], i32 0, void (i32*, i32*, ...)* bitcast (void (i32*, i32*)* @.omp_outlined. to void (i32*, i32*, ...)*))
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@.omp_outlined.
// CHECK1-SAME: (i32* noalias [[DOTGLOBAL_TID_:%.*]], i32* noalias [[DOTBOUND_TID_:%.*]]) #[[ATTR5:[0-9]+]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca i32*, align 8
// CHECK1-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca i32*, align 8
// CHECK1-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: store i32* [[DOTGLOBAL_TID_]], i32** [[DOTGLOBAL_TID__ADDR]], align 8
// CHECK1-NEXT: store i32* [[DOTBOUND_TID_]], i32** [[DOTBOUND_TID__ADDR]], align 8
// CHECK1-NEXT: store i32 7, i32* [[I]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_LB]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: store i32 1, i32* [[DOTOMP_STRIDE]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK1-NEXT: [[TMP0:%.*]] = load i32*, i32** [[DOTGLOBAL_TID__ADDR]], align 8
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[TMP0]], align 4
// CHECK1-NEXT: call void @__kmpc_for_static_init_4(%struct.ident_t* @[[GLOB1]], i32 [[TMP1]], i32 34, i32* [[DOTOMP_IS_LAST]], i32* [[DOTOMP_LB]], i32* [[DOTOMP_UB]], i32* [[DOTOMP_STRIDE]], i32 1, i32 1)
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[CMP:%.*]] = icmp sgt i32 [[TMP2]], 0
// CHECK1-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ 0, [[COND_TRUE]] ], [ [[TMP3]], [[COND_FALSE]] ]
// CHECK1-NEXT: store i32 [[COND]], i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTOMP_LB]], align 4
// CHECK1-NEXT: store i32 [[TMP4]], i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK1: omp.inner.for.cond:
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK1-NEXT: [[CMP1:%.*]] = icmp sle i32 [[TMP5]], [[TMP6]]
// CHECK1-NEXT: br i1 [[CMP1]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK1: omp.inner.for.body:
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP7]], 5
// CHECK1-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
// CHECK1-NEXT: store i32 [[ADD]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP8]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD2:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK1-NEXT: [[CMP3:%.*]] = icmp slt i32 4, [[ADD2]]
// CHECK1-NEXT: br i1 [[CMP3]], label [[COND_TRUE4:%.*]], label [[COND_FALSE5:%.*]]
// CHECK1: cond.true4:
// CHECK1-NEXT: br label [[COND_END7:%.*]]
// CHECK1: cond.false5:
// CHECK1-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD6:%.*]] = add nsw i32 [[TMP11]], 5
// CHECK1-NEXT: br label [[COND_END7]]
// CHECK1: cond.end7:
// CHECK1-NEXT: [[COND8:%.*]] = phi i32 [ 4, [[COND_TRUE4]] ], [ [[ADD6]], [[COND_FALSE5]] ]
// CHECK1-NEXT: [[CMP9:%.*]] = icmp slt i32 [[TMP9]], [[COND8]]
// CHECK1-NEXT: br i1 [[CMP9]], label [[FOR_BODY:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[MUL10:%.*]] = mul nsw i32 [[TMP12]], 3
// CHECK1-NEXT: [[ADD11:%.*]] = add nsw i32 7, [[MUL10]]
// CHECK1-NEXT: store i32 [[ADD11]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP13:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP13]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP14]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP17:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK1: omp.body.continue:
// CHECK1-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK1: omp.inner.for.inc:
// CHECK1-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: [[ADD12:%.*]] = add nsw i32 [[TMP15]], 1
// CHECK1-NEXT: store i32 [[ADD12]], i32* [[DOTOMP_IV]], align 4
// CHECK1-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK1: omp.inner.for.end:
// CHECK1-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK1: omp.loop.exit:
// CHECK1-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP1]])
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@tfoo7
// CHECK1-SAME: () #[[ATTR0]] {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: call void @_Z4foo7IiLi3ELi5EEvT_S0_(i32 0, i32 42)
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@_Z4foo7IiLi3ELi5EEvT_S0_
// CHECK1-SAME: (i32 [[START:%.*]], i32 [[END:%.*]]) #[[ATTR0]] comdat {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: [[START_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[END_ADDR:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTCAPTURE_EXPR_2:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK1-NEXT: store i32 [[START]], i32* [[START_ADDR]], align 4
// CHECK1-NEXT: store i32 [[END]], i32* [[END_ADDR]], align 4
// CHECK1-NEXT: [[TMP0:%.*]] = load i32, i32* [[START_ADDR]], align 4
// CHECK1-NEXT: store i32 [[TMP0]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP1:%.*]] = load i32, i32* [[START_ADDR]], align 4
// CHECK1-NEXT: store i32 [[TMP1]], i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[TMP2:%.*]] = load i32, i32* [[END_ADDR]], align 4
// CHECK1-NEXT: store i32 [[TMP2]], i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK1-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK1-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[SUB:%.*]] = sub i32 [[TMP3]], [[TMP4]]
// CHECK1-NEXT: [[SUB3:%.*]] = sub i32 [[SUB]], 1
// CHECK1-NEXT: [[ADD:%.*]] = add i32 [[SUB3]], 3
// CHECK1-NEXT: [[DIV:%.*]] = udiv i32 [[ADD]], 3
// CHECK1-NEXT: [[SUB4:%.*]] = sub i32 [[DIV]], 1
// CHECK1-NEXT: store i32 [[SUB4]], i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND:%.*]]
// CHECK1: for.cond:
// CHECK1-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[ADD5:%.*]] = add i32 [[TMP6]], 1
// CHECK1-NEXT: [[CMP:%.*]] = icmp ult i32 [[TMP5]], [[ADD5]]
// CHECK1-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END17:%.*]]
// CHECK1: for.body:
// CHECK1-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: store i32 [[TMP7]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND6:%.*]]
// CHECK1: for.cond6:
// CHECK1-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[ADD7:%.*]] = add i32 [[TMP9]], 1
// CHECK1-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK1-NEXT: [[CMP9:%.*]] = icmp ult i32 [[ADD7]], [[ADD8]]
// CHECK1-NEXT: br i1 [[CMP9]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK1: cond.true:
// CHECK1-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK1-NEXT: [[ADD10:%.*]] = add i32 [[TMP11]], 1
// CHECK1-NEXT: br label [[COND_END:%.*]]
// CHECK1: cond.false:
// CHECK1-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD11:%.*]] = add nsw i32 [[TMP12]], 5
// CHECK1-NEXT: br label [[COND_END]]
// CHECK1: cond.end:
// CHECK1-NEXT: [[COND:%.*]] = phi i32 [ [[ADD10]], [[COND_TRUE]] ], [ [[ADD11]], [[COND_FALSE]] ]
// CHECK1-NEXT: [[CMP12:%.*]] = icmp ult i32 [[TMP8]], [[COND]]
// CHECK1-NEXT: br i1 [[CMP12]], label [[FOR_BODY13:%.*]], label [[FOR_END:%.*]]
// CHECK1: for.body13:
// CHECK1-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK1-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[MUL:%.*]] = mul i32 [[TMP14]], 3
// CHECK1-NEXT: [[ADD14:%.*]] = add i32 [[TMP13]], [[MUL]]
// CHECK1-NEXT: store i32 [[ADD14]], i32* [[I]], align 4
// CHECK1-NEXT: [[TMP15:%.*]] = load i32, i32* [[I]], align 4
// CHECK1-NEXT: call void (...) @body(i32 [[TMP15]])
// CHECK1-NEXT: br label [[FOR_INC:%.*]]
// CHECK1: for.inc:
// CHECK1-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: [[INC:%.*]] = add nsw i32 [[TMP16]], 1
// CHECK1-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND6]], !llvm.loop [[LOOP20:![0-9]+]]
// CHECK1: for.end:
// CHECK1-NEXT: br label [[FOR_INC15:%.*]]
// CHECK1: for.inc15:
// CHECK1-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: [[ADD16:%.*]] = add nsw i32 [[TMP17]], 5
// CHECK1-NEXT: store i32 [[ADD16]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK1-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP21:![0-9]+]]
// CHECK1: for.end17:
// CHECK1-NEXT: ret void
//
//
// CHECK1-LABEL: define {{[^@]+}}@_GLOBAL__sub_I_tile_codegen.cpp
// CHECK1-SAME: () #[[ATTR1]] section ".text.startup" {
// CHECK1-NEXT: entry:
// CHECK1-NEXT: call void @__cxx_global_var_init()
// CHECK1-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@__cxx_global_var_init
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR0:[0-9]+]] section ".text.startup" {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: call void @_ZN1SC1Ev(%struct.S* nonnull align 4 dereferenceable(4) @s)
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@_ZN1SC1Ev
// CHECK2-SAME: (%struct.S* nonnull align 4 dereferenceable(4) [[THIS:%.*]]) unnamed_addr #[[ATTR1:[0-9]+]] comdat align 2 {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[THIS_ADDR:%.*]] = alloca %struct.S*, align 8
// CHECK2-NEXT: store %struct.S* [[THIS]], %struct.S** [[THIS_ADDR]], align 8
// CHECK2-NEXT: [[THIS1:%.*]] = load %struct.S*, %struct.S** [[THIS_ADDR]], align 8
// CHECK2-NEXT: call void @_ZN1SC2Ev(%struct.S* nonnull align 4 dereferenceable(4) [[THIS1]])
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@_ZN1SC2Ev
// CHECK2-SAME: (%struct.S* nonnull align 4 dereferenceable(4) [[THIS:%.*]]) unnamed_addr #[[ATTR1]] comdat align 2 {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[THIS_ADDR:%.*]] = alloca %struct.S*, align 8
// CHECK2-NEXT: [[I:%.*]] = alloca i32*, align 8
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: store %struct.S* [[THIS]], %struct.S** [[THIS_ADDR]], align 8
// CHECK2-NEXT: [[THIS1:%.*]] = load %struct.S*, %struct.S** [[THIS_ADDR]], align 8
// CHECK2-NEXT: [[I2:%.*]] = getelementptr inbounds [[STRUCT_S:%.*]], %struct.S* [[THIS1]], i32 0, i32 0
// CHECK2-NEXT: store i32* [[I2]], i32** [[I]], align 8
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP0:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[CMP:%.*]] = icmp slt i32 [[TMP0]], 4
// CHECK2-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END11:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP1]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND3:%.*]]
// CHECK2: for.cond3:
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP3]], 5
// CHECK2-NEXT: [[CMP4:%.*]] = icmp slt i32 4, [[ADD]]
// CHECK2-NEXT: br i1 [[CMP4]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD5:%.*]] = add nsw i32 [[TMP4]], 5
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ 4, [[COND_TRUE]] ], [ [[ADD5]], [[COND_FALSE]] ]
// CHECK2-NEXT: [[CMP6:%.*]] = icmp slt i32 [[TMP2]], [[COND]]
// CHECK2-NEXT: br i1 [[CMP6]], label [[FOR_BODY7:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body7:
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP5]], 3
// CHECK2-NEXT: [[ADD8:%.*]] = add nsw i32 7, [[MUL]]
// CHECK2-NEXT: [[TMP6:%.*]] = load i32*, i32** [[I]], align 8
// CHECK2-NEXT: store i32 [[ADD8]], i32* [[TMP6]], align 4
// CHECK2-NEXT: [[TMP7:%.*]] = load i32*, i32** [[I]], align 8
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[TMP7]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP8]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP9]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND3]], !llvm.loop [[LOOP2:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[FOR_INC9:%.*]]
// CHECK2: for.inc9:
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD10:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK2-NEXT: store i32 [[ADD10]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP4:![0-9]+]]
// CHECK2: for.end11:
// CHECK2-NEXT: ret void
//
//
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-LABEL: define {{[^@]+}}@body
// CHECK2-SAME: (...) #[[ATTR2:[0-9]+]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@foo1
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: (i32 [[START:%.*]], i32 [[END:%.*]], i32 [[STEP:%.*]]) #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[START_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[END_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[STEP_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_2:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_3:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: store i32 [[START]], i32* [[START_ADDR]], align 4
// CHECK2-NEXT: store i32 [[END]], i32* [[END_ADDR]], align 4
// CHECK2-NEXT: store i32 [[STEP]], i32* [[STEP_ADDR]], align 4
// CHECK2-NEXT: [[TMP0:%.*]] = load i32, i32* [[START_ADDR]], align 4
// CHECK2-NEXT: store i32 [[TMP0]], i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[END_ADDR]], align 4
// CHECK2-NEXT: store i32 [[TMP1]], i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[STEP_ADDR]], align 4
// CHECK2-NEXT: store i32 [[TMP2]], i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB:%.*]] = sub i32 [[TMP3]], [[TMP4]]
// CHECK2-NEXT: [[SUB4:%.*]] = sub i32 [[SUB]], 1
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[ADD:%.*]] = add i32 [[SUB4]], [[TMP5]]
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[DIV:%.*]] = udiv i32 [[ADD]], [[TMP6]]
// CHECK2-NEXT: [[SUB5:%.*]] = sub i32 [[DIV]], 1
// CHECK2-NEXT: store i32 [[SUB5]], i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[ADD6:%.*]] = add i32 [[TMP8]], 1
// CHECK2-NEXT: [[CMP:%.*]] = icmp ult i32 [[TMP7]], [[ADD6]]
// CHECK2-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END18:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP9]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND7:%.*]]
// CHECK2: for.cond7:
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[ADD8:%.*]] = add i32 [[TMP11]], 1
// CHECK2-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD9:%.*]] = add nsw i32 [[TMP12]], 5
// CHECK2-NEXT: [[CMP10:%.*]] = icmp ult i32 [[ADD8]], [[ADD9]]
// CHECK2-NEXT: br i1 [[CMP10]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[ADD11:%.*]] = add i32 [[TMP13]], 1
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD12:%.*]] = add nsw i32 [[TMP14]], 5
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ [[ADD11]], [[COND_TRUE]] ], [ [[ADD12]], [[COND_FALSE]] ]
// CHECK2-NEXT: [[CMP13:%.*]] = icmp ult i32 [[TMP10]], [[COND]]
// CHECK2-NEXT: br i1 [[CMP13]], label [[FOR_BODY14:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body14:
// CHECK2-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[MUL:%.*]] = mul i32 [[TMP16]], [[TMP17]]
// CHECK2-NEXT: [[ADD15:%.*]] = add i32 [[TMP15]], [[MUL]]
// CHECK2-NEXT: store i32 [[ADD15]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP18:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP18]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP19:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP19]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND7]], !llvm.loop [[LOOP5:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[FOR_INC16:%.*]]
// CHECK2: for.inc16:
// CHECK2-NEXT: [[TMP20:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD17:%.*]] = add nsw i32 [[TMP20]], 5
// CHECK2-NEXT: store i32 [[ADD17]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP6:![0-9]+]]
// CHECK2: for.end18:
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@foo2
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: (i32 [[START:%.*]], i32 [[END:%.*]], i32 [[STEP:%.*]]) #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[START_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[END_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[STEP_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_1_IV_J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_1_IV_J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: store i32 [[START]], i32* [[START_ADDR]], align 4
// CHECK2-NEXT: store i32 [[END]], i32* [[END_ADDR]], align 4
// CHECK2-NEXT: store i32 [[STEP]], i32* [[STEP_ADDR]], align 4
// CHECK2-NEXT: store i32 7, i32* [[I]], align 4
// CHECK2-NEXT: store i32 7, i32* [[J]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP0:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[CMP:%.*]] = icmp slt i32 [[TMP0]], 4
// CHECK2-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END30:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND1:%.*]]
// CHECK2: for.cond1:
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[CMP2:%.*]] = icmp slt i32 [[TMP1]], 4
// CHECK2-NEXT: br i1 [[CMP2]], label [[FOR_BODY3:%.*]], label [[FOR_END27:%.*]]
// CHECK2: for.body3:
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP2]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND4:%.*]]
// CHECK2: for.cond4:
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], 5
// CHECK2-NEXT: [[CMP5:%.*]] = icmp slt i32 4, [[ADD]]
// CHECK2-NEXT: br i1 [[CMP5]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD6:%.*]] = add nsw i32 [[TMP5]], 5
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ 4, [[COND_TRUE]] ], [ [[ADD6]], [[COND_FALSE]] ]
// CHECK2-NEXT: [[CMP7:%.*]] = icmp slt i32 [[TMP3]], [[COND]]
// CHECK2-NEXT: br i1 [[CMP7]], label [[FOR_BODY8:%.*]], label [[FOR_END24:%.*]]
// CHECK2: for.body8:
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP6]], 3
// CHECK2-NEXT: [[ADD9:%.*]] = add nsw i32 7, [[MUL]]
// CHECK2-NEXT: store i32 [[ADD9]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: store i32 [[TMP7]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND10:%.*]]
// CHECK2: for.cond10:
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD11:%.*]] = add nsw i32 [[TMP9]], 5
// CHECK2-NEXT: [[CMP12:%.*]] = icmp slt i32 4, [[ADD11]]
// CHECK2-NEXT: br i1 [[CMP12]], label [[COND_TRUE13:%.*]], label [[COND_FALSE14:%.*]]
// CHECK2: cond.true13:
// CHECK2-NEXT: br label [[COND_END16:%.*]]
// CHECK2: cond.false14:
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD15:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK2-NEXT: br label [[COND_END16]]
// CHECK2: cond.end16:
// CHECK2-NEXT: [[COND17:%.*]] = phi i32 [ 4, [[COND_TRUE13]] ], [ [[ADD15]], [[COND_FALSE14]] ]
// CHECK2-NEXT: [[CMP18:%.*]] = icmp slt i32 [[TMP8]], [[COND17]]
// CHECK2-NEXT: br i1 [[CMP18]], label [[FOR_BODY19:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body19:
// CHECK2-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[MUL20:%.*]] = mul nsw i32 [[TMP11]], 3
// CHECK2-NEXT: [[ADD21:%.*]] = add nsw i32 7, [[MUL20]]
// CHECK2-NEXT: store i32 [[ADD21]], i32* [[J]], align 4
// CHECK2-NEXT: [[TMP12:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: [[TMP13:%.*]] = load i32, i32* [[J]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP12]], i32 [[TMP13]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP14]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND10]], !llvm.loop [[LOOP7:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[FOR_INC22:%.*]]
// CHECK2: for.inc22:
// CHECK2-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC23:%.*]] = add nsw i32 [[TMP15]], 1
// CHECK2-NEXT: store i32 [[INC23]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND4]], !llvm.loop [[LOOP8:![0-9]+]]
// CHECK2: for.end24:
// CHECK2-NEXT: br label [[FOR_INC25:%.*]]
// CHECK2: for.inc25:
// CHECK2-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD26:%.*]] = add nsw i32 [[TMP16]], 5
// CHECK2-NEXT: store i32 [[ADD26]], i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND1]], !llvm.loop [[LOOP9:![0-9]+]]
// CHECK2: for.end27:
// CHECK2-NEXT: br label [[FOR_INC28:%.*]]
// CHECK2: for.inc28:
// CHECK2-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD29:%.*]] = add nsw i32 [[TMP17]], 5
// CHECK2-NEXT: store i32 [[ADD29]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP10:![0-9]+]]
// CHECK2: for.end30:
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@foo3
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_1_IV_J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_1_IV_J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(%struct.ident_t* @[[GLOB2:[0-9]+]])
// CHECK2-NEXT: store i32 7, i32* [[I]], align 4
// CHECK2-NEXT: store i32 7, i32* [[J]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_LB]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: store i32 1, i32* [[DOTOMP_STRIDE]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK2-NEXT: call void @__kmpc_for_static_init_4(%struct.ident_t* @[[GLOB1:[0-9]+]], i32 [[TMP0]], i32 34, i32* [[DOTOMP_IS_LAST]], i32* [[DOTOMP_LB]], i32* [[DOTOMP_UB]], i32* [[DOTOMP_STRIDE]], i32 1, i32 1)
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[CMP:%.*]] = icmp sgt i32 [[TMP1]], 0
// CHECK2-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ 0, [[COND_TRUE]] ], [ [[TMP2]], [[COND_FALSE]] ]
// CHECK2-NEXT: store i32 [[COND]], i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTOMP_LB]], align 4
// CHECK2-NEXT: store i32 [[TMP3]], i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK2: omp.inner.for.cond:
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[CMP1:%.*]] = icmp sle i32 [[TMP4]], [[TMP5]]
// CHECK2-NEXT: br i1 [[CMP1]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK2: omp.inner.for.body:
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP6]], 5
// CHECK2-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
// CHECK2-NEXT: store i32 [[ADD]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[CMP2:%.*]] = icmp slt i32 [[TMP7]], 4
// CHECK2-NEXT: br i1 [[CMP2]], label [[FOR_BODY:%.*]], label [[FOR_END32:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP8]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND3:%.*]]
// CHECK2: for.cond3:
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK2-NEXT: [[CMP5:%.*]] = icmp slt i32 4, [[ADD4]]
// CHECK2-NEXT: br i1 [[CMP5]], label [[COND_TRUE6:%.*]], label [[COND_FALSE7:%.*]]
// CHECK2: cond.true6:
// CHECK2-NEXT: br label [[COND_END9:%.*]]
// CHECK2: cond.false7:
// CHECK2-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP11]], 5
// CHECK2-NEXT: br label [[COND_END9]]
// CHECK2: cond.end9:
// CHECK2-NEXT: [[COND10:%.*]] = phi i32 [ 4, [[COND_TRUE6]] ], [ [[ADD8]], [[COND_FALSE7]] ]
// CHECK2-NEXT: [[CMP11:%.*]] = icmp slt i32 [[TMP9]], [[COND10]]
// CHECK2-NEXT: br i1 [[CMP11]], label [[FOR_BODY12:%.*]], label [[FOR_END29:%.*]]
// CHECK2: for.body12:
// CHECK2-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[MUL13:%.*]] = mul nsw i32 [[TMP12]], 3
// CHECK2-NEXT: [[ADD14:%.*]] = add nsw i32 7, [[MUL13]]
// CHECK2-NEXT: store i32 [[ADD14]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: store i32 [[TMP13]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND15:%.*]]
// CHECK2: for.cond15:
// CHECK2-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD16:%.*]] = add nsw i32 [[TMP15]], 5
// CHECK2-NEXT: [[CMP17:%.*]] = icmp slt i32 4, [[ADD16]]
// CHECK2-NEXT: br i1 [[CMP17]], label [[COND_TRUE18:%.*]], label [[COND_FALSE19:%.*]]
// CHECK2: cond.true18:
// CHECK2-NEXT: br label [[COND_END21:%.*]]
// CHECK2: cond.false19:
// CHECK2-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD20:%.*]] = add nsw i32 [[TMP16]], 5
// CHECK2-NEXT: br label [[COND_END21]]
// CHECK2: cond.end21:
// CHECK2-NEXT: [[COND22:%.*]] = phi i32 [ 4, [[COND_TRUE18]] ], [ [[ADD20]], [[COND_FALSE19]] ]
// CHECK2-NEXT: [[CMP23:%.*]] = icmp slt i32 [[TMP14]], [[COND22]]
// CHECK2-NEXT: br i1 [[CMP23]], label [[FOR_BODY24:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body24:
// CHECK2-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[MUL25:%.*]] = mul nsw i32 [[TMP17]], 3
// CHECK2-NEXT: [[ADD26:%.*]] = add nsw i32 7, [[MUL25]]
// CHECK2-NEXT: store i32 [[ADD26]], i32* [[J]], align 4
// CHECK2-NEXT: [[TMP18:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: [[TMP19:%.*]] = load i32, i32* [[J]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP18]], i32 [[TMP19]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP20:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP20]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND15]], !llvm.loop [[LOOP11:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[FOR_INC27:%.*]]
// CHECK2: for.inc27:
// CHECK2-NEXT: [[TMP21:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC28:%.*]] = add nsw i32 [[TMP21]], 1
// CHECK2-NEXT: store i32 [[INC28]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND3]], !llvm.loop [[LOOP12:![0-9]+]]
// CHECK2: for.end29:
// CHECK2-NEXT: br label [[FOR_INC30:%.*]]
// CHECK2: for.inc30:
// CHECK2-NEXT: [[TMP22:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD31:%.*]] = add nsw i32 [[TMP22]], 5
// CHECK2-NEXT: store i32 [[ADD31]], i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP13:![0-9]+]]
// CHECK2: for.end32:
// CHECK2-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK2: omp.body.continue:
// CHECK2-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK2: omp.inner.for.inc:
// CHECK2-NEXT: [[TMP23:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[ADD33:%.*]] = add nsw i32 [[TMP23]], 1
// CHECK2-NEXT: store i32 [[ADD33]], i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK2: omp.inner.for.end:
// CHECK2-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK2: omp.loop.exit:
// CHECK2-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]])
// CHECK2-NEXT: call void @__kmpc_barrier(%struct.ident_t* @[[GLOB3:[0-9]+]], i32 [[TMP0]])
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@foo4
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[_TMP1:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[K:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_1_IV_J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_1_IV_J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(%struct.ident_t* @[[GLOB2]])
// CHECK2-NEXT: store i32 7, i32* [[I]], align 4
// CHECK2-NEXT: store i32 7, i32* [[J]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_LB]], align 4
// CHECK2-NEXT: store i32 3, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: store i32 1, i32* [[DOTOMP_STRIDE]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK2-NEXT: call void @__kmpc_for_static_init_4(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]], i32 34, i32* [[DOTOMP_IS_LAST]], i32* [[DOTOMP_LB]], i32* [[DOTOMP_UB]], i32* [[DOTOMP_STRIDE]], i32 1, i32 1)
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[CMP:%.*]] = icmp sgt i32 [[TMP1]], 3
// CHECK2-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ 3, [[COND_TRUE]] ], [ [[TMP2]], [[COND_FALSE]] ]
// CHECK2-NEXT: store i32 [[COND]], i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTOMP_LB]], align 4
// CHECK2-NEXT: store i32 [[TMP3]], i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK2: omp.inner.for.cond:
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[CMP2:%.*]] = icmp sle i32 [[TMP4]], [[TMP5]]
// CHECK2-NEXT: br i1 [[CMP2]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK2: omp.inner.for.body:
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[DIV:%.*]] = sdiv i32 [[TMP6]], 1
// CHECK2-NEXT: [[MUL:%.*]] = mul nsw i32 [[DIV]], 3
// CHECK2-NEXT: [[ADD:%.*]] = add nsw i32 7, [[MUL]]
// CHECK2-NEXT: store i32 [[ADD]], i32* [[K]], align 4
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[DIV3:%.*]] = sdiv i32 [[TMP8]], 1
// CHECK2-NEXT: [[MUL4:%.*]] = mul nsw i32 [[DIV3]], 1
// CHECK2-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP7]], [[MUL4]]
// CHECK2-NEXT: [[MUL5:%.*]] = mul nsw i32 [[SUB]], 5
// CHECK2-NEXT: [[ADD6:%.*]] = add nsw i32 0, [[MUL5]]
// CHECK2-NEXT: store i32 [[ADD6]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[CMP7:%.*]] = icmp slt i32 [[TMP9]], 4
// CHECK2-NEXT: br i1 [[CMP7]], label [[FOR_BODY:%.*]], label [[FOR_END37:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP10]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND8:%.*]]
// CHECK2: for.cond8:
// CHECK2-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD9:%.*]] = add nsw i32 [[TMP12]], 5
// CHECK2-NEXT: [[CMP10:%.*]] = icmp slt i32 4, [[ADD9]]
// CHECK2-NEXT: br i1 [[CMP10]], label [[COND_TRUE11:%.*]], label [[COND_FALSE12:%.*]]
// CHECK2: cond.true11:
// CHECK2-NEXT: br label [[COND_END14:%.*]]
// CHECK2: cond.false12:
// CHECK2-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD13:%.*]] = add nsw i32 [[TMP13]], 5
// CHECK2-NEXT: br label [[COND_END14]]
// CHECK2: cond.end14:
// CHECK2-NEXT: [[COND15:%.*]] = phi i32 [ 4, [[COND_TRUE11]] ], [ [[ADD13]], [[COND_FALSE12]] ]
// CHECK2-NEXT: [[CMP16:%.*]] = icmp slt i32 [[TMP11]], [[COND15]]
// CHECK2-NEXT: br i1 [[CMP16]], label [[FOR_BODY17:%.*]], label [[FOR_END34:%.*]]
// CHECK2: for.body17:
// CHECK2-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[MUL18:%.*]] = mul nsw i32 [[TMP14]], 3
// CHECK2-NEXT: [[ADD19:%.*]] = add nsw i32 7, [[MUL18]]
// CHECK2-NEXT: store i32 [[ADD19]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: store i32 [[TMP15]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND20:%.*]]
// CHECK2: for.cond20:
// CHECK2-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD21:%.*]] = add nsw i32 [[TMP17]], 5
// CHECK2-NEXT: [[CMP22:%.*]] = icmp slt i32 4, [[ADD21]]
// CHECK2-NEXT: br i1 [[CMP22]], label [[COND_TRUE23:%.*]], label [[COND_FALSE24:%.*]]
// CHECK2: cond.true23:
// CHECK2-NEXT: br label [[COND_END26:%.*]]
// CHECK2: cond.false24:
// CHECK2-NEXT: [[TMP18:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD25:%.*]] = add nsw i32 [[TMP18]], 5
// CHECK2-NEXT: br label [[COND_END26]]
// CHECK2: cond.end26:
// CHECK2-NEXT: [[COND27:%.*]] = phi i32 [ 4, [[COND_TRUE23]] ], [ [[ADD25]], [[COND_FALSE24]] ]
// CHECK2-NEXT: [[CMP28:%.*]] = icmp slt i32 [[TMP16]], [[COND27]]
// CHECK2-NEXT: br i1 [[CMP28]], label [[FOR_BODY29:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body29:
// CHECK2-NEXT: [[TMP19:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[MUL30:%.*]] = mul nsw i32 [[TMP19]], 3
// CHECK2-NEXT: [[ADD31:%.*]] = add nsw i32 7, [[MUL30]]
// CHECK2-NEXT: store i32 [[ADD31]], i32* [[J]], align 4
// CHECK2-NEXT: [[TMP20:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: [[TMP21:%.*]] = load i32, i32* [[J]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP20]], i32 [[TMP21]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP22:%.*]] = load i32, i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP22]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND20]], !llvm.loop [[LOOP14:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[FOR_INC32:%.*]]
// CHECK2: for.inc32:
// CHECK2-NEXT: [[TMP23:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC33:%.*]] = add nsw i32 [[TMP23]], 1
// CHECK2-NEXT: store i32 [[INC33]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND8]], !llvm.loop [[LOOP15:![0-9]+]]
// CHECK2: for.end34:
// CHECK2-NEXT: br label [[FOR_INC35:%.*]]
// CHECK2: for.inc35:
// CHECK2-NEXT: [[TMP24:%.*]] = load i32, i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: [[ADD36:%.*]] = add nsw i32 [[TMP24]], 5
// CHECK2-NEXT: store i32 [[ADD36]], i32* [[DOTFLOOR_1_IV_J]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP16:![0-9]+]]
// CHECK2: for.end37:
// CHECK2-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK2: omp.body.continue:
// CHECK2-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK2: omp.inner.for.inc:
// CHECK2-NEXT: [[TMP25:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[ADD38:%.*]] = add nsw i32 [[TMP25]], 1
// CHECK2-NEXT: store i32 [[ADD38]], i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK2: omp.inner.for.end:
// CHECK2-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK2: omp.loop.exit:
// CHECK2-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]])
// CHECK2-NEXT: call void @__kmpc_barrier(%struct.ident_t* @[[GLOB3]], i32 [[TMP0]])
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@foo5
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[DOTOMP_IV:%.*]] = alloca i64, align 8
// CHECK2-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[_TMP1:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[_TMP2:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_3:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_5:%.*]] = alloca i64, align 8
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[J:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_LB:%.*]] = alloca i64, align 8
// CHECK2-NEXT: [[DOTOMP_UB:%.*]] = alloca i64, align 8
// CHECK2-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i64, align 8
// CHECK2-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I11:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I12:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[J13:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[TMP0:%.*]] = call i32 @__kmpc_global_thread_num(%struct.ident_t* @[[GLOB2]])
// CHECK2-NEXT: store i32 7, i32* [[I]], align 4
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[TMP]], align 4
// CHECK2-NEXT: store i32 [[TMP1]], i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[TMP]], align 4
// CHECK2-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP2]], 5
// CHECK2-NEXT: [[CMP:%.*]] = icmp slt i32 4, [[ADD]]
// CHECK2-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[TMP]], align 4
// CHECK2-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP3]], 5
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ 4, [[COND_TRUE]] ], [ [[ADD4]], [[COND_FALSE]] ]
// CHECK2-NEXT: store i32 [[COND]], i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB:%.*]] = sub i32 [[TMP4]], [[TMP5]]
// CHECK2-NEXT: [[SUB6:%.*]] = sub i32 [[SUB]], 1
// CHECK2-NEXT: [[ADD7:%.*]] = add i32 [[SUB6]], 1
// CHECK2-NEXT: [[DIV:%.*]] = udiv i32 [[ADD7]], 1
// CHECK2-NEXT: [[CONV:%.*]] = zext i32 [[DIV]] to i64
// CHECK2-NEXT: [[MUL:%.*]] = mul nsw i64 1, [[CONV]]
// CHECK2-NEXT: [[MUL8:%.*]] = mul nsw i64 [[MUL]], 4
// CHECK2-NEXT: [[SUB9:%.*]] = sub nsw i64 [[MUL8]], 1
// CHECK2-NEXT: store i64 [[SUB9]], i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: store i32 [[TMP6]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: store i32 7, i32* [[J]], align 4
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[CMP10:%.*]] = icmp slt i32 [[TMP7]], [[TMP8]]
// CHECK2-NEXT: br i1 [[CMP10]], label [[OMP_PRECOND_THEN:%.*]], label [[OMP_PRECOND_END:%.*]]
// CHECK2: omp.precond.then:
// CHECK2-NEXT: store i64 0, i64* [[DOTOMP_LB]], align 8
// CHECK2-NEXT: [[TMP9:%.*]] = load i64, i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK2-NEXT: store i64 [[TMP9]], i64* [[DOTOMP_UB]], align 8
// CHECK2-NEXT: store i64 1, i64* [[DOTOMP_STRIDE]], align 8
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK2-NEXT: call void @__kmpc_for_static_init_8(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]], i32 34, i32* [[DOTOMP_IS_LAST]], i64* [[DOTOMP_LB]], i64* [[DOTOMP_UB]], i64* [[DOTOMP_STRIDE]], i64 1, i64 1)
// CHECK2-NEXT: [[TMP10:%.*]] = load i64, i64* [[DOTOMP_UB]], align 8
// CHECK2-NEXT: [[TMP11:%.*]] = load i64, i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK2-NEXT: [[CMP14:%.*]] = icmp sgt i64 [[TMP10]], [[TMP11]]
// CHECK2-NEXT: br i1 [[CMP14]], label [[COND_TRUE15:%.*]], label [[COND_FALSE16:%.*]]
// CHECK2: cond.true15:
// CHECK2-NEXT: [[TMP12:%.*]] = load i64, i64* [[DOTCAPTURE_EXPR_5]], align 8
// CHECK2-NEXT: br label [[COND_END17:%.*]]
// CHECK2: cond.false16:
// CHECK2-NEXT: [[TMP13:%.*]] = load i64, i64* [[DOTOMP_UB]], align 8
// CHECK2-NEXT: br label [[COND_END17]]
// CHECK2: cond.end17:
// CHECK2-NEXT: [[COND18:%.*]] = phi i64 [ [[TMP12]], [[COND_TRUE15]] ], [ [[TMP13]], [[COND_FALSE16]] ]
// CHECK2-NEXT: store i64 [[COND18]], i64* [[DOTOMP_UB]], align 8
// CHECK2-NEXT: [[TMP14:%.*]] = load i64, i64* [[DOTOMP_LB]], align 8
// CHECK2-NEXT: store i64 [[TMP14]], i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK2: omp.inner.for.cond:
// CHECK2-NEXT: [[TMP15:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP16:%.*]] = load i64, i64* [[DOTOMP_UB]], align 8
// CHECK2-NEXT: [[CMP19:%.*]] = icmp sle i64 [[TMP15]], [[TMP16]]
// CHECK2-NEXT: br i1 [[CMP19]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK2: omp.inner.for.body:
// CHECK2-NEXT: [[TMP17:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP18:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP19:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB20:%.*]] = sub i32 [[TMP18]], [[TMP19]]
// CHECK2-NEXT: [[SUB21:%.*]] = sub i32 [[SUB20]], 1
// CHECK2-NEXT: [[ADD22:%.*]] = add i32 [[SUB21]], 1
// CHECK2-NEXT: [[DIV23:%.*]] = udiv i32 [[ADD22]], 1
// CHECK2-NEXT: [[MUL24:%.*]] = mul i32 1, [[DIV23]]
// CHECK2-NEXT: [[MUL25:%.*]] = mul i32 [[MUL24]], 4
// CHECK2-NEXT: [[CONV26:%.*]] = zext i32 [[MUL25]] to i64
// CHECK2-NEXT: [[DIV27:%.*]] = sdiv i64 [[TMP17]], [[CONV26]]
// CHECK2-NEXT: [[MUL28:%.*]] = mul nsw i64 [[DIV27]], 5
// CHECK2-NEXT: [[ADD29:%.*]] = add nsw i64 0, [[MUL28]]
// CHECK2-NEXT: [[CONV30:%.*]] = trunc i64 [[ADD29]] to i32
// CHECK2-NEXT: store i32 [[CONV30]], i32* [[DOTFLOOR_0_IV_I11]], align 4
// CHECK2-NEXT: [[TMP20:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[CONV31:%.*]] = sext i32 [[TMP20]] to i64
// CHECK2-NEXT: [[TMP21:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP22:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP23:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP24:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB32:%.*]] = sub i32 [[TMP23]], [[TMP24]]
// CHECK2-NEXT: [[SUB33:%.*]] = sub i32 [[SUB32]], 1
// CHECK2-NEXT: [[ADD34:%.*]] = add i32 [[SUB33]], 1
// CHECK2-NEXT: [[DIV35:%.*]] = udiv i32 [[ADD34]], 1
// CHECK2-NEXT: [[MUL36:%.*]] = mul i32 1, [[DIV35]]
// CHECK2-NEXT: [[MUL37:%.*]] = mul i32 [[MUL36]], 4
// CHECK2-NEXT: [[CONV38:%.*]] = zext i32 [[MUL37]] to i64
// CHECK2-NEXT: [[DIV39:%.*]] = sdiv i64 [[TMP22]], [[CONV38]]
// CHECK2-NEXT: [[TMP25:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP26:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB40:%.*]] = sub i32 [[TMP25]], [[TMP26]]
// CHECK2-NEXT: [[SUB41:%.*]] = sub i32 [[SUB40]], 1
// CHECK2-NEXT: [[ADD42:%.*]] = add i32 [[SUB41]], 1
// CHECK2-NEXT: [[DIV43:%.*]] = udiv i32 [[ADD42]], 1
// CHECK2-NEXT: [[MUL44:%.*]] = mul i32 1, [[DIV43]]
// CHECK2-NEXT: [[MUL45:%.*]] = mul i32 [[MUL44]], 4
// CHECK2-NEXT: [[CONV46:%.*]] = zext i32 [[MUL45]] to i64
// CHECK2-NEXT: [[MUL47:%.*]] = mul nsw i64 [[DIV39]], [[CONV46]]
// CHECK2-NEXT: [[SUB48:%.*]] = sub nsw i64 [[TMP21]], [[MUL47]]
// CHECK2-NEXT: [[DIV49:%.*]] = sdiv i64 [[SUB48]], 4
// CHECK2-NEXT: [[MUL50:%.*]] = mul nsw i64 [[DIV49]], 1
// CHECK2-NEXT: [[ADD51:%.*]] = add nsw i64 [[CONV31]], [[MUL50]]
// CHECK2-NEXT: [[CONV52:%.*]] = trunc i64 [[ADD51]] to i32
// CHECK2-NEXT: store i32 [[CONV52]], i32* [[DOTTILE_0_IV_I12]], align 4
// CHECK2-NEXT: [[TMP27:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP28:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP29:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP30:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB53:%.*]] = sub i32 [[TMP29]], [[TMP30]]
// CHECK2-NEXT: [[SUB54:%.*]] = sub i32 [[SUB53]], 1
// CHECK2-NEXT: [[ADD55:%.*]] = add i32 [[SUB54]], 1
// CHECK2-NEXT: [[DIV56:%.*]] = udiv i32 [[ADD55]], 1
// CHECK2-NEXT: [[MUL57:%.*]] = mul i32 1, [[DIV56]]
// CHECK2-NEXT: [[MUL58:%.*]] = mul i32 [[MUL57]], 4
// CHECK2-NEXT: [[CONV59:%.*]] = zext i32 [[MUL58]] to i64
// CHECK2-NEXT: [[DIV60:%.*]] = sdiv i64 [[TMP28]], [[CONV59]]
// CHECK2-NEXT: [[TMP31:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP32:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB61:%.*]] = sub i32 [[TMP31]], [[TMP32]]
// CHECK2-NEXT: [[SUB62:%.*]] = sub i32 [[SUB61]], 1
// CHECK2-NEXT: [[ADD63:%.*]] = add i32 [[SUB62]], 1
// CHECK2-NEXT: [[DIV64:%.*]] = udiv i32 [[ADD63]], 1
// CHECK2-NEXT: [[MUL65:%.*]] = mul i32 1, [[DIV64]]
// CHECK2-NEXT: [[MUL66:%.*]] = mul i32 [[MUL65]], 4
// CHECK2-NEXT: [[CONV67:%.*]] = zext i32 [[MUL66]] to i64
// CHECK2-NEXT: [[MUL68:%.*]] = mul nsw i64 [[DIV60]], [[CONV67]]
// CHECK2-NEXT: [[SUB69:%.*]] = sub nsw i64 [[TMP27]], [[MUL68]]
// CHECK2-NEXT: [[TMP33:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP34:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[TMP35:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP36:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB70:%.*]] = sub i32 [[TMP35]], [[TMP36]]
// CHECK2-NEXT: [[SUB71:%.*]] = sub i32 [[SUB70]], 1
// CHECK2-NEXT: [[ADD72:%.*]] = add i32 [[SUB71]], 1
// CHECK2-NEXT: [[DIV73:%.*]] = udiv i32 [[ADD72]], 1
// CHECK2-NEXT: [[MUL74:%.*]] = mul i32 1, [[DIV73]]
// CHECK2-NEXT: [[MUL75:%.*]] = mul i32 [[MUL74]], 4
// CHECK2-NEXT: [[CONV76:%.*]] = zext i32 [[MUL75]] to i64
// CHECK2-NEXT: [[DIV77:%.*]] = sdiv i64 [[TMP34]], [[CONV76]]
// CHECK2-NEXT: [[TMP37:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_3]], align 4
// CHECK2-NEXT: [[TMP38:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB78:%.*]] = sub i32 [[TMP37]], [[TMP38]]
// CHECK2-NEXT: [[SUB79:%.*]] = sub i32 [[SUB78]], 1
// CHECK2-NEXT: [[ADD80:%.*]] = add i32 [[SUB79]], 1
// CHECK2-NEXT: [[DIV81:%.*]] = udiv i32 [[ADD80]], 1
// CHECK2-NEXT: [[MUL82:%.*]] = mul i32 1, [[DIV81]]
// CHECK2-NEXT: [[MUL83:%.*]] = mul i32 [[MUL82]], 4
// CHECK2-NEXT: [[CONV84:%.*]] = zext i32 [[MUL83]] to i64
// CHECK2-NEXT: [[MUL85:%.*]] = mul nsw i64 [[DIV77]], [[CONV84]]
// CHECK2-NEXT: [[SUB86:%.*]] = sub nsw i64 [[TMP33]], [[MUL85]]
// CHECK2-NEXT: [[DIV87:%.*]] = sdiv i64 [[SUB86]], 4
// CHECK2-NEXT: [[MUL88:%.*]] = mul nsw i64 [[DIV87]], 4
// CHECK2-NEXT: [[SUB89:%.*]] = sub nsw i64 [[SUB69]], [[MUL88]]
// CHECK2-NEXT: [[MUL90:%.*]] = mul nsw i64 [[SUB89]], 3
// CHECK2-NEXT: [[ADD91:%.*]] = add nsw i64 7, [[MUL90]]
// CHECK2-NEXT: [[CONV92:%.*]] = trunc i64 [[ADD91]] to i32
// CHECK2-NEXT: store i32 [[CONV92]], i32* [[J13]], align 4
// CHECK2-NEXT: [[TMP39:%.*]] = load i32, i32* [[DOTTILE_0_IV_I12]], align 4
// CHECK2-NEXT: [[MUL93:%.*]] = mul nsw i32 [[TMP39]], 3
// CHECK2-NEXT: [[ADD94:%.*]] = add nsw i32 7, [[MUL93]]
// CHECK2-NEXT: store i32 [[ADD94]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP40:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: [[TMP41:%.*]] = load i32, i32* [[J13]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP40]], i32 [[TMP41]])
// CHECK2-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK2: omp.body.continue:
// CHECK2-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK2: omp.inner.for.inc:
// CHECK2-NEXT: [[TMP42:%.*]] = load i64, i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: [[ADD95:%.*]] = add nsw i64 [[TMP42]], 1
// CHECK2-NEXT: store i64 [[ADD95]], i64* [[DOTOMP_IV]], align 8
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK2: omp.inner.for.end:
// CHECK2-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK2: omp.loop.exit:
// CHECK2-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP0]])
// CHECK2-NEXT: br label [[OMP_PRECOND_END]]
// CHECK2: omp.precond.end:
// CHECK2-NEXT: call void @__kmpc_barrier(%struct.ident_t* @[[GLOB3]], i32 [[TMP0]])
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@foo6
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: call void (%struct.ident_t*, i32, void (i32*, i32*, ...)*, ...) @__kmpc_fork_call(%struct.ident_t* @[[GLOB2]], i32 0, void (i32*, i32*, ...)* bitcast (void (i32*, i32*)* @.omp_outlined. to void (i32*, i32*, ...)*))
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@.omp_outlined.
// CHECK2-SAME: (i32* noalias [[DOTGLOBAL_TID_:%.*]], i32* noalias [[DOTBOUND_TID_:%.*]]) #[[ATTR5:[0-9]+]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[DOTGLOBAL_TID__ADDR:%.*]] = alloca i32*, align 8
// CHECK2-NEXT: [[DOTBOUND_TID__ADDR:%.*]] = alloca i32*, align 8
// CHECK2-NEXT: [[DOTOMP_IV:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[TMP:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_LB:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_UB:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_STRIDE:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTOMP_IS_LAST:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: store i32* [[DOTGLOBAL_TID_]], i32** [[DOTGLOBAL_TID__ADDR]], align 8
// CHECK2-NEXT: store i32* [[DOTBOUND_TID_]], i32** [[DOTBOUND_TID__ADDR]], align 8
// CHECK2-NEXT: store i32 7, i32* [[I]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_LB]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: store i32 1, i32* [[DOTOMP_STRIDE]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTOMP_IS_LAST]], align 4
// CHECK2-NEXT: [[TMP0:%.*]] = load i32*, i32** [[DOTGLOBAL_TID__ADDR]], align 8
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[TMP0]], align 4
// CHECK2-NEXT: call void @__kmpc_for_static_init_4(%struct.ident_t* @[[GLOB1]], i32 [[TMP1]], i32 34, i32* [[DOTOMP_IS_LAST]], i32* [[DOTOMP_LB]], i32* [[DOTOMP_UB]], i32* [[DOTOMP_STRIDE]], i32 1, i32 1)
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[CMP:%.*]] = icmp sgt i32 [[TMP2]], 0
// CHECK2-NEXT: br i1 [[CMP]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ 0, [[COND_TRUE]] ], [ [[TMP3]], [[COND_FALSE]] ]
// CHECK2-NEXT: store i32 [[COND]], i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTOMP_LB]], align 4
// CHECK2-NEXT: store i32 [[TMP4]], i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND:%.*]]
// CHECK2: omp.inner.for.cond:
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTOMP_UB]], align 4
// CHECK2-NEXT: [[CMP1:%.*]] = icmp sle i32 [[TMP5]], [[TMP6]]
// CHECK2-NEXT: br i1 [[CMP1]], label [[OMP_INNER_FOR_BODY:%.*]], label [[OMP_INNER_FOR_END:%.*]]
// CHECK2: omp.inner.for.body:
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP7]], 5
// CHECK2-NEXT: [[ADD:%.*]] = add nsw i32 0, [[MUL]]
// CHECK2-NEXT: store i32 [[ADD]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP8]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD2:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK2-NEXT: [[CMP3:%.*]] = icmp slt i32 4, [[ADD2]]
// CHECK2-NEXT: br i1 [[CMP3]], label [[COND_TRUE4:%.*]], label [[COND_FALSE5:%.*]]
// CHECK2: cond.true4:
// CHECK2-NEXT: br label [[COND_END7:%.*]]
// CHECK2: cond.false5:
// CHECK2-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD6:%.*]] = add nsw i32 [[TMP11]], 5
// CHECK2-NEXT: br label [[COND_END7]]
// CHECK2: cond.end7:
// CHECK2-NEXT: [[COND8:%.*]] = phi i32 [ 4, [[COND_TRUE4]] ], [ [[ADD6]], [[COND_FALSE5]] ]
// CHECK2-NEXT: [[CMP9:%.*]] = icmp slt i32 [[TMP9]], [[COND8]]
// CHECK2-NEXT: br i1 [[CMP9]], label [[FOR_BODY:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[MUL10:%.*]] = mul nsw i32 [[TMP12]], 3
// CHECK2-NEXT: [[ADD11:%.*]] = add nsw i32 7, [[MUL10]]
// CHECK2-NEXT: store i32 [[ADD11]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP13:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP13]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP14]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP17:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[OMP_BODY_CONTINUE:%.*]]
// CHECK2: omp.body.continue:
// CHECK2-NEXT: br label [[OMP_INNER_FOR_INC:%.*]]
// CHECK2: omp.inner.for.inc:
// CHECK2-NEXT: [[TMP15:%.*]] = load i32, i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: [[ADD12:%.*]] = add nsw i32 [[TMP15]], 1
// CHECK2-NEXT: store i32 [[ADD12]], i32* [[DOTOMP_IV]], align 4
// CHECK2-NEXT: br label [[OMP_INNER_FOR_COND]]
// CHECK2: omp.inner.for.end:
// CHECK2-NEXT: br label [[OMP_LOOP_EXIT:%.*]]
// CHECK2: omp.loop.exit:
// CHECK2-NEXT: call void @__kmpc_for_static_fini(%struct.ident_t* @[[GLOB1]], i32 [[TMP1]])
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@tfoo7
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR2]] {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: call void @_Z4foo7IiLi3ELi5EEvT_S0_(i32 0, i32 42)
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@_Z4foo7IiLi3ELi5EEvT_S0_
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: (i32 [[START:%.*]], i32 [[END:%.*]]) #[[ATTR2]] comdat {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: [[START_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[END_ADDR:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_1:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTCAPTURE_EXPR_2:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTFLOOR_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: [[DOTTILE_0_IV_I:%.*]] = alloca i32, align 4
// CHECK2-NEXT: store i32 [[START]], i32* [[START_ADDR]], align 4
// CHECK2-NEXT: store i32 [[END]], i32* [[END_ADDR]], align 4
// CHECK2-NEXT: [[TMP0:%.*]] = load i32, i32* [[START_ADDR]], align 4
// CHECK2-NEXT: store i32 [[TMP0]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP1:%.*]] = load i32, i32* [[START_ADDR]], align 4
// CHECK2-NEXT: store i32 [[TMP1]], i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[TMP2:%.*]] = load i32, i32* [[END_ADDR]], align 4
// CHECK2-NEXT: store i32 [[TMP2]], i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK2-NEXT: [[TMP3:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_1]], align 4
// CHECK2-NEXT: [[TMP4:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[SUB:%.*]] = sub i32 [[TMP3]], [[TMP4]]
// CHECK2-NEXT: [[SUB3:%.*]] = sub i32 [[SUB]], 1
// CHECK2-NEXT: [[ADD:%.*]] = add i32 [[SUB3]], 3
// CHECK2-NEXT: [[DIV:%.*]] = udiv i32 [[ADD]], 3
// CHECK2-NEXT: [[SUB4:%.*]] = sub i32 [[DIV]], 1
// CHECK2-NEXT: store i32 [[SUB4]], i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: store i32 0, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND:%.*]]
// CHECK2: for.cond:
// CHECK2-NEXT: [[TMP5:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP6:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[ADD5:%.*]] = add i32 [[TMP6]], 1
// CHECK2-NEXT: [[CMP:%.*]] = icmp ult i32 [[TMP5]], [[ADD5]]
// CHECK2-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END17:%.*]]
// CHECK2: for.body:
// CHECK2-NEXT: [[TMP7:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: store i32 [[TMP7]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND6:%.*]]
// CHECK2: for.cond6:
// CHECK2-NEXT: [[TMP8:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[TMP9:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[ADD7:%.*]] = add i32 [[TMP9]], 1
// CHECK2-NEXT: [[TMP10:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD8:%.*]] = add nsw i32 [[TMP10]], 5
// CHECK2-NEXT: [[CMP9:%.*]] = icmp ult i32 [[ADD7]], [[ADD8]]
// CHECK2-NEXT: br i1 [[CMP9]], label [[COND_TRUE:%.*]], label [[COND_FALSE:%.*]]
// CHECK2: cond.true:
// CHECK2-NEXT: [[TMP11:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_2]], align 4
// CHECK2-NEXT: [[ADD10:%.*]] = add i32 [[TMP11]], 1
// CHECK2-NEXT: br label [[COND_END:%.*]]
// CHECK2: cond.false:
// CHECK2-NEXT: [[TMP12:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD11:%.*]] = add nsw i32 [[TMP12]], 5
// CHECK2-NEXT: br label [[COND_END]]
// CHECK2: cond.end:
// CHECK2-NEXT: [[COND:%.*]] = phi i32 [ [[ADD10]], [[COND_TRUE]] ], [ [[ADD11]], [[COND_FALSE]] ]
// CHECK2-NEXT: [[CMP12:%.*]] = icmp ult i32 [[TMP8]], [[COND]]
// CHECK2-NEXT: br i1 [[CMP12]], label [[FOR_BODY13:%.*]], label [[FOR_END:%.*]]
// CHECK2: for.body13:
// CHECK2-NEXT: [[TMP13:%.*]] = load i32, i32* [[DOTCAPTURE_EXPR_]], align 4
// CHECK2-NEXT: [[TMP14:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[MUL:%.*]] = mul i32 [[TMP14]], 3
// CHECK2-NEXT: [[ADD14:%.*]] = add i32 [[TMP13]], [[MUL]]
// CHECK2-NEXT: store i32 [[ADD14]], i32* [[I]], align 4
// CHECK2-NEXT: [[TMP15:%.*]] = load i32, i32* [[I]], align 4
// CHECK2-NEXT: call void (...) @body(i32 [[TMP15]])
// CHECK2-NEXT: br label [[FOR_INC:%.*]]
// CHECK2: for.inc:
// CHECK2-NEXT: [[TMP16:%.*]] = load i32, i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: [[INC:%.*]] = add nsw i32 [[TMP16]], 1
// CHECK2-NEXT: store i32 [[INC]], i32* [[DOTTILE_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND6]], !llvm.loop [[LOOP20:![0-9]+]]
// CHECK2: for.end:
// CHECK2-NEXT: br label [[FOR_INC15:%.*]]
// CHECK2: for.inc15:
// CHECK2-NEXT: [[TMP17:%.*]] = load i32, i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: [[ADD16:%.*]] = add nsw i32 [[TMP17]], 5
// CHECK2-NEXT: store i32 [[ADD16]], i32* [[DOTFLOOR_0_IV_I]], align 4
// CHECK2-NEXT: br label [[FOR_COND]], !llvm.loop [[LOOP21:![0-9]+]]
// CHECK2: for.end17:
// CHECK2-NEXT: ret void
//
//
// CHECK2-LABEL: define {{[^@]+}}@_GLOBAL__sub_I_tile_codegen.cpp
[OpenMP] Overhaul `declare target` handling This patch fixes various issues with our prior `declare target` handling and extends it to support `omp begin declare target` as well. This started with PR49649 in mind, trying to provide a way for users to avoid the "ref" global use introduced for globals with internal linkage. From there it went down the rabbit hole, e.g., all variables, even `nohost` ones, were emitted into the device code so it was impossible to determine if "ref" was needed late in the game (based on the name only). To make it really useful, `begin declare target` was needed as it can carry the `device_type`. Not emitting variables eagerly had a ripple effect. Finally, the precedence of the (explicit) declare target list items needed to be taken into account, that meant we cannot just look for any declare target attribute to make a decision. This caused the handling of functions to require fixup as well. I tried to clean up things while I was at it, e.g., we should not "parse declarations and defintions" as part of OpenMP parsing, this will always break at some point. Instead, we keep track what region we are in and act on definitions and declarations instead, this is what we do for declare variant and other begin/end directives already. Highlights: - new diagnosis for restrictions specificed in the standard, - delayed emission of globals not mentioned in an explicit list of a declare target, - omission of `nohost` globals on the host and `host` globals on the device, - no explicit parsing of declarations in-between `omp [begin] declare variant` and the corresponding end anymore, regular parsing instead, - precedence for explicit mentions in `declare target` lists over implicit mentions in the declaration-definition-seq, and - `omp allocate` declarations will now replace an earlier emitted global, if necessary. --- Notes: The patch is larger than I hoped but it turns out that most changes do on their own lead to "inconsistent states", which seem less desirable overall. After working through this I feel the standard should remove the explicit declare target forms as the delayed emission is horrible. That said, while we delay things anyway, it seems to me we check too often for the current status even though that is often not sufficient to act upon. There seems to be a lot of duplication that can probably be trimmed down. Eagerly emitting some things seems pretty weak as an argument to keep so much logic around. --- Reviewed By: ABataev Differential Revision: https://reviews.llvm.org/D101030
2021-04-22 13:57:28 +08:00
// CHECK2-SAME: () #[[ATTR0]] section ".text.startup" {
// CHECK2-NEXT: entry:
// CHECK2-NEXT: call void @__cxx_global_var_init()
// CHECK2-NEXT: ret void
//