!34671 Fix Ge Interface for AllReduce fusion

Merge pull request !34671 from huangxinjing/fix_ge_error
2022-05-26 11:34:42 +00:00 · 2022-05-26 11:34:42 +00:00 · 9cdb283781
parent 440f88ffc1 95c5859d99
commit 9cdb283781
4 changed files with 8 additions and 5 deletions
--- a/docs/api/api_python/transformer/mindspore.nn.transformer.FeedForward.rst
+++ b/docs/api/api_python/transformer/mindspore.nn.transformer.FeedForward.rst
@ -3,7 +3,7 @@
    具有两层线性层的多层感知器，并在最终输出上使用Dropout。第一个线性层将输入维度从hidden_size投影到ffn_hidden_size，并在中间应用激活层。第二个线性层将该维度从ffn_hidden_size投影到hidden_size。配置parallel_config之后，
    第一个线性层的权重将在输入维度上被分片，第二个线性层在输出维度上进行切分。总体过程如下

-    .. math:
+    .. math::
        Dropout((xW_1+b_1)W_2 + b_2))

    其中 :math:`W_1, W_2, b_1` 和 :math:`b_2` 为可训练参数。
--- a/docs/api/api_python/transformer/mindspore.nn.transformer.MultiHeadAttention.rst
+++ b/docs/api/api_python/transformer/mindspore.nn.transformer.MultiHeadAttention.rst
@ -5,7 +5,7 @@
    .. math::
           MultiHeadAttention(query, key, vector) = Dropout(Concat(head_1, \dots, head_h)W^O)

-    其中， `head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)` 。注意：输出层的投影计算中带有偏置参数。
+    其中， :math:`head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)` 。注意：输出层的投影计算中带有偏置参数。

    如果query tensor、key tensor和value tensor相同，则上述即为自注意力机制的计算过程。

--- a/docs/api/api_python/transformer/mindspore.nn.transformer.Transformer.rst
+++ b/docs/api/api_python/transformer/mindspore.nn.transformer.Transformer.rst
@ -44,5 +44,5 @@

    - **output** (Tensor) - 如果只有编码器，则表示编码器层的输出logit。shape为[batch, src_seq_length, hidden_size] or [batch * src_seq_length, hidden_size]。如果有编码器和解码器，则输出来自于解码器层。shape为[batch, tgt_seq_length, hidden_size]或[batch * tgt_seq_length, hidden_size]。
    - **encoder_layer_present** (Tuple) - 大小为num_layers的元组，其中每个元组都是shape为((batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的自注意力中的投影key向量和value向量的tensor的元组。
-    - **decoder_layer_present** (Tuple) - 大小为num_layers的元组，其中每个元组都是shape为((batch_size, num_heads, size_per_head, tgt_seq_length)或(batch_size, num_heads, tgt_seq_length, size_per_head))的self attention中的投影key向量和value向量的tensor的元组，或者是shape为(batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的交叉注意力中的投影key向量和value向量的tensor的元组。如果未设置解码器，返回值将为None。
+    - **decoder_layer_present** (Tuple) - 大小为num_layers的元组，其中每个元组都是shape为((batch_size, num_heads, size_per_head, tgt_seq_length)或(batch_size, num_heads, tgt_seq_length, size_per_head))的自注意力中的投影key向量和value向量的tensor的元组，或者是shape为(batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的交叉注意力中的投影key向量和value向量的tensor的元组。如果未设置解码器，返回值将为None。
    - **accum_loss** (Tensor) - 表示一个辅助损失来最小化路由到每个专家的数据部分的均方，且仅仅在专家数大于1时才会返回。
--- a/mindspore/python/mindspore/nn/wrap/grad_reducer.py
+++ b/mindspore/python/mindspore/nn/wrap/grad_reducer.py
@ -34,7 +34,6 @@ def _init_allreduce_operators(length, split_indices, group=GlobalComm.WORLD_COMM
        if indices >= length:
            logger.warning(f"AllReduce's split index {indices} is greater than or equal to"
                           f"the total gradient's number of {length}")
-
    fusion_type = 2 ** 10
    split = 0
    fusion = ()
@ -50,7 +49,11 @@ def _init_allreduce_operators(length, split_indices, group=GlobalComm.WORLD_COMM
    op_list = ()
    for i in range(length):
        op = AllReduce('sum', group)
-        op.add_prim_attr('fusion', fusion[i])
+        op_fusion_id = fusion[i]
+        # When running in ge and enabled all_reduce_fusion_config, hccl will check the allreduce' fusion id to be -1
+        if context.get_context("enable_ge") and context.get_auto_parallel_context("all_reduce_fusion_config"):
+            op_fusion_id = -1
+        op.add_prim_attr('fusion', op_fusion_id)
        op.add_prim_attr('index', index[i])
        op_list = op_list + (op,)
    return op_list