forked from mindspore-Ecosystem/mindspore
!34671 Fix Ge Interface for AllReduce fusion
Merge pull request !34671 from huangxinjing/fix_ge_error
This commit is contained in:
commit
9cdb283781
|
@ -3,7 +3,7 @@
|
|||
具有两层线性层的多层感知器,并在最终输出上使用Dropout。第一个线性层将输入维度从hidden_size投影到ffn_hidden_size,并在中间应用激活层。第二个线性层将该维度从ffn_hidden_size投影到hidden_size。配置parallel_config之后,
|
||||
第一个线性层的权重将在输入维度上被分片,第二个线性层在输出维度上进行切分。总体过程如下
|
||||
|
||||
.. math:
|
||||
.. math::
|
||||
Dropout((xW_1+b_1)W_2 + b_2))
|
||||
|
||||
其中 :math:`W_1, W_2, b_1` 和 :math:`b_2` 为可训练参数。
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
.. math::
|
||||
MultiHeadAttention(query, key, vector) = Dropout(Concat(head_1, \dots, head_h)W^O)
|
||||
|
||||
其中, `head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)` 。注意:输出层的投影计算中带有偏置参数。
|
||||
其中, :math:`head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)` 。注意:输出层的投影计算中带有偏置参数。
|
||||
|
||||
如果query tensor、key tensor和value tensor相同,则上述即为自注意力机制的计算过程。
|
||||
|
||||
|
|
|
@ -44,5 +44,5 @@
|
|||
|
||||
- **output** (Tensor) - 如果只有编码器,则表示编码器层的输出logit。shape为[batch, src_seq_length, hidden_size] or [batch * src_seq_length, hidden_size]。如果有编码器和解码器,则输出来自于解码器层。shape为[batch, tgt_seq_length, hidden_size]或[batch * tgt_seq_length, hidden_size]。
|
||||
- **encoder_layer_present** (Tuple) - 大小为num_layers的元组,其中每个元组都是shape为((batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的自注意力中的投影key向量和value向量的tensor的元组。
|
||||
- **decoder_layer_present** (Tuple) - 大小为num_layers的元组,其中每个元组都是shape为((batch_size, num_heads, size_per_head, tgt_seq_length)或(batch_size, num_heads, tgt_seq_length, size_per_head))的self attention中的投影key向量和value向量的tensor的元组,或者是shape为(batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的交叉注意力中的投影key向量和value向量的tensor的元组。如果未设置解码器,返回值将为None。
|
||||
- **decoder_layer_present** (Tuple) - 大小为num_layers的元组,其中每个元组都是shape为((batch_size, num_heads, size_per_head, tgt_seq_length)或(batch_size, num_heads, tgt_seq_length, size_per_head))的自注意力中的投影key向量和value向量的tensor的元组,或者是shape为(batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的交叉注意力中的投影key向量和value向量的tensor的元组。如果未设置解码器,返回值将为None。
|
||||
- **accum_loss** (Tensor) - 表示一个辅助损失来最小化路由到每个专家的数据部分的均方,且仅仅在专家数大于1时才会返回。
|
||||
|
|
|
@ -34,7 +34,6 @@ def _init_allreduce_operators(length, split_indices, group=GlobalComm.WORLD_COMM
|
|||
if indices >= length:
|
||||
logger.warning(f"AllReduce's split index {indices} is greater than or equal to"
|
||||
f"the total gradient's number of {length}")
|
||||
|
||||
fusion_type = 2 ** 10
|
||||
split = 0
|
||||
fusion = ()
|
||||
|
@ -50,7 +49,11 @@ def _init_allreduce_operators(length, split_indices, group=GlobalComm.WORLD_COMM
|
|||
op_list = ()
|
||||
for i in range(length):
|
||||
op = AllReduce('sum', group)
|
||||
op.add_prim_attr('fusion', fusion[i])
|
||||
op_fusion_id = fusion[i]
|
||||
# When running in ge and enabled all_reduce_fusion_config, hccl will check the allreduce' fusion id to be -1
|
||||
if context.get_context("enable_ge") and context.get_auto_parallel_context("all_reduce_fusion_config"):
|
||||
op_fusion_id = -1
|
||||
op.add_prim_attr('fusion', op_fusion_id)
|
||||
op.add_prim_attr('index', index[i])
|
||||
op_list = op_list + (op,)
|
||||
return op_list
|
||||
|
|
Loading…
Reference in New Issue