diff --git a/docs/api/api_python/transformer/mindspore.nn.transformer.Transformer.rst b/docs/api/api_python/transformer/mindspore.nn.transformer.Transformer.rst index 842a9aa9611..ab5d1327863 100644 --- a/docs/api/api_python/transformer/mindspore.nn.transformer.Transformer.rst +++ b/docs/api/api_python/transformer/mindspore.nn.transformer.Transformer.rst @@ -40,8 +40,9 @@ **输出:** - Tuple,表示包含(`output`, `encoder_layer_present`, `encoder_layer_present`)的元组。 + Tuple,表示包含(`output`, `encoder_layer_present`, `encoder_layer_present`, `accum_loss`)的元组。 - **output** (Tensor) - 如果只有编码器,则表示编码器层的输出logit。shape为[batch, src_seq_length, hidden_size] or [batch * src_seq_length, hidden_size]。如果有编码器和解码器,则输出来自于解码器层。shape为[batch, tgt_seq_length, hidden_size]或[batch * tgt_seq_length, hidden_size]。 - **encoder_layer_present** (Tuple) - 大小为num_layers的元组,其中每个元组都是shape为((batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的自注意力中的投影key向量和value向量的tensor的元组。 - **decoder_layer_present** (Tuple) - 大小为num_layers的元组,其中每个元组都是shape为((batch_size, num_heads, size_per_head, tgt_seq_length)或(batch_size, num_heads, tgt_seq_length, size_per_head))的self attention中的投影key向量和value向量的tensor的元组,或者是shape为(batch_size, num_heads, size_per_head, src_seq_length)或(batch_size, num_heads, src_seq_length, size_per_head))的交叉注意力中的投影key向量和value向量的tensor的元组。如果未设置解码器,返回值将为None。 + - **accum_loss** (Tensor) - 表示一个辅助损失来最小化路由到每个专家的数据部分的均方,且仅仅在专家数大于1时才会返回。 diff --git a/mindspore/python/mindspore/nn/transformer/transformer.py b/mindspore/python/mindspore/nn/transformer/transformer.py index c905ef003a6..efc46687c90 100644 --- a/mindspore/python/mindspore/nn/transformer/transformer.py +++ b/mindspore/python/mindspore/nn/transformer/transformer.py @@ -2335,7 +2335,7 @@ class Transformer(Cell): Used for incremental prediction when the use_past is True. Default None. Outputs: - Tuple, a tuple contains(`output`, `encoder_layer_present`, `decoder_layer_present`) + Tuple, a tuple contains(`output`, `encoder_layer_present`, `decoder_layer_present`, `accum_loss`) - **output** (Tensor) - If there is only encoder, the output logit of the encoder layer. The shape is [batch, src_seq_length, hidden_size] or [batch * src_seq_length, hidden_size], if there are encoder and @@ -2351,6 +2351,8 @@ class Transformer(Cell): (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)). If the decoder is not set, the returned value will be None. + - **accum_loss** (Tensor) - A Tensor indicates an auxiliary loss to minimize the mean square of the data + part routed to each expert, and only returned if the number of experts is greater than 1. Supported Platforms: ``Ascend`` ``GPU``