!15294 提交DGU模型GPU版本PR

From: @daiqizhu123 Reviewed-by: Signed-off-by:
2021-06-03 20:24:19 +08:00 · 2021-06-03 20:24:19 +08:00 · 82715410e9
parent 099a73ef0e c7eb2fa7d9
commit 82715410e9
26 changed files with 5684 additions and 0 deletions
--- a/model_zoo/official/nlp/dgu/README_CN.md
+++ b/model_zoo/official/nlp/dgu/README_CN.md
@ -0,0 +1,417 @@
 # 目录
 <!-- TOC -->
 - [目录](#目录)
 - [概述](#概述)
 - [模型架构](#模型架构)
 - [数据准备](#数据准备)
 - [环境要求](#环境要求)
 - [快速入门](#快速入门)
    - [脚本说明](#脚本说明)
    - [脚本和样例代码](#脚本和样例代码)
    - [脚本参数](#脚本参数)
        - [预训练](#预训练)
        - [微调与评估](#微调与评估)
    - [选项及参数](#选项及参数)
        - [选项](#选项)
        - [参数](#参数)
    - [训练过程](#训练过程)
        - [用法](#用法)
            - [Ascend处理器上运行](#ascend处理器上运行)
            - [GPU上运行](#GPU上运行)
    - [评估过程](#评估过程)
        - [用法](#用法-1)
            - [Ascend处理器上运行后评估各个任务的模型](#Ascend处理器上运行后评估各个任务的模型)
            - [GPU上运行后评估各个任务的模型](#GPU上运行后评估各个任务的模型)
    - [模型描述](#模型描述)
    - [性能](#性能)
        - [预训练性能](#预训练性能)
            - [推理性能](#推理性能)
 - [随机情况说明](#随机情况说明)
 - [ModelZoo主页](#modelzoo主页)
 <!-- /TOC -->
 # 概述
 对话系统 (Dialogue System) 常常需要根据应用场景的变化去解决多种多样的任务。任务的多样性（意图识别、槽填充、行为识别、状态追踪等等），以及领域训练数据的稀少，给Dialogue System的研究和应用带来了巨大的困难和挑战，要使得Dialogue System得到更好的发展，基于BERT的对话通用理解模型 (DGU: Dialogue General Understanding)，通过实验表明，使用base-model (BERT)并结合常见的学习范式，可以实现一个通用的对话理解模型。
 DGU模型内共包含4个任务，全部基于公开数据集在mindspore1.1.1上完成训练及评估，详细说明如下：
 udc: 使用UDC (Ubuntu Corpus V1) 数据集完成对话匹配 (Dialogue Response Selection) 任务;
 atis_intent: 使用ATIS (Airline Travel Information System) 数据集完成对话意图识别 (Dialogue Intent Detection) 任务；
 mrda: 使用MRDAC (Meeting Recorder Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务；
 swda: 使用SwDAC (Switchboard Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务;
 # 模型架构
 BERT的主干结构为Transformer。对于BERT_base，Transformer包含12个编码器模块，每个模块包含一个自注意模块，每个自注意模块包含一个注意模块。
 # 数据准备
 - 下载数据集压缩包并解压后，DGU_datasets目录下共存在6个目录，分别对应每个任务的训练集train.txt、评估集dev.txt和测试集test.txt。
    wget https://paddlenlp.bj.bcebos.com/datasets/DGU_datasets.tar.gz
    tar -zxf DGU_datasets.tar.gz
 - 下载数据集进行微调和评估，如udc、atis_intent、mrda、swda等。将数据集文件从JSON格式转换为MindRecord格式。详见src/dataconvert.py文件。
 - BERT模型训练的词汇表bert-base-uncased-vocab.txt 下载地址：https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
 - bert-base-uncased预训练模型原始权重 下载地址：https://paddlenlp.bj.bcebos.com/models/transformers/bert-base-uncased.pdparams
 # 环境要求
 - 硬件（GPU处理器）
    - 准备GPU处理器搭建硬件环境。
 - 框架
    - [MindSpore](https://gitee.com/mindspore/mindspore)
 - 更多关于Mindspore的信息，请查看以下资源：
    - [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
 # 快速入门
 从官网下载安装MindSpore之后，您可以按照如下步骤进行训练和评估：
 - 在GPU上运行
 ```bash
 # 运行微调和评估示例
 - 如需运行微调任务，请先准备预训练生成的权重文件（ckpt）。
 - 在`finetune_eval_config.py`中设置BERT网络配置和优化器超参。
 - 运行下载数据脚本：
  bash scripts/download_data.sh
 - 运行数据预处理脚本：
  bash scripts/run_data_preprocess.sh
 - 运行下载及转换预训练模型脚本（转换需要paddle环境）:
  bash scripts/download_pretrain_model.sh
 - dgu：在scripts/run_dgu.sh中设置任务相关的超参,可完成进行针对不同任务的微调。
 - 运行`bash scripts/run_dgu_gpu.sh`，对BERT-base模型进行微调。
  bash scripts/run_dgu_gpu.sh
 ```
 在Ascend设备上做分布式训练时，请提前创建JSON格式的HCCL配置文件。
 在Ascend设备上做单机分布式训练时，请参考[here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_single_machine_multi_rank.json)创建HCCL配置文件。
 在Ascend设备上做多机分布式训练时，训练命令需要在很短的时间间隔内在各台设备上执行。因此，每台设备上都需要准备HCCL配置文件。请参考[here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_multi_machine_multi_rank.json)创建多机的HCCL配置文件。
 如需设置数据集格式和参数，请创建JSON格式的模式配置文件，详见[TFRecord](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/dataset_loading.html#tfrecord)格式。
 ```text
 For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"].
 For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
 For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
 `numRows` is the only option which could be set by user, other values must be set according to the dataset.
 For example, the schema file of cn-wiki-128 dataset for pretraining shows as follows:
 {
    "datasetType": "TF",
    "numRows": 7680,
    "columns": {
        "input_ids": {
            "type": "int64",
            "rank": 1,
            "shape": [128]
        },
        "input_mask": {
            "type": "int64",
            "rank": 1,
            "shape": [128]
        },
        "segment_ids": {
            "type": "int64",
            "rank": 1,
            "shape": [128]
        },
        "next_sentence_labels": {
            "type": "int64",
            "rank": 1,
            "shape": [1]
        },
        "masked_lm_positions": {
            "type": "int64",
            "rank": 1,
            "shape": [20]
        },
        "masked_lm_ids": {
            "type": "int64",
            "rank": 1,
            "shape": [20]
        },
        "masked_lm_weights": {
            "type": "float32",
            "rank": 1,
            "shape": [20]
        }
    }
 }
 ```
 ## 脚本说明
 ## 脚本和样例代码
 ```shell
 .
 └─dgu
  ├─README_CN.md
  ├─scripts
    ├─run_dgu.sh                     # Ascend上单机DGU任务shell脚本
    ├─run_dgu_gpu.sh                 # GPU上单机DGU任务shell脚本
    ├─download_data.sh               # 下载数据集shell脚本
    ├─download_pretrain_model.sh     # 下载预训练模型权重shell脚本
    ├─export.sh                      # export脚本
    ├─eval.sh                        # Ascend上单机DGU任务评估shell脚本
    └─run_data_preprocess.sh         # 数据集预处理shell脚本
  ├─src
    ├─__init__.py
    ├─adam.py                                 # 优化器
    ├─args.py                                 # 代码运行参数设置
    ├─bert_for_finetune.py                    # 网络骨干编码
    ├─bert_for_pre_training.py                # 网络骨干编码
    ├─bert_model.py                           # 网络骨干编码
    ├─config.py                               # 预训练参数配置
    ├─data_util.py                            # 数据预处理util函数
    ├─dataset.py                              # 数据预处理
    ├─dataconvert.py                          # 数据转换
    ├─finetune_eval_config.py                 # 微调参数配置
    ├─finetune_eval_model.py                  # 网络骨干编码
    ├─metric.py                               # 评估过程的测评方法
    ├─pretrainmodel_convert.py           # 预训练模型权重转换
    ├─tokenizer.py                            # tokenizer函数
    └─utils.py                                # util函数
  └─run_dgu.py                                # DGU模型的微调和评估网络
 ```
 ## 脚本参数
 ### 微调与评估
 ```shell
 用法：dataconvert.py   [--task_name TASK_NAME]
                    [--data_dir DATA_DIR]
                    [--vocab_file_path VOCAB_FILE_PATH]
                    [--output_dir OUTPUT_DIR]
                    [--max_seq_len N]
                    [--eval_max_seq_len N]
 选项：
    --task_name                       训练任务的名称
    --data_dir                        原始数据集路径
    --vocab_file_path                 BERT模型训练的词汇表
    --output_dir                      保存生成mindRecord格式数据的路径
    --max_seq_len                     train数据集的max_seq_len
    --eval_max_seq_len                dev或test数据集的max_seq_len
 用法：run_dgu.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
                    [--device_id N] [--epoch_num N]
                    [--train_data_shuffle TRAIN_DATA_SHUFFLE]
                    [--eval_data_shuffle EVAL_DATA_SHUFFLE]
                    [--checkpoint_path CHECKPOINT_PATH]
                    [--model_name_or_path MODEL_NAME_OR_PATH]
                    [--train_data_file_path TRAIN_DATA_FILE_PATH]
                    [--eval_data_file_path EVAL_DATA_FILE_PATH]
                    [--eval_ckpt_path EVAL_CKPT_PATH]
                    [--is_modelarts_work IS_MODELARTS_WORK]
 选项:
    --task_name                       训练任务的名称
    --device_target                   代码实现设备，可选项为Ascend或CPU。默认为Ascend
    --do_train                        是否基于训练集开始训练，可选项为true或false
    --do_eval                         是否基于开发集开始评估，可选项为true或false
    --epoch_num                       训练轮次总数
    --train_data_shuffle              是否使能训练数据集轮换，默认为true
    --eval_data_shuffle               是否使能评估数据集轮换，默认为false
    --checkpoint_path                 保存生成微调检查点的路径
    --model_name_or_path              初始检查点的文件路径（通常来自预训练BERT模型
    --train_data_file_path            用于保存训练数据的mindRecord文件，如train1.1.mindrecord
    --eval_data_file_path             用于保存预测数据的mindRecord文件，如dev1.1.mindrecord
    --eval_ckpt_path                  如仅执行评估，提供用于评估的微调检查点的路径
    --is_modelarts_work               是否使用ModelArts线上训练环境，默认为false
 ```
 ## 选项及参数
 可以在`config.py`和`finetune_eval_config.py`文件中分别配置训练和评估参数。
 ### 选项
 ```text
 config for lossscale and etc.
    bert_network                    BERT模型版本，可选项为base或nezha，默认为base
    batch_size                      输入数据集的批次大小，默认为16
    loss_scale_value                损失放大初始值，默认为2^32
    scale_factor                    损失放大的更新因子，默认为2
    scale_window                    损失放大的一次更新步数，默认为1000
    optimizer                       网络中采用的优化器，可选项为AdamWerigtDecayDynamicLR、Lamb、或Momentum，默认为Lamb
 ```
 ### 参数
 ```text
 数据集和网络参数（预训练/微调/评估）：
    seq_length                      输入序列的长度，默认为128
    vocab_size                      各内嵌向量大小，需与所采用的数据集相同。默认为21136
    hidden_size                     BERT的encoder层数，默认为768
    num_hidden_layers               隐藏层数，默认为12
    num_attention_heads             注意头的数量，默认为12
    intermediate_size               中间层数，默认为3072
    hidden_act                      所采用的激活函数，默认为gelu
    hidden_dropout_prob             BERT输出的随机失活可能性，默认为0.1
    attention_probs_dropout_prob    BERT注意的随机失活可能性，默认为0.1
    max_position_embeddings         序列最大长度，默认为512
    type_vocab_size                 标记类型的词汇表大小，默认为16
    initializer_range               TruncatedNormal的初始值，默认为0.02
    use_relative_positions          是否采用相对位置，可选项为true或false，默认为False
    dtype                           输入的数据类型，可选项为mstype.float16或mstype.float32，默认为mstype.float32
    compute_type                    Bert Transformer的计算类型，可选项为mstype.float16或mstype.float32，默认为mstype.float16
 Parameters for optimizer:
    AdamWeightDecay:
    decay_steps                     学习率开始衰减的步数
    learning_rate                   学习率
    end_learning_rate               结束学习率，取值需为正数
    power                           幂
    warmup_steps                    热身学习率步数
    weight_decay                    权重衰减
    eps                             增加分母，提高小数稳定性
    Lamb:
    decay_steps                     学习率开始衰减的步数
    learning_rate                   学习率
    end_learning_rate               结束学习率
    power                           幂
    warmup_steps                    热身学习率步数
    weight_decay                    权重衰减
    Momentum:
    learning_rate                   学习率
    momentum                        平均移动动量
 ```
 ## 训练过程
 ### 用法
 #### Ascend处理器上运行
 ```bash
 bash scripts/run_dgu.sh
 ```
 以上命令后台运行，您可以在task_name.log中查看训练日志。训练结束后，您可以在默认脚本路径下脚本文件夹中找到检查点文件，得到如下损失值：
 ```text
 # grep "epoch" task_name.log
 epoch: 0.0, current epoch percent: 0.000, step: 1, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.0856101e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
 epoch: 0.0, current epoch percent: 0.000, step: 2, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.0821701e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
 ...
 ```
 > **注意**如果所运行的数据集较大，建议添加一个外部环境变量，确保HCCL不会超时。
 >
 > ```bash
 > export HCCL_CONNECT_TIMEOUT=600
 > ```
 >
 > 将HCCL的超时时间从默认的120秒延长到600秒。
 > **注意**若使用的BERT模型较大，保存检查点时可能会出现protobuf错误，可尝试使用下面的环境集。
 >
 > ```bash
 > export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
 > ```
 #### GPU上运行
 ```bash
 bash scripts/run_dgu_gpu.sh
 ```
 以上命令后台运行，您可以在task_name.log中查看训练日志。训练结束后，您可以在默认脚本路径下脚本文件夹中找到检查点文件，得到如下损失值：
 ```text
 # grep "epoch" task_name.log
 epoch: 0, current epoch percent: 1.000, step: 6094, outputs are (Tensor(shape=[], dtype=Float32, value= 0.714172), Tensor(shape=[], dtype=Bool, value= False))
 epoch time: 1702423.561 ms, per step time: 279.361 ms
 epoch: 1, current epoch percent: 1.000, step: 12188, outputs are (Tensor(shape=[], dtype=Float32, value= 0.788653), Tensor(shape=[], dtype=Bool, value= False))
 epoch time: 1684662.219 ms, per step time: 276.446 ms
 epoch: 2, current epoch percent: 1.000, step: 18282, outputs are (Tensor(shape=[], dtype=Float32, value= 0.618005), Tensor(shape=[], dtype=Bool, value= False))
 epoch time: 1711860.908 ms, per step time: 280.909 ms
 ...
 ```
 > **注意**如果所运行的数据集较大，建议添加一个外部环境变量，确保HCCL不会超时。
 >
 > ```bash
 > export HCCL_CONNECT_TIMEOUT=600
 > ```
 >
 > 将HCCL的超时时间从默认的120秒延长到600秒。
 > **注意**若使用的BERT模型较大，保存检查点时可能会出现protobuf错误，可尝试使用下面的环境集。
 >
 > ```bash
 > export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
 > ```
 ## 评估过程
 ### 用法
 #### Ascend处理器上运行后评估各个任务的模型
 运行以下命令前，确保已设置加载与训练检查点路径。若将检查点路径设置为绝对全路径，例如，/username/pretrain/checkpoint_100_300.ckpt，则评估指定的检查点；若将检查点路径设置为文件夹路径，则评估文件夹中所有检查点。
 修改eval.sh中task_name为将要评估的任务名以及修改相应的测试数据路径，修改device_target为"Ascend"。
 ```bash
 bash scripts/eval.sh
 ```
 可得到如下结果：
 ```text
 eval model:  /home/dgu/checkpoints/swda/swda_3-2_6094.ckpt
 loading...
 evaling...
 ==============================================================
 (w/o first and last) elapsed time: 2.3705036640167236, per step time : 0.017053983194364918
 ==============================================================
 Accuracy  : 0.8092150215136715
 ```
 #### GPU上运行后评估各个任务的模型
 运行以下命令前，确保已设置加载与训练检查点路径。请将检查点路径设置为绝对全路径，例如，/username/pretrain/checkpoint_100_300.ckpt，则评估指定的检查点；若将检查点路径设置为文件夹路径，则评估文件夹中所有检查点。
 修改eval.sh中task_name为将要评估的任务名以及修改相应的测试数据路径，修改device_target为"GPU"。
 ```bash
 bash scripts/eval.sh
 ```
 可得到如下结果：
 ```text
 eval model:  /home/dgu/checkpoints/swda/swda-2_6094.ckpt
 loading...
 evaling...
 ==============================================================
 (w/o first and last) elapsed time: 10.98917531967163, per step time : 0.0790588152494362
 ==============================================================
 Accuracy  : 0.8082890070921985
 ```
 # 随机情况说明
 run_dgu.sh中设置train_data_shuffle为true，eval_data_shuffle为false，默认对数据集进行轮换操作。
 config.py中，默认将hidden_dropout_prob和note_pros_dropout_prob设置为0.1，丢弃部分网络节点。
 # ModelZoo主页
 请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)。
--- a/model_zoo/official/nlp/dgu/export.py
+++ b/model_zoo/official/nlp/dgu/export.py
@ -0,0 +1,52 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """export checkpoint file into models"""
 import argparse
 import numpy as np
 import mindspore.common.dtype as mstype
 from mindspore import Tensor, context, load_checkpoint, export
 from src.finetune_eval_config import bert_net_cfg
 from src.finetune_eval_model import BertCLSModel
 parser = argparse.ArgumentParser(description="Bert export")
 parser.add_argument("--device_id", type=int, default=0, help="Device id")
 parser.add_argument("--batch_size", type=int, default=16, help="batch size")
 parser.add_argument("--number_labels", type=int, default=16, help="batch size")
 parser.add_argument("--ckpt_file", type=str, required=True, help="Bert ckpt file.")
 parser.add_argument("--file_name", type=str, default="Bert", help="bert output air name.")
 parser.add_argument("--file_format", type=str, choices=["AIR", "ONNX", "MINDIR"], default="AIR", help="file format")
 parser.add_argument("--device_target", type=str, default="Ascend",
                    choices=["Ascend", "GPU", "CPU"], help="device target (default: Ascend)")
 args = parser.parse_args()
 context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target)
 if args.device_target == "Ascend":
    context.set_context(device_id=args.device_id)
 if __name__ == "__main__":
    net = BertCLSModel(bert_net_cfg, False, num_labels=args.number_labels)
    load_checkpoint(args.ckpt_file, net=net)
    net.set_train(False)
    input_ids = Tensor(np.zeros([args.batch_size, bert_net_cfg.seq_length]), mstype.int32)
    input_mask = Tensor(np.zeros([args.batch_size, bert_net_cfg.seq_length]), mstype.int32)
    token_type_id = Tensor(np.zeros([args.batch_size, bert_net_cfg.seq_length]), mstype.int32)
    label_ids = Tensor(np.zeros([args.batch_size, bert_net_cfg.seq_length]), mstype.int32)
    input_data = [input_ids, input_mask, token_type_id]
    export(net, *input_data, file_name=args.file_name, file_format=args.file_format)
--- a/model_zoo/official/nlp/dgu/run_dgu.py
+++ b/model_zoo/official/nlp/dgu/run_dgu.py
@ -0,0 +1,226 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 '''
 Bert finetune and evaluation script.
 '''
 import os
 import time
 import mindspore.common.dtype as mstype
 import mindspore.ops as P
 from mindspore import context
 from mindspore import log as logger
 from mindspore.nn import Accuracy
 from mindspore.nn.optim import AdamWeightDecay
 from mindspore.nn.wrap.loss_scale import DynamicLossScaleUpdateCell
 from mindspore.train.callback import (CheckpointConfig, ModelCheckpoint,
                                      TimeMonitor)
 from mindspore.train.model import Model
 from mindspore.train.serialization import load_checkpoint, load_param_into_net
 import src.dataset as data
 import src.metric as metric
 from src.args import parse_args, set_default_args
 from src.bert_for_finetune import BertCLS, BertFinetuneCell
 from src.finetune_eval_config import (bert_net_cfg, bert_net_udc_cfg,
                                      optimizer_cfg)
 from src.utils import (CustomWarmUpLR, GetAllCkptPath, LossCallBack,
                       create_classification_dataset, make_directory)
 def do_train(dataset=None, network=None, load_checkpoint_path="base-BertCLS-111.ckpt",
             save_checkpoint_path="", epoch_num=1):
    """ do train """
    if load_checkpoint_path == "":
        raise ValueError("Pretrain model missed, finetune task must load pretrain model!")
    print("load pretrain model: ", load_checkpoint_path)
    steps_per_epoch = args_opt.save_steps
    num_examples = dataset.get_dataset_size() * args_opt.train_batch_size
    max_train_steps = epoch_num * dataset.get_dataset_size()
    warmup_steps = int(max_train_steps * args_opt.warmup_proportion)
    print("Num train examples: %d" % num_examples)
    print("Max train steps: %d" % max_train_steps)
    print("Num warmup steps: %d" % warmup_steps)
    #warmup and optimizer
    lr_schedule = CustomWarmUpLR(learning_rate=args_opt.learning_rate, \
            warmup_steps=warmup_steps, max_train_steps=max_train_steps)
    params = network.trainable_params()
    decay_params = list(filter(optimizer_cfg.AdamWeightDecay.decay_filter, params))
    other_params = list(filter(lambda x: not optimizer_cfg.AdamWeightDecay.decay_filter(x), params))
    group_params = [{'params': decay_params, 'weight_decay': optimizer_cfg.AdamWeightDecay.weight_decay},
                    {'params': other_params, 'weight_decay': 0.0}]
    optimizer = AdamWeightDecay(group_params, lr_schedule, eps=optimizer_cfg.AdamWeightDecay.eps)
    update_cell = DynamicLossScaleUpdateCell(loss_scale_value=2**32, scale_factor=2, scale_window=1000)
    #ckpt config
    ckpt_config = CheckpointConfig(save_checkpoint_steps=steps_per_epoch, keep_checkpoint_max=10)
    ckpoint_cb = ModelCheckpoint(prefix=args_opt.task_name,
                                 directory=None if save_checkpoint_path == "" else save_checkpoint_path,
                                 config=ckpt_config)
    # load checkpoint into network
    param_dict = load_checkpoint(load_checkpoint_path)
    load_param_into_net(network, param_dict)
    netwithgrads = BertFinetuneCell(network, optimizer=optimizer, scale_update_cell=update_cell)
    model = Model(netwithgrads)
    callbacks = [TimeMonitor(dataset.get_dataset_size()), LossCallBack(dataset.get_dataset_size()), ckpoint_cb]
    model.train(epoch_num, dataset, callbacks=callbacks)
 def eval_result_print(eval_metric, result):
    if args_opt.task_name.lower() in ['atis_intent', 'mrda', 'swda']:
        metric_name = "Accuracy"
    else:
        metric_name = eval_metric.name()
    print(metric_name, " :", result)
    if args_opt.task_name.lower() == "udc":
        print("R1@10: ", result[0])
        print("R2@10: ", result[1])
        print("R5@10: ", result[2])
 def do_eval(dataset=None, network=None, num_class=5, eval_metric=None, load_checkpoint_path=""):
    """ do eval """
    if load_checkpoint_path == "":
        raise ValueError("Finetune model missed, evaluation task must load finetune model!")
    print("eval model: ", load_checkpoint_path)
    print("loading... ")
    net_for_pretraining = network(eval_net_cfg, False, num_class)
    net_for_pretraining.set_train(False)
    param_dict = load_checkpoint(load_checkpoint_path)
    load_param_into_net(net_for_pretraining, param_dict)
    model = Model(net_for_pretraining)
    print("evaling... ")
    columns_list = ["input_ids", "input_mask", "segment_ids", "label_ids"]
    eval_metric.clear()
    evaluate_times = []
    for data_item in dataset.create_dict_iterator(num_epochs=1):
        input_data = []
        for i in columns_list:
            input_data.append(data_item[i])
        input_ids, input_mask, token_type_id, label_ids = input_data
        squeeze = P.Squeeze(-1)
        label_ids = squeeze(label_ids)
        time_begin = time.time()
        logits = model.predict(input_ids, input_mask, token_type_id, label_ids)
        time_end = time.time()
        evaluate_times.append(time_end - time_begin)
        eval_metric.update(logits, label_ids)
    print("==============================================================")
    print("(w/o first and last) elapsed time: {}, per step time : {}".format(
        sum(evaluate_times[1:-1]), sum(evaluate_times[1:-1])/(len(evaluate_times) - 2)))
    print("==============================================================")
    result = eval_metric.eval()
    eval_result_print(eval_metric, result)
    return result
 def run_dgu(args_input):
    """run_dgu main function """
    dataset_class, metric_class = TASK_CLASSES[args_input.task_name]
    epoch_num = args_input.epochs
    num_class = dataset_class.num_classes()
    target = args_input.device_target
    if target == "Ascend":
        context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_input.device_id)
    elif target == "GPU":
        context.set_context(mode=context.GRAPH_MODE, device_target="GPU", device_id=args_input.device_id)
        if net_cfg.compute_type != mstype.float32:
            logger.warning('GPU only support fp32 temporarily, run with fp32.')
            net_cfg.compute_type = mstype.float32
    else:
        raise Exception("Target error, GPU or Ascend is supported.")
    if args_input.do_train.lower() == "true":
        netwithloss = BertCLS(net_cfg, True, num_labels=num_class, dropout_prob=0.1)
        train_ds = create_classification_dataset(batch_size=args_input.train_batch_size, repeat_count=1, \
                        data_file_path=args_input.train_data_file_path, \
                        do_shuffle=(args_input.train_data_shuffle.lower() == "true"))
        do_train(train_ds, netwithloss, load_pretrain_checkpoint_path, save_finetune_checkpoint_path, epoch_num)
    if args_input.do_eval.lower() == "true":
        eval_ds = create_classification_dataset(batch_size=args_input.eval_batch_size, repeat_count=1, \
                    data_file_path=args_input.eval_data_file_path, \
                    do_shuffle=(args_input.eval_data_shuffle.lower() == "true"))
        if args_input.task_name in ['atis_intent', 'mrda', 'swda']:
            eval_metric = metric_class("classification")
        else:
            eval_metric = metric_class()
        #load model from path and eval
        if args_input.eval_ckpt_path:
            do_eval(eval_ds, BertCLS, num_class, eval_metric, args_input.eval_ckpt_path)
        #eval all saved models
        else:
            ckpt_list = GetAllCkptPath(save_finetune_checkpoint_path)
            print("saved models:", ckpt_list)
            for filepath in ckpt_list:
                eval_result = do_eval(eval_ds, BertCLS, num_class, eval_metric, filepath)
                eval_file_dict[filepath] = str(eval_result)
            print(eval_file_dict)
        if args_input.is_modelarts_work == 'true':
            for filename in eval_file_dict:
                ckpt_result = eval_file_dict[filename].replace('[', '').replace(']', '').replace(', ', '_', 2)
                save_file_name = args_input.train_url + ckpt_result + "_" + filename.split('/')[-1]
                mox.file.copy_parallel(filename, save_file_name)
                print("upload model " + filename + " to " + save_file_name)
 def print_args_input(args_input):
    print('-----------  Configuration Arguments -----------')
    for arg, value in sorted(vars(args_input).items()):
        print('%s: %s' % (arg, value))
    print('------------------------------------------------')
 def set_bert_cfg():
    """set bert cfg"""
    global net_cfg
    global eval_net_cfg
    if args_opt.task_name == 'udc':
        net_cfg = bert_net_udc_cfg
        eval_net_cfg = bert_net_udc_cfg
        print("use udc_bert_cfg")
    else:
        net_cfg = bert_net_cfg
        eval_net_cfg = bert_net_cfg
    return net_cfg, eval_net_cfg
 if __name__ == '__main__':
    TASK_CLASSES = {
        'udc': (data.UDCv1, metric.RecallAtK),
        'atis_intent': (data.ATIS_DID, Accuracy),
        'mrda': (data.MRDA, Accuracy),
        'swda': (data.SwDA, Accuracy)
    }
    os.environ['GLOG_v'] = '3'
    eval_file_dict = {}
    args_opt = parse_args()
    set_default_args(args_opt)
    net_cfg, eval_net_cfg = set_bert_cfg()
    load_pretrain_checkpoint_path = args_opt.model_name_or_path
    save_finetune_checkpoint_path = args_opt.checkpoints_path + args_opt.task_name
    save_finetune_checkpoint_path = make_directory(save_finetune_checkpoint_path)
    if args_opt.is_modelarts_work == 'true':
        import moxing as mox
        local_load_pretrain_checkpoint_path = args_opt.local_model_name_or_path
        local_data_path = '/cache/data/' + args_opt.task_name
        mox.file.copy_parallel(args_opt.data_url + args_opt.task_name, local_data_path)
        mox.file.copy_parallel('obs:/' + load_pretrain_checkpoint_path, local_load_pretrain_checkpoint_path)
        load_pretrain_checkpoint_path = local_load_pretrain_checkpoint_path
        if not args_opt.train_data_file_path:
            args_opt.train_data_file_path = local_data_path + '/' + args_opt.task_name + '_train.mindrecord'
        if not args_opt.eval_data_file_path:
            args_opt.eval_data_file_path = local_data_path + '/' + args_opt.task_name + '_test.mindrecord'
    print_args_input(args_opt)
    run_dgu(args_opt)
--- a/model_zoo/official/nlp/dgu/scripts/download_data.sh
+++ b/model_zoo/official/nlp/dgu/scripts/download_data.sh
@ -0,0 +1,26 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 # download dataset file to ./
 DATA_URL=https://paddlenlp.bj.bcebos.com/datasets/DGU_datasets.tar.gz
 wget --no-check-certificate ${DATA_URL}
 # unzip dataset file to ./DGU_datasets
 tar -zxvf DGU_datasets.tar.gz
 cd src
 # download vocab file to ./src/
 VOCAB_URL=https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
 wget --no-check-certificate ${VOCAB_URL}
--- a/model_zoo/official/nlp/dgu/scripts/download_pretrain_model.sh
+++ b/model_zoo/official/nlp/dgu/scripts/download_pretrain_model.sh
@ -0,0 +1,24 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 mkdir -p pretrainModel
 cd pretrainModel
 # download pretrain model file to ./pretrainModel/
 MODEL_BERT_BASE="https://paddlenlp.bj.bcebos.com/models/transformers/bert-base-uncased.pdparams"
 wget --no-check-certificate ${MODEL_BERT_BASE}
 # convert pdparams to mindspore ckpt
 python ../src/pretrainmodel_convert.py
--- a/model_zoo/official/nlp/dgu/scripts/eval.sh
+++ b/model_zoo/official/nlp/dgu/scripts/eval.sh
@ -0,0 +1,78 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 export GLOG_v=3
 python3 ./run_dgu.py \
    --task_name=udc \
    --do_train="false" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=0  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/udc/udc_train.mindrecord  \
    --train_batch_size=32  \
    --eval_batch_size=100  \
    --eval_data_file_path=./data/udc/udc_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=2  \
    --is_modelarts_work="false"  \
    --eval_ckpt_path=./checkpoints/udc/udc-2_31250.ckpt
 python3 ./run_dgu.py \
    --task_name=atis_intent \
    --do_train="false" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=0  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/atis_intent/atis_intent_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/atis_intent/atis_intent_test.mindrecord \
    --checkpoints_path=./checkpoints/  \
    --epochs=20  \
    --is_modelarts_work="false"  \
    --eval_ckpt_path=./checkpoints/atis_intent/atis_intent-17_155.ckpt
 python3 ./run_dgu.py \
    --task_name=mrda \
    --do_train="false" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=0  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/mrda/mrda_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/mrda/mrda_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=7   \
    --is_modelarts_work="false"  \
    --eval_ckpt_path=./checkpoints/mrda/mrda-3_2364.ckpt
 python3 ./run_dgu.py \
    --task_name=swda \
    --do_train="false" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=0  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/swda/swda_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/swda/swda_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=3  \
    --is_modelarts_work="false"  \
    --eval_ckpt_path=./checkpoints/swda/swda-3_6094.ckpt
--- a/model_zoo/official/nlp/dgu/scripts/export.sh
+++ b/model_zoo/official/nlp/dgu/scripts/export.sh
@ -0,0 +1,23 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 python export.py --device_id=0 \
        --batch_size=32  \
        --number_labels=26  \
        --ckpt_file=/home/ma-user/work/ckpt/atis_intent/0.9791666666666666_atis_intent-11_155.ckpt  \
        --file_name=atis_intent.mindir  \
        --file_format=MINDIR  \
        --device_target=Ascend
--- a/model_zoo/official/nlp/dgu/scripts/run_data_preprocess.sh
+++ b/model_zoo/official/nlp/dgu/scripts/run_data_preprocess.sh
@ -0,0 +1,50 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 CUR_DIR=`pwd`
 #udc
 python3 ${CUR_DIR}/src/dataconvert.py \
    --data_dir=${CUR_DIR}/DGU_datasets/ \
    --output_dir=${CUR_DIR}/data/  \
    --vocab_file_dir=${CUR_DIR}/src/bert-base-uncased-vocab.txt  \
    --task_name=udc  \
    --max_seq_len=224  \
    --eval_max_seq_len=224
 #atis_intent
 python3 ${CUR_DIR}/src/dataconvert.py \
    --data_dir=${CUR_DIR}/DGU_datasets/ \
    --output_dir=${CUR_DIR}/data/  \
    --vocab_file_dir=${CUR_DIR}/src/bert-base-uncased-vocab.txt  \
    --task_name=atis_intent  \
    --max_seq_len=128
 #mrda
 python3 ${CUR_DIR}/src/dataconvert.py \
    --data_dir=${CUR_DIR}/DGU_datasets/ \
    --output_dir=${CUR_DIR}/data/  \
    --vocab_file_dir=${CUR_DIR}/src/bert-base-uncased-vocab.txt  \
    --task_name=mrda  \
    --max_seq_len=128
 #swda
 python3 ${CUR_DIR}/src/dataconvert.py \
    --data_dir=${CUR_DIR}/DGU_datasets/ \
    --output_dir=${CUR_DIR}/data/  \
    --vocab_file_dir=${CUR_DIR}/src/bert-base-uncased-vocab.txt  \
    --task_name=swda  \
    --max_seq_len=128 
--- a/model_zoo/official/nlp/dgu/scripts/run_dgu.sh
+++ b/model_zoo/official/nlp/dgu/scripts/run_dgu.sh
@ -0,0 +1,73 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 export GLOG_v=3
 nohup python3 ./run_dgu.py \
    --task_name=udc \
    --do_train="true" \
    --do_eval="true" \
    --device_target="Ascend" \
    --device_id=0  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/udc/udc_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/udc/udc_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=2  \
    --is_modelarts_work="false" >udc_output.log 2>&1 &
 nohup python3 ./run_dgu.py \
    --task_name=atis_intent \
    --do_train="true" \
    --do_eval="true" \
    --device_target="Ascend" \
    --device_id=1  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/atis_intent/atis_intent_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/atis_intent/atis_intent_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=20  \
    --is_modelarts_work="false" >atisintent_output.log 2>&1 &
 nohup python3 ./run_dgu.py \
    --task_name=mrda \
    --do_train="true" \
    --do_eval="true" \
    --device_target="Ascend" \
    --device_id=2  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/mrda/mrda_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/mrda/mrda_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=7   \
    --is_modelarts_work="false" >mrda_output.log 2>&1 &
 nohup python3 ./run_dgu.py \
    --task_name=swda \
    --do_train="true" \
    --do_eval="true" \
    --device_target="Ascend" \
    --device_id=3   \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/swda/swda_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/swda/swda_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=3  \
    --is_modelarts_work="false" >swda_output.log 2>&1 &
--- a/model_zoo/official/nlp/dgu/scripts/run_dgu_gpu.sh
+++ b/model_zoo/official/nlp/dgu/scripts/run_dgu_gpu.sh
@ -0,0 +1,73 @@
 #!/bin/bash
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 export GLOG_v=3
 nohup python3 ./run_dgu.py \
    --task_name=udc \
    --do_train="true" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=0  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/udc/udc_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/udc/udc_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=2  \
    --is_modelarts_work="false" >udc_output.log 2>&1 &
 nohup python3 ./run_dgu.py \
    --task_name=atis_intent \
    --do_train="true" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=1  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/atis_intent/atis_intent_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/atis_intent/atis_intent_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=20  \
    --is_modelarts_work="false" >atisintent_output.log 2>&1 &
 nohup python3 ./run_dgu.py \
    --task_name=mrda \
    --do_train="true" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=2  \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/mrda/mrda_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/mrda/mrda_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=7   \
    --is_modelarts_work="false" >mrda_output.log 2>&1 &
 nohup python3 ./run_dgu.py \
    --task_name=swda \
    --do_train="true" \
    --do_eval="true" \
    --device_target="GPU" \
    --device_id=3   \
    --model_name_or_path=./pretrainModel/base-BertCLS-111.ckpt  \
    --train_data_file_path=./data/swda/swda_train.mindrecord  \
    --train_batch_size=32  \
    --eval_data_file_path=./data/swda/swda_test.mindrecord \
    --checkpoints_path=./checkpoints/   \
    --epochs=3  \
    --is_modelarts_work="false" >swda_output.log 2>&1 &
--- a/model_zoo/official/nlp/dgu/src/init.py
+++ b/model_zoo/official/nlp/dgu/src/init.py
@ -0,0 +1,34 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """Bert Init."""
 from .bert_for_pre_training import BertNetworkWithLoss, BertPreTraining, \
    BertPretrainingLoss, GetMaskedLMOutput, GetNextSentenceOutput, \
    BertTrainOneStepCell, BertTrainOneStepWithLossScaleCell, \
    BertTrainOneStepWithLossScaleCellForAdam
 from .bert_model import BertAttention, BertConfig, BertEncoderCell, BertModel, \
    BertOutput, BertSelfAttention, BertTransformer, EmbeddingLookup, \
    EmbeddingPostprocessor, RelaPosEmbeddingsGenerator, RelaPosMatrixGenerator, \
    SaturateCast, CreateAttentionMaskFromInputMask
 from .adam import AdamWeightDecayForBert
 __all__ = [
    "BertNetworkWithLoss", "BertPreTraining", "BertPretrainingLoss",
    "GetMaskedLMOutput", "GetNextSentenceOutput", "BertTrainOneStepCell",
    "BertTrainOneStepWithLossScaleCell",
    "BertAttention", "BertConfig", "BertEncoderCell", "BertModel", "BertOutput",
    "BertSelfAttention", "BertTransformer", "EmbeddingLookup",
    "EmbeddingPostprocessor", "RelaPosEmbeddingsGenerator", "AdamWeightDecayForBert",
    "RelaPosMatrixGenerator", "SaturateCast", "CreateAttentionMaskFromInputMask",
    "BertTrainOneStepWithLossScaleCellForAdam"
 ]
--- a/model_zoo/official/nlp/dgu/src/adam.py
+++ b/model_zoo/official/nlp/dgu/src/adam.py
@ -0,0 +1,307 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """AdamWeightDecayForBert, a customized Adam for bert. Input: gradient, overflow flag."""
 import numpy as np
 from mindspore.common import dtype as mstype
 from mindspore.ops import operations as P
 from mindspore.ops import composite as C
 from mindspore.ops import functional as F
 from mindspore.common.tensor import Tensor
 from mindspore._checkparam import Validator as validator
 from mindspore._checkparam import Rel
 from mindspore.nn.optim.optimizer import Optimizer
 _adam_opt = C.MultitypeFuncGraph("adam_opt")
 _scaler_one = Tensor(1, mstype.int32)
 _scaler_ten = Tensor(10, mstype.float32)
@_adam_opt.register("Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Number", "Tensor", "Tensor", "Tensor",
                    "Tensor", "Bool", "Bool")
 def _update_run_op(beta1, beta2, eps, lr, overflow, weight_decay, param, m, v, gradient, decay_flag, optim_filter):
    """
    Update parameters.
    Args:
        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        overflow (Tensor): Whether overflow occurs.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
        param (Tensor): Parameters.
        m (Tensor): m value of parameters.
        v (Tensor): v value of parameters.
        gradient (Tensor): Gradient of parameters.
        decay_flag (bool): Applies weight decay or not.
        optim_filter (bool): Applies parameter update or not.
    Returns:
        Tensor, the new value of v after updating.
    """
    if optim_filter:
        op_mul = P.Mul()
        op_square = P.Square()
        op_sqrt = P.Sqrt()
        op_cast = P.Cast()
        op_reshape = P.Reshape()
        op_shape = P.Shape()
        op_select = P.Select()
        param_fp32 = op_cast(param, mstype.float32)
        m_fp32 = op_cast(m, mstype.float32)
        v_fp32 = op_cast(v, mstype.float32)
        gradient_fp32 = op_cast(gradient, mstype.float32)
        cond = op_cast(F.fill(mstype.int32, op_shape(m_fp32), 1) * op_reshape(overflow, (())), mstype.bool_)
        next_m = op_mul(beta1, m_fp32) + op_select(cond, m_fp32,\
                op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32) - beta1, gradient_fp32))
        next_v = op_mul(beta2, v_fp32) + op_select(cond, v_fp32,\
                op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32) - beta2, op_square(gradient_fp32)))
        update = next_m / (eps + op_sqrt(next_v))
        if decay_flag:
            update = op_mul(weight_decay, param_fp32) + update
        update_with_lr = op_mul(lr, update)
        zeros = F.fill(mstype.float32, op_shape(param_fp32), 0)
        next_param = param_fp32 - op_select(cond, zeros, op_reshape(update_with_lr, op_shape(param_fp32)))
        next_param = F.depend(next_param, F.assign(param, op_cast(next_param, F.dtype(param))))
        next_param = F.depend(next_param, F.assign(m, op_cast(next_m, F.dtype(m))))
        next_param = F.depend(next_param, F.assign(v, op_cast(next_v, F.dtype(v))))
        return op_cast(next_param, F.dtype(param))
    return gradient
@_adam_opt.register("Function", "Function", "Function", "Function", "Bool", "Bool", "Bool", "Tensor", "Tensor",
                    "Tensor", "Tensor", "Tensor", "Tensor", "RowTensor", "Tensor", "Tensor", "Tensor", "Bool", "Bool")
 def _run_opt_with_sparse(opt, sparse_opt, push, pull, use_locking, use_nesterov, target, beta1_power,
                         beta2_power, beta1, beta2, eps, lr, gradient, param, m, v, ps_parameter, cache_enable):
    """Apply sparse adam optimizer to the weight parameter when the gradient is sparse."""
    success = True
    indices = gradient.indices
    values = gradient.values
    if ps_parameter and not cache_enable:
        op_shape = P.Shape()
        shapes = (op_shape(param), op_shape(m), op_shape(v),
                  op_shape(beta1_power), op_shape(beta2_power), op_shape(lr), op_shape(beta1),
                  op_shape(beta2), op_shape(eps), op_shape(values), op_shape(indices))
        success = F.depend(success, pull(push((beta1_power, beta2_power, lr, beta1, beta2,
                                               eps, values, indices), shapes), param))
        return success
    if not target:
        success = F.depend(success, sparse_opt(param, m, v, beta1_power, beta2_power, lr, beta1, beta2,
                                               eps, values, indices))
    else:
        op_mul = P.Mul()
        op_square = P.Square()
        op_sqrt = P.Sqrt()
        scatter_add = P.ScatterAdd(use_locking)
        assign_m = F.assign(m, op_mul(beta1, m))
        assign_v = F.assign(v, op_mul(beta2, v))
        grad_indices = gradient.indices
        grad_value = gradient.values
        next_m = scatter_add(m,
                             grad_indices,
                             op_mul(F.tuple_to_array((1.0,)) - beta1, grad_value))
        next_v = scatter_add(v,
                             grad_indices,
                             op_mul(F.tuple_to_array((1.0,)) - beta2, op_square(grad_value)))
        if use_nesterov:
            m_temp = next_m * _scaler_ten
            assign_m_nesterov = F.assign(m, op_mul(beta1, next_m))
            div_value = scatter_add(m,
                                    op_mul(grad_indices, _scaler_one),
                                    op_mul(F.tuple_to_array((1.0,)) - beta1, grad_value))
            param_update = div_value / (op_sqrt(next_v) + eps)
            m_recover = F.assign(m, m_temp / _scaler_ten)
            F.control_depend(m_temp, assign_m_nesterov)
            F.control_depend(assign_m_nesterov, div_value)
            F.control_depend(param_update, m_recover)
        else:
            param_update = next_m / (op_sqrt(next_v) + eps)
        lr_t = lr * op_sqrt(1 - beta2_power) / (1 - beta1_power)
        next_param = param - lr_t * param_update
        F.control_depend(assign_m, next_m)
        F.control_depend(assign_v, next_v)
        success = F.depend(success, F.assign(param, next_param))
        success = F.depend(success, F.assign(m, next_m))
        success = F.depend(success, F.assign(v, next_v))
    return success
@_adam_opt.register("Function", "Function", "Function", "Function", "Bool", "Bool", "Bool", "Tensor", "Tensor",
                    "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Bool", "Bool")
 def _run_opt_with_one_number(opt, sparse_opt, push, pull, use_locking, use_nesterov, target,
                             beta1_power, beta2_power, beta1, beta2, eps, lr, gradient, param,
                             moment1, moment2, ps_parameter, cache_enable):
    """Apply adam optimizer to the weight parameter using Tensor."""
    success = True
    if ps_parameter and not cache_enable:
        op_shape = P.Shape()
        success = F.depend(success, pull(push((beta1_power, beta2_power, lr, beta1, beta2, eps, gradient),
                                              (op_shape(param), op_shape(moment1), op_shape(moment2))), param))
    else:
        success = F.depend(success, opt(param, moment1, moment2, beta1_power, beta2_power, lr, beta1, beta2,
                                        eps, gradient))
    return success
@_adam_opt.register("Function", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor",
                    "Tensor", "Tensor")
 def _run_off_load_opt(opt, beta1_power, beta2_power, beta1, beta2, eps, lr, gradient, param, moment1, moment2):
    """Apply AdamOffload optimizer to the weight parameter using Tensor."""
    success = True
    delat_param = opt(moment1, moment2, beta1_power, beta2_power, lr, beta1, beta2, eps, gradient)
    success = F.depend(success, F.assign_add(param, delat_param))
    return success
 def _check_param_value(beta1, beta2, eps, prim_name):
    """Check the type of inputs."""
    validator.check_value_type("beta1", beta1, [float], prim_name)
    validator.check_value_type("beta2", beta2, [float], prim_name)
    validator.check_value_type("eps", eps, [float], prim_name)
    validator.check_float_range(beta1, 0.0, 1.0, Rel.INC_NEITHER, "beta1", prim_name)
    validator.check_float_range(beta2, 0.0, 1.0, Rel.INC_NEITHER, "beta2", prim_name)
    validator.check_positive_float(eps, "eps", prim_name)
 class AdamWeightDecayForBert(Optimizer):
    """
    Implements the Adam algorithm to fix the weight decay.
    Note:
        When separating parameter groups, the weight decay in each group will be applied on the parameters if the
        weight decay is positive. When not separating parameter groups, the `weight_decay` in the API will be applied
        on the parameters without 'beta' or 'gamma' in their names if `weight_decay` is positive.
        To improve parameter groups performance, the customized order of parameters can be supported.
    Args:
        params (Union[list[Parameter], list[dict]]): When the `params` is a list of `Parameter` which will be updated,
            the element in `params` must be class `Parameter`. When the `params` is a list of `dict`, the "params",
            "lr", "weight_decay" and "order_params" are the keys can be parsed.
            - params: Required. The value must be a list of `Parameter`.
            - lr: Optional. If "lr" is in the keys, the value of the corresponding learning rate will be used.
              If not, the `learning_rate` in the API will be used.
            - weight_decay: Optional. If "weight_decay" is in the keys, the value of the corresponding weight decay
              will be used. If not, the `weight_decay` in the API will be used.
            - order_params: Optional. If "order_params" is in the keys, the value must be the order of parameters and
              the order will be followed in the optimizer. There are no other keys in the `dict` and the parameters
              which in the 'order_params' must be in one of group parameters.
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Default: 0.9.
            Should be in range (0.0, 1.0).
        beta2 (float): The exponential decay rate for the 2nd moment estimations. Default: 0.999.
            Should be in range (0.0, 1.0).
        eps (float): Term added to the denominator to improve numerical stability. Default: 1e-6.
            Should be greater than 0.
        weight_decay (float): Weight decay (L2 penalty). It must be equal to or greater than 0. Default: 0.0.
    Inputs:
        - **gradients** (tuple[Tensor]) - The gradients of `params`, the shape is the same as `params`.
        - **overflow** (tuple[Tensor]) - The overflow flag in dynamiclossscale.
    Outputs:
        tuple[bool], all elements are True.
    Supported Platforms:
        ``Ascend`` ``GPU``
    Examples:
        >>> net = Net()
        >>> #1) All parameters use the same learning rate and weight decay
        >>> optim = nn.AdamWeightDecay(params=net.trainable_params())
        >>>
        >>> #2) Use parameter groups and set different values
        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
        ...                 {'params': no_conv_params, 'lr': 0.01},
        ...                 {'order_params': net.trainable_params()}]
        >>> optim = nn.AdamWeightDecay(group_params, learning_rate=0.1, weight_decay=0.0)
        >>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
        >>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0.
        >>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
        >>>
        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
        >>> model = Model(net, loss_fn=loss, optimizer=optim)
   """
    def __init__(self, params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-6, weight_decay=0.0):
        super(AdamWeightDecayForBert, self).__init__(learning_rate, params, weight_decay)
        _check_param_value(beta1, beta2, eps, self.cls_name)
        self.beta1 = Tensor(np.array([beta1]).astype(np.float32))
        self.beta2 = Tensor(np.array([beta2]).astype(np.float32))
        self.eps = Tensor(np.array([eps]).astype(np.float32))
        self.moments1 = self.parameters.clone(prefix="adam_m", init='zeros')
        self.moments2 = self.parameters.clone(prefix="adam_v", init='zeros')
        self.hyper_map = C.HyperMap()
        self.op_select = P.Select()
        self.op_cast = P.Cast()
        self.op_reshape = P.Reshape()
        self.op_shape = P.Shape()
    def construct(self, gradients, overflow):
        """AdamWeightDecayForBert"""
        lr = self.get_lr()
        cond = self.op_cast(F.fill(mstype.int32, self.op_shape(self.beta1), 1) *\
                            self.op_reshape(overflow, (())), mstype.bool_)
        beta1 = self.op_select(cond, self.op_cast(F.tuple_to_array((1.0,)), mstype.float32), self.beta1)
        beta2 = self.op_select(cond, self.op_cast(F.tuple_to_array((1.0,)), mstype.float32), self.beta2)
        if self.is_group:
            if self.is_group_lr:
                optim_result = self.hyper_map(F.partial(_adam_opt, self.beta1, self.beta2, self.eps),
                                              lr, self.weight_decay, self.parameters, self.moments1, self.moments2,
                                              gradients, self.decay_flags, self.optim_filter)
            else:
                optim_result = self.hyper_map(F.partial(_adam_opt, beta1, beta2, self.eps, lr, overflow),
                                              self.weight_decay, self.parameters, self.moments1, self.moments2,
                                              gradients, self.decay_flags, self.optim_filter)
        else:
            optim_result = self.hyper_map(F.partial(_adam_opt, self.beta1, self.beta2, self.eps, lr, self.weight_decay),
                                          self.parameters, self.moments1, self.moments2,
                                          gradients, self.decay_flags, self.optim_filter)
        if self.use_parallel:
            self.broadcast_params(optim_result)
        return optim_result
--- a/model_zoo/official/nlp/dgu/src/args.py
+++ b/model_zoo/official/nlp/dgu/src/args.py
@ -0,0 +1,165 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 Args used in Bert finetune and evaluation.
 """
 import argparse
 def parse_args():
    """Parse args."""
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument(
        "--task_name",
        default="udc",
        type=str,
        required=True,
        help="The name of the task to train.")
    parser.add_argument(
        "--device_target",
        default="GPU",
        type=str,
        help="The device to train.")
    parser.add_argument(
        "--device_id",
        default=0,
        type=int,
        help="The device id to use.")
    parser.add_argument(
        "--model_name_or_path",
        default='bert-base-uncased.ckpt',
        type=str,
        help="Path to pre-trained bert model or shortcut name.")
    parser.add_argument(
        "--local_model_name_or_path",
        default='/cache/pretrainModel/bert-BertCLS-111.ckpt',
        type=str,
        help="local Path to pre-trained bert model or shortcut name, for online work.")
    parser.add_argument(
        "--checkpoints_path",
        default=None,
        type=str,
        help="The output directory where the checkpoints will be saved.")
    parser.add_argument(
        "--eval_ckpt_path",
        default=None,
        type=str,
        help="The path of checkpoint to be loaded.")
    parser.add_argument(
        "--max_seq_len",
        default=None,
        type=int,
        help="The maximum total input sequence length after tokenization for trainng.\
        Sequences longer than this will be truncated, sequences shorter will be padded.")
    parser.add_argument(
        "--eval_max_seq_len",
        default=None,
        type=int,
        help="The maximum total input sequence length after tokenization for evaling.\
        Sequences longer than this will be truncated, sequences shorter will be padded.")
    parser.add_argument(
        "--learning_rate",
        default=None,
        type=float,
        help="The initial learning rate for Adam.")
    parser.add_argument(
        "--epochs",
        default=None,
        type=int,
        help="Total number of training epochs to perform.")
    parser.add_argument(
        "--save_steps",
        default=None,
        type=int,
        help="Save checkpoint every X updates steps.")
    parser.add_argument(
        "--warmup_proportion",
        default=0.1,
        type=float,
        help="The proportion of warmup.")
    parser.add_argument(
        "--do_train", default="true", type=str, help="Whether training.")
    parser.add_argument(
        "--do_eval", default="true", type=str, help="Whether evaluation.")
    parser.add_argument(
        "--train_data_shuffle", type=str, default="true", choices=["true", "false"],
        help="Enable train data shuffle, default is true")
    parser.add_argument(
        "--train_data_file_path", type=str, default="",
        help="Data path, it is better to use absolute path")
    parser.add_argument(
        "--train_batch_size", type=int, default=32, help="Train batch size, default is 32")
    parser.add_argument(
        "--eval_batch_size", type=int, default=None,
        help="Eval batch size, default is None. if the eval_batch_size parameter is not passed in,\
        It will be assigned the same value as train_batch_size")
    parser.add_argument(
        "--eval_data_file_path", type=str, default="", help="Data path, it is better to use absolute path")
    parser.add_argument(
        "--eval_data_shuffle", type=str, default="false", choices=["true", "false"],
        help="Enable eval data shuffle, default is false")
    parser.add_argument(
        "--is_modelarts_work", type=str, default="false", help="Whether modelarts online work.")
    parser.add_argument(
        "--train_url", type=str, default="",
        help="save_model path, it is better to use absolute path, for modelarts online work.")
    parser.add_argument(
        "--data_url", type=str, default="", help="data path, for modelarts online work")
    args = parser.parse_args()
    return args
 def set_default_args(args):
    """set default args."""
    args.task_name = args.task_name.lower()
    if args.task_name == 'udc':
        if not args.save_steps:
            args.save_steps = 1000
        if not args.epochs:
            args.epochs = 2
        if not args.max_seq_len:
            args.max_seq_len = 224
        if not args.eval_batch_size:
            args.eval_batch_size = 100
    elif args.task_name == 'atis_intent':
        if not args.save_steps:
            args.save_steps = 100
        if not args.epochs:
            args.epochs = 20
    elif args.task_name == 'mrda':
        if not args.save_steps:
            args.save_steps = 500
        if not args.epochs:
            args.epochs = 7
    elif args.task_name == 'swda':
        if not args.save_steps:
            args.save_steps = 500
        if not args.epochs:
            args.epochs = 3
    else:
        raise ValueError('Not support task: %s.' % args.task_name)
    if not args.checkpoints_path:
        args.checkpoints_path = './checkpoints/' + args.task_name
    if not args.learning_rate:
        args.learning_rate = 2e-5
    if not args.max_seq_len:
        args.max_seq_len = 128
    if not args.eval_max_seq_len:
        args.eval_max_seq_len = args.max_seq_len
    if not args.eval_batch_size:
        args.eval_batch_size = args.train_batch_size
--- a/model_zoo/official/nlp/dgu/src/bert_for_finetune.py
+++ b/model_zoo/official/nlp/dgu/src/bert_for_finetune.py
@ -0,0 +1,339 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 '''
 Bert for finetune script.
 '''
 import mindspore.nn as nn
 from mindspore.ops import operations as P
 from mindspore.ops import functional as F
 from mindspore.ops import composite as C
 from mindspore.common.tensor import Tensor
 from mindspore.common.parameter import Parameter
 from mindspore.common import dtype as mstype
 from mindspore.nn.wrap.grad_reducer import DistributedGradReducer
 from mindspore.context import ParallelMode
 from mindspore.communication.management import get_group_size
 from mindspore import context
 from .bert_for_pre_training import clip_grad
 from .finetune_eval_model import BertCLSModel, BertNERModel, BertSquadModel
 from .utils import CrossEntropyCalculation
 GRADIENT_CLIP_TYPE = 1
 GRADIENT_CLIP_VALUE = 1.0
 grad_scale = C.MultitypeFuncGraph("grad_scale")
 reciprocal = P.Reciprocal()
@grad_scale.register("Tensor", "Tensor")
 def tensor_grad_scale(scale, grad):
    return grad * reciprocal(scale)
 _grad_overflow = C.MultitypeFuncGraph("_grad_overflow")
 grad_overflow = P.FloatStatus()
@_grad_overflow.register("Tensor")
 def _tensor_grad_overflow(grad):
    return grad_overflow(grad)
 class BertFinetuneCell(nn.Cell):
    """
    Especially defined for finetuning where only four inputs tensor are needed.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    Different from the builtin loss_scale wrapper cell, we apply grad_clip before the optimization.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        scale_update_cell (Cell): Cell to do the loss scale. Default: None.
    """
    def __init__(self, network, optimizer, scale_update_cell=None):
        super(BertFinetuneCell, self).__init__(auto_prefix=False)
        self.network = network
        self.network.set_grad()
        self.weights = optimizer.parameters
        self.optimizer = optimizer
        self.grad = C.GradOperation(get_by_list=True,
                                    sens_param=True)
        self.reducer_flag = False
        self.allreduce = P.AllReduce()
        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
            self.reducer_flag = True
        self.grad_reducer = None
        if self.reducer_flag:
            mean = context.get_auto_parallel_context("gradients_mean")
            degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, mean, degree)
        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
        self.cast = P.Cast()
        self.gpu_target = False
        if context.get_context("device_target") == "GPU":
            self.gpu_target = True
            self.float_status = P.FloatStatus()
            self.addn = P.AddN()
            self.reshape = P.Reshape()
        else:
            self.alloc_status = P.NPUAllocFloatStatus()
            self.get_status = P.NPUGetFloatStatus()
            self.clear_status = P.NPUClearFloatStatus()
        self.reduce_sum = P.ReduceSum(keep_dims=False)
        self.base = Tensor(1, mstype.float32)
        self.less_equal = P.LessEqual()
        self.hyper_map = C.HyperMap()
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  label_ids,
                  sens=None):
        """Bert Finetune"""
        weights = self.weights
        init = False
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            label_ids)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        if not self.gpu_target:
            init = self.alloc_status()
            init = F.depend(init, loss)
            clear_status = self.clear_status(init)
            scaling_sens = F.depend(scaling_sens, clear_status)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 label_ids,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        grads = self.hyper_map(F.partial(grad_scale, scaling_sens), grads)
        grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
        if self.reducer_flag:
            grads = self.grad_reducer(grads)
        if not self.gpu_target:
            init = F.depend(init, grads)
            get_status = self.get_status(init)
            init = F.depend(init, get_status)
            flag_sum = self.reduce_sum(init, (0,))
        else:
            flag_sum = self.hyper_map(F.partial(_grad_overflow), grads)
            flag_sum = self.addn(flag_sum)
            flag_sum = self.reshape(flag_sum, (()))
        if self.is_distributed:
            flag_reduce = self.allreduce(flag_sum)
            cond = self.less_equal(self.base, flag_reduce)
        else:
            cond = self.less_equal(self.base, flag_sum)
        overflow = cond
        if sens is None:
            overflow = self.loss_scaling_manager(self.loss_scale, cond)
        if overflow:
            succ = False
        else:
            succ = self.optimizer(grads)
        ret = (loss, cond)
        return F.depend(ret, succ)
 class BertSquadCell(nn.Cell):
    """
    specifically defined for finetuning where only four inputs tensor are needed.
    """
    def __init__(self, network, optimizer, scale_update_cell=None):
        super(BertSquadCell, self).__init__(auto_prefix=False)
        self.network = network
        self.network.set_grad()
        self.weights = optimizer.parameters
        self.optimizer = optimizer
        self.grad = C.GradOperation(get_by_list=True, sens_param=True)
        self.reducer_flag = False
        self.allreduce = P.AllReduce()
        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
            self.reducer_flag = True
        self.grad_reducer = None
        if self.reducer_flag:
            mean = context.get_auto_parallel_context("gradients_mean")
            degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, mean, degree)
        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
        self.cast = P.Cast()
        self.alloc_status = P.NPUAllocFloatStatus()
        self.get_status = P.NPUGetFloatStatus()
        self.clear_status = P.NPUClearFloatStatus()
        self.reduce_sum = P.ReduceSum(keep_dims=False)
        self.base = Tensor(1, mstype.float32)
        self.less_equal = P.LessEqual()
        self.hyper_map = C.HyperMap()
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  start_position,
                  end_position,
                  unique_id,
                  is_impossible,
                  sens=None):
        """BertSquad"""
        weights = self.weights
        init = self.alloc_status()
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            start_position,
                            end_position,
                            unique_id,
                            is_impossible)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        init = F.depend(init, loss)
        clear_status = self.clear_status(init)
        scaling_sens = F.depend(scaling_sens, clear_status)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 start_position,
                                                 end_position,
                                                 unique_id,
                                                 is_impossible,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        grads = self.hyper_map(F.partial(grad_scale, scaling_sens), grads)
        grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
        if self.reducer_flag:
            grads = self.grad_reducer(grads)
        init = F.depend(init, grads)
        get_status = self.get_status(init)
        init = F.depend(init, get_status)
        flag_sum = self.reduce_sum(init, (0,))
        if self.is_distributed:
            flag_reduce = self.allreduce(flag_sum)
            cond = self.less_equal(self.base, flag_reduce)
        else:
            cond = self.less_equal(self.base, flag_sum)
        overflow = cond
        if sens is None:
            overflow = self.loss_scaling_manager(self.loss_scale, cond)
        if overflow:
            succ = False
        else:
            succ = self.optimizer(grads)
        ret = (loss, cond)
        return F.depend(ret, succ)
 class BertCLS(nn.Cell):
    """
    Train interface for classification finetuning task.
    """
    def __init__(self, config, is_training, num_labels=2, dropout_prob=0.0, use_one_hot_embeddings=False,
                 assessment_method=""):
        super(BertCLS, self).__init__()
        self.bert = BertCLSModel(config, is_training, num_labels, dropout_prob, use_one_hot_embeddings,
                                 assessment_method)
        self.loss = CrossEntropyCalculation(is_training)
        self.num_labels = num_labels
        self.assessment_method = assessment_method
        self.is_training = is_training
    def construct(self, input_ids, input_mask, token_type_id, label_ids):
        logits = self.bert(input_ids, input_mask, token_type_id)
        if self.assessment_method == "spearman_correlation":
            if self.is_training:
                loss = self.loss(logits, label_ids)
            else:
                loss = logits
        else:
            loss = self.loss(logits, label_ids, self.num_labels)
        return loss
 class BertNER(nn.Cell):
    """
    Train interface for sequence labeling finetuning task.
    """
    def __init__(self, config, batch_size, is_training, num_labels=11, use_crf=False,
                 tag_to_index=None, dropout_prob=0.0, use_one_hot_embeddings=False):
        super(BertNER, self).__init__()
        self.bert = BertNERModel(config, is_training, num_labels, use_crf, dropout_prob, use_one_hot_embeddings)
        if use_crf:
            if not tag_to_index:
                raise Exception("The dict for tag-index mapping should be provided for CRF.")
            from src.CRF import CRF
            self.loss = CRF(tag_to_index, batch_size, config.seq_length, is_training)
        else:
            self.loss = CrossEntropyCalculation(is_training)
        self.num_labels = num_labels
        self.use_crf = use_crf
    def construct(self, input_ids, input_mask, token_type_id, label_ids):
        logits = self.bert(input_ids, input_mask, token_type_id)
        if self.use_crf:
            loss = self.loss(logits, label_ids)
        else:
            loss = self.loss(logits, label_ids, self.num_labels)
        return loss
 class BertSquad(nn.Cell):
    '''
    Train interface for SQuAD finetuning task.
    '''
    def __init__(self, config, is_training, num_labels=2, dropout_prob=0.0, use_one_hot_embeddings=False):
        super(BertSquad, self).__init__()
        self.bert = BertSquadModel(config, is_training, num_labels, dropout_prob, use_one_hot_embeddings)
        self.loss = CrossEntropyCalculation(is_training)
        self.num_labels = num_labels
        self.seq_length = config.seq_length
        self.is_training = is_training
        self.total_num = Parameter(Tensor([0], mstype.float32))
        self.start_num = Parameter(Tensor([0], mstype.float32))
        self.end_num = Parameter(Tensor([0], mstype.float32))
        self.sum = P.ReduceSum()
        self.equal = P.Equal()
        self.argmax = P.ArgMaxWithValue(axis=1)
        self.squeeze = P.Squeeze(axis=-1)
    def construct(self, input_ids, input_mask, token_type_id, start_position, end_position, unique_id, is_impossible):
        """interface for SQuAD finetuning task"""
        logits = self.bert(input_ids, input_mask, token_type_id)
        if self.is_training:
            unstacked_logits_0 = self.squeeze(logits[:, :, 0:1])
            unstacked_logits_1 = self.squeeze(logits[:, :, 1:2])
            start_loss = self.loss(unstacked_logits_0, start_position, self.seq_length)
            end_loss = self.loss(unstacked_logits_1, end_position, self.seq_length)
            total_loss = (start_loss + end_loss) / 2.0
        else:
            start_logits = self.squeeze(logits[:, :, 0:1])
            start_logits = start_logits + 100 * input_mask
            end_logits = self.squeeze(logits[:, :, 1:2])
            end_logits = end_logits + 100 * input_mask
            total_loss = (unique_id, start_logits, end_logits)
        return total_loss
--- a/model_zoo/official/nlp/dgu/src/bert_for_pre_training.py
+++ b/model_zoo/official/nlp/dgu/src/bert_for_pre_training.py
@ -0,0 +1,807 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """Bert for pretraining."""
 import numpy as np
 import mindspore.nn as nn
 from mindspore.common.initializer import initializer, TruncatedNormal
 from mindspore.ops import operations as P
 from mindspore.ops import functional as F
 from mindspore.ops import composite as C
 from mindspore.common.tensor import Tensor
 from mindspore.common.parameter import Parameter
 from mindspore.common import dtype as mstype
 from mindspore.nn.wrap.grad_reducer import DistributedGradReducer
 from mindspore.context import ParallelMode
 from mindspore.communication.management import get_group_size
 from mindspore import context
 from .bert_model import BertModel
 GRADIENT_CLIP_TYPE = 1
 GRADIENT_CLIP_VALUE = 1.0
 clip_grad = C.MultitypeFuncGraph("clip_grad")
@clip_grad.register("Number", "Number", "Tensor")
 def _clip_grad(clip_type, clip_value, grad):
    """
    Clip gradients.
    Inputs:
        clip_type (int): The way to clip, 0 for 'value', 1 for 'norm'.
        clip_value (float): Specifies how much to clip.
        grad (tuple[Tensor]): Gradients.
    Outputs:
        tuple[Tensor], clipped gradients.
    """
    if clip_type not in (0, 1):
        return grad
    dt = F.dtype(grad)
    if clip_type == 0:
        new_grad = C.clip_by_value(grad, F.cast(F.tuple_to_array((-clip_value,)), dt),
                                   F.cast(F.tuple_to_array((clip_value,)), dt))
    else:
        new_grad = nn.ClipByNorm()(grad, F.cast(F.tuple_to_array((clip_value,)), dt))
    return new_grad
 class GetMaskedLMOutput(nn.Cell):
    """
    Get masked lm output.
    Args:
        config (BertConfig): The config of BertModel.
    Returns:
        Tensor, masked lm output.
    """
    def __init__(self, config):
        super(GetMaskedLMOutput, self).__init__()
        self.width = config.hidden_size
        self.reshape = P.Reshape()
        self.gather = P.Gather()
        weight_init = TruncatedNormal(config.initializer_range)
        self.dense = nn.Dense(self.width,
                              config.hidden_size,
                              weight_init=weight_init,
                              activation=config.hidden_act).to_float(config.compute_type)
        self.layernorm = nn.LayerNorm((config.hidden_size,)).to_float(config.compute_type)
        self.output_bias = Parameter(
            initializer(
                'zero',
                config.vocab_size))
        self.matmul = P.MatMul(transpose_b=True)
        self.log_softmax = nn.LogSoftmax(axis=-1)
        self.shape_flat_offsets = (-1, 1)
        self.last_idx = (-1,)
        self.shape_flat_sequence_tensor = (-1, self.width)
        self.seq_length_tensor = Tensor(np.array((config.seq_length,)).astype(np.int32))
        self.cast = P.Cast()
        self.compute_type = config.compute_type
        self.dtype = config.dtype
    def construct(self,
                  input_tensor,
                  output_weights,
                  positions):
        """Get output log_probs"""
        rng = F.tuple_to_array(F.make_range(P.Shape()(input_tensor)[0]))
        flat_offsets = self.reshape(rng * self.seq_length_tensor, self.shape_flat_offsets)
        flat_position = self.reshape(positions + flat_offsets, self.last_idx)
        flat_sequence_tensor = self.reshape(input_tensor, self.shape_flat_sequence_tensor)
        input_tensor = self.gather(flat_sequence_tensor, flat_position, 0)
        input_tensor = self.cast(input_tensor, self.compute_type)
        output_weights = self.cast(output_weights, self.compute_type)
        input_tensor = self.dense(input_tensor)
        input_tensor = self.layernorm(input_tensor)
        logits = self.matmul(input_tensor, output_weights)
        logits = self.cast(logits, self.dtype)
        logits = logits + self.output_bias
        log_probs = self.log_softmax(logits)
        return log_probs
 class GetNextSentenceOutput(nn.Cell):
    """
    Get next sentence output.
    Args:
        config (BertConfig): The config of Bert.
    Returns:
        Tensor, next sentence output.
    """
    def __init__(self, config):
        super(GetNextSentenceOutput, self).__init__()
        self.log_softmax = P.LogSoftmax()
        weight_init = TruncatedNormal(config.initializer_range)
        self.dense = nn.Dense(config.hidden_size, 2,
                              weight_init=weight_init, has_bias=True).to_float(config.compute_type)
        self.dtype = config.dtype
        self.cast = P.Cast()
    def construct(self, input_tensor):
        logits = self.dense(input_tensor)
        logits = self.cast(logits, self.dtype)
        log_prob = self.log_softmax(logits)
        return log_prob
 class BertPreTraining(nn.Cell):
    """
    Bert pretraining network.
    Args:
        config (BertConfig): The config of BertModel.
        is_training (bool): Specifies whether to use the training mode.
        use_one_hot_embeddings (bool): Specifies whether to use one-hot for embeddings.
    Returns:
        Tensor, prediction_scores, seq_relationship_score.
    """
    def __init__(self, config, is_training, use_one_hot_embeddings):
        super(BertPreTraining, self).__init__()
        self.bert = BertModel(config, is_training, use_one_hot_embeddings)
        self.cls1 = GetMaskedLMOutput(config)
        self.cls2 = GetNextSentenceOutput(config)
    def construct(self, input_ids, input_mask, token_type_id,
                  masked_lm_positions):
        sequence_output, pooled_output, embedding_table = \
            self.bert(input_ids, token_type_id, input_mask)
        prediction_scores = self.cls1(sequence_output,
                                      embedding_table,
                                      masked_lm_positions)
        seq_relationship_score = self.cls2(pooled_output)
        return prediction_scores, seq_relationship_score
 class BertPretrainingLoss(nn.Cell):
    """
    Provide bert pre-training loss.
    Args:
        config (BertConfig): The config of BertModel.
    Returns:
        Tensor, total loss.
    """
    def __init__(self, config):
        super(BertPretrainingLoss, self).__init__()
        self.vocab_size = config.vocab_size
        self.onehot = P.OneHot()
        self.on_value = Tensor(1.0, mstype.float32)
        self.off_value = Tensor(0.0, mstype.float32)
        self.reduce_sum = P.ReduceSum()
        self.reduce_mean = P.ReduceMean()
        self.reshape = P.Reshape()
        self.last_idx = (-1,)
        self.neg = P.Neg()
        self.cast = P.Cast()
    def construct(self, prediction_scores, seq_relationship_score, masked_lm_ids,
                  masked_lm_weights, next_sentence_labels):
        """Defines the computation performed."""
        label_ids = self.reshape(masked_lm_ids, self.last_idx)
        label_weights = self.cast(self.reshape(masked_lm_weights, self.last_idx), mstype.float32)
        one_hot_labels = self.onehot(label_ids, self.vocab_size, self.on_value, self.off_value)
        per_example_loss = self.neg(self.reduce_sum(prediction_scores * one_hot_labels, self.last_idx))
        numerator = self.reduce_sum(label_weights * per_example_loss, ())
        denominator = self.reduce_sum(label_weights, ()) + self.cast(F.tuple_to_array((1e-5,)), mstype.float32)
        masked_lm_loss = numerator / denominator
        # next_sentence_loss
        labels = self.reshape(next_sentence_labels, self.last_idx)
        one_hot_labels = self.onehot(labels, 2, self.on_value, self.off_value)
        per_example_loss = self.neg(self.reduce_sum(
            one_hot_labels * seq_relationship_score, self.last_idx))
        next_sentence_loss = self.reduce_mean(per_example_loss, self.last_idx)
        # total_loss
        total_loss = masked_lm_loss + next_sentence_loss
        return total_loss
 class BertNetworkWithLoss(nn.Cell):
    """
    Provide bert pre-training loss through network.
    Args:
        config (BertConfig): The config of BertModel.
        is_training (bool): Specifies whether to use the training mode.
        use_one_hot_embeddings (bool): Specifies whether to use one-hot for embeddings. Default: False.
    Returns:
        Tensor, the loss of the network.
    """
    def __init__(self, config, is_training, use_one_hot_embeddings=False):
        super(BertNetworkWithLoss, self).__init__()
        self.bert = BertPreTraining(config, is_training, use_one_hot_embeddings)
        self.loss = BertPretrainingLoss(config)
        self.cast = P.Cast()
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  next_sentence_labels,
                  masked_lm_positions,
                  masked_lm_ids,
                  masked_lm_weights):
        """Get pre-training loss"""
        prediction_scores, seq_relationship_score = \
            self.bert(input_ids, input_mask, token_type_id, masked_lm_positions)
        total_loss = self.loss(prediction_scores, seq_relationship_score,
                               masked_lm_ids, masked_lm_weights, next_sentence_labels)
        return self.cast(total_loss, mstype.float32)
 class BertTrainOneStepCell(nn.TrainOneStepCell):
    """
    Encapsulation class of bert network training.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        sens (Number): The adjust parameter. Default: 1.0.
    """
    def __init__(self, network, optimizer, sens=1.0):
        super(BertTrainOneStepCell, self).__init__(network, optimizer, sens)
        self.cast = P.Cast()
        self.hyper_map = C.HyperMap()
    def set_sens(self, value):
        self.sens = value
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  next_sentence_labels,
                  masked_lm_positions,
                  masked_lm_ids,
                  masked_lm_weights):
        """Defines the computation performed."""
        weights = self.weights
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            next_sentence_labels,
                            masked_lm_positions,
                            masked_lm_ids,
                            masked_lm_weights)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 next_sentence_labels,
                                                 masked_lm_positions,
                                                 masked_lm_ids,
                                                 masked_lm_weights,
                                                 self.cast(F.tuple_to_array((self.sens,)),
                                                           mstype.float32))
        grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
        grads = self.grad_reducer(grads)
        succ = self.optimizer(grads)
        return F.depend(loss, succ)
 grad_scale = C.MultitypeFuncGraph("grad_scale")
 reciprocal = P.Reciprocal()
@grad_scale.register("Tensor", "Tensor")
 def tensor_grad_scale(scale, grad):
    return grad * reciprocal(scale)
 _grad_overflow = C.MultitypeFuncGraph("_grad_overflow")
 grad_overflow = P.FloatStatus()
@_grad_overflow.register("Tensor")
 def _tensor_grad_overflow(grad):
    return grad_overflow(grad)
 class BertTrainOneStepWithLossScaleCell(nn.TrainOneStepWithLossScaleCell):
    """
    Encapsulation class of bert network training.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        scale_update_cell (Cell): Cell to do the loss scale. Default: None.
    """
    def __init__(self, network, optimizer, scale_update_cell=None):
        super(BertTrainOneStepWithLossScaleCell, self).__init__(network, optimizer, scale_update_cell)
        self.cast = P.Cast()
        self.degree = 1
        if self.reducer_flag:
            self.degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, False, self.degree)
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  next_sentence_labels,
                  masked_lm_positions,
                  masked_lm_ids,
                  masked_lm_weights,
                  sens=None):
        """Defines the computation performed."""
        weights = self.weights
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            next_sentence_labels,
                            masked_lm_positions,
                            masked_lm_ids,
                            masked_lm_weights)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        status, scaling_sens = self.start_overflow_check(loss, scaling_sens)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 next_sentence_labels,
                                                 masked_lm_positions,
                                                 masked_lm_ids,
                                                 masked_lm_weights,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        # apply grad reducer on grads
        grads = self.grad_reducer(grads)
        grads = self.hyper_map(F.partial(grad_scale, scaling_sens * self.degree), grads)
        grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
        cond = self.get_overflow_status(status, grads)
        overflow = cond
        if sens is None:
            overflow = self.loss_scaling_manager(self.loss_scale, cond)
        if overflow:
            succ = False
        else:
            succ = self.optimizer(grads)
        ret = (loss, cond, scaling_sens)
        return F.depend(ret, succ)
 class BertTrainOneStepWithLossScaleCellForAdam(nn.TrainOneStepWithLossScaleCell):
    """
    Encapsulation class of bert network training.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    Different from BertTrainOneStepWithLossScaleCell, the optimizer takes the overflow
    condition as input.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        scale_update_cell (Cell): Cell to do the loss scale. Default: None.
    """
    def __init__(self, network, optimizer, scale_update_cell=None):
        super(BertTrainOneStepWithLossScaleCellForAdam, self).__init__(network, optimizer, scale_update_cell)
        self.cast = P.Cast()
        self.degree = 1
        if self.reducer_flag:
            self.degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, False, self.degree)
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  next_sentence_labels,
                  masked_lm_positions,
                  masked_lm_ids,
                  masked_lm_weights,
                  sens=None):
        """Defines the computation performed."""
        weights = self.weights
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            next_sentence_labels,
                            masked_lm_positions,
                            masked_lm_ids,
                            masked_lm_weights)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        status, scaling_sens = self.start_overflow_check(loss, scaling_sens)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 next_sentence_labels,
                                                 masked_lm_positions,
                                                 masked_lm_ids,
                                                 masked_lm_weights,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        # apply grad reducer on grads
        grads = self.grad_reducer(grads)
        grads = self.hyper_map(F.partial(grad_scale, scaling_sens * self.degree), grads)
        grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
        cond = self.get_overflow_status(status, grads)
        overflow = cond
        if self.loss_scaling_manager is not None:
            overflow = self.loss_scaling_manager(scaling_sens, cond)
        succ = self.optimizer(grads, overflow)
        ret = (loss, cond, scaling_sens)
        return F.depend(ret, succ)
 cast = P.Cast()
 add_grads = C.MultitypeFuncGraph("add_grads")
@add_grads.register("Tensor", "Tensor")
 def _add_grads(accu_grad, grad):
    return accu_grad + cast(grad, mstype.float32)
 update_accu_grads = C.MultitypeFuncGraph("update_accu_grads")
@update_accu_grads.register("Tensor", "Tensor")
 def _update_accu_grads(accu_grad, grad):
    succ = True
    return F.depend(succ, F.assign(accu_grad, cast(grad, mstype.float32)))
 accumulate_accu_grads = C.MultitypeFuncGraph("accumulate_accu_grads")
@accumulate_accu_grads.register("Tensor", "Tensor")
 def _accumulate_accu_grads(accu_grad, grad):
    succ = True
    return F.depend(succ, F.assign_add(accu_grad, cast(grad, mstype.float32)))
 zeroslike = P.ZerosLike()
 reset_accu_grads = C.MultitypeFuncGraph("reset_accu_grads")
@reset_accu_grads.register("Tensor")
 def _reset_accu_grads(accu_grad):
    succ = True
    return F.depend(succ, F.assign(accu_grad, zeroslike(accu_grad)))
 class BertTrainAccumulationAllReducePostWithLossScaleCell(nn.Cell):
    """
    Encapsulation class of bert network training.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    To mimic higher batch size, gradients are accumulated N times before weight update.
    For distribution mode, allreduce will only be implemented in the weight updated step,
    i.e. the sub-step after gradients accumulated N times.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        scale_update_cell (Cell): Cell to do the loss scale. Default: None.
        accumulation_steps (int): Number of accumulation steps before gradient update. The global batch size =
                                batch_size * accumulation_steps. Default: 1.
    """
    def __init__(self, network, optimizer, scale_update_cell=None, accumulation_steps=1, enable_global_norm=False):
        super(BertTrainAccumulationAllReducePostWithLossScaleCell, self).__init__(auto_prefix=False)
        self.network = network
        self.network.set_grad()
        self.weights = optimizer.parameters
        self.optimizer = optimizer
        self.accumulation_steps = accumulation_steps
        self.enable_global_norm = enable_global_norm
        self.one = Tensor(np.array([1]).astype(np.int32))
        self.zero = Tensor(np.array([0]).astype(np.int32))
        self.local_step = Parameter(initializer(0, [1], mstype.int32))
        self.accu_grads = self.weights.clone(prefix="accu_grads", init='zeros')
        self.accu_overflow = Parameter(initializer(0, [1], mstype.int32))
        self.accu_loss = Parameter(initializer(0, [1], mstype.float32))
        self.grad = C.GradOperation(get_by_list=True, sens_param=True)
        self.reducer_flag = False
        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
            self.reducer_flag = True
        self.grad_reducer = F.identity
        self.degree = 1
        if self.reducer_flag:
            self.degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, False, self.degree)
        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
        self.overflow_reducer = F.identity
        if self.is_distributed:
            self.overflow_reducer = P.AllReduce()
        self.cast = P.Cast()
        self.alloc_status = P.NPUAllocFloatStatus()
        self.get_status = P.NPUGetFloatStatus()
        self.clear_status = P.NPUClearFloatStatus()
        self.reduce_sum = P.ReduceSum(keep_dims=False)
        self.base = Tensor(1, mstype.float32)
        self.less_equal = P.LessEqual()
        self.logical_or = P.LogicalOr()
        self.not_equal = P.NotEqual()
        self.select = P.Select()
        self.reshape = P.Reshape()
        self.hyper_map = C.HyperMap()
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  next_sentence_labels,
                  masked_lm_positions,
                  masked_lm_ids,
                  masked_lm_weights,
                  sens=None):
        """Defines the computation performed."""
        weights = self.weights
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            next_sentence_labels,
                            masked_lm_positions,
                            masked_lm_ids,
                            masked_lm_weights)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        # alloc status and clear should be right before gradoperation
        init = self.alloc_status()
        init = F.depend(init, loss)
        clear_status = self.clear_status(init)
        scaling_sens = F.depend(scaling_sens, clear_status)
        # update accumulation parameters
        is_accu_step = self.not_equal(self.local_step, self.accumulation_steps)
        self.local_step = self.select(is_accu_step, self.local_step + self.one, self.one)
        self.accu_loss = self.select(is_accu_step, self.accu_loss + loss, loss)
        mean_loss = self.accu_loss / self.local_step
        is_accu_step = self.not_equal(self.local_step, self.accumulation_steps)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 next_sentence_labels,
                                                 masked_lm_positions,
                                                 masked_lm_ids,
                                                 masked_lm_weights,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        accu_succ = self.hyper_map(accumulate_accu_grads, self.accu_grads, grads)
        mean_loss = F.depend(mean_loss, accu_succ)
        init = F.depend(init, mean_loss)
        get_status = self.get_status(init)
        init = F.depend(init, get_status)
        flag_sum = self.reduce_sum(init, (0,))
        overflow = self.less_equal(self.base, flag_sum)
        overflow = self.logical_or(self.not_equal(self.accu_overflow, self.zero), overflow)
        accu_overflow = self.select(overflow, self.one, self.zero)
        self.accu_overflow = self.select(is_accu_step, accu_overflow, self.zero)
        if is_accu_step:
            succ = False
        else:
            # apply grad reducer on grads
            grads = self.grad_reducer(self.accu_grads)
            scaling = scaling_sens * self.degree * self.accumulation_steps
            grads = self.hyper_map(F.partial(grad_scale, scaling), grads)
            if self.enable_global_norm:
                grads = C.clip_by_global_norm(grads, 1.0, None)
            else:
                grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
            accu_overflow = F.depend(accu_overflow, grads)
            accu_overflow = self.overflow_reducer(accu_overflow)
            overflow = self.less_equal(self.base, accu_overflow)
            accu_succ = self.hyper_map(reset_accu_grads, self.accu_grads)
            overflow = F.depend(overflow, accu_succ)
            overflow = self.reshape(overflow, (()))
            if sens is None:
                overflow = self.loss_scaling_manager(self.loss_scale, overflow)
            if overflow:
                succ = False
            else:
                succ = self.optimizer(grads)
        ret = (mean_loss, overflow, scaling_sens)
        return F.depend(ret, succ)
 class BertTrainAccumulationAllReduceEachWithLossScaleCell(nn.Cell):
    """
    Encapsulation class of bert network training.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    To mimic higher batch size, gradients are accumulated N times before weight update.
    For distribution mode, allreduce will be implemented after each sub-step and the trailing time
    will be overided by backend optimization pass.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        scale_update_cell (Cell): Cell to do the loss scale. Default: None.
        accumulation_steps (int): Number of accumulation steps before gradient update. The global batch size =
                                  batch_size * accumulation_steps. Default: 1.
    """
    def __init__(self, network, optimizer, scale_update_cell=None, accumulation_steps=1, enable_global_norm=False):
        super(BertTrainAccumulationAllReduceEachWithLossScaleCell, self).__init__(auto_prefix=False)
        self.network = network
        self.network.set_grad()
        self.weights = optimizer.parameters
        self.optimizer = optimizer
        self.accumulation_steps = accumulation_steps
        self.enable_global_norm = enable_global_norm
        self.one = Tensor(np.array([1]).astype(np.int32))
        self.zero = Tensor(np.array([0]).astype(np.int32))
        self.local_step = Parameter(initializer(0, [1], mstype.int32))
        self.accu_grads = self.weights.clone(prefix="accu_grads", init='zeros')
        self.accu_overflow = Parameter(initializer(0, [1], mstype.int32))
        self.accu_loss = Parameter(initializer(0, [1], mstype.float32))
        self.grad = C.GradOperation(get_by_list=True, sens_param=True)
        self.reducer_flag = False
        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
            self.reducer_flag = True
        self.grad_reducer = F.identity
        self.degree = 1
        if self.reducer_flag:
            self.degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, False, self.degree)
        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
        self.overflow_reducer = F.identity
        if self.is_distributed:
            self.overflow_reducer = P.AllReduce()
        self.cast = P.Cast()
        self.alloc_status = P.NPUAllocFloatStatus()
        self.get_status = P.NPUGetFloatStatus()
        self.clear_before_grad = P.NPUClearFloatStatus()
        self.reduce_sum = P.ReduceSum(keep_dims=False)
        self.base = Tensor(1, mstype.float32)
        self.less_equal = P.LessEqual()
        self.logical_or = P.LogicalOr()
        self.not_equal = P.NotEqual()
        self.select = P.Select()
        self.reshape = P.Reshape()
        self.hyper_map = C.HyperMap()
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
    @C.add_flags(has_effect=True)
    def construct(self,
                  input_ids,
                  input_mask,
                  token_type_id,
                  next_sentence_labels,
                  masked_lm_positions,
                  masked_lm_ids,
                  masked_lm_weights,
                  sens=None):
        """Defines the computation performed."""
        weights = self.weights
        loss = self.network(input_ids,
                            input_mask,
                            token_type_id,
                            next_sentence_labels,
                            masked_lm_positions,
                            masked_lm_ids,
                            masked_lm_weights)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        # update accumulation parameters
        is_accu_step = self.not_equal(self.local_step, self.accumulation_steps)
        self.local_step = self.select(is_accu_step, self.local_step + self.one, self.one)
        self.accu_loss = self.select(is_accu_step, self.accu_loss + loss, loss)
        mean_loss = self.accu_loss / self.local_step
        is_accu_step = self.not_equal(self.local_step, self.accumulation_steps)
        # alloc status and clear should be right before gradoperation
        init = self.alloc_status()
        self.clear_before_grad(init)
        grads = self.grad(self.network, weights)(input_ids,
                                                 input_mask,
                                                 token_type_id,
                                                 next_sentence_labels,
                                                 masked_lm_positions,
                                                 masked_lm_ids,
                                                 masked_lm_weights,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        accu_grads = self.hyper_map(add_grads, self.accu_grads, grads)
        scaling = scaling_sens * self.degree * self.accumulation_steps
        grads = self.hyper_map(F.partial(grad_scale, scaling), accu_grads)
        grads = self.grad_reducer(grads)
        self.get_status(init)
        flag_sum = self.reduce_sum(init, (0,))
        flag_reduce = self.overflow_reducer(flag_sum)
        overflow = self.less_equal(self.base, flag_reduce)
        overflow = self.logical_or(self.not_equal(self.accu_overflow, self.zero), overflow)
        accu_overflow = self.select(overflow, self.one, self.zero)
        self.accu_overflow = self.select(is_accu_step, accu_overflow, self.zero)
        overflow = self.reshape(overflow, (()))
        if is_accu_step:
            succ = False
            accu_succ = self.hyper_map(update_accu_grads, self.accu_grads, accu_grads)
            succ = F.depend(succ, accu_succ)
        else:
            if sens is None:
                overflow = self.loss_scaling_manager(self.loss_scale, overflow)
            if overflow:
                succ = False
            else:
                if self.enable_global_norm:
                    grads = C.clip_by_global_norm(grads, 1.0, None)
                else:
                    grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
                succ = self.optimizer(grads)
            accu_succ = self.hyper_map(reset_accu_grads, self.accu_grads)
            succ = F.depend(succ, accu_succ)
        ret = (mean_loss, overflow, scaling_sens)
        return F.depend(ret, succ)
--- a/model_zoo/official/nlp/dgu/src/bert_model.py
+++ b/model_zoo/official/nlp/dgu/src/bert_model.py
@ -0,0 +1,881 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """Bert model."""
 import math
 import copy
 import numpy as np
 import mindspore.common.dtype as mstype
 import mindspore.nn as nn
 import mindspore.ops.functional as F
 from mindspore.common.initializer import TruncatedNormal, initializer
 from mindspore.ops import operations as P
 from mindspore.ops import composite as C
 from mindspore.common.tensor import Tensor
 from mindspore.common.parameter import Parameter
 class BertConfig:
    """
    Configuration for `BertModel`.
    Args:
        seq_length (int): Length of input sequence. Default: 128.
        vocab_size (int): The shape of each embedding vector. Default: 32000.
        hidden_size (int): Size of the bert encoder layers. Default: 768.
        num_hidden_layers (int): Number of hidden layers in the BertTransformer encoder
                           cell. Default: 12.
        num_attention_heads (int): Number of attention heads in the BertTransformer
                             encoder cell. Default: 12.
        intermediate_size (int): Size of intermediate layer in the BertTransformer
                           encoder cell. Default: 3072.
        hidden_act (str): Activation function used in the BertTransformer encoder
                    cell. Default: "gelu".
        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
        attention_probs_dropout_prob (float): The dropout probability for
                                      BertAttention. Default: 0.1.
        max_position_embeddings (int): Maximum length of sequences used in this
                                 model. Default: 512.
        type_vocab_size (int): Size of token type vocab. Default: 16.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
        dtype (:class:`mindspore.dtype`): Data type of the input. Default: mstype.float32.
        compute_type (:class:`mindspore.dtype`): Compute type in BertTransformer. Default: mstype.float32.
    """
    def __init__(self,
                 seq_length=128,
                 vocab_size=32000,
                 hidden_size=768,
                 num_hidden_layers=12,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 hidden_act="gelu",
                 hidden_dropout_prob=0.1,
                 attention_probs_dropout_prob=0.1,
                 max_position_embeddings=512,
                 type_vocab_size=16,
                 initializer_range=0.02,
                 use_relative_positions=False,
                 dtype=mstype.float32,
                 compute_type=mstype.float32):
        self.seq_length = seq_length
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range
        self.use_relative_positions = use_relative_positions
        self.dtype = dtype
        self.compute_type = compute_type
 class EmbeddingLookup(nn.Cell):
    """
    A embeddings lookup table with a fixed dictionary and size.
    Args:
        vocab_size (int): Size of the dictionary of embeddings.
        embedding_size (int): The size of each embedding vector.
        embedding_shape (list): [batch_size, seq_length, embedding_size], the shape of
                         each embedding vector.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
    """
    def __init__(self,
                 vocab_size,
                 embedding_size,
                 embedding_shape,
                 use_one_hot_embeddings=False,
                 initializer_range=0.02):
        super(EmbeddingLookup, self).__init__()
        self.vocab_size = vocab_size
        self.use_one_hot_embeddings = use_one_hot_embeddings
        self.embedding_table = Parameter(initializer
                                         (TruncatedNormal(initializer_range),
                                          [vocab_size, embedding_size]))
        self.expand = P.ExpandDims()
        self.shape_flat = (-1,)
        self.gather = P.Gather()
        self.one_hot = P.OneHot()
        self.on_value = Tensor(1.0, mstype.float32)
        self.off_value = Tensor(0.0, mstype.float32)
        self.array_mul = P.MatMul()
        self.reshape = P.Reshape()
        self.shape = tuple(embedding_shape)
    def construct(self, input_ids):
        """Get output and embeddings lookup table"""
        extended_ids = self.expand(input_ids, -1)
        flat_ids = self.reshape(extended_ids, self.shape_flat)
        if self.use_one_hot_embeddings:
            one_hot_ids = self.one_hot(flat_ids, self.vocab_size, self.on_value, self.off_value)
            output_for_reshape = self.array_mul(
                one_hot_ids, self.embedding_table)
        else:
            output_for_reshape = self.gather(self.embedding_table, flat_ids, 0)
        output = self.reshape(output_for_reshape, self.shape)
        return output, self.embedding_table
 class EmbeddingPostprocessor(nn.Cell):
    """
    Postprocessors apply positional and token type embeddings to word embeddings.
    Args:
        embedding_size (int): The size of each embedding vector.
        embedding_shape (list): [batch_size, seq_length, embedding_size], the shape of
                         each embedding vector.
        use_token_type (bool): Specifies whether to use token type embeddings. Default: False.
        token_type_vocab_size (int): Size of token type vocab. Default: 16.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        max_position_embeddings (int): Maximum length of sequences used in this
                                 model. Default: 512.
        dropout_prob (float): The dropout probability. Default: 0.1.
    """
    def __init__(self,
                 embedding_size,
                 embedding_shape,
                 use_relative_positions=False,
                 use_token_type=False,
                 token_type_vocab_size=16,
                 use_one_hot_embeddings=False,
                 initializer_range=0.02,
                 max_position_embeddings=512,
                 dropout_prob=0.1):
        super(EmbeddingPostprocessor, self).__init__()
        self.use_token_type = use_token_type
        self.token_type_vocab_size = token_type_vocab_size
        self.use_one_hot_embeddings = use_one_hot_embeddings
        self.max_position_embeddings = max_position_embeddings
        self.embedding_table = Parameter(initializer
                                         (TruncatedNormal(initializer_range),
                                          [token_type_vocab_size,
                                           embedding_size]))
        self.shape_flat = (-1,)
        self.one_hot = P.OneHot()
        self.on_value = Tensor(1.0, mstype.float32)
        self.off_value = Tensor(0.1, mstype.float32)
        self.array_mul = P.MatMul()
        self.reshape = P.Reshape()
        self.shape = tuple(embedding_shape)
        self.layernorm = nn.LayerNorm((embedding_size,))
        self.dropout = nn.Dropout(1 - dropout_prob)
        self.gather = P.Gather()
        self.use_relative_positions = use_relative_positions
        self.slice = P.StridedSlice()
        self.full_position_embeddings = Parameter(initializer
                                                  (TruncatedNormal(initializer_range),
                                                   [max_position_embeddings,
                                                    embedding_size]))
    def construct(self, token_type_ids, word_embeddings):
        """Postprocessors apply positional and token type embeddings to word embeddings."""
        output = word_embeddings
        if self.use_token_type:
            flat_ids = self.reshape(token_type_ids, self.shape_flat)
            if self.use_one_hot_embeddings:
                one_hot_ids = self.one_hot(flat_ids,
                                           self.token_type_vocab_size, self.on_value, self.off_value)
                token_type_embeddings = self.array_mul(one_hot_ids,
                                                       self.embedding_table)
            else:
                token_type_embeddings = self.gather(self.embedding_table, flat_ids, 0)
            token_type_embeddings = self.reshape(token_type_embeddings, self.shape)
            output += token_type_embeddings
        if not self.use_relative_positions:
            _, seq, width = self.shape
            position_embeddings = self.slice(self.full_position_embeddings, (0, 0), (seq, width), (1, 1))
            position_embeddings = self.reshape(position_embeddings, (1, seq, width))
            output += position_embeddings
        output = self.layernorm(output)
        output = self.dropout(output)
        return output
 class BertOutput(nn.Cell):
    """
    Apply a linear computation to hidden status and a residual computation to input.
    Args:
        in_channels (int): Input channels.
        out_channels (int): Output channels.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        dropout_prob (float): The dropout probability. Default: 0.1.
        compute_type (:class:`mindspore.dtype`): Compute type in BertTransformer. Default: mstype.float32.
    """
    def __init__(self,
                 in_channels,
                 out_channels,
                 initializer_range=0.02,
                 dropout_prob=0.1,
                 compute_type=mstype.float32):
        super(BertOutput, self).__init__()
        self.dense = nn.Dense(in_channels, out_channels,
                              weight_init=TruncatedNormal(initializer_range)).to_float(compute_type)
        self.dropout = nn.Dropout(1 - dropout_prob)
        self.dropout_prob = dropout_prob
        self.add = P.Add()
        self.layernorm = nn.LayerNorm((out_channels,)).to_float(compute_type)
        self.cast = P.Cast()
    def construct(self, hidden_status, input_tensor):
        output = self.dense(hidden_status)
        output = self.dropout(output)
        output = self.add(input_tensor, output)
        output = self.layernorm(output)
        return output
 class RelaPosMatrixGenerator(nn.Cell):
    """
    Generates matrix of relative positions between inputs.
    Args:
        length (int): Length of one dim for the matrix to be generated.
        max_relative_position (int): Max value of relative position.
    """
    def __init__(self, length, max_relative_position):
        super(RelaPosMatrixGenerator, self).__init__()
        self._length = length
        self._max_relative_position = max_relative_position
        self._min_relative_position = -max_relative_position
        self.range_length = -length + 1
        self.tile = P.Tile()
        self.range_mat = P.Reshape()
        self.sub = P.Sub()
        self.expanddims = P.ExpandDims()
        self.cast = P.Cast()
    def construct(self):
        """Generates matrix of relative positions between inputs."""
        range_vec_row_out = self.cast(F.tuple_to_array(F.make_range(self._length)), mstype.int32)
        range_vec_col_out = self.range_mat(range_vec_row_out, (self._length, -1))
        tile_row_out = self.tile(range_vec_row_out, (self._length,))
        tile_col_out = self.tile(range_vec_col_out, (1, self._length))
        range_mat_out = self.range_mat(tile_row_out, (self._length, self._length))
        transpose_out = self.range_mat(tile_col_out, (self._length, self._length))
        distance_mat = self.sub(range_mat_out, transpose_out)
        distance_mat_clipped = C.clip_by_value(distance_mat,
                                               self._min_relative_position,
                                               self._max_relative_position)
        # Shift values to be >=0. Each integer still uniquely identifies a
        # relative position difference.
        final_mat = distance_mat_clipped + self._max_relative_position
        return final_mat
 class RelaPosEmbeddingsGenerator(nn.Cell):
    """
    Generates tensor of size [length, length, depth].
    Args:
        length (int): Length of one dim for the matrix to be generated.
        depth (int): Size of each attention head.
        max_relative_position (int): Maxmum value of relative position.
        initializer_range (float): Initialization value of TruncatedNormal.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
    """
    def __init__(self,
                 length,
                 depth,
                 max_relative_position,
                 initializer_range,
                 use_one_hot_embeddings=False):
        super(RelaPosEmbeddingsGenerator, self).__init__()
        self.depth = depth
        self.vocab_size = max_relative_position * 2 + 1
        self.use_one_hot_embeddings = use_one_hot_embeddings
        self.embeddings_table = Parameter(
            initializer(TruncatedNormal(initializer_range),
                        [self.vocab_size, self.depth]))
        self.relative_positions_matrix = RelaPosMatrixGenerator(length=length,
                                                                max_relative_position=max_relative_position)
        self.reshape = P.Reshape()
        self.one_hot = nn.OneHot(depth=self.vocab_size)
        self.shape = P.Shape()
        self.gather = P.Gather()  # index_select
        self.matmul = P.BatchMatMul()
    def construct(self):
        """Generate embedding for each relative position of dimension depth."""
        relative_positions_matrix_out = self.relative_positions_matrix()
        if self.use_one_hot_embeddings:
            flat_relative_positions_matrix = self.reshape(relative_positions_matrix_out, (-1,))
            one_hot_relative_positions_matrix = self.one_hot(
                flat_relative_positions_matrix)
            embeddings = self.matmul(one_hot_relative_positions_matrix, self.embeddings_table)
            my_shape = self.shape(relative_positions_matrix_out) + (self.depth,)
            embeddings = self.reshape(embeddings, my_shape)
        else:
            embeddings = self.gather(self.embeddings_table,
                                     relative_positions_matrix_out, 0)
        return embeddings
 class SaturateCast(nn.Cell):
    """
    Performs a safe saturating cast. This operation applies proper clamping before casting to prevent
    the danger that the value will overflow or underflow.
    Args:
        src_type (:class:`mindspore.dtype`): The type of the elements of the input tensor. Default: mstype.float32.
        dst_type (:class:`mindspore.dtype`): The type of the elements of the output tensor. Default: mstype.float32.
    """
    def __init__(self, src_type=mstype.float32, dst_type=mstype.float32):
        super(SaturateCast, self).__init__()
        np_type = mstype.dtype_to_nptype(dst_type)
        self.tensor_min_type = float(np.finfo(np_type).min)
        self.tensor_max_type = float(np.finfo(np_type).max)
        self.min_op = P.Minimum()
        self.max_op = P.Maximum()
        self.cast = P.Cast()
        self.dst_type = dst_type
    def construct(self, x):
        out = self.max_op(x, self.tensor_min_type)
        out = self.min_op(out, self.tensor_max_type)
        return self.cast(out, self.dst_type)
 class BertAttention(nn.Cell):
    """
    Apply multi-headed attention from "from_tensor" to "to_tensor".
    Args:
        from_tensor_width (int): Size of last dim of from_tensor.
        to_tensor_width (int): Size of last dim of to_tensor.
        from_seq_length (int): Length of from_tensor sequence.
        to_seq_length (int): Length of to_tensor sequence.
        num_attention_heads (int): Number of attention heads. Default: 1.
        size_per_head (int): Size of each attention head. Default: 512.
        query_act (str): Activation function for the query transform. Default: None.
        key_act (str): Activation function for the key transform. Default: None.
        value_act (str): Activation function for the value transform. Default: None.
        has_attention_mask (bool): Specifies whether to use attention mask. Default: False.
        attention_probs_dropout_prob (float): The dropout probability for
                                      BertAttention. Default: 0.0.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        do_return_2d_tensor (bool): True for return 2d tensor. False for return 3d
                             tensor. Default: False.
        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
        compute_type (:class:`mindspore.dtype`): Compute type in BertAttention. Default: mstype.float32.
    """
    def __init__(self,
                 from_tensor_width,
                 to_tensor_width,
                 from_seq_length,
                 to_seq_length,
                 num_attention_heads=1,
                 size_per_head=512,
                 query_act=None,
                 key_act=None,
                 value_act=None,
                 has_attention_mask=False,
                 attention_probs_dropout_prob=0.0,
                 use_one_hot_embeddings=False,
                 initializer_range=0.02,
                 do_return_2d_tensor=False,
                 use_relative_positions=False,
                 compute_type=mstype.float32):
        super(BertAttention, self).__init__()
        self.from_seq_length = from_seq_length
        self.to_seq_length = to_seq_length
        self.num_attention_heads = num_attention_heads
        self.size_per_head = size_per_head
        self.has_attention_mask = has_attention_mask
        self.use_relative_positions = use_relative_positions
        self.scores_mul = 1.0 / math.sqrt(float(self.size_per_head))
        self.reshape = P.Reshape()
        self.shape_from_2d = (-1, from_tensor_width)
        self.shape_to_2d = (-1, to_tensor_width)
        weight = TruncatedNormal(initializer_range)
        units = num_attention_heads * size_per_head
        self.query_layer = nn.Dense(from_tensor_width,
                                    units,
                                    activation=query_act,
                                    weight_init=weight).to_float(compute_type)
        self.key_layer = nn.Dense(to_tensor_width,
                                  units,
                                  activation=key_act,
                                  weight_init=weight).to_float(compute_type)
        self.value_layer = nn.Dense(to_tensor_width,
                                    units,
                                    activation=value_act,
                                    weight_init=weight).to_float(compute_type)
        self.shape_from = (-1, from_seq_length, num_attention_heads, size_per_head)
        self.shape_to = (-1, to_seq_length, num_attention_heads, size_per_head)
        self.matmul_trans_b = P.BatchMatMul(transpose_b=True)
        self.multiply = P.Mul()
        self.transpose = P.Transpose()
        self.trans_shape = (0, 2, 1, 3)
        self.trans_shape_relative = (2, 0, 1, 3)
        self.trans_shape_position = (1, 2, 0, 3)
        self.multiply_data = -10000.0
        self.matmul = P.BatchMatMul()
        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(1 - attention_probs_dropout_prob)
        if self.has_attention_mask:
            self.expand_dims = P.ExpandDims()
            self.sub = P.Sub()
            self.add = P.Add()
            self.cast = P.Cast()
            self.get_dtype = P.DType()
        if do_return_2d_tensor:
            self.shape_return = (-1, num_attention_heads * size_per_head)
        else:
            self.shape_return = (-1, from_seq_length, num_attention_heads * size_per_head)
        self.cast_compute_type = SaturateCast(dst_type=compute_type)
        if self.use_relative_positions:
            self._generate_relative_positions_embeddings = \
                RelaPosEmbeddingsGenerator(length=to_seq_length,
                                           depth=size_per_head,
                                           max_relative_position=16,
                                           initializer_range=initializer_range,
                                           use_one_hot_embeddings=use_one_hot_embeddings)
    def construct(self, from_tensor, to_tensor, attention_mask):
        """reshape 2d/3d input tensors to 2d"""
        from_tensor_2d = self.reshape(from_tensor, self.shape_from_2d)
        to_tensor_2d = self.reshape(to_tensor, self.shape_to_2d)
        query_out = self.query_layer(from_tensor_2d)
        key_out = self.key_layer(to_tensor_2d)
        value_out = self.value_layer(to_tensor_2d)
        query_layer = self.reshape(query_out, self.shape_from)
        query_layer = self.transpose(query_layer, self.trans_shape)
        key_layer = self.reshape(key_out, self.shape_to)
        key_layer = self.transpose(key_layer, self.trans_shape)
        attention_scores = self.matmul_trans_b(query_layer, key_layer)
        # use_relative_position, supplementary logic
        if self.use_relative_positions:
            # relations_keys is [F|T, F|T, H]
            relations_keys = self._generate_relative_positions_embeddings()
            relations_keys = self.cast_compute_type(relations_keys)
            # query_layer_t is [F, B, N, H]
            query_layer_t = self.transpose(query_layer, self.trans_shape_relative)
            # query_layer_r is [F, B * N, H]
            query_layer_r = self.reshape(query_layer_t,
                                         (self.from_seq_length,
                                          -1,
                                          self.size_per_head))
            # key_position_scores is [F, B * N, F|T]
            key_position_scores = self.matmul_trans_b(query_layer_r,
                                                      relations_keys)
            # key_position_scores_r is [F, B, N, F|T]
            key_position_scores_r = self.reshape(key_position_scores,
                                                 (self.from_seq_length,
                                                  -1,
                                                  self.num_attention_heads,
                                                  self.from_seq_length))
            # key_position_scores_r_t is [B, N, F, F|T]
            key_position_scores_r_t = self.transpose(key_position_scores_r,
                                                     self.trans_shape_position)
            attention_scores = attention_scores + key_position_scores_r_t
        attention_scores = self.multiply(self.scores_mul, attention_scores)
        if self.has_attention_mask:
            attention_mask = self.expand_dims(attention_mask, 1)
            multiply_out = self.sub(self.cast(F.tuple_to_array((1.0,)), self.get_dtype(attention_scores)),
                                    self.cast(attention_mask, self.get_dtype(attention_scores)))
            adder = self.multiply(multiply_out, self.multiply_data)
            attention_scores = self.add(adder, attention_scores)
        attention_probs = self.softmax(attention_scores)
        attention_probs = self.dropout(attention_probs)
        value_layer = self.reshape(value_out, self.shape_to)
        value_layer = self.transpose(value_layer, self.trans_shape)
        context_layer = self.matmul(attention_probs, value_layer)
        # use_relative_position, supplementary logic
        if self.use_relative_positions:
            # relations_values is [F|T, F|T, H]
            relations_values = self._generate_relative_positions_embeddings()
            relations_values = self.cast_compute_type(relations_values)
            # attention_probs_t is [F, B, N, T]
            attention_probs_t = self.transpose(attention_probs, self.trans_shape_relative)
            # attention_probs_r is [F, B * N, T]
            attention_probs_r = self.reshape(
                attention_probs_t,
                (self.from_seq_length,
                 -1,
                 self.to_seq_length))
            # value_position_scores is [F, B * N, H]
            value_position_scores = self.matmul(attention_probs_r,
                                                relations_values)
            # value_position_scores_r is [F, B, N, H]
            value_position_scores_r = self.reshape(value_position_scores,
                                                   (self.from_seq_length,
                                                    -1,
                                                    self.num_attention_heads,
                                                    self.size_per_head))
            # value_position_scores_r_t is [B, N, F, H]
            value_position_scores_r_t = self.transpose(value_position_scores_r,
                                                       self.trans_shape_position)
            context_layer = context_layer + value_position_scores_r_t
        context_layer = self.transpose(context_layer, self.trans_shape)
        context_layer = self.reshape(context_layer, self.shape_return)
        return context_layer
 class BertSelfAttention(nn.Cell):
    """
    Apply self-attention.
    Args:
        seq_length (int): Length of input sequence.
        hidden_size (int): Size of the bert encoder layers.
        num_attention_heads (int): Number of attention heads. Default: 12.
        attention_probs_dropout_prob (float): The dropout probability for
                                      BertAttention. Default: 0.1.
        use_one_hot_embeddings (bool): Specifies whether to use one_hot encoding form. Default: False.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
        compute_type (:class:`mindspore.dtype`): Compute type in BertSelfAttention. Default: mstype.float32.
    """
    def __init__(self,
                 seq_length,
                 hidden_size,
                 num_attention_heads=12,
                 attention_probs_dropout_prob=0.1,
                 use_one_hot_embeddings=False,
                 initializer_range=0.02,
                 hidden_dropout_prob=0.1,
                 use_relative_positions=False,
                 compute_type=mstype.float32):
        super(BertSelfAttention, self).__init__()
        if hidden_size % num_attention_heads != 0:
            raise ValueError("The hidden size (%d) is not a multiple of the number "
                             "of attention heads (%d)" % (hidden_size, num_attention_heads))
        self.size_per_head = int(hidden_size / num_attention_heads)
        self.attention = BertAttention(
            from_tensor_width=hidden_size,
            to_tensor_width=hidden_size,
            from_seq_length=seq_length,
            to_seq_length=seq_length,
            num_attention_heads=num_attention_heads,
            size_per_head=self.size_per_head,
            attention_probs_dropout_prob=attention_probs_dropout_prob,
            use_one_hot_embeddings=use_one_hot_embeddings,
            initializer_range=initializer_range,
            use_relative_positions=use_relative_positions,
            has_attention_mask=True,
            do_return_2d_tensor=True,
            compute_type=compute_type)
        self.output = BertOutput(in_channels=hidden_size,
                                 out_channels=hidden_size,
                                 initializer_range=initializer_range,
                                 dropout_prob=hidden_dropout_prob,
                                 compute_type=compute_type)
        self.reshape = P.Reshape()
        self.shape = (-1, hidden_size)
    def construct(self, input_tensor, attention_mask):
        input_tensor = self.reshape(input_tensor, self.shape)
        attention_output = self.attention(input_tensor, input_tensor, attention_mask)
        output = self.output(attention_output, input_tensor)
        return output
 class BertEncoderCell(nn.Cell):
    """
    Encoder cells used in BertTransformer.
    Args:
        hidden_size (int): Size of the bert encoder layers. Default: 768.
        seq_length (int): Length of input sequence. Default: 512.
        num_attention_heads (int): Number of attention heads. Default: 12.
        intermediate_size (int): Size of intermediate layer. Default: 3072.
        attention_probs_dropout_prob (float): The dropout probability for
                                      BertAttention. Default: 0.02.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
        hidden_act (str): Activation function. Default: "gelu".
        compute_type (:class:`mindspore.dtype`): Compute type in attention. Default: mstype.float32.
    """
    def __init__(self,
                 hidden_size=768,
                 seq_length=512,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 attention_probs_dropout_prob=0.02,
                 use_one_hot_embeddings=False,
                 initializer_range=0.02,
                 hidden_dropout_prob=0.1,
                 use_relative_positions=False,
                 hidden_act="gelu",
                 compute_type=mstype.float32):
        super(BertEncoderCell, self).__init__()
        self.attention = BertSelfAttention(
            hidden_size=hidden_size,
            seq_length=seq_length,
            num_attention_heads=num_attention_heads,
            attention_probs_dropout_prob=attention_probs_dropout_prob,
            use_one_hot_embeddings=use_one_hot_embeddings,
            initializer_range=initializer_range,
            hidden_dropout_prob=hidden_dropout_prob,
            use_relative_positions=use_relative_positions,
            compute_type=compute_type)
        self.intermediate = nn.Dense(in_channels=hidden_size,
                                     out_channels=intermediate_size,
                                     activation=hidden_act,
                                     weight_init=TruncatedNormal(initializer_range)).to_float(compute_type)
        self.output = BertOutput(in_channels=intermediate_size,
                                 out_channels=hidden_size,
                                 initializer_range=initializer_range,
                                 dropout_prob=hidden_dropout_prob,
                                 compute_type=compute_type)
    def construct(self, hidden_states, attention_mask):
        # self-attention
        attention_output = self.attention(hidden_states, attention_mask)
        # feed construct
        intermediate_output = self.intermediate(attention_output)
        # add and normalize
        output = self.output(intermediate_output, attention_output)
        return output
 class BertTransformer(nn.Cell):
    """
    Multi-layer bert transformer.
    Args:
        hidden_size (int): Size of the encoder layers.
        seq_length (int): Length of input sequence.
        num_hidden_layers (int): Number of hidden layers in encoder cells.
        num_attention_heads (int): Number of attention heads in encoder cells. Default: 12.
        intermediate_size (int): Size of intermediate layer in encoder cells. Default: 3072.
        attention_probs_dropout_prob (float): The dropout probability for
                                      BertAttention. Default: 0.1.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
        initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
        hidden_act (str): Activation function used in the encoder cells. Default: "gelu".
        compute_type (:class:`mindspore.dtype`): Compute type in BertTransformer. Default: mstype.float32.
        return_all_encoders (bool): Specifies whether to return all encoders. Default: False.
    """
    def __init__(self,
                 hidden_size,
                 seq_length,
                 num_hidden_layers,
                 num_attention_heads=12,
                 intermediate_size=3072,
                 attention_probs_dropout_prob=0.1,
                 use_one_hot_embeddings=False,
                 initializer_range=0.02,
                 hidden_dropout_prob=0.1,
                 use_relative_positions=False,
                 hidden_act="gelu",
                 compute_type=mstype.float32,
                 return_all_encoders=False):
        super(BertTransformer, self).__init__()
        self.return_all_encoders = return_all_encoders
        layers = []
        for _ in range(num_hidden_layers):
            layer = BertEncoderCell(hidden_size=hidden_size,
                                    seq_length=seq_length,
                                    num_attention_heads=num_attention_heads,
                                    intermediate_size=intermediate_size,
                                    attention_probs_dropout_prob=attention_probs_dropout_prob,
                                    use_one_hot_embeddings=use_one_hot_embeddings,
                                    initializer_range=initializer_range,
                                    hidden_dropout_prob=hidden_dropout_prob,
                                    use_relative_positions=use_relative_positions,
                                    hidden_act=hidden_act,
                                    compute_type=compute_type)
            layers.append(layer)
        self.layers = nn.CellList(layers)
        self.reshape = P.Reshape()
        self.shape = (-1, hidden_size)
        self.out_shape = (-1, seq_length, hidden_size)
    def construct(self, input_tensor, attention_mask):
        """Multi-layer bert transformer."""
        prev_output = self.reshape(input_tensor, self.shape)
        all_encoder_layers = ()
        for layer_module in self.layers:
            layer_output = layer_module(prev_output, attention_mask)
            prev_output = layer_output
            if self.return_all_encoders:
                layer_output = self.reshape(layer_output, self.out_shape)
                all_encoder_layers = all_encoder_layers + (layer_output,)
        if not self.return_all_encoders:
            prev_output = self.reshape(prev_output, self.out_shape)
            all_encoder_layers = all_encoder_layers + (prev_output,)
        return all_encoder_layers
 class CreateAttentionMaskFromInputMask(nn.Cell):
    """
    Create attention mask according to input mask.
    Args:
        config (Class): Configuration for BertModel.
    """
    def __init__(self, config):
        super(CreateAttentionMaskFromInputMask, self).__init__()
        self.input_mask = None
        self.cast = P.Cast()
        self.reshape = P.Reshape()
        self.shape = (-1, 1, config.seq_length)
    def construct(self, input_mask):
        attention_mask = self.cast(self.reshape(input_mask, self.shape), mstype.float32)
        return attention_mask
 class BertModel(nn.Cell):
    """
    Bidirectional Encoder Representations from Transformers.
    Args:
        config (Class): Configuration for BertModel.
        is_training (bool): True for training mode. False for eval mode.
        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
    """
    def __init__(self,
                 config,
                 is_training,
                 use_one_hot_embeddings=False):
        super(BertModel, self).__init__()
        config = copy.deepcopy(config)
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.attention_probs_dropout_prob = 0.0
        self.seq_length = config.seq_length
        self.hidden_size = config.hidden_size
        self.num_hidden_layers = config.num_hidden_layers
        self.embedding_size = config.hidden_size
        self.token_type_ids = None
        self.last_idx = self.num_hidden_layers - 1
        output_embedding_shape = [-1, self.seq_length, self.embedding_size]
        self.bert_embedding_lookup = EmbeddingLookup(
            vocab_size=config.vocab_size,
            embedding_size=self.embedding_size,
            embedding_shape=output_embedding_shape,
            use_one_hot_embeddings=use_one_hot_embeddings,
            initializer_range=config.initializer_range)
        self.bert_embedding_postprocessor = EmbeddingPostprocessor(
            embedding_size=self.embedding_size,
            embedding_shape=output_embedding_shape,
            use_relative_positions=config.use_relative_positions,
            use_token_type=True,
            token_type_vocab_size=config.type_vocab_size,
            use_one_hot_embeddings=use_one_hot_embeddings,
            initializer_range=0.02,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)
        self.bert_encoder = BertTransformer(
            hidden_size=self.hidden_size,
            seq_length=self.seq_length,
            num_attention_heads=config.num_attention_heads,
            num_hidden_layers=self.num_hidden_layers,
            intermediate_size=config.intermediate_size,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            use_one_hot_embeddings=use_one_hot_embeddings,
            initializer_range=config.initializer_range,
            hidden_dropout_prob=config.hidden_dropout_prob,
            use_relative_positions=config.use_relative_positions,
            hidden_act=config.hidden_act,
            compute_type=config.compute_type,
            return_all_encoders=True)
        self.cast = P.Cast()
        self.dtype = config.dtype
        self.cast_compute_type = SaturateCast(dst_type=config.compute_type)
        self.slice = P.StridedSlice()
        self.squeeze_1 = P.Squeeze(axis=1)
        self.dense = nn.Dense(self.hidden_size, self.hidden_size,
                              activation="tanh",
                              weight_init=TruncatedNormal(config.initializer_range)).to_float(config.compute_type)
        self._create_attention_mask_from_input_mask = CreateAttentionMaskFromInputMask(config)
    def construct(self, input_ids, token_type_ids, input_mask):
        """Bidirectional Encoder Representations from Transformers."""
        # embedding
        word_embeddings, embedding_tables = self.bert_embedding_lookup(input_ids)
        embedding_output = self.bert_embedding_postprocessor(token_type_ids,
                                                             word_embeddings)
        # attention mask [batch_size, seq_length, seq_length]
        attention_mask = self._create_attention_mask_from_input_mask(input_mask)
        # bert encoder
        encoder_output = self.bert_encoder(self.cast_compute_type(embedding_output),
                                           attention_mask)
        sequence_output = self.cast(encoder_output[self.last_idx], self.dtype)
        # pooler
        batch_size = P.Shape()(input_ids)[0]
        sequence_slice = self.slice(sequence_output,
                                    (0, 0, 0),
                                    (batch_size, 1, self.hidden_size),
                                    (1, 1, 1))
        first_token = self.squeeze_1(sequence_slice)
        pooled_output = self.dense(first_token)
        pooled_output = self.cast(pooled_output, self.dtype)
        return sequence_output, pooled_output, embedding_tables
--- a/model_zoo/official/nlp/dgu/src/config.py
+++ b/model_zoo/official/nlp/dgu/src/config.py
@ -0,0 +1,129 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 network config setting, will be used in dataset.py, run_pretrain.py
 """
 from easydict import EasyDict as edict
 import mindspore.common.dtype as mstype
 from bert_model import BertConfig
 cfg = edict({
    'batch_size': 32,
    'bert_network': 'base',
    'loss_scale_value': 65536,
    'scale_factor': 2,
    'scale_window': 1000,
    'optimizer': 'Lamb',
    'enable_global_norm': False,
    'AdamWeightDecay': edict({
        'learning_rate': 3e-5,
        'end_learning_rate': 0.0,
        'power': 5.0,
        'weight_decay': 1e-5,
        'decay_filter': lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(),
        'eps': 1e-6,
        'warmup_steps': 10000,
    }),
    'Lamb': edict({
        'learning_rate': 3e-5,
        'end_learning_rate': 0.0,
        'power': 5.0,
        'warmup_steps': 10000,
        'weight_decay': 0.01,
        'decay_filter': lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(),
        'eps': 1e-8,
    }),
    'Momentum': edict({
        'learning_rate': 2e-5,
        'momentum': 0.9,
    }),
    'Thor': edict({
        'lr_max': 0.0034,
        'lr_min': 3.244e-5,
        'lr_power': 1.0,
        'lr_total_steps': 30000,
        'damping_max': 5e-2,
        'damping_min': 1e-6,
        'damping_power': 1.0,
        'damping_total_steps': 30000,
        'momentum': 0.9,
        'weight_decay': 5e-4,
        'loss_scale': 1.0,
        'frequency': 100,
    }),
 })
 '''
 Including two kinds of network: \
 base: Google BERT-base(the base version of BERT model).
 large: BERT-NEZHA(a Chinese pretrained language model developed by Huawei, which introduced a improvement of \
       Functional Relative Posetional Encoding as an effective positional encoding scheme).
 '''
 if cfg.bert_network == 'base':
    cfg.batch_size = 64
    bert_net_cfg = BertConfig(
        seq_length=128,
        vocab_size=30522,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=512,
        type_vocab_size=2,
        initializer_range=0.02,
        use_relative_positions=False,
        dtype=mstype.float32,
        compute_type=mstype.float16
    )
 if cfg.bert_network == 'nezha':
    cfg.batch_size = 96
    bert_net_cfg = BertConfig(
        seq_length=128,
        vocab_size=21128,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        intermediate_size=4096,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=512,
        type_vocab_size=2,
        initializer_range=0.02,
        use_relative_positions=True,
        dtype=mstype.float32,
        compute_type=mstype.float16
    )
 if cfg.bert_network == 'large':
    cfg.batch_size = 24
    bert_net_cfg = BertConfig(
        seq_length=512,
        vocab_size=30522,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        intermediate_size=4096,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=512,
        type_vocab_size=2,
        initializer_range=0.02,
        use_relative_positions=False,
        dtype=mstype.float32,
        compute_type=mstype.float16
    )
--- a/model_zoo/official/nlp/dgu/src/data_util.py
+++ b/model_zoo/official/nlp/dgu/src/data_util.py
@ -0,0 +1,115 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 Data utils used in Bert finetune and evaluation.
 """
 import numpy as np
 class Tuple():
    """
    apply the functions to  the corresponding input fields.
    """
    def __init__(self, fn, *args):
        if isinstance(fn, (list, tuple)):
            assert args, 'Input pattern not understood. The input of Tuple can be ' \
                                   'Tuple(A, B, C) or Tuple([A, B, C]) or Tuple((A, B, C)). ' \
                                   'Received fn=%s, args=%s' % (str(fn), str(args))
            self._fn = fn
        else:
            self._fn = (fn,) + args
        for i, ele_fn in enumerate(self._fn):
            assert callable(
                ele_fn
            ), 'Batchify functions must be callable! type(fn[%d]) = %s' % (
                i, str(type(ele_fn)))
    def __call__(self, data):
        assert len(data[0]) == len(self._fn),\
            'The number of attributes in each data sample should contain' \
            ' {} elements'.format(len(self._fn))
        ret = []
        for i, ele_fn in enumerate(self._fn):
            result = ele_fn([ele[i] for ele in data])
            if isinstance(result, (tuple, list)):
                ret.extend(result)
            else:
                ret.append(result)
        return tuple(ret)
 class Pad():
    """
     pad the data with given value
    """
    def __init__(self,
                 pad_val=0,
                 axis=0,
                 ret_length=None,
                 dtype=None,
                 pad_right=True):
        self._pad_val = pad_val
        self._axis = axis
        self._ret_length = ret_length
        self._dtype = dtype
        self._pad_right = pad_right
    def __call__(self, data):
        arrs = [np.asarray(ele) for ele in data]
        original_length = [ele.shape[self._axis] for ele in arrs]
        max_size = max(original_length)
        ret_shape = list(arrs[0].shape)
        ret_shape[self._axis] = max_size
        ret_shape = (len(arrs),) + tuple(ret_shape)
        ret = np.full(
            shape=ret_shape,
            fill_value=self._pad_val,
            dtype=arrs[0].dtype if self._dtype is None else self._dtype)
        for i, arr in enumerate(arrs):
            if arr.shape[self._axis] == max_size:
                ret[i] = arr
            else:
                slices = [slice(None) for _ in range(arr.ndim)]
                if self._pad_right:
                    slices[self._axis] = slice(0, arr.shape[self._axis])
                else:
                    slices[self._axis] = slice(max_size - arr.shape[self._axis],
                                               max_size)
                if slices[self._axis].start != slices[self._axis].stop:
                    slices = [slice(i, i + 1)] + slices
                    ret[tuple(slices)] = arr
        if self._ret_length:
            return ret, np.asarray(
                original_length,
                dtype="int32") if self._ret_length else np.asarray(
                    original_length, self._ret_length)
        return ret
 class Stack():
    """
    Stack the input data
    """
    def __init__(self, axis=0, dtype=None):
        self._axis = axis
        self._dtype = dtype
    def __call__(self, data):
        data = np.stack(
            data,
            axis=self._axis).astype(self._dtype) if self._dtype else np.stack(
                data, axis=self._axis)
        return data
--- a/model_zoo/official/nlp/dgu/src/dataconvert.py
+++ b/model_zoo/official/nlp/dgu/src/dataconvert.py
@ -0,0 +1,142 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 data convert to mindrecord file.
 """
 import os
 import argparse
 import numpy as np
 import dataset as data
 from tokenizer import FullTokenizer
 from data_util import Tuple, Pad, Stack
 from mindspore.mindrecord import FileWriter
 TASK_CLASSES = {
    'udc': data.UDCv1,
    'dstc2': data.DSTC2,
    'atis_slot': data.ATIS_DSF,
    'atis_intent': data.ATIS_DID,
    'mrda': data.MRDA,
    'swda': data.SwDA,
 }
 def data_save_to_file(data_file_path=None, vocab_file_path='bert-base-uncased-vocab.txt', \
        output_path=None, task_name=None, mode="train", max_seq_length=128):
    """data save to mindrecord file."""
    MINDRECORD_FILE_PATH = output_path + task_name+"/" + task_name + "_" + mode + ".mindrecord"
    if not os.path.exists(output_path + task_name):
        os.makedirs(output_path + task_name)
    if os.path.exists(MINDRECORD_FILE_PATH):
        os.remove(MINDRECORD_FILE_PATH)
        os.remove(MINDRECORD_FILE_PATH + ".db")
    dataset_class = TASK_CLASSES[task_name]
    tokenizer = FullTokenizer(vocab_file=vocab_file_path, do_lower_case=True)
    dataset = dataset_class(data_file_path+task_name, mode=mode)
    applid_data = []
    datalist = []
    print(task_name + " " + mode + " data process begin")
    dataset_len = len(dataset)
    if args.task_name == 'atis_slot':
        batchify_fn = lambda samples, fn=Tuple(
            Pad(axis=0, pad_val=0),  # input
            Pad(axis=0, pad_val=0),  # mask
            Pad(axis=0, pad_val=0),  # segment
            Pad(axis=0, pad_val=0, dtype='int64')  # label
        ): fn(samples)
    else:
        batchify_fn = lambda samples, fn=Tuple(
            Pad(axis=0, pad_val=0),  # input
            Pad(axis=0, pad_val=0),  # mask
            Pad(axis=0, pad_val=0),  # segment
            Stack(dtype='int64')  # label
        ): fn(samples)
    for idx, example in enumerate(dataset):
        if idx % 1000 == 0:
            print("Reading example %d of %d" % (idx, dataset_len))
        data_example = dataset_class.convert_example(example=example, \
                tokenizer=tokenizer, max_seq_length=max_seq_length)
        applid_data.append(data_example)
    applid_data = batchify_fn(applid_data)
    input_ids, input_mask, segment_ids, label_ids = applid_data
    for idx in range(dataset_len):
        if idx % 1000 == 0:
            print("Processing example %d of %d" % (idx, dataset_len))
        sample = {
            "input_ids": np.array(input_ids[idx], dtype=np.int64),
            "input_mask": np.array(input_mask[idx], dtype=np.int64),
            "segment_ids": np.array(segment_ids[idx], dtype=np.int64),
            "label_ids": np.array([label_ids[idx]], dtype=np.int64),
        }
        datalist.append(sample)
    print(task_name + " " + mode + " data process end")
    writer = FileWriter(file_name=MINDRECORD_FILE_PATH, shard_num=1)
    nlp_schema = {
        "input_ids": {"type": "int64", "shape": [-1]},
        "input_mask": {"type": "int64", "shape": [-1]},
        "segment_ids": {"type": "int64", "shape": [-1]},
        "label_ids": {"type": "int64", "shape": [-1]},
    }
    writer.add_schema(nlp_schema, "proprocessed classification dataset")
    writer.write_raw_data(datalist)
    writer.commit()
    print("write success")
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="run classifier")
    parser.add_argument(
        "--task_name",
        default=None,
        type=str,
        required=True,
        help="The name of the task to train.")
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        help="The directory where the dataset will be load.")
    parser.add_argument(
        "--vocab_file_dir",
        default=None,
        type=str,
        help="The directory where the vocab will be load.")
    parser.add_argument(
        "--output_dir",
        default=None,
        type=str,
        help="The directory where the mindrecord dataset file will be save.")
    parser.add_argument(
        "--max_seq_len",
        default=128,
        type=int,
        help="The maximum total input sequence length after tokenization for trainng. ")
    parser.add_argument(
        "--eval_max_seq_len",
        default=None,
        type=int,
        help="The maximum total input sequence length after tokenization for trainng. ")
    args = parser.parse_args()
    if args.eval_max_seq_len is None:
        args.eval_max_seq_len = args.max_seq_len
    data_save_to_file(data_file_path=args.data_dir, vocab_file_path=args.vocab_file_dir, output_path=args.output_dir, \
            task_name=args.task_name, mode="train", max_seq_length=args.max_seq_len)
    data_save_to_file(data_file_path=args.data_dir, vocab_file_path=args.vocab_file_dir, output_path=args.output_dir, \
            task_name=args.task_name, mode="dev", max_seq_length=args.eval_max_seq_len)
    data_save_to_file(data_file_path=args.data_dir, vocab_file_path=args.vocab_file_dir, output_path=args.output_dir, \
            task_name=args.task_name, mode="test", max_seq_length=args.eval_max_seq_len)
--- a/model_zoo/official/nlp/dgu/src/dataset.py
+++ b/model_zoo/official/nlp/dgu/src/dataset.py
@ -0,0 +1,608 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 dataset used in Bert finetune and evaluation.
 """
 import os
 from typing import List
 import numpy as np
 # The input data bigin with '[CLS]', using '[SEP]' split conversation content(
 # Previous part, current part, following part, etc.). If there are multiple
 # conversation in split part, using 'INNER_SEP' to further split.
 INNER_SEP = '[unused0]'
 class Dataset():
    """ Dataset base class """
    def __init__(self):
        pass
    def __getitem__(self, idx):
        raise NotImplementedError("'{}' not implement in class " \
                                  "{}".format('__getitem__', self.__class__.__name__))
    def __len__(self):
        raise NotImplementedError("'{}' not implement in class " \
                                  "{}".format('__len__', self.__class__.__name__))
 def get_label_map(label_list):
    """ Create label maps """
    label_map = {}
    for (i, l) in enumerate(label_list):
        label_map[l] = i
    return label_map
 class UDCv1(Dataset):
    """
    The UDCv1 dataset is using in task Dialogue Response Selection.
    The source dataset is UDCv1(Ubuntu Dialogue Corpus v1.0). See detail at
    http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/
    """
    MAX_LEN_OF_RESPONSE = 60
    LABEL_MAP = get_label_map(['0', '1'])
    def __init__(self, data_dir, mode='train', label_map_config=None):
        super(UDCv1, self).__init__()
        self._data_dir = data_dir
        self._mode = mode
        self.read_data()
        self.label_map = None
        if label_map_config:
            with open(label_map_config) as f:
                self.label_map = json.load(f)
        else:
            self.label_map = None
    #read data from file
    def read_data(self):
        """read data from file"""
        if self._mode == 'train':
            data_path = os.path.join(self._data_dir, 'train.txt')
        elif self._mode == 'dev':
            data_path = os.path.join(self._data_dir, 'dev.txt-small')
        elif self._mode == 'test':
            data_path = os.path.join(self._data_dir, 'test.txt')
        self.data = []
        with open(data_path, 'r', encoding='utf8') as fin:
            for line in fin:
                if not line:
                    continue
                arr = line.rstrip('\n').split('\t')
                if len(arr) < 3:
                    print('Data format error: %s' % '\t'.join(arr))
                    print(
                        'Data row contains at least three parts: label\tconversation1\t.....\tresponse.'
                    )
                    continue
                label = arr[0]
                text_a = arr[1:-1]
                text_b = arr[-1]
                self.data.append([label, text_a, text_b])
    @classmethod
    def get_label(cls, label):
        return cls.LABEL_MAP[label]
    @classmethod
    def num_classes(cls):
        return len(cls.LABEL_MAP)
    @classmethod
    def convert_example(cls, example, tokenizer, max_seq_length=512):
        """ Convert a glue example into necessary features. """
        def _truncate_and_concat(text_a: List[str], text_b: str, tokenizer, max_seq_length):
            tokens_b = tokenizer.tokenize(text_b)
            tokens_b = tokens_b[:min(cls.MAX_LEN_OF_RESPONSE, len(tokens_b))]
            tokens_a = []
            for text in text_a:
                tokens_a.extend(tokenizer.tokenize(text))
                tokens_a.append(INNER_SEP)
            tokens_a = tokens_a[:-1]
            if len(tokens_a) > max_seq_length - len(tokens_b) - 3:
                tokens_a = tokens_a[len(tokens_a) - max_seq_length + len(tokens_b) + 3:]
            tokens, segment_ids = [], []
            tokens.append("[CLS]")
            segment_ids.append(0)
            for token in tokens_a:
                tokens.append(token)
                segment_ids.append(0)
            tokens.append("[SEP]")
            segment_ids.append(0)
            if tokens_b:
                for token in tokens_b:
                    tokens.append(token)
                    segment_ids.append(1)
                tokens.append("[SEP]")
                segment_ids.append(1)
            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_mask = [1] * len(input_ids)
            while len(input_ids) < max_seq_length:
                input_ids.append(0)
                input_mask.append(0)
                segment_ids.append(0)
            return input_ids, input_mask, segment_ids
        label, text_a, text_b = example
        label = np.array([cls.get_label(label)], dtype='int64')
        input_ids, input_mask, segment_ids = _truncate_and_concat(text_a, text_b, tokenizer, max_seq_length)
        return input_ids, input_mask, segment_ids, label
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return len(self.data)
 class DSTC2(Dataset):
    """
    The dataset DSTC2 is using in task Dialogue State Tracking.
    The source dataset is DSTC2(Dialog State Tracking Challenges 2). See detail at
    https://github.com/matthen/dstc
    """
    LABEL_MAP = get_label_map([str(i) for i in range(217)])
    def __init__(self, data_dir, mode='train'):
        super(DSTC2, self).__init__()
        self._data_dir = data_dir
        self._mode = mode
        self.read_data()
    def read_data(self):
        """read data from file"""
        def _concat_dialogues(examples):
            """concat multi turns dialogues"""
            new_examples = []
            max_turns = 20
            example_len = len(examples)
            for i in range(example_len):
                multi_turns = examples[max(i - max_turns, 0):i + 1]
                new_qa = '\1'.join([example[0] for example in multi_turns])
                new_examples.append((new_qa.split('\1'), examples[i][1]))
            return new_examples
        if self._mode == 'train':
            data_path = os.path.join(self._data_dir, 'train.txt')
        elif self._mode == 'dev':
            data_path = os.path.join(self._data_dir, 'dev.txt')
        elif self._mode == 'test':
            data_path = os.path.join(self._data_dir, 'test.txt')
        self.data = []
        with open(data_path, 'r', encoding='utf8') as fin:
            pre_idx = -1
            examples = []
            for line in fin:
                if not line:
                    continue
                arr = line.rstrip('\n').split('\t')
                if len(arr) != 3:
                    print('Data format error: %s' % '\t'.join(arr))
                    print(
                        'Data row should contains three parts: id\tquestion\1answer\tlabel1 label2 ...'
                    )
                    continue
                idx = arr[0]
                qa = arr[1]
                label_list = arr[2].split()
                if idx != pre_idx:
                    if idx != 0:
                        examples = _concat_dialogues(examples)
                        self.data.extend(examples)
                        examples = []
                    pre_idx = idx
                examples.append((qa, label_list))
            if examples:
                examples = _concat_dialogues(examples)
                self.data.extend(examples)
    @classmethod
    def get_label(cls, label):
        return cls.LABEL_MAP[label]
    @classmethod
    def num_classes(cls):
        return len(cls.LABEL_MAP)
    @classmethod
    def convert_example(cls, example, tokenizer, max_seq_length=512):
        """ Convert a glue example into necessary features. """
        def _truncate_and_concat(texts: List[str], tokenizer, max_seq_length):
            tokens = []
            for text in texts:
                tokens.extend(tokenizer.tokenize(text))
                tokens.append(INNER_SEP)
            tokens = tokens[:-1]
            if len(tokens) > max_seq_length - 2:
                tokens = tokens[len(tokens) - max_seq_length + 2:]
            tokens_, segment_ids = [], []
            tokens_.append("[CLS]")
            segment_ids.append(0)
            for token in tokens:
                tokens_.append(token)
                segment_ids.append(0)
            tokens_.append("[SEP]")
            segment_ids.append(0)
            tokens = tokens_
            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            return input_ids, segment_ids
        texts, labels = example
        input_ids, segment_ids = _truncate_and_concat(texts, tokenizer,
                                                      max_seq_length)
        labels = [cls.get_label(l) for l in labels]
        label = np.zeros(cls.num_classes(), dtype='int64')
        for l in labels:
            label[l] = 1
        input_mask = [1] * len(input_ids)
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
        return input_ids, input_mask, segment_ids, label
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return len(self.data)
 class ATIS_DSF(Dataset):
    """
    The dataset ATIS_DSF is using in task Dialogue Slot Filling.
    The source dataset is ATIS(Airline Travel Information System). See detail at
    https://www.kaggle.com/siddhadev/ms-cntk-atis
    """
    LABEL_MAP = get_label_map([str(i) for i in range(130)])
    def __init__(self, data_dir, mode='train'):
        super(ATIS_DSF, self).__init__()
        self._data_dir = data_dir
        self._mode = mode
        self.read_data()
    def read_data(self):
        """read data from file"""
        if self._mode == 'train':
            data_path = os.path.join(self._data_dir, 'train.txt')
        elif self._mode == 'dev':
            data_path = os.path.join(self._data_dir, 'dev.txt')
        elif self._mode == 'test':
            data_path = os.path.join(self._data_dir, 'test.txt')
        self.data = []
        with open(data_path, 'r', encoding='utf8') as fin:
            for line in fin:
                if not line:
                    continue
                arr = line.rstrip('\n').split('\t')
                if len(arr) != 2:
                    print('Data format error: %s' % '\t'.join(arr))
                    print(
                        'Data row should contains two parts: conversation_content\tlabel1 label2 label3.'
                    )
                    continue
                text = arr[0]
                label_list = arr[1].split()
                self.data.append([text, label_list])
    @classmethod
    def get_label(cls, label):
        return cls.LABEL_MAP[label]
    @classmethod
    def num_classes(cls):
        return len(cls.LABEL_MAP)
    @classmethod
    def convert_example(cls, example, tokenizer, max_seq_length=512):
        """ Convert a glue example into necessary features. """
        text, labels = example
        tokens, label_list = [], []
        words = text.split()
        assert len(words) == len(labels)
        for word, label in zip(words, labels):
            piece_words = tokenizer.tokenize(word)
            tokens.extend(piece_words)
            label = cls.get_label(label)
            label_list.extend([label] * len(piece_words))
        if len(tokens) > max_seq_length - 2:
            tokens = tokens[len(tokens) - max_seq_length + 2:]
            label_list = label_list[len(tokens) - max_seq_length + 2:]
        tokens_, segment_ids = [], []
        tokens_.append("[CLS]")
        for token in tokens:
            tokens_.append(token)
        tokens_.append("[SEP]")
        tokens = tokens_
        label_list = [0] + label_list + [0]
        segment_ids = [0] * len(tokens)
        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        label = np.array(label_list, dtype='int64')
        input_mask = [1] * len(input_ids)
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
        return input_ids, input_mask, segment_ids, label
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return len(self.data)
 class ATIS_DID(Dataset):
    """
    The dataset ATIS_ID is using in task Dialogue Intent Detection.
    The source dataset is ATIS(Airline Travel Information System). See detail at
    https://www.kaggle.com/siddhadev/ms-cntk-atis
    """
    LABEL_MAP = get_label_map([str(i) for i in range(26)])
    def __init__(self, data_dir, mode='train'):
        super(ATIS_DID, self).__init__()
        self._data_dir = data_dir
        self._mode = mode
        self.read_data()
    def read_data(self):
        """read data from file"""
        if self._mode == 'train':
            data_path = os.path.join(self._data_dir, 'train.txt')
        elif self._mode == 'dev':
            data_path = os.path.join(self._data_dir, 'dev.txt')
        elif self._mode == 'test':
            data_path = os.path.join(self._data_dir, 'test.txt')
        self.data = []
        with open(data_path, 'r', encoding='utf8') as fin:
            for line in fin:
                if not line:
                    continue
                arr = line.rstrip('\n').split('\t')
                if len(arr) != 2:
                    print('Data format error: %s' % '\t'.join(arr))
                    print(
                        'Data row should contains two parts: label\tconversation_content.'
                    )
                    continue
                label = arr[0]
                text = arr[1]
                self.data.append([label, text])
    @classmethod
    def get_label(cls, label):
        return cls.LABEL_MAP[label]
    @classmethod
    def num_classes(cls):
        return len(cls.LABEL_MAP)
    @classmethod
    def convert_example(cls, example, tokenizer, max_seq_length=512):
        """ Convert a glue example into necessary features. """
        label, text = example
        tokens = tokenizer.tokenize(text)
        if len(tokens) > max_seq_length - 2:
            tokens = tokens[len(tokens) - max_seq_length + 2:]
        tokens_, segment_ids = [], []
        tokens_.append("[CLS]")
        for token in tokens:
            tokens_.append(token)
        tokens_.append("[SEP]")
        tokens = tokens_
        segment_ids = [0] * len(tokens)
        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        label = np.array([cls.get_label(label)], dtype='int64')
        input_mask = [1] * len(input_ids)
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
        return input_ids, input_mask, segment_ids, label
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return len(self.data)
 def read_da_data(data_dir, mode):
    """read data from file"""
    def _concat_dialogues(examples):
        """concat multi turns dialogues"""
        new_examples = []
        example_len = len(examples)
        for i in range(example_len):
            label, caller, text = examples[i]
            cur_txt = "%s : %s" % (caller, text)
            pre_txt = [
                "%s : %s" % (item[1], item[2])
                for item in examples[max(0, i - 5):i]
            ]
            suf_txt = [
                "%s : %s" % (item[1], item[2])
                for item in examples[i + 1:min(len(examples), i + 3)]
            ]
            sample = [label, pre_txt, cur_txt, suf_txt]
            new_examples.append(sample)
        return new_examples
    if mode == 'train':
        data_path = os.path.join(data_dir, 'train.txt')
    elif mode == 'dev':
        data_path = os.path.join(data_dir, 'dev.txt')
    elif mode == 'test':
        data_path = os.path.join(data_dir, 'test.txt')
    data = []
    with open(data_path, 'r', encoding='utf8') as fin:
        pre_idx = -1
        examples = []
        for line in fin:
            if not line:
                continue
            arr = line.rstrip('\n').split('\t')
            if len(arr) != 4:
                print('Data format error: %s' % '\t'.join(arr))
                print(
                    'Data row should contains four parts: id\tlabel\tcaller\tconversation_content.'
                )
                continue
            idx, label, caller, text = arr
            if idx != pre_idx:
                if idx != 0:
                    examples = _concat_dialogues(examples)
                    data.extend(examples)
                    examples = []
                pre_idx = idx
            examples.append((label, caller, text))
        if examples:
            examples = _concat_dialogues(examples)
            data.extend(examples)
    return data
 def truncate_and_concat(pre_txt: List[str],
                        cur_txt: str,
                        suf_txt: List[str],
                        tokenizer,
                        max_seq_length,
                        max_len_of_cur_text):
    """concat data"""
    cur_tokens = tokenizer.tokenize(cur_txt)
    cur_tokens = cur_tokens[:min(max_len_of_cur_text, len(cur_tokens))]
    pre_tokens = []
    for text in pre_txt:
        pre_tokens.extend(tokenizer.tokenize(text))
        pre_tokens.append(INNER_SEP)
    pre_tokens = pre_tokens[:-1]
    suf_tokens = []
    for text in suf_txt:
        suf_tokens.extend(tokenizer.tokenize(text))
        suf_tokens.append(INNER_SEP)
    suf_tokens = suf_tokens[:-1]
    if len(cur_tokens) + len(pre_tokens) + len(suf_tokens) > max_seq_length - 4:
        left_num = max_seq_length - 4 - len(cur_tokens)
        if len(pre_tokens) > len(suf_tokens):
            suf_num = int(left_num / 2)
            suf_tokens = suf_tokens[:suf_num]
            pre_num = left_num - len(suf_tokens)
            pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num):]
        else:
            pre_num = int(left_num / 2)
            pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num):]
            suf_num = left_num - len(pre_tokens)
            suf_tokens = suf_tokens[:suf_num]
    tokens, segment_ids = [], []
    tokens.append("[CLS]")
    for token in pre_tokens:
        tokens.append(token)
    tokens.append("[SEP]")
    segment_ids.extend([0] * len(tokens))
    for token in cur_tokens:
        tokens.append(token)
    tokens.append("[SEP]")
    segment_ids.extend([1] * (len(cur_tokens) + 1))
    if suf_tokens:
        for token in suf_tokens:
            tokens.append(token)
        tokens.append("[SEP]")
        segment_ids.extend([0] * (len(suf_tokens) + 1))
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_ids)
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
    return input_ids, input_mask, segment_ids
 class MRDA(Dataset):
    """
    The dataset MRDA is using in task Dialogue Act.
    The source dataset is MRDA(Meeting Recorder Dialogue Act). See detail at
    https://www.aclweb.org/anthology/W04-2319.pdf
    """
    MAX_LEN_OF_CUR_TEXT = 50
    LABEL_MAP = get_label_map([str(i) for i in range(5)])
    def __init__(self, data_dir, mode='train'):
        super(MRDA, self).__init__()
        self.data = read_da_data(data_dir, mode)
    @classmethod
    def get_label(cls, label):
        return cls.LABEL_MAP[label]
    @classmethod
    def num_classes(cls):
        return len(cls.LABEL_MAP)
    @classmethod
    def convert_example(cls, example, tokenizer, max_seq_length=512):
        """ Convert a glue example into necessary features. """
        label, pre_txt, cur_txt, suf_txt = example
        label = np.array([cls.get_label(label)], dtype='int64')
        input_ids, input_mask, segment_ids = truncate_and_concat(pre_txt, cur_txt, suf_txt, \
            tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT)
        return input_ids, input_mask, segment_ids, label
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return len(self.data)
 class SwDA(Dataset):
    """
    The dataset SwDA is using in task Dialogue Act.
    The source dataset is SwDA(Switchboard Dialog Act). See detail at
    http://compprag.christopherpotts.net/swda.html
    """
    MAX_LEN_OF_CUR_TEXT = 50
    LABEL_MAP = get_label_map([str(i) for i in range(42)])
    def __init__(self, data_dir, mode='train'):
        super(SwDA, self).__init__()
        self.data = read_da_data(data_dir, mode)
    @classmethod
    def get_label(cls, label):
        return cls.LABEL_MAP[label]
    @classmethod
    def num_classes(cls):
        return len(cls.LABEL_MAP)
    @classmethod
    def convert_example(cls, example, tokenizer, max_seq_length=512):
        """ Convert a glue example into necessary features. """
        label, pre_txt, cur_txt, suf_txt = example
        label = np.array([cls.get_label(label)], dtype='int64')
        input_ids, input_mask, segment_ids = truncate_and_concat(pre_txt, cur_txt, suf_txt, \
            tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT)
        return input_ids, input_mask, segment_ids, label
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return len(self.data)
--- a/model_zoo/official/nlp/dgu/src/finetune_eval_config.py
+++ b/model_zoo/official/nlp/dgu/src/finetune_eval_config.py
@ -0,0 +1,81 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 config settings, will be used in finetune.py
 """
 from easydict import EasyDict as edict
 import mindspore.common.dtype as mstype
 from .bert_model import BertConfig
 optimizer_cfg = edict({
    'optimizer': 'Lamb',
    'AdamWeightDecay': edict({
        'learning_rate': 2e-5,
        'end_learning_rate': 1e-7,
        'power': 1.0,
        'weight_decay': 1e-5,
        'decay_filter': lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(),
        'eps': 1e-6,
    }),
    'Lamb': edict({
        'learning_rate': 2e-5,
        'end_learning_rate': 1e-7,
        'power': 1.0,
        'weight_decay': 0.01,
        'decay_filter': lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(),
    }),
    'Momentum': edict({
        'learning_rate': 2e-5,
        'momentum': 0.9,
    }),
 })
 bert_net_cfg = BertConfig(
    seq_length=128,
    vocab_size=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=512,
    type_vocab_size=2,
    initializer_range=0.02,
    use_relative_positions=False,
    dtype=mstype.float32,
    compute_type=mstype.float16,
 )
 bert_net_udc_cfg = BertConfig(
    seq_length=224,
    vocab_size=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=512,
    type_vocab_size=2,
    initializer_range=0.02,
    use_relative_positions=False,
    dtype=mstype.float32,
    compute_type=mstype.float16,
 )
--- a/model_zoo/official/nlp/dgu/src/finetune_eval_model.py
+++ b/model_zoo/official/nlp/dgu/src/finetune_eval_model.py
@ -0,0 +1,124 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 '''
 Bert finetune and evaluation model script.
 '''
 import mindspore.nn as nn
 from mindspore.common.initializer import TruncatedNormal
 from mindspore.ops import operations as P
 from .bert_model import BertModel
 class BertCLSModel(nn.Cell):
    """
    This class is responsible for classification task evaluation, i.e. XNLI(num_labels=3),
    LCQMC(num_labels=2), Chnsenti(num_labels=2). The returned output represents the final
    logits as the results of log_softmax is proportional to that of softmax.
    """
    def __init__(self, config, is_training, num_labels=2, dropout_prob=0.0, use_one_hot_embeddings=False,
                 assessment_method=""):
        super(BertCLSModel, self).__init__()
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.hidden_probs_dropout_prob = 0.0
        self.bert = BertModel(config, is_training, use_one_hot_embeddings)
        self.cast = P.Cast()
        self.weight_init = TruncatedNormal(config.initializer_range)
        self.log_softmax = P.LogSoftmax(axis=-1)
        self.dtype = config.dtype
        self.num_labels = num_labels
        self.dense_1 = nn.Dense(config.hidden_size, self.num_labels, weight_init=self.weight_init,
                                has_bias=True).to_float(config.compute_type)
        self.dropout = nn.Dropout(1 - dropout_prob)
        self.assessment_method = assessment_method
    def construct(self, input_ids, input_mask, token_type_id):
        _, pooled_output, _ = \
            self.bert(input_ids, token_type_id, input_mask)
        cls = self.cast(pooled_output, self.dtype)
        cls = self.dropout(cls)
        logits = self.dense_1(cls)
        logits = self.cast(logits, self.dtype)
        if self.assessment_method != "spearman_correlation":
            logits = self.log_softmax(logits)
        return logits
 class BertSquadModel(nn.Cell):
    '''
    This class is responsible for SQuAD
    '''
    def __init__(self, config, is_training, num_labels=2, dropout_prob=0.0, use_one_hot_embeddings=False):
        super(BertSquadModel, self).__init__()
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.hidden_probs_dropout_prob = 0.0
        self.bert = BertModel(config, is_training, use_one_hot_embeddings)
        self.weight_init = TruncatedNormal(config.initializer_range)
        self.dense1 = nn.Dense(config.hidden_size, num_labels, weight_init=self.weight_init,
                               has_bias=True).to_float(config.compute_type)
        self.num_labels = num_labels
        self.dtype = config.dtype
        self.log_softmax = P.LogSoftmax(axis=1)
        self.is_training = is_training
    def construct(self, input_ids, input_mask, token_type_id):
        sequence_output, _, _ = self.bert(input_ids, token_type_id, input_mask)
        batch_size, seq_length, hidden_size = P.Shape()(sequence_output)
        sequence = P.Reshape()(sequence_output, (-1, hidden_size))
        logits = self.dense1(sequence)
        logits = P.Cast()(logits, self.dtype)
        logits = P.Reshape()(logits, (batch_size, seq_length, self.num_labels))
        logits = self.log_softmax(logits)
        return logits
 class BertNERModel(nn.Cell):
    """
    This class is responsible for sequence labeling task evaluation, i.e. NER(num_labels=11).
    The returned output represents the final logits as the results of log_softmax is proportional to that of softmax.
    """
    def __init__(self, config, is_training, num_labels=11, use_crf=False, dropout_prob=0.0,
                 use_one_hot_embeddings=False):
        super(BertNERModel, self).__init__()
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.hidden_probs_dropout_prob = 0.0
        self.bert = BertModel(config, is_training, use_one_hot_embeddings)
        self.cast = P.Cast()
        self.weight_init = TruncatedNormal(config.initializer_range)
        self.log_softmax = P.LogSoftmax(axis=-1)
        self.dtype = config.dtype
        self.num_labels = num_labels
        self.dense_1 = nn.Dense(config.hidden_size, self.num_labels, weight_init=self.weight_init,
                                has_bias=True).to_float(config.compute_type)
        self.dropout = nn.Dropout(1 - dropout_prob)
        self.reshape = P.Reshape()
        self.shape = (-1, config.hidden_size)
        self.use_crf = use_crf
        self.origin_shape = (-1, config.seq_length, self.num_labels)
    def construct(self, input_ids, input_mask, token_type_id):
        """Return the final logits as the results of log_softmax."""
        sequence_output, _, _ = \
            self.bert(input_ids, token_type_id, input_mask)
        seq = self.dropout(sequence_output)
        seq = self.reshape(seq, self.shape)
        logits = self.dense_1(seq)
        logits = self.cast(logits, self.dtype)
        if self.use_crf:
            return_value = self.reshape(logits, self.origin_shape)
        else:
            return_value = self.log_softmax(logits)
        return return_value
--- a/model_zoo/official/nlp/dgu/src/metric.py
+++ b/model_zoo/official/nlp/dgu/src/metric.py
@ -0,0 +1,230 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 Metric used in Bert finetune and evaluation.
 """
 import mindspore.nn as nn
 from mindspore.nn.metrics.metric import Metric
 class F1Score(Metric):
    """
    F1-score is the harmonic mean of precision and recall. Micro-averaging is
    to create a global confusion matrix for all examples, and then calculate
    the F1-score. This class is using to evaluate the performance of Dialogue
    Slot Filling.
    """
    def __init__(self, *args, **kwargs):
        super(F1Score, self).__init__(*args, **kwargs)
        self._name = 'F1Score'
        self.clear()
    def clear(self):
        """
        Resets all of the metric state.
        """
        self.tp = {}
        self.fn = {}
        self.fp = {}
    def update(self, logits, labels):
        """
        Update the states based on the current mini-batch prediction results.
        Args:
            logits (Tensor): The predicted value is a Tensor with
                shape [batch_size, seq_len, num_classes] and type float32 or
                float64.
            labels (Tensor): The ground truth value is a 2D Tensor,
                its shape is [batch_size, seq_len] and type is int64.
        """
        output = logits.asnumpy()
        probs = output.argmax(axis=-1)
        labels = labels.asnumpy()
        assert probs.shape[0] == labels.shape[0]
        assert probs.shape[1] == labels.shape[1]
        for i in range(probs.shape[0]):
            start, end = 1, probs.shape[1]
            while end > start:
                if labels[i][end - 1] != 0:
                    break
                end -= 1
            prob, label = probs[i][start:end], labels[i][start:end]
            for y_pred, y in zip(prob, label):
                if y_pred == y:
                    self.tp[y] = self.tp.get(y, 0) + 1
                else:
                    self.fp[y_pred] = self.fp.get(y_pred, 0) + 1
                    self.fn[y] = self.fn.get(y, 0) + 1
    def eval(self):
        """
        Calculate the final micro F1 score.
        Returns:
            A scaler float: results of the calculated micro F1 score.
        """
        tp_total = sum(self.tp.values())
        fn_total = sum(self.fn.values())
        fp_total = sum(self.fp.values())
        p_total = float(tp_total) / (tp_total + fp_total)
        r_total = float(tp_total) / (tp_total + fn_total)
        if p_total + r_total == 0:
            return 0
        f1_micro = 2 * p_total * r_total / (p_total + r_total)
        return f1_micro
    def name(self):
        """
        Returns metric name
        """
        return self._name
 class JointAccuracy(Metric):
    """
    The joint accuracy rate is used to evaluate the performance of multi-turn
    Dialogue State Tracking. For each turn, if and only if all state in
    state_list are correctly predicted, the dialog state prediction is
    considered correct. And the joint accuracy rate is equal to 1, otherwise
    it is equal to 0.
    """
    def __init__(self, *args, **kwargs):
        super(JointAccuracy, self).__init__(*args, **kwargs)
        self._name = 'JointAccuracy'
        self.sigmoid = nn.Sigmoid()
        self.clear()
    def clear(self):
        """
        Resets all of the metric state.
        """
        self.num_samples = 0
        self.correct_joint = 0.0
    def update(self, logits, labels):
        """
        Update the states based on the current mini-batch prediction results.
        """
        probs = self.sigmoid(logits)
        probs = probs.asnumpy()
        labels = labels.asnumpy()
        assert probs.shape[0] == labels.shape[0]
        assert probs.shape[1] == labels.shape[1]
        for i in range(probs.shape[0]):
            pred, refer = [], []
            for j in range(probs.shape[1]):
                if probs[i][j] >= 0.5:
                    pred.append(j)
                if labels[i][j] == 1:
                    refer.append(j)
            if not pred:
                pred = [np.argmax(probs[i])]
            if pred == refer:
                self.correct_joint += 1
        self.num_samples += probs.shape[0]
    def eval(self):
        """
        Returns the results of the calculated JointAccuracy.
        """
        joint_acc = self.correct_joint / self.num_samples
        return joint_acc
    def name(self):
        """
        Returns metric name
        """
        return self._name
 class RecallAtK(Metric):
    """
    Recall@K is the fraction of relevant results among the retrieved Top K
    results, using to evaluate the performance of Dialogue Response Selection.
    """
    def __init__(self, *args, **kwargs):
        super(RecallAtK, self).__init__(*args, **kwargs)
        self._name = 'Recall@K'
        self.softmax = nn.Softmax()
        self.clear()
    def clear(self):
        """
        Resets all of the metric state.
        """
        self.num_sampls = 0
        self.p_at_1_in_10 = 0.0
        self.p_at_2_in_10 = 0.0
        self.p_at_5_in_10 = 0.0
    def get_p_at_n_in_m(self, data, n, m, idx):
        """
        calculate precision in recall n
        """
        pos_score = data[idx][0]
        curr = data[idx:idx + m]
        curr = sorted(curr, key=lambda x: x[0], reverse=True)
        if curr[n - 1][0] <= pos_score:
            return 1
        return 0
    def update(self, logits, labels):
        """
        Update the states based on the current mini-batch prediction results.
        Args:
            logits (Tensor): The predicted value is a Tensor with
                shape [batch_size, 2] and type float32 or float64.
            labels (Tensor): The ground truth value is a 2D Tensor,
                its shape is [batch_size, 1] and type is int64.
        """
        probs = self.softmax(logits)
        probs = probs.asnumpy()
        labels = labels.asnumpy()
        assert probs.shape[0] == labels.shape[0]
        data = []
        for prob, label in zip(probs, labels):
            data.append((prob[1], label))
        assert len(data) % 10 == 0
        length = int(len(data) / 10)
        self.num_sampls += length
        for i in range(length):
            idx = i * 10
            assert data[idx][1] == 1
            self.p_at_1_in_10 += self.get_p_at_n_in_m(data, 1, 10, idx)
            self.p_at_2_in_10 += self.get_p_at_n_in_m(data, 2, 10, idx)
            self.p_at_5_in_10 += self.get_p_at_n_in_m(data, 5, 10, idx)
    def eval(self):
        """
        Calculate the final Recall@K.
        Returnsa  list with scaler float: results of the calculated R1@K, R2@K, R5@K.
        """
        metrics_out = [
            self.p_at_1_in_10 / self.num_sampls, self.p_at_2_in_10 /
            self.num_sampls, self.p_at_5_in_10 / self.num_sampls
        ]
        return metrics_out
    def name(self):
        """
        Returns metric name
        """
        return self._name
--- a/model_zoo/official/nlp/dgu/src/pretrainmodel_convert.py
+++ b/model_zoo/official/nlp/dgu/src/pretrainmodel_convert.py
@ -0,0 +1,94 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 convert pretrain model from pdparams to mindspore ckpt
 """
 import collections
 import os
 import paddle.fluid.dygraph as D
 from paddle import fluid
 from mindspore import Tensor
 from mindspore.train.serialization import save_checkpoint
 def build_params_map(attention_num=12):
    """
    build params map from paddle-paddle's BERT to mindspore's BERT
    :return:
    """
    weight_map = collections.OrderedDict({
        'bert.embeddings.word_embeddings.weight': "bert.bert.bert_embedding_lookup.embedding_table",
        'bert.embeddings.token_type_embeddings.weight': "bert.bert.bert_embedding_postprocessor.embedding_table",
        'bert.embeddings.position_embeddings.weight': "bert.bert.bert_embedding_postprocessor.full_position_embeddings",
        'bert.embeddings.layer_norm.weight': 'bert.bert.bert_embedding_postprocessor.layernorm.gamma',
        'bert.embeddings.layer_norm.bias': 'bert.bert.bert_embedding_postprocessor.layernorm.beta',
    })
    # add attention layers
    for i in range(attention_num):
        weight_map[f'bert.encoder.layers.{i}.self_attn.q_proj.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.attention.query_layer.weight'
        weight_map[f'bert.encoder.layers.{i}.self_attn.q_proj.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.attention.query_layer.bias'
        weight_map[f'bert.encoder.layers.{i}.self_attn.k_proj.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.attention.key_layer.weight'
        weight_map[f'bert.encoder.layers.{i}.self_attn.k_proj.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.attention.key_layer.bias'
        weight_map[f'bert.encoder.layers.{i}.self_attn.v_proj.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.attention.value_layer.weight'
        weight_map[f'bert.encoder.layers.{i}.self_attn.v_proj.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.attention.value_layer.bias'
        weight_map[f'bert.encoder.layers.{i}.self_attn.out_proj.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.output.dense.weight'
        weight_map[f'bert.encoder.layers.{i}.self_attn.out_proj.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.output.dense.bias'
        weight_map[f'bert.encoder.layers.{i}.linear1.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.intermediate.weight'
        weight_map[f'bert.encoder.layers.{i}.linear1.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.intermediate.bias'
        weight_map[f'bert.encoder.layers.{i}.linear2.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.output.dense.weight'
        weight_map[f'bert.encoder.layers.{i}.linear2.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.output.dense.bias'
        weight_map[f'bert.encoder.layers.{i}.norm1.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.output.layernorm.gamma'
        weight_map[f'bert.encoder.layers.{i}.norm1.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.attention.output.layernorm.beta'
        weight_map[f'bert.encoder.layers.{i}.norm2.weight'] = \
        f'bert.bert.bert_encoder.layers.{i}.output.layernorm.gamma'
        weight_map[f'bert.encoder.layers.{i}.norm2.bias'] = \
        f'bert.bert.bert_encoder.layers.{i}.output.layernorm.beta'
    # add pooler
    weight_map.update(
        {
            'bert.pooler.dense.weight': 'bert.bert.dense.weight',
            'bert.pooler.dense.bias': 'bert.bert.dense.bias'
        }
    )
    return weight_map
 input_dir = '.'
 state_dict = []
 bert_weight_map = build_params_map(attention_num=12)
 with fluid.dygraph.guard():
    paddle_paddle_params, _ = D.load_dygraph(os.path.join(input_dir, 'bert-base-uncased'))
 for weight_name, weight_value in paddle_paddle_params.items():
    if 'weight' in weight_name:
        if 'encoder' in weight_name or 'pooler' in weight_name or \
        'predictions' in weight_name or 'seq_relationship' in weight_name:
            weight_value = weight_value.transpose()
    if weight_name in bert_weight_map.keys():
        state_dict.append({'name': bert_weight_map[weight_name], 'data': Tensor(weight_value)})
        print(weight_name, '->', bert_weight_map[weight_name], weight_value.shape)
 save_checkpoint(state_dict, 'base-BertCLS-111.ckpt')
--- a/model_zoo/official/nlp/dgu/src/tokenizer.py
+++ b/model_zoo/official/nlp/dgu/src/tokenizer.py
@ -0,0 +1,302 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 Tokenizer used in Bert finetune and evaluation.
 """
 import collections
 import io
 import unicodedata
 import six
 class FullTokenizer():
    """Runs end-to-end tokenziation."""
    def __init__(self, vocab_file, do_lower_case=True):
        self.vocab = load_vocab(vocab_file)
        self.inv_vocab = {v: k for k, v in self.vocab.items()}
        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
    def tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text):
            for sub_token in self.wordpiece_tokenizer.tokenize(token):
                split_tokens.append(sub_token)
        return split_tokens
    def convert_tokens_to_ids(self, tokens):
        return convert_by_vocab(self.vocab, tokens)
 class BasicTokenizer():
    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
    def __init__(self, do_lower_case=True):
        """Constructs a BasicTokenizer.
        Args:
            do_lower_case: Whether to lower case the input.
        """
        self.do_lower_case = do_lower_case
    def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = convert_to_unicode(text)
        text = self._clean_text(text)
        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)
        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if self.do_lower_case:
                token = token.lower()
                token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token))
        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens
    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)
    def _run_split_on_punc(self, text):
        """Splits punctuation on a piece of text."""
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])
                start_new_word = False
                output[-1].append(char)
            i += 1
        return ["".join(x) for x in output]
    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
            cp = ord(char)
            if self._is_chinese_char(cp):
                output.append(" ")
                output.append(char)
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)
    def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((0x4E00 <= cp <= 0x9FFF) or  #
                (0x3400 <= cp <= 0x4DBF) or  #
                (0x20000 <= cp <= 0x2A6DF) or  #
                (0x2A700 <= cp <= 0x2B73F) or  #
                (0x2B740 <= cp <= 0x2B81F) or  #
                (0x2B820 <= cp <= 0x2CEAF) or
                (0xF900 <= cp <= 0xFAFF) or  #
                (0x2F800 <= cp <= 0x2FA1F)):  #
            return True
        return False
    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xfffd or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)
 class WordpieceTokenizer():
    """Runs WordPiece tokenziation."""
    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word
    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.
        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.
        For example:
            input = "unaffable"
            output = ["un", "##aff", "##able"]
        Args:
            text: A single token or whitespace separated tokens. This should have
                already been passed through `BasicTokenizer.
        Returns:
            A list of wordpiece tokens.
        """
        text = convert_to_unicode(text)
        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue
            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end
            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens
 def convert_to_unicode(text):
    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
    if six.PY3:
        if isinstance(text, str):
            return text
        if isinstance(text, bytes):
            return text.decode("utf-8", "ignore")
        raise ValueError("Unsupported string type: %s" % (type(text)))
    if six.PY2:
        if isinstance(text, str):
            return text.decode("utf-8", "ignore")
        if isinstance(text, unicode):
            return text
        raise ValueError("Unsupported string type: %s" % (type(text)))
    raise ValueError("Not running on Python2 or Python 3?")
 def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    fin = io.open(vocab_file, encoding="utf8")
    for num, line in enumerate(fin):
        items = convert_to_unicode(line.strip()).split("\t")
        if len(items) > 2:
            break
        token = items[0]
        index = items[1] if len(items) == 2 else num
        token = token.strip()
        vocab[token] = int(index)
    return vocab
 def convert_by_vocab(vocab, items):
    """Converts a sequence of [tokens|ids] using the vocab."""
    output = []
    for item in items:
        output.append(vocab[item])
    return output
 def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a piece of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens
 def _is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically control characters but we treat them
    # as whitespace since they are generally considered as such.
    if char in (' ', '\\t', '\\n', '\\r'):
        return True
    cat = unicodedata.category(char)
    if cat == "Zs":
        return True
    return False
 def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char in ('\\t', '\\n', '\\r'):
        return False
    cat = unicodedata.category(char)
    if cat.startswith("C"):
        return True
    return False
 def _is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((33 <= cp <= 47) or (58 <= cp <= 64) or
            (91 <= cp <= 96) or (123 <= cp <= 126)):
        return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
        return True
    return False
--- a/model_zoo/official/nlp/dgu/src/utils.py
+++ b/model_zoo/official/nlp/dgu/src/utils.py
@ -0,0 +1,284 @@
 # Copyright 2021 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 Functional Cells used in Bert finetune and evaluation.
 """
 import collections
 import math
 import os
 import numpy as np
 import mindspore.dataset as ds
 import mindspore.dataset.transforms.c_transforms as C
 import mindspore.nn as nn
 import mindspore.ops as P
 from mindspore import dtype as mstype
 from mindspore import log as logger
 from mindspore._checkparam import Validator as validator
 from mindspore.common.tensor import Tensor
 from mindspore.nn.learning_rate_schedule import (LearningRateSchedule,
                                                 PolynomialDecayLR, WarmUpLR)
 from mindspore.train.callback import Callback
 def create_classification_dataset(batch_size=32, repeat_count=1,
                                  data_file_path=None, schema_file_path=None, do_shuffle=True):
    """create finetune or evaluation dataset from mindrecord file"""
    type_cast_op = C.TypeCast(mstype.int32)
    data_set = ds.MindDataset([data_file_path],  \
    columns_list=["input_ids", "input_mask", "segment_ids", "label_ids"], shuffle=do_shuffle)
    data_set = data_set.map(operations=type_cast_op, input_columns="label_ids")
    data_set = data_set.map(operations=type_cast_op, input_columns="input_mask")
    data_set = data_set.map(operations=type_cast_op, input_columns="segment_ids")
    data_set = data_set.map(operations=type_cast_op, input_columns="input_ids")
    #data_set = data_set.repeat(repeat_count)
    data_set = data_set.batch(batch_size, drop_remainder=True)
    return data_set
 class CustomWarmUpLR(LearningRateSchedule):
    """
    apply the functions to  the corresponding input fields.
    ·
    """
    def __init__(self, learning_rate, warmup_steps, max_train_steps):
        super(CustomWarmUpLR, self).__init__()
        if not isinstance(learning_rate, float):
            raise TypeError("learning_rate must be float.")
        validator.check_non_negative_float(learning_rate, "learning_rate", self.cls_name)
        validator.check_positive_int(warmup_steps, 'warmup_steps', self.cls_name)
        self.warmup_steps = warmup_steps
        self.learning_rate = learning_rate
        self.max_train_steps = max_train_steps
        self.cast = P.Cast()
    def construct(self, current_step):
        if current_step < self.warmup_steps:
            warmup_percent = self.cast(current_step, mstype.float32)/ self.warmup_steps
        else:
            warmup_percent = 1 - self.cast(current_step, mstype.float32)/ self.max_train_steps
        return self.learning_rate * warmup_percent
 class CrossEntropyCalculation(nn.Cell):
    """
    Cross Entropy loss
    """
    def __init__(self, is_training=True):
        super(CrossEntropyCalculation, self).__init__()
        self.onehot = P.OneHot()
        self.on_value = Tensor(1.0, mstype.float32)
        self.off_value = Tensor(0.0, mstype.float32)
        self.reduce_sum = P.ReduceSum()
        self.reduce_mean = P.ReduceMean()
        self.reshape = P.Reshape()
        self.last_idx = (-1,)
        self.neg = P.Neg()
        self.cast = P.Cast()
        self.is_training = is_training
    def construct(self, logits, label_ids, num_labels):
        if self.is_training:
            label_ids = self.reshape(label_ids, self.last_idx)
            one_hot_labels = self.onehot(label_ids, num_labels, self.on_value, self.off_value)
            per_example_loss = self.neg(self.reduce_sum(one_hot_labels * logits, self.last_idx))
            loss = self.reduce_mean(per_example_loss, self.last_idx)
            return_value = self.cast(loss, mstype.float32)
        else:
            return_value = logits * 1.0
        return return_value
 def make_directory(path: str):
    """Make directory."""
    if path is None or not isinstance(path, str) or path.strip() == "":
        logger.error("The path(%r) is invalid type.", path)
        raise TypeError("Input path is invalid type")
    # convert the relative paths
    path = os.path.realpath(path)
    logger.debug("The abs path is %r", path)
    # check the path is exist and write permissions?
    if os.path.exists(path):
        real_path = path
    else:
        # All exceptions need to be caught because create directory maybe have some limit(permissions)
        logger.debug("The directory(%s) doesn't exist, will create it", path)
        try:
            os.makedirs(path, exist_ok=True)
            real_path = path
        except PermissionError as e:
            logger.error("No write permission on the directory(%r), error = %r", path, e)
            raise TypeError("No write permission on the directory.")
    return real_path
 class LossCallBack(Callback):
    """
    Monitor the loss in training.
    If the loss in NAN or INF terminating training.
    Note:
        if per_print_times is 0 do not print loss.
    Args:
        per_print_times (int): Print loss every times. Default: 1.
    """
    def __init__(self, dataset_size=-1):
        super(LossCallBack, self).__init__()
        self._dataset_size = dataset_size
    def step_end(self, run_context):
        """
        Print loss after each step
        """
        cb_params = run_context.original_args()
        if self._dataset_size > 0:
            percent, epoch_num = math.modf(cb_params.cur_step_num / self._dataset_size)
            if percent == 0:
                percent = 1
                epoch_num -= 1
            print("epoch: {}, current epoch percent: {}, step: {}, outputs are {}"
                  .format(int(epoch_num), "%.3f" % percent, cb_params.cur_step_num, str(cb_params.net_outputs)),
                  flush=True)
        else:
            print("epoch: {}, step: {}, outputs are {}".format(cb_params.cur_epoch_num, cb_params.cur_step_num,
                                                               str(cb_params.net_outputs)), flush=True)
 def LoadNewestCkpt(load_finetune_checkpoint_dir, steps_per_epoch, epoch_num, prefix):
    """
    Find the ckpt finetune generated and load it into eval network.
    """
    files = os.listdir(load_finetune_checkpoint_dir)
    pre_len = len(prefix)
    max_num = 0
    for filename in files:
        name_ext = os.path.splitext(filename)
        if name_ext[-1] != ".ckpt":
            continue
        if filename.find(prefix) == 0 and not filename[pre_len].isalpha():
            index = filename[pre_len:].find("-")
            if index == 0 and max_num == 0:
                load_finetune_checkpoint_path = os.path.join(load_finetune_checkpoint_dir, filename)
            elif index not in (0, -1):
                name_split = name_ext[-2].split('_')
                if (steps_per_epoch != int(name_split[len(name_split)-1])) \
                        or (epoch_num != int(filename[pre_len + index + 1:pre_len + index + 2])):
                    continue
                num = filename[pre_len + 1:pre_len + index]
                if int(num) > max_num:
                    max_num = int(num)
                    load_finetune_checkpoint_path = os.path.join(load_finetune_checkpoint_dir, filename)
    return load_finetune_checkpoint_path
 def GetAllCkptPath(save_finetune_checkpoint_path):
    files_list = os.listdir(save_finetune_checkpoint_path)
    ckpt_list = []
    for filename in files_list:
        if '.ckpt' in filename:
            load_finetune_checkpoint_dir = os.path.join(save_finetune_checkpoint_path, filename)
            ckpt_list.append(load_finetune_checkpoint_dir)
            #print(load_finetune_checkpoint_dir)
    return ckpt_list
 class BertLearningRate(LearningRateSchedule):
    """
    Warmup-decay learning rate for Bert network.
    """
    def __init__(self, learning_rate, end_learning_rate, warmup_steps, decay_steps, power):
        super(BertLearningRate, self).__init__()
        self.warmup_flag = False
        if warmup_steps > 0:
            self.warmup_flag = True
            self.warmup_lr = WarmUpLR(learning_rate, warmup_steps)
        self.decay_lr = PolynomialDecayLR(learning_rate, end_learning_rate, decay_steps, power)
        self.warmup_steps = Tensor(np.array([warmup_steps]).astype(np.float32))
        self.greater = P.Greater()
        self.one = Tensor(np.array([1.0]).astype(np.float32))
        self.cast = P.Cast()
    def construct(self, global_step):
        decay_lr = self.decay_lr(global_step)
        if self.warmup_flag:
            is_warmup = self.cast(self.greater(self.warmup_steps, global_step), mstype.float32)
            warmup_lr = self.warmup_lr(global_step)
            lr = (self.one - is_warmup) * decay_lr + is_warmup * warmup_lr
        else:
            lr = decay_lr
        return lr
 def convert_labels_to_index(label_list):
    """
    Convert label_list to indices for NER task.
    """
    label2id = collections.OrderedDict()
    label2id["O"] = 0
    prefix = ["S_", "B_", "M_", "E_"]
    index = 0
    for label in label_list:
        for pre in prefix:
            index += 1
            sub_label = pre + label
            label2id[sub_label] = index
    return label2id
 def _get_poly_lr(global_step, lr_init, lr_end, lr_max, warmup_steps, total_steps, poly_power):
    """
    generate learning rate array
    Args:
       global_step(int): current step
       lr_init(float): init learning rate
       lr_end(float): end learning rate
       lr_max(float): max learning rate
       warmup_steps(int): number of warmup epochs
       total_steps(int): total epoch of training
       poly_power(int): poly learning rate power
    Returns:
       np.array, learning rate array
    """
    lr_each_step = []
    if warmup_steps != 0:
        inc_each_step = (float(lr_max) - float(lr_init)) / float(warmup_steps)
    else:
        inc_each_step = 0
    for i in range(total_steps):
        if i < warmup_steps:
            lr = float(lr_init) + inc_each_step * float(i)
        else:
            base = (1.0 - (float(i) - float(warmup_steps)) / (float(total_steps) - float(warmup_steps)))
            lr = float(lr_max - lr_end) * (base ** poly_power)
            lr = lr + lr_end
            if lr < 0.0:
                lr = 0.0
        lr_each_step.append(lr)
    learning_rate = np.array(lr_each_step).astype(np.float32)
    current_step = global_step
    learning_rate = learning_rate[current_step:]
    return learning_rate
 def get_bert_thor_lr(lr_max=0.0034, lr_min=3.244e-05, lr_power=1.0, lr_total_steps=30000):
    learning_rate = _get_poly_lr(global_step=0, lr_init=0.0, lr_end=lr_min, lr_max=lr_max, warmup_steps=0,
                                 total_steps=lr_total_steps, poly_power=lr_power)
    return Tensor(learning_rate)
 def get_bert_thor_damping(damping_max=5e-2, damping_min=1e-6, damping_power=1.0, damping_total_steps=30000):
    damping = _get_poly_lr(global_step=0, lr_init=0.0, lr_end=damping_min, lr_max=damping_max, warmup_steps=0,
                           total_steps=damping_total_steps, poly_power=damping_power)
    return Tensor(damping)