!17291 Inception_ResNet_v2 for master

Merge pull request !17291 from wittlu/master
2021-07-30 01:22:03 +00:00 · 2021-07-30 01:22:03 +00:00 · b99b3833cb
parent 60fb61b84d 0643fd8c28
commit b99b3833cb
12 changed files with 1291 additions and 0 deletions
--- a/model_zoo/research/cv/inception_resnet_v2/README.md
+++ b/model_zoo/research/cv/inception_resnet_v2/README.md
@ -0,0 +1,224 @@
+# Inception_ResNet_v2 for Ascend
+
+- [Inception_ResNet_v2 Description](#Inception_ResNet_v2-description)
+- [Model Architecture](#model-architecture)
+- [Dataset](#dataset)
+- [Features](#features)
+    - [Mixed Precision](#mixed-precision)
+- [Environment Requirements](#environment-requirements)
+- [Script Description](#script-description)
+    - [Script and Sample Code](#script-and-sample-code)
+    - [Training Process](#training-process)
+    - [Evaluation Process](#evaluation-process)
+        - [Evaluation](#evaluation)
+- [Model Description](#model-description)
+    - [Performance](#performance)  
+        - [Training Performance](#evaluation-performance)
+        - [Inference Performance](#evaluation-performance)
+- [Description of Random Situation](#description-of-random-situation)
+- [ModelZoo Homepage](#modelzoo-homepage)
+
+# [Inception_ResNet_v2 Description](#contents)
+
+Inception_ResNet_v2 is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than Inception-v3. This idea was proposed in the paper Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, published in 2016.
+
+[Paper](https://arxiv.org/pdf/1602.07261.pdf) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Computer Vision and Pattern Recognition[J]. 2016.
+
+# [Model architecture](#contents)
+
+The overall network architecture of Inception_ResNet_v2 is show below:
+
+[Link](https://arxiv.org/pdf/1602.07261.pdf)
+
+# [Dataset](#contents)
+
+Dataset used can refer to paper.
+
+- Dataset size: 125G, 1250k colorful images in 1000 classes
+    - Train: 120G, 1200k images
+    - Test: 5G, 50k images
+- Data format: RGB images.
+    - Note: Data will be processed in src/dataset.py
+
+# [Features](#contents)
+
+## [Mixed Precision(Ascend)](#contents)
+
+The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
+
+For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
+
+# [Environment Requirements](#contents)
+
+- Hardware（Ascend）
+    - Prepare hardware environment with Ascend processor. If you want to try Ascend  , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
+- Framework
+    - [MindSpore](https://www.mindspore.cn/install/en)
+- For more information, please check the resources below：
+    - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
+    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
+
+# [Script description](#contents)
+
+## [Script and sample code](#contents)
+
+```shell
+.
+└─inception_resnet_v2
+  ├─README.md
+  ├─scripts
+    ├─run_standalone_train_ascend.sh    # launch standalone training with ascend platform(1p)
+    ├─run_distribute_train_ascend.sh    # launch distributed training with ascend platform(8p)
+    └─run_eval_ascend.sh                # launch evaluating with ascend platform
+  ├─src
+    ├─config.py                       # parameter configuration
+    ├─dataset.py                      # data preprocessing
+    ├─inception_resnet_v2.py.py                  # network definition
+    └─callback.py                     # eval callback function
+  ├─eval.py                           # eval net
+  ├─export.py                         # export checkpoint, surpport .onnx, .air, .mindir convert
+  └─train.py                          # train net
+```
+
+## [Script Parameters](#contents)
+
+```python
+Major parameters in train.py and config.py are:
+'is_save_on_master'          # save checkpoint only on master device
+'batch_size'                 # input batchsize
+'epoch_size'                 # total epoch numbers
+'num_classes'                # dataset class numbers
+'work_nums'                  # number of workers to read data
+'loss_scale'                 # loss scale
+'smooth_factor'              # label smoothing factor
+'weight_decay'               # weight decay
+'momentum'                   # momentum
+'amp_level'                  # precision training, Supports [O0, O2, O3]
+'decay'                      # decay used in optimize function
+'epsilon'                    # epsilon used in iptimize function
+'keep_checkpoint_max'        # max numbers to keep checkpoints
+'save_checkpoint_epochs'     # save checkpoints per n epoch
+'lr_init'                    # init leaning rate
+'lr_end'                     # end of learning rate
+'lr_max'                     # max bound of learning rate
+'warmup_epochs'              # warmup epoch numbers
+'start_epoch'                # number of start epoch range[1, epoch_size]
+```
+
+## [Training process](#contents)
+
+### Usage
+
+You can start training using python or shell scripts. The usage of shell scripts as follows:
+
+- Ascend:
+
+```bash
+# distribute training example(8p)
+sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
+# standalone training
+sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
+```
+
+> Notes:
+> RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html) , and the device_ip can be got as [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools). For large models like InceptionV4, it's better to export an external environment variable `export HCCL_CONNECT_TIMEOUT=600` to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.
+>
+> This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
+
+### Launch
+
+```bash
+# training example
+  shell:
+      Ascend:
+      # distribute training example(8p)
+      sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
+      # standalone training
+      sh scripts/run_standalone_train_ascend.sh
+```
+
+### Result
+
+Training result will be stored in the example path. Checkpoints will be stored at `ckpt_path` by default, and training log  will be redirected to `./log.txt` like following.
+
+```python
+epoch: 1 step: 1251, loss is 5.4833196
+Epoch time: 520274.060, per step time: 415.887
+epoch: 2 step: 1251, loss is 4.093194
+Epoch time: 288520.628, per step time: 230.632
+epoch: 3 step: 1251, loss is 3.6242008
+Epoch time: 288507.506, per step time: 230.622
+```
+
+## [Eval process](#contents)
+
+### Usage
+
+You can start training using python or shell scripts. The usage of shell scripts as follows:
+
+- Ascend:
+
+```bash
+  sh scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
+```
+
+> checkpoint can be produced in training process.
+
+### Result
+
+Evaluation result will be stored in the example path, you can find result like the following in `eval.log`.
+
+```python
+metric: {'Loss': 1.0413, 'Top1-Acc':0.79955, 'Top5-Acc':0.9439}
+```
+
+# [Model description](#contents)
+
+## [Performance](#contents)
+
+### Training Performance
+
+| Parameters                 | Ascend                                                       |
+| -------------------------- | ------------------------------------------------------------ |
+| Model Version              | Inception ResNet v2                                          |
+| Resource                   | Ascend 910, cpu:2.60GHz 192cores, memory:755G                |
+| uploaded Date              | 11/04/2020                                                   |
+| MindSpore Version          | 1.2.0                                                        |
+| Dataset                    | 1200k images                                                 |
+| Batch_size                 | 128                                                          |
+| Training Parameters        | src/config.py                                                |
+| Optimizer                  | RMSProp                                                      |
+| Loss Function              | SoftmaxCrossEntropyWithLogits                                |
+| Outputs                    | probability                                                  |
+| Total time (8p)            | 24h                                                          |
+
+#### Inference Performance
+
+| Parameters          | Ascend                 |
+| ------------------- | --------------------------- |
+| Model Version       | Inception ResNet v2                    |
+| Resource            | Ascend 910, cpu:2.60GHz 192cores, memory:755G         |
+| Uploaded Date       | 11/04/2020                 |
+| MindSpore Version   | 1.2.0              |
+| Dataset             | 50k images                  |
+| Batch_size          | 128                         |
+| Outputs             | probability                 |
+| Accuracy            | ACC1[79.96%] ACC5[94.40%]                                    |
+
+#### Training performance results
+
+| **Ascend** | train performance |
+| :--------: | :---------------: |
+|     1p     |     556 img/s     |
+
+| **Ascend** | train performance |
+| :--------: | :---------------: |
+|     8p     |     4430 img/s     |
+
+# [Description of Random Situation](#contents)
+
+In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
+
+# [ModelZoo Homepage](#contents)
+
+Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
--- a/model_zoo/research/cv/inception_resnet_v2/README_CN.md
+++ b/model_zoo/research/cv/inception_resnet_v2/README_CN.md
@ -0,0 +1,219 @@
+# 目录
+
+<!-- TOC -->
+
+- [目录](#目录)
+- [Inception_ResNet_v2描述](#Inception_ResNet_v2描述)
+- [模型架构](#模型架构)
+- [数据集](#数据集)
+- [特性](#特性)
+    - [混合精度（Ascend）](#混合精度ascend)
+- [环境要求](#环境要求)
+- [脚本说明](#脚本说明)
+    - [脚本和样例代码](#脚本和样例代码)
+    - [脚本参数](#脚本参数)
+    - [训练过程](#训练过程)
+        - [用法](#用法)
+        - [启动](#启动)
+        - [结果](#结果)
+    - [评估过程](#评估过程)
+        - [用法](#用法-1)
+        - [启动](#启动-1)
+        - [结果](#结果-1)
+- [模型描述](#模型描述)
+    - [性能](#性能)
+        - [训练性能](#训练性能)
+            - [推理性能](#推理性能)
+- [随机情况说明](#随机情况说明)
+- [ModelZoo主页](#modelzoo主页)
+
+<!-- /TOC -->
+
+# Inception_ResNet_v2描述
+
+Inception_ResNet_v2是Google的深度学习卷积架构系列的一个版本。Inception_ResNet_v2主要通过修改以前的Inception架构来减少计算资源的消耗。该方法在2016年出版的Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning一文中提出的。
+
+[论文](https://arxiv.org/pdf/1512.00567.pdf)：(https://arxiv.org/pdf/1602.07261.pdf) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Computer Vision and Pattern Recognition[J]. 2016.
+
+# 模型架构
+
+Inception_ResNet_v2的总体网络架构如下：
+
+[链接](https://arxiv.org/pdf/1602.07261.pdf)
+
+# 数据集
+
+所用数据集可参照论文。
+
+- 数据集大小：125G，共1000个类、125万张彩色图像
+    - 训练集：120G, 120万张图像
+    - 测试集：5G，共5万张图像
+- 数据格式：RGB
+    - 注：数据将在src/dataset.py中处理。
+
+# 特性
+
+## 混合精度（Ascend）
+
+采用[混合精度](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/enable_mixed_precision.html)的训练方法使用支持单精度和半精度数据来提高深度学习神经网络的训练速度，同时保持单精度训练所能达到的网络精度。混合精度训练提高计算速度、减少内存使用的同时，支持在特定硬件上训练更大的模型或实现更大批次的训练。
+
+以FP16算子为例，如果输入数据类型为FP32，MindSpore后台会自动降低精度来处理数据。用户可打开INFO日志，搜索“reduce precision”查看精度降低的算子。
+
+# 环境要求
+
+- 硬件（Ascend）
+- 使用Ascend来搭建硬件环境。
+- 框架
+- [MindSpore](https://www.mindspore.cn/install/en)
+- 如需查看详情，请参见如下资源：
+- [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
+- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
+
+# 脚本说明
+
+## 脚本和样例代码
+
+```shell
+.
+└─inception_resnet_v2
+  ├─README.md
+  ├─scripts
+    ├─run_standalone_train_ascend.sh    # launch standalone training with ascend platform(1p)
+    ├─run_distribute_train_ascend.sh    # launch distributed training with ascend platform(8p)
+    └─run_eval_ascend.sh                # launch evaluating with ascend platform
+  ├─src
+    ├─config.py                       # parameter configuration
+    ├─dataset.py                      # data preprocessing
+    ├─inception_resnet_v2.py.py                  # network definition
+    └─callback.py                     # eval callback function
+  ├─eval.py                           # eval net
+  ├─export.py                         # export checkpoint, surpport .onnx, .air, .mindir convert
+  └─train.py                          # train net
+```
+
+## 脚本参数
+
+```python
+Major parameters in train.py and config.py are:
+'is_save_on_master'          # save checkpoint only on master device
+'batch_size'                 # input batchsize
+'epoch_size'                 # total epoch numbers
+'num_classes'                # dataset class numbers
+'work_nums'                  # number of workers to read data
+'loss_scale'                 # loss scale
+'smooth_factor'              # label smoothing factor
+'weight_decay'               # weight decay
+'momentum'                   # momentum
+'amp_level'                  # precision training, Supports [O0, O2, O3]
+'decay'                      # decay used in optimize function
+'epsilon'                    # epsilon used in iptimize function
+'keep_checkpoint_max'        # max numbers to keep checkpoints
+'save_checkpoint_epochs'     # save checkpoints per n epoch
+'lr_init'                    # init leaning rate
+'lr_end'                     # end of learning rate
+'lr_max'                     # max bound of learning rate
+'warmup_epochs'              # warmup epoch numbers
+'start_epoch'                # number of start epoch range[1, epoch_size]
+```
+
+## 训练过程
+
+### 用法
+
+使用python或shell脚本开始训练。shell脚本的用法如下：
+
+- Ascend：
+
+    ```bash
+    # distribute training example(8p)
+    sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
+    # standalone training
+    sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
+    ```
+
+> 注：RANK_TABLE_FILE可参考[链接](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/distributed_training_ascend.html)。device_ip可以通过[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)获取
+
+### 结果
+
+训练结果保存在示例路径。检查点默认保存在`checkpoint`，训练日志会重定向到`./log.txt`，如下：
+
+#### Ascend
+
+```python
+epoch: 1 step: 1251, loss is 5.4833196
+Epoch time: 520274.060, per step time: 415.887
+epoch: 2 step: 1251, loss is 4.093194
+Epoch time: 288520.628, per step time: 230.632
+epoch: 3 step: 1251, loss is 3.6242008
+Epoch time: 288507.506, per step time: 230.622
+```
+
+## 评估过程
+
+### 用法
+
+使用python或shell脚本开始训练。shell脚本的用法如下：
+
+- Ascend：
+
+```bash
+sh scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
+```
+
+> 训练过程中可以生成模型文件。
+
+### 结果
+
+推理结果保存在示例路径，可以在`eval.log`中找到如下结果。
+
+```log
+metric: {'Loss': 1.0413, 'Top1-Acc':0.79955, 'Top5-Acc':0.9439}
+```
+
+## 模型导出
+
+```shell
+python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format[EXPORT_FORMAT]
+```
+
+`EXPORT_FORMAT` 可选 ["AIR", "MINDIR"]
+
+# 模型描述
+
+## 性能
+
+### 训练性能
+
+| 参数                 | Ascend                                    |
+| -------------------------- | ---------------------------------------------- |
+| 模型版本              | Inception ResNet v2                                   |
+| 资源                   | Ascend 910；CPU 2.60GHz，192核；内存 755G；系统 Euler2.8   |
+| MindSpore版本          | 0.6.0-beta                                     |
+| 数据集                    | 120万张图像                                   |
+| Batch_size                 | 128                                            |
+| 训练参数        | src/config.py                                  |
+| 优化器                  | RMSProp                                        |
+| 损失函数              | Softmax交叉熵                            |
+| 输出                    | 概率                                    |
+| 损失                       | 1.98                                           |
+| 总时长（8卡）            | 24小时                                            |
+
+#### 推理性能
+
+| 参数          | Ascend                 |
+| ------------------- | --------------------------- |
+| 模型版本       | Inception ResNet v2    |
+| 资源            |  Ascend 910；CPU 2.60GHz，192核；内存 755G；系统 Euler2.8                 |
+| MindSpore 版本   |  1.2.0                  |
+| 数据集             | 5万张图像                  |
+| Batch_size          | 128                         |
+| 准确率            | ACC1[79.96%] ACC5[94.40%]      |
+
+# 随机情况说明
+
+在dataset.py中，我们设置了“create_dataset”函数内的种子，同时还使用了train.py中的随机种子。
+
+# ModelZoo主页
+
+请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)。
+
--- a/model_zoo/research/cv/inception_resnet_v2/eval.py
+++ b/model_zoo/research/cv/inception_resnet_v2/eval.py
@ -0,0 +1,59 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""evaluate_imagenet"""
+import argparse
+import os
+
+import mindspore.nn as nn
+from mindspore import context
+from mindspore.train.model import Model
+from mindspore.train.serialization import load_checkpoint, load_param_into_net
+from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
+
+from src.dataset import create_dataset
+from src.inception_resnet_v2 import Inception_resnet_v2
+from src.config import config_ascend as config
+
+def parse_args():
+    '''parse_args'''
+    parser = argparse.ArgumentParser(description='image classification evaluation')
+    parser.add_argument('--platform', type=str, default='Ascend', choices=('Ascend', 'GPU'), help='run platform')
+    parser.add_argument('--dataset_path', type=str, default='', help='Dataset path')
+    parser.add_argument('--checkpoint_path', type=str, default='', help='checkpoint of inception_resnet_v2')
+    args_opt = parser.parse_args()
+    return args_opt
+
+if __name__ == '__main__':
+    args = parse_args()
+
+    if args.platform == 'Ascend':
+        device_id = int(os.getenv('DEVICE_ID'))
+        context.set_context(device_id=device_id)
+
+    context.set_context(mode=context.GRAPH_MODE, device_target=args.platform)
+    net = Inception_resnet_v2(classes=config.num_classes, is_train=False)
+    ckpt = load_checkpoint(args.checkpoint_path)
+    load_param_into_net(net, ckpt)
+    net.set_train(False)
+    dataset = create_dataset(dataset_path=args.dataset_path, do_train=False,
+                             repeat_num=1, batch_size=config.batch_size)
+    loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
+    eval_metrics = {'Loss': nn.Loss(),
+                    'Top1-Acc': nn.Top1CategoricalAccuracy(),
+                    'Top5-Acc': nn.Top5CategoricalAccuracy()}
+    model = Model(net, loss, optimizer=None, metrics=eval_metrics)
+    print('='*20, 'Evalute start', '='*20)
+    metrics = model.eval(dataset)
+    print("metric: ", metrics)
--- a/model_zoo/research/cv/inception_resnet_v2/export.py
+++ b/model_zoo/research/cv/inception_resnet_v2/export.py
@ -0,0 +1,47 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""export checkpoint file into air, onnx, mindir models"""
+import argparse
+import numpy as np
+
+# import mindspore as ms
+from mindspore import Tensor, dtype
+from mindspore.train.serialization import load_checkpoint, load_param_into_net, export, context
+
+from src.config import config_ascend as config
+from src.inception_resnet_v2 import Inception_resnet_v2
+
+parser = argparse.ArgumentParser(description='inception_resnet_v2 export')
+parser.add_argument("--device_id", type=int, default=0, help="Device id")
+parser.add_argument('--ckpt_file', type=str, required=True, help='inception_resnet_v2 ckpt file.')
+parser.add_argument('--file_name', type=str, default='inception_resnet_v2', help='inception_resnet_v2 output air name.')
+parser.add_argument('--file_format', type=str, choices=["AIR", "ONNX", "MINDIR"], default='AIR', help='file format')
+parser.add_argument('--width', type=int, default=299, help='input width')
+parser.add_argument('--height', type=int, default=299, help='input height')
+parser.add_argument("--device_target", type=str, choices=["Ascend", "GPU", "CPU"], default="Ascend",
+                    help="device target")
+args = parser.parse_args()
+
+context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target)
+if args.device_target == "Ascend":
+    context.set_context(device_id=args.device_id)
+
+if __name__ == '__main__':
+    net = Inception_resnet_v2(classes=config.num_classes)
+    param_dict = load_checkpoint(args.ckpt_file)
+    load_param_into_net(net, param_dict)
+
+    input_arr = Tensor(np.ones([config.batch_size, 3, args.width, args.height]), dtype.float32)
+    export(net, input_arr, file_name=args.file_name, file_format=args.file_format)
--- a/model_zoo/research/cv/inception_resnet_v2/scripts/run_distribute_train_ascend.sh
+++ b/model_zoo/research/cv/inception_resnet_v2/scripts/run_distribute_train_ascend.sh
@ -0,0 +1,49 @@
+#!/bin/bash
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+export RANK_TABLE_FILE=$1
+DATA_DIR=$2
+export RANK_SIZE=8
+
+
+cores=`cat /proc/cpuinfo|grep "processor" |wc -l`
+echo "the number of logical core" $cores
+avg_core_per_rank=`expr $cores \/ $RANK_SIZE`
+core_gap=`expr $avg_core_per_rank \- 1`
+echo "avg_core_per_rank" $avg_core_per_rank
+echo "core_gap" $core_gap
+for((i=0;i<RANK_SIZE;i++))
+do
+    start=`expr $i \* $avg_core_per_rank`
+    export DEVICE_ID=$i
+    export RANK_ID=$i
+    export DEPLOY_MODE=0
+    export GE_USE_STATIC_MEMORY=1
+    end=`expr $start \+ $core_gap`
+    cmdopt=$start"-"$end
+
+    rm -rf train_parallel$i
+    mkdir ./train_parallel$i
+    cp  *.py ./train_parallel$i
+    cd ./train_parallel$i || exit
+    echo "start training for rank $i, device $DEVICE_ID rank_id $RANK_ID"
+
+    env > env.log
+    taskset -c $cmdopt python -u ../train.py \
+    --device_id $i \
+    --dataset_path=$DATA_DIR > log.txt 2>&1 &
+    cd ../
+done
--- a/model_zoo/research/cv/inception_resnet_v2/scripts/run_eval_ascend.sh
+++ b/model_zoo/research/cv/inception_resnet_v2/scripts/run_eval_ascend.sh
@ -0,0 +1,28 @@
+#!/bin/bash
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+export DEVICE_ID=$1
+DATA_DIR=$2
+CHECKPOINT_PATH=$3
+export RANK_SIZE=1
+
+rm -rf evaluation_ascend
+mkdir ./evaluation_ascend
+cd ./evaluation_ascend || exit
+echo  "start training for device id $DEVICE_ID"
+env > env.log
+python ../eval.py --platform=Ascend --dataset_path=$DATA_DIR --checkpoint_path=$CHECKPOINT_PATH > eval.log 2>&1 &
+cd ../
--- a/model_zoo/research/cv/inception_resnet_v2/scripts/run_standalone_train_ascend.sh
+++ b/model_zoo/research/cv/inception_resnet_v2/scripts/run_standalone_train_ascend.sh
@ -0,0 +1,29 @@
+#!/bin/bash
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+export RANK_SIZE="1"
+export DEVICE_ID="1"
+DATA_DIR='/data/imagenet/train'
+
+rm -rf train_standalone
+mkdir ./train_standalone
+cd ./train_standalone || exit
+echo  "start training for device id $DEVICE_ID"
+env > env.log
+python -u ../train.py \
+    --device_id=$DEVICE_ID \
+    --dataset_path=$DATA_DIR > log.txt 2>&1 &
+cd ../
--- a/model_zoo/research/cv/inception_resnet_v2/src/callback.py
+++ b/model_zoo/research/cv/inception_resnet_v2/src/callback.py
@ -0,0 +1,42 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""callback function"""
+from mindspore.train.callback import Callback
+
+
+class EvaluateCallBack(Callback):
+    """EvaluateCallBack"""
+    def __init__(self, model, eval_dataset, per_print_time=1000):
+        super(EvaluateCallBack, self).__init__()
+        self.model = model
+        self.per_print_time = per_print_time
+        self.eval_dataset = eval_dataset
+
+    def step_end(self, run_context):
+        cb_params = run_context.original_args()
+        if cb_params.cur_step_num % self.per_print_time == 0:
+            result = self.model.eval(self.eval_dataset, dataset_sink_mode=False)
+            print('cur epoch {}, cur_step {}, top1 accuracy {}, top5 accuracy {}.'.format(cb_params.cur_epoch_num,
+                                                                                          cb_params.cur_step_num,
+                                                                                          result['top_1_accuracy'],
+                                                                                          result['top_5_accuracy']))
+
+    def epoch_end(self, run_context):
+        cb_params = run_context.original_args()
+        result = self.model.eval(self.eval_dataset, dataset_sink_mode=False)
+        print('cur epoch {}, cur_step {}, top1 accuracy {}, top5 accuracy {}.'.format(cb_params.cur_epoch_num,
+                                                                                      cb_params.cur_step_num,
+                                                                                      result['top_1_accuracy'],
+                                                                                      result['top_5_accuracy']))
--- a/model_zoo/research/cv/inception_resnet_v2/src/config.py
+++ b/model_zoo/research/cv/inception_resnet_v2/src/config.py
@ -0,0 +1,48 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""
+network config setting, will be used in main.py
+"""
+from easydict import EasyDict as edict
+
+config_ascend = edict({
+    'is_save_on_master': False,
+
+    'batch_size': 128,
+    'epoch_size': 250,
+    'num_classes': 1000,
+    'work_nums': 8,
+
+    'loss_scale': 1024,
+    'smooth_factor': 0.1,
+    'weight_decay': 0.00004,
+    'momentum': 0.9,
+    'amp_level': 'O3',
+    'decay': 0.9,
+    'epsilon': 1.0,
+
+    'keep_checkpoint_max': 10,
+    'save_checkpoint_epochs': 10,
+
+    'lr_init': 0.00004,
+    'lr_end': 0.000004,
+    'lr_max': 0.4,
+    'warmup_epochs': 1,
+    'start_epoch': 1,
+
+    # 'lr_init': 0.4,
+    # # 'lr_end': 0.0001,
+    # 'gamma':0.96,
+})
--- a/model_zoo/research/cv/inception_resnet_v2/src/dataset.py
+++ b/model_zoo/research/cv/inception_resnet_v2/src/dataset.py
@ -0,0 +1,80 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Create train or eval dataset."""
+import os
+import mindspore.common.dtype as mstype
+import mindspore.dataset as de
+import mindspore.dataset.vision.c_transforms as C
+import mindspore.dataset.transforms.c_transforms as C2
+from src.config import config_ascend as config
+
+
+device_id = int(os.getenv('DEVICE_ID'))
+device_num = int(os.getenv('RANK_SIZE'))
+
+
+def create_dataset(dataset_path, do_train, repeat_num=1, batch_size=32):
+    """
+    Create a train or eval dataset.
+
+    Args:
+        dataset_path (str): The path of dataset.
+        do_train (bool): Whether dataset is used for train or eval.
+        repeat_num (int): The repeat times of dataset. Default: 1.
+        batch_size (int): The batch size of dataset. Default: 32.
+
+    Returns:
+        Dataset.
+    """
+
+    do_shuffle = bool(do_train)
+
+    if device_num == 1 or not do_train:
+        ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=config.work_nums, shuffle=do_shuffle)
+    else:
+        ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=config.work_nums,
+                                   shuffle=do_shuffle, num_shards=device_num, shard_id=device_id)
+
+    image_length = 299
+    if do_train:
+        trans = [
+            C.RandomCropDecodeResize(image_length, scale=(0.08, 1.0), ratio=(0.75, 1.333)),
+            C.RandomHorizontalFlip(prob=0.5),
+            C.RandomColorAdjust(brightness=0.4, contrast=0.4, saturation=0.4)
+            ]
+    else:
+        trans = [
+            C.Decode(),
+            C.Resize((int(image_length/0.875), int(image_length/0.875))),
+            C.CenterCrop(image_length)
+            ]
+    trans += [
+        C.Rescale(1.0 / 255.0, 0.0),
+        # C.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        C.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        C.HWC2CHW()
+    ]
+
+    type_cast_op = C2.TypeCast(mstype.int32)
+
+    ds = ds.map(input_columns="label", operations=type_cast_op, num_parallel_workers=config.work_nums)
+    ds = ds.map(input_columns="image", operations=trans, num_parallel_workers=config.work_nums)
+
+    # apply batch operations
+    ds = ds.batch(batch_size, drop_remainder=True)
+
+    # apply dataset repeat operation
+    ds = ds.repeat(repeat_num)
+    return ds
--- a/model_zoo/research/cv/inception_resnet_v2/src/inception_resnet_v2.py
+++ b/model_zoo/research/cv/inception_resnet_v2/src/inception_resnet_v2.py
@ -0,0 +1,297 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Inception_ResNet_v2"""
+import mindspore.nn as nn
+from mindspore.ops import operations as P
+
+class Avgpool(nn.Cell):
+    """Avgpool"""
+    def __init__(self, kernel_size, stride=1, pad_mode='same'):
+        super(Avgpool, self).__init__()
+        self.avg_pool = nn.AvgPool2d(kernel_size=kernel_size, stride=stride, pad_mode=pad_mode)
+
+    def construct(self, x):
+        x = self.avg_pool(x)
+        return x
+
+
+class Conv2d(nn.Cell):
+    """
+    Set the default configuration for Conv2dBnAct
+    """
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1, pad_mode='valid', padding=0,
+                 has_bias=False, weight_init="XavierUniform", bias_init='zeros'):
+        super(Conv2d, self).__init__()
+        self.conv = nn.Conv2dBnAct(in_channels, out_channels, kernel_size, stride=stride, pad_mode=pad_mode,
+                                   padding=padding, weight_init=weight_init, bias_init=bias_init, has_bias=has_bias,
+                                   has_bn=True, eps=0.001, momentum=0.9, activation="relu")
+
+    def construct(self, x):
+        x = self.conv(x)
+        return x
+
+
+class Mixed_5b(nn.Cell):
+    """
+    Mixed_5b
+    """
+    def __init__(self):
+        super(Mixed_5b, self).__init__()
+
+        self.branch0 = Conv2d(192, 96, kernel_size=1, stride=1)
+
+        self.branch1 = nn.SequentialCell(
+            Conv2d(192, 48, kernel_size=1, stride=1),
+            Conv2d(48, 64, kernel_size=5, stride=1, padding=2, pad_mode='pad')
+        )
+
+        self.branch2 = nn.SequentialCell(
+            Conv2d(192, 64, kernel_size=1, stride=1),
+            Conv2d(64, 96, kernel_size=3, stride=1, padding=1, pad_mode='pad'),
+            Conv2d(96, 96, kernel_size=3, stride=1, padding=1, pad_mode='pad')
+        )
+
+        self.branch3 = nn.SequentialCell(
+            nn.AvgPool2d(3, stride=1, pad_mode='same'),
+            Conv2d(192, 64, kernel_size=1, stride=1)
+        )
+
+        self.concat = P.Concat(1)
+
+    def construct(self, x):
+        '''
+        construct
+        '''
+        x0 = self.branch0(x)
+        x1 = self.branch1(x)
+        x2 = self.branch2(x)
+        x3 = self.branch3(x)
+        out = self.concat((x0, x1, x2, x3))
+        return out
+
+
+class Stem(nn.Cell):
+    """
+    Inceptionv resnet v2 stem
+
+    """
+    def __init__(self, in_channels):
+        super(Stem, self).__init__()
+        self.conv2d_1a = Conv2d(in_channels, 32, kernel_size=3, stride=2)
+        self.conv2d_2a = Conv2d(32, 32, kernel_size=3, stride=1)
+        self.conv2d_2b = Conv2d(32, 64, kernel_size=3, stride=1, padding=1, pad_mode='pad')
+        self.maxpool_3a = nn.MaxPool2d(3, stride=2)
+        self.conv2d_3b = Conv2d(64, 80, kernel_size=1, stride=1)
+        self.conv2d_4a = Conv2d(80, 192, kernel_size=3, stride=1)
+        self.maxpool_5a = nn.MaxPool2d(3, stride=2)
+        self.mixed_5b = Mixed_5b()
+
+    def construct(self, x):
+        """construct"""
+        x = self.conv2d_1a(x)
+        x = self.conv2d_2a(x)
+        x = self.conv2d_2b(x)
+        x = self.maxpool_3a(x)
+        x = self.conv2d_3b(x)
+        x = self.conv2d_4a(x)
+        x = self.maxpool_5a(x)
+        x = self.mixed_5b(x)
+        return x
+
+
+class InceptionA(nn.Cell):
+    """InceptionA"""
+    def __init__(self, scale):
+        super(InceptionA, self).__init__()
+        self.scale = scale
+        self.branch0 = Conv2d(320, 32, kernel_size=1, stride=1)
+        self.branch1 = nn.SequentialCell(
+            Conv2d(320, 32, kernel_size=1, stride=1),
+            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, pad_mode='pad')
+        )
+
+        self.branch2 = nn.SequentialCell(
+            Conv2d(320, 32, kernel_size=1, stride=1),
+            Conv2d(32, 48, kernel_size=3, stride=1, padding=1, pad_mode='pad'),
+            Conv2d(48, 64, kernel_size=3, stride=1, padding=1, pad_mode='pad')
+        )
+
+        self.conv2d = nn.Conv2d(128, 320, kernel_size=1, stride=1)
+        self.relu = nn.ReLU()
+        self.concat = P.Concat(1)
+
+    def construct(self, x):
+        x0 = self.branch0(x)
+        x1 = self.branch1(x)
+        x2 = self.branch2(x)
+        out = self.concat((x0, x1, x2))
+        out = self.conv2d(out)
+        out = out * self.scale + x
+        out = self.relu(out)
+        return out
+
+
+class ReductionA(nn.Cell):
+    '''
+    ReductionA
+    '''
+    def __init__(self):
+        super(ReductionA, self).__init__()
+
+        self.branch0 = Conv2d(320, 384, kernel_size=3, stride=2)
+
+        self.branch1 = nn.SequentialCell(
+            Conv2d(320, 256, kernel_size=1, stride=1),
+            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, pad_mode='pad'),
+            Conv2d(256, 384, kernel_size=3, stride=2)
+        )
+
+        self.branch2 = nn.MaxPool2d(3, stride=2)
+        self.concat = P.Concat(1)
+
+    def construct(self, x):
+        x0 = self.branch0(x)
+        x1 = self.branch1(x)
+        x2 = self.branch2(x)
+        out = self.concat((x0, x1, x2))
+        return out
+
+
+class InceptionB(nn.Cell):
+    """
+    InceptionB
+    """
+    def __init__(self, scale=1.0):
+        super(InceptionB, self).__init__()
+        self.scale = scale
+        self.branch0 = Conv2d(1088, 192, kernel_size=1, stride=1)
+        self.branch1 = nn.SequentialCell(
+            Conv2d(1088, 128, kernel_size=1, stride=1),
+            Conv2d(128, 160, kernel_size=(1, 7), stride=1, pad_mode='same'),
+            Conv2d(160, 192, kernel_size=(7, 1), stride=1, pad_mode='same')
+        )
+        self.conv2d = nn.Conv2d(384, 1088, kernel_size=1, stride=1)
+        self.relu = nn.ReLU()
+        self.concat = P.Concat(1)
+
+    def construct(self, x):
+        x0 = self.branch0(x)
+        x1 = self.branch1(x)
+        out = self.concat((x0, x1))
+        out = self.conv2d(out)
+        out = out * self.scale + x
+        out = self.relu(out)
+        return out
+
+
+class ReductionB(nn.Cell):
+    """
+    ReductionB
+    """
+    def __init__(self):
+        super(ReductionB, self).__init__()
+        self.branch0 = nn.SequentialCell(
+            Conv2d(1088, 256, kernel_size=1, stride=1),
+            Conv2d(256, 384, kernel_size=3, stride=2)
+        )
+        self.branch1 = nn.SequentialCell(
+            Conv2d(1088, 256, kernel_size=1, stride=1),
+            Conv2d(256, 288, kernel_size=3, stride=2)
+        )
+        self.branch2 = nn.SequentialCell(
+            Conv2d(1088, 256, kernel_size=1, stride=1),
+            Conv2d(256, 288, kernel_size=3, stride=1, pad_mode='pad', padding=1),
+            Conv2d(288, 320, kernel_size=3, stride=2)
+        )
+        self.branch3 = nn.MaxPool2d(3, stride=2)
+        self.concat = P.Concat(1)
+
+    def construct(self, x):
+        x0 = self.branch0(x)
+        x1 = self.branch1(x)
+        x2 = self.branch2(x)
+        x3 = self.branch3(x)
+        out = self.concat((x0, x1, x2, x3))
+        return out
+
+
+class InceptionC(nn.Cell):
+    """
+    InceptionC
+    """
+    def __init__(self, scale=1.0, noReLU=False):
+        super(InceptionC, self).__init__()
+        self.scale = scale
+        self.noReLU = noReLU
+        self.branch0 = Conv2d(2080, 192, kernel_size=1, stride=1)
+        self.branch1 = nn.SequentialCell(
+            Conv2d(2080, 192, kernel_size=1, stride=1),
+            Conv2d(192, 224, kernel_size=(1, 3), stride=1, pad_mode='same'),
+            Conv2d(224, 256, kernel_size=(3, 1), stride=1, pad_mode='same')
+        )
+        self.conv2d = nn.Conv2d(448, 2080, kernel_size=1, stride=1)
+        self.concat = P.Concat(1)
+        if not self.noReLU:
+            self.relu = nn.ReLU()
+        self.print = P.Print()
+
+    def construct(self, x):
+        x0 = self.branch0(x)
+        x1 = self.branch1(x)
+        out = self.concat((x0, x1))
+        out = self.conv2d(out)
+        out = out * self.scale + x
+        if not self.noReLU:
+            out = self.relu(out)
+        return out
+
+
+class Inception_resnet_v2(nn.Cell):
+    """
+    Inception_resnet_v2 architecture
+    Args.
+        is_train : in train mode, turn on the dropout.
+    """
+    def __init__(self, in_channels=3, classes=1000, k=192, l=224, m=256, n=384, is_train=True):
+        super(Inception_resnet_v2, self).__init__()
+        blocks = []
+        blocks.append(Stem(in_channels))
+        for _ in range(10):
+            blocks.append(InceptionA(scale=0.17))
+        blocks.append(ReductionA())
+        for _ in range(20):
+            blocks.append(InceptionB(scale=0.10))
+        blocks.append(ReductionB())
+        for _ in range(9):
+            blocks.append(InceptionC(scale=0.20))
+        self.features = nn.SequentialCell(blocks)
+        self.block8 = InceptionC(noReLU=True)
+        self.conv2d_7b = Conv2d(2080, 1536, kernel_size=1, stride=1)
+        self.avgpool = P.ReduceMean(keep_dims=False)
+        self.softmax = nn.DenseBnAct(
+            1536, classes, weight_init="XavierUniform", has_bias=True, has_bn=True, activation="logsoftmax")
+        if is_train:
+            self.dropout = nn.Dropout(0.8)
+        else:
+            self.dropout = nn.Dropout(1.0)
+
+    def construct(self, x):
+        x = self.features(x)
+        x = self.block8(x)
+        x = self.conv2d_7b(x)
+        x = self.avgpool(x, (2, 3))
+        x = self.dropout(x)
+        x = self.softmax(x)
+        return x
--- a/model_zoo/research/cv/inception_resnet_v2/train.py
+++ b/model_zoo/research/cv/inception_resnet_v2/train.py
@ -0,0 +1,169 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""train imagenet"""
+import os
+import argparse
+import math
+import numpy as np
+
+from mindspore.communication import init, get_rank
+from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, TimeMonitor, LossMonitor
+from mindspore.train.model import ParallelMode
+from mindspore.train.loss_scale_manager import FixedLossScaleManager
+from mindspore import Model
+from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
+from mindspore.nn import RMSProp
+from mindspore import Tensor
+from mindspore import context
+from mindspore.common import set_seed
+from mindspore.common.initializer import XavierUniform, initializer
+from mindspore.train.serialization import load_checkpoint, load_param_into_net
+
+from src.inception_resnet_v2 import Inception_resnet_v2
+from src.dataset import create_dataset, device_num
+
+from src.config import config_ascend as config
+
+os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'
+set_seed(1)
+
+def generate_cosine_lr(steps_per_epoch, total_epochs,
+                       lr_init=config.lr_init,
+                       lr_end=config.lr_end,
+                       lr_max=config.lr_max,
+                       warmup_epochs=config.warmup_epochs):
+    """
+    Applies cosine decay to generate learning rate array.
+
+    Args:
+       steps_per_epoch(int): steps number per epoch
+       total_epochs(int): all epoch in training.
+       lr_init(float): init learning rate.
+       lr_end(float): end learning rate
+       lr_max(float): max learning rate.
+       warmup_steps(int): all steps in warmup epochs.
+
+    Returns:
+       np.array, learning rate array.
+    """
+    total_steps = steps_per_epoch * total_epochs
+    warmup_steps = steps_per_epoch * warmup_epochs
+    decay_steps = total_steps - warmup_steps
+    lr_each_step = []
+    for i in range(total_steps):
+        if i < warmup_steps:
+            lr_inc = (float(lr_max) - float(lr_init)) / float(warmup_steps)
+            lr = float(lr_init) + lr_inc * (i + 1)
+        else:
+            cosine_decay = 0.5 * (1 + math.cos(math.pi * (i - warmup_steps) / decay_steps))
+            lr = (lr_max - lr_end) * cosine_decay + lr_end
+        lr_each_step.append(lr)
+    learning_rate = np.array(lr_each_step).astype(np.float32)
+    current_step = steps_per_epoch * (config.start_epoch - 1)
+    learning_rate = learning_rate[current_step:]
+    return learning_rate
+
+
+
+
+def inception_resnet_v2_train():
+    """
+    Train inception_resnet_v2 in data parallelism
+    """
+    print('epoch_size: {} batch_size: {} class_num {}'.format(config.epoch_size, config.batch_size, config.num_classes))
+
+    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
+    context.set_context(device_id=args.device_id)
+    context.set_context(enable_graph_kernel=False)
+    rank = 0
+    if device_num > 1:
+        init(backend_name='hccl')
+        rank = get_rank()
+        context.set_auto_parallel_context(device_num=device_num,
+                                          parallel_mode=ParallelMode.DATA_PARALLEL,
+                                          gradients_mean=True,
+                                          all_reduce_fusion_config=[200, 400])
+    print("creating dataset....")
+    # create dataset
+    train_dataset = create_dataset(dataset_path=args.dataset_path, do_train=True,
+                                   repeat_num=1, batch_size=config.batch_size)
+    train_step_size = train_dataset.get_dataset_size()
+    # create model
+    print("creating model.....")
+    net = Inception_resnet_v2(classes=config.num_classes)
+    # loss
+    print("creating loss.....")
+    loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
+    lr = Tensor(generate_cosine_lr(steps_per_epoch=train_step_size, total_epochs=config.epoch_size))
+    decayed_params = []
+    no_decayed_params = []
+    for param in net.trainable_params():
+        if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name:
+            decayed_params.append(param)
+        else:
+            no_decayed_params.append(param)
+    for param in net.trainable_params():
+        if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name:
+            param.set_data(initializer(XavierUniform(), param.data.shape, param.data.dtype))
+    group_params = [{'params': decayed_params, 'weight_decay': config.weight_decay},
+                    {'params': no_decayed_params},
+                    {'order_params': net.trainable_params()}]
+
+    opt = RMSProp(group_params, lr, decay=config.decay, epsilon=config.epsilon, weight_decay=config.weight_decay,
+                  momentum=config.momentum, loss_scale=config.loss_scale)
+
+    if args.device_id == 0:
+        print(lr)
+        print(train_step_size)
+    if args.resume:
+        ckpt = load_checkpoint(args.resume)
+        load_param_into_net(net, ckpt)
+
+    loss_scale_manager = FixedLossScaleManager(config.loss_scale, drop_overflow_update=False)
+    model = Model(net, loss_fn=loss, optimizer=opt, metrics={
+        'acc', 'top_1_accuracy', 'top_5_accuracy'}, loss_scale_manager=loss_scale_manager, amp_level=config.amp_level)
+
+    # define callbacks
+    performance_cb = TimeMonitor(data_size=train_step_size)
+    loss_cb = LossMonitor(per_print_times=train_step_size)
+    ckp_save_step = config.save_checkpoint_epochs * train_step_size
+    config_ck = CheckpointConfig(save_checkpoint_steps=ckp_save_step, keep_checkpoint_max=config.keep_checkpoint_max)
+    ckpoint_cb = ModelCheckpoint(prefix=f"ince_res-train-rank{rank}",
+                                 directory='ckpts_rank_' + str(rank), config=config_ck)
+    callbacks = [performance_cb, loss_cb]
+    if device_num > 1 and config.is_save_on_master:
+        if args.device_id == 0:
+            callbacks.append(ckpoint_cb)
+    else:
+        callbacks.append(ckpoint_cb)
+
+    # train model
+    print("start training....")
+    model.train(config.epoch_size, train_dataset, callbacks=callbacks, dataset_sink_mode=True)
+
+def parse_args():
+    '''parse_args'''
+    arg_parser = argparse.ArgumentParser(description='inception resnet v2 image classification training')
+    arg_parser.add_argument('--dataset_path', type=str, default='/data/imagenet2012/train', help='Dataset path')
+    arg_parser.add_argument('--device_id', type=int, default=0, help='device id')
+    arg_parser.add_argument('--resume', type=str, default='', help='resume training with existed checkpoint')
+    args_opt = arg_parser.parse_args()
+    return args_opt
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print("start training....")
+    inception_resnet_v2_train()