!17291  Inception_ResNet_v2 for master

Merge pull request !17291 from wittlu/master
This commit is contained in:
i-robot 2021-07-30 01:22:03 +00:00 committed by Gitee
commit b99b3833cb
12 changed files with 1291 additions and 0 deletions

View File

@ -0,0 +1,224 @@
# Inception_ResNet_v2 for Ascend
- [Inception_ResNet_v2 Description](#Inception_ResNet_v2-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Features](#features)
- [Mixed Precision](#mixed-precision)
- [Environment Requirements](#environment-requirements)
- [Script Description](#script-description)
- [Script and Sample Code](#script-and-sample-code)
- [Training Process](#training-process)
- [Evaluation Process](#evaluation-process)
- [Evaluation](#evaluation)
- [Model Description](#model-description)
- [Performance](#performance)
- [Training Performance](#evaluation-performance)
- [Inference Performance](#evaluation-performance)
- [Description of Random Situation](#description-of-random-situation)
- [ModelZoo Homepage](#modelzoo-homepage)
# [Inception_ResNet_v2 Description](#contents)
Inception_ResNet_v2 is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than Inception-v3. This idea was proposed in the paper Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, published in 2016.
[Paper](https://arxiv.org/pdf/1602.07261.pdf) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Computer Vision and Pattern Recognition[J]. 2016.
# [Model architecture](#contents)
The overall network architecture of Inception_ResNet_v2 is show below:
[Link](https://arxiv.org/pdf/1602.07261.pdf)
# [Dataset](#contents)
Dataset used can refer to paper.
- Dataset size: 125G, 1250k colorful images in 1000 classes
- Train: 120G, 1200k images
- Test: 5G, 50k images
- Data format: RGB images.
- Note: Data will be processed in src/dataset.py
# [Features](#contents)
## [Mixed Precision(Ascend)](#contents)
The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching reduce precision.
# [Environment Requirements](#contents)
- HardwareAscend
- Prepare hardware environment with Ascend processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Framework
- [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
# [Script description](#contents)
## [Script and sample code](#contents)
```shell
.
└─inception_resnet_v2
├─README.md
├─scripts
├─run_standalone_train_ascend.sh # launch standalone training with ascend platform(1p)
├─run_distribute_train_ascend.sh # launch distributed training with ascend platform(8p)
└─run_eval_ascend.sh # launch evaluating with ascend platform
├─src
├─config.py # parameter configuration
├─dataset.py # data preprocessing
├─inception_resnet_v2.py.py # network definition
└─callback.py # eval callback function
├─eval.py # eval net
├─export.py # export checkpoint, surpport .onnx, .air, .mindir convert
└─train.py # train net
```
## [Script Parameters](#contents)
```python
Major parameters in train.py and config.py are:
'is_save_on_master' # save checkpoint only on master device
'batch_size' # input batchsize
'epoch_size' # total epoch numbers
'num_classes' # dataset class numbers
'work_nums' # number of workers to read data
'loss_scale' # loss scale
'smooth_factor' # label smoothing factor
'weight_decay' # weight decay
'momentum' # momentum
'amp_level' # precision training, Supports [O0, O2, O3]
'decay' # decay used in optimize function
'epsilon' # epsilon used in iptimize function
'keep_checkpoint_max' # max numbers to keep checkpoints
'save_checkpoint_epochs' # save checkpoints per n epoch
'lr_init' # init leaning rate
'lr_end' # end of learning rate
'lr_max' # max bound of learning rate
'warmup_epochs' # warmup epoch numbers
'start_epoch' # number of start epoch range[1, epoch_size]
```
## [Training process](#contents)
### Usage
You can start training using python or shell scripts. The usage of shell scripts as follows:
- Ascend:
```bash
# distribute training example(8p)
sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
# standalone training
sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
```
> Notes:
> RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html) , and the device_ip can be got as [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools). For large models like InceptionV4, it's better to export an external environment variable `export HCCL_CONNECT_TIMEOUT=600` to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.
>
> This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
### Launch
```bash
# training example
shell:
Ascend:
# distribute training example(8p)
sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
# standalone training
sh scripts/run_standalone_train_ascend.sh
```
### Result
Training result will be stored in the example path. Checkpoints will be stored at `ckpt_path` by default, and training log will be redirected to `./log.txt` like following.
```python
epoch: 1 step: 1251, loss is 5.4833196
Epoch time: 520274.060, per step time: 415.887
epoch: 2 step: 1251, loss is 4.093194
Epoch time: 288520.628, per step time: 230.632
epoch: 3 step: 1251, loss is 3.6242008
Epoch time: 288507.506, per step time: 230.622
```
## [Eval process](#contents)
### Usage
You can start training using python or shell scripts. The usage of shell scripts as follows:
- Ascend:
```bash
sh scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
```
> checkpoint can be produced in training process.
### Result
Evaluation result will be stored in the example path, you can find result like the following in `eval.log`.
```python
metric: {'Loss': 1.0413, 'Top1-Acc':0.79955, 'Top5-Acc':0.9439}
```
# [Model description](#contents)
## [Performance](#contents)
### Training Performance
| Parameters | Ascend |
| -------------------------- | ------------------------------------------------------------ |
| Model Version | Inception ResNet v2 |
| Resource | Ascend 910, cpu:2.60GHz 192cores, memory:755G |
| uploaded Date | 11/04/2020 |
| MindSpore Version | 1.2.0 |
| Dataset | 1200k images |
| Batch_size | 128 |
| Training Parameters | src/config.py |
| Optimizer | RMSProp |
| Loss Function | SoftmaxCrossEntropyWithLogits |
| Outputs | probability |
| Total time (8p) | 24h |
#### Inference Performance
| Parameters | Ascend |
| ------------------- | --------------------------- |
| Model Version | Inception ResNet v2 |
| Resource | Ascend 910, cpu:2.60GHz 192cores, memory:755G |
| Uploaded Date | 11/04/2020 |
| MindSpore Version | 1.2.0 |
| Dataset | 50k images |
| Batch_size | 128 |
| Outputs | probability |
| Accuracy | ACC1[79.96%] ACC5[94.40%] |
#### Training performance results
| **Ascend** | train performance |
| :--------: | :---------------: |
| 1p | 556 img/s |
| **Ascend** | train performance |
| :--------: | :---------------: |
| 8p | 4430 img/s |
# [Description of Random Situation](#contents)
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
# [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).

View File

@ -0,0 +1,219 @@
# 目录
<!-- TOC -->
- [目录](#目录)
- [Inception_ResNet_v2描述](#Inception_ResNet_v2描述)
- [模型架构](#模型架构)
- [数据集](#数据集)
- [特性](#特性)
- [混合精度Ascend](#混合精度ascend)
- [环境要求](#环境要求)
- [脚本说明](#脚本说明)
- [脚本和样例代码](#脚本和样例代码)
- [脚本参数](#脚本参数)
- [训练过程](#训练过程)
- [用法](#用法)
- [启动](#启动)
- [结果](#结果)
- [评估过程](#评估过程)
- [用法](#用法-1)
- [启动](#启动-1)
- [结果](#结果-1)
- [模型描述](#模型描述)
- [性能](#性能)
- [训练性能](#训练性能)
- [推理性能](#推理性能)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#modelzoo主页)
<!-- /TOC -->
# Inception_ResNet_v2描述
Inception_ResNet_v2是Google的深度学习卷积架构系列的一个版本。Inception_ResNet_v2主要通过修改以前的Inception架构来减少计算资源的消耗。该方法在2016年出版的Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning一文中提出的。
[论文](https://arxiv.org/pdf/1512.00567.pdf)(https://arxiv.org/pdf/1602.07261.pdf) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Computer Vision and Pattern Recognition[J]. 2016.
# 模型架构
Inception_ResNet_v2的总体网络架构如下
[链接](https://arxiv.org/pdf/1602.07261.pdf)
# 数据集
所用数据集可参照论文。
- 数据集大小125G共1000个类、125万张彩色图像
- 训练集120G, 120万张图像
- 测试集5G共5万张图像
- 数据格式RGB
- 注数据将在src/dataset.py中处理。
# 特性
## 混合精度Ascend
采用[混合精度](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/enable_mixed_precision.html)的训练方法使用支持单精度和半精度数据来提高深度学习神经网络的训练速度,同时保持单精度训练所能达到的网络精度。混合精度训练提高计算速度、减少内存使用的同时,支持在特定硬件上训练更大的模型或实现更大批次的训练。
以FP16算子为例如果输入数据类型为FP32MindSpore后台会自动降低精度来处理数据。用户可打开INFO日志搜索“reduce precision”查看精度降低的算子。
# 环境要求
- 硬件Ascend
- 使用Ascend来搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install/en)
- 如需查看详情,请参见如下资源:
- [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
# 脚本说明
## 脚本和样例代码
```shell
.
└─inception_resnet_v2
├─README.md
├─scripts
├─run_standalone_train_ascend.sh # launch standalone training with ascend platform(1p)
├─run_distribute_train_ascend.sh # launch distributed training with ascend platform(8p)
└─run_eval_ascend.sh # launch evaluating with ascend platform
├─src
├─config.py # parameter configuration
├─dataset.py # data preprocessing
├─inception_resnet_v2.py.py # network definition
└─callback.py # eval callback function
├─eval.py # eval net
├─export.py # export checkpoint, surpport .onnx, .air, .mindir convert
└─train.py # train net
```
## 脚本参数
```python
Major parameters in train.py and config.py are:
'is_save_on_master' # save checkpoint only on master device
'batch_size' # input batchsize
'epoch_size' # total epoch numbers
'num_classes' # dataset class numbers
'work_nums' # number of workers to read data
'loss_scale' # loss scale
'smooth_factor' # label smoothing factor
'weight_decay' # weight decay
'momentum' # momentum
'amp_level' # precision training, Supports [O0, O2, O3]
'decay' # decay used in optimize function
'epsilon' # epsilon used in iptimize function
'keep_checkpoint_max' # max numbers to keep checkpoints
'save_checkpoint_epochs' # save checkpoints per n epoch
'lr_init' # init leaning rate
'lr_end' # end of learning rate
'lr_max' # max bound of learning rate
'warmup_epochs' # warmup epoch numbers
'start_epoch' # number of start epoch range[1, epoch_size]
```
## 训练过程
### 用法
使用python或shell脚本开始训练。shell脚本的用法如下
- Ascend
```bash
# distribute training example(8p)
sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
# standalone training
sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
```
> 注RANK_TABLE_FILE可参考[链接](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/distributed_training_ascend.html)。device_ip可以通过[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)获取
### 结果
训练结果保存在示例路径。检查点默认保存在`checkpoint`,训练日志会重定向到`./log.txt`,如下:
#### Ascend
```python
epoch: 1 step: 1251, loss is 5.4833196
Epoch time: 520274.060, per step time: 415.887
epoch: 2 step: 1251, loss is 4.093194
Epoch time: 288520.628, per step time: 230.632
epoch: 3 step: 1251, loss is 3.6242008
Epoch time: 288507.506, per step time: 230.622
```
## 评估过程
### 用法
使用python或shell脚本开始训练。shell脚本的用法如下
- Ascend
```bash
sh scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
```
> 训练过程中可以生成模型文件。
### 结果
推理结果保存在示例路径,可以在`eval.log`中找到如下结果。
```log
metric: {'Loss': 1.0413, 'Top1-Acc':0.79955, 'Top5-Acc':0.9439}
```
## 模型导出
```shell
python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format[EXPORT_FORMAT]
```
`EXPORT_FORMAT` 可选 ["AIR", "MINDIR"]
# 模型描述
## 性能
### 训练性能
| 参数 | Ascend |
| -------------------------- | ---------------------------------------------- |
| 模型版本 | Inception ResNet v2 |
| 资源 | Ascend 910CPU 2.60GHz192核内存 755G系统 Euler2.8 |
| MindSpore版本 | 0.6.0-beta |
| 数据集 | 120万张图像 |
| Batch_size | 128 |
| 训练参数 | src/config.py |
| 优化器 | RMSProp |
| 损失函数 | Softmax交叉熵 |
| 输出 | 概率 |
| 损失 | 1.98 |
| 总时长8卡 | 24小时 |
#### 推理性能
| 参数 | Ascend |
| ------------------- | --------------------------- |
| 模型版本 | Inception ResNet v2 |
| 资源 | Ascend 910CPU 2.60GHz192核内存 755G系统 Euler2.8 |
| MindSpore 版本 | 1.2.0 |
| 数据集 | 5万张图像 |
| Batch_size | 128 |
| 准确率 | ACC1[79.96%] ACC5[94.40%] |
# 随机情况说明
在dataset.py中我们设置了“create_dataset”函数内的种子同时还使用了train.py中的随机种子。
# ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)。

View File

@ -0,0 +1,59 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""evaluate_imagenet"""
import argparse
import os
import mindspore.nn as nn
from mindspore import context
from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from src.dataset import create_dataset
from src.inception_resnet_v2 import Inception_resnet_v2
from src.config import config_ascend as config
def parse_args():
'''parse_args'''
parser = argparse.ArgumentParser(description='image classification evaluation')
parser.add_argument('--platform', type=str, default='Ascend', choices=('Ascend', 'GPU'), help='run platform')
parser.add_argument('--dataset_path', type=str, default='', help='Dataset path')
parser.add_argument('--checkpoint_path', type=str, default='', help='checkpoint of inception_resnet_v2')
args_opt = parser.parse_args()
return args_opt
if __name__ == '__main__':
args = parse_args()
if args.platform == 'Ascend':
device_id = int(os.getenv('DEVICE_ID'))
context.set_context(device_id=device_id)
context.set_context(mode=context.GRAPH_MODE, device_target=args.platform)
net = Inception_resnet_v2(classes=config.num_classes, is_train=False)
ckpt = load_checkpoint(args.checkpoint_path)
load_param_into_net(net, ckpt)
net.set_train(False)
dataset = create_dataset(dataset_path=args.dataset_path, do_train=False,
repeat_num=1, batch_size=config.batch_size)
loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
eval_metrics = {'Loss': nn.Loss(),
'Top1-Acc': nn.Top1CategoricalAccuracy(),
'Top5-Acc': nn.Top5CategoricalAccuracy()}
model = Model(net, loss, optimizer=None, metrics=eval_metrics)
print('='*20, 'Evalute start', '='*20)
metrics = model.eval(dataset)
print("metric: ", metrics)

View File

@ -0,0 +1,47 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""export checkpoint file into air, onnx, mindir models"""
import argparse
import numpy as np
# import mindspore as ms
from mindspore import Tensor, dtype
from mindspore.train.serialization import load_checkpoint, load_param_into_net, export, context
from src.config import config_ascend as config
from src.inception_resnet_v2 import Inception_resnet_v2
parser = argparse.ArgumentParser(description='inception_resnet_v2 export')
parser.add_argument("--device_id", type=int, default=0, help="Device id")
parser.add_argument('--ckpt_file', type=str, required=True, help='inception_resnet_v2 ckpt file.')
parser.add_argument('--file_name', type=str, default='inception_resnet_v2', help='inception_resnet_v2 output air name.')
parser.add_argument('--file_format', type=str, choices=["AIR", "ONNX", "MINDIR"], default='AIR', help='file format')
parser.add_argument('--width', type=int, default=299, help='input width')
parser.add_argument('--height', type=int, default=299, help='input height')
parser.add_argument("--device_target", type=str, choices=["Ascend", "GPU", "CPU"], default="Ascend",
help="device target")
args = parser.parse_args()
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target)
if args.device_target == "Ascend":
context.set_context(device_id=args.device_id)
if __name__ == '__main__':
net = Inception_resnet_v2(classes=config.num_classes)
param_dict = load_checkpoint(args.ckpt_file)
load_param_into_net(net, param_dict)
input_arr = Tensor(np.ones([config.batch_size, 3, args.width, args.height]), dtype.float32)
export(net, input_arr, file_name=args.file_name, file_format=args.file_format)

View File

@ -0,0 +1,49 @@
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
export RANK_TABLE_FILE=$1
DATA_DIR=$2
export RANK_SIZE=8
cores=`cat /proc/cpuinfo|grep "processor" |wc -l`
echo "the number of logical core" $cores
avg_core_per_rank=`expr $cores \/ $RANK_SIZE`
core_gap=`expr $avg_core_per_rank \- 1`
echo "avg_core_per_rank" $avg_core_per_rank
echo "core_gap" $core_gap
for((i=0;i<RANK_SIZE;i++))
do
start=`expr $i \* $avg_core_per_rank`
export DEVICE_ID=$i
export RANK_ID=$i
export DEPLOY_MODE=0
export GE_USE_STATIC_MEMORY=1
end=`expr $start \+ $core_gap`
cmdopt=$start"-"$end
rm -rf train_parallel$i
mkdir ./train_parallel$i
cp *.py ./train_parallel$i
cd ./train_parallel$i || exit
echo "start training for rank $i, device $DEVICE_ID rank_id $RANK_ID"
env > env.log
taskset -c $cmdopt python -u ../train.py \
--device_id $i \
--dataset_path=$DATA_DIR > log.txt 2>&1 &
cd ../
done

View File

@ -0,0 +1,28 @@
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
export DEVICE_ID=$1
DATA_DIR=$2
CHECKPOINT_PATH=$3
export RANK_SIZE=1
rm -rf evaluation_ascend
mkdir ./evaluation_ascend
cd ./evaluation_ascend || exit
echo "start training for device id $DEVICE_ID"
env > env.log
python ../eval.py --platform=Ascend --dataset_path=$DATA_DIR --checkpoint_path=$CHECKPOINT_PATH > eval.log 2>&1 &
cd ../

View File

@ -0,0 +1,29 @@
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
export RANK_SIZE="1"
export DEVICE_ID="1"
DATA_DIR='/data/imagenet/train'
rm -rf train_standalone
mkdir ./train_standalone
cd ./train_standalone || exit
echo "start training for device id $DEVICE_ID"
env > env.log
python -u ../train.py \
--device_id=$DEVICE_ID \
--dataset_path=$DATA_DIR > log.txt 2>&1 &
cd ../

View File

@ -0,0 +1,42 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""callback function"""
from mindspore.train.callback import Callback
class EvaluateCallBack(Callback):
"""EvaluateCallBack"""
def __init__(self, model, eval_dataset, per_print_time=1000):
super(EvaluateCallBack, self).__init__()
self.model = model
self.per_print_time = per_print_time
self.eval_dataset = eval_dataset
def step_end(self, run_context):
cb_params = run_context.original_args()
if cb_params.cur_step_num % self.per_print_time == 0:
result = self.model.eval(self.eval_dataset, dataset_sink_mode=False)
print('cur epoch {}, cur_step {}, top1 accuracy {}, top5 accuracy {}.'.format(cb_params.cur_epoch_num,
cb_params.cur_step_num,
result['top_1_accuracy'],
result['top_5_accuracy']))
def epoch_end(self, run_context):
cb_params = run_context.original_args()
result = self.model.eval(self.eval_dataset, dataset_sink_mode=False)
print('cur epoch {}, cur_step {}, top1 accuracy {}, top5 accuracy {}.'.format(cb_params.cur_epoch_num,
cb_params.cur_step_num,
result['top_1_accuracy'],
result['top_5_accuracy']))

View File

@ -0,0 +1,48 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
network config setting, will be used in main.py
"""
from easydict import EasyDict as edict
config_ascend = edict({
'is_save_on_master': False,
'batch_size': 128,
'epoch_size': 250,
'num_classes': 1000,
'work_nums': 8,
'loss_scale': 1024,
'smooth_factor': 0.1,
'weight_decay': 0.00004,
'momentum': 0.9,
'amp_level': 'O3',
'decay': 0.9,
'epsilon': 1.0,
'keep_checkpoint_max': 10,
'save_checkpoint_epochs': 10,
'lr_init': 0.00004,
'lr_end': 0.000004,
'lr_max': 0.4,
'warmup_epochs': 1,
'start_epoch': 1,
# 'lr_init': 0.4,
# # 'lr_end': 0.0001,
# 'gamma':0.96,
})

View File

@ -0,0 +1,80 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Create train or eval dataset."""
import os
import mindspore.common.dtype as mstype
import mindspore.dataset as de
import mindspore.dataset.vision.c_transforms as C
import mindspore.dataset.transforms.c_transforms as C2
from src.config import config_ascend as config
device_id = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
def create_dataset(dataset_path, do_train, repeat_num=1, batch_size=32):
"""
Create a train or eval dataset.
Args:
dataset_path (str): The path of dataset.
do_train (bool): Whether dataset is used for train or eval.
repeat_num (int): The repeat times of dataset. Default: 1.
batch_size (int): The batch size of dataset. Default: 32.
Returns:
Dataset.
"""
do_shuffle = bool(do_train)
if device_num == 1 or not do_train:
ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=config.work_nums, shuffle=do_shuffle)
else:
ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=config.work_nums,
shuffle=do_shuffle, num_shards=device_num, shard_id=device_id)
image_length = 299
if do_train:
trans = [
C.RandomCropDecodeResize(image_length, scale=(0.08, 1.0), ratio=(0.75, 1.333)),
C.RandomHorizontalFlip(prob=0.5),
C.RandomColorAdjust(brightness=0.4, contrast=0.4, saturation=0.4)
]
else:
trans = [
C.Decode(),
C.Resize((int(image_length/0.875), int(image_length/0.875))),
C.CenterCrop(image_length)
]
trans += [
C.Rescale(1.0 / 255.0, 0.0),
# C.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
C.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32)
ds = ds.map(input_columns="label", operations=type_cast_op, num_parallel_workers=config.work_nums)
ds = ds.map(input_columns="image", operations=trans, num_parallel_workers=config.work_nums)
# apply batch operations
ds = ds.batch(batch_size, drop_remainder=True)
# apply dataset repeat operation
ds = ds.repeat(repeat_num)
return ds

View File

@ -0,0 +1,297 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Inception_ResNet_v2"""
import mindspore.nn as nn
from mindspore.ops import operations as P
class Avgpool(nn.Cell):
"""Avgpool"""
def __init__(self, kernel_size, stride=1, pad_mode='same'):
super(Avgpool, self).__init__()
self.avg_pool = nn.AvgPool2d(kernel_size=kernel_size, stride=stride, pad_mode=pad_mode)
def construct(self, x):
x = self.avg_pool(x)
return x
class Conv2d(nn.Cell):
"""
Set the default configuration for Conv2dBnAct
"""
def __init__(self, in_channels, out_channels, kernel_size, stride=1, pad_mode='valid', padding=0,
has_bias=False, weight_init="XavierUniform", bias_init='zeros'):
super(Conv2d, self).__init__()
self.conv = nn.Conv2dBnAct(in_channels, out_channels, kernel_size, stride=stride, pad_mode=pad_mode,
padding=padding, weight_init=weight_init, bias_init=bias_init, has_bias=has_bias,
has_bn=True, eps=0.001, momentum=0.9, activation="relu")
def construct(self, x):
x = self.conv(x)
return x
class Mixed_5b(nn.Cell):
"""
Mixed_5b
"""
def __init__(self):
super(Mixed_5b, self).__init__()
self.branch0 = Conv2d(192, 96, kernel_size=1, stride=1)
self.branch1 = nn.SequentialCell(
Conv2d(192, 48, kernel_size=1, stride=1),
Conv2d(48, 64, kernel_size=5, stride=1, padding=2, pad_mode='pad')
)
self.branch2 = nn.SequentialCell(
Conv2d(192, 64, kernel_size=1, stride=1),
Conv2d(64, 96, kernel_size=3, stride=1, padding=1, pad_mode='pad'),
Conv2d(96, 96, kernel_size=3, stride=1, padding=1, pad_mode='pad')
)
self.branch3 = nn.SequentialCell(
nn.AvgPool2d(3, stride=1, pad_mode='same'),
Conv2d(192, 64, kernel_size=1, stride=1)
)
self.concat = P.Concat(1)
def construct(self, x):
'''
construct
'''
x0 = self.branch0(x)
x1 = self.branch1(x)
x2 = self.branch2(x)
x3 = self.branch3(x)
out = self.concat((x0, x1, x2, x3))
return out
class Stem(nn.Cell):
"""
Inceptionv resnet v2 stem
"""
def __init__(self, in_channels):
super(Stem, self).__init__()
self.conv2d_1a = Conv2d(in_channels, 32, kernel_size=3, stride=2)
self.conv2d_2a = Conv2d(32, 32, kernel_size=3, stride=1)
self.conv2d_2b = Conv2d(32, 64, kernel_size=3, stride=1, padding=1, pad_mode='pad')
self.maxpool_3a = nn.MaxPool2d(3, stride=2)
self.conv2d_3b = Conv2d(64, 80, kernel_size=1, stride=1)
self.conv2d_4a = Conv2d(80, 192, kernel_size=3, stride=1)
self.maxpool_5a = nn.MaxPool2d(3, stride=2)
self.mixed_5b = Mixed_5b()
def construct(self, x):
"""construct"""
x = self.conv2d_1a(x)
x = self.conv2d_2a(x)
x = self.conv2d_2b(x)
x = self.maxpool_3a(x)
x = self.conv2d_3b(x)
x = self.conv2d_4a(x)
x = self.maxpool_5a(x)
x = self.mixed_5b(x)
return x
class InceptionA(nn.Cell):
"""InceptionA"""
def __init__(self, scale):
super(InceptionA, self).__init__()
self.scale = scale
self.branch0 = Conv2d(320, 32, kernel_size=1, stride=1)
self.branch1 = nn.SequentialCell(
Conv2d(320, 32, kernel_size=1, stride=1),
Conv2d(32, 32, kernel_size=3, stride=1, padding=1, pad_mode='pad')
)
self.branch2 = nn.SequentialCell(
Conv2d(320, 32, kernel_size=1, stride=1),
Conv2d(32, 48, kernel_size=3, stride=1, padding=1, pad_mode='pad'),
Conv2d(48, 64, kernel_size=3, stride=1, padding=1, pad_mode='pad')
)
self.conv2d = nn.Conv2d(128, 320, kernel_size=1, stride=1)
self.relu = nn.ReLU()
self.concat = P.Concat(1)
def construct(self, x):
x0 = self.branch0(x)
x1 = self.branch1(x)
x2 = self.branch2(x)
out = self.concat((x0, x1, x2))
out = self.conv2d(out)
out = out * self.scale + x
out = self.relu(out)
return out
class ReductionA(nn.Cell):
'''
ReductionA
'''
def __init__(self):
super(ReductionA, self).__init__()
self.branch0 = Conv2d(320, 384, kernel_size=3, stride=2)
self.branch1 = nn.SequentialCell(
Conv2d(320, 256, kernel_size=1, stride=1),
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, pad_mode='pad'),
Conv2d(256, 384, kernel_size=3, stride=2)
)
self.branch2 = nn.MaxPool2d(3, stride=2)
self.concat = P.Concat(1)
def construct(self, x):
x0 = self.branch0(x)
x1 = self.branch1(x)
x2 = self.branch2(x)
out = self.concat((x0, x1, x2))
return out
class InceptionB(nn.Cell):
"""
InceptionB
"""
def __init__(self, scale=1.0):
super(InceptionB, self).__init__()
self.scale = scale
self.branch0 = Conv2d(1088, 192, kernel_size=1, stride=1)
self.branch1 = nn.SequentialCell(
Conv2d(1088, 128, kernel_size=1, stride=1),
Conv2d(128, 160, kernel_size=(1, 7), stride=1, pad_mode='same'),
Conv2d(160, 192, kernel_size=(7, 1), stride=1, pad_mode='same')
)
self.conv2d = nn.Conv2d(384, 1088, kernel_size=1, stride=1)
self.relu = nn.ReLU()
self.concat = P.Concat(1)
def construct(self, x):
x0 = self.branch0(x)
x1 = self.branch1(x)
out = self.concat((x0, x1))
out = self.conv2d(out)
out = out * self.scale + x
out = self.relu(out)
return out
class ReductionB(nn.Cell):
"""
ReductionB
"""
def __init__(self):
super(ReductionB, self).__init__()
self.branch0 = nn.SequentialCell(
Conv2d(1088, 256, kernel_size=1, stride=1),
Conv2d(256, 384, kernel_size=3, stride=2)
)
self.branch1 = nn.SequentialCell(
Conv2d(1088, 256, kernel_size=1, stride=1),
Conv2d(256, 288, kernel_size=3, stride=2)
)
self.branch2 = nn.SequentialCell(
Conv2d(1088, 256, kernel_size=1, stride=1),
Conv2d(256, 288, kernel_size=3, stride=1, pad_mode='pad', padding=1),
Conv2d(288, 320, kernel_size=3, stride=2)
)
self.branch3 = nn.MaxPool2d(3, stride=2)
self.concat = P.Concat(1)
def construct(self, x):
x0 = self.branch0(x)
x1 = self.branch1(x)
x2 = self.branch2(x)
x3 = self.branch3(x)
out = self.concat((x0, x1, x2, x3))
return out
class InceptionC(nn.Cell):
"""
InceptionC
"""
def __init__(self, scale=1.0, noReLU=False):
super(InceptionC, self).__init__()
self.scale = scale
self.noReLU = noReLU
self.branch0 = Conv2d(2080, 192, kernel_size=1, stride=1)
self.branch1 = nn.SequentialCell(
Conv2d(2080, 192, kernel_size=1, stride=1),
Conv2d(192, 224, kernel_size=(1, 3), stride=1, pad_mode='same'),
Conv2d(224, 256, kernel_size=(3, 1), stride=1, pad_mode='same')
)
self.conv2d = nn.Conv2d(448, 2080, kernel_size=1, stride=1)
self.concat = P.Concat(1)
if not self.noReLU:
self.relu = nn.ReLU()
self.print = P.Print()
def construct(self, x):
x0 = self.branch0(x)
x1 = self.branch1(x)
out = self.concat((x0, x1))
out = self.conv2d(out)
out = out * self.scale + x
if not self.noReLU:
out = self.relu(out)
return out
class Inception_resnet_v2(nn.Cell):
"""
Inception_resnet_v2 architecture
Args.
is_train : in train mode, turn on the dropout.
"""
def __init__(self, in_channels=3, classes=1000, k=192, l=224, m=256, n=384, is_train=True):
super(Inception_resnet_v2, self).__init__()
blocks = []
blocks.append(Stem(in_channels))
for _ in range(10):
blocks.append(InceptionA(scale=0.17))
blocks.append(ReductionA())
for _ in range(20):
blocks.append(InceptionB(scale=0.10))
blocks.append(ReductionB())
for _ in range(9):
blocks.append(InceptionC(scale=0.20))
self.features = nn.SequentialCell(blocks)
self.block8 = InceptionC(noReLU=True)
self.conv2d_7b = Conv2d(2080, 1536, kernel_size=1, stride=1)
self.avgpool = P.ReduceMean(keep_dims=False)
self.softmax = nn.DenseBnAct(
1536, classes, weight_init="XavierUniform", has_bias=True, has_bn=True, activation="logsoftmax")
if is_train:
self.dropout = nn.Dropout(0.8)
else:
self.dropout = nn.Dropout(1.0)
def construct(self, x):
x = self.features(x)
x = self.block8(x)
x = self.conv2d_7b(x)
x = self.avgpool(x, (2, 3))
x = self.dropout(x)
x = self.softmax(x)
return x

View File

@ -0,0 +1,169 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""train imagenet"""
import os
import argparse
import math
import numpy as np
from mindspore.communication import init, get_rank
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, TimeMonitor, LossMonitor
from mindspore.train.model import ParallelMode
from mindspore.train.loss_scale_manager import FixedLossScaleManager
from mindspore import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore.nn import RMSProp
from mindspore import Tensor
from mindspore import context
from mindspore.common import set_seed
from mindspore.common.initializer import XavierUniform, initializer
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from src.inception_resnet_v2 import Inception_resnet_v2
from src.dataset import create_dataset, device_num
from src.config import config_ascend as config
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'
set_seed(1)
def generate_cosine_lr(steps_per_epoch, total_epochs,
lr_init=config.lr_init,
lr_end=config.lr_end,
lr_max=config.lr_max,
warmup_epochs=config.warmup_epochs):
"""
Applies cosine decay to generate learning rate array.
Args:
steps_per_epoch(int): steps number per epoch
total_epochs(int): all epoch in training.
lr_init(float): init learning rate.
lr_end(float): end learning rate
lr_max(float): max learning rate.
warmup_steps(int): all steps in warmup epochs.
Returns:
np.array, learning rate array.
"""
total_steps = steps_per_epoch * total_epochs
warmup_steps = steps_per_epoch * warmup_epochs
decay_steps = total_steps - warmup_steps
lr_each_step = []
for i in range(total_steps):
if i < warmup_steps:
lr_inc = (float(lr_max) - float(lr_init)) / float(warmup_steps)
lr = float(lr_init) + lr_inc * (i + 1)
else:
cosine_decay = 0.5 * (1 + math.cos(math.pi * (i - warmup_steps) / decay_steps))
lr = (lr_max - lr_end) * cosine_decay + lr_end
lr_each_step.append(lr)
learning_rate = np.array(lr_each_step).astype(np.float32)
current_step = steps_per_epoch * (config.start_epoch - 1)
learning_rate = learning_rate[current_step:]
return learning_rate
def inception_resnet_v2_train():
"""
Train inception_resnet_v2 in data parallelism
"""
print('epoch_size: {} batch_size: {} class_num {}'.format(config.epoch_size, config.batch_size, config.num_classes))
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
context.set_context(device_id=args.device_id)
context.set_context(enable_graph_kernel=False)
rank = 0
if device_num > 1:
init(backend_name='hccl')
rank = get_rank()
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True,
all_reduce_fusion_config=[200, 400])
print("creating dataset....")
# create dataset
train_dataset = create_dataset(dataset_path=args.dataset_path, do_train=True,
repeat_num=1, batch_size=config.batch_size)
train_step_size = train_dataset.get_dataset_size()
# create model
print("creating model.....")
net = Inception_resnet_v2(classes=config.num_classes)
# loss
print("creating loss.....")
loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
lr = Tensor(generate_cosine_lr(steps_per_epoch=train_step_size, total_epochs=config.epoch_size))
decayed_params = []
no_decayed_params = []
for param in net.trainable_params():
if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name:
decayed_params.append(param)
else:
no_decayed_params.append(param)
for param in net.trainable_params():
if 'beta' not in param.name and 'gamma' not in param.name and 'bias' not in param.name:
param.set_data(initializer(XavierUniform(), param.data.shape, param.data.dtype))
group_params = [{'params': decayed_params, 'weight_decay': config.weight_decay},
{'params': no_decayed_params},
{'order_params': net.trainable_params()}]
opt = RMSProp(group_params, lr, decay=config.decay, epsilon=config.epsilon, weight_decay=config.weight_decay,
momentum=config.momentum, loss_scale=config.loss_scale)
if args.device_id == 0:
print(lr)
print(train_step_size)
if args.resume:
ckpt = load_checkpoint(args.resume)
load_param_into_net(net, ckpt)
loss_scale_manager = FixedLossScaleManager(config.loss_scale, drop_overflow_update=False)
model = Model(net, loss_fn=loss, optimizer=opt, metrics={
'acc', 'top_1_accuracy', 'top_5_accuracy'}, loss_scale_manager=loss_scale_manager, amp_level=config.amp_level)
# define callbacks
performance_cb = TimeMonitor(data_size=train_step_size)
loss_cb = LossMonitor(per_print_times=train_step_size)
ckp_save_step = config.save_checkpoint_epochs * train_step_size
config_ck = CheckpointConfig(save_checkpoint_steps=ckp_save_step, keep_checkpoint_max=config.keep_checkpoint_max)
ckpoint_cb = ModelCheckpoint(prefix=f"ince_res-train-rank{rank}",
directory='ckpts_rank_' + str(rank), config=config_ck)
callbacks = [performance_cb, loss_cb]
if device_num > 1 and config.is_save_on_master:
if args.device_id == 0:
callbacks.append(ckpoint_cb)
else:
callbacks.append(ckpoint_cb)
# train model
print("start training....")
model.train(config.epoch_size, train_dataset, callbacks=callbacks, dataset_sink_mode=True)
def parse_args():
'''parse_args'''
arg_parser = argparse.ArgumentParser(description='inception resnet v2 image classification training')
arg_parser.add_argument('--dataset_path', type=str, default='/data/imagenet2012/train', help='Dataset path')
arg_parser.add_argument('--device_id', type=int, default=0, help='device id')
arg_parser.add_argument('--resume', type=str, default='', help='resume training with existed checkpoint')
args_opt = arg_parser.parse_args()
return args_opt
if __name__ == '__main__':
args = parse_args()
print("start training....")
inception_resnet_v2_train()