update efficientnet scripts & nasnet cn readme

This commit is contained in:
panfengfeng 2020-11-19 16:30:16 +08:00
parent 5e3b135130
commit 148fc597f6
8 changed files with 359 additions and 89 deletions

View File

@ -1,24 +1,66 @@
# EfficientNet-B0 Example
# Contents
## Description
- [EfficientNet-B0 Description](#efficientnet-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Environment Requirements](#environment-requirements)
- [Quick Start](#quick-start)
- [Script Description](#script-description)
- [Script and Sample Code](#script-and-sample-code)
- [Script Parameters](#script-parameters)
- [Training Process](#training-process)
- [Evaluation Process](#evaluation-process)
- [Model Description](#model-description)
- [Performance](#performance)
- [Training Performance](#evaluation-performance)
- [Inference Performance](#evaluation-performance)
- [ModelZoo Homepage](#modelzoo-homepage)
This is an example of training EfficientNet-B0 in MindSpore.
# [EfficientNet-B0 Description](#contents)
## Requirements
- Install [Mindspore](http://www.mindspore.cn/install/en).
- Download the dataset.
[Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.
## Structure
# [Model architecture](#contents)
```shell
The overall network architecture of EfficientNet-B0 is show below:
[Link](https://arxiv.org/abs/1905.11946)
# [Dataset](#contents)
Dataset used: [imagenet](http://www.image-net.org/)
- Dataset size: ~125G, 1.2W colorful images in 1000 classes
- Train: 120G, 1.2W images
- Test: 5G, 50000 images
- Data format: RGB images.
- Note: Data will be processed in src/dataset.py
# [Environment Requirements](#contents)
- Hardware GPU
- Prepare hardware environment with GPU processor.
- Framework
- [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
# [Script description](#contents)
## [Script and sample code](#contents)
```python
.
└─nasnet
└─efficientnet
├─README.md
├─scripts
├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
└─run_eval_for_gpu.sh # launch evaluating with gpu platform
├─scripts
├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
└─run_eval_for_gpu.sh # launch evaluating with gpu platform
├─src
├─config.py # parameter configuration
├─dataset.py # data preprocessing
@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore.
├─loss.py # Customized loss function
├─transform_utils.py # random augment utils
├─transform.py # random augment class
├─eval.py # eval net
└─train.py # train net
├─eval.py # eval net
└─train.py # train net
```
## Parameter Configuration
## [Script Parameters](#contents)
Parameters for both training and evaluating can be set in config.py
Parameters for both training and evaluating can be set in config.py.
```
```
'random_seed': 1, # fix random seed
'model': 'efficientnet_b0', # model name
'drop': 0.2, # dropout rate
@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py
'batch_size': 128, # batch size
'decay_epochs': 2.4, # epoch interval to decay LR
'warmup_epochs': 5, # epochs to warmup LR
'decay_rate': 0.97, # LR decay rate
'decay_rate': 0.97, # LR decay rate
'weight_decay': 1e-5, # weight decay
'epochs': 600, # number of epochs to train
'epochs': 600, # number of epochs to train
'workers': 8, # number of data processing processes
'amp_level': 'O0', # amp level
'opt': 'rmsprop', # optimizer
@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py
'resume_start_epoch': 0, # resume start epoch
```
## Running the example
### Train
## [Training Process](#contents)
#### Usage
```
# distribute training example(8p)
sh run_distribute_train_for_gpu.sh DATA_DIR
# standalone training
sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID
GPU:
# distribute training example(8p)
sh run_distribute_train_for_gpu.sh
# standalone training
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
```
#### Launch
```bash
# distributed training example(8p) for GPU
sh scripts/run_distribute_train_for_gpu.sh /dataset
cd scripts
sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train
# standalone training example for GPU
sh scripts/run_standalone_train_for_gpu.sh /dataset 0
cd scripts
sh run_standalone_train_for_gpu.sh 0 /dataset/train
```
#### Result
You can find checkpoint file together with result in log.
### Evaluation
## [Evaluation Process](#contents)
#### Usage
### Usage
```
# Evaluation
@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT
```bash
# Evaluation with checkpoint
sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt
cd scripts
sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt
```
> checkpoint can be produced in training process.
#### Result
Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.
```
acc=76.96%(TOP1)
```
# [Model description](#contents)
## [Performance](#contents)
### Training Performance
| Parameters | efficientnet_b0 |
| -------------------------- | ------------------------- |
| Resource | NV SMX2 V100-32G |
| uploaded Date | 10/26/2020 |
| MindSpore Version | 1.0.0 |
| Dataset | ImageNet |
| Training Parameters | src/config.py |
| Optimizer | rmsprop |
| Loss Function | LabelSmoothingCrossEntropy |
| Loss | 1.8886 |
| Accuracy | 76.96%(TOP1) |
| Total time | 132 h 8ps |
| Checkpoint for Fine tuning | 64 M(.ckpt file) |
### Inference Performance
| Parameters | |
| -------------------------- | ------------------------- |
| Resource | NV SMX2 V100-32G |
| uploaded Date | 10/26/2020 |
| MindSpore Version | 1.0.0 |
| Dataset | ImageNet, 1.2W |
| batch_size | 128 |
| outputs | probability |
| Accuracy | acc=76.96%(TOP1) |
# [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).

View File

@ -49,7 +49,7 @@ if __name__ == '__main__':
ckpt = load_checkpoint(args_opt.checkpoint)
load_param_into_net(net, ckpt)
net.set_train(False)
val_data_url = os.path.join(args_opt.data_path, 'val')
val_data_url = args_opt.data_path
dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False)
loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing)
eval_metrics = {'Loss': nn.Loss(),

View File

@ -13,20 +13,57 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
DATA_DIR=$1
if [ $# != 3 ] && [ $# != 4 ]
then
echo "Usage:
sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
"
exit 1
fi
current_exec_path=$(pwd)
echo ${current_exec_path}
if [ $1 -lt 1 ] && [ $1 -gt 8 ]
then
echo "error: DEVICE_NUM=$1 is not in (1-8)"
exit 1
fi
curtime=`date '+%Y%m%d-%H%M%S'`
RANK_SIZE=8
# check dataset file
if [ ! -d $3 ]
then
echo "error: DATASET_PATH=$3 is not a directory"
exit 1
fi
rm ${current_exec_path}/device_parallel/ -rf
mkdir ${current_exec_path}/device_parallel
echo ${curtime} > ${current_exec_path}/device_parallel/starttime
export DEVICE_NUM=$1
export RANK_SIZE=$1
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
if [ -d "../train" ];
then
rm -rf ../train
fi
mkdir ../train
cd ../train || exit
export CUDA_VISIBLE_DEVICES="$2"
if [ $# == 3 ]
then
mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
python ${BASEPATH}/../train.py \
--GPU \
--distributed \
--data_path $3 > train.log 2>&1 &
fi
if [ $# == 4 ]
then
mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
python ${BASEPATH}/../train.py \
--GPU \
--distributed \
--data_path $3 \
--resume $4 > train.log 2>&1 &
fi
mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \
--GPU \
--distributed \
--data_path ${DATA_DIR} \
--cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 &

View File

@ -13,15 +13,34 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
DATA_DIR=$1
DEVICE_ID=$2
PATH_CHECKPOINT=$3
if [ $# != 2 ]
then
echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]"
exit 1
fi
current_exec_path=$(pwd)
echo ${current_exec_path}
# check dataset file
if [ ! -d $1 ]
then
echo "error: DATASET_PATH=$1 is not a directory"
exit 1
fi
curtime=`date '+%Y%m%d-%H%M%S'`
# check checkpoint file
if [ ! -f $2 ]
then
echo "error: CHECKPOINT_PATH=$2 is not a file"
exit 1
fi
echo ${curtime} > ${current_exec_path}/eval_starttime
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 &
if [ -d "../eval" ];
then
rm -rf ../eval
fi
mkdir ../eval
cd ../eval || exit
python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 &

View File

@ -13,19 +13,38 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
DATA_DIR=$1
DEVICE_ID=$2
if [ $# != 2 ] && [ $# != 3 ]
then
echo "Usage:
sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
"
exit 1
fi
current_exec_path=$(pwd)
echo ${current_exec_path}
# check dataset file
if [ ! -d $2 ]
then
echo "error: DATASET_PATH=$2 is not a directory"
exit 1
fi
curtime=`date '+%Y%m%d-%H%M%S'`
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
if [ -d "../train" ];
then
rm -rf ../train
fi
mkdir ../train
cd ../train || exit
rm ${current_exec_path}/device_${DEVICE_ID}/ -rf
mkdir ${current_exec_path}/device_${DEVICE_ID}
echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime
export CUDA_VISIBLE_DEVICES=$1
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \
--GPU \
--data_path ${DATA_DIR} \
--cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 &
if [ $# == 2 ]
then
python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 &
fi
if [ $# == 3 ]
then
python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 &
fi

View File

@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False):
input_columns=["image", "label"],
num_parallel_workers=2,
drop_remainder=True)
ds_train = ds_train.repeat(1)
return ds_train
@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F
dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers)
dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers)
dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers)
dataset = dataset.repeat(1)
return dataset

View File

@ -17,7 +17,6 @@ import argparse
import math
import os
import random
import time
import numpy as np
import mindspore
@ -115,8 +114,6 @@ def main():
if args.GPU:
context.set_context(device_target='GPU')
is_master = not args.distributed or (rank_id == 0)
net = efficientnet_b0(num_classes=cfg.num_classes,
drop_rate=cfg.drop,
drop_connect_rate=cfg.drop_connect,
@ -124,18 +121,7 @@ def main():
bn_tf=cfg.bn_tf,
)
cur_time = args.cur_time
output_base = './output'
exp_name = '-'.join([
cur_time,
cfg.model,
str(224)
])
time.sleep(rank_id)
output_dir = get_outdir(output_base, exp_name)
train_data_url = os.path.join(args.data_path, 'train')
train_data_url = args.data_path
train_dataset = create_dataset(
cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed)
batches_per_epoch = train_dataset.get_dataset_size()
@ -152,7 +138,7 @@ def main():
config_ck = CheckpointConfig(
save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max)
ckpoint_cb = ModelCheckpoint(
prefix=cfg.model, directory=output_dir, config=config_ck)
prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck)
callbacks += [ckpoint_cb]
lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch,
@ -180,7 +166,7 @@ def main():
amp_level=cfg.amp_level
)
callbacks = callbacks if is_master else []
# callbacks = callbacks if is_master else []
if args.resume:
real_epoch = cfg.epochs - cfg.resume_start_epoch

View File

@ -0,0 +1,130 @@
# NASNet示例
<!-- TOC -->
- [NASNet示例](#nasnet示例)
- [概述](#概述)
- [要求](#要求)
- [结构](#结构)
- [参数配置](#参数配置)
- [运行示例](#运行示例)
- [训练](#训练)
- [用法](#用法)
- [运行](#运行)
- [结果](#结果)
- [评估](#评估)
- [用法](#用法-1)
- [启动](#启动)
- [结果](#结果-1)
<!-- /TOC -->
## 概述
此为MindSpore中训练NASNet-A-Mobile的示例。
## 要求
- 安装[Mindspore](http://www.mindspore.cn/install/en)。
- 下载数据集。
## 结构
```shell
.
└─nasnet
├─README.md
├─scripts
├─run_standalone_train_for_gpu.sh # 使用GPU平台启动单机训练单卡
├─Run_distribute_train_for_gpu.sh # 使用GPU平台启动分布式训练8卡
└─Run_eval_for_gpu.sh # 使用GPU平台进行启动评估
├─src
├─config.py # 参数配置
├─dataset.py # 数据预处理
├─loss.py # 自定义交叉熵损失函数
├─lr_generator.py # 学习率生成器
├─nasnet_a_mobile.py # 网络定义
├─eval.py # 评估网络
├─export.py # 转换检查点
└─train.py # 训练网络
```
## 参数配置
在config.py中可以同时配置训练参数和评估参数。
```
'random_seed':1, # 固定随机种子
'rank':0, # 分布式训练进程序号
'group_size':1, # 分布式训练分组大小
'work_nums':8, # 数据读取人员数
'epoch_size':500, # 总周期数
'keep_checkpoint_max':100, # 保存检查点最大数
'ckpt_path':'./checkpoint/', # 检查点保存路径
'is_save_on_master':1 # 在rank0上保存检查点分布式参数
'batch_size':32, # 输入批次大小
'num_classes':1000, # 数据集类数
'label_smooth_factor':0.1, # 标签平滑因子
'aux_factor':0.4, # 副对数损失系数
'lr_init':0.04, # 启动学习率
'lr_decay_rate':0.97, # 学习率衰减率
'num_epoch_per_decay':2.4, # 衰减周期数
'weight_decay':0.00004, # 重量衰减
'momentum':0.9, # 动量
'opt_eps':1.0, # epsilon参数
'rmsprop_decay':0.9, # rmsprop衰减
'loss_scale':1, # 损失规模
```
## 运行示例
### 训练
#### 用法
```
# 分布式训练示例8卡
sh run_distribute_train_for_gpu.sh DATA_DIR
# 单机训练
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
```
#### 运行
```bash
# GPU分布式训练示例8卡
sh scripts/run_distribute_train_for_gpu.sh /dataset/train
# GPU单机训练示例
sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
```
#### 结果
可以在日志中找到检查点文件及结果。
### 评估
#### 用法
```
# 评估
sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
```
#### 启动
```bash
# 检查点评估
sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt
```
> 训练过程中可以生成检查点。
#### 结果
评估结果保存在脚本路径下。路径下的日志中,可以找到如下结果: