update efficientnet scripts & nasnet cn readme

2020-11-19 16:30:16 +08:00 · 2020-11-19 16:30:16 +08:00 · 148fc597f6
parent 5e3b135130
commit 148fc597f6
8 changed files with 359 additions and 89 deletions
--- a/model_zoo/official/cv/efficientnet/README.md
+++ b/model_zoo/official/cv/efficientnet/README.md
@ -1,24 +1,66 @@
-# EfficientNet-B0 Example
+# Contents

-## Description
+- [EfficientNet-B0 Description](#efficientnet-description)
+- [Model Architecture](#model-architecture)
+- [Dataset](#dataset)
+- [Environment Requirements](#environment-requirements)
+- [Quick Start](#quick-start)
+- [Script Description](#script-description)
+    - [Script and Sample Code](#script-and-sample-code)
+    - [Script Parameters](#script-parameters)
+    - [Training Process](#training-process)
+    - [Evaluation Process](#evaluation-process)
+- [Model Description](#model-description)
+    - [Performance](#performance)
+        - [Training Performance](#evaluation-performance)
+        - [Inference Performance](#evaluation-performance)
+- [ModelZoo Homepage](#modelzoo-homepage)

-This is an example of training EfficientNet-B0 in MindSpore.
+# [EfficientNet-B0 Description](#contents)

-## Requirements

- Install [Mindspore](http://www.mindspore.cn/install/en).
- Download the dataset.
+[Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.

-## Structure
+# [Model architecture](#contents)

-```shell
+The overall network architecture of EfficientNet-B0 is show below:
+
+[Link](https://arxiv.org/abs/1905.11946)
+
+
+# [Dataset](#contents)
+
+Dataset used: [imagenet](http://www.image-net.org/)
+
+- Dataset size: ~125G, 1.2W colorful images in 1000 classes
+  - Train: 120G, 1.2W images
+  - Test: 5G, 50000 images
+- Data format: RGB images.
+  - Note: Data will be processed in src/dataset.py
+
+
+# [Environment Requirements](#contents)
+
+- Hardware GPU
+  - Prepare hardware environment with GPU processor.
+- Framework
+  - [MindSpore](https://www.mindspore.cn/install/en)
+- For more information, please check the resources below：
+  - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
+  - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
+
+# [Script description](#contents)
+
+## [Script and sample code](#contents)
+
+```python
 .
-└─nasnet      
+└─efficientnet
  ├─README.md
-  ├─scripts      
-    ├─run_standalone_train_for_gpu.sh         # launch standalone training with gpu platform(1p)
-    ├─run_distribute_train_for_gpu.sh         # launch distributed training with gpu platform(8p)
-    └─run_eval_for_gpu.sh                     # launch evaluating with gpu platform
+  ├─scripts
+    ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
+    ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
+    └─run_eval_for_gpu.sh             # launch evaluating with gpu platform
  ├─src
    ├─config.py                       # parameter configuration
    ├─dataset.py                      # data preprocessing
@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore.
    ├─loss.py                         # Customized loss function
    ├─transform_utils.py              # random augment utils
    ├─transform.py                    # random augment class
-  ├─eval.py                           # eval net
-  └─train.py                          # train net
+├─eval.py                             # eval net
+└─train.py                            # train net

 ```

-## Parameter Configuration
+## [Script Parameters](#contents)

-Parameters for both training and evaluating can be set in config.py
+Parameters for both training and evaluating can be set in config.py.

-```       
+```
 'random_seed': 1,                # fix random seed
 'model': 'efficientnet_b0',      # model name
 'drop': 0.2,                     # dropout rate
@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py
 'batch_size': 128,               # batch size
 'decay_epochs': 2.4,             # epoch interval to decay LR
 'warmup_epochs': 5,              # epochs to warmup LR
-'decay_rate': 0.97,              # LR decay rate   
+'decay_rate': 0.97,              # LR decay rate
 'weight_decay': 1e-5,            # weight decay
-'epochs': 600,                   # number of epochs to train    
+'epochs': 600,                   # number of epochs to train
 'workers': 8,                    # number of data processing processes
 'amp_level': 'O0',               # amp level
 'opt': 'rmsprop',                # optimizer
@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py
 'resume_start_epoch': 0,         # resume start epoch
 ```

-## Running the example
-
-### Train
+## [Training Process](#contents)

 #### Usage

 ```
-# distribute training example(8p)
-sh run_distribute_train_for_gpu.sh DATA_DIR
-# standalone training
-sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID
+GPU:
+    # distribute training example(8p)
+    sh run_distribute_train_for_gpu.sh 
+    # standalone training
+    sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
 ```

 #### Launch

 ```bash
 # distributed training example(8p) for GPU
-sh scripts/run_distribute_train_for_gpu.sh /dataset
+cd scripts
+sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train
 # standalone training example for GPU
-sh scripts/run_standalone_train_for_gpu.sh /dataset 0
+cd scripts
+sh run_standalone_train_for_gpu.sh 0 /dataset/train
 ```

-#### Result
-
 You can find checkpoint file together with result in log.

-### Evaluation
+## [Evaluation Process](#contents)

-#### Usage
+### Usage

 ```
 # Evaluation
@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT

 ```bash
 # Evaluation with checkpoint
-sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt
+cd scripts
+sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt
 ```

-> checkpoint can be produced in training process.
-
 #### Result

 Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.
+
+```
+acc=76.96%(TOP1)
+```
+
+# [Model description](#contents)
+
+## [Performance](#contents)
+
+### Training Performance
+
+| Parameters                 | efficientnet_b0           |
+| -------------------------- | ------------------------- |
+| Resource                   | NV SMX2 V100-32G          |
+| uploaded Date              | 10/26/2020                |
+| MindSpore Version          | 1.0.0                     |
+| Dataset                    | ImageNet                  |
+| Training Parameters        | src/config.py             |
+| Optimizer                  | rmsprop                   |
+| Loss Function              | LabelSmoothingCrossEntropy |
+| Loss                       | 1.8886                    |
+| Accuracy                   | 76.96%(TOP1)               |
+| Total time                 | 132 h 8ps                 |
+| Checkpoint for Fine tuning | 64 M(.ckpt file)         |
+
+### Inference Performance
+
+| Parameters                 |                           |
+| -------------------------- | ------------------------- |
+| Resource                   | NV SMX2 V100-32G          |
+| uploaded Date              | 10/26/2020                |
+| MindSpore Version          | 1.0.0                     |
+| Dataset                    | ImageNet, 1.2W            |
+| batch_size                 | 128                       |
+| outputs                    | probability               |
+| Accuracy                   | acc=76.96%(TOP1)          |
+
+
+# [ModelZoo Homepage](#contents)
+ 
+Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
--- a/model_zoo/official/cv/efficientnet/eval.py
+++ b/model_zoo/official/cv/efficientnet/eval.py
@ -49,7 +49,7 @@ if __name__ == '__main__':
    ckpt = load_checkpoint(args_opt.checkpoint)
    load_param_into_net(net, ckpt)
    net.set_train(False)
-    val_data_url = os.path.join(args_opt.data_path, 'val')
+    val_data_url = args_opt.data_path
    dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False)
    loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing)
    eval_metrics = {'Loss': nn.Loss(),
--- a/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh
@ -13,20 +13,57 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
-DATA_DIR=$1
+if [ $# != 3 ] && [ $# != 4 ]
+then
+    echo "Usage:
+          sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
+          "
+exit 1
+fi

-current_exec_path=$(pwd)
-echo ${current_exec_path}
+if [ $1 -lt 1 ] && [ $1 -gt 8 ]
+then
+    echo "error: DEVICE_NUM=$1 is not in (1-8)"
+exit 1
+fi

-curtime=`date '+%Y%m%d-%H%M%S'`
-RANK_SIZE=8
+# check dataset file
+if [ ! -d $3 ]
+then
+    echo "error: DATASET_PATH=$3 is not a directory"
+exit 1
+fi

-rm ${current_exec_path}/device_parallel/ -rf
-mkdir ${current_exec_path}/device_parallel
-echo ${curtime} > ${current_exec_path}/device_parallel/starttime
+export DEVICE_NUM=$1
+export RANK_SIZE=$1
+
+BASEPATH=$(cd "`dirname $0`" || exit; pwd)
+export PYTHONPATH=${BASEPATH}:$PYTHONPATH
+if [ -d "../train" ];
+then
+    rm -rf ../train
+fi
+mkdir ../train
+cd ../train || exit
+
+export CUDA_VISIBLE_DEVICES="$2"
+
+if [ $# == 3 ]
+then
+    mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
+    python ${BASEPATH}/../train.py \
+        --GPU \
+        --distributed \
+        --data_path $3 > train.log 2>&1 &
+fi
+
+if [ $# == 4 ]
+then
+    mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
+    python ${BASEPATH}/../train.py \
+        --GPU \
+        --distributed \
+        --data_path $3 \
+        --resume $4 > train.log 2>&1 &
+fi

-mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \
-                                                --GPU \
-                                                --distributed \
-                                                --data_path ${DATA_DIR} \
-                                                --cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 &
--- a/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh
@ -13,15 +13,34 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
-DATA_DIR=$1
-DEVICE_ID=$2
-PATH_CHECKPOINT=$3
+if [ $# != 2 ]
+then
+    echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]"
+exit 1
+fi

-current_exec_path=$(pwd)
-echo ${current_exec_path}
+# check dataset file
+if [ ! -d $1 ]
+then
+    echo "error: DATASET_PATH=$1 is not a directory"
+exit 1
+fi

-curtime=`date '+%Y%m%d-%H%M%S'`
+# check checkpoint file
+if [ ! -f $2 ]
+then
+    echo "error: CHECKPOINT_PATH=$2 is not a file"
+exit 1
+fi

-echo ${curtime} > ${current_exec_path}/eval_starttime
+BASEPATH=$(cd "`dirname $0`" || exit; pwd)
+export PYTHONPATH=${BASEPATH}:$PYTHONPATH

-CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 &
+if [ -d "../eval" ];
+then
+    rm -rf ../eval
+fi
+mkdir ../eval
+cd ../eval || exit
+
+python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 &
--- a/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh
@ -13,19 +13,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
-DATA_DIR=$1
-DEVICE_ID=$2
+if [ $# != 2 ] && [ $# != 3 ]
+then
+    echo "Usage: 
+          sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) 
+          "
+exit 1
+fi

-current_exec_path=$(pwd)
-echo ${current_exec_path}
+# check dataset file
+if [ ! -d $2 ]
+then
+    echo "error: DATASET_PATH=$2 is not a directory"    
+exit 1
+fi

-curtime=`date '+%Y%m%d-%H%M%S'`
+BASEPATH=$(cd "`dirname $0`" || exit; pwd)
+export PYTHONPATH=${BASEPATH}:$PYTHONPATH
+if [ -d "../train" ];
+then
+    rm -rf ../train
+fi
+mkdir ../train
+cd ../train || exit

-rm ${current_exec_path}/device_${DEVICE_ID}/ -rf
-mkdir ${current_exec_path}/device_${DEVICE_ID}
-echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime
+export CUDA_VISIBLE_DEVICES=$1

-CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \
-                                         --GPU \
-                                         --data_path ${DATA_DIR} \
-                                         --cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 &
+if [ $# == 2 ]
+then
+    python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 &
+fi
+
+if [ $# == 3 ]
+then
+    python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 &
+fi
--- a/model_zoo/official/cv/efficientnet/src/dataset.py
+++ b/model_zoo/official/cv/efficientnet/src/dataset.py
@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False):
                                   input_columns=["image", "label"],
                                   num_parallel_workers=2,
                                   drop_remainder=True)
-    ds_train = ds_train.repeat(1)
    return ds_train


@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F
    dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers)
    dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers)
    dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers)
-    dataset = dataset.repeat(1)
    return dataset
--- a/model_zoo/official/cv/efficientnet/train.py
+++ b/model_zoo/official/cv/efficientnet/train.py
@ -17,7 +17,6 @@ import argparse
 import math
 import os
 import random
-import time

 import numpy as np
 import mindspore
@ -115,8 +114,6 @@ def main():
        if args.GPU:
            context.set_context(device_target='GPU')

-    is_master = not args.distributed or (rank_id == 0)
-
    net = efficientnet_b0(num_classes=cfg.num_classes,
                          drop_rate=cfg.drop,
                          drop_connect_rate=cfg.drop_connect,
@ -124,18 +121,7 @@ def main():
                          bn_tf=cfg.bn_tf,
                          )

-    cur_time = args.cur_time
-    output_base = './output'
-
-    exp_name = '-'.join([
-        cur_time,
-        cfg.model,
-        str(224)
-    ])
-    time.sleep(rank_id)
-    output_dir = get_outdir(output_base, exp_name)
-
-    train_data_url = os.path.join(args.data_path, 'train')
+    train_data_url = args.data_path
    train_dataset = create_dataset(
        cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed)
    batches_per_epoch = train_dataset.get_dataset_size()
@ -152,7 +138,7 @@ def main():
        config_ck = CheckpointConfig(
            save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max)
        ckpoint_cb = ModelCheckpoint(
-            prefix=cfg.model, directory=output_dir, config=config_ck)
+            prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck)
        callbacks += [ckpoint_cb]

    lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch,
@ -180,7 +166,7 @@ def main():
                  amp_level=cfg.amp_level
                  )

-    callbacks = callbacks if is_master else []
+#    callbacks = callbacks if is_master else []

    if args.resume:
        real_epoch = cfg.epochs - cfg.resume_start_epoch
--- a/model_zoo/official/cv/nasnet/README_CN.md
+++ b/model_zoo/official/cv/nasnet/README_CN.md
@ -0,0 +1,130 @@
+# NASNet示例
+
+<!-- TOC -->
+
+- [NASNet示例](#nasnet示例)
+    - [概述](#概述)
+    - [要求](#要求)
+    - [结构](#结构)
+    - [参数配置](#参数配置)
+    - [运行示例](#运行示例)
+        - [训练](#训练)
+            - [用法](#用法)
+            - [运行](#运行)
+            - [结果](#结果)
+        - [评估](#评估)
+            - [用法](#用法-1)
+            - [启动](#启动)
+            - [结果](#结果-1)
+
+<!-- /TOC -->
+
+## 概述
+
+此为MindSpore中训练NASNet-A-Mobile的示例。
+
+## 要求
+
+- 安装[Mindspore](http://www.mindspore.cn/install/en)。
+- 下载数据集。
+
+## 结构
+
+```shell
+.
+└─nasnet      
+  ├─README.md
+  ├─scripts      
+    ├─run_standalone_train_for_gpu.sh         # 使用GPU平台启动单机训练（单卡）
+    ├─Run_distribute_train_for_gpu.sh         # 使用GPU平台启动分布式训练（8卡）
+    └─Run_eval_for_gpu.sh                     # 使用GPU平台进行启动评估
+  ├─src
+    ├─config.py                       # 参数配置
+    ├─dataset.py                      # 数据预处理
+    ├─loss.py                         # 自定义交叉熵损失函数
+    ├─lr_generator.py                 # 学习率生成器
+    ├─nasnet_a_mobile.py              # 网络定义
+  ├─eval.py                           # 评估网络
+  ├─export.py                         # 转换检查点
+  └─train.py                          # 训练网络
+  
+```
+
+## 参数配置
+
+在config.py中可以同时配置训练参数和评估参数。
+
+```       
+'random_seed':1,                # 固定随机种子
+'rank':0,                       # 分布式训练进程序号
+'group_size':1,                 # 分布式训练分组大小
+'work_nums':8,                  # 数据读取人员数
+'epoch_size':500,               # 总周期数
+'keep_checkpoint_max':100,      # 保存检查点最大数
+'ckpt_path':'./checkpoint/',    # 检查点保存路径
+'is_save_on_master':1           # 在rank0上保存检查点，分布式参数
+'batch_size':32,                # 输入批次大小
+'num_classes':1000,             # 数据集类数
+'label_smooth_factor':0.1,      # 标签平滑因子
+'aux_factor':0.4,               # 副对数损失系数
+'lr_init':0.04,                 # 启动学习率
+'lr_decay_rate':0.97,           # 学习率衰减率
+'num_epoch_per_decay':2.4,      # 衰减周期数
+'weight_decay':0.00004,         # 重量衰减
+'momentum':0.9,                 # 动量
+'opt_eps':1.0,                  # epsilon参数
+'rmsprop_decay':0.9,            # rmsprop衰减
+'loss_scale':1,                 # 损失规模
+
+```
+
+
+
+## 运行示例
+
+### 训练
+
+#### 用法
+
+```
+# 分布式训练示例（8卡）
+sh run_distribute_train_for_gpu.sh DATA_DIR 
+# 单机训练
+sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
+```
+
+#### 运行
+
+```bash
+# GPU分布式训练示例（8卡）
+sh scripts/run_distribute_train_for_gpu.sh /dataset/train
+# GPU单机训练示例
+sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
+```
+
+#### 结果
+
+可以在日志中找到检查点文件及结果。
+
+### 评估
+
+#### 用法
+
+```
+# 评估
+sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
+```
+
+#### 启动
+
+```bash
+# 检查点评估
+sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt
+```
+
+> 训练过程中可以生成检查点。
+
+#### 结果
+
+评估结果保存在脚本路径下。路径下的日志中，可以找到如下结果：
+