forked from mindspore-Ecosystem/mindspore
update efficientnet scripts & nasnet cn readme
This commit is contained in:
parent
5e3b135130
commit
148fc597f6
model_zoo/official/cv
|
@ -1,24 +1,66 @@
|
|||
# EfficientNet-B0 Example
|
||||
# Contents
|
||||
|
||||
## Description
|
||||
- [EfficientNet-B0 Description](#efficientnet-description)
|
||||
- [Model Architecture](#model-architecture)
|
||||
- [Dataset](#dataset)
|
||||
- [Environment Requirements](#environment-requirements)
|
||||
- [Quick Start](#quick-start)
|
||||
- [Script Description](#script-description)
|
||||
- [Script and Sample Code](#script-and-sample-code)
|
||||
- [Script Parameters](#script-parameters)
|
||||
- [Training Process](#training-process)
|
||||
- [Evaluation Process](#evaluation-process)
|
||||
- [Model Description](#model-description)
|
||||
- [Performance](#performance)
|
||||
- [Training Performance](#evaluation-performance)
|
||||
- [Inference Performance](#evaluation-performance)
|
||||
- [ModelZoo Homepage](#modelzoo-homepage)
|
||||
|
||||
This is an example of training EfficientNet-B0 in MindSpore.
|
||||
# [EfficientNet-B0 Description](#contents)
|
||||
|
||||
## Requirements
|
||||
|
||||
- Install [Mindspore](http://www.mindspore.cn/install/en).
|
||||
- Download the dataset.
|
||||
[Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.
|
||||
|
||||
## Structure
|
||||
# [Model architecture](#contents)
|
||||
|
||||
```shell
|
||||
The overall network architecture of EfficientNet-B0 is show below:
|
||||
|
||||
[Link](https://arxiv.org/abs/1905.11946)
|
||||
|
||||
|
||||
# [Dataset](#contents)
|
||||
|
||||
Dataset used: [imagenet](http://www.image-net.org/)
|
||||
|
||||
- Dataset size: ~125G, 1.2W colorful images in 1000 classes
|
||||
- Train: 120G, 1.2W images
|
||||
- Test: 5G, 50000 images
|
||||
- Data format: RGB images.
|
||||
- Note: Data will be processed in src/dataset.py
|
||||
|
||||
|
||||
# [Environment Requirements](#contents)
|
||||
|
||||
- Hardware GPU
|
||||
- Prepare hardware environment with GPU processor.
|
||||
- Framework
|
||||
- [MindSpore](https://www.mindspore.cn/install/en)
|
||||
- For more information, please check the resources below:
|
||||
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
|
||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
|
||||
|
||||
# [Script description](#contents)
|
||||
|
||||
## [Script and sample code](#contents)
|
||||
|
||||
```python
|
||||
.
|
||||
└─nasnet
|
||||
└─efficientnet
|
||||
├─README.md
|
||||
├─scripts
|
||||
├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
|
||||
├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
|
||||
└─run_eval_for_gpu.sh # launch evaluating with gpu platform
|
||||
├─scripts
|
||||
├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
|
||||
├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
|
||||
└─run_eval_for_gpu.sh # launch evaluating with gpu platform
|
||||
├─src
|
||||
├─config.py # parameter configuration
|
||||
├─dataset.py # data preprocessing
|
||||
|
@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore.
|
|||
├─loss.py # Customized loss function
|
||||
├─transform_utils.py # random augment utils
|
||||
├─transform.py # random augment class
|
||||
├─eval.py # eval net
|
||||
└─train.py # train net
|
||||
├─eval.py # eval net
|
||||
└─train.py # train net
|
||||
|
||||
```
|
||||
|
||||
## Parameter Configuration
|
||||
## [Script Parameters](#contents)
|
||||
|
||||
Parameters for both training and evaluating can be set in config.py
|
||||
Parameters for both training and evaluating can be set in config.py.
|
||||
|
||||
```
|
||||
```
|
||||
'random_seed': 1, # fix random seed
|
||||
'model': 'efficientnet_b0', # model name
|
||||
'drop': 0.2, # dropout rate
|
||||
|
@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py
|
|||
'batch_size': 128, # batch size
|
||||
'decay_epochs': 2.4, # epoch interval to decay LR
|
||||
'warmup_epochs': 5, # epochs to warmup LR
|
||||
'decay_rate': 0.97, # LR decay rate
|
||||
'decay_rate': 0.97, # LR decay rate
|
||||
'weight_decay': 1e-5, # weight decay
|
||||
'epochs': 600, # number of epochs to train
|
||||
'epochs': 600, # number of epochs to train
|
||||
'workers': 8, # number of data processing processes
|
||||
'amp_level': 'O0', # amp level
|
||||
'opt': 'rmsprop', # optimizer
|
||||
|
@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py
|
|||
'resume_start_epoch': 0, # resume start epoch
|
||||
```
|
||||
|
||||
## Running the example
|
||||
|
||||
### Train
|
||||
## [Training Process](#contents)
|
||||
|
||||
#### Usage
|
||||
|
||||
```
|
||||
# distribute training example(8p)
|
||||
sh run_distribute_train_for_gpu.sh DATA_DIR
|
||||
# standalone training
|
||||
sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID
|
||||
GPU:
|
||||
# distribute training example(8p)
|
||||
sh run_distribute_train_for_gpu.sh
|
||||
# standalone training
|
||||
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
|
||||
```
|
||||
|
||||
#### Launch
|
||||
|
||||
```bash
|
||||
# distributed training example(8p) for GPU
|
||||
sh scripts/run_distribute_train_for_gpu.sh /dataset
|
||||
cd scripts
|
||||
sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train
|
||||
# standalone training example for GPU
|
||||
sh scripts/run_standalone_train_for_gpu.sh /dataset 0
|
||||
cd scripts
|
||||
sh run_standalone_train_for_gpu.sh 0 /dataset/train
|
||||
```
|
||||
|
||||
#### Result
|
||||
|
||||
You can find checkpoint file together with result in log.
|
||||
|
||||
### Evaluation
|
||||
## [Evaluation Process](#contents)
|
||||
|
||||
#### Usage
|
||||
### Usage
|
||||
|
||||
```
|
||||
# Evaluation
|
||||
|
@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT
|
|||
|
||||
```bash
|
||||
# Evaluation with checkpoint
|
||||
sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt
|
||||
cd scripts
|
||||
sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt
|
||||
```
|
||||
|
||||
> checkpoint can be produced in training process.
|
||||
|
||||
#### Result
|
||||
|
||||
Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.
|
||||
|
||||
```
|
||||
acc=76.96%(TOP1)
|
||||
```
|
||||
|
||||
# [Model description](#contents)
|
||||
|
||||
## [Performance](#contents)
|
||||
|
||||
### Training Performance
|
||||
|
||||
| Parameters | efficientnet_b0 |
|
||||
| -------------------------- | ------------------------- |
|
||||
| Resource | NV SMX2 V100-32G |
|
||||
| uploaded Date | 10/26/2020 |
|
||||
| MindSpore Version | 1.0.0 |
|
||||
| Dataset | ImageNet |
|
||||
| Training Parameters | src/config.py |
|
||||
| Optimizer | rmsprop |
|
||||
| Loss Function | LabelSmoothingCrossEntropy |
|
||||
| Loss | 1.8886 |
|
||||
| Accuracy | 76.96%(TOP1) |
|
||||
| Total time | 132 h 8ps |
|
||||
| Checkpoint for Fine tuning | 64 M(.ckpt file) |
|
||||
|
||||
### Inference Performance
|
||||
|
||||
| Parameters | |
|
||||
| -------------------------- | ------------------------- |
|
||||
| Resource | NV SMX2 V100-32G |
|
||||
| uploaded Date | 10/26/2020 |
|
||||
| MindSpore Version | 1.0.0 |
|
||||
| Dataset | ImageNet, 1.2W |
|
||||
| batch_size | 128 |
|
||||
| outputs | probability |
|
||||
| Accuracy | acc=76.96%(TOP1) |
|
||||
|
||||
|
||||
# [ModelZoo Homepage](#contents)
|
||||
|
||||
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
||||
|
|
|
@ -49,7 +49,7 @@ if __name__ == '__main__':
|
|||
ckpt = load_checkpoint(args_opt.checkpoint)
|
||||
load_param_into_net(net, ckpt)
|
||||
net.set_train(False)
|
||||
val_data_url = os.path.join(args_opt.data_path, 'val')
|
||||
val_data_url = args_opt.data_path
|
||||
dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False)
|
||||
loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing)
|
||||
eval_metrics = {'Loss': nn.Loss(),
|
||||
|
|
|
@ -13,20 +13,57 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
DATA_DIR=$1
|
||||
if [ $# != 3 ] && [ $# != 4 ]
|
||||
then
|
||||
echo "Usage:
|
||||
sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
current_exec_path=$(pwd)
|
||||
echo ${current_exec_path}
|
||||
if [ $1 -lt 1 ] && [ $1 -gt 8 ]
|
||||
then
|
||||
echo "error: DEVICE_NUM=$1 is not in (1-8)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
curtime=`date '+%Y%m%d-%H%M%S'`
|
||||
RANK_SIZE=8
|
||||
# check dataset file
|
||||
if [ ! -d $3 ]
|
||||
then
|
||||
echo "error: DATASET_PATH=$3 is not a directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
rm ${current_exec_path}/device_parallel/ -rf
|
||||
mkdir ${current_exec_path}/device_parallel
|
||||
echo ${curtime} > ${current_exec_path}/device_parallel/starttime
|
||||
export DEVICE_NUM=$1
|
||||
export RANK_SIZE=$1
|
||||
|
||||
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
|
||||
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
|
||||
if [ -d "../train" ];
|
||||
then
|
||||
rm -rf ../train
|
||||
fi
|
||||
mkdir ../train
|
||||
cd ../train || exit
|
||||
|
||||
export CUDA_VISIBLE_DEVICES="$2"
|
||||
|
||||
if [ $# == 3 ]
|
||||
then
|
||||
mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
|
||||
python ${BASEPATH}/../train.py \
|
||||
--GPU \
|
||||
--distributed \
|
||||
--data_path $3 > train.log 2>&1 &
|
||||
fi
|
||||
|
||||
if [ $# == 4 ]
|
||||
then
|
||||
mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
|
||||
python ${BASEPATH}/../train.py \
|
||||
--GPU \
|
||||
--distributed \
|
||||
--data_path $3 \
|
||||
--resume $4 > train.log 2>&1 &
|
||||
fi
|
||||
|
||||
mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \
|
||||
--GPU \
|
||||
--distributed \
|
||||
--data_path ${DATA_DIR} \
|
||||
--cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 &
|
||||
|
|
|
@ -13,15 +13,34 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
DATA_DIR=$1
|
||||
DEVICE_ID=$2
|
||||
PATH_CHECKPOINT=$3
|
||||
if [ $# != 2 ]
|
||||
then
|
||||
echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
current_exec_path=$(pwd)
|
||||
echo ${current_exec_path}
|
||||
# check dataset file
|
||||
if [ ! -d $1 ]
|
||||
then
|
||||
echo "error: DATASET_PATH=$1 is not a directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
curtime=`date '+%Y%m%d-%H%M%S'`
|
||||
# check checkpoint file
|
||||
if [ ! -f $2 ]
|
||||
then
|
||||
echo "error: CHECKPOINT_PATH=$2 is not a file"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ${curtime} > ${current_exec_path}/eval_starttime
|
||||
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
|
||||
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
|
||||
|
||||
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 &
|
||||
if [ -d "../eval" ];
|
||||
then
|
||||
rm -rf ../eval
|
||||
fi
|
||||
mkdir ../eval
|
||||
cd ../eval || exit
|
||||
|
||||
python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 &
|
||||
|
|
|
@ -13,19 +13,38 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
DATA_DIR=$1
|
||||
DEVICE_ID=$2
|
||||
if [ $# != 2 ] && [ $# != 3 ]
|
||||
then
|
||||
echo "Usage:
|
||||
sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
current_exec_path=$(pwd)
|
||||
echo ${current_exec_path}
|
||||
# check dataset file
|
||||
if [ ! -d $2 ]
|
||||
then
|
||||
echo "error: DATASET_PATH=$2 is not a directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
curtime=`date '+%Y%m%d-%H%M%S'`
|
||||
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
|
||||
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
|
||||
if [ -d "../train" ];
|
||||
then
|
||||
rm -rf ../train
|
||||
fi
|
||||
mkdir ../train
|
||||
cd ../train || exit
|
||||
|
||||
rm ${current_exec_path}/device_${DEVICE_ID}/ -rf
|
||||
mkdir ${current_exec_path}/device_${DEVICE_ID}
|
||||
echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime
|
||||
export CUDA_VISIBLE_DEVICES=$1
|
||||
|
||||
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \
|
||||
--GPU \
|
||||
--data_path ${DATA_DIR} \
|
||||
--cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 &
|
||||
if [ $# == 2 ]
|
||||
then
|
||||
python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 &
|
||||
fi
|
||||
|
||||
if [ $# == 3 ]
|
||||
then
|
||||
python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 &
|
||||
fi
|
||||
|
|
|
@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False):
|
|||
input_columns=["image", "label"],
|
||||
num_parallel_workers=2,
|
||||
drop_remainder=True)
|
||||
ds_train = ds_train.repeat(1)
|
||||
return ds_train
|
||||
|
||||
|
||||
|
@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F
|
|||
dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers)
|
||||
dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers)
|
||||
dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers)
|
||||
dataset = dataset.repeat(1)
|
||||
return dataset
|
||||
|
|
|
@ -17,7 +17,6 @@ import argparse
|
|||
import math
|
||||
import os
|
||||
import random
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import mindspore
|
||||
|
@ -115,8 +114,6 @@ def main():
|
|||
if args.GPU:
|
||||
context.set_context(device_target='GPU')
|
||||
|
||||
is_master = not args.distributed or (rank_id == 0)
|
||||
|
||||
net = efficientnet_b0(num_classes=cfg.num_classes,
|
||||
drop_rate=cfg.drop,
|
||||
drop_connect_rate=cfg.drop_connect,
|
||||
|
@ -124,18 +121,7 @@ def main():
|
|||
bn_tf=cfg.bn_tf,
|
||||
)
|
||||
|
||||
cur_time = args.cur_time
|
||||
output_base = './output'
|
||||
|
||||
exp_name = '-'.join([
|
||||
cur_time,
|
||||
cfg.model,
|
||||
str(224)
|
||||
])
|
||||
time.sleep(rank_id)
|
||||
output_dir = get_outdir(output_base, exp_name)
|
||||
|
||||
train_data_url = os.path.join(args.data_path, 'train')
|
||||
train_data_url = args.data_path
|
||||
train_dataset = create_dataset(
|
||||
cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed)
|
||||
batches_per_epoch = train_dataset.get_dataset_size()
|
||||
|
@ -152,7 +138,7 @@ def main():
|
|||
config_ck = CheckpointConfig(
|
||||
save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max)
|
||||
ckpoint_cb = ModelCheckpoint(
|
||||
prefix=cfg.model, directory=output_dir, config=config_ck)
|
||||
prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck)
|
||||
callbacks += [ckpoint_cb]
|
||||
|
||||
lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch,
|
||||
|
@ -180,7 +166,7 @@ def main():
|
|||
amp_level=cfg.amp_level
|
||||
)
|
||||
|
||||
callbacks = callbacks if is_master else []
|
||||
# callbacks = callbacks if is_master else []
|
||||
|
||||
if args.resume:
|
||||
real_epoch = cfg.epochs - cfg.resume_start_epoch
|
||||
|
|
|
@ -0,0 +1,130 @@
|
|||
# NASNet示例
|
||||
|
||||
<!-- TOC -->
|
||||
|
||||
- [NASNet示例](#nasnet示例)
|
||||
- [概述](#概述)
|
||||
- [要求](#要求)
|
||||
- [结构](#结构)
|
||||
- [参数配置](#参数配置)
|
||||
- [运行示例](#运行示例)
|
||||
- [训练](#训练)
|
||||
- [用法](#用法)
|
||||
- [运行](#运行)
|
||||
- [结果](#结果)
|
||||
- [评估](#评估)
|
||||
- [用法](#用法-1)
|
||||
- [启动](#启动)
|
||||
- [结果](#结果-1)
|
||||
|
||||
<!-- /TOC -->
|
||||
|
||||
## 概述
|
||||
|
||||
此为MindSpore中训练NASNet-A-Mobile的示例。
|
||||
|
||||
## 要求
|
||||
|
||||
- 安装[Mindspore](http://www.mindspore.cn/install/en)。
|
||||
- 下载数据集。
|
||||
|
||||
## 结构
|
||||
|
||||
```shell
|
||||
.
|
||||
└─nasnet
|
||||
├─README.md
|
||||
├─scripts
|
||||
├─run_standalone_train_for_gpu.sh # 使用GPU平台启动单机训练(单卡)
|
||||
├─Run_distribute_train_for_gpu.sh # 使用GPU平台启动分布式训练(8卡)
|
||||
└─Run_eval_for_gpu.sh # 使用GPU平台进行启动评估
|
||||
├─src
|
||||
├─config.py # 参数配置
|
||||
├─dataset.py # 数据预处理
|
||||
├─loss.py # 自定义交叉熵损失函数
|
||||
├─lr_generator.py # 学习率生成器
|
||||
├─nasnet_a_mobile.py # 网络定义
|
||||
├─eval.py # 评估网络
|
||||
├─export.py # 转换检查点
|
||||
└─train.py # 训练网络
|
||||
|
||||
```
|
||||
|
||||
## 参数配置
|
||||
|
||||
在config.py中可以同时配置训练参数和评估参数。
|
||||
|
||||
```
|
||||
'random_seed':1, # 固定随机种子
|
||||
'rank':0, # 分布式训练进程序号
|
||||
'group_size':1, # 分布式训练分组大小
|
||||
'work_nums':8, # 数据读取人员数
|
||||
'epoch_size':500, # 总周期数
|
||||
'keep_checkpoint_max':100, # 保存检查点最大数
|
||||
'ckpt_path':'./checkpoint/', # 检查点保存路径
|
||||
'is_save_on_master':1 # 在rank0上保存检查点,分布式参数
|
||||
'batch_size':32, # 输入批次大小
|
||||
'num_classes':1000, # 数据集类数
|
||||
'label_smooth_factor':0.1, # 标签平滑因子
|
||||
'aux_factor':0.4, # 副对数损失系数
|
||||
'lr_init':0.04, # 启动学习率
|
||||
'lr_decay_rate':0.97, # 学习率衰减率
|
||||
'num_epoch_per_decay':2.4, # 衰减周期数
|
||||
'weight_decay':0.00004, # 重量衰减
|
||||
'momentum':0.9, # 动量
|
||||
'opt_eps':1.0, # epsilon参数
|
||||
'rmsprop_decay':0.9, # rmsprop衰减
|
||||
'loss_scale':1, # 损失规模
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
## 运行示例
|
||||
|
||||
### 训练
|
||||
|
||||
#### 用法
|
||||
|
||||
```
|
||||
# 分布式训练示例(8卡)
|
||||
sh run_distribute_train_for_gpu.sh DATA_DIR
|
||||
# 单机训练
|
||||
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
|
||||
```
|
||||
|
||||
#### 运行
|
||||
|
||||
```bash
|
||||
# GPU分布式训练示例(8卡)
|
||||
sh scripts/run_distribute_train_for_gpu.sh /dataset/train
|
||||
# GPU单机训练示例
|
||||
sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
|
||||
```
|
||||
|
||||
#### 结果
|
||||
|
||||
可以在日志中找到检查点文件及结果。
|
||||
|
||||
### 评估
|
||||
|
||||
#### 用法
|
||||
|
||||
```
|
||||
# 评估
|
||||
sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
```
|
||||
|
||||
#### 启动
|
||||
|
||||
```bash
|
||||
# 检查点评估
|
||||
sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt
|
||||
```
|
||||
|
||||
> 训练过程中可以生成检查点。
|
||||
|
||||
#### 结果
|
||||
|
||||
评估结果保存在脚本路径下。路径下的日志中,可以找到如下结果:
|
||||
|
Loading…
Reference in New Issue