History

mindspore-ci-bot 31ad1654a1 !6267 delete redundant codes in model zoo Merge pull request !6267 from zhaoting/clean_warnings		2020-09-18 15:39:30 +08:00
..
scripts	Save the GPU backend multi card output in different folders.	2020-09-10 16:04:28 +08:00
src	delete redundant codes	2020-09-17 17:44:36 +08:00
README.md	Add an example of training NASNet in MindSpore	2020-08-28 10:17:51 -04:00
eval.py	Add an example of training NASNet in MindSpore	2020-08-28 10:17:51 -04:00
export.py	Add an example of training NASNet in MindSpore	2020-08-28 10:17:51 -04:00
train.py	fix_api_problems	2020-09-17 11:32:13 +08:00

README.md

NASNet Example

Description

This is an example of training NASNet-A-Mobile in MindSpore.

Requirements

Install Mindspore.
Download the dataset.

Structure

.
└─nasnet      
  ├─README.md
  ├─scripts      
    ├─run_standalone_train_for_gpu.sh         # launch standalone training with gpu platform(1p)
    ├─run_distribute_train_for_gpu.sh         # launch distributed training with gpu platform(8p)
    └─run_eval_for_gpu.sh                     # launch evaluating with gpu platform
  ├─src
    ├─config.py                       # parameter configuration
    ├─dataset.py                      # data preprocessing
    ├─loss.py                         # Customized CrossEntropy loss function
    ├─lr_generator.py                 # learning rate generator
    ├─nasnet_a_mobile.py              # network definition
  ├─eval.py                           # eval net
  ├─export.py                         # convert checkpoint
  └─train.py                          # train net

Parameter Configuration

Parameters for both training and evaluating can be set in config.py

'random_seed': 1,                # fix random seed
'rank': 0,                       # local rank of distributed
'group_size': 1,                 # world size of distributed
'work_nums': 8,                  # number of workers to read the data
'epoch_size': 250,               # total epoch numbers
'keep_checkpoint_max': 100,      # max numbers to keep checkpoints
'ckpt_path': './checkpoint/',    # save checkpoint path
'is_save_on_master': 1           # save checkpoint on rank0, distributed parameters
'batch_size': 32,                # input batchsize
'num_classes': 1000,             # dataset class numbers
'label_smooth_factor': 0.1,      # label smoothing factor
'aux_factor': 0.4,               # loss factor of aux logit
'lr_init': 0.04,                 # initiate learning rate
'lr_decay_rate': 0.97,           # decay rate of learning rate
'num_epoch_per_decay': 2.4,      # decay epoch number
'weight_decay': 0.00004,         # weight decay
'momentum': 0.9,                 # momentum
'opt_eps': 1.0,                  # epsilon
'rmsprop_decay': 0.9,            # rmsprop decay
'loss_scale': 1,                 # loss scale

Running the example

Train

Usage

# distribute training example(8p)
sh run_distribute_train_for_gpu.sh DATA_DIR 
# standalone training
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR

Launch

# distributed training example(8p) for GPU
sh scripts/run_distribute_train_for_gpu.sh /dataset/train
# standalone training example for GPU
sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train

Result

You can find checkpoint file together with result in log.

Evaluation

Usage

# Evaluation
sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT

Launch

# Evaluation with checkpoint
sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt

checkpoint can be produced in training process.

Result

Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.