mindspore/model_zoo/resnet_thor
anzhengqi 008b91b2a1 inject epoch ctrl op in the execution tree and send eos at the end of epoch 2020-07-20 13:02:47 +08:00
..
scripts move resnet_thor series from example to model_zoo 2020-06-29 14:41:37 +08:00
src move resnet_thor series from example to model_zoo 2020-06-29 14:41:37 +08:00
README.md move resnet_thor series from example to model_zoo 2020-06-29 14:41:37 +08:00
eval.py move resnet_thor series from example to model_zoo 2020-06-29 14:41:37 +08:00
train.py inject epoch ctrl op in the execution tree and send eos at the end of epoch 2020-07-20 13:02:47 +08:00

README.md

ResNet-50-THOR Example

Description

This is an example of training ResNet-50 V1.5 with ImageNet2012 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish ResNet-50 V1.5 training in 72 minutes to top-1 accuracy of 75.9% using 8 Ascend 910, which is much faster than SGD with Momentum.

Requirements

  • Install MindSpore.

  • Download the dataset ImageNet2012

Unzip the ImageNet2012 dataset to any path you want and the folder structure should include train and eval dataset as follows:

.  
├── ilsvrc                  # train dataset
└── ilsvrc_eval             # infer dataset

Example structure

.
├── resnet_thor
    ├── README.md
    ├── src
        ├── crossentropy.py                 # CrossEntropy loss function
        ├── config.py                       # parameter configuration
        ├── resnet50.py                     # resnet50 backbone
        ├── dataset_helper.py               # dataset help for minddata dataset
        ├── grad_reducer_thor.py            # grad reducer for thor
        ├── model_thor.py                   # model
        ├── resnet_thor.py                  # resnet50_thor backone
        ├── thor.py                         # thor
        ├── thor_layer.py                   # thor layer
        └── dataset_imagenet.py             # data preprocessing
    ├── scripts
        ├── run_distribute_train.sh         # launch distributed training(8 pcs)
        └── run_eval.sh                     # launch infering
    ├── eval.py                             # infer script
    └── train.py                            # train script

Parameter configuration

Parameters for both training and inference can be set in config.py.

"class_num": 1000,                # dataset class number
"batch_size": 32,                 # batch size of input tensor
"loss_scale": 128,                # loss scale
"momentum": 0.9,                  # momentum of THOR optimizer
"weight_decay": 5e-4,             # weight decay 
"epoch_size": 45,                 # only valid for taining, which is always 1 for inference 
"buffer_size": 1000,              # number of queue size in data preprocessing
"image_height": 224,              # image height
"image_width": 224,               # image width
"save_checkpoint": True,          # whether save checkpoint or not
"save_checkpoint_steps": 5004,    # the step interval between two checkpoints. By default, the checkpoint will be saved every epoch
"keep_checkpoint_max": 20,        # only keep the last keep_checkpoint_max checkpoint
"save_checkpoint_path": "./",     # path to save checkpoint relative to the executed path
"label_smooth": True,             # label smooth
"label_smooth_factor": 0.1,       # label smooth factor
"frequency": 834,                 # the step interval to update second-order information matrix

Running the example

Train

Usage

# distributed training
Usage: sh run_distribute_train.sh [MINDSPORE_HCCL_CONFIG_PATH] [DATASET_PATH] [DEVICE_NUM]

Launch

# distributed training example(8 pcs)
sh run_distribute_train.sh rank_table_8p.json dataset/ilsvrc

About rank_table.json, you can refer to the distributed training tutorial.

Result

Training result will be stored in the example path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.

# distribute training result(8 pcs)
epoch: 1 step: 5004, loss is 4.4182425
epoch: 2 step: 5004, loss is 3.740064
epoch: 3 step: 5004, loss is 4.0546017
epoch: 4 step: 5004, loss is 3.7598825
epoch: 5 step: 5004, loss is 3.3744206
......

Infer

Usage

# infer
Usage: sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]

Launch

# infer with checkpoint
sh run_eval.sh dataset/ilsvrc_eval train_parallel0/resnet-42_5004.ckpt

checkpoint can be produced in training process.

Result

Inference result will be stored in the example path, whose folder name is "infer". Under this, you can find result like the followings in log.

result: {'acc': 0.759503041} ckpt=train_parallel0/resnet-42_5004.ckpt