caojian05 deb9694f63 add pretrain for lstm & vgg16 and remove lstm/vgg16/googlenet from directory 'mindspore/model_zoo' 2020-06-22 17:17:10 +08:00
scripts refactoring code directory for vgg16 and lstm 2020-06-16 16:38:00 +08:00
src refactoring code directory for vgg16 and lstm 2020-06-16 16:38:00 +08:00
README.md add pretrain for lstm & vgg16 and remove lstm/vgg16/googlenet from directory 'mindspore/model_zoo' 2020-06-22 17:17:10 +08:00
eval.py refactoring code directory for vgg16 and lstm 2020-06-16 16:38:00 +08:00
train.py add pretrain for lstm & vgg16 and remove lstm/vgg16/googlenet from directory 'mindspore/model_zoo' 2020-06-22 17:17:10 +08:00


VGG16 Example


This example is for VGG16 model training and evaluation.


  • Install MindSpore.

  • Download the CIFAR-10 binary version dataset.

Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:

├── cifar-10-batches-bin  # train dataset
└── cifar-10-verify-bin   # infer dataset

Running the Example


python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 & 

The python command above will run in the background, you can view the results through the file out.train.log.

After training, you'll get some checkpoint files under the script folder by default.

You will get the loss value as following:

# grep "loss is " out.train.log
epoch: 1 step: 781, loss is 2.093086
epcoh: 2 step: 781, loss is 1.827582


python eval.py --data_path=your_data_path --device_id=6 --checkpoint_path=./train_vgg_cifar10-70-781.ckpt > out.eval.log 2>&1 & 

The above python command will run in the background, you can view the results through the file out.eval.log.

You will get the accuracy as following:

# grep "result: " out.eval.log
result: {'acc': 0.92}

Distribute Training

sh run_distribute_train.sh rank_table.json your_data_path

The above shell script will run distribute training in the background, you can view the results through the file train_parallel[X]/log.

You will get the loss value as following:

# grep "result: " train_parallel*/log
train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579

About rank_table.json, you can refer to the distributed training tutorial.



usage: train.py [--device_target TARGET][--data_path DATA_PATH]
                [--device_id DEVICE_ID][--pre_trained PRE_TRAINED]

  --device_target       the training backend type, default is Ascend.
  --data_path           the storage path of dataset
  --device_id           the device which used to train model.
  --pre_trained         the pretrained checkpoint file path.


usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
                [--device_id DEVICE_ID][--checkpoint_path CKPT_PATH]

  --device_target       the evaluation backend type, default is Ascend.
  --data_path           the storage path of datasetd 
  --device_id           the device which used to evaluate model.
  --checkpoint_path     the checkpoint file path used to evaluate model.

Distribute Training

Usage: sh script/run_distribute_train.sh [MINDSPORE_HCCL_CONFIG_PATH] [DATA_PATH]

  MINDSPORE_HCCL_CONFIG_PATH   HCCL configuration file path.
  DATA_PATH                    the storage path of dataset.