forked from mindspore-Ecosystem/mindspore
!9667 fix readme file of resnet_thor
From: @wangmin0104 Reviewed-by: Signed-off-by:
This commit is contained in:
commit
55bf63a8a4
|
@ -13,6 +13,7 @@
|
||||||
- [Evaluation Process](#Evaluation-Process)
|
- [Evaluation Process](#Evaluation-Process)
|
||||||
- [Model Description](#Model-Description)
|
- [Model Description](#Model-Description)
|
||||||
- [Evaluation Performance](#Evaluation-Performance)
|
- [Evaluation Performance](#Evaluation-Performance)
|
||||||
|
- [Inference Performance](#Inference-Performance)
|
||||||
- [Description of Random Situation](#Description-of-Random-Situation)
|
- [Description of Random Situation](#Description-of-Random-Situation)
|
||||||
- [ModelZoo Homepage](#ModelZoo-Homepage)
|
- [ModelZoo Homepage](#ModelZoo-Homepage)
|
||||||
|
|
||||||
|
@ -21,10 +22,13 @@
|
||||||
This is an example of training ResNet-50 V1.5 with ImageNet2012 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish ResNet-50 V1.5 training in 72 minutes to top-1 accuracy of 75.9% using 8 Ascend 910, which is much faster than SGD with Momentum.
|
This is an example of training ResNet-50 V1.5 with ImageNet2012 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish ResNet-50 V1.5 training in 72 minutes to top-1 accuracy of 75.9% using 8 Ascend 910, which is much faster than SGD with Momentum.
|
||||||
|
|
||||||
## Model Architecture
|
## Model Architecture
|
||||||
|
|
||||||
The overall network architecture of ResNet-50 is show below:[link](https://arxiv.org/pdf/1512.03385.pdf)
|
The overall network architecture of ResNet-50 is show below:[link](https://arxiv.org/pdf/1512.03385.pdf)
|
||||||
|
|
||||||
## Dataset
|
## Dataset
|
||||||
|
|
||||||
Dataset used: ImageNet2012
|
Dataset used: ImageNet2012
|
||||||
|
|
||||||
- Dataset size 224*224 colorful images in 1000 classes
|
- Dataset size 224*224 colorful images in 1000 classes
|
||||||
- Train:1,281,167 images
|
- Train:1,281,167 images
|
||||||
- Test: 50,000 images
|
- Test: 50,000 images
|
||||||
|
@ -35,18 +39,21 @@ Dataset used: ImageNet2012
|
||||||
- Download the dataset ImageNet2012
|
- Download the dataset ImageNet2012
|
||||||
|
|
||||||
> Unzip the ImageNet2012 dataset to any path you want and the folder structure should include train and eval dataset as follows:
|
> Unzip the ImageNet2012 dataset to any path you want and the folder structure should include train and eval dataset as follows:
|
||||||
> ```
|
|
||||||
> ├── ilsvrc # train dataset
|
|
||||||
> └── ilsvrc_eval # infer dataset
|
|
||||||
> ```
|
|
||||||
|
|
||||||
|
```shell
|
||||||
|
├── ilsvrc # train dataset
|
||||||
|
└── ilsvrc_eval # infer dataset
|
||||||
|
```
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
The classical first-order optimization algorithm, such as SGD, has a small amount of computation, but the convergence speed is slow and requires lots of iterations. The second-order optimization algorithm uses the second-order derivative of the target function to accelerate convergence, can converge faster to the optimal value of the model and requires less iterations. But the application of the second-order optimization algorithm in deep neural network training is not common because of the high computation cost. The main computational cost of the second-order optimization algorithm lies in the inverse operation of the second-order information matrix (Hessian matrix, Fisher information matrix, etc.), and the time complexity is about $O (n^3)$. On the basis of the existing natural gradient algorithm, we developed the available second-order optimizer THOR in MindSpore by adopting approximation and shearing of Fisher information matrix to reduce the computational complexity of the inverse matrix. With eight Ascend 910 chips, THOR can complete ResNet50-v1.5-ImageNet training in 72 minutes.
|
The classical first-order optimization algorithm, such as SGD, has a small amount of computation, but the convergence speed is slow and requires lots of iterations. The second-order optimization algorithm uses the second-order derivative of the target function to accelerate convergence, can converge faster to the optimal value of the model and requires less iterations. But the application of the second-order optimization algorithm in deep neural network training is not common because of the high computation cost. The main computational cost of the second-order optimization algorithm lies in the inverse operation of the second-order information matrix (Hessian matrix, Fisher information matrix, etc.), and the time complexity is about $O (n^3)$. On the basis of the existing natural gradient algorithm, we developed the available second-order optimizer THOR in MindSpore by adopting approximation and shearing of Fisher information matrix to reduce the computational complexity of the inverse matrix. With eight Ascend 910 chips, THOR can complete ResNet50-v1.5-ImageNet training in 72 minutes.
|
||||||
|
|
||||||
## Environment Requirements
|
## Environment Requirements
|
||||||
|
|
||||||
- Hardware(Ascend/GPU)
|
- Hardware(Ascend/GPU)
|
||||||
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
|
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
|
||||||
|
|
||||||
- Framework
|
- Framework
|
||||||
- [MindSpore](https://www.mindspore.cn/install/en)
|
- [MindSpore](https://www.mindspore.cn/install/en)
|
||||||
- For more information, please check the resources below:
|
- For more information, please check the resources below:
|
||||||
|
@ -54,8 +61,11 @@ The classical first-order optimization algorithm, such as SGD, has a small amoun
|
||||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
|
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
After installing MindSpore via the official website, you can start training and evaluation as follows:
|
After installing MindSpore via the official website, you can start training and evaluation as follows:
|
||||||
|
|
||||||
- Running on Ascend
|
- Running on Ascend
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# run distributed training example
|
# run distributed training example
|
||||||
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
||||||
|
@ -63,9 +73,11 @@ sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
||||||
# run evaluation example
|
# run evaluation example
|
||||||
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
> For distributed training, a hccl configuration file with JSON format needs to be created in advance. About the configuration file, you can refer to the [HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
|
> For distributed training, a hccl configuration file with JSON format needs to be created in advance. About the configuration file, you can refer to the [HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
|
||||||
|
|
||||||
- Running on GPU
|
- Running on GPU
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# run distributed training example
|
# run distributed training example
|
||||||
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
||||||
|
@ -107,7 +119,8 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
Parameters for both training and inference can be set in config.py.
|
Parameters for both training and inference can be set in config.py.
|
||||||
|
|
||||||
- Parameters for Ascend 910
|
- Parameters for Ascend 910
|
||||||
```
|
|
||||||
|
```shell
|
||||||
"class_num": 1001, # dataset class number
|
"class_num": 1001, # dataset class number
|
||||||
"batch_size": 32, # batch size of input tensor(only supports 32)
|
"batch_size": 32, # batch size of input tensor(only supports 32)
|
||||||
"loss_scale": 128, # loss scale
|
"loss_scale": 128, # loss scale
|
||||||
|
@ -127,8 +140,10 @@ Parameters for both training and inference can be set in config.py.
|
||||||
"damping_decay": 0.87, # damping decay rate
|
"damping_decay": 0.87, # damping decay rate
|
||||||
"frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
|
"frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
|
||||||
```
|
```
|
||||||
|
|
||||||
- Parameters for GPU
|
- Parameters for GPU
|
||||||
```
|
|
||||||
|
```shell
|
||||||
"class_num": 1001, # dataset class number
|
"class_num": 1001, # dataset class number
|
||||||
"batch_size": 32, # batch size of input tensor
|
"batch_size": 32, # batch size of input tensor
|
||||||
"loss_scale": 128, # loss scale
|
"loss_scale": 128, # loss scale
|
||||||
|
@ -148,22 +163,26 @@ Parameters for both training and inference can be set in config.py.
|
||||||
"damping_decay": 0.5467, # damping decay rate
|
"damping_decay": 0.5467, # damping decay rate
|
||||||
"frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
|
"frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
|
||||||
```
|
```
|
||||||
|
|
||||||
> Due to the limitation of operators, the value of batch size only supports 32 in Ascend currently. And the update frequency of second-order information matrix must be set the divisor of the steps of per epoch(for example, 834 is the divisor of 5004). As a word, our algorithm is not very flexible in setting those parameters due to the limitations of the framework and operators. But we will solve these problems in the future versions.
|
> Due to the limitation of operators, the value of batch size only supports 32 in Ascend currently. And the update frequency of second-order information matrix must be set the divisor of the steps of per epoch(for example, 834 is the divisor of 5004). As a word, our algorithm is not very flexible in setting those parameters due to the limitations of the framework and operators. But we will solve these problems in the future versions.
|
||||||
|
|
||||||
### Training Process
|
### Training Process
|
||||||
|
|
||||||
#### Ascend 910
|
#### Ascend 910
|
||||||
|
|
||||||
```
|
```shell
|
||||||
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
||||||
```
|
```
|
||||||
|
|
||||||
We need three parameters for this scripts.
|
We need three parameters for this scripts.
|
||||||
|
|
||||||
- `RANK_TABLE_FILE`:the path of rank_table.json
|
- `RANK_TABLE_FILE`:the path of rank_table.json
|
||||||
- `DATASET_PATH`:the path of train dataset.
|
- `DATASET_PATH`:the path of train dataset.
|
||||||
- `DEVICE_NUM`: the device number for distributed train.
|
- `DEVICE_NUM`: the device number for distributed train.
|
||||||
|
|
||||||
Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
|
Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
|
||||||
|
|
||||||
```
|
```shell
|
||||||
...
|
...
|
||||||
epoch: 1 step: 5004, loss is 4.4182425
|
epoch: 1 step: 5004, loss is 4.4182425
|
||||||
epoch: 2 step: 5004, loss is 3.740064
|
epoch: 2 step: 5004, loss is 3.740064
|
||||||
|
@ -176,12 +195,16 @@ epoch: 41 step: 5004, loss is 1.8217756
|
||||||
epoch: 42 step: 5004, loss is 1.6453942
|
epoch: 42 step: 5004, loss is 1.6453942
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
#### GPU
|
#### GPU
|
||||||
```
|
|
||||||
|
```shell
|
||||||
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
||||||
```
|
```
|
||||||
|
|
||||||
Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
|
Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
|
||||||
```
|
|
||||||
|
```shell
|
||||||
...
|
...
|
||||||
epoch: 1 step: 5004, loss is 4.2546034
|
epoch: 1 step: 5004, loss is 4.2546034
|
||||||
epoch: 2 step: 5004, loss is 4.0819564
|
epoch: 2 step: 5004, loss is 4.0819564
|
||||||
|
@ -193,16 +216,18 @@ epoch: 36 step: 5004, loss is 1.645802
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Evaluation Process
|
### Evaluation Process
|
||||||
|
|
||||||
Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/resnet_thor/train_parallel0/resnet-42_5004.ckpt".
|
Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/resnet_thor/train_parallel0/resnet-42_5004.ckpt".
|
||||||
|
|
||||||
#### Ascend 910
|
#### Ascend 910
|
||||||
|
|
||||||
```
|
```shell
|
||||||
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
We need two parameters for this scripts.
|
We need two parameters for this scripts.
|
||||||
|
|
||||||
- `DATASET_PATH`:the path of evaluation dataset.
|
- `DATASET_PATH`:the path of evaluation dataset.
|
||||||
- `CHECKPOINT_PATH`: the absolute path for checkpoint file.
|
- `CHECKPOINT_PATH`: the absolute path for checkpoint file.
|
||||||
|
|
||||||
|
@ -210,16 +235,19 @@ We need two parameters for this scripts.
|
||||||
|
|
||||||
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log.
|
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log.
|
||||||
|
|
||||||
```
|
```shell
|
||||||
result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt
|
result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt
|
||||||
```
|
```
|
||||||
|
|
||||||
#### GPU
|
#### GPU
|
||||||
```
|
|
||||||
|
```shell
|
||||||
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log.
|
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log.
|
||||||
```
|
|
||||||
|
```shell
|
||||||
result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
|
result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -245,12 +273,24 @@ Inference result will be stored in the example path, whose folder name is "eval"
|
||||||
| Checkpoint for Fine tuning | 491M (.ckpt file) |380M (.ckpt file) |
|
| Checkpoint for Fine tuning | 491M (.ckpt file) |380M (.ckpt file) |
|
||||||
| Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |
|
| Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |
|
||||||
|
|
||||||
|
### Inference Performance
|
||||||
|
|
||||||
|
| Parameters | Ascend 910 | GPU |
|
||||||
|
| ------------------- | --------------------------- | --------------------------- |
|
||||||
|
| Model Version | ResNet50-v1.5 | ResNet50-v1.5 |
|
||||||
|
| Resource | Ascend 910 | GPU |
|
||||||
|
| Uploaded Date | 06/01/2020 (month/day/year) | 09/23/2020(month/day/year) |
|
||||||
|
| MindSpore Version | 0.3.0-alpha | 1.0.0 |
|
||||||
|
| Dataset | ImageNet2012 | ImageNet2012 |
|
||||||
|
| batch_size | 32 | 32 |
|
||||||
|
| outputs | probability | probability |
|
||||||
|
| Accuracy | 76.14% | 75.97% |
|
||||||
|
| Model for inference | 98M (.air file) | |
|
||||||
|
|
||||||
## Description of Random Situation
|
## Description of Random Situation
|
||||||
|
|
||||||
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
|
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
|
||||||
|
|
||||||
|
## ModelZoo HomePage
|
||||||
|
|
||||||
## ModelZoo Homepage
|
|
||||||
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# ResNet-50-THOR 示例
|
# ResNet-50-THOR 示例
|
||||||
|
|
||||||
<!-- TOC -->
|
<!-- TOC -->
|
||||||
|
|
||||||
- [ResNet-50-THOR 示例](#resnet-50-thor-示例)
|
- [ResNet-50-THOR 示例](#resnet-50-thor-示例)
|
||||||
|
@ -8,6 +9,16 @@
|
||||||
- [特性](#特性)
|
- [特性](#特性)
|
||||||
- [环境要求](#环境要求)
|
- [环境要求](#环境要求)
|
||||||
- [快速入门](#快速入门)
|
- [快速入门](#快速入门)
|
||||||
|
- [脚本描述](#脚本描述)
|
||||||
|
- [脚本代码结构](#脚本代码结构)
|
||||||
|
- [脚本参数](#脚本参数)
|
||||||
|
- [训练过程](#训练过程)
|
||||||
|
- [推理过程](#推理过程)
|
||||||
|
- [模型描述](#模型描述)
|
||||||
|
- [训练性能](#训练性能)
|
||||||
|
- [推理性能](#推理性能)
|
||||||
|
- [随机情况说明](#随机情况说明)
|
||||||
|
- [ModelZoo首页](#ModelZoo首页)
|
||||||
|
|
||||||
<!-- /TOC -->
|
<!-- /TOC -->
|
||||||
|
|
||||||
|
@ -16,10 +27,13 @@
|
||||||
本文举例说明了如何用二阶优化器THOR及ImageNet2012数据集训练ResNet-50 V1.5网络。THOR是MindSpore中一种近似二阶优化、迭代更少的新方法。THOR采用8卡Ascend 910,能在72分钟内达到75.9%的top-1准确率,完成ResNet-50 V1.5训练,远高于使用SGD+Momentum算法。
|
本文举例说明了如何用二阶优化器THOR及ImageNet2012数据集训练ResNet-50 V1.5网络。THOR是MindSpore中一种近似二阶优化、迭代更少的新方法。THOR采用8卡Ascend 910,能在72分钟内达到75.9%的top-1准确率,完成ResNet-50 V1.5训练,远高于使用SGD+Momentum算法。
|
||||||
|
|
||||||
## 模型架构
|
## 模型架构
|
||||||
|
|
||||||
ResNet-50的总体网络架构如下:[链接](https://arxiv.org/pdf/1512.03385.pdf)
|
ResNet-50的总体网络架构如下:[链接](https://arxiv.org/pdf/1512.03385.pdf)
|
||||||
|
|
||||||
## 数据集
|
## 数据集
|
||||||
|
|
||||||
使用的数据集:ImageNet2012
|
使用的数据集:ImageNet2012
|
||||||
|
|
||||||
- 数据集大小:共1000个类的224*224彩色图像
|
- 数据集大小:共1000个类的224*224彩色图像
|
||||||
- 训练集:1,281,167张图像
|
- 训练集:1,281,167张图像
|
||||||
- 测试集:5万张图像
|
- 测试集:5万张图像
|
||||||
|
@ -30,19 +44,20 @@ ResNet-50的总体网络架构如下:[链接](https://arxiv.org/pdf/1512.03385
|
||||||
- 下载数据集ImageNet2012。
|
- 下载数据集ImageNet2012。
|
||||||
|
|
||||||
> 解压ImageNet2012数据集到任意路径,目录结构应包含训练数据集和验证数据集,如下所示:
|
> 解压ImageNet2012数据集到任意路径,目录结构应包含训练数据集和验证数据集,如下所示:
|
||||||
> ```
|
|
||||||
> ├── ilsvrc # 训练数据集
|
|
||||||
> └── ilsvrc_eval # 验证数据集
|
|
||||||
> ```
|
|
||||||
|
|
||||||
|
```shell
|
||||||
|
├── ilsvrc # 训练数据集
|
||||||
|
└── ilsvrc_eval # 验证数据集
|
||||||
|
```
|
||||||
|
|
||||||
## 特性
|
## 特性
|
||||||
|
|
||||||
传统一阶优化算法,如SGD,计算量小,但收敛速度慢,迭代次数多。二阶优化算法利用目标函数的二阶导数加速收敛,收敛速度更快,迭代次数少。但是,由于计算成本高,二阶优化算法在深度神经网络训练中的应用并不普遍。二阶优化算法的主要计算成本在于二阶信息矩阵(Hessian矩阵、Fisher信息矩阵等)的求逆运算,时间复杂度约为$O (n^3)$。在现有自然梯度算法的基础上,通过近似和剪切Fisher信息矩阵以降低逆矩阵的计算复杂度,实现了基于MindSpore的二阶优化器THOR。THOR使用8张Ascend 910芯片,可在72分钟内完成ResNet50-v1.5+ImageNet的训练。
|
传统一阶优化算法,如SGD,计算量小,但收敛速度慢,迭代次数多。二阶优化算法利用目标函数的二阶导数加速收敛,收敛速度更快,迭代次数少。但是,由于计算成本高,二阶优化算法在深度神经网络训练中的应用并不普遍。二阶优化算法的主要计算成本在于二阶信息矩阵(Hessian矩阵、Fisher信息矩阵等)的求逆运算,时间复杂度约为$O (n^3)$。在现有自然梯度算法的基础上,通过近似和剪切Fisher信息矩阵以降低逆矩阵的计算复杂度,实现了基于MindSpore的二阶优化器THOR。THOR使用8张Ascend 910芯片,可在72分钟内完成ResNet50-v1.5+ImageNet的训练。
|
||||||
|
|
||||||
## 环境要求
|
## 环境要求
|
||||||
|
|
||||||
- 硬件:昇腾处理器(Ascend或GPU)
|
- 硬件:昇腾处理器(Ascend或GPU)
|
||||||
- 使用Ascend或GPU处理器搭建硬件环境。如需试用昇腾处理器,请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)
|
- 使用Ascend或GPU处理器搭建硬件环境。如需试用昇腾处理器,请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) 至ascend@huawei.com,审核通过即可获得资源。
|
||||||
至ascend@huawei.com,审核通过即可获得资源。
|
|
||||||
|
|
||||||
- 框架
|
- 框架
|
||||||
- [MindSpore](https://www.mindspore.cn/install)
|
- [MindSpore](https://www.mindspore.cn/install)
|
||||||
|
@ -51,8 +66,11 @@ ResNet-50的总体网络架构如下:[链接](https://arxiv.org/pdf/1512.03385
|
||||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
|
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
|
||||||
|
|
||||||
## 快速入门
|
## 快速入门
|
||||||
|
|
||||||
通过官方网站安装MindSpore后,您可以按照如下步骤进行训练和验证:
|
通过官方网站安装MindSpore后,您可以按照如下步骤进行训练和验证:
|
||||||
|
|
||||||
- Ascend处理器环境运行
|
- Ascend处理器环境运行
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# 分布式训练运行示例
|
# 分布式训练运行示例
|
||||||
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
||||||
|
@ -60,9 +78,12 @@ sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
||||||
# 推理运行示例
|
# 推理运行示例
|
||||||
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
> 对于分布式训练,需要提前创建JSON格式的HCCL配置文件。关于配置文件,可以参考[HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)
|
> 对于分布式训练,需要提前创建JSON格式的HCCL配置文件。关于配置文件,可以参考[HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)
|
||||||
。
|
。
|
||||||
|
|
||||||
- GPU处理器环境运行
|
- GPU处理器环境运行
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# 分布式训练运行示例
|
# 分布式训练运行示例
|
||||||
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
||||||
|
@ -71,7 +92,7 @@ sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
||||||
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
## 脚本说明
|
## 脚本描述
|
||||||
|
|
||||||
### 脚本代码结构
|
### 脚本代码结构
|
||||||
|
|
||||||
|
@ -104,7 +125,8 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
在config.py中可以同时配置训练和推理参数。
|
在config.py中可以同时配置训练和推理参数。
|
||||||
|
|
||||||
- Ascend 910参数说明
|
- Ascend 910参数说明
|
||||||
```
|
|
||||||
|
```shell
|
||||||
"class_num":1001, # 数据集类数
|
"class_num":1001, # 数据集类数
|
||||||
"batch_size":32, # 输入张量的批次大小(只支持32)
|
"batch_size":32, # 输入张量的批次大小(只支持32)
|
||||||
"loss_scale":128, # loss_scale缩放系数
|
"loss_scale":128, # loss_scale缩放系数
|
||||||
|
@ -124,13 +146,15 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
"damping_decay": 0.87, # 阻尼衰减率
|
"damping_decay": 0.87, # 阻尼衰减率
|
||||||
"frequency": 834, # 更新二阶信息矩阵的步长间隔(应为每个epoch step数的除数)
|
"frequency": 834, # 更新二阶信息矩阵的步长间隔(应为每个epoch step数的除数)
|
||||||
```
|
```
|
||||||
|
|
||||||
- GPU参数
|
- GPU参数
|
||||||
```
|
|
||||||
|
```shell
|
||||||
"class_num":1001, # 数据集类数
|
"class_num":1001, # 数据集类数
|
||||||
"batch_size":32, # 输入张量的批次大小
|
"batch_size":32, # 输入张量的批次大小
|
||||||
"loss_scale":128, # loss缩放系数
|
"loss_scale":128, # loss缩放系数
|
||||||
"momentum": 0.9, # THOR优化器中momentum
|
"momentum": 0.9, # THOR优化器中momentum
|
||||||
"weight_decay": 5e-4, # 权重衰减
|
"weight_decay": 5e-4, # 权重衰减系数
|
||||||
"epoch_size":40, # 只对训练有效,推理固定值为1
|
"epoch_size":40, # 只对训练有效,推理固定值为1
|
||||||
"save_checkpoint": True, # 是否保存checkpoint
|
"save_checkpoint": True, # 是否保存checkpoint
|
||||||
"save_checkpoint_epochs": 1, # 两个checkpoint之间的轮次间隔;默认情况下,每个epoch都会保存checkpoint
|
"save_checkpoint_epochs": 1, # 两个checkpoint之间的轮次间隔;默认情况下,每个epoch都会保存checkpoint
|
||||||
|
@ -145,22 +169,26 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
"damping_decay": 0.5467, # 阻尼衰减率
|
"damping_decay": 0.5467, # 阻尼衰减率
|
||||||
"frequency": 834, # 更新二阶信息矩阵的步长间隔(应为每epoch step数的除数)
|
"frequency": 834, # 更新二阶信息矩阵的步长间隔(应为每epoch step数的除数)
|
||||||
```
|
```
|
||||||
|
|
||||||
> 由于算子的限制,目前Ascend中batch size只支持32。二阶信息矩阵的更新频率必须设置为每个epoch的step数的除数(例如,834是5004的除数)。总之,由于框架和算子的局限性,我们的算法在设置这些参数时并不十分灵活。但后续版本会解决这些问题。
|
> 由于算子的限制,目前Ascend中batch size只支持32。二阶信息矩阵的更新频率必须设置为每个epoch的step数的除数(例如,834是5004的除数)。总之,由于框架和算子的局限性,我们的算法在设置这些参数时并不十分灵活。但后续版本会解决这些问题。
|
||||||
|
|
||||||
### 训练过程
|
### 训练过程
|
||||||
|
|
||||||
#### Ascend 910
|
#### Ascend 910
|
||||||
|
|
||||||
```
|
```shell
|
||||||
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
|
||||||
```
|
```
|
||||||
|
|
||||||
此脚本需设置三个参数:
|
此脚本需设置三个参数:
|
||||||
|
|
||||||
- `RANK_TABLE_FILE`:rank_table.json文件路径
|
- `RANK_TABLE_FILE`:rank_table.json文件路径
|
||||||
- `DATASET_PATH`:训练数据集的路径
|
- `DATASET_PATH`:训练数据集的路径
|
||||||
- `DEVICE_NUM`:分布式训练的设备号
|
- `DEVICE_NUM`:分布式训练的设备号
|
||||||
|
|
||||||
训练结果保存在当前路径下,文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果,如下所示。
|
训练结果保存在当前路径下,文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果,如下所示。
|
||||||
|
|
||||||
```
|
```shell
|
||||||
...
|
...
|
||||||
epoch:1 step:5004,loss is 4.4182425
|
epoch:1 step:5004,loss is 4.4182425
|
||||||
epoch:2 step: 5004,loss is 3.740064
|
epoch:2 step: 5004,loss is 3.740064
|
||||||
|
@ -173,12 +201,16 @@ epoch:41 step: 5004,loss is 1.8217756
|
||||||
epoch:42 step: 5004,loss is 1.6453942
|
epoch:42 step: 5004,loss is 1.6453942
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
#### GPU
|
#### GPU
|
||||||
```
|
|
||||||
|
```shell
|
||||||
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
|
||||||
```
|
```
|
||||||
|
|
||||||
训练结果保存在当前路径下,文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果,如下所示。
|
训练结果保存在当前路径下,文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果,如下所示。
|
||||||
```
|
|
||||||
|
```shell
|
||||||
...
|
...
|
||||||
epoch: 1 step: 5004,loss is 4.2546034
|
epoch: 1 step: 5004,loss is 4.2546034
|
||||||
epoch: 2 step: 5004,loss is 4.0819564
|
epoch: 2 step: 5004,loss is 4.0819564
|
||||||
|
@ -190,16 +222,18 @@ epoch: 36 step: 5004,loss is 1.645802
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### 推理过程
|
### 推理过程
|
||||||
|
|
||||||
在运行以下命令之前,请检查用于推理的checkpoint路径。请将checkpoint路径设置为绝对路径,如`username/resnet_thor/train_parallel0/resnet-42_5004.ckpt`。
|
在运行以下命令之前,请检查用于推理的checkpoint路径。请将checkpoint路径设置为绝对路径,如`username/resnet_thor/train_parallel0/resnet-42_5004.ckpt`。
|
||||||
|
|
||||||
#### Ascend 910
|
#### Ascend 910
|
||||||
|
|
||||||
```
|
```shell
|
||||||
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
此脚本需设置两个参数:
|
此脚本需设置两个参数:
|
||||||
|
|
||||||
- `DATASET_PATH`:验证数据集的路径。
|
- `DATASET_PATH`:验证数据集的路径。
|
||||||
- `CHECKPOINT_PATH`:checkpoint文件的绝对路径。
|
- `CHECKPOINT_PATH`:checkpoint文件的绝对路径。
|
||||||
|
|
||||||
|
@ -207,22 +241,25 @@ epoch: 36 step: 5004,loss is 1.645802
|
||||||
|
|
||||||
推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。
|
推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。
|
||||||
|
|
||||||
```
|
```shell
|
||||||
result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt
|
result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt
|
||||||
```
|
```
|
||||||
|
|
||||||
#### GPU
|
#### GPU
|
||||||
```
|
|
||||||
|
```shell
|
||||||
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
|
||||||
```
|
```
|
||||||
|
|
||||||
推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。
|
推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。
|
||||||
```
|
|
||||||
|
```shell
|
||||||
result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
|
result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
|
||||||
```
|
```
|
||||||
|
|
||||||
## 型号说明
|
## 模型描述
|
||||||
|
|
||||||
### 评估性能
|
### 训练性能
|
||||||
|
|
||||||
| 参数 | Ascend 910 | GPU |
|
| 参数 | Ascend 910 | GPU |
|
||||||
| -------------------------- | -------------------------------------- | ---------------------------------- |
|
| -------------------------- | -------------------------------------- | ---------------------------------- |
|
||||||
|
@ -242,14 +279,25 @@ epoch: 36 step: 5004,loss is 1.645802
|
||||||
| checkpoint | 491M(.ckpt file) | 380M(.ckpt file) |
|
| checkpoint | 491M(.ckpt file) | 380M(.ckpt file) |
|
||||||
| 脚本 |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |
|
| 脚本 |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |
|
||||||
|
|
||||||
|
### 推理性能
|
||||||
|
|
||||||
|
| 参数 | Ascend 910 | GPU |
|
||||||
|
| ------------------- | --------------------------- | --------------------------- |
|
||||||
|
| 模型版本 | ResNet50-v1.5 | ResNet50-v1.5 |
|
||||||
|
| 资源 | Ascend 910 | GPU |
|
||||||
|
| 上传日期 | 2020-06-01 | 2020-09-23 |
|
||||||
|
| MindSpore版本 | 0.3.0-alpha | 1.0.0 |
|
||||||
|
| 数据集 | ImageNet2012 | ImageNet2012 |
|
||||||
|
| 批大小 | 32 | 32 |
|
||||||
|
| 输出 | 概率 | 概率 |
|
||||||
|
| 精度 | 76.14% | 75.97% |
|
||||||
|
| 推理模型 | 98M (.air file) | |
|
||||||
|
|
||||||
## 随机情况说明
|
## 随机情况说明
|
||||||
|
|
||||||
在dataset.py中,我们设置了“create_dataset”函数内的种子。我们还在train.py中使用随机种子。
|
在dataset.py中,我们设置了“create_dataset”函数内的种子。我们还在train.py中使用随机种子。
|
||||||
|
|
||||||
|
|
||||||
## ModelZoo首页
|
## ModelZoo首页
|
||||||
|
|
||||||
请查看官方[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)
|
请查看官方[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)
|
||||||
。
|
。
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue