!9667 fix readme file of resnet_thor

From: @wangmin0104
Reviewed-by: 
Signed-off-by:
This commit is contained in:
mindspore-ci-bot 2020-12-09 14:32:16 +08:00 committed by Gitee
commit 55bf63a8a4
2 changed files with 200 additions and 112 deletions

View File

@ -13,6 +13,7 @@
- [Evaluation Process](#Evaluation-Process) - [Evaluation Process](#Evaluation-Process)
- [Model Description](#Model-Description) - [Model Description](#Model-Description)
- [Evaluation Performance](#Evaluation-Performance) - [Evaluation Performance](#Evaluation-Performance)
- [Inference Performance](#Inference-Performance)
- [Description of Random Situation](#Description-of-Random-Situation) - [Description of Random Situation](#Description-of-Random-Situation)
- [ModelZoo Homepage](#ModelZoo-Homepage) - [ModelZoo Homepage](#ModelZoo-Homepage)
@ -21,10 +22,13 @@
This is an example of training ResNet-50 V1.5 with ImageNet2012 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish ResNet-50 V1.5 training in 72 minutes to top-1 accuracy of 75.9% using 8 Ascend 910, which is much faster than SGD with Momentum. This is an example of training ResNet-50 V1.5 with ImageNet2012 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish ResNet-50 V1.5 training in 72 minutes to top-1 accuracy of 75.9% using 8 Ascend 910, which is much faster than SGD with Momentum.
## Model Architecture ## Model Architecture
The overall network architecture of ResNet-50 is show below:[link](https://arxiv.org/pdf/1512.03385.pdf) The overall network architecture of ResNet-50 is show below:[link](https://arxiv.org/pdf/1512.03385.pdf)
## Dataset ## Dataset
Dataset used: ImageNet2012 Dataset used: ImageNet2012
- Dataset size 224*224 colorful images in 1000 classes - Dataset size 224*224 colorful images in 1000 classes
- Train1,281,167 images - Train1,281,167 images
- Test 50,000 images - Test 50,000 images
@ -35,18 +39,21 @@ Dataset used: ImageNet2012
- Download the dataset ImageNet2012 - Download the dataset ImageNet2012
> Unzip the ImageNet2012 dataset to any path you want and the folder structure should include train and eval dataset as follows: > Unzip the ImageNet2012 dataset to any path you want and the folder structure should include train and eval dataset as follows:
> ```
> ├── ilsvrc # train dataset
> └── ilsvrc_eval # infer dataset
> ```
```shell
├── ilsvrc # train dataset
└── ilsvrc_eval # infer dataset
```
## Features ## Features
The classical first-order optimization algorithm, such as SGD, has a small amount of computation, but the convergence speed is slow and requires lots of iterations. The second-order optimization algorithm uses the second-order derivative of the target function to accelerate convergence, can converge faster to the optimal value of the model and requires less iterations. But the application of the second-order optimization algorithm in deep neural network training is not common because of the high computation cost. The main computational cost of the second-order optimization algorithm lies in the inverse operation of the second-order information matrix (Hessian matrix, Fisher information matrix, etc.), and the time complexity is about $O (n^3)$. On the basis of the existing natural gradient algorithm, we developed the available second-order optimizer THOR in MindSpore by adopting approximation and shearing of Fisher information matrix to reduce the computational complexity of the inverse matrix. With eight Ascend 910 chips, THOR can complete ResNet50-v1.5-ImageNet training in 72 minutes. The classical first-order optimization algorithm, such as SGD, has a small amount of computation, but the convergence speed is slow and requires lots of iterations. The second-order optimization algorithm uses the second-order derivative of the target function to accelerate convergence, can converge faster to the optimal value of the model and requires less iterations. But the application of the second-order optimization algorithm in deep neural network training is not common because of the high computation cost. The main computational cost of the second-order optimization algorithm lies in the inverse operation of the second-order information matrix (Hessian matrix, Fisher information matrix, etc.), and the time complexity is about $O (n^3)$. On the basis of the existing natural gradient algorithm, we developed the available second-order optimizer THOR in MindSpore by adopting approximation and shearing of Fisher information matrix to reduce the computational complexity of the inverse matrix. With eight Ascend 910 chips, THOR can complete ResNet50-v1.5-ImageNet training in 72 minutes.
## Environment Requirements ## Environment Requirements
- HardwareAscend/GPU - HardwareAscend/GPU
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Framework - Framework
- [MindSpore](https://www.mindspore.cn/install/en) - [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below - For more information, please check the resources below
@ -54,8 +61,11 @@ The classical first-order optimization algorithm, such as SGD, has a small amoun
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html) - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
## Quick Start ## Quick Start
After installing MindSpore via the official website, you can start training and evaluation as follows: After installing MindSpore via the official website, you can start training and evaluation as follows:
- Running on Ascend - Running on Ascend
```python ```python
# run distributed training example # run distributed training example
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
@ -63,9 +73,11 @@ sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
# run evaluation example # run evaluation example
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
> For distributed training, a hccl configuration file with JSON format needs to be created in advance. About the configuration file, you can refer to the [HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools). > For distributed training, a hccl configuration file with JSON format needs to be created in advance. About the configuration file, you can refer to the [HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
- Running on GPU - Running on GPU
```python ```python
# run distributed training example # run distributed training example
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
@ -107,7 +119,8 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
Parameters for both training and inference can be set in config.py. Parameters for both training and inference can be set in config.py.
- Parameters for Ascend 910 - Parameters for Ascend 910
```
```shell
"class_num": 1001, # dataset class number "class_num": 1001, # dataset class number
"batch_size": 32, # batch size of input tensor(only supports 32) "batch_size": 32, # batch size of input tensor(only supports 32)
"loss_scale": 128, # loss scale "loss_scale": 128, # loss scale
@ -127,8 +140,10 @@ Parameters for both training and inference can be set in config.py.
"damping_decay": 0.87, # damping decay rate "damping_decay": 0.87, # damping decay rate
"frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch) "frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
``` ```
- Parameters for GPU - Parameters for GPU
```
```shell
"class_num": 1001, # dataset class number "class_num": 1001, # dataset class number
"batch_size": 32, # batch size of input tensor "batch_size": 32, # batch size of input tensor
"loss_scale": 128, # loss scale "loss_scale": 128, # loss scale
@ -148,22 +163,26 @@ Parameters for both training and inference can be set in config.py.
"damping_decay": 0.5467, # damping decay rate "damping_decay": 0.5467, # damping decay rate
"frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch) "frequency": 834, # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
``` ```
> Due to the limitation of operators, the value of batch size only supports 32 in Ascend currently. And the update frequency of second-order information matrix must be set the divisor of the steps of per epoch(for example, 834 is the divisor of 5004). As a word, our algorithm is not very flexible in setting those parameters due to the limitations of the framework and operators. But we will solve these problems in the future versions. > Due to the limitation of operators, the value of batch size only supports 32 in Ascend currently. And the update frequency of second-order information matrix must be set the divisor of the steps of per epoch(for example, 834 is the divisor of 5004). As a word, our algorithm is not very flexible in setting those parameters due to the limitations of the framework and operators. But we will solve these problems in the future versions.
### Training Process ### Training Process
#### Ascend 910 #### Ascend 910
``` ```shell
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
``` ```
We need three parameters for this scripts. We need three parameters for this scripts.
- `RANK_TABLE_FILE`the path of rank_table.json - `RANK_TABLE_FILE`the path of rank_table.json
- `DATASET_PATH`the path of train dataset. - `DATASET_PATH`the path of train dataset.
- `DEVICE_NUM`: the device number for distributed train. - `DEVICE_NUM`: the device number for distributed train.
Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log. Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
``` ```shell
... ...
epoch: 1 step: 5004, loss is 4.4182425 epoch: 1 step: 5004, loss is 4.4182425
epoch: 2 step: 5004, loss is 3.740064 epoch: 2 step: 5004, loss is 3.740064
@ -176,12 +195,16 @@ epoch: 41 step: 5004, loss is 1.8217756
epoch: 42 step: 5004, loss is 1.6453942 epoch: 42 step: 5004, loss is 1.6453942
... ...
``` ```
#### GPU #### GPU
```
```shell
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
``` ```
Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log. Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the followings in log.
```
```shell
... ...
epoch: 1 step: 5004, loss is 4.2546034 epoch: 1 step: 5004, loss is 4.2546034
epoch: 2 step: 5004, loss is 4.0819564 epoch: 2 step: 5004, loss is 4.0819564
@ -193,16 +216,18 @@ epoch: 36 step: 5004, loss is 1.645802
... ...
``` ```
### Evaluation Process ### Evaluation Process
Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/resnet_thor/train_parallel0/resnet-42_5004.ckpt". Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/resnet_thor/train_parallel0/resnet-42_5004.ckpt".
#### Ascend 910 #### Ascend 910
``` ```shell
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
We need two parameters for this scripts. We need two parameters for this scripts.
- `DATASET_PATH`the path of evaluation dataset. - `DATASET_PATH`the path of evaluation dataset.
- `CHECKPOINT_PATH`: the absolute path for checkpoint file. - `CHECKPOINT_PATH`: the absolute path for checkpoint file.
@ -210,16 +235,19 @@ We need two parameters for this scripts.
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log. Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log.
``` ```shell
result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt
``` ```
#### GPU #### GPU
```
```shell
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log. Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the followings in log.
```
```shell
result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
``` ```
@ -245,12 +273,24 @@ Inference result will be stored in the example path, whose folder name is "eval"
| Checkpoint for Fine tuning | 491M (.ckpt file) |380M (.ckpt file) | | Checkpoint for Fine tuning | 491M (.ckpt file) |380M (.ckpt file) |
| Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) | | Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |
### Inference Performance
| Parameters | Ascend 910 | GPU |
| ------------------- | --------------------------- | --------------------------- |
| Model Version | ResNet50-v1.5 | ResNet50-v1.5 |
| Resource | Ascend 910 | GPU |
| Uploaded Date | 06/01/2020 (month/day/year) | 09/23/2020(month/day/year) |
| MindSpore Version | 0.3.0-alpha | 1.0.0 |
| Dataset | ImageNet2012 | ImageNet2012 |
| batch_size | 32 | 32 |
| outputs | probability | probability |
| Accuracy | 76.14% | 75.97% |
| Model for inference | 98M (.air file) | |
## Description of Random Situation ## Description of Random Situation
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
## ModelZoo HomePage
## ModelZoo Homepage
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).

View File

@ -1,4 +1,5 @@
# ResNet-50-THOR 示例 # ResNet-50-THOR 示例
<!-- TOC --> <!-- TOC -->
- [ResNet-50-THOR 示例](#resnet-50-thor-示例) - [ResNet-50-THOR 示例](#resnet-50-thor-示例)
@ -8,6 +9,16 @@
- [特性](#特性) - [特性](#特性)
- [环境要求](#环境要求) - [环境要求](#环境要求)
- [快速入门](#快速入门) - [快速入门](#快速入门)
- [脚本描述](#脚本描述)
- [脚本代码结构](#脚本代码结构)
- [脚本参数](#脚本参数)
- [训练过程](#训练过程)
- [推理过程](#推理过程)
- [模型描述](#模型描述)
- [训练性能](#训练性能)
- [推理性能](#推理性能)
- [随机情况说明](#随机情况说明)
- [ModelZoo首页](#ModelZoo首页)
<!-- /TOC --> <!-- /TOC -->
@ -16,10 +27,13 @@
本文举例说明了如何用二阶优化器THOR及ImageNet2012数据集训练ResNet-50 V1.5网络。THOR是MindSpore中一种近似二阶优化、迭代更少的新方法。THOR采用8卡Ascend 910能在72分钟内达到75.9%的top-1准确率完成ResNet-50 V1.5训练远高于使用SGD+Momentum算法。 本文举例说明了如何用二阶优化器THOR及ImageNet2012数据集训练ResNet-50 V1.5网络。THOR是MindSpore中一种近似二阶优化、迭代更少的新方法。THOR采用8卡Ascend 910能在72分钟内达到75.9%的top-1准确率完成ResNet-50 V1.5训练远高于使用SGD+Momentum算法。
## 模型架构 ## 模型架构
ResNet-50的总体网络架构如下[链接](https://arxiv.org/pdf/1512.03385.pdf) ResNet-50的总体网络架构如下[链接](https://arxiv.org/pdf/1512.03385.pdf)
## 数据集 ## 数据集
使用的数据集ImageNet2012 使用的数据集ImageNet2012
- 数据集大小共1000个类的224*224彩色图像 - 数据集大小共1000个类的224*224彩色图像
- 训练集1,281,167张图像 - 训练集1,281,167张图像
- 测试集5万张图像 - 测试集5万张图像
@ -30,19 +44,20 @@ ResNet-50的总体网络架构如下[链接](https://arxiv.org/pdf/1512.03385
- 下载数据集ImageNet2012。 - 下载数据集ImageNet2012。
> 解压ImageNet2012数据集到任意路径目录结构应包含训练数据集和验证数据集如下所示 > 解压ImageNet2012数据集到任意路径目录结构应包含训练数据集和验证数据集如下所示
> ```
> ├── ilsvrc # 训练数据集
> └── ilsvrc_eval # 验证数据集
> ```
```shell
├── ilsvrc # 训练数据集
└── ilsvrc_eval # 验证数据集
```
## 特性 ## 特性
传统一阶优化算法如SGD计算量小但收敛速度慢迭代次数多。二阶优化算法利用目标函数的二阶导数加速收敛收敛速度更快迭代次数少。但是由于计算成本高二阶优化算法在深度神经网络训练中的应用并不普遍。二阶优化算法的主要计算成本在于二阶信息矩阵Hessian矩阵、Fisher信息矩阵等的求逆运算时间复杂度约为$O (n^3)$。在现有自然梯度算法的基础上通过近似和剪切Fisher信息矩阵以降低逆矩阵的计算复杂度实现了基于MindSpore的二阶优化器THOR。THOR使用8张Ascend 910芯片可在72分钟内完成ResNet50-v1.5+ImageNet的训练。 传统一阶优化算法如SGD计算量小但收敛速度慢迭代次数多。二阶优化算法利用目标函数的二阶导数加速收敛收敛速度更快迭代次数少。但是由于计算成本高二阶优化算法在深度神经网络训练中的应用并不普遍。二阶优化算法的主要计算成本在于二阶信息矩阵Hessian矩阵、Fisher信息矩阵等的求逆运算时间复杂度约为$O (n^3)$。在现有自然梯度算法的基础上通过近似和剪切Fisher信息矩阵以降低逆矩阵的计算复杂度实现了基于MindSpore的二阶优化器THOR。THOR使用8张Ascend 910芯片可在72分钟内完成ResNet50-v1.5+ImageNet的训练。
## 环境要求 ## 环境要求
- 硬件昇腾处理器Ascend或GPU - 硬件昇腾处理器Ascend或GPU
- 使用Ascend或GPU处理器搭建硬件环境。如需试用昇腾处理器请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) - 使用Ascend或GPU处理器搭建硬件环境。如需试用昇腾处理器请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) 至ascend@huawei.com审核通过即可获得资源。
至ascend@huawei.com审核通过即可获得资源。
- 框架 - 框架
- [MindSpore](https://www.mindspore.cn/install) - [MindSpore](https://www.mindspore.cn/install)
@ -51,8 +66,11 @@ ResNet-50的总体网络架构如下[链接](https://arxiv.org/pdf/1512.03385
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html) - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
## 快速入门 ## 快速入门
通过官方网站安装MindSpore后您可以按照如下步骤进行训练和验证 通过官方网站安装MindSpore后您可以按照如下步骤进行训练和验证
- Ascend处理器环境运行 - Ascend处理器环境运行
```python ```python
# 分布式训练运行示例 # 分布式训练运行示例
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
@ -60,9 +78,12 @@ sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
# 推理运行示例 # 推理运行示例
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
> 对于分布式训练需要提前创建JSON格式的HCCL配置文件。关于配置文件可以参考[HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools) > 对于分布式训练需要提前创建JSON格式的HCCL配置文件。关于配置文件可以参考[HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)
- GPU处理器环境运行 - GPU处理器环境运行
```python ```python
# 分布式训练运行示例 # 分布式训练运行示例
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
@ -71,7 +92,7 @@ sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
## 脚本说明 ## 脚本描述
### 脚本代码结构 ### 脚本代码结构
@ -104,7 +125,8 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
在config.py中可以同时配置训练和推理参数。 在config.py中可以同时配置训练和推理参数。
- Ascend 910参数说明 - Ascend 910参数说明
```
```shell
"class_num"1001, # 数据集类数 "class_num"1001, # 数据集类数
"batch_size"32, # 输入张量的批次大小只支持32 "batch_size"32, # 输入张量的批次大小只支持32
"loss_scale"128, # loss_scale缩放系数 "loss_scale"128, # loss_scale缩放系数
@ -124,13 +146,15 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
"damping_decay": 0.87, # 阻尼衰减率 "damping_decay": 0.87, # 阻尼衰减率
"frequency": 834, # 更新二阶信息矩阵的步长间隔应为每个epoch step数的除数 "frequency": 834, # 更新二阶信息矩阵的步长间隔应为每个epoch step数的除数
``` ```
- GPU参数 - GPU参数
```
```shell
"class_num"1001, # 数据集类数 "class_num"1001, # 数据集类数
"batch_size"32, # 输入张量的批次大小 "batch_size"32, # 输入张量的批次大小
"loss_scale"128, # loss缩放系数 "loss_scale"128, # loss缩放系数
"momentum": 0.9, # THOR优化器中momentum "momentum": 0.9, # THOR优化器中momentum
"weight_decay": 5e-4, # 权重衰减 "weight_decay": 5e-4, # 权重衰减系数
"epoch_size"40, # 只对训练有效推理固定值为1 "epoch_size"40, # 只对训练有效推理固定值为1
"save_checkpoint": True, # 是否保存checkpoint "save_checkpoint": True, # 是否保存checkpoint
"save_checkpoint_epochs": 1, # 两个checkpoint之间的轮次间隔默认情况下每个epoch都会保存checkpoint "save_checkpoint_epochs": 1, # 两个checkpoint之间的轮次间隔默认情况下每个epoch都会保存checkpoint
@ -145,22 +169,26 @@ sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
"damping_decay": 0.5467, # 阻尼衰减率 "damping_decay": 0.5467, # 阻尼衰减率
"frequency": 834, # 更新二阶信息矩阵的步长间隔应为每epoch step数的除数 "frequency": 834, # 更新二阶信息矩阵的步长间隔应为每epoch step数的除数
``` ```
> 由于算子的限制目前Ascend中batch size只支持32。二阶信息矩阵的更新频率必须设置为每个epoch的step数的除数例如834是5004的除数。总之由于框架和算子的局限性我们的算法在设置这些参数时并不十分灵活。但后续版本会解决这些问题。 > 由于算子的限制目前Ascend中batch size只支持32。二阶信息矩阵的更新频率必须设置为每个epoch的step数的除数例如834是5004的除数。总之由于框架和算子的局限性我们的算法在设置这些参数时并不十分灵活。但后续版本会解决这些问题。
### 训练过程 ### 训练过程
#### Ascend 910 #### Ascend 910
``` ```shell
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]
``` ```
此脚本需设置三个参数: 此脚本需设置三个参数:
- `RANK_TABLE_FILE`rank_table.json文件路径 - `RANK_TABLE_FILE`rank_table.json文件路径
- `DATASET_PATH`:训练数据集的路径 - `DATASET_PATH`:训练数据集的路径
- `DEVICE_NUM`:分布式训练的设备号 - `DEVICE_NUM`:分布式训练的设备号
训练结果保存在当前路径下文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果如下所示。 训练结果保存在当前路径下文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果如下所示。
``` ```shell
... ...
epoch1 step5004loss is 4.4182425 epoch1 step5004loss is 4.4182425
epoch2 step: 5004loss is 3.740064 epoch2 step: 5004loss is 3.740064
@ -173,12 +201,16 @@ epoch41 step: 5004loss is 1.8217756
epoch42 step: 5004loss is 1.6453942 epoch42 step: 5004loss is 1.6453942
... ...
``` ```
#### GPU #### GPU
```
```shell
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM] sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]
``` ```
训练结果保存在当前路径下文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果如下所示。 训练结果保存在当前路径下文件夹名称以“train_parallel”开头。您可在日志中找到checkpoint文件以及结果如下所示。
```
```shell
... ...
epoch 1 step: 5004loss is 4.2546034 epoch 1 step: 5004loss is 4.2546034
epoch 2 step: 5004loss is 4.0819564 epoch 2 step: 5004loss is 4.0819564
@ -190,16 +222,18 @@ epoch 36 step: 5004loss is 1.645802
... ...
``` ```
### 推理过程 ### 推理过程
在运行以下命令之前请检查用于推理的checkpoint路径。请将checkpoint路径设置为绝对路径如`username/resnet_thor/train_parallel0/resnet-42_5004.ckpt`。 在运行以下命令之前请检查用于推理的checkpoint路径。请将checkpoint路径设置为绝对路径如`username/resnet_thor/train_parallel0/resnet-42_5004.ckpt`。
#### Ascend 910 #### Ascend 910
``` ```shell
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
此脚本需设置两个参数: 此脚本需设置两个参数:
- `DATASET_PATH`:验证数据集的路径。 - `DATASET_PATH`:验证数据集的路径。
- `CHECKPOINT_PATH`checkpoint文件的绝对路径。 - `CHECKPOINT_PATH`checkpoint文件的绝对路径。
@ -207,22 +241,25 @@ epoch 36 step: 5004loss is 1.645802
推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。 推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。
``` ```shell
result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt
``` ```
#### GPU #### GPU
```
```shell
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]
``` ```
推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。 推理结果保存在示例路径,文件夹名为`eval`。您可在日志中找到如下结果。
```
```shell
result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt
``` ```
## 型号说明 ## 模型描述
### 评估性能 ### 训练性能
| 参数 | Ascend 910 | GPU | | 参数 | Ascend 910 | GPU |
| -------------------------- | -------------------------------------- | ---------------------------------- | | -------------------------- | -------------------------------------- | ---------------------------------- |
@ -242,14 +279,25 @@ epoch 36 step: 5004loss is 1.645802
| checkpoint | 491M.ckpt file | 380M.ckpt file | | checkpoint | 491M.ckpt file | 380M.ckpt file |
| 脚本 |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) | | 脚本 |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/resnet_thor) |
### 推理性能
| 参数 | Ascend 910 | GPU |
| ------------------- | --------------------------- | --------------------------- |
| 模型版本 | ResNet50-v1.5 | ResNet50-v1.5 |
| 资源 | Ascend 910 | GPU |
| 上传日期 | 2020-06-01 | 2020-09-23 |
| MindSpore版本 | 0.3.0-alpha | 1.0.0 |
| 数据集 | ImageNet2012 | ImageNet2012 |
| 批大小 | 32 | 32 |
| 输出 | 概率 | 概率 |
| 精度 | 76.14% | 75.97% |
| 推理模型 | 98M (.air file) | |
## 随机情况说明 ## 随机情况说明
在dataset.py中我们设置了“create_dataset”函数内的种子。我们还在train.py中使用随机种子。 在dataset.py中我们设置了“create_dataset”函数内的种子。我们还在train.py中使用随机种子。
## ModelZoo首页 ## ModelZoo首页
请查看官方[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo) 请查看官方[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)