modify README by adding GPU usage

This commit is contained in:
dessyang 2021-02-18 16:45:14 -05:00
parent 191b3f0c8c
commit 71a7102573
3 changed files with 205 additions and 57 deletions

View File

@ -47,7 +47,7 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)
# Environment Requirements
- HardwareAscend
- HardwareAscend/GPU
- Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Docker base image
@ -96,9 +96,13 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)
After installing MindSpore via the official website, you can start training and evaluation as follows:
Note: 1.the first run will generate the mindeocrd file, which will take a long time.
2.pretrained model is a resnet50 checkpoint that trained over ImageNet2012.you can train it with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo, and use src/convert_checkpoint.py to get the pretrain model.
3.BACKBONE_MODEL is a checkpoint file trained with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo.PRETRAINED_MODEL is a checkpoint file after convert.VALIDATION_JSON_FILE is label file. CHECKPOINT_PATH is a checkpoint file after trained.
Note:
1. the first run will generate the mindeocrd file, which will take a long time.
2. pretrained model is a resnet50 checkpoint that trained over ImageNet2012.you can train it with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo, and use src/convert_checkpoint.py to get the pretrain model.
3. BACKBONE_MODEL is a checkpoint file trained with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo.PRETRAINED_MODEL is a checkpoint file after convert.VALIDATION_JSON_FILE is label file. CHECKPOINT_PATH is a checkpoint file after trained.
## Run on Ascend
```shell
@ -118,7 +122,25 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
```
# Run in docker
## Run on GPU
```shell
# convert checkpoint
python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
# standalone training
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
# distributed training
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
# eval
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
## Run in docker
1. Build docker images
@ -169,9 +191,12 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
├─ascend310_infer //application for 310 inference
├─scripts
├─run_standalone_train_ascend.sh // shell script for standalone on ascend
├─run_standalone_train_gpu.sh // shell script for standalone on GPU
├─run_distribute_train_ascend.sh // shell script for distributed on ascend
├─run_distribute_train_gpu.sh // shell script for distributed on GPU
├─run_infer_310.sh // shell script for 310 inference
└─run_eval_ascend.sh // shell script for eval on ascend
└─run_eval_gpu.sh // shell script for eval on GPU
├─src
├─FasterRcnn
├─__init__.py // init file
@ -201,6 +226,8 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
### Usage
#### on Ascend
```shell
# standalone training on ascend
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
@ -209,6 +236,16 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
#### on GPU
```shell
# standalone training on gpu
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
# distributed training on gpu
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
```
Notes:
1. Rank_table.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
@ -259,11 +296,20 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005
### Usage
#### on Ascend
```shell
# eval on ascend
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
#### on GPU
```shell
# eval on GPU
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
> checkpoint can be produced in training process.
>
> Images size in dataset should be equal to the annotation size in VALIDATION_JSON_FILE, otherwise the evaluation result cannot be displayed properly.
@ -331,34 +377,34 @@ Inference result is saved in current path, you can find result like this in acc.
### Evaluation Performance
| Parameters | Ascend |
| -------------------------- | ----------------------------------------------------------- |
| Model Version | V1 |
| Resource | Ascend 910 CPU 2.60GHz192coresMemory755G |
| uploaded Date | 08/31/2020 (month/day/year) |
| MindSpore Version | 1.0.0 |
| Dataset | COCO2017 |
| Training Parameters | epoch=12, batch_size=2 |
| Optimizer | SGD |
| Loss Function | Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss |
| Speed | 1pc: 190 ms/step; 8pcs: 200 ms/step |
| Total time | 1pc: 37.17 hours; 8pcs: 4.89 hours |
| Parameters (M) | 250 |
| Scripts | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------------------------------------- |----------------------------------------------------------- |
| Model Version | V1 | V1 |
| Resource | Ascend 910 CPU 2.60GHz192coresMemory755G |V100-PCIE 32G |
| uploaded Date | 08/31/2020 (month/day/year) |02/10/2021 (month/day/year) |
| MindSpore Version | 1.0.0 |1.2.0 |
| Dataset | COCO2017 |COCO2017 |
| Training Parameters | epoch=12, batch_size=2 |epoch=12, batch_size=2 |
| Optimizer | SGD |SGD |
| Loss Function | Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss|Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss|
| Speed | 1pc: 190 ms/step; 8pcs: 200 ms/step | 1pc: 320 ms/step; 8pcs: 335 ms/step |
| Total time | 1pc: 37.17 hours; 8pcs: 4.89 hours |1pc: 63.09 hours; 8pcs: 8.25 hours |
| Parameters (M) | 250 |250 |
| Scripts | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |
### Inference Performance
| Parameters | Ascend |
| ------------------- | --------------------------- |
| Model Version | V1 |
| Resource | Ascend 910 |
| Uploaded Date | 08/31/2020 (month/day/year) |
| MindSpore Version | 1.0.0 |
| Dataset | COCO2017 |
| batch_size | 2 |
| outputs | mAP |
| Accuracy | IoU=0.50: 57.6% |
| Model for inference | 250M (.ckpt file) |
| Parameters | Ascend |GPU |
| ------------------- | --------------------------- |--------------------------- |
| Model Version | V1 | V1 |
| Resource | Ascend 910 |GPU |
| Uploaded Date | 08/31/2020 (month/day/year) |02/10/2021 (month/day/year) |
| MindSpore Version | 1.0.0 | 1.2.0 |
| Dataset | COCO2017 |COCO2017 |
| batch_size | 2 |2 |
| outputs | mAP |mAP |
| Accuracy | IoU=0.50: 58.6% | IoU=0.50: 59.1% |
| Model for inference | 250M (.ckpt file) |250M (.ckpt file) |
# [ModelZoo Homepage](#contents)

View File

@ -48,7 +48,7 @@ Faster R-CNN是一个两阶段目标检测网络该网络采用RPN可以
# 环境要求
- 硬件Ascend
- 硬件Ascend/GPU
- 使用Ascend处理器来搭建硬件环境。如需试用Ascend处理器请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com审核通过即可获得资源。
- 获取基础镜像
@ -103,6 +103,8 @@ Faster R-CNN是一个两阶段目标检测网络该网络采用RPN可以
2. 预训练模型是在ImageNet2012上训练的ResNet-50检查点。你可以使用ModelZoo中 [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) 脚本来训练, 然后使用src/convert_checkpoint.py把训练好的resnet50的权重文件转换为可加载的权重文件。
3. BACKBONE_MODEL是通过modelzoo中的[resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet)脚本训练的。PRETRAINED_MODEL是经过转换后的权重文件。VALIDATION_JSON_FILE为标签文件。CHECKPOINT_PATH是训练后的检查点文件。
## 在Ascend上运行
```shell
# 权重文件转换
@ -121,7 +123,25 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
```
# 在docker上运行
## 在GPU上运行
```shell
# 权重文件转换
python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
# 单机训练
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
# 分布式训练
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
# 评估
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
## 在docker上运行
1. 编译镜像
@ -172,9 +192,12 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
├─ascend310_infer //实现310推理源代码
├─scripts
├─run_standalone_train_ascend.sh // Ascend单机shell脚本
├─run_standalone_train_gpu.sh // GPU单机shell脚本
├─run_distribute_train_ascend.sh // Ascend分布式shell脚本
├─run_distribute_train_gpu.sh // GPU分布式shell脚本
├─run_infer_310.sh // Ascend推理shell脚本
└─run_eval_ascend.sh // Ascend评估shell脚本
└─run_eval_gpu.sh // GPU评估shell脚本
├─src
├─FasterRcnn
├─__init__.py // init文件
@ -204,6 +227,8 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
### 用法
#### 在Ascend上运行
```shell
# Ascend单机训练
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
@ -212,6 +237,16 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
#### 在GPU上运行
```shell
# GPU单机训练
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
# GPU分布式训练
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
```
Notes:
1. 运行分布式任务时需要用到RANK_TABLE_FILE指定的rank_table.json。您可以使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。
@ -262,11 +297,20 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005
### 用法
#### 在Ascend上运行
```shell
# Ascend评估
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
#### 在GPU上运行
```shell
# GPU评估
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
> 在训练过程中生成检查点。
>
> 数据集中图片的数量要和VALIDATION_JSON_FILE文件中标记数量一致否则精度结果展示格式可能出现异常。
@ -334,34 +378,34 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
### 训练性能
| 参数 |Ascend |
| -------------------------- | ----------------------------------------------------------- |
| 模型版本 | V1 |
| 资源 | Ascend 910CPU 2.60GHz192核内存755G |
| 上传日期 | 2020/8/31 |
| MindSpore版本 | 1.0.0 |
| 数据集 | COCO 2017 |
| 训练参数 | epoch=12, batch_size=2 |
| 优化器 | SGD |
| 损失函数 | Softmax交叉熵Sigmoid交叉熵SmoothL1Loss |
| 速度 | 1卡190毫秒/步8卡200毫秒/步 |
| 总时间 | 1卡37.17小时8卡4.89小时 |
| 参数(M) | 250 |
| 脚本 | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |
| 参数 |Ascend |GPU |
| -------------------------- | ----------------------------------------------------------- |----------------------------------------------------------- |
| 模型版本 | V1 |V1 |
| 资源 | Ascend 910CPU 2.60GHz192核内存755G |V100-PCIE 32G |
| 上传日期 | 2020/8/31 | 2021/2/10 |
| MindSpore版本 | 1.0.0 |1.2.0 |
| 数据集 | COCO 2017 |COCO 2017 |
| 训练参数 | epoch=12, batch_size=2 |epoch=12, batch_size=2 |
| 优化器 | SGD |SGD |
| 损失函数 | Softmax交叉熵Sigmoid交叉熵SmoothL1Loss |Softmax交叉熵Sigmoid交叉熵SmoothL1Loss |
| 速度 | 1卡190毫秒/步8卡200毫秒/步 | 1卡320毫秒/步8卡335毫秒/步 |
| 总时间 | 1卡37.17小时8卡4.89小时 |1卡63.09小时8卡8.25小时 |
| 参数(M) | 250 |250 |
| 脚本 | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |
### 评估性能
| 参数 | Ascend |
| ------------------- | --------------------------- |
| 模型版本 | V1 |
| 资源 | Ascend 910 |
| 上传日期 | 2020/8/31 |
| MindSpore版本 | 1.0.0 |
| 数据集 | COCO2017 |
| batch_size | 2 |
| 输出 | mAP |
| 准确率 | IoU=0.5057.6% |
| 推理模型 | 250M.ckpt文件 |
| 参数 | Ascend |GPU |
| ------------------- | --------------------------- | --------------------------- |
| 模型版本 | V1 |V1 |
| 资源 | Ascend 910 |V100-PCIE 32G |
| 上传日期 | 2020/8/31 |2021/2/10 |
| MindSpore版本 | 1.0.0 |1.2.0 |
| 数据集 | COCO2017 |COCO2017 |
| batch_size | 2 | 2 |
| 输出 | mAP |mAP |
| 准确率 | IoU=0.5058.6% |IoU=0.5059.1% |
| 推理模型 | 250M.ckpt文件 |250M.ckpt文件 |
# ModelZoo主页

View File

@ -0,0 +1,58 @@
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 1 ]
then
echo "Usage: sh run_standalone_train_gpu.sh [PRETRAINED_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
PATH1=$(get_real_path $1)
echo $PATH1
if [ ! -f $PATH1 ]
then
echo "error: PRETRAINED_PATH=$PATH1 is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=0
export RANK_ID=0
export RANK_SIZE=1
if [ -d "train" ];
then
rm -rf ./train
fi
mkdir ./train
cp ../*.py ./train
cp *.sh ./train
cp -r ../src ./train
cd ./train || exit
echo "start training for device $DEVICE_ID"
env > env.log
python train.py --device_id=$DEVICE_ID --pre_trained=$PATH1 --device_target="GPU" &> log &
cd ..