modify README by adding GPU usage

2021-02-18 16:45:14 -05:00 · 2021-02-18 16:45:14 -05:00 · 71a7102573
parent 191b3f0c8c
commit 71a7102573
3 changed files with 205 additions and 57 deletions
--- a/model_zoo/official/cv/faster_rcnn/README.md
+++ b/model_zoo/official/cv/faster_rcnn/README.md
@ -47,7 +47,7 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)

 # Environment Requirements

- Hardware（Ascend）
+- Hardware（Ascend/GPU）
    - Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.

 - Docker base image
@ -96,9 +96,13 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)

 After installing MindSpore via the official website, you can start training and evaluation as follows:

-Note: 1.the first run will generate the mindeocrd file, which will take a long time.
-      2.pretrained model is a resnet50 checkpoint that trained over ImageNet2012.you can train it with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo, and use src/convert_checkpoint.py to get the pretrain model.
-      3.BACKBONE_MODEL is a checkpoint file trained with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo.PRETRAINED_MODEL is a checkpoint file after convert.VALIDATION_JSON_FILE is label file. CHECKPOINT_PATH is a checkpoint file after trained.
+Note:
+
+1. the first run will generate the mindeocrd file, which will take a long time.
+2. pretrained model is a resnet50 checkpoint that trained over ImageNet2012.you can train it with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo, and use src/convert_checkpoint.py to get the pretrain model.
+3. BACKBONE_MODEL is a checkpoint file trained with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo.PRETRAINED_MODEL is a checkpoint file after convert.VALIDATION_JSON_FILE is label file. CHECKPOINT_PATH is a checkpoint file after trained.
+
+## Run on Ascend

 ```shell

@ -118,7 +122,25 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
 sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
 ```

-# Run in docker
+## Run on GPU
+
+```shell
+
+# convert checkpoint
+python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
+
+# standalone training
+sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
+
+# distributed training
+sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
+
+# eval
+sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
+
+```
+
+## Run in docker

 1. Build docker images

@ -169,9 +191,12 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
  ├─ascend310_infer //application for 310 inference
  ├─scripts
    ├─run_standalone_train_ascend.sh    // shell script for standalone on ascend
+    ├─run_standalone_train_gpu.sh    // shell script for standalone on GPU
    ├─run_distribute_train_ascend.sh    // shell script for distributed on ascend
+    ├─run_distribute_train_gpu.sh    // shell script for distributed on GPU
    ├─run_infer_310.sh    // shell script for 310 inference
    └─run_eval_ascend.sh    // shell script for eval on ascend
+    └─run_eval_gpu.sh    // shell script for eval on GPU
  ├─src
    ├─FasterRcnn
      ├─__init__.py    // init file
@ -201,6 +226,8 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]

 ### Usage

+#### on Ascend
+
 ```shell
 # standalone training on ascend
 sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
@ -209,6 +236,16 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
 sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
 ```

+#### on GPU
+
+```shell
+# standalone training on gpu
+sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
+
+# distributed training on gpu
+sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
+```
+
 Notes:

 1. Rank_table.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
@ -259,11 +296,20 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005

 ### Usage

+#### on Ascend
+
 ```shell
 # eval on ascend
 sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
 ```

+#### on GPU
+
+```shell
+# eval on GPU
+sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
+```
+
 > checkpoint can be produced in training process.
 >
 > Images size in dataset should be equal to the annotation size in VALIDATION_JSON_FILE, otherwise the evaluation result cannot be displayed properly.
@ -331,34 +377,34 @@ Inference result is saved in current path, you can find result like this in acc.

 ### Evaluation Performance

-| Parameters                 | Ascend                                                   |
-| -------------------------- | ----------------------------------------------------------- |
-| Model Version              | V1                                                |
-| Resource                   | Ascend 910 ；CPU 2.60GHz，192cores；Memory，755G             |
-| uploaded Date              | 08/31/2020 (month/day/year)                                 |
-| MindSpore Version          | 1.0.0                                                       |
-| Dataset                    | COCO2017                                                   |
-| Training Parameters        | epoch=12,  batch_size=2          |
-| Optimizer                  | SGD                                                         |
-| Loss Function              | Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss                                      |
-| Speed                      | 1pc: 190 ms/step;  8pcs: 200 ms/step                          |
-| Total time                 | 1pc: 37.17 hours;  8pcs: 4.89 hours                          |
-| Parameters (M)             | 250                                                         |
-| Scripts                    | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |
+| Parameters                 | Ascend                                                   | GPU                                                 |
+| -------------------------- | ----------------------------------------------------------- |----------------------------------------------------------- |
+| Model Version              | V1                                                | V1                                                |
+| Resource                   | Ascend 910 ；CPU 2.60GHz，192cores；Memory，755G             |V100-PCIE 32G            |
+| uploaded Date              | 08/31/2020 (month/day/year)                                 |02/10/2021 (month/day/year)                                 |
+| MindSpore Version          | 1.0.0                                                       |1.2.0                                                       |
+| Dataset                    | COCO2017                                                   |COCO2017                                                   |
+| Training Parameters        | epoch=12,  batch_size=2          |epoch=12,  batch_size=2          |
+| Optimizer                  | SGD                                                         |SGD                                                         |
+| Loss Function              | Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss|Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss|
+| Speed                      | 1pc: 190 ms/step;  8pcs: 200 ms/step                          | 1pc: 320 ms/step;  8pcs: 335 ms/step                          |
+| Total time                 | 1pc: 37.17 hours;  8pcs: 4.89 hours                          |1pc: 63.09 hours;  8pcs: 8.25 hours                          |
+| Parameters (M)             | 250                                                         |250                                                         |
+| Scripts                    | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |

 ### Inference Performance

-| Parameters          | Ascend                |
-| ------------------- | --------------------------- |
-| Model Version       | V1                |
-| Resource            | Ascend 910                  |
-| Uploaded Date       | 08/31/2020 (month/day/year) |
-| MindSpore Version   | 1.0.0                       |
-| Dataset             | COCO2017    |
-| batch_size          | 2                         |
-| outputs             | mAP                 |
-| Accuracy            |  IoU=0.50: 57.6%  |
-| Model for inference | 250M (.ckpt file)         |
+| Parameters          | Ascend                |GPU                |
+| ------------------- | --------------------------- |--------------------------- |
+| Model Version       | V1                | V1                |
+| Resource            | Ascend 910                  |GPU                   |
+| Uploaded Date       | 08/31/2020 (month/day/year) |02/10/2021 (month/day/year) |
+| MindSpore Version   | 1.0.0                       | 1.2.0                       |
+| Dataset             | COCO2017    |COCO2017    |
+| batch_size          | 2                         |2                         |
+| outputs             | mAP                 |mAP                 |
+| Accuracy            |  IoU=0.50: 58.6%  | IoU=0.50: 59.1%  |
+| Model for inference | 250M (.ckpt file)         |250M (.ckpt file)         |

 # [ModelZoo Homepage](#contents)  

--- a/model_zoo/official/cv/faster_rcnn/README_CN.md
+++ b/model_zoo/official/cv/faster_rcnn/README_CN.md
@ -48,7 +48,7 @@ Faster R-CNN是一个两阶段目标检测网络，该网络采用RPN，可以

 # 环境要求

- 硬件（Ascend）
+- 硬件（Ascend/GPU）
    - 使用Ascend处理器来搭建硬件环境。如需试用Ascend处理器，请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com，审核通过即可获得资源。

 - 获取基础镜像
@ -103,6 +103,8 @@ Faster R-CNN是一个两阶段目标检测网络，该网络采用RPN，可以
 2. 预训练模型是在ImageNet2012上训练的ResNet-50检查点。你可以使用ModelZoo中 [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) 脚本来训练, 然后使用src/convert_checkpoint.py把训练好的resnet50的权重文件转换为可加载的权重文件。
 3. BACKBONE_MODEL是通过modelzoo中的[resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet)脚本训练的。PRETRAINED_MODEL是经过转换后的权重文件。VALIDATION_JSON_FILE为标签文件。CHECKPOINT_PATH是训练后的检查点文件。

+## 在Ascend上运行
+
 ```shell

 # 权重文件转换
@ -121,7 +123,25 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
 sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
 ```

-# 在docker上运行
+## 在GPU上运行
+
+```shell
+
+# 权重文件转换
+python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
+
+# 单机训练
+sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
+
+# 分布式训练
+sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
+
+# 评估
+sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
+
+```
+
+## 在docker上运行

 1. 编译镜像

@ -172,9 +192,12 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
  ├─ascend310_infer  //实现310推理源代码
  ├─scripts
    ├─run_standalone_train_ascend.sh    // Ascend单机shell脚本
+    ├─run_standalone_train_gpu.sh    // GPU单机shell脚本
    ├─run_distribute_train_ascend.sh    // Ascend分布式shell脚本
+    ├─run_distribute_train_gpu.sh    // GPU分布式shell脚本
    ├─run_infer_310.sh    // Ascend推理shell脚本
    └─run_eval_ascend.sh    // Ascend评估shell脚本
+    └─run_eval_gpu.sh    // GPU评估shell脚本
  ├─src
    ├─FasterRcnn
      ├─__init__.py    // init文件
@ -204,6 +227,8 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]

 ### 用法

+#### 在Ascend上运行
+
 ```shell
 # Ascend单机训练
 sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
@ -212,6 +237,16 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
 sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
 ```

+#### 在GPU上运行
+
+```shell
+# GPU单机训练
+sh run_standalone_train_gpu.sh [PRETRAINED_MODEL]
+
+# GPU分布式训练
+sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL]
+```
+
 Notes:

 1. 运行分布式任务时需要用到RANK_TABLE_FILE指定的rank_table.json。您可以使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。
@ -262,11 +297,20 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005

 ### 用法

+#### 在Ascend上运行
+
 ```shell
 # Ascend评估
 sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
 ```

+#### 在GPU上运行
+
+```shell
+# GPU评估
+sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
+```
+
 > 在训练过程中生成检查点。
 >
 > 数据集中图片的数量要和VALIDATION_JSON_FILE文件中标记数量一致，否则精度结果展示格式可能出现异常。
@ -334,34 +378,34 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]

 ### 训练性能

-| 参数 |Ascend |
-| -------------------------- | ----------------------------------------------------------- |
-| 模型版本 | V1 |
-| 资源 | Ascend 910；CPU 2.60GHz，192核；内存：755G |
-| 上传日期 | 2020/8/31 |
-| MindSpore版本 | 1.0.0 |
-| 数据集 | COCO 2017 |
-| 训练参数 | epoch=12, batch_size=2 |
-| 优化器 | SGD |
-| 损失函数 | Softmax交叉熵，Sigmoid交叉熵，SmoothL1Loss |
-| 速度 | 1卡：190毫秒/步；8卡：200毫秒/步 |
-| 总时间 | 1卡：37.17小时；8卡：4.89小时 |
-| 参数(M) | 250 |
-| 脚本 | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |
+| 参数 |Ascend |GPU |
+| -------------------------- | ----------------------------------------------------------- |----------------------------------------------------------- |
+| 模型版本 | V1 |V1 |
+| 资源 | Ascend 910；CPU 2.60GHz，192核；内存：755G |V100-PCIE 32G            |
+| 上传日期 | 2020/8/31 | 2021/2/10 |
+| MindSpore版本 | 1.0.0 |1.2.0 |
+| 数据集 | COCO 2017 |COCO 2017 |
+| 训练参数 | epoch=12, batch_size=2 |epoch=12, batch_size=2 |
+| 优化器 | SGD |SGD |
+| 损失函数 | Softmax交叉熵，Sigmoid交叉熵，SmoothL1Loss |Softmax交叉熵，Sigmoid交叉熵，SmoothL1Loss |
+| 速度 | 1卡：190毫秒/步；8卡：200毫秒/步 | 1卡：320毫秒/步；8卡：335毫秒/步 |
+| 总时间 | 1卡：37.17小时；8卡：4.89小时 |1卡：63.09小时；8卡：8.25小时 |
+| 参数(M) | 250 |250 |
+| 脚本 | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) |

 ### 评估性能

-| 参数 | Ascend |
-| ------------------- | --------------------------- |
-| 模型版本 | V1 |
-| 资源 | Ascend 910 |
-| 上传日期 | 2020/8/31 |
-| MindSpore版本 | 1.0.0 |
-| 数据集 | COCO2017 |
-| batch_size | 2 |
-| 输出 | mAP |
-| 准确率 | IoU=0.50：57.6%  |
-| 推理模型 | 250M（.ckpt文件） |
+| 参数 | Ascend |GPU |
+| ------------------- | --------------------------- | --------------------------- |
+| 模型版本 | V1 |V1 |
+| 资源 | Ascend 910 |V100-PCIE 32G  |
+| 上传日期 | 2020/8/31 |2021/2/10 |
+| MindSpore版本 | 1.0.0 |1.2.0 |
+| 数据集 | COCO2017 |COCO2017 |
+| batch_size | 2 | 2 |
+| 输出 | mAP |mAP |
+| 准确率 | IoU=0.50：58.6%  |IoU=0.50：59.1%  |
+| 推理模型 | 250M（.ckpt文件） |250M（.ckpt文件） |

 # ModelZoo主页

--- a/model_zoo/official/cv/faster_rcnn/scripts/run_standalone_train_gpu.sh
+++ b/model_zoo/official/cv/faster_rcnn/scripts/run_standalone_train_gpu.sh
@ -0,0 +1,58 @@
+#!/bin/bash
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+if [ $# -ne 1 ]
+then 
+    echo "Usage: sh run_standalone_train_gpu.sh [PRETRAINED_PATH]"
+exit 1
+fi
+
+get_real_path(){
+  if [ "${1:0:1}" == "/" ]; then
+    echo "$1"
+  else
+    echo "$(realpath -m $PWD/$1)"
+  fi
+}
+
+PATH1=$(get_real_path $1)
+echo $PATH1
+
+if [ ! -f $PATH1 ]
+then 
+    echo "error: PRETRAINED_PATH=$PATH1 is not a file"
+exit 1
+fi
+
+ulimit -u unlimited
+export DEVICE_NUM=1
+export DEVICE_ID=0
+export RANK_ID=0
+export RANK_SIZE=1
+
+if [ -d "train" ];
+then
+    rm -rf ./train
+fi
+mkdir ./train
+cp ../*.py ./train
+cp *.sh ./train
+cp -r ../src ./train
+cd ./train || exit
+echo "start training for device $DEVICE_ID"
+env > env.log
+python train.py --device_id=$DEVICE_ID --pre_trained=$PATH1 --device_target="GPU"  &> log &
+cd ..