From 71a710257359a82cc01f7c5e047daf1537a03630 Mon Sep 17 00:00:00 2001 From: dessyang Date: Thu, 18 Feb 2021 16:45:14 -0500 Subject: [PATCH] modify README by adding GPU usage --- model_zoo/official/cv/faster_rcnn/README.md | 106 +++++++++++++----- .../official/cv/faster_rcnn/README_CN.md | 98 +++++++++++----- .../scripts/run_standalone_train_gpu.sh | 58 ++++++++++ 3 files changed, 205 insertions(+), 57 deletions(-) create mode 100755 model_zoo/official/cv/faster_rcnn/scripts/run_standalone_train_gpu.sh diff --git a/model_zoo/official/cv/faster_rcnn/README.md b/model_zoo/official/cv/faster_rcnn/README.md index c4cfa5d01cb..78226698bcd 100644 --- a/model_zoo/official/cv/faster_rcnn/README.md +++ b/model_zoo/official/cv/faster_rcnn/README.md @@ -47,7 +47,7 @@ Dataset used: [COCO2017]() # Environment Requirements -- Hardware(Ascend) +- Hardware(Ascend/GPU) - Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. - Docker base image @@ -96,9 +96,13 @@ Dataset used: [COCO2017]() After installing MindSpore via the official website, you can start training and evaluation as follows: -Note: 1.the first run will generate the mindeocrd file, which will take a long time. - 2.pretrained model is a resnet50 checkpoint that trained over ImageNet2012.you can train it with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo, and use src/convert_checkpoint.py to get the pretrain model. - 3.BACKBONE_MODEL is a checkpoint file trained with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo.PRETRAINED_MODEL is a checkpoint file after convert.VALIDATION_JSON_FILE is label file. CHECKPOINT_PATH is a checkpoint file after trained. +Note: + +1. the first run will generate the mindeocrd file, which will take a long time. +2. pretrained model is a resnet50 checkpoint that trained over ImageNet2012.you can train it with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo, and use src/convert_checkpoint.py to get the pretrain model. +3. BACKBONE_MODEL is a checkpoint file trained with [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) scripts in modelzoo.PRETRAINED_MODEL is a checkpoint file after convert.VALIDATION_JSON_FILE is label file. CHECKPOINT_PATH is a checkpoint file after trained. + +## Run on Ascend ```shell @@ -118,7 +122,25 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] ``` -# Run in docker +## Run on GPU + +```shell + +# convert checkpoint +python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL] + +# standalone training +sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] + +# distributed training +sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] + +# eval +sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] + +``` + +## Run in docker 1. Build docker images @@ -169,9 +191,12 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] ├─ascend310_infer //application for 310 inference ├─scripts ├─run_standalone_train_ascend.sh // shell script for standalone on ascend + ├─run_standalone_train_gpu.sh // shell script for standalone on GPU ├─run_distribute_train_ascend.sh // shell script for distributed on ascend + ├─run_distribute_train_gpu.sh // shell script for distributed on GPU ├─run_infer_310.sh // shell script for 310 inference └─run_eval_ascend.sh // shell script for eval on ascend + └─run_eval_gpu.sh // shell script for eval on GPU ├─src ├─FasterRcnn ├─__init__.py // init file @@ -201,6 +226,8 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] ### Usage +#### on Ascend + ```shell # standalone training on ascend sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] @@ -209,6 +236,16 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] ``` +#### on GPU + +```shell +# standalone training on gpu +sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] + +# distributed training on gpu +sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] +``` + Notes: 1. Rank_table.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools). @@ -259,11 +296,20 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005 ### Usage +#### on Ascend + ```shell # eval on ascend sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] ``` +#### on GPU + +```shell +# eval on GPU +sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] +``` + > checkpoint can be produced in training process. > > Images size in dataset should be equal to the annotation size in VALIDATION_JSON_FILE, otherwise the evaluation result cannot be displayed properly. @@ -331,34 +377,34 @@ Inference result is saved in current path, you can find result like this in acc. ### Evaluation Performance -| Parameters | Ascend | -| -------------------------- | ----------------------------------------------------------- | -| Model Version | V1 | -| Resource | Ascend 910 ;CPU 2.60GHz,192cores;Memory,755G | -| uploaded Date | 08/31/2020 (month/day/year) | -| MindSpore Version | 1.0.0 | -| Dataset | COCO2017 | -| Training Parameters | epoch=12, batch_size=2 | -| Optimizer | SGD | -| Loss Function | Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss | -| Speed | 1pc: 190 ms/step; 8pcs: 200 ms/step | -| Total time | 1pc: 37.17 hours; 8pcs: 4.89 hours | -| Parameters (M) | 250 | -| Scripts | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | +| Parameters | Ascend | GPU | +| -------------------------- | ----------------------------------------------------------- |----------------------------------------------------------- | +| Model Version | V1 | V1 | +| Resource | Ascend 910 ;CPU 2.60GHz,192cores;Memory,755G |V100-PCIE 32G | +| uploaded Date | 08/31/2020 (month/day/year) |02/10/2021 (month/day/year) | +| MindSpore Version | 1.0.0 |1.2.0 | +| Dataset | COCO2017 |COCO2017 | +| Training Parameters | epoch=12, batch_size=2 |epoch=12, batch_size=2 | +| Optimizer | SGD |SGD | +| Loss Function | Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss|Softmax Cross Entropy ,Sigmoid Cross Entropy,SmoothL1Loss| +| Speed | 1pc: 190 ms/step; 8pcs: 200 ms/step | 1pc: 320 ms/step; 8pcs: 335 ms/step | +| Total time | 1pc: 37.17 hours; 8pcs: 4.89 hours |1pc: 63.09 hours; 8pcs: 8.25 hours | +| Parameters (M) | 250 |250 | +| Scripts | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | [fasterrcnn script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | ### Inference Performance -| Parameters | Ascend | -| ------------------- | --------------------------- | -| Model Version | V1 | -| Resource | Ascend 910 | -| Uploaded Date | 08/31/2020 (month/day/year) | -| MindSpore Version | 1.0.0 | -| Dataset | COCO2017 | -| batch_size | 2 | -| outputs | mAP | -| Accuracy | IoU=0.50: 57.6% | -| Model for inference | 250M (.ckpt file) | +| Parameters | Ascend |GPU | +| ------------------- | --------------------------- |--------------------------- | +| Model Version | V1 | V1 | +| Resource | Ascend 910 |GPU | +| Uploaded Date | 08/31/2020 (month/day/year) |02/10/2021 (month/day/year) | +| MindSpore Version | 1.0.0 | 1.2.0 | +| Dataset | COCO2017 |COCO2017 | +| batch_size | 2 |2 | +| outputs | mAP |mAP | +| Accuracy | IoU=0.50: 58.6% | IoU=0.50: 59.1% | +| Model for inference | 250M (.ckpt file) |250M (.ckpt file) | # [ModelZoo Homepage](#contents) diff --git a/model_zoo/official/cv/faster_rcnn/README_CN.md b/model_zoo/official/cv/faster_rcnn/README_CN.md index 9e79ed46549..a9d01444b34 100644 --- a/model_zoo/official/cv/faster_rcnn/README_CN.md +++ b/model_zoo/official/cv/faster_rcnn/README_CN.md @@ -48,7 +48,7 @@ Faster R-CNN是一个两阶段目标检测网络,该网络采用RPN,可以 # 环境要求 -- 硬件(Ascend) +- 硬件(Ascend/GPU) - 使用Ascend处理器来搭建硬件环境。如需试用Ascend处理器,请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com,审核通过即可获得资源。 - 获取基础镜像 @@ -103,6 +103,8 @@ Faster R-CNN是一个两阶段目标检测网络,该网络采用RPN,可以 2. 预训练模型是在ImageNet2012上训练的ResNet-50检查点。你可以使用ModelZoo中 [resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet) 脚本来训练, 然后使用src/convert_checkpoint.py把训练好的resnet50的权重文件转换为可加载的权重文件。 3. BACKBONE_MODEL是通过modelzoo中的[resnet50](https://gitee.com/qujianwei/mindspore/tree/master/model_zoo/official/cv/resnet)脚本训练的。PRETRAINED_MODEL是经过转换后的权重文件。VALIDATION_JSON_FILE为标签文件。CHECKPOINT_PATH是训练后的检查点文件。 +## 在Ascend上运行 + ```shell # 权重文件转换 @@ -121,7 +123,25 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID] ``` -# 在docker上运行 +## 在GPU上运行 + +```shell + +# 权重文件转换 +python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL] + +# 单机训练 +sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] + +# 分布式训练 +sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] + +# 评估 +sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] + +``` + +## 在docker上运行 1. 编译镜像 @@ -172,9 +192,12 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID] ├─ascend310_infer //实现310推理源代码 ├─scripts ├─run_standalone_train_ascend.sh // Ascend单机shell脚本 + ├─run_standalone_train_gpu.sh // GPU单机shell脚本 ├─run_distribute_train_ascend.sh // Ascend分布式shell脚本 + ├─run_distribute_train_gpu.sh // GPU分布式shell脚本 ├─run_infer_310.sh // Ascend推理shell脚本 └─run_eval_ascend.sh // Ascend评估shell脚本 + └─run_eval_gpu.sh // GPU评估shell脚本 ├─src ├─FasterRcnn ├─__init__.py // init文件 @@ -204,6 +227,8 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID] ### 用法 +#### 在Ascend上运行 + ```shell # Ascend单机训练 sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] @@ -212,6 +237,16 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] ``` +#### 在GPU上运行 + +```shell +# GPU单机训练 +sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] + +# GPU分布式训练 +sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] +``` + Notes: 1. 运行分布式任务时需要用到RANK_TABLE_FILE指定的rank_table.json。您可以使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。 @@ -262,11 +297,20 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005 ### 用法 +#### 在Ascend上运行 + ```shell # Ascend评估 sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] ``` +#### 在GPU上运行 + +```shell +# GPU评估 +sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] +``` + > 在训练过程中生成检查点。 > > 数据集中图片的数量要和VALIDATION_JSON_FILE文件中标记数量一致,否则精度结果展示格式可能出现异常。 @@ -334,34 +378,34 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID] ### 训练性能 -| 参数 |Ascend | -| -------------------------- | ----------------------------------------------------------- | -| 模型版本 | V1 | -| 资源 | Ascend 910;CPU 2.60GHz,192核;内存:755G | -| 上传日期 | 2020/8/31 | -| MindSpore版本 | 1.0.0 | -| 数据集 | COCO 2017 | -| 训练参数 | epoch=12, batch_size=2 | -| 优化器 | SGD | -| 损失函数 | Softmax交叉熵,Sigmoid交叉熵,SmoothL1Loss | -| 速度 | 1卡:190毫秒/步;8卡:200毫秒/步 | -| 总时间 | 1卡:37.17小时;8卡:4.89小时 | -| 参数(M) | 250 | -| 脚本 | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | +| 参数 |Ascend |GPU | +| -------------------------- | ----------------------------------------------------------- |----------------------------------------------------------- | +| 模型版本 | V1 |V1 | +| 资源 | Ascend 910;CPU 2.60GHz,192核;内存:755G |V100-PCIE 32G | +| 上传日期 | 2020/8/31 | 2021/2/10 | +| MindSpore版本 | 1.0.0 |1.2.0 | +| 数据集 | COCO 2017 |COCO 2017 | +| 训练参数 | epoch=12, batch_size=2 |epoch=12, batch_size=2 | +| 优化器 | SGD |SGD | +| 损失函数 | Softmax交叉熵,Sigmoid交叉熵,SmoothL1Loss |Softmax交叉熵,Sigmoid交叉熵,SmoothL1Loss | +| 速度 | 1卡:190毫秒/步;8卡:200毫秒/步 | 1卡:320毫秒/步;8卡:335毫秒/步 | +| 总时间 | 1卡:37.17小时;8卡:4.89小时 |1卡:63.09小时;8卡:8.25小时 | +| 参数(M) | 250 |250 | +| 脚本 | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | [Faster R-CNN脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/faster_rcnn) | ### 评估性能 -| 参数 | Ascend | -| ------------------- | --------------------------- | -| 模型版本 | V1 | -| 资源 | Ascend 910 | -| 上传日期 | 2020/8/31 | -| MindSpore版本 | 1.0.0 | -| 数据集 | COCO2017 | -| batch_size | 2 | -| 输出 | mAP | -| 准确率 | IoU=0.50:57.6% | -| 推理模型 | 250M(.ckpt文件) | +| 参数 | Ascend |GPU | +| ------------------- | --------------------------- | --------------------------- | +| 模型版本 | V1 |V1 | +| 资源 | Ascend 910 |V100-PCIE 32G | +| 上传日期 | 2020/8/31 |2021/2/10 | +| MindSpore版本 | 1.0.0 |1.2.0 | +| 数据集 | COCO2017 |COCO2017 | +| batch_size | 2 | 2 | +| 输出 | mAP |mAP | +| 准确率 | IoU=0.50:58.6% |IoU=0.50:59.1% | +| 推理模型 | 250M(.ckpt文件) |250M(.ckpt文件) | # ModelZoo主页 diff --git a/model_zoo/official/cv/faster_rcnn/scripts/run_standalone_train_gpu.sh b/model_zoo/official/cv/faster_rcnn/scripts/run_standalone_train_gpu.sh new file mode 100755 index 00000000000..61bd7153fdb --- /dev/null +++ b/model_zoo/official/cv/faster_rcnn/scripts/run_standalone_train_gpu.sh @@ -0,0 +1,58 @@ +#!/bin/bash +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ + +if [ $# -ne 1 ] +then + echo "Usage: sh run_standalone_train_gpu.sh [PRETRAINED_PATH]" +exit 1 +fi + +get_real_path(){ + if [ "${1:0:1}" == "/" ]; then + echo "$1" + else + echo "$(realpath -m $PWD/$1)" + fi +} + +PATH1=$(get_real_path $1) +echo $PATH1 + +if [ ! -f $PATH1 ] +then + echo "error: PRETRAINED_PATH=$PATH1 is not a file" +exit 1 +fi + +ulimit -u unlimited +export DEVICE_NUM=1 +export DEVICE_ID=0 +export RANK_ID=0 +export RANK_SIZE=1 + +if [ -d "train" ]; +then + rm -rf ./train +fi +mkdir ./train +cp ../*.py ./train +cp *.sh ./train +cp -r ../src ./train +cd ./train || exit +echo "start training for device $DEVICE_ID" +env > env.log +python train.py --device_id=$DEVICE_ID --pre_trained=$PATH1 --device_target="GPU" &> log & +cd ..