mobilenetv2+ssd gpu

2020-09-23 15:40:23 +08:00 · 2020-09-23 15:40:23 +08:00 · 0d16d52d61
parent b993ea0288
commit 0d16d52d61
6 changed files with 290 additions and 55 deletions
--- a/model_zoo/official/cv/ssd/README.md
+++ b/model_zoo/official/cv/ssd/README.md
@ -82,7 +82,8 @@ Dataset used: [COCO2017](<http://images.cocodataset.org/>)

 # [Quick Start](#contents)

-After installing MindSpore via the official website, you can start training and evaluation on Ascend as follows: 
+After installing MindSpore via the official website, you can start training and evaluation as follows: 
+- runing on Ascend

 ```
 # distributed training on Ascend
@ -91,6 +92,14 @@ sh run_distribute_train.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] [RANK_TABLE_
 # run eval on Ascend
 sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
 ```
+- runing on GPU
+```
+# distributed training on GPU
+sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET]
+
+# run eval on GPU
+sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
+```

 # [Script Description](#contents)

@ -100,22 +109,24 @@ sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
 .
 └─ cv
  └─ ssd      
-    ├─ README.md                  ## descriptions about SSD
+    ├─ README.md                      ## descriptions about SSD
    ├─ scripts
-      └─ run_distribute_train.sh  ## shell script for distributed on ascend
-      └─ run_eval.sh              ## shell script for eval on ascend
+      ├─ run_distribute_train.sh      ## shell script for distributed on ascend
+      ├─ run_distribute_train_gpu.sh  ## shell script for distributed on gpu
+      ├─ run_eval.sh                  ## shell script for eval on ascend
+      └─ run_eval_gpu.sh              ## shell script for eval on gpu
    ├─ src
-      ├─ __init__.py              ## init file
-      ├─ box_util.py              ## bbox utils
-      ├─ coco_eval.py             ## coco metrics utils
-      ├─ config.py                ## total config
-      ├─ dataset.py               ## create dataset and process dataset
-      ├─ init_params.py           ## parameters utils
-      ├─ lr_schedule.py           ## learning ratio generator
-      └─ ssd.py                   ## ssd architecture
-    ├─ eval.py                    ## eval scripts
-    ├─ train.py                   ## train scripts
-    ├── mindspore_hub_conf.py       #  mindspore hub interface
+      ├─ __init__.py                  ## init file
+      ├─ box_util.py                  ## bbox utils
+      ├─ coco_eval.py                 ## coco metrics utils
+      ├─ config.py                    ## total config
+      ├─ dataset.py                   ## create dataset and process dataset
+      ├─ init_params.py               ## parameters utils
+      ├─ lr_schedule.py               ## learning ratio generator
+      └─ ssd.py                       ## ssd architecture
+    ├─ eval.py                        ## eval scripts
+    ├─ train.py                       ## train scripts
+    └─ mindspore_hub_conf.py          ## mindspore hub interface
 ```

 ## [Script Parameters](#contents)
@ -146,10 +157,9 @@ sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]

 ## [Training Process](#contents)

-### Training on Ascend
-
 To train the model, run `train.py`. If the `mindrecord_dir` is empty, it will generate [mindrecord](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/convert_dataset.html) files by `coco_root`(coco dataset) or `iamge_dir` and `anno_path`(own dataset). **Note if mindrecord_dir isn't empty, it will use mindrecord_dir instead of raw images.**

+### Training on Ascend

 - Distribute mode

@ -184,6 +194,34 @@ epoch: 500 step: 458, loss is 0.5548882
 epoch time: 39064.8467540741, per step time: 85.29442522723602
 ```

+### Training on GPU
+
+- Distribute mode
+
+```
+    sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] [PRE_TRAINED](optional) [PRE_TRAINED_EPOCH_SIZE](optional)
+```
+We need five or seven parameters for this scripts.
+- `DEVICE_NUM`: the device number for distributed train.
+- `EPOCH_NUM`: epoch num for distributed train.
+- `LR`: learning rate init value for distributed train.
+- `DATASET`：the dataset mode for distributed train.
+- `PRE_TRAINED :` the path of pretrained checkpoint file, it is better to use absolute path.
+- `PRE_TRAINED_EPOCH_SIZE :` the epoch num of pretrained.
+
+    Training result will be stored in the current path, whose folder name is "LOG".  Under this, you can find checkpoint files together with result like the followings in log
+
+```
+epoch: 1 step: 1, loss is 420.11783
+epoch: 1 step: 2, loss is 434.11032
+epoch: 1 step: 3, loss is 476.802
+...
+epoch: 1 step: 458, loss is 3.1283689
+epoch time: 150753.701, per step time: 329.157
+...
+
+```
+
 ## [Evaluation Process](#contents)

 ### Evaluation on Ascend
@ -219,41 +257,73 @@ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.697
 mAP: 0.23808886505483504
 ```

+### Evaluation on GPU
+
+```
+sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
+```
+We need two parameters for this scripts.
+- `DATASET`：the dataset mode of evaluation dataset.
+- `CHECKPOINT_PATH`: the absolute path for checkpoint file.
+- `DEVICE_ID`: the device id for eval.
+
+> checkpoint can be produced in training process.
+
+Inference result will be stored in the example path, whose folder name begins with "eval". Under this, you can find result like the followings in log.
+
+```
+Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.224
+Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.375
+Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.228
+Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.034
+Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.189
+Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.407
+Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.243
+Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.382
+Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.417
+Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.120
+Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.425
+Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.686
+
+========================================
+
+mAP: 0.2244936111705981
+```

 # [Model Description](#contents)
 ## [Performance](#contents)

 ### Evaluation Performance

-| Parameters                 | Ascend                                                       |
-| -------------------------- | -------------------------------------------------------------|
-| Model Version              | SSD V1                                                       |
-| Resource                   | Ascend 910 ；CPU 2.60GHz，192cores；Memory，755G              |
-| uploaded Date              | 06/01/2020 (month/day/year)                                  |
-| MindSpore Version          | 0.3.0-alpha                                                  |
-| Dataset                    | COCO2017                                                     |
-| Training Parameters        | epoch = 500,  batch_size = 32                                |
-| Optimizer                  | Momentum                                                     |
-| Loss Function              | Sigmoid Cross Entropy,SmoothL1Loss                           |
-| Speed                      | 8pcs: 90ms/step                                              |
-| Total time                 | 8pcs: 4.81hours                                              |
-| Parameters (M)             | 34                                                           |
-| Scripts                    | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd |
+| Parameters                 | Ascend                                                       | GPU                                                          |
+| -------------------------- | -------------------------------------------------------------| -------------------------------------------------------------|
+| Model Version              | SSD V1                                                       | SSD V1                                                       |
+| Resource                   | Ascend 910 ；CPU 2.60GHz，192cores；Memory，755G             | NV SMX2 V100-16G                                             |
+| uploaded Date              | 06/01/2020 (month/day/year)                                  | 09/24/2020 (month/day/year)                                  |
+| MindSpore Version          | 0.3.0-alpha                                                  | 1.0.0                                                        |
+| Dataset                    | COCO2017                                                     | COCO2017                                                     |
+| Training Parameters        | epoch = 500,  batch_size = 32                                | epoch = 800,  batch_size = 32                                |
+| Optimizer                  | Momentum                                                     | Momentum                                                     |
+| Loss Function              | Sigmoid Cross Entropy,SmoothL1Loss                           | Sigmoid Cross Entropy,SmoothL1Loss                           |
+| Speed                      | 8pcs: 90ms/step                                              | 8pcs: 121ms/step                                             |
+| Total time                 | 8pcs: 4.81hours                                              | 8pcs: 12.31hours                                              |
+| Parameters (M)             | 34                                                           | 34                                                           |
+| Scripts                    | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd |


 ### Inference Performance

-| Parameters          | Ascend                      |
-| ------------------- | ----------------------------|
-| Model Version       | SSD V1                      |
-| Resource            | Ascend 910                  |
-| Uploaded Date       | 06/01/2020 (month/day/year) |
-| MindSpore Version   | 0.3.0-alpha                 |
-| Dataset             | COCO2017                    |
-| batch_size          | 1                           |
-| outputs             | mAP                         |
-| Accuracy            | IoU=0.50: 23.8%             |
-| Model for inference | 34M(.ckpt file)             |
+| Parameters          | Ascend                      | GPU                         |
+| ------------------- | ----------------------------| ----------------------------|
+| Model Version       | SSD V1                      | SSD V1                      |
+| Resource            | Ascend 910                  | GPU                         |
+| Uploaded Date       | 06/01/2020 (month/day/year) | 09/24/2020 (month/day/year) |
+| MindSpore Version   | 0.3.0-alpha                 | 1.0.0                       |
+| Dataset             | COCO2017                    | COCO2017                    |
+| batch_size          | 1                           | 1                           |
+| outputs             | mAP                         | mAP                         |
+| Accuracy            | IoU=0.50: 23.8%             | IoU=0.50: 22.4%             |
+| Model for inference | 34M(.ckpt file)             | 34M(.ckpt file)             |

 # [Description of Random Situation](#contents)

--- a/model_zoo/official/cv/ssd/eval.py
+++ b/model_zoo/official/cv/ssd/eval.py
@ -71,9 +71,11 @@ if __name__ == '__main__':
    parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
    parser.add_argument("--dataset", type=str, default="coco", help="Dataset, default is coco.")
    parser.add_argument("--checkpoint_path", type=str, required=True, help="Checkpoint file path.")
+    parser.add_argument("--run_platform", type=str, default="Ascend", choices=("Ascend", "GPU"),
+                        help="run platform, only support Ascend and GPU.")
    args_opt = parser.parse_args()

-    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
+    context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.run_platform, device_id=args_opt.device_id)

    prefix = "ssd_eval.mindrecord"
    mindrecord_dir = config.mindrecord_dir
--- a/model_zoo/official/cv/ssd/scripts/run_distribute_train_gpu.sh
+++ b/model_zoo/official/cv/ssd/scripts/run_distribute_train_gpu.sh
@ -0,0 +1,77 @@
+#!/bin/bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+echo "=============================================================================================================="
+echo "Please run the scipt as: "
+echo "sh run_distribute_train_gpu.sh DEVICE_NUM EPOCH_SIZE LR DATASET PRE_TRAINED PRE_TRAINED_EPOCH_SIZE"
+echo "for example: sh run_distribute_train_gpu.sh 8 500 0.2 coco /opt/ssd-300.ckpt(optional) 200(optional)"
+echo "It is better to use absolute path."
+echo "================================================================================================================="
+
+if [ $# != 4 ] && [ $# != 6 ]
+then
+    echo "Usage: sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] \
+[PRE_TRAINED](optional) [PRE_TRAINED_EPOCH_SIZE](optional)"
+    exit 1
+fi
+
+# Before start distribute train, first create mindrecord files.
+BASE_PATH=$(cd "`dirname $0`" || exit; pwd)
+cd $BASE_PATH/../ || exit
+python train.py --only_create_dataset=True --run_platform="GPU"
+
+echo "After running the scipt, the network runs in the background. The log will be generated in LOG/log.txt"
+
+export RANK_SIZE=$1
+EPOCH_SIZE=$2
+LR=$3
+DATASET=$4
+PRE_TRAINED=$5
+PRE_TRAINED_EPOCH_SIZE=$6
+
+rm -rf LOG
+mkdir ./LOG
+cp ./*.py ./LOG
+cp -r ./src ./LOG
+cd ./LOG || exit
+
+if [ $# == 4 ]
+then
+    mpirun -allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
+    python train.py  \
+    --distribute=True  \
+    --lr=$LR \
+    --dataset=$DATASET \
+    --device_num=$RANK_SIZE  \
+    --loss_scale=1 \
+    --run_platform="GPU" \
+    --epoch_size=$EPOCH_SIZE > log.txt 2>&1 &
+fi
+
+if [ $# == 6 ]
+then
+    mpirun -allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
+    python train.py  \
+    --distribute=True  \
+    --lr=$LR \
+    --dataset=$DATASET \
+    --device_num=$RANK_SIZE  \
+    --pre_trained=$PRE_TRAINED \
+    --pre_trained_epoch_size=$PRE_TRAINED_EPOCH_SIZE \
+    --loss_scale=1 \
+    --run_platform="GPU" \
+    --epoch_size=$EPOCH_SIZE > log.txt 2>&1 &
+fi
--- a/model_zoo/official/cv/ssd/scripts/run_eval_gpu.sh
+++ b/model_zoo/official/cv/ssd/scripts/run_eval_gpu.sh
@ -0,0 +1,66 @@
+#!/bin/bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+if [ $# != 3 ]
+then
+    echo "Usage: sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]"
+exit 1
+fi
+
+get_real_path(){
+  if [ "${1:0:1}" == "/" ]; then
+    echo "$1"
+  else
+    echo "$(realpath -m $PWD/$1)"
+  fi
+}
+
+DATASET=$1
+CHECKPOINT_PATH=$(get_real_path $2)
+echo $DATASET
+echo $CHECKPOINT_PATH
+
+if [ ! -f $CHECKPOINT_PATH ]
+then
+    echo "error: CHECKPOINT_PATH=$PATH2 is not a file"
+exit 1
+fi
+
+export DEVICE_NUM=1
+export DEVICE_ID=$3
+export RANK_SIZE=$DEVICE_NUM
+export RANK_ID=0
+
+BASE_PATH=$(cd "`dirname $0`" || exit; pwd)
+cd $BASE_PATH/../ || exit
+
+if [ -d "eval$3" ];
+then
+    rm -rf ./eval$3
+fi
+
+mkdir ./eval$3
+cp ./*.py ./eval$3
+cp -r ./src ./eval$3
+cd ./eval$3 || exit
+env > env.log
+echo "start infering for device $DEVICE_ID"
+python eval.py \
+    --dataset=$DATASET \
+    --checkpoint_path=$CHECKPOINT_PATH \
+    --run_platform="GPU" \
+    --device_id=$3 > log.txt 2>&1 &
+cd ..
--- a/model_zoo/official/cv/ssd/src/ssd.py
+++ b/model_zoo/official/cv/ssd/src/ssd.py
@ -250,6 +250,8 @@ class SSD300(nn.Cell):
        pred_loc, pred_label = self.multi_box(multi_feature)
        if not self.is_training:
            pred_label = self.activation(pred_label)
+        pred_loc = F.cast(pred_loc, mstype.float32)
+        pred_label = F.cast(pred_label, mstype.float32)
        return pred_loc, pred_label


--- a/model_zoo/official/cv/ssd/train.py
+++ b/model_zoo/official/cv/ssd/train.py
@ -20,12 +20,12 @@ import argparse
 import ast
 import mindspore.nn as nn
 from mindspore import context, Tensor
-from mindspore.communication.management import init
+from mindspore.communication.management import init, get_rank
 from mindspore.train.callback import CheckpointConfig, ModelCheckpoint, LossMonitor, TimeMonitor
 from mindspore.train import Model
 from mindspore.context import ParallelMode
 from mindspore.train.serialization import load_checkpoint, load_param_into_net
-from mindspore.common import set_seed
+from mindspore.common import set_seed, dtype
 from src.ssd import SSD300, SSDWithLossCell, TrainingWrapper, ssd_mobilenet_v2
 from src.config import config
 from src.dataset import create_ssd_dataset, data_to_mindrecord_byte_image, voc_data_to_mindrecord
@ -53,20 +53,36 @@ def main():
    parser.add_argument("--loss_scale", type=int, default=1024, help="Loss scale, default is 1024.")
    parser.add_argument("--filter_weight", type=ast.literal_eval, default=False,
                        help="Filter weight parameters, default is False.")
+    parser.add_argument("--run_platform", type=str, default="Ascend", choices=("Ascend", "GPU"),
+                        help="run platform, only support Ascend and GPU.")
    args_opt = parser.parse_args()

-    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
-
-    if args_opt.distribute:
-        device_num = args_opt.device_num
-        context.reset_auto_parallel_context()
-        context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
-                                          device_num=device_num)
+    if args_opt.run_platform == "Ascend":
+        context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
+        if args_opt.distribute:
+            device_num = args_opt.device_num
+            context.reset_auto_parallel_context()
+            context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
+                                              device_num=device_num)
+            init()
+            rank = args_opt.device_id % device_num
+        else:
+            rank = 0
+            device_num = 1
+    elif args_opt.run_platform == "GPU":
+        context.set_context(mode=context.GRAPH_MODE, device_target="GPU", device_id=args_opt.device_id)
        init()
-        rank = args_opt.device_id % device_num
+        if args_opt.distribute:
+            device_num = args_opt.device_num
+            context.reset_auto_parallel_context()
+            context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
+                                              device_num=device_num)
+            rank = get_rank()
+        else:
+            rank = 0
+            device_num = 1
    else:
-        rank = 0
-        device_num = 1
+        raise ValueError("Unsupported platform.")

    print("Start create dataset!")

@ -113,6 +129,8 @@ def main():

        backbone = ssd_mobilenet_v2()
        ssd = SSD300(backbone=backbone, config=config)
+        if args_opt.run_platform == "GPU":
+            ssd.to_float(dtype.float16)
        net = SSDWithLossCell(ssd, config)
        init_net_param(net)