!6869 mobilenetv2+ssd gpu

Merge pull request !6869 from TuDouNi/r1.0
This commit is contained in:
mindspore-ci-bot 2020-09-25 11:19:11 +08:00 committed by Gitee
commit 2e40ac6465
6 changed files with 290 additions and 54 deletions

View File

@ -82,7 +82,8 @@ Dataset used: [COCO2017](<http://images.cocodataset.org/>)
# [Quick Start](#contents)
After installing MindSpore via the official website, you can start training and evaluation on Ascend as follows:
After installing MindSpore via the official website, you can start training and evaluation as follows:
- runing on Ascend
```
# distributed training on Ascend
@ -91,6 +92,14 @@ sh run_distribute_train.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] [RANK_TABLE_
# run eval on Ascend
sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
```
- runing on GPU
```
# distributed training on GPU
sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET]
# run eval on GPU
sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
```
# [Script Description](#contents)
@ -100,21 +109,24 @@ sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
.
└─ cv
└─ ssd
├─ README.md ## descriptions about SSD
├─ README.md ## descriptions about SSD
├─ scripts
└─ run_distribute_train.sh ## shell script for distributed on ascend
└─ run_eval.sh ## shell script for eval on ascend
├─ run_distribute_train.sh ## shell script for distributed on ascend
├─ run_distribute_train_gpu.sh ## shell script for distributed on gpu
├─ run_eval.sh ## shell script for eval on ascend
└─ run_eval_gpu.sh ## shell script for eval on gpu
├─ src
├─ __init__.py ## init file
├─ box_util.py ## bbox utils
├─ coco_eval.py ## coco metrics utils
├─ config.py ## total config
├─ dataset.py ## create dataset and process dataset
├─ init_params.py ## parameters utils
├─ lr_schedule.py ## learning ratio generator
└─ ssd.py ## ssd architecture
├─ eval.py ## eval scripts
└─ train.py ## train scripts
├─ __init__.py ## init file
├─ box_util.py ## bbox utils
├─ coco_eval.py ## coco metrics utils
├─ config.py ## total config
├─ dataset.py ## create dataset and process dataset
├─ init_params.py ## parameters utils
├─ lr_schedule.py ## learning ratio generator
└─ ssd.py ## ssd architecture
├─ eval.py ## eval scripts
├─ train.py ## train scripts
└─ mindspore_hub_conf.py ## mindspore hub interface
```
## [Script Parameters](#contents)
@ -145,10 +157,9 @@ sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
## [Training Process](#contents)
### Training on Ascend
To train the model, run `train.py`. If the `mindrecord_dir` is empty, it will generate [mindrecord](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/convert_dataset.html) files by `coco_root`(coco dataset) or `iamge_dir` and `anno_path`(own dataset). **Note if mindrecord_dir isn't empty, it will use mindrecord_dir instead of raw images.**
### Training on Ascend
- Distribute mode
@ -183,6 +194,34 @@ epoch: 500 step: 458, loss is 0.5548882
epoch time: 39064.8467540741, per step time: 85.29442522723602
```
### Training on GPU
- Distribute mode
```
sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] [PRE_TRAINED](optional) [PRE_TRAINED_EPOCH_SIZE](optional)
```
We need five or seven parameters for this scripts.
- `DEVICE_NUM`: the device number for distributed train.
- `EPOCH_NUM`: epoch num for distributed train.
- `LR`: learning rate init value for distributed train.
- `DATASET`the dataset mode for distributed train.
- `PRE_TRAINED :` the path of pretrained checkpoint file, it is better to use absolute path.
- `PRE_TRAINED_EPOCH_SIZE :` the epoch num of pretrained.
Training result will be stored in the current path, whose folder name is "LOG". Under this, you can find checkpoint files together with result like the followings in log
```
epoch: 1 step: 1, loss is 420.11783
epoch: 1 step: 2, loss is 434.11032
epoch: 1 step: 3, loss is 476.802
...
epoch: 1 step: 458, loss is 3.1283689
epoch time: 150753.701, per step time: 329.157
...
```
## [Evaluation Process](#contents)
### Evaluation on Ascend
@ -218,41 +257,73 @@ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.697
mAP: 0.23808886505483504
```
### Evaluation on GPU
```
sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
```
We need two parameters for this scripts.
- `DATASET`the dataset mode of evaluation dataset.
- `CHECKPOINT_PATH`: the absolute path for checkpoint file.
- `DEVICE_ID`: the device id for eval.
> checkpoint can be produced in training process.
Inference result will be stored in the example path, whose folder name begins with "eval". Under this, you can find result like the followings in log.
```
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.224
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.375
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.228
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.034
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.189
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.407
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.243
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.382
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.417
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.120
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.425
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.686
========================================
mAP: 0.2244936111705981
```
# [Model Description](#contents)
## [Performance](#contents)
### Evaluation Performance
| Parameters | Ascend |
| -------------------------- | -------------------------------------------------------------|
| Model Version | SSD V1 |
| Resource | Ascend 910 CPU 2.60GHz56coresMemory314G |
| uploaded Date | 06/01/2020 (month/day/year) |
| MindSpore Version | 0.3.0-alpha |
| Dataset | COCO2017 |
| Training Parameters | epoch = 500, batch_size = 32 |
| Optimizer | Momentum |
| Loss Function | Sigmoid Cross Entropy,SmoothL1Loss |
| Speed | 8pcs: 90ms/step |
| Total time | 8pcs: 4.81hours |
| Parameters (M) | 34 |
| Scripts | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd |
| Parameters | Ascend | GPU |
| -------------------------- | -------------------------------------------------------------| -------------------------------------------------------------|
| Model Version | SSD V1 | SSD V1 |
| Resource | Ascend 910 CPU 2.60GHz56coresMemory314G | NV SMX2 V100-16G |
| uploaded Date | 06/01/2020 (month/day/year) | 09/24/2020 (month/day/year) |
| MindSpore Version | 0.3.0-alpha | 1.0.0 |
| Dataset | COCO2017 | COCO2017 |
| Training Parameters | epoch = 500, batch_size = 32 | epoch = 800, batch_size = 32 |
| Optimizer | Momentum | Momentum |
| Loss Function | Sigmoid Cross Entropy,SmoothL1Loss | Sigmoid Cross Entropy,SmoothL1Loss |
| Speed | 8pcs: 90ms/step | 8pcs: 121ms/step |
| Total time | 8pcs: 4.81hours | 8pcs: 12.31hours |
| Parameters (M) | 34 | 34 |
| Scripts | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd |
### Inference Performance
| Parameters | Ascend |
| ------------------- | ----------------------------|
| Model Version | SSD V1 |
| Resource | Ascend 910 |
| Uploaded Date | 06/01/2020 (month/day/year) |
| MindSpore Version | 0.3.0-alpha |
| Dataset | COCO2017 |
| batch_size | 1 |
| outputs | mAP |
| Accuracy | IoU=0.50: 23.8% |
| Model for inference | 34M(.ckpt file) |
| Parameters | Ascend | GPU |
| ------------------- | ----------------------------| ----------------------------|
| Model Version | SSD V1 | SSD V1 |
| Resource | Ascend 910 | GPU |
| Uploaded Date | 06/01/2020 (month/day/year) | 09/24/2020 (month/day/year) |
| MindSpore Version | 0.3.0-alpha | 1.0.0 |
| Dataset | COCO2017 | COCO2017 |
| batch_size | 1 | 1 |
| outputs | mAP | mAP |
| Accuracy | IoU=0.50: 23.8% | IoU=0.50: 22.4% |
| Model for inference | 34M(.ckpt file) | 34M(.ckpt file) |
# [Description of Random Situation](#contents)

View File

@ -71,9 +71,11 @@ if __name__ == '__main__':
parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
parser.add_argument("--dataset", type=str, default="coco", help="Dataset, default is coco.")
parser.add_argument("--checkpoint_path", type=str, required=True, help="Checkpoint file path.")
parser.add_argument("--run_platform", type=str, default="Ascend", choices=("Ascend", "GPU"),
help="run platform, only support Ascend and GPU.")
args_opt = parser.parse_args()
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.run_platform, device_id=args_opt.device_id)
prefix = "ssd_eval.mindrecord"
mindrecord_dir = config.mindrecord_dir

View File

@ -0,0 +1,77 @@
#!/bin/bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
echo "=============================================================================================================="
echo "Please run the scipt as: "
echo "sh run_distribute_train_gpu.sh DEVICE_NUM EPOCH_SIZE LR DATASET PRE_TRAINED PRE_TRAINED_EPOCH_SIZE"
echo "for example: sh run_distribute_train_gpu.sh 8 500 0.2 coco /opt/ssd-300.ckpt(optional) 200(optional)"
echo "It is better to use absolute path."
echo "================================================================================================================="
if [ $# != 4 ] && [ $# != 6 ]
then
echo "Usage: sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] \
[PRE_TRAINED](optional) [PRE_TRAINED_EPOCH_SIZE](optional)"
exit 1
fi
# Before start distribute train, first create mindrecord files.
BASE_PATH=$(cd "`dirname $0`" || exit; pwd)
cd $BASE_PATH/../ || exit
python train.py --only_create_dataset=True --run_platform="GPU"
echo "After running the scipt, the network runs in the background. The log will be generated in LOG/log.txt"
export RANK_SIZE=$1
EPOCH_SIZE=$2
LR=$3
DATASET=$4
PRE_TRAINED=$5
PRE_TRAINED_EPOCH_SIZE=$6
rm -rf LOG
mkdir ./LOG
cp ./*.py ./LOG
cp -r ./src ./LOG
cd ./LOG || exit
if [ $# == 4 ]
then
mpirun -allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
python train.py \
--distribute=True \
--lr=$LR \
--dataset=$DATASET \
--device_num=$RANK_SIZE \
--loss_scale=1 \
--run_platform="GPU" \
--epoch_size=$EPOCH_SIZE > log.txt 2>&1 &
fi
if [ $# == 6 ]
then
mpirun -allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
python train.py \
--distribute=True \
--lr=$LR \
--dataset=$DATASET \
--device_num=$RANK_SIZE \
--pre_trained=$PRE_TRAINED \
--pre_trained_epoch_size=$PRE_TRAINED_EPOCH_SIZE \
--loss_scale=1 \
--run_platform="GPU" \
--epoch_size=$EPOCH_SIZE > log.txt 2>&1 &
fi

View File

@ -0,0 +1,66 @@
#!/bin/bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 3 ]
then
echo "Usage: sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
DATASET=$1
CHECKPOINT_PATH=$(get_real_path $2)
echo $DATASET
echo $CHECKPOINT_PATH
if [ ! -f $CHECKPOINT_PATH ]
then
echo "error: CHECKPOINT_PATH=$PATH2 is not a file"
exit 1
fi
export DEVICE_NUM=1
export DEVICE_ID=$3
export RANK_SIZE=$DEVICE_NUM
export RANK_ID=0
BASE_PATH=$(cd "`dirname $0`" || exit; pwd)
cd $BASE_PATH/../ || exit
if [ -d "eval$3" ];
then
rm -rf ./eval$3
fi
mkdir ./eval$3
cp ./*.py ./eval$3
cp -r ./src ./eval$3
cd ./eval$3 || exit
env > env.log
echo "start infering for device $DEVICE_ID"
python eval.py \
--dataset=$DATASET \
--checkpoint_path=$CHECKPOINT_PATH \
--run_platform="GPU" \
--device_id=$3 > log.txt 2>&1 &
cd ..

View File

@ -250,6 +250,8 @@ class SSD300(nn.Cell):
pred_loc, pred_label = self.multi_box(multi_feature)
if not self.is_training:
pred_label = self.activation(pred_label)
pred_loc = F.cast(pred_loc, mstype.float32)
pred_label = F.cast(pred_label, mstype.float32)
return pred_loc, pred_label

View File

@ -20,12 +20,12 @@ import argparse
import ast
import mindspore.nn as nn
from mindspore import context, Tensor
from mindspore.communication.management import init
from mindspore.communication.management import init, get_rank
from mindspore.train.callback import CheckpointConfig, ModelCheckpoint, LossMonitor, TimeMonitor
from mindspore.train import Model
from mindspore.context import ParallelMode
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.common import set_seed
from mindspore.common import set_seed, dtype
from src.ssd import SSD300, SSDWithLossCell, TrainingWrapper, ssd_mobilenet_v2
from src.config import config
from src.dataset import create_ssd_dataset, data_to_mindrecord_byte_image, voc_data_to_mindrecord
@ -53,20 +53,36 @@ def main():
parser.add_argument("--loss_scale", type=int, default=1024, help="Loss scale, default is 1024.")
parser.add_argument("--filter_weight", type=ast.literal_eval, default=False,
help="Filter weight parameters, default is False.")
parser.add_argument("--run_platform", type=str, default="Ascend", choices=("Ascend", "GPU"),
help="run platform, only support Ascend and GPU.")
args_opt = parser.parse_args()
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
if args_opt.distribute:
device_num = args_opt.device_num
context.reset_auto_parallel_context()
context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
device_num=device_num)
if args_opt.run_platform == "Ascend":
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
if args_opt.distribute:
device_num = args_opt.device_num
context.reset_auto_parallel_context()
context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
device_num=device_num)
init()
rank = args_opt.device_id % device_num
else:
rank = 0
device_num = 1
elif args_opt.run_platform == "GPU":
context.set_context(mode=context.GRAPH_MODE, device_target="GPU", device_id=args_opt.device_id)
init()
rank = args_opt.device_id % device_num
if args_opt.distribute:
device_num = args_opt.device_num
context.reset_auto_parallel_context()
context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
device_num=device_num)
rank = get_rank()
else:
rank = 0
device_num = 1
else:
rank = 0
device_num = 1
raise ValueError("Unsupported platform.")
print("Start create dataset!")
@ -113,6 +129,8 @@ def main():
backbone = ssd_mobilenet_v2()
ssd = SSD300(backbone=backbone, config=config)
if args_opt.run_platform == "GPU":
ssd.to_float(dtype.float16)
net = SSDWithLossCell(ssd, config)
init_net_param(net)