!9059 modify timemonitor and ckpt info

From: @changzherui
Reviewed-by: 
Signed-off-by:
This commit is contained in:
mindspore-ci-bot 2020-11-30 16:04:28 +08:00 committed by Gitee
commit 0c153b586a
4 changed files with 189 additions and 140 deletions

View File

@ -302,13 +302,13 @@ def check_version_and_env_config():
def _set_pb_env(): def _set_pb_env():
"""Set env variable `PROTOCOL_BUFFERS` to prevent memory overflow.""" """Set env variable `PROTOCOL_BUFFERS` to prevent memory overflow."""
if os.getenv("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION") == "cpp": if os.getenv("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION") == "cpp":
logger.warning("Current env variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp`. " logger.info("Current env variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp`. "
"When the checkpoint file is too large, " "When the checkpoint file is too large, "
"it may cause memory limit error durning load checkpoint file. " "it may cause memory limit error durning load checkpoint file. "
"This can be solved by set env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`.") "This can be solved by set env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`.")
elif os.getenv("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION") is None: elif os.getenv("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION") is None:
logger.warning("Setting the env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` to prevent memory overflow " logger.info("Setting the env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` to prevent memory overflow "
"during save or load checkpoint file.") "during save or load checkpoint file.")
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python" os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

View File

@ -49,4 +49,4 @@ class TimeMonitor(Callback):
return return
step_seconds = epoch_seconds / step_size step_seconds = epoch_seconds / step_size
print("Epoch time: {:5.3f}, per step time: {:5.3f}".format(epoch_seconds, step_seconds), flush=True) print("epoch time: {:5.3f} ms, per step time: {:5.3f} ms".format(epoch_seconds, step_seconds), flush=True)

View File

@ -18,10 +18,11 @@
- [Description of Random Situation](#description-of-random-situation) - [Description of Random Situation](#description-of-random-situation)
- [ModelZoo Homepage](#modelzoo-homepage) - [ModelZoo Homepage](#modelzoo-homepage)
# [DeepLabV3 Description](#contents) # [DeepLabV3 Description](#contents)
## Description ## Description
DeepLab is a series of image semantic segmentation models, DeepLabV3 improves significantly over previous versions. Two keypoints of DeepLabV3: Its multi-grid atrous convolution makes it better to deal with segmenting objects at multiple scales, and augmented ASPP makes image-level features available to capture long range information.
DeepLab is a series of image semantic segmentation models, DeepLabV3 improves significantly over previous versions. Two keypoints of DeepLabV3: Its multi-grid atrous convolution makes it better to deal with segmenting objects at multiple scales, and augmented ASPP makes image-level features available to capture long range information.
This repository provides a script and recipe to DeepLabV3 model and achieve state-of-the-art performance. This repository provides a script and recipe to DeepLabV3 model and achieve state-of-the-art performance.
Refer to [this paper][1] for network details. Refer to [this paper][1] for network details.
@ -30,31 +31,34 @@ Refer to [this paper][1] for network details.
[1]: https://arxiv.org/abs/1706.05587 [1]: https://arxiv.org/abs/1706.05587
# [Model Architecture](#contents) # [Model Architecture](#contents)
Resnet101 as backbone, atrous convolution for dense feature extraction. Resnet101 as backbone, atrous convolution for dense feature extraction.
# [Dataset](#contents) # [Dataset](#contents)
Pascal VOC datasets and Semantic Boundaries Dataset Pascal VOC datasets and Semantic Boundaries Dataset
- Download segmentation dataset.
- Prepare the training data list file. The list file saves the relative path to image and annotation pairs. Lines are like: - Download segmentation dataset.
``` - Prepare the training data list file. The list file saves the relative path to image and annotation pairs. Lines are like:
JPEGImages/00001.jpg SegmentationClassGray/00001.png
JPEGImages/00002.jpg SegmentationClassGray/00002.png
JPEGImages/00003.jpg SegmentationClassGray/00003.png
JPEGImages/00004.jpg SegmentationClassGray/00004.png
......
```
- Configure and run build_data.sh to convert dataset to mindrecords. Arguments in scripts/build_data.sh: ```shell
JPEGImages/00001.jpg SegmentationClassGray/00001.png
JPEGImages/00002.jpg SegmentationClassGray/00002.png
JPEGImages/00003.jpg SegmentationClassGray/00003.png
JPEGImages/00004.jpg SegmentationClassGray/00004.png
......
```
``` - Configure and run build_data.sh to convert dataset to mindrecords. Arguments in scripts/build_data.sh:
--data_root root path of training data
--data_lst list of training data(prepared above) ```shell
--dst_path where mindrecords are saved --data_root root path of training data
--num_shards number of shards of the mindrecords --data_lst list of training data(prepared above)
--shuffle shuffle or not --dst_path where mindrecords are saved
``` --num_shards number of shards of the mindrecords
--shuffle shuffle or not
```
# [Features](#contents) # [Features](#contents)
@ -66,15 +70,15 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
# [Environment Requirements](#contents) # [Environment Requirements](#contents)
- HardwareAscend - HardwareAscend
- Prepare hardware environment with Ascend. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. - Prepare hardware environment with Ascend. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Framework - Framework
- [MindSpore](https://www.mindspore.cn/install/en) - [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below - For more information, please check the resources below
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html) - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html) - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
- Install python packages in requirements.txt - Install python packages in requirements.txt
- Generate config json file for 8pcs training - Generate config json file for 8pcs training
``` ```
# From the root of this project # From the root of this project
cd src/tools/ cd src/tools/
@ -85,47 +89,67 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
# [Quick Start](#contents) # [Quick Start](#contents)
After installing MindSpore via the official website, you can start training and evaluation as follows: After installing MindSpore via the official website, you can start training and evaluation as follows:
- Running on Ascend - Running on Ascend
Based on original DeepLabV3 paper, we reproduce two training experiments on vocaug (also as trainaug) dataset and evaluate on voc val dataset. Based on original DeepLabV3 paper, we reproduce two training experiments on vocaug (also as trainaug) dataset and evaluate on voc val dataset.
For single device training, please config parameters, training script is: For single device training, please config parameters, training script is:
```
```shell
run_standalone_train.sh run_standalone_train.sh
``` ```
For 8 devices training, training steps are as follows: For 8 devices training, training steps are as follows:
1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is:
``` 1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is:
```shell
run_distribute_train_s16_r1.sh run_distribute_train_s16_r1.sh
``` ```
2. Train s8 with vocaug dataset, finetuning from model in previous step, training script is: 2. Train s8 with vocaug dataset, finetuning from model in previous step, training script is:
```
```shell
run_distribute_train_s8_r1.sh run_distribute_train_s8_r1.sh
``` ```
3. Train s8 with voctrain dataset, finetuning from model in pervious step, training script is: 3. Train s8 with voctrain dataset, finetuning from model in pervious step, training script is:
```
```shell
run_distribute_train_s8_r2.sh run_distribute_train_s8_r2.sh
``` ```
For evaluation, evaluating steps are as follows: For evaluation, evaluating steps are as follows:
1. Eval s16 with voc val dataset, eval script is: 1. Eval s16 with voc val dataset, eval script is:
```
```shell
run_eval_s16.sh run_eval_s16.sh
``` ```
2. Eval s8 with voc val dataset, eval script is: 2. Eval s8 with voc val dataset, eval script is:
```
```shell
run_eval_s8.sh run_eval_s8.sh
``` ```
3. Eval s8 multiscale with voc val dataset, eval script is: 3. Eval s8 multiscale with voc val dataset, eval script is:
```
```shell
run_eval_s8_multiscale.sh run_eval_s8_multiscale.sh
``` ```
4. Eval s8 multiscale and flip with voc val dataset, eval script is: 4. Eval s8 multiscale and flip with voc val dataset, eval script is:
```
```shell
run_eval_s8_multiscale_flip.sh run_eval_s8_multiscale_flip.sh
``` ```
# [Script Description](#contents) # [Script Description](#contents)
## [Script and Sample Code](#contents) ## [Script and Sample Code](#contents)
```shell ```shell
. .
└──deeplabv3 └──deeplabv3
@ -141,19 +165,19 @@ run_eval_s8_multiscale_flip.sh
├── run_eval_s8_multiscale_filp.sh # launch ascend evaluation with multiscale and filp in s8 structure ├── run_eval_s8_multiscale_filp.sh # launch ascend evaluation with multiscale and filp in s8 structure
├── run_standalone_train.sh # launch ascend standalone training(1 pc) ├── run_standalone_train.sh # launch ascend standalone training(1 pc)
├── src ├── src
├── data ├── data
├── dataset.py # mindrecord data generator ├── dataset.py # mindrecord data generator
├── build_seg_data.py # data preprocessing ├── build_seg_data.py # data preprocessing
├── loss ├── loss
├── loss.py # loss definition for deeplabv3 ├── loss.py # loss definition for deeplabv3
├── nets ├── nets
├── deeplab_v3 ├── deeplab_v3
├── deeplab_v3.py # DeepLabV3 network structure ├── deeplab_v3.py # DeepLabV3 network structure
├── net_factory.py # set S16 and S8 structures ├── net_factory.py # set S16 and S8 structures
├── tools ├── tools
├── get_multicards_json.py # get rank table file ├── get_multicards_json.py # get rank table file
└── utils └── utils
└── learning_rates.py # generate learning rate └── learning_rates.py # generate learning rate
├── eval.py # eval net ├── eval.py # eval net
├── train.py # train net ├── train.py # train net
└── requirements.txt # requirements file └── requirements.txt # requirements file
@ -162,7 +186,8 @@ run_eval_s8_multiscale_flip.sh
## [Script Parameters](#contents) ## [Script Parameters](#contents)
Default configuration Default configuration
```
```shell
"data_file":"/PATH/TO/MINDRECORD_NAME" # dataset path "data_file":"/PATH/TO/MINDRECORD_NAME" # dataset path
"train_epochs":300 # total epochs "train_epochs":300 # total epochs
"batch_size":32 # batch size of input tensor "batch_size":32 # batch size of input tensor
@ -183,11 +208,14 @@ Default configuration
## [Training Process](#contents) ## [Training Process](#contents)
### Usage ### Usage
#### Running on Ascend #### Running on Ascend
Based on original DeepLabV3 paper, we reproduce two training experiments on vocaug (also as trainaug) dataset and evaluate on voc val dataset. Based on original DeepLabV3 paper, we reproduce two training experiments on vocaug (also as trainaug) dataset and evaluate on voc val dataset.
For single device training, please config parameters, training script is as follows: For single device training, please config parameters, training script is as follows:
```
```shell
# run_standalone_train.sh # run_standalone_train.sh
python ${train_code_path}/train.py --data_file=/PATH/TO/MINDRECORD_NAME \ python ${train_code_path}/train.py --data_file=/PATH/TO/MINDRECORD_NAME \
--train_dir=${train_path}/ckpt \ --train_dir=${train_path}/ckpt \
@ -205,11 +233,12 @@ python ${train_code_path}/train.py --data_file=/PATH/TO/MINDRECORD_NAME \
--save_steps=1500 \ --save_steps=1500 \
--keep_checkpoint_max=200 >log 2>&1 & --keep_checkpoint_max=200 >log 2>&1 &
``` ```
For 8 devices training, training steps are as follows: For 8 devices training, training steps are as follows:
1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is as follows: 1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is as follows:
``` ```python
# run_distribute_train_s16_r1.sh # run_distribute_train_s16_r1.sh
for((i=0;i<=$RANK_SIZE-1;i++)); for((i=0;i<=$RANK_SIZE-1;i++));
do do
@ -236,8 +265,10 @@ do
--keep_checkpoint_max=200 >log 2>&1 & --keep_checkpoint_max=200 >log 2>&1 &
done done
``` ```
2. Train s8 with vocaug dataset, finetuning from model in previous step, training script is as follows: 2. Train s8 with vocaug dataset, finetuning from model in previous step, training script is as follows:
```
```shell
# run_distribute_train_s8_r1.sh # run_distribute_train_s8_r1.sh
for((i=0;i<=$RANK_SIZE-1;i++)); for((i=0;i<=$RANK_SIZE-1;i++));
do do
@ -265,8 +296,10 @@ do
--keep_checkpoint_max=200 >log 2>&1 & --keep_checkpoint_max=200 >log 2>&1 &
done done
``` ```
3. Train s8 with voctrain dataset, finetuning from model in pervious step, training script is as follows: 3. Train s8 with voctrain dataset, finetuning from model in pervious step, training script is as follows:
```
```shell
# run_distribute_train_s8_r2.sh # run_distribute_train_s8_r2.sh
for((i=0;i<=$RANK_SIZE-1;i++)); for((i=0;i<=$RANK_SIZE-1;i++));
do do
@ -294,73 +327,83 @@ do
--keep_checkpoint_max=200 >log 2>&1 & --keep_checkpoint_max=200 >log 2>&1 &
done done
``` ```
### Result ### Result
- Training vocaug in s16 structure - Training vocaug in s16 structure
```
```shell
# distribute training result(8p) # distribute training result(8p)
epoch: 1 step: 41, loss is 0.8319108 epoch: 1 step: 41, loss is 0.8319108
Epoch time: 213856.477, per step time: 5216.012 epoch time: 213856.477 ms, per step time: 5216.012 ms
epoch: 2 step: 41, loss is 0.46052963 epoch: 2 step: 41, loss is 0.46052963
Epoch time: 21233.183, per step time: 517.883 epoch time: 21233.183 ms, per step time: 517.883 ms
epoch: 3 step: 41, loss is 0.45012417 epoch: 3 step: 41, loss is 0.45012417
Epoch time: 21231.951, per step time: 517.852 epoch time: 21231.951 ms, per step time: 517.852 ms
epoch: 4 step: 41, loss is 0.30687785 epoch: 4 step: 41, loss is 0.30687785
Epoch time: 21199.911, per step time: 517.071 epoch time: 21199.911 ms, per step time: 517.071 ms
epoch: 5 step: 41, loss is 0.22769661 epoch: 5 step: 41, loss is 0.22769661
Epoch time: 21240.281, per step time: 518.056 epoch time: 21240.281 ms, per step time: 518.056 ms
epoch: 6 step: 41, loss is 0.25470978 epoch: 6 step: 41, loss is 0.25470978
... ...
``` ```
- Training vocaug in s8 structure - Training vocaug in s8 structure
```
```shell
# distribute training result(8p) # distribute training result(8p)
epoch: 1 step: 82, loss is 0.024167 epoch: 1 step: 82, loss is 0.024167
Epoch time: 322663.456, per step time: 3934.920 epoch time: 322663.456 ms, per step time: 3934.920 ms
epoch: 2 step: 82, loss is 0.019832281 epoch: 2 step: 82, loss is 0.019832281
Epoch time: 43107.238, per step time: 525.698 epoch time: 43107.238 ms, per step time: 525.698 ms
epoch: 3 step: 82, loss is 0.021008959 epoch: 3 step: 82, loss is 0.021008959
Epoch time: 43109.519, per step time: 525.726 epoch time: 43109.519 ms, per step time: 525.726 ms
epoch: 4 step: 82, loss is 0.01912349 epoch: 4 step: 82, loss is 0.01912349
Epoch time: 43177.287, per step time: 526.552 epoch time: 43177.287 ms, per step time: 526.552 ms
epoch: 5 step: 82, loss is 0.022886964 epoch: 5 step: 82, loss is 0.022886964
Epoch time: 43095.915, per step time: 525.560 epoch time: 43095.915 ms, per step time: 525.560 ms
epoch: 6 step: 82, loss is 0.018708453 epoch: 6 step: 82, loss is 0.018708453
Epoch time: 43107.458, per step time: 525.701 epoch time: 43107.458 ms per step time: 525.701 ms
... ...
``` ```
- Training voctrain in s8 structure - Training voctrain in s8 structure
```
```shell
# distribute training result(8p) # distribute training result(8p)
epoch: 1 step: 11, loss is 0.00554624 epoch: 1 step: 11, loss is 0.00554624
Epoch time: 199412.913, per step time: 18128.447 epoch time: 199412.913 ms, per step time: 18128.447 ms
epoch: 2 step: 11, loss is 0.007181881 epoch: 2 step: 11, loss is 0.007181881
Epoch time: 6119.375, per step time: 556.307 epoch time: 6119.375 ms, per step time: 556.307 ms
epoch: 3 step: 11, loss is 0.004980865 epoch: 3 step: 11, loss is 0.004980865
Epoch time: 5996.978, per step time: 545.180 epoch time: 5996.978 ms, per step time: 545.180 ms
epoch: 4 step: 11, loss is 0.0047651967 epoch: 4 step: 11, loss is 0.0047651967
Epoch time: 5987.412, per step time: 544.310 epoch time: 5987.412 ms, per step time: 544.310 ms
epoch: 5 step: 11, loss is 0.006262637 epoch: 5 step: 11, loss is 0.006262637
Epoch time: 5956.682, per step time: 541.517 epoch time: 5956.682 ms, per step time: 541.517 ms
epoch: 6 step: 11, loss is 0.0060750707 epoch: 6 step: 11, loss is 0.0060750707
Epoch time: 5962.164, per step time: 542.015 epoch time: 5962.164 ms, per step time: 542.015 ms
... ...
``` ```
## [Evaluation Process](#contents) ## [Evaluation Process](#contents)
### Usage ### Usage
#### Running on Ascend #### Running on Ascend
Configure checkpoint with --ckpt_path and dataset path. Then run script, mIOU will be printed in eval_path/eval_log. Configure checkpoint with --ckpt_path and dataset path. Then run script, mIOU will be printed in eval_path/eval_log.
```
```shell
./run_eval_s16.sh # test s16 ./run_eval_s16.sh # test s16
./run_eval_s8.sh # test s8 ./run_eval_s8.sh # test s8
./run_eval_s8_multiscale.sh # test s8 + multiscale ./run_eval_s8_multiscale.sh # test s8 + multiscale
./run_eval_s8_multiscale_flip.sh # test s8 + multiscale + flip ./run_eval_s8_multiscale_flip.sh # test s8 + multiscale + flip
``` ```
Example of test script is as follows: Example of test script is as follows:
```
```shell
python ${train_code_path}/eval.py --data_root=/PATH/TO/DATA \ python ${train_code_path}/eval.py --data_root=/PATH/TO/DATA \
--data_lst=/PATH/TO/DATA_lst.txt \ --data_lst=/PATH/TO/DATA_lst.txt \
--batch_size=16 \ --batch_size=16 \
@ -383,6 +426,7 @@ python ${train_code_path}/eval.py --data_root=/PATH/TO/DATA \
Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide. Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide.
#### Training accuracy #### Training accuracy
| **Network** | OS=16 | OS=8 | MS | Flip | mIOU | mIOU in paper | | **Network** | OS=16 | OS=8 | MS | Flip | mIOU | mIOU in paper |
| :----------: | :-----: | :----: | :----: | :-----: | :-----: | :-------------: | | :----------: | :-----: | :----: | :----: | :-----: | :-----: | :-------------: |
| deeplab_v3 | √ | | | | 77.37 | 77.21 | | deeplab_v3 | √ | | | | 77.37 | 77.21 |
@ -393,29 +437,31 @@ Our result were obtained by running the applicable training script. To achieve t
Note: There OS is output stride, and MS is multiscale. Note: There OS is output stride, and MS is multiscale.
# [Model Description](#contents) # [Model Description](#contents)
## [Performance](#contents)
### Evaluation Performance ## [Performance](#contents
| Parameters | Ascend 910
### Evaluation Performance
| Parameters | Ascend 910
| -------------------------- | -------------------------------------- | | -------------------------- | -------------------------------------- |
| Model Version | DeepLabV3 | Model Version | DeepLabV3
| Resource | Ascend 910 | | Resource | Ascend 910 |
| Uploaded Date | 09/04/2020 (month/day/year) | | Uploaded Date | 09/04/2020 (month/day/year) |
| MindSpore Version | 0.7.0-alpha | | MindSpore Version | 0.7.0-alpha |
| Dataset | PASCAL VOC2012 + SBD | | Dataset | PASCAL VOC2012 + SBD |
| Training Parameters | epoch = 300, batch_size = 32 (s16_r1) <br> epoch = 800, batch_size = 16 (s8_r1) <br> epoch = 300, batch_size = 16 (s8_r2) | | Training Parameters | epoch = 300, batch_size = 32 (s16_r1) <br> epoch = 800, batch_size = 16 (s8_r1) <br> epoch = 300, batch_size = 16 (s8_r2) |
| Optimizer | Momentum | | Optimizer | Momentum |
| Loss Function | Softmax Cross Entropy | | Loss Function | Softmax Cross Entropy |
| Outputs | probability | | Outputs | probability |
| Loss | 0.0065883575 | | Loss | 0.0065883575 |
| Speed | 60 ms/step1pc, s16<br> 480 ms/step8pcs, s16 <br> 244 ms/step (8pcs, s8) | | Speed | 60 ms/step1pc, s16<br> 480 ms/step8pcs, s16 <br> 244 ms/step (8pcs, s8) |
| Total time | 8pcs: 706 mins | | Total time | 8pcs: 706 mins |
| Parameters (M) | 58.2 | | Parameters (M) | 58.2 |
| Checkpoint for Fine tuning | 443M (.ckpt file) | | Checkpoint for Fine tuning | 443M (.ckpt file) |
| Model for inference | 223M (.air file) | | Model for inference | 223M (.air file) |
| Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/deeplabv3) | | Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/deeplabv3) |
### Inference Performance ## Inference Performance
| Parameters | Ascend | | Parameters | Ascend |
| ------------------- | --------------------------- | | ------------------- | --------------------------- |
@ -429,10 +475,10 @@ Note: There OS is output stride, and MS is multiscale.
| Accuracy | 8pcs: <br> s16: 77.37 <br> s8: 78.84% <br> s8_multiscale: 79.70% <br> s8_Flip: 79.89% | | Accuracy | 8pcs: <br> s16: 77.37 <br> s8: 78.84% <br> s8_multiscale: 79.70% <br> s8_Flip: 79.89% |
| Model for inference | 443M (.ckpt file) | | Model for inference | 443M (.ckpt file) |
# [Description of Random Situation](#contents) # [Description of Random Situation](#contents)
In dataset.py, we set the seed inside "create_dataset" function. We also use random seed in train.py.
In dataset.py, we set the seed inside "create_dataset" function. We also use random seed in train.py.
# [ModelZoo Homepage](#contents) # [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).

View File

@ -30,34 +30,33 @@ The overall network architecture of InceptionV3 is show below:
[Link](https://arxiv.org/pdf/1512.00567.pdf) [Link](https://arxiv.org/pdf/1512.00567.pdf)
# [Dataset](#contents) # [Dataset](#contents)
Dataset used can refer to paper. Dataset used can refer to paper.
- Dataset size: 125G, 1250k colorful images in 1000 classes - Dataset size: 125G, 1250k colorful images in 1000 classes
- Train: 120G, 1200k images - Train: 120G, 1200k images
- Test: 5G, 50k images - Test: 5G, 50k images
- Data format: RGB images. - Data format: RGB images.
- Note: Data will be processed in src/dataset.py - Note: Data will be processed in src/dataset.py
# [Features](#contents) # [Features](#contents)
## [Mixed Precision(Ascend)](#contents) ## [Mixed Precision(Ascend)](#contents)
The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware. The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching reduce precision. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching reduce precision.
# [Environment Requirements](#contents) # [Environment Requirements](#contents)
- HardwareAscend/GPU - HardwareAscend/GPU
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Framework - Framework
- [MindSpore](https://www.mindspore.cn/install/en) - [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below - For more information, please check the resources below
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html) - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html) - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
# [Script description](#contents) # [Script description](#contents)
@ -65,14 +64,14 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
```shell ```shell
. .
└─Inception-v3 └─Inception-v3
├─README.md ├─README.md
├─scripts ├─scripts
├─run_standalone_train.sh # launch standalone training with ascend platform(1p) ├─run_standalone_train.sh # launch standalone training with ascend platform(1p)
├─run_standalone_train_gpu.sh # launch standalone training with gpu platform(1p) ├─run_standalone_train_gpu.sh # launch standalone training with gpu platform(1p)
├─run_distribute_train.sh # launch distributed training with ascend platform(8p) ├─run_distribute_train.sh # launch distributed training with ascend platform(8p)
├─run_distribute_train_gpu.sh # launch distributed training with gpu platform(8p) ├─run_distribute_train_gpu.sh # launch distributed training with gpu platform(8p)
├─run_eval.sh # launch evaluating with ascend platform ├─run_eval.sh # launch evaluating with ascend platform
└─run_eval_gpu.sh # launch evaluating with gpu platform └─run_eval_gpu.sh # launch evaluating with gpu platform
├─src ├─src
├─config.py # parameter configuration ├─config.py # parameter configuration
@ -83,12 +82,13 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
├─eval.py # eval net ├─eval.py # eval net
├─export.py # convert checkpoint ├─export.py # convert checkpoint
└─train.py # train net └─train.py # train net
``` ```
## [Script Parameters](#contents) ## [Script Parameters](#contents)
```python ```python
Major parameters in train.py and config.py are: Major parameters in train.py and config.py are:
'random_seed' # fix random seed 'random_seed' # fix random seed
'rank' # local rank of distributed 'rank' # local rank of distributed
'group_size' # world size of distributed 'group_size' # world size of distributed
@ -111,8 +111,8 @@ Major parameters in train.py and config.py are:
'ckpt_path' # save checkpoint path 'ckpt_path' # save checkpoint path
'is_save_on_master' # save checkpoint on rank0, distributed parameters 'is_save_on_master' # save checkpoint on rank0, distributed parameters
'dropout_keep_prob' # the keep rate, between 0 and 1, e.g. keep_prob = 0.9, means dropping out 10% of input units 'dropout_keep_prob' # the keep rate, between 0 and 1, e.g. keep_prob = 0.9, means dropping out 10% of input units
'has_bias' # specifies whether the layer uses a bias vector. 'has_bias' # specifies whether the layer uses a bias vector.
'amp_level' # option for argument `level` in `mindspore.amp.build_train_network`, level for mixed 'amp_level' # option for argument `level` in `mindspore.amp.build_train_network`, level for mixed
# precision training. Supports [O0, O2, O3]. # precision training. Supports [O0, O2, O3].
``` ```
@ -121,33 +121,33 @@ Major parameters in train.py and config.py are:
### Usage ### Usage
You can start training using python or shell scripts. The usage of shell scripts as follows: You can start training using python or shell scripts. The usage of shell scripts as follows:
- Ascend: - Ascend:
```
```shell
# distribute training example(8p) # distribute training example(8p)
sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
# standalone training # standalone training
sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
``` ```
> Notes:
RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html) , and the device_ip can be got as [Link]https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For large models like InceptionV3, it's better to export an external environment variable `export HCCL_CONNECT_TIMEOUT=600` to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.
> Notes: RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html), and the device_ip can be got as [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools). For large models like InceptionV3, it's better to export an external environment variable `export HCCL_CONNECT_TIMEOUT=600` to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.
>
> This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh` > This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
- GPU: - GPU:
```
```python
# distribute training example(8p) # distribute training example(8p)
sh scripts/run_distribute_train_gpu.sh DATA_DIR sh scripts/run_distribute_train_gpu.sh DATA_DIR
# standalone training # standalone training
sh scripts/run_standalone_train_gpu.sh DEVICE_ID DATA_DIR sh scripts/run_standalone_train_gpu.sh DEVICE_ID DATA_DIR
``` ```
### Launch ### Launch
``` ```python
# training example # training example
python: python:
Ascend: python train.py --dataset_path /dataset/train --platform Ascend Ascend: python train.py --dataset_path /dataset/train --platform Ascend
@ -159,23 +159,24 @@ sh scripts/run_standalone_train_gpu.sh DEVICE_ID DATA_DIR
sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
# standalone training # standalone training
sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
GPU: GPU:
# distributed training example(8p) # distributed training example(8p)
sh scripts/run_distribute_train_gpu.sh /dataset/train sh scripts/run_distribute_train_gpu.sh /dataset/train
# standalone training example # standalone training example
sh scripts/run_standalone_train_gpu.sh 0 /dataset/train sh scripts/run_standalone_train_gpu.sh 0 /dataset/train
``` ```
### Result ### Result
Training result will be stored in the example path. Checkpoints will be stored at `. /checkpoint` by default, and training log will be redirected to `./log.txt` like followings. Training result will be stored in the example path. Checkpoints will be stored at `. /checkpoint` by default, and training log will be redirected to `./log.txt` like followings.
``` ```python
epoch: 0 step: 1251, loss is 5.7787247 epoch: 0 step: 1251, loss is 5.7787247
Epoch time: 360760.985, per step time: 288.378 epoch time: 360760.985 ms, per step time: 288.378 ms
epoch: 1 step: 1251, loss is 4.392868 epoch: 1 step: 1251, loss is 4.392868
Epoch time: 160917.911, per step time: 128.631 epoch time: 160917.911 ms, per step time: 128.631 ms
``` ```
## [Eval process](#contents) ## [Eval process](#contents)
### Usage ### Usage
@ -183,17 +184,20 @@ Epoch time: 160917.911, per step time: 128.631
You can start training using python or shell scripts. The usage of shell scripts as follows: You can start training using python or shell scripts. The usage of shell scripts as follows:
- Ascend: - Ascend:
```python
sh scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
``` ```
sh scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
```
- GPU: - GPU:
```
sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT ```python
sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
``` ```
### Launch ### Launch
``` ```python
# eval example # eval example
python: python:
Ascend: python eval.py --dataset_path DATA_DIR --checkpoint PATH_CHECKPOINT --platform Ascend Ascend: python eval.py --dataset_path DATA_DIR --checkpoint PATH_CHECKPOINT --platform Ascend
@ -204,13 +208,13 @@ You can start training using python or shell scripts. The usage of shell scripts
GPU: sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT GPU: sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
``` ```
> checkpoint can be produced in training process. > checkpoint can be produced in training process.
### Result ### Result
Evaluation result will be stored in the example path, you can find result like the followings in `eval.log`. Evaluation result will be stored in the example path, you can find result like the followings in `eval.log`.
``` ```python
metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942} metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942}
``` ```
@ -239,12 +243,11 @@ metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942}
| Checkpoint for Fine tuning | 313M | 312M | | Checkpoint for Fine tuning | 313M | 312M |
| Scripts | [inceptionv3 script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/inceptionv3) | [inceptionv3 script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/inceptionv3) | | Scripts | [inceptionv3 script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/inceptionv3) | [inceptionv3 script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/inceptionv3) |
#### Inference Performance #### Inference Performance
| Parameters | Ascend | | Parameters | Ascend |
| ------------------- | --------------------------- | | ------------------- | --------------------------- |
| Model Version | InceptionV3 | | Model Version | InceptionV3 |
| Resource | Ascend 910, cpu:2.60GHz 192cores, memory:755G | | Resource | Ascend 910, cpu:2.60GHz 192cores, memory:755G |
| Uploaded Date | 08/22/2020 | | Uploaded Date | 08/22/2020 |
| MindSpore Version | 0.6.0-beta | | MindSpore Version | 0.6.0-beta |
@ -260,5 +263,5 @@ metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942}
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
# [ModelZoo Homepage](#contents) # [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).