forked from mindspore-Ecosystem/mindspore
!9059 modify timemonitor and ckpt info
From: @changzherui Reviewed-by: Signed-off-by:
This commit is contained in:
commit
0c153b586a
|
@ -302,13 +302,13 @@ def check_version_and_env_config():
|
|||
def _set_pb_env():
|
||||
"""Set env variable `PROTOCOL_BUFFERS` to prevent memory overflow."""
|
||||
if os.getenv("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION") == "cpp":
|
||||
logger.warning("Current env variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp`. "
|
||||
"When the checkpoint file is too large, "
|
||||
"it may cause memory limit error durning load checkpoint file. "
|
||||
"This can be solved by set env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`.")
|
||||
logger.info("Current env variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp`. "
|
||||
"When the checkpoint file is too large, "
|
||||
"it may cause memory limit error durning load checkpoint file. "
|
||||
"This can be solved by set env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`.")
|
||||
elif os.getenv("PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION") is None:
|
||||
logger.warning("Setting the env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` to prevent memory overflow "
|
||||
"during save or load checkpoint file.")
|
||||
logger.info("Setting the env `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python` to prevent memory overflow "
|
||||
"during save or load checkpoint file.")
|
||||
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
|
||||
|
||||
|
||||
|
|
|
@ -49,4 +49,4 @@ class TimeMonitor(Callback):
|
|||
return
|
||||
|
||||
step_seconds = epoch_seconds / step_size
|
||||
print("Epoch time: {:5.3f}, per step time: {:5.3f}".format(epoch_seconds, step_seconds), flush=True)
|
||||
print("epoch time: {:5.3f} ms, per step time: {:5.3f} ms".format(epoch_seconds, step_seconds), flush=True)
|
||||
|
|
|
@ -18,10 +18,11 @@
|
|||
- [Description of Random Situation](#description-of-random-situation)
|
||||
- [ModelZoo Homepage](#modelzoo-homepage)
|
||||
|
||||
|
||||
# [DeepLabV3 Description](#contents)
|
||||
|
||||
## Description
|
||||
DeepLab is a series of image semantic segmentation models, DeepLabV3 improves significantly over previous versions. Two keypoints of DeepLabV3: Its multi-grid atrous convolution makes it better to deal with segmenting objects at multiple scales, and augmented ASPP makes image-level features available to capture long range information.
|
||||
|
||||
DeepLab is a series of image semantic segmentation models, DeepLabV3 improves significantly over previous versions. Two keypoints of DeepLabV3: Its multi-grid atrous convolution makes it better to deal with segmenting objects at multiple scales, and augmented ASPP makes image-level features available to capture long range information.
|
||||
This repository provides a script and recipe to DeepLabV3 model and achieve state-of-the-art performance.
|
||||
|
||||
Refer to [this paper][1] for network details.
|
||||
|
@ -30,31 +31,34 @@ Refer to [this paper][1] for network details.
|
|||
[1]: https://arxiv.org/abs/1706.05587
|
||||
|
||||
# [Model Architecture](#contents)
|
||||
|
||||
Resnet101 as backbone, atrous convolution for dense feature extraction.
|
||||
|
||||
# [Dataset](#contents)
|
||||
|
||||
Pascal VOC datasets and Semantic Boundaries Dataset
|
||||
- Download segmentation dataset.
|
||||
|
||||
- Prepare the training data list file. The list file saves the relative path to image and annotation pairs. Lines are like:
|
||||
- Download segmentation dataset.
|
||||
|
||||
```
|
||||
JPEGImages/00001.jpg SegmentationClassGray/00001.png
|
||||
JPEGImages/00002.jpg SegmentationClassGray/00002.png
|
||||
JPEGImages/00003.jpg SegmentationClassGray/00003.png
|
||||
JPEGImages/00004.jpg SegmentationClassGray/00004.png
|
||||
......
|
||||
```
|
||||
- Prepare the training data list file. The list file saves the relative path to image and annotation pairs. Lines are like:
|
||||
|
||||
- Configure and run build_data.sh to convert dataset to mindrecords. Arguments in scripts/build_data.sh:
|
||||
```shell
|
||||
JPEGImages/00001.jpg SegmentationClassGray/00001.png
|
||||
JPEGImages/00002.jpg SegmentationClassGray/00002.png
|
||||
JPEGImages/00003.jpg SegmentationClassGray/00003.png
|
||||
JPEGImages/00004.jpg SegmentationClassGray/00004.png
|
||||
......
|
||||
```
|
||||
|
||||
```
|
||||
--data_root root path of training data
|
||||
--data_lst list of training data(prepared above)
|
||||
--dst_path where mindrecords are saved
|
||||
--num_shards number of shards of the mindrecords
|
||||
--shuffle shuffle or not
|
||||
```
|
||||
- Configure and run build_data.sh to convert dataset to mindrecords. Arguments in scripts/build_data.sh:
|
||||
|
||||
```shell
|
||||
--data_root root path of training data
|
||||
--data_lst list of training data(prepared above)
|
||||
--dst_path where mindrecords are saved
|
||||
--num_shards number of shards of the mindrecords
|
||||
--shuffle shuffle or not
|
||||
```
|
||||
|
||||
# [Features](#contents)
|
||||
|
||||
|
@ -66,15 +70,15 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
# [Environment Requirements](#contents)
|
||||
|
||||
- Hardware(Ascend)
|
||||
- Prepare hardware environment with Ascend. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
|
||||
- Prepare hardware environment with Ascend. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
|
||||
- Framework
|
||||
- [MindSpore](https://www.mindspore.cn/install/en)
|
||||
- [MindSpore](https://www.mindspore.cn/install/en)
|
||||
- For more information, please check the resources below:
|
||||
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
|
||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
|
||||
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
|
||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
|
||||
- Install python packages in requirements.txt
|
||||
- Generate config json file for 8pcs training
|
||||
|
||||
|
||||
```
|
||||
# From the root of this project
|
||||
cd src/tools/
|
||||
|
@ -85,47 +89,67 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
# [Quick Start](#contents)
|
||||
|
||||
After installing MindSpore via the official website, you can start training and evaluation as follows:
|
||||
|
||||
- Running on Ascend
|
||||
|
||||
Based on original DeepLabV3 paper, we reproduce two training experiments on vocaug (also as trainaug) dataset and evaluate on voc val dataset.
|
||||
|
||||
For single device training, please config parameters, training script is:
|
||||
```
|
||||
For single device training, please config parameters, training script is:
|
||||
|
||||
```shell
|
||||
run_standalone_train.sh
|
||||
```
|
||||
|
||||
For 8 devices training, training steps are as follows:
|
||||
1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is:
|
||||
```
|
||||
|
||||
1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is:
|
||||
|
||||
```shell
|
||||
run_distribute_train_s16_r1.sh
|
||||
```
|
||||
|
||||
2. Train s8 with vocaug dataset, finetuning from model in previous step, training script is:
|
||||
```
|
||||
|
||||
```shell
|
||||
run_distribute_train_s8_r1.sh
|
||||
```
|
||||
|
||||
3. Train s8 with voctrain dataset, finetuning from model in pervious step, training script is:
|
||||
```
|
||||
|
||||
```shell
|
||||
run_distribute_train_s8_r2.sh
|
||||
```
|
||||
|
||||
For evaluation, evaluating steps are as follows:
|
||||
|
||||
1. Eval s16 with voc val dataset, eval script is:
|
||||
```
|
||||
|
||||
```shell
|
||||
run_eval_s16.sh
|
||||
```
|
||||
|
||||
2. Eval s8 with voc val dataset, eval script is:
|
||||
```
|
||||
|
||||
```shell
|
||||
run_eval_s8.sh
|
||||
```
|
||||
|
||||
3. Eval s8 multiscale with voc val dataset, eval script is:
|
||||
```
|
||||
|
||||
```shell
|
||||
run_eval_s8_multiscale.sh
|
||||
```
|
||||
|
||||
4. Eval s8 multiscale and flip with voc val dataset, eval script is:
|
||||
```
|
||||
|
||||
```shell
|
||||
run_eval_s8_multiscale_flip.sh
|
||||
```
|
||||
|
||||
# [Script Description](#contents)
|
||||
|
||||
## [Script and Sample Code](#contents)
|
||||
|
||||
```shell
|
||||
.
|
||||
└──deeplabv3
|
||||
|
@ -141,19 +165,19 @@ run_eval_s8_multiscale_flip.sh
|
|||
├── run_eval_s8_multiscale_filp.sh # launch ascend evaluation with multiscale and filp in s8 structure
|
||||
├── run_standalone_train.sh # launch ascend standalone training(1 pc)
|
||||
├── src
|
||||
├── data
|
||||
├── data
|
||||
├── dataset.py # mindrecord data generator
|
||||
├── build_seg_data.py # data preprocessing
|
||||
├── loss
|
||||
├── loss.py # loss definition for deeplabv3
|
||||
├── nets
|
||||
├── loss.py # loss definition for deeplabv3
|
||||
├── nets
|
||||
├── deeplab_v3
|
||||
├── deeplab_v3.py # DeepLabV3 network structure
|
||||
├── net_factory.py # set S16 and S8 structures
|
||||
├── tools
|
||||
├── tools
|
||||
├── get_multicards_json.py # get rank table file
|
||||
└── utils
|
||||
└── learning_rates.py # generate learning rate
|
||||
└── utils
|
||||
└── learning_rates.py # generate learning rate
|
||||
├── eval.py # eval net
|
||||
├── train.py # train net
|
||||
└── requirements.txt # requirements file
|
||||
|
@ -162,7 +186,8 @@ run_eval_s8_multiscale_flip.sh
|
|||
## [Script Parameters](#contents)
|
||||
|
||||
Default configuration
|
||||
```
|
||||
|
||||
```shell
|
||||
"data_file":"/PATH/TO/MINDRECORD_NAME" # dataset path
|
||||
"train_epochs":300 # total epochs
|
||||
"batch_size":32 # batch size of input tensor
|
||||
|
@ -183,11 +208,14 @@ Default configuration
|
|||
## [Training Process](#contents)
|
||||
|
||||
### Usage
|
||||
|
||||
#### Running on Ascend
|
||||
|
||||
Based on original DeepLabV3 paper, we reproduce two training experiments on vocaug (also as trainaug) dataset and evaluate on voc val dataset.
|
||||
|
||||
For single device training, please config parameters, training script is as follows:
|
||||
```
|
||||
For single device training, please config parameters, training script is as follows:
|
||||
|
||||
```shell
|
||||
# run_standalone_train.sh
|
||||
python ${train_code_path}/train.py --data_file=/PATH/TO/MINDRECORD_NAME \
|
||||
--train_dir=${train_path}/ckpt \
|
||||
|
@ -205,11 +233,12 @@ python ${train_code_path}/train.py --data_file=/PATH/TO/MINDRECORD_NAME \
|
|||
--save_steps=1500 \
|
||||
--keep_checkpoint_max=200 >log 2>&1 &
|
||||
```
|
||||
|
||||
For 8 devices training, training steps are as follows:
|
||||
|
||||
1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is as follows:
|
||||
1. Train s16 with vocaug dataset, finetuning from resnet101 pretrained model, script is as follows:
|
||||
|
||||
```
|
||||
```python
|
||||
# run_distribute_train_s16_r1.sh
|
||||
for((i=0;i<=$RANK_SIZE-1;i++));
|
||||
do
|
||||
|
@ -236,8 +265,10 @@ do
|
|||
--keep_checkpoint_max=200 >log 2>&1 &
|
||||
done
|
||||
```
|
||||
|
||||
2. Train s8 with vocaug dataset, finetuning from model in previous step, training script is as follows:
|
||||
```
|
||||
|
||||
```shell
|
||||
# run_distribute_train_s8_r1.sh
|
||||
for((i=0;i<=$RANK_SIZE-1;i++));
|
||||
do
|
||||
|
@ -265,8 +296,10 @@ do
|
|||
--keep_checkpoint_max=200 >log 2>&1 &
|
||||
done
|
||||
```
|
||||
|
||||
3. Train s8 with voctrain dataset, finetuning from model in pervious step, training script is as follows:
|
||||
```
|
||||
|
||||
```shell
|
||||
# run_distribute_train_s8_r2.sh
|
||||
for((i=0;i<=$RANK_SIZE-1;i++));
|
||||
do
|
||||
|
@ -294,73 +327,83 @@ do
|
|||
--keep_checkpoint_max=200 >log 2>&1 &
|
||||
done
|
||||
```
|
||||
|
||||
### Result
|
||||
|
||||
- Training vocaug in s16 structure
|
||||
```
|
||||
|
||||
```shell
|
||||
# distribute training result(8p)
|
||||
epoch: 1 step: 41, loss is 0.8319108
|
||||
Epoch time: 213856.477, per step time: 5216.012
|
||||
epoch time: 213856.477 ms, per step time: 5216.012 ms
|
||||
epoch: 2 step: 41, loss is 0.46052963
|
||||
Epoch time: 21233.183, per step time: 517.883
|
||||
epoch time: 21233.183 ms, per step time: 517.883 ms
|
||||
epoch: 3 step: 41, loss is 0.45012417
|
||||
Epoch time: 21231.951, per step time: 517.852
|
||||
epoch time: 21231.951 ms, per step time: 517.852 ms
|
||||
epoch: 4 step: 41, loss is 0.30687785
|
||||
Epoch time: 21199.911, per step time: 517.071
|
||||
epoch time: 21199.911 ms, per step time: 517.071 ms
|
||||
epoch: 5 step: 41, loss is 0.22769661
|
||||
Epoch time: 21240.281, per step time: 518.056
|
||||
epoch time: 21240.281 ms, per step time: 518.056 ms
|
||||
epoch: 6 step: 41, loss is 0.25470978
|
||||
...
|
||||
```
|
||||
|
||||
- Training vocaug in s8 structure
|
||||
```
|
||||
|
||||
```shell
|
||||
# distribute training result(8p)
|
||||
epoch: 1 step: 82, loss is 0.024167
|
||||
Epoch time: 322663.456, per step time: 3934.920
|
||||
epoch time: 322663.456 ms, per step time: 3934.920 ms
|
||||
epoch: 2 step: 82, loss is 0.019832281
|
||||
Epoch time: 43107.238, per step time: 525.698
|
||||
epoch time: 43107.238 ms, per step time: 525.698 ms
|
||||
epoch: 3 step: 82, loss is 0.021008959
|
||||
Epoch time: 43109.519, per step time: 525.726
|
||||
epoch time: 43109.519 ms, per step time: 525.726 ms
|
||||
epoch: 4 step: 82, loss is 0.01912349
|
||||
Epoch time: 43177.287, per step time: 526.552
|
||||
epoch time: 43177.287 ms, per step time: 526.552 ms
|
||||
epoch: 5 step: 82, loss is 0.022886964
|
||||
Epoch time: 43095.915, per step time: 525.560
|
||||
epoch time: 43095.915 ms, per step time: 525.560 ms
|
||||
epoch: 6 step: 82, loss is 0.018708453
|
||||
Epoch time: 43107.458, per step time: 525.701
|
||||
epoch time: 43107.458 ms per step time: 525.701 ms
|
||||
...
|
||||
```
|
||||
|
||||
- Training voctrain in s8 structure
|
||||
```
|
||||
|
||||
```shell
|
||||
# distribute training result(8p)
|
||||
epoch: 1 step: 11, loss is 0.00554624
|
||||
Epoch time: 199412.913, per step time: 18128.447
|
||||
epoch time: 199412.913 ms, per step time: 18128.447 ms
|
||||
epoch: 2 step: 11, loss is 0.007181881
|
||||
Epoch time: 6119.375, per step time: 556.307
|
||||
epoch time: 6119.375 ms, per step time: 556.307 ms
|
||||
epoch: 3 step: 11, loss is 0.004980865
|
||||
Epoch time: 5996.978, per step time: 545.180
|
||||
epoch time: 5996.978 ms, per step time: 545.180 ms
|
||||
epoch: 4 step: 11, loss is 0.0047651967
|
||||
Epoch time: 5987.412, per step time: 544.310
|
||||
epoch time: 5987.412 ms, per step time: 544.310 ms
|
||||
epoch: 5 step: 11, loss is 0.006262637
|
||||
Epoch time: 5956.682, per step time: 541.517
|
||||
epoch time: 5956.682 ms, per step time: 541.517 ms
|
||||
epoch: 6 step: 11, loss is 0.0060750707
|
||||
Epoch time: 5962.164, per step time: 542.015
|
||||
epoch time: 5962.164 ms, per step time: 542.015 ms
|
||||
...
|
||||
```
|
||||
|
||||
## [Evaluation Process](#contents)
|
||||
|
||||
### Usage
|
||||
|
||||
#### Running on Ascend
|
||||
|
||||
Configure checkpoint with --ckpt_path and dataset path. Then run script, mIOU will be printed in eval_path/eval_log.
|
||||
```
|
||||
|
||||
```shell
|
||||
./run_eval_s16.sh # test s16
|
||||
./run_eval_s8.sh # test s8
|
||||
./run_eval_s8_multiscale.sh # test s8 + multiscale
|
||||
./run_eval_s8_multiscale_flip.sh # test s8 + multiscale + flip
|
||||
```
|
||||
|
||||
Example of test script is as follows:
|
||||
```
|
||||
|
||||
```shell
|
||||
python ${train_code_path}/eval.py --data_root=/PATH/TO/DATA \
|
||||
--data_lst=/PATH/TO/DATA_lst.txt \
|
||||
--batch_size=16 \
|
||||
|
@ -383,6 +426,7 @@ python ${train_code_path}/eval.py --data_root=/PATH/TO/DATA \
|
|||
Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide.
|
||||
|
||||
#### Training accuracy
|
||||
|
||||
| **Network** | OS=16 | OS=8 | MS | Flip | mIOU | mIOU in paper |
|
||||
| :----------: | :-----: | :----: | :----: | :-----: | :-----: | :-------------: |
|
||||
| deeplab_v3 | √ | | | | 77.37 | 77.21 |
|
||||
|
@ -393,29 +437,31 @@ Our result were obtained by running the applicable training script. To achieve t
|
|||
Note: There OS is output stride, and MS is multiscale.
|
||||
|
||||
# [Model Description](#contents)
|
||||
## [Performance](#contents)
|
||||
|
||||
### Evaluation Performance
|
||||
| Parameters | Ascend 910
|
||||
## [Performance](#contents
|
||||
|
||||
### Evaluation Performance
|
||||
|
||||
| Parameters | Ascend 910
|
||||
| -------------------------- | -------------------------------------- |
|
||||
| Model Version | DeepLabV3
|
||||
| Resource | Ascend 910 |
|
||||
| Uploaded Date | 09/04/2020 (month/day/year) |
|
||||
| MindSpore Version | 0.7.0-alpha |
|
||||
| Dataset | PASCAL VOC2012 + SBD |
|
||||
| Training Parameters | epoch = 300, batch_size = 32 (s16_r1) <br> epoch = 800, batch_size = 16 (s8_r1) <br> epoch = 300, batch_size = 16 (s8_r2) |
|
||||
| Model Version | DeepLabV3
|
||||
| Resource | Ascend 910 |
|
||||
| Uploaded Date | 09/04/2020 (month/day/year) |
|
||||
| MindSpore Version | 0.7.0-alpha |
|
||||
| Dataset | PASCAL VOC2012 + SBD |
|
||||
| Training Parameters | epoch = 300, batch_size = 32 (s16_r1) <br> epoch = 800, batch_size = 16 (s8_r1) <br> epoch = 300, batch_size = 16 (s8_r2) |
|
||||
| Optimizer | Momentum |
|
||||
| Loss Function | Softmax Cross Entropy |
|
||||
| Outputs | probability |
|
||||
| Loss | 0.0065883575 |
|
||||
| Loss Function | Softmax Cross Entropy |
|
||||
| Outputs | probability |
|
||||
| Loss | 0.0065883575 |
|
||||
| Speed | 60 ms/step(1pc, s16)<br> 480 ms/step(8pcs, s16) <br> 244 ms/step (8pcs, s8) |
|
||||
| Total time | 8pcs: 706 mins |
|
||||
| Parameters (M) | 58.2 |
|
||||
| Checkpoint for Fine tuning | 443M (.ckpt file) |
|
||||
| Model for inference | 223M (.air file) |
|
||||
| Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/deeplabv3) |
|
||||
| Scripts | [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/deeplabv3) |
|
||||
|
||||
### Inference Performance
|
||||
## Inference Performance
|
||||
|
||||
| Parameters | Ascend |
|
||||
| ------------------- | --------------------------- |
|
||||
|
@ -429,10 +475,10 @@ Note: There OS is output stride, and MS is multiscale.
|
|||
| Accuracy | 8pcs: <br> s16: 77.37 <br> s8: 78.84% <br> s8_multiscale: 79.70% <br> s8_Flip: 79.89% |
|
||||
| Model for inference | 443M (.ckpt file) |
|
||||
|
||||
|
||||
# [Description of Random Situation](#contents)
|
||||
In dataset.py, we set the seed inside "create_dataset" function. We also use random seed in train.py.
|
||||
|
||||
In dataset.py, we set the seed inside "create_dataset" function. We also use random seed in train.py.
|
||||
|
||||
# [ModelZoo Homepage](#contents)
|
||||
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
||||
|
||||
|
||||
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
|
@ -30,34 +30,33 @@ The overall network architecture of InceptionV3 is show below:
|
|||
|
||||
[Link](https://arxiv.org/pdf/1512.00567.pdf)
|
||||
|
||||
|
||||
# [Dataset](#contents)
|
||||
|
||||
Dataset used can refer to paper.
|
||||
|
||||
- Dataset size: 125G, 1250k colorful images in 1000 classes
|
||||
- Train: 120G, 1200k images
|
||||
- Test: 5G, 50k images
|
||||
- Train: 120G, 1200k images
|
||||
- Test: 5G, 50k images
|
||||
- Data format: RGB images.
|
||||
- Note: Data will be processed in src/dataset.py
|
||||
- Note: Data will be processed in src/dataset.py
|
||||
|
||||
# [Features](#contents)
|
||||
|
||||
## [Mixed Precision(Ascend)](#contents)
|
||||
|
||||
The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
|
||||
The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
|
||||
|
||||
For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
|
||||
|
||||
# [Environment Requirements](#contents)
|
||||
|
||||
- Hardware(Ascend/GPU)
|
||||
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
|
||||
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
|
||||
- Framework
|
||||
- [MindSpore](https://www.mindspore.cn/install/en)
|
||||
- [MindSpore](https://www.mindspore.cn/install/en)
|
||||
- For more information, please check the resources below:
|
||||
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
|
||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
|
||||
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
|
||||
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
|
||||
|
||||
# [Script description](#contents)
|
||||
|
||||
|
@ -65,14 +64,14 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
|
||||
```shell
|
||||
.
|
||||
└─Inception-v3
|
||||
└─Inception-v3
|
||||
├─README.md
|
||||
├─scripts
|
||||
├─run_standalone_train.sh # launch standalone training with ascend platform(1p)
|
||||
├─scripts
|
||||
├─run_standalone_train.sh # launch standalone training with ascend platform(1p)
|
||||
├─run_standalone_train_gpu.sh # launch standalone training with gpu platform(1p)
|
||||
├─run_distribute_train.sh # launch distributed training with ascend platform(8p)
|
||||
├─run_distribute_train.sh # launch distributed training with ascend platform(8p)
|
||||
├─run_distribute_train_gpu.sh # launch distributed training with gpu platform(8p)
|
||||
├─run_eval.sh # launch evaluating with ascend platform
|
||||
├─run_eval.sh # launch evaluating with ascend platform
|
||||
└─run_eval_gpu.sh # launch evaluating with gpu platform
|
||||
├─src
|
||||
├─config.py # parameter configuration
|
||||
|
@ -83,12 +82,13 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
├─eval.py # eval net
|
||||
├─export.py # convert checkpoint
|
||||
└─train.py # train net
|
||||
|
||||
|
||||
```
|
||||
|
||||
## [Script Parameters](#contents)
|
||||
|
||||
```python
|
||||
Major parameters in train.py and config.py are:
|
||||
Major parameters in train.py and config.py are:
|
||||
'random_seed' # fix random seed
|
||||
'rank' # local rank of distributed
|
||||
'group_size' # world size of distributed
|
||||
|
@ -111,8 +111,8 @@ Major parameters in train.py and config.py are:
|
|||
'ckpt_path' # save checkpoint path
|
||||
'is_save_on_master' # save checkpoint on rank0, distributed parameters
|
||||
'dropout_keep_prob' # the keep rate, between 0 and 1, e.g. keep_prob = 0.9, means dropping out 10% of input units
|
||||
'has_bias' # specifies whether the layer uses a bias vector.
|
||||
'amp_level' # option for argument `level` in `mindspore.amp.build_train_network`, level for mixed
|
||||
'has_bias' # specifies whether the layer uses a bias vector.
|
||||
'amp_level' # option for argument `level` in `mindspore.amp.build_train_network`, level for mixed
|
||||
# precision training. Supports [O0, O2, O3].
|
||||
|
||||
```
|
||||
|
@ -121,33 +121,33 @@ Major parameters in train.py and config.py are:
|
|||
|
||||
### Usage
|
||||
|
||||
|
||||
You can start training using python or shell scripts. The usage of shell scripts as follows:
|
||||
|
||||
- Ascend:
|
||||
```
|
||||
- Ascend:
|
||||
|
||||
```shell
|
||||
# distribute training example(8p)
|
||||
sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
|
||||
# standalone training
|
||||
sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
|
||||
```
|
||||
> Notes:
|
||||
RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html) , and the device_ip can be got as [Link]https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
|
||||
For large models like InceptionV3, it's better to export an external environment variable `export HCCL_CONNECT_TIMEOUT=600` to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.
|
||||
|
||||
> Notes: RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html), and the device_ip can be got as [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools). For large models like InceptionV3, it's better to export an external environment variable `export HCCL_CONNECT_TIMEOUT=600` to extend hccl connection checking time from the default 120 seconds to 600 seconds. Otherwise, the connection could be timeout since compiling time increases with the growth of model size.
|
||||
>
|
||||
> This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
|
||||
|
||||
- GPU:
|
||||
```
|
||||
|
||||
```python
|
||||
# distribute training example(8p)
|
||||
sh scripts/run_distribute_train_gpu.sh DATA_DIR
|
||||
sh scripts/run_distribute_train_gpu.sh DATA_DIR
|
||||
# standalone training
|
||||
sh scripts/run_standalone_train_gpu.sh DEVICE_ID DATA_DIR
|
||||
```
|
||||
|
||||
### Launch
|
||||
|
||||
```
|
||||
```python
|
||||
# training example
|
||||
python:
|
||||
Ascend: python train.py --dataset_path /dataset/train --platform Ascend
|
||||
|
@ -159,23 +159,24 @@ sh scripts/run_standalone_train_gpu.sh DEVICE_ID DATA_DIR
|
|||
sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
|
||||
# standalone training
|
||||
sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
|
||||
GPU:
|
||||
GPU:
|
||||
# distributed training example(8p)
|
||||
sh scripts/run_distribute_train_gpu.sh /dataset/train
|
||||
# standalone training example
|
||||
sh scripts/run_distribute_train_gpu.sh /dataset/train
|
||||
# standalone training example
|
||||
sh scripts/run_standalone_train_gpu.sh 0 /dataset/train
|
||||
```
|
||||
|
||||
### Result
|
||||
|
||||
Training result will be stored in the example path. Checkpoints will be stored at `. /checkpoint` by default, and training log will be redirected to `./log.txt` like followings.
|
||||
Training result will be stored in the example path. Checkpoints will be stored at `. /checkpoint` by default, and training log will be redirected to `./log.txt` like followings.
|
||||
|
||||
```
|
||||
```python
|
||||
epoch: 0 step: 1251, loss is 5.7787247
|
||||
Epoch time: 360760.985, per step time: 288.378
|
||||
epoch time: 360760.985 ms, per step time: 288.378 ms
|
||||
epoch: 1 step: 1251, loss is 4.392868
|
||||
Epoch time: 160917.911, per step time: 128.631
|
||||
epoch time: 160917.911 ms, per step time: 128.631 ms
|
||||
```
|
||||
|
||||
## [Eval process](#contents)
|
||||
|
||||
### Usage
|
||||
|
@ -183,17 +184,20 @@ Epoch time: 160917.911, per step time: 128.631
|
|||
You can start training using python or shell scripts. The usage of shell scripts as follows:
|
||||
|
||||
- Ascend:
|
||||
|
||||
```python
|
||||
sh scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
```
|
||||
sh scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
```
|
||||
|
||||
- GPU:
|
||||
```
|
||||
sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
|
||||
```python
|
||||
sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
```
|
||||
|
||||
### Launch
|
||||
|
||||
```
|
||||
```python
|
||||
# eval example
|
||||
python:
|
||||
Ascend: python eval.py --dataset_path DATA_DIR --checkpoint PATH_CHECKPOINT --platform Ascend
|
||||
|
@ -204,13 +208,13 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
GPU: sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
```
|
||||
|
||||
> checkpoint can be produced in training process.
|
||||
> checkpoint can be produced in training process.
|
||||
|
||||
### Result
|
||||
|
||||
Evaluation result will be stored in the example path, you can find result like the followings in `eval.log`.
|
||||
Evaluation result will be stored in the example path, you can find result like the followings in `eval.log`.
|
||||
|
||||
```
|
||||
```python
|
||||
metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942}
|
||||
```
|
||||
|
||||
|
@ -239,12 +243,11 @@ metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942}
|
|||
| Checkpoint for Fine tuning | 313M | 312M |
|
||||
| Scripts | [inceptionv3 script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/inceptionv3) | [inceptionv3 script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/inceptionv3) |
|
||||
|
||||
|
||||
#### Inference Performance
|
||||
|
||||
| Parameters | Ascend |
|
||||
| ------------------- | --------------------------- |
|
||||
| Model Version | InceptionV3 |
|
||||
| Model Version | InceptionV3 |
|
||||
| Resource | Ascend 910, cpu:2.60GHz 192cores, memory:755G |
|
||||
| Uploaded Date | 08/22/2020 |
|
||||
| MindSpore Version | 0.6.0-beta |
|
||||
|
@ -260,5 +263,5 @@ metric: {'Loss': 1.778, 'Top1-Acc':0.788, 'Top5-Acc':0.942}
|
|||
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
|
||||
|
||||
# [ModelZoo Homepage](#contents)
|
||||
|
||||
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
||||
|
||||
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|
Loading…
Reference in New Issue