This commit is contained in:
huchunmei 2021-07-02 18:17:45 +08:00
parent fd262eb3b8
commit c08ec63080
27 changed files with 1568 additions and 387 deletions

View File

@ -105,7 +105,7 @@ step1: prepare pretrained model: train a mobilenet_v2 model by mindspore or use
# The key/cell/module name must as follow, otherwise you need to modify "name_map" function:
# --mindspore: as the same as mobilenet_v2_key.ckpt
# --pytorch: same as official pytorch model(e.g., official mobilenet_v2-b0353104.pth)
python convert_weight_mobilenetv2.py --ckpt_fn=./mobilenet_v2_key.ckpt --pt_fn=./mobilenet_v2-b0353104.pth --out_ckpt_fn=./mobilenet_v2.ckpt
python convert_weight_centerface.py --ckpt_fn=./mobilenet_v2_key.ckpt --pt_fn=./mobilenet_v2-b0353104.pth --out_ckpt_fn=./mobilenet_v2.ckpt
```
step2: prepare dataset
@ -178,6 +178,109 @@ step6: eval
sh eval_all.sh [ground_truth_path]
```
- Running on [ModelArts](https://support.huaweicloud.com/modelarts/)
```bash
# Train 8p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "distribute=True" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set "lr: 0.004" on default_config.yaml file.
# (optional)Set "checkpoint_url='s3://dir_to_your_pretrained/'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "distribute=True" on the website UI interface.
# Add "dataset_path=/cache/data" on the website UI interface.
# Add "lr: 0.004" on the website UI interface.
# (optional)Add "checkpoint_url='s3://dir_to_your_pretrained/'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your pretrained model to S3 bucket if you want to finetune.
# (4) Perform a or b. (suggested option a)
# a. First, zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/centerface" on the website UI interface.
# (6) Set the startup file to "train.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
#
# Train 1p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set "lr: 0.004" on default_config.yaml file.
# (optional)Set "checkpoint_url='s3://dir_to_your_pretrained/'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "dataset_path='/cache/data'" on the website UI interface.
# Add "lr: 0.004" on the website UI interface.
# (optional)Add "checkpoint_url='s3://dir_to_your_pretrained/'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your pretrained model to S3 bucket if you want to finetune.
# (4) Perform a or b. (suggested option a)
# a. zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/centerface" on the website UI interface.
# (6) Set the startup file to "train.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
#
# Eval 1p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "checkpoint_url='s3://dir_to_your_trained_model/'" on base_config.yaml file.
# Set "checkpoint='./centerface/centerface_trained.ckpt'" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "checkpoint_url='s3://dir_to_your_trained_model/'" on the website UI interface.
# Add "checkpoint='./centerface/centerface_trained.ckpt'" on the website UI interface.
# Add "dataset_path='/cache/data'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your trained model to S3 bucket.
# (4) Perform a or b. (suggested option a)
# a. First, zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/centerface" on the website UI interface.
# (6) Set the startup file to "eval.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
```
- Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start evaluating as follows)
1. Export s8 multiscale and flip with voc val dataset on modelarts, evaluating steps are as follows:
```python
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on base_config.yaml file.
# Set "file_name='centerface'" on base_config.yaml file.
# Set "file_format='AIR'" on base_config.yaml file.
# Set "checkpoint_url='/The path of checkpoint in S3/'" on beta_config.yaml file.
# Set "ckpt_file='/cache/checkpoint_path/model.ckpt'" on base_config.yaml file.
# Set other parameters on base_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "file_name='centerface'" on the website UI interface.
# Add "file_format='AIR'" on the website UI interface.
# Add "checkpoint_url='/The path of checkpoint in S3/'" on the website UI interface.
# Add "ckpt_file='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Upload or copy your trained model to S3 bucket.
# (3) Set the code directory to "/path/centerface" on the website UI interface.
# (4) Set the startup file to "export.py" on the website UI interface.
# (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (6) Create your job.
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
@ -192,7 +295,7 @@ sh eval_all.sh [ground_truth_path]
├── postprocess.py // 310infer postprocess scripts
├── README.md // descriptions about CenterFace
├── ascend310_infer // application for 310 inference
├─default_config.yaml // Training parameter profile
├─default_config.yaml // Training parameter profile
├── scripts
│ ├──run_infer_310.sh // shell script for infer on ascend310
│ ├──eval.sh // evaluate a single testing result
@ -211,13 +314,13 @@ sh eval_all.sh [ground_truth_path]
│ ├──mobile_v2.py // modified mobilenet_v2 backbone
│ ├──utils.py // auxiliary functions for train, to log and preload
│ ├──var_init.py // weight initialization
│ ├──convert_weight_mobilenetv2.py // convert pretrained backbone to mindspore
│ ├──convert_weight_centerface.py // convert pretrained backbone to mindspore
│ ├──convert_weight.py // CenterFace model convert to mindspore
| └──model_utils
| ├──config.py // Processing configuration parameters
| ├──device_adapter.py // Get cloud ID
| ├──local_adapter.py // Get local ID
| └ ──moxing_adapter.py // Parameter processing
| ├──config.py // Processing configuration parameters
| ├──device_adapter.py // Get cloud ID
| ├──local_adapter.py // Get local ID
| └ ──moxing_adapter.py // Parameter processing
└── dependency // third party codes: MIT License
├──extd // training dependency: data augmentation
│ ├──utils
@ -318,7 +421,7 @@ Major parameters eval.py as follows:
step1: user need train a mobilenet_v2 model by mindspore or use the script below:
```python
python torch_to_ms_mobilenetv2.py --ckpt_fn=./mobilenet_v2_key.ckpt --pt_fn=./mobilenet_v2-b0353104.pth --out_ckpt_fn=./mobilenet_v2.ckpt
python torch_to_ms_centerface.py --ckpt_fn=./mobilenet_v2_key.ckpt --pt_fn=./mobilenet_v2-b0353104.pth --out_ckpt_fn=./mobilenet_v2.ckpt
```
step2: prepare user rank_table

View File

@ -70,7 +70,7 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)
pip install mmcv==0.2.14
```
And change the COCO_ROOT and other settings you need in `config_50.yaml、config_101.yaml or config_152.yaml`. The directory structure is as follows:
And change the COCO_ROOT and other settings you need in `default_config.yaml、default_config_101.yaml or default_config_152.yaml`. The directory structure is as follows:
```path
.
@ -90,7 +90,7 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)
train2017/0000001.jpg 0,259,401,459,7 35,28,324,201,2 0,30,59,80,2
```
Each row is an image annotation which split by space, the first column is a relative path of image, the others are box and class information of the format [xmin,ymin,xmax,ymax,class]. We read image from an image path joined by the `IMAGE_DIR`(dataset directory) and the relative path in `ANNO_PATH`(the TXT file path), `IMAGE_DIR` and `ANNO_PATH` are setting in `config_50.yaml、config_101.yaml or config_152.yaml`.
Each row is an image annotation which split by space, the first column is a relative path of image, the others are box and class information of the format [xmin,ymin,xmax,ymax,class]. We read image from an image path joined by the `IMAGE_DIR`(dataset directory) and the relative path in `ANNO_PATH`(the TXT file path), `IMAGE_DIR` and `ANNO_PATH` are setting in `default_config_50.yaml、default_config_101.yaml or default_config_152.yaml`.
# Quick Start
@ -110,13 +110,13 @@ Note:
python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
# standalone training
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# distributed training
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# eval
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# inference
sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
@ -130,13 +130,13 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH]
python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
# standalone training
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# distributed training
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# eval
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
@ -160,17 +160,17 @@ bash scripts/docker_start.sh fasterrcnn:20.1.0 [DATA_DIR] [MODEL_DIR]
```shell
# standalone training
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# distributed training
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
4. Eval
```shell
# eval
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
5. Inference
@ -180,6 +180,109 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
```
- Running on [ModelArts](https://support.huaweicloud.com/modelarts/)
```bash
# Train 8p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "distribute=True" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set "epoch_size: 20" on default_config.yaml file.
# (optional)Set "checkpoint_url='s3://dir_to_your_pretrained/'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "distribute=True" on the website UI interface.
# Add "dataset_path=/cache/data" on the website UI interface.
# Add "epoch_size: 20" on the website UI interface.
# (optional)Add "checkpoint_url='s3://dir_to_your_pretrained/'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your pretrained model to S3 bucket if you want to finetune.
# (4) Perform a or b. (suggested option a)
# a. First, zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/faster_rcnn" on the website UI interface.
# (6) Set the startup file to "train.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
#
# Train 1p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set "epoch_size: 20" on default_config.yaml file.
# (optional)Set "checkpoint_url='s3://dir_to_your_pretrained/'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "dataset_path='/cache/data'" on the website UI interface.
# Add "epoch_size: 20" on the website UI interface.
# (optional)Add "checkpoint_url='s3://dir_to_your_pretrained/'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your pretrained model to S3 bucket if you want to finetune.
# (4) Perform a or b. (suggested option a)
# a. zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/faster_rcnn" on the website UI interface.
# (6) Set the startup file to "train.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
#
# Eval 1p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "checkpoint_url='s3://dir_to_your_trained_model/'" on base_config.yaml file.
# Set "checkpoint='./faster_rcnn/faster_rcnn_trained.ckpt'" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "checkpoint_url='s3://dir_to_your_trained_model/'" on the website UI interface.
# Add "checkpoint='./faster_rcnn/faster_rcnn_trained.ckpt'" on the website UI interface.
# Add "dataset_path='/cache/data'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your trained model to S3 bucket.
# (4) Perform a or b. (suggested option a)
# a. First, zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/faster_rcnn" on the website UI interface.
# (6) Set the startup file to "eval.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
```
- Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start evaluating as follows)
1. Export s8 multiscale and flip with voc val dataset on modelarts, evaluating steps are as follows:
```python
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on base_config.yaml file.
# Set "file_name='faster_rcnn'" on base_config.yaml file.
# Set "file_format='AIR'" on base_config.yaml file.
# Set "checkpoint_url='/The path of checkpoint in S3/'" on beta_config.yaml file.
# Set "ckpt_file='/cache/checkpoint_path/model.ckpt'" on base_config.yaml file.
# Set other parameters on base_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "file_name='faster_rcnn'" on the website UI interface.
# Add "file_format='AIR'" on the website UI interface.
# Add "checkpoint_url='/The path of checkpoint in S3/'" on the website UI interface.
# Add "ckpt_file='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Upload or copy your trained model to S3 bucket.
# (3) Set the code directory to "/path/faster_rcnn" on the website UI interface.
# (4) Set the startup file to "export.py" on the website UI interface.
# (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (6) Create your job.
```
# Script Description
## Script and Sample Code
@ -187,43 +290,59 @@ sh run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
```shell
.
└─faster_rcnn
├─README.md // descriptions about fasterrcnn
├─ascend310_infer //application for 310 inference
├─README.md // descriptions about fasterrcnn
├─ascend310_infer //application for 310 inference
├─scripts
├─run_standalone_train_ascend.sh // shell script for standalone on ascend
├─run_standalone_train_gpu.sh // shell script for standalone on GPU
├─run_distribute_train_ascend.sh // shell script for distributed on ascend
├─run_distribute_train_gpu.sh // shell script for distributed on GPU
├─run_infer_310.sh // shell script for 310 inference
└─run_eval_ascend.sh // shell script for eval on ascend
└─run_eval_gpu.sh // shell script for eval on GPU
├─run_standalone_train_ascend.sh // shell script for standalone on ascend
├─run_standalone_train_gpu.sh // shell script for standalone on GPU
├─run_distribute_train_ascend.sh // shell script for distributed on ascend
├─run_distribute_train_gpu.sh // shell script for distributed on GPU
├─run_infer_310.sh // shell script for 310 inference
└─run_eval_ascend.sh // shell script for eval on ascend
└─run_eval_gpu.sh // shell script for eval on GPU
├─src
├─FasterRcnn
├─__init__.py // init file
├─anchor_generator.py // anchor generator
├─bbox_assign_sample.py // first stage sampler
├─bbox_assign_sample_stage2.py // second stage sampler
├─faster_rcnn_resnet.py // fasterrcnn network
├─faster_rcnn_resnet50v1.py //fasterrcnn network for ResNet50v1.0
├─fpn_neck.py //feature pyramid network
├─proposal_generator.py // proposal generator
├─rcnn.py // rcnn network
├─resnet.py // backbone network
├─resnet50v1.py // backbone network for ResNet50v1.0
├─roi_align.py // roi align network
└─rpn.py // region proposal network
├─config.py // config for yaml parsing
├─config_50.yaml // config for ResNet50
├─config_101.yaml // config for ResNet101
├─config_152.yaml // config for ResNet152
├─dataset.py // create dataset and process dataset
├─lr_schedule.py // learning ratio generator
├─network_define.py // network define for fasterrcnn
└─util.py // routine operation
├─export.py // script to export AIR,MINDIR,ONNX model
├─eval.py //eval scripts
├─postprogress.py // post process for 310 inference
└─train.py // train scripts
├─__init__.py // init file
├─anchor_generator.py // anchor generator
├─bbox_assign_sample.py // first stage sampler
├─bbox_assign_sample_stage2.py // second stage sampler
├─faster_rcnn_resnet.py // fasterrcnn network
├─faster_rcnn_resnet50v1.py //fasterrcnn network for ResNet50v1.0
├─fpn_neck.py //feature pyramid network
├─proposal_generator.py // proposal generator
├─rcnn.py // rcnn network
├─resnet.py // backbone network
├─resnet50v1.py // backbone network for ResNet50v1.0
├─roi_align.py // roi align network
└─rpn.py // region proposal network
├─dataset.py // create dataset and process dataset
├─lr_schedule.py // learning ratio generator
├─network_define.py // network define for fasterrcnn
├─util.py // routine operation
└─model_utils
├─config.py // Processing configuration parameters
├─device_adapter.py // Get cloud ID
├─local_adapter.py // Get local ID
└─moxing_adapter.py // Parameter processing
├─default_config.yaml // config for ResNet50
├─default_config_101.yaml // config for ResNet101
├─default_config_152.yaml // config for ResNet152
├─export.py // script to export AIR,MINDIR,ONNX model
├─eval.py //eval scripts
├─postprogress.py // post process for 310 inference
└─train.py // train scripts
```
```bash
if backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
from src.FasterRcnn.faster_rcnn_resnet import Faster_Rcnn_Resnet
"resnet_v1.5_50" -> "./src/config_50.yaml"
"resnet_v1_101" -> "./src/config_101.yaml"
"resnet_v1_152" -> "./src/config_152.yaml"
elif backbone == "resnet_v1_50":
from src.FasterRcnn.faster_rcnn_resnet50v1 import Faster_Rcnn_Resnet
"resnet_v1_50" -> "./src/config_50.yaml"
```
## Training Process
@ -234,20 +353,20 @@ sh run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
```shell
# standalone training on ascend
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# distributed training on ascend
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
#### on GPU
```shell
# standalone training on gpu
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# distributed training on gpu
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
Notes:
@ -279,7 +398,7 @@ Notes:
load_param_into_net(net, param_dict)
```
3. The original dataset path needs to be in the config_50.yaml、config_101.yaml、config_152.yaml,you can select "coco_root" or "image_dir".
3. The original dataset path needs to be in the default_config_50.yaml、default_config_101.yaml、default_config_152.yaml,you can select "coco_root" or "image_dir".
### Result
@ -304,14 +423,14 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005
```shell
# eval on ascend
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
#### on GPU
```shell
# eval on GPU
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
> checkpoint can be produced in training process.
@ -340,7 +459,7 @@ Eval result will be stored in the example path, whose folder name is "eval". Und
## Model Export
```shell
python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format[EXPORT_FORMAT] --backbone [BACKBONE]
python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format[EXPORT_FORMAT] --backbone [BACKBONE] --coco_root [COCO_ROOT] --mindrecord_dir [MINDRECORD_DIR](option)
```
`EXPORT_FORMAT` should be in ["AIR", "MINDIR"]

View File

@ -111,13 +111,13 @@ Faster R-CNN是一个两阶段目标检测网络该网络采用RPN可以
python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
# 单机训练
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# 分布式训练
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# 评估
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
#推理
sh run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
@ -131,13 +131,13 @@ sh run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
python convert_checkpoint.py --ckpt_file=[BACKBONE_MODEL]
# 单机训练
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# 分布式训练
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# 评估
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
@ -161,17 +161,17 @@ bash scripts/docker_start.sh fasterrcnn:20.1.0 [DATA_DIR] [MODEL_DIR]
```shell
# 单机训练
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# 分布式训练
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
4. 评估
```shell
# 评估
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
5. 推理
@ -181,6 +181,109 @@ sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
```
- 在 ModelArts 进行训练 (如果你想在modelarts上运行可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
```python
# 在 ModelArts 上使用8卡训练
# (1) 执行a或者b
# a. 在 default_config.yaml 文件中设置 "enable_modelarts=True"
# 在 default_config.yaml 文件中设置 "distribute=True"
# 在 default_config.yaml 文件中设置 "dataset_path='/cache/data'"
# 在 default_config.yaml 文件中设置 "epoch_size: 20"
# (可选)在 default_config.yaml 文件中设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在 default_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "distribute=True"
# 在网页上设置 "dataset_path=/cache/data"
# 在网页上设置 "epoch_size: 20"
# (可选)在网页上设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在网页上设置 其他参数
# (2) 准备模型代码
# (3) 如果选择微调您的模型,请上传你的预训练模型到 S3 桶上
# (4) 执行a或者b (推荐选择 a)
# a. 第一, 将该数据集压缩为一个 ".zip" 文件。
# 第二, 上传你的压缩数据集到 S3 桶上 (你也可以上传未压缩的数据集,但那可能会很慢。)
# b. 上传原始数据集到 S3 桶上。
# (数据集转换发生在训练过程中,需要花费较多的时间。每次训练的时候都会重新进行转换。)
# (5) 在网页上设置你的代码路径为 "/path/faster_rcnn"
# (6) 在网页上设置启动文件为 "train.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
#
# 在 ModelArts 上使用单卡训练
# (1) 执行a或者b
# a. 在 default_config.yaml 文件中设置 "enable_modelarts=True"
# 在 default_config.yaml 文件中设置 "dataset_path='/cache/data'"
# 在 default_config.yaml 文件中设置 "epoch_size: 20"
# (可选)在 default_config.yaml 文件中设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在 default_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "dataset_path='/cache/data'"
# 在网页上设置 "epoch_size: 20"
# (可选)在网页上设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在网页上设置 其他参数
# (2) 准备模型代码
# (3) 如果选择微调您的模型,上传你的预训练模型到 S3 桶上
# (4) 执行a或者b (推荐选择 a)
# a. 第一, 将该数据集压缩为一个 ".zip" 文件。
# 第二, 上传你的压缩数据集到 S3 桶上 (你也可以上传未压缩的数据集,但那可能会很慢。)
# b. 上传原始数据集到 S3 桶上。
# (数据集转换发生在训练过程中,需要花费较多的时间。每次训练的时候都会重新进行转换。)
# (5) 在网页上设置你的代码路径为 "/path/faster_rcnn"
# (6) 在网页上设置启动文件为 "train.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
#
# 在 ModelArts 上使用单卡验证
# (1) 执行a或者b
# a. 在 default_config.yaml 文件中设置 "enable_modelarts=True"
# 在 default_config.yaml 文件中设置 "checkpoint_url='s3://dir_to_your_trained_model/'"
# 在 default_config.yaml 文件中设置 "checkpoint='./faster_rcnn/faster_rcnn_trained.ckpt'"
# 在 default_config.yaml 文件中设置 "dataset_path='/cache/data'"
# 在 default_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "checkpoint_url='s3://dir_to_your_trained_model/'"
# 在网页上设置 "checkpoint='./faster_rcnn/faster_rcnn_trained.ckpt'"
# 在网页上设置 "dataset_path='/cache/data'"
# 在网页上设置 其他参数
# (2) 准备模型代码
# (3) 上传你训练好的模型到 S3 桶上
# (4) 执行a或者b (推荐选择 a)
# a. 第一, 将该数据集压缩为一个 ".zip" 文件。
# 第二, 上传你的压缩数据集到 S3 桶上 (你也可以上传未压缩的数据集,但那可能会很慢。)
# b. 上传原始数据集到 S3 桶上。
# (数据集转换发生在训练过程中,需要花费较多的时间。每次训练的时候都会重新进行转换。)
# (5) 在网页上设置你的代码路径为 "/path/faster_rcnn"
# (6) 在网页上设置启动文件为 "train.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
```
- 在 ModelArts 进行导出 (如果你想在modelarts上运行可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
1. 使用voc val数据集评估多尺度和翻转s8。评估步骤如下
```python
# (1) 执行 a 或者 b.
# a. 在 base_config.yaml 文件中设置 "enable_modelarts=True"
# 在 base_config.yaml 文件中设置 "file_name='faster_rcnn'"
# 在 base_config.yaml 文件中设置 "file_format='AIR'"
# 在 base_config.yaml 文件中设置 "checkpoint_url='/The path of checkpoint in S3/'"
# 在 base_config.yaml 文件中设置 "ckpt_file='/cache/checkpoint_path/model.ckpt'"
# 在 base_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "file_name='faster_rcnn'"
# 在网页上设置 "file_format='AIR'"
# 在网页上设置 "checkpoint_url='/The path of checkpoint in S3/'"
# 在网页上设置 "ckpt_file='/cache/checkpoint_path/model.ckpt'"
# 在网页上设置 其他参数
# (2) 上传你的预训练模型到 S3 桶上
# (3) 在网页上设置你的代码路径为 "/path/faster_rcnn"
# (4) 在网页上设置启动文件为 "export.py"
# (5) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (6) 创建训练作业
```
# 脚本说明
## 脚本及样例代码
@ -188,43 +291,47 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
```shell
.
└─faster_rcnn
├─README.md // Faster R-CNN相关说明
├─ascend310_infer //实现310推理源代码
├─README.md // Faster R-CNN相关说明
├─ascend310_infer // 实现310推理源代码
├─scripts
├─run_standalone_train_ascend.sh // Ascend单机shell脚本
├─run_standalone_train_ascend.sh // Ascend单机shell脚本
├─run_standalone_train_gpu.sh // GPU单机shell脚本
├─run_distribute_train_ascend.sh // Ascend分布式shell脚本
├─run_distribute_train_ascend.sh // Ascend分布式shell脚本
├─run_distribute_train_gpu.sh // GPU分布式shell脚本
├─run_infer_310.sh // Ascend推理shell脚本
└─run_eval_ascend.sh // Ascend评估shell脚本
└─run_eval_gpu.sh // GPU评估shell脚本
├─run_infer_310.sh // Ascend推理shell脚本
└─run_eval_ascend.sh // Ascend评估shell脚本
└─run_eval_gpu.sh // GPU评估shell脚本
├─src
├─FasterRcnn
├─__init__.py // init文件
├─anchor_generator.py // 锚点生成器
├─bbox_assign_sample.py // 第一阶段采样器
├─bbox_assign_sample_stage2.py // 第二阶段采样器
├─faster_rcnn_resnet.py // Faster R-CNN网络
├─faster_rcnn_resnet50v1.py //以Resnet50v1.0作为backbone的Faster R-CNN网络
├─fpn_neck.py // 特征金字塔网络
├─proposal_generator.py // 候选生成器
├─rcnn.py // R-CNN网络
├─resnet.py // 骨干网络
├─resnet50v1.py // Resnet50v1.0骨干网络
├─roi_align.py // ROI对齐网络
└─rpn.py // 区域候选网络
├─config.py // 读取yaml配置的config类
├─config_50.yaml // Resnet50相关配置
├─config_101.yaml // Resnet101相关配置
├─config_152.yaml // Resnet152相关配置
├─dataset.py // 创建并处理数据集
├─lr_schedule.py // 学习率生成器
├─network_define.py // Faster R-CNN网络定义
└─util.py // 例行操作
├─export.py // 导出 AIR,MINDIR模型的脚本
├─eval.py // 评估脚本
├─postprogress.py // 310推理后处理脚本
└─train.py // 训练脚本
├─__init__.py // init文件
├─anchor_generator.py // 锚点生成器
├─bbox_assign_sample.py // 第一阶段采样器
├─bbox_assign_sample_stage2.py // 第二阶段采样器
├─faster_rcnn_resnet.py // Faster R-CNN网络
├─faster_rcnn_resnet50v1.py // 以Resnet50v1.0作为backbone的Faster R-CNN网络
├─fpn_neck.py // 特征金字塔网络
├─proposal_generator.py // 候选生成器
├─rcnn.py // R-CNN网络
├─resnet.py // 骨干网络
├─resnet50v1.py // Resnet50v1.0骨干网络
├─roi_align.py // ROI对齐网络
└─rpn.py // 区域候选网络
├─dataset.py // 创建并处理数据集
├─lr_schedule.py // 学习率生成器
├─network_define.py // Faster R-CNN网络定义
├─util.py // 例行操作
└─model_utils
├─config.py // 获取.yaml配置参数
├─device_adapter.py // 获取云上id
├─local_adapter.py // 获取本地id
└─moxing_adapter.py // 云上数据准备
├─default_config.yaml // Resnet50相关配置
├─default_config_101.yaml // Resnet101相关配置
├─default_config_152.yaml // Resnet152相关配置
├─export.py // 导出 AIR,MINDIR模型的脚本
├─eval.py // 评估脚本
├─postprogress.py // 310推理后处理脚本
└─train.py // 训练脚本
```
## 训练过程
@ -235,20 +342,20 @@ sh run_infer_310.sh [AIR_PATH] [DATA_PATH] [ANN_FILE_PATH] [DEVICE_ID]
```shell
# Ascend单机训练
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# Ascend分布式训练
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
#### 在GPU上运行
```shell
# GPU单机训练
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE]
sh run_standalone_train_gpu.sh [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
# GPU分布式训练
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE]
sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_MODEL] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
Notes:
@ -305,14 +412,14 @@ epoch: 12 step: 7393, rpn_loss: 0.00691, rcnn_loss: 0.10168, rpn_cls_loss: 0.005
```shell
# Ascend评估
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
#### 在GPU上运行
```shell
# GPU评估
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)
```
> 在训练过程中生成检查点。
@ -341,7 +448,7 @@ sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]
## 模型导出
```shell
python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format[EXPORT_FORMAT] --backbone [BACKBONE]
python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format[EXPORT_FORMAT] --backbone [BACKBONE] --coco_root [COCO_ROOT] --mindrecord_dir [MINDRECORD_DIR](option)
```
`EXPORT_FORMAT` 可选 ["AIR", "MINDIR"]

View File

@ -1,18 +1,16 @@
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===========================================================================
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
enable_modelarts: False
data_url: ""
train_url: ""
checkpoint_url: ""
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path"
device_target: Ascend
enable_profiling: False
# ==============================================================================
# config
img_width: 1280
img_height: 768
keep_ratio: True
@ -20,7 +18,7 @@ flip_ratio: 0.5
expand_ratio: 1.0
# anchor
feature_shapes:
feature_shapes:
- [192, 320]
- [96, 160]
- [48, 80]
@ -153,4 +151,65 @@ coco_classes: ['background', 'person', 'bicycle', 'car', 'motorcycle', 'airplane
'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
'refrigerator', 'book', 'clock', 'vase', 'scissors',
'teddy bear', 'hair drier', 'toothbrush']
num_classes: 81
num_classes: 81
# train.py FasterRcnn training
run_distribute: False
dataset: "coco"
pre_trained: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
device_id: 0
device_num: 1
rank_id: 0
image_dir: ''
anno_path: ''
backbone: 'resnet_v1_50'
# eval.py FasterRcnn evaluation
ann_file: '/cache/data/annotations/instances_val2017.json'
checkpoint_path: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
# export.py fasterrcnn_export
file_name: "faster_rcnn"
file_format: "AIR"
ckpt_file: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
# postprocess ("./src/config_50.yaml")
#ann_file: ''
result_path: ''
---
# Config description for each option
enable_modelarts: 'Whether training on modelarts, default: False'
data_url: 'Dataset url for obs'
train_url: 'Training output url for obs'
data_path: 'Dataset path for local'
output_path: 'Training output path for local'
result_dir: "result files path."
label_dir: "image file path."
device_target: "device where the code will be implemented, default is Ascend"
file_name: "output file name."
dataset: "Dataset, either cifar10 or imagenet2012"
parameter_server: 'Run parameter server train'
width: 'input width'
height: 'input height'
enable_profiling: 'Whether enable profiling while training, default: False'
only_create_dataset: 'If set it true, only create Mindrecord, default is false.'
run_distribute: 'Run distribute, default is false.'
do_train: 'Do train or not, default is true.'
do_eval: 'Do eval or not, default is false.'
pre_trained: 'Pretrained checkpoint path'
device_id: 'Device id, default is 0.'
device_num: 'Use device nums, default is 1.'
rank_id: 'Rank id, default is 0.'
file_format: 'file format'
ann_file: "Ann file, default is val.json."
checkpoint_path: "Checkpoint file path."
ckpt_file: 'fasterrcnn ckpt file.'
result_path: "result file path."
backbone: "backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152"
---
device_target: ['Ascend', 'GPU', 'CPU']
file_format: ["AIR", "ONNX", "MINDIR"]
dataset_name: ["cifar10", "imagenet2012"]

View File

@ -1,17 +1,16 @@
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===========================================================================
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
enable_modelarts: False
data_url: ""
train_url: ""
checkpoint_url: ""
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path"
device_target: Ascend
enable_profiling: False
# ==============================================================================
# config
img_width: 1280
img_height: 768
@ -20,7 +19,7 @@ flip_ratio: 0.5
expand_ratio: 1.0
# anchor
feature_shapes:
feature_shapes:
- [192, 320]
- [96, 160]
- [48, 80]
@ -154,3 +153,64 @@ coco_classes: ['background', 'person', 'bicycle', 'car', 'motorcycle', 'airplane
'refrigerator', 'book', 'clock', 'vase', 'scissors',
'teddy bear', 'hair drier', 'toothbrush']
num_classes: 81
# train.py FasterRcnn training
run_distribute: False
dataset: "coco"
pre_trained: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
device_id: 0
device_num: 1
rank_id: 0
image_dir: ''
anno_path: ''
backbone: 'resnet_v1_50'
# eval.py FasterRcnn evaluation
ann_file: '/cache/data/annotations/instances_val2017.json'
checkpoint_path: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
# export.py fasterrcnn_export
file_name: "faster_rcnn"
file_format: "AIR"
ckpt_file: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
# postprocess ("./src/config_50.yaml")
#ann_file: ''
result_path: ''
---
# Config description for each option
enable_modelarts: 'Whether training on modelarts, default: False'
data_url: 'Dataset url for obs'
train_url: 'Training output url for obs'
data_path: 'Dataset path for local'
output_path: 'Training output path for local'
result_dir: "result files path."
label_dir: "image file path."
device_target: "device where the code will be implemented, default is Ascend"
file_name: "output file name."
dataset: "Dataset, either cifar10 or imagenet2012"
parameter_server: 'Run parameter server train'
width: 'input width'
height: 'input height'
enable_profiling: 'Whether enable profiling while training, default: False'
only_create_dataset: 'If set it true, only create Mindrecord, default is false.'
run_distribute: 'Run distribute, default is false.'
do_train: 'Do train or not, default is true.'
do_eval: 'Do eval or not, default is false.'
pre_trained: 'Pretrained checkpoint path'
device_id: 'Device id, default is 0.'
device_num: 'Use device nums, default is 1.'
rank_id: 'Rank id, default is 0.'
file_format: 'file format'
ann_file: "Ann file, default is val.json."
checkpoint_path: "Checkpoint file path."
ckpt_file: 'fasterrcnn ckpt file.'
result_path: "result file path."
backbone: "backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152"
---
device_target: ['Ascend', 'GPU', 'CPU']
file_format: ["AIR", "ONNX", "MINDIR"]
dataset_name: ["cifar10", "imagenet2012"]

View File

@ -1,17 +1,16 @@
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===========================================================================
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
enable_modelarts: False
data_url: ""
train_url: ""
checkpoint_url: ""
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path"
device_target: Ascend
enable_profiling: False
# ==============================================================================
# config
img_width: 1280
img_height: 768
@ -20,7 +19,7 @@ flip_ratio: 0.5
expand_ratio: 1.0
# anchor
feature_shapes:
feature_shapes:
- [192, 320]
- [96, 160]
- [48, 80]
@ -153,4 +152,65 @@ coco_classes: ['background', 'person', 'bicycle', 'car', 'motorcycle', 'airplane
'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
'refrigerator', 'book', 'clock', 'vase', 'scissors',
'teddy bear', 'hair drier', 'toothbrush']
num_classes: 81
num_classes: 81
# train.py FasterRcnn training
run_distribute: False
dataset: "coco"
pre_trained: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
device_id: 0
device_num: 1
rank_id: 0
image_dir: ''
anno_path: ''
backbone: 'resnet_v1_50'
# eval.py FasterRcnn evaluation
ann_file: '/cache/data/annotations/instances_val2017.json'
checkpoint_path: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
# export.py fasterrcnn_export
file_name: "faster_rcnn"
file_format: "AIR"
ckpt_file: "/cache/train/fasterrcnn/faster_rcnn-12_7393.ckpt"
# postprocess ("./src/config_50.yaml")
#ann_file: ''
result_path: ''
---
# Config description for each option
enable_modelarts: 'Whether training on modelarts, default: False'
data_url: 'Dataset url for obs'
train_url: 'Training output url for obs'
data_path: 'Dataset path for local'
output_path: 'Training output path for local'
result_dir: "result files path."
label_dir: "image file path."
device_target: "device where the code will be implemented, default is Ascend"
file_name: "output file name."
dataset: "Dataset, either cifar10 or imagenet2012"
parameter_server: 'Run parameter server train'
width: 'input width'
height: 'input height'
enable_profiling: 'Whether enable profiling while training, default: False'
only_create_dataset: 'If set it true, only create Mindrecord, default is false.'
run_distribute: 'Run distribute, default is false.'
do_train: 'Do train or not, default is true.'
do_eval: 'Do eval or not, default is false.'
pre_trained: 'Pretrained checkpoint path'
device_id: 'Device id, default is 0.'
device_num: 'Use device nums, default is 1.'
rank_id: 'Rank id, default is 0.'
file_format: 'file format'
ann_file: "Ann file, default is val.json."
checkpoint_path: "Checkpoint file path."
ckpt_file: 'fasterrcnn ckpt file.'
result_path: "result file path."
backbone: "backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152"
---
device_target: ['Ascend', 'GPU', 'CPU']
file_format: ["AIR", "ONNX", "MINDIR"]
dataset_name: ["cifar10", "imagenet2012"]

View File

@ -15,7 +15,6 @@
"""Evaluation for FasterRcnn"""
import os
import argparse
import time
import numpy as np
from pycocotools.coco import COCO
@ -26,34 +25,17 @@ from mindspore.common import set_seed, Parameter
from src.dataset import data_to_mindrecord_byte_image, create_fasterrcnn_dataset
from src.util import coco_eval, bbox2result_1image, results2json
import src.config as cfg
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
from src.model_utils.device_adapter import get_device_id
set_seed(1)
context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target, device_id=get_device_id())
parser = argparse.ArgumentParser(description="FasterRcnn evaluation")
parser.add_argument("--dataset", type=str, default="coco", help="Dataset, default is coco.")
parser.add_argument("--ann_file", type=str, default="val.json", help="Ann file, default is val.json.")
parser.add_argument("--checkpoint_path", type=str, required=True, help="Checkpoint file path.")
parser.add_argument("--device_target", type=str, default="Ascend",
help="device where the code will be implemented, default is Ascend")
parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
parser.add_argument("--backbone", type=str, required=True, \
help="backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152")
args_opt = parser.parse_args()
context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=args_opt.device_id)
if args_opt.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
if config.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
from src.FasterRcnn.faster_rcnn_resnet import Faster_Rcnn_Resnet
if args_opt.backbone == "resnet_v1.5_50":
config = cfg.get_config("./src/config_50.yaml")
elif args_opt.backbone == "resnet_v1_101":
config = cfg.get_config("./src/config_101.yaml")
elif args_opt.backbone == "resnet_v1_152":
config = cfg.get_config("./src/config_152.yaml")
elif args_opt.backbone == "resnet_v1_50":
config = cfg.get_config("./src/config_50.yaml")
elif config.backbone == "resnet_v1_50":
from src.FasterRcnn.faster_rcnn_resnet50v1 import Faster_Rcnn_Resnet
def fasterrcnn_eval(dataset_path, ckpt_path, ann_file):
@ -61,7 +43,7 @@ def fasterrcnn_eval(dataset_path, ckpt_path, ann_file):
ds = create_fasterrcnn_dataset(config, dataset_path, batch_size=config.test_batch_size, is_training=False)
net = Faster_Rcnn_Resnet(config)
param_dict = load_checkpoint(ckpt_path)
if args_opt.device_target == "GPU":
if config.device_target == "GPU":
for key, value in param_dict.items():
tensor = value.asnumpy().astype(np.float32)
param_dict[key] = Parameter(tensor, key)
@ -125,7 +107,13 @@ def fasterrcnn_eval(dataset_path, ckpt_path, ann_file):
coco_eval(result_files, eval_types, dataset_coco, single_result=True)
if __name__ == '__main__':
def modelarts_pre_process():
pass
# config.ckpt_path = os.path.join(config.output_path, str(get_rank_id()), config.checkpoint_path)
@moxing_wrapper(pre_process=modelarts_pre_process)
def eval_fasterrcnn():
""" eval_fasterrcnn """
prefix = "FasterRcnn_eval.mindrecord"
mindrecord_dir = config.mindrecord_dir
mindrecord_file = os.path.join(mindrecord_dir, prefix)
@ -134,7 +122,7 @@ if __name__ == '__main__':
if not os.path.exists(mindrecord_file):
if not os.path.isdir(mindrecord_dir):
os.makedirs(mindrecord_dir)
if args_opt.dataset == "coco":
if config.dataset == "coco":
if os.path.isdir(config.coco_root):
print("Create Mindrecord. It may take some time.")
data_to_mindrecord_byte_image(config, "coco", False, prefix, file_num=1)
@ -151,4 +139,7 @@ if __name__ == '__main__':
print("CHECKING MINDRECORD FILES DONE!")
print("Start Eval!")
fasterrcnn_eval(mindrecord_file, args_opt.checkpoint_path, args_opt.ann_file)
fasterrcnn_eval(mindrecord_file, config.checkpoint_path, config.ann_file)
if __name__ == '__main__':
eval_fasterrcnn()

View File

@ -13,45 +13,33 @@
# limitations under the License.
# ============================================================================
"""export checkpoint file into air, onnx, mindir models"""
import argparse
import numpy as np
import mindspore.common.dtype as mstype
from mindspore import Tensor, load_checkpoint, load_param_into_net, export, context
import src.config as cfg
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
from src.model_utils.device_adapter import get_device_id
parser = argparse.ArgumentParser(description='fasterrcnn_export')
parser.add_argument("--device_id", type=int, default=0, help="Device id")
parser.add_argument("--file_name", type=str, default="faster_rcnn", help="output file name.")
parser.add_argument("--file_format", type=str, choices=["AIR", "ONNX", "MINDIR"], default="AIR", help="file format")
parser.add_argument("--device_target", type=str, choices=["Ascend", "GPU", "CPU"], default="Ascend",
help="device target")
parser.add_argument('--ckpt_file', type=str, default='', help='fasterrcnn ckpt file.')
parser.add_argument("--backbone", type=str, required=True, \
help="backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152")
args = parser.parse_args()
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target)
if args.device_target == "Ascend":
context.set_context(device_id=args.device_id)
context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target)
if config.device_target == "Ascend":
context.set_context(device_id=get_device_id())
if args.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
if config.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
from src.FasterRcnn.faster_rcnn_resnet import FasterRcnn_Infer
if args.backbone == "resnet_v1.5_50":
config = cfg.get_config("./src/config_50.yaml")
elif args.backbone == "resnet_v1_101":
config = cfg.get_config("./src/config_101.yaml")
elif args.backbone == "resnet_v1_152":
config = cfg.get_config("./src/config_152.yaml")
elif args.backbone == "resnet_v1_50":
config = cfg.get_config("./src/config_50.yaml")
elif config.backbone == "resnet_v1_50":
from src.FasterRcnn.faster_rcnn_resnet50v1 import FasterRcnn_Infer
if __name__ == '__main__':
def modelarts_pre_process():
pass
@moxing_wrapper(pre_process=modelarts_pre_process)
def export_fasterrcnn():
""" export_fasterrcnn """
net = FasterRcnn_Infer(config=config)
param_dict = load_checkpoint(args.ckpt_file)
param_dict = load_checkpoint(config.ckpt_file)
param_dict_new = {}
for key, value in param_dict.items():
@ -66,4 +54,7 @@ if __name__ == '__main__':
img = Tensor(np.zeros([config.test_batch_size, 3, config.img_height, config.img_width]), mstype.float32)
img_metas = Tensor(np.random.uniform(0.0, 1.0, size=[config.test_batch_size, 4]), mstype.float32)
export(net, img, img_metas, file_name=args.file_name, file_format=args.file_format)
export(net, img, img_metas, file_name=config.file_name, file_format=config.file_format)
if __name__ == '__main__':
export_fasterrcnn()

View File

@ -13,25 +13,13 @@
# limitations under the License.
# ============================================================================
"""hub config."""
import argparse
import src.config as cfg
parser = argparse.ArgumentParser(description="FasterRcnn")
parser.add_argument("--backbone", type=str, required=True, \
help="backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152")
args_opt = parser.parse_args()
from src.model_utils.config import config
if args_opt.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
if config.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
from src.FasterRcnn.faster_rcnn_resnet import Faster_Rcnn_Resnet
if args_opt.backbone == "resnet_v1.5_50":
config = cfg.get_config("./src/config_50.yaml")
elif args_opt.backbone == "resnet_v1_101":
config = cfg.get_config("./src/config_101.yaml")
elif args_opt.backbone == "resnet_v1_152":
config = cfg.get_config("./src/config_152.yaml")
elif args_opt.backbone == "resnet_v1_50":
config = cfg.get_config("./src/config_50.yaml")
elif config.backbone == "resnet_v1_50":
from src.FasterRcnn.faster_rcnn_resnet50v1 import Faster_Rcnn_Resnet
def create_network(name, *args, **kwargs):

View File

@ -14,22 +14,21 @@
# ============================================================================
"""post process for 310 inference"""
import os
import argparse
import numpy as np
from pycocotools.coco import COCO
from src.util import coco_eval, bbox2result_1image, results2json
import src.config as cfg
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
dst_width = 1280
dst_height = 768
parser = argparse.ArgumentParser(description="FasterRcnn inference")
parser.add_argument("--ann_file", type=str, required=True, help="ann file.")
parser.add_argument("--result_path", type=str, required=True, help="result file path.")
args = parser.parse_args()
config = cfg.get_config("./src/config_50.yaml")
def modelarts_pre_process():
pass
@moxing_wrapper(pre_process=modelarts_pre_process)
def get_eval_result(ann_file, result_path):
""" get evaluation result of faster rcnn"""
max_num = 128
@ -72,4 +71,4 @@ def get_eval_result(ann_file, result_path):
coco_eval(result_files, eval_types, dataset_coco, single_result=False)
if __name__ == '__main__':
get_eval_result(args.ann_file, args.result_path)
get_eval_result(config.ann_file, config.result_path)

View File

@ -14,9 +14,9 @@
# limitations under the License.
# ============================================================================
if [ $# -ne 3 ]
if [ $# -le 3 ]
then
echo "Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_PATH] [BACKBONE]"
echo "Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)"
exit 1
fi
@ -35,7 +35,11 @@ get_real_path(){
}
PATH1=$(get_real_path $1)
PATH2=$(get_real_path $2)
PATH3=$(get_real_path $4)
echo $PATH1
echo $PATH2
echo $PATH3
if [ ! -f $PATH1 ]
then
@ -43,14 +47,48 @@ then
exit 1
fi
PATH2=$(get_real_path $2)
echo $PATH2
if [ ! -f $PATH2 ]
then
echo "error: PRETRAINED_PATH=$PATH2 is not a file"
exit 1
fi
if [ ! -d $PATH3 ]
then
echo "error: COCO_ROOT=$PATH3 is not a dir"
exit 1
fi
mindrecord_dir=$PATH3/MindRecord_COCO_TRAIN/
if [ $# -eq 5 ]
then
mindrecord_dir=$(get_real_path $5)
if [ ! -d $mindrecord_dir ]
then
echo "error: mindrecord_dir=$mindrecord_dir is not a dir"
exit 1
fi
fi
echo $mindrecord_dir
BASE_PATH=$(cd ./"`dirname $0`" || exit; pwd)
if [ $# -ge 1 ]; then
if [ $3 == 'resnet_v1.5_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
elif [ $3 == 'resnet_v1_101' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_101.yaml"
elif [ $3 == 'resnet_v1_152' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_152.yaml"
elif [ $3 == 'resnet_v1_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
else
echo "Unrecognized parameter"
exit 1
fi
else
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
fi
ulimit -u unlimited
export HCCL_CONNECT_TIMEOUT=600
export DEVICE_NUM=8
@ -64,11 +102,13 @@ do
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp ../*.py ./train_parallel$i
cp ../*.yaml ./train_parallel$i
cp *.sh ./train_parallel$i
cp -r ../src ./train_parallel$i
cd ./train_parallel$i || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
python train.py --device_id=$i --rank_id=$i --run_distribute=True --device_num=$DEVICE_NUM --pre_trained=$PATH2 --backbone=$3 &> log &
python train.py --config_path=$CONFIG_FILE --coco_root=$PATH3 --mindrecord_dir=$mindrecord_dir --device_id=$i \
--rank_id=$i --run_distribute=True --device_num=$DEVICE_NUM --pre_trained=$PATH2 --backbone=$3 &> log &
cd ..
done

View File

@ -16,14 +16,14 @@
echo "=============================================================================================================="
echo "Please run the script as: "
echo "sh run_distribute_train_gpu.sh DEVICE_NUM PRETRAINED_PATH BACKBONE"
echo "for example: sh run_distribute_train_gpu.sh 8 /path/pretrain.ckpt resnet_v1_50"
echo "sh run_distribute_train_gpu.sh DEVICE_NUM PRETRAINED_PATH BACKBONE COCO_ROOT MINDRECORD_DIR(option)"
echo "for example: sh run_distribute_train_gpu.sh 8 /path/pretrain.ckpt resnet_v1_50 cocodataset mindrecord_dir(option)"
echo "It is better to use absolute path."
echo "=============================================================================================================="
if [ $# != 3 ]
if [ $# -le 3 ]
then
echo "Usage: sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_PATH] [BACKBONE]"
echo "Usage: sh run_distribute_train_gpu.sh [DEVICE_NUM] [PRETRAINED_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)"
exit 1
fi
@ -33,19 +33,62 @@ then
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
rm -rf run_distribute_train
mkdir run_distribute_train
cp -rf ../src/ ../train.py ./run_distribute_train
cp -rf ../src/ ../train.py ../*.yaml ./run_distribute_train
cd run_distribute_train || exit
export RANK_SIZE=$1
PRETRAINED_PATH=$2
PATH3=$4
mindrecord_dir=$PATH3/MindRecord_COCO_TRAIN/
if [ $# -eq 5 ]
then
mindrecord_dir=$(get_real_path $5)
if [ ! -d $mindrecord_dir ]
then
echo "error: mindrecord_dir=$mindrecord_dir is not a dir"
exit 1
fi
fi
echo $mindrecord_dir
BASE_PATH=$(cd ./"`dirname $0`" || exit; pwd)
if [ $# -ge 1 ]; then
if [ $3 == 'resnet_v1.5_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
elif [ $3 == 'resnet_v1_101' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_101.yaml"
elif [ $3 == 'resnet_v1_152' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_152.yaml"
elif [ $3 == 'resnet_v1_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
else
echo "Unrecognized parameter"
exit 1
fi
else
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
fi
echo "start training on $RANK_SIZE devices"
mpirun -n $RANK_SIZE \
python train.py \
--config_path=$CONFIG_FILE \
--run_distribute=True \
--device_target="GPU" \
--device_num=$RANK_SIZE \
--pre_trained=$PRETRAINED_PATH \
--backbone=$3 > log 2>&1 &
--backbone=$3 \
--coco_root=$PATH3 \
--mindrecord_dir=$mindrecord_dir > log 2>&1 &

View File

@ -14,9 +14,9 @@
# limitations under the License.
# ============================================================================
if [ $# != 3 ]
if [ $# -le 3 ]
then
echo "Usage: sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]"
echo "Usage: sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)"
exit 1
fi
@ -35,6 +35,8 @@ get_real_path(){
}
PATH1=$(get_real_path $1)
PATH2=$(get_real_path $2)
PATH3=$(get_real_path $4)
echo $PATH3
echo $PATH1
echo $PATH2
@ -50,6 +52,42 @@ then
exit 1
fi
if [ ! -d $PATH3 ]
then
echo "error: COCO_ROOT=$PATH3 is not a dir"
exit 1
fi
mindrecord_dir=$PATH3/MindRecord_COCO_TRAIN/
if [ $# -eq 5 ]
then
mindrecord_dir=$(get_real_path $5)
if [ ! -d $mindrecord_dir ]
then
echo "error: mindrecord_dir=$mindrecord_dir is not a dir"
exit 1
fi
fi
echo $mindrecord_dir
BASE_PATH=$(cd ./"`dirname $0`" || exit; pwd)
if [ $# -ge 1 ]; then
if [ $3 == 'resnet_v1.5_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
elif [ $3 == 'resnet_v1_101' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_101.yaml"
elif [ $3 == 'resnet_v1_152' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_152.yaml"
elif [ $3 == 'resnet_v1_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
else
echo "Unrecognized parameter"
exit 1
fi
else
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
fi
ulimit -u unlimited
export DEVICE_NUM=1
export RANK_SIZE=$DEVICE_NUM
@ -62,10 +100,12 @@ then
fi
mkdir ./eval
cp ../*.py ./eval
cp ../*.yaml ./eval
cp *.sh ./eval
cp -r ../src ./eval
cd ./eval || exit
env > env.log
echo "start eval for device $DEVICE_ID"
python eval.py --device_id=$DEVICE_ID --ann_file=$PATH1 --checkpoint_path=$PATH2 --backbone=$3 &> log &
python eval.py --config_path=$CONFIG_FILE --device_id=$DEVICE_ID --ann_file=$PATH1 --checkpoint_path=$PATH2 \
--backbone=$3 --coco_root=$PATH3 --mindrecord_dir=$mindrecord_dir &> log &
cd ..

View File

@ -14,9 +14,9 @@
# limitations under the License.
# ============================================================================
if [ $# != 3 ]
if [ $# -le 3 ]
then
echo "Usage: sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE]"
echo "Usage: sh run_eval_gpu.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)"
exit 1
fi
@ -35,8 +35,10 @@ get_real_path(){
}
PATH1=$(get_real_path $1)
PATH2=$(get_real_path $2)
PATH3=$(get_real_path $4)
echo $PATH1
echo $PATH2
echo $PATH3
if [ ! -f $PATH1 ]
then
@ -50,6 +52,42 @@ then
exit 1
fi
if [ ! -d $PATH3 ]
then
echo "error: COCO_ROOT=$PATH3 is not a dir"
exit 1
fi
mindrecord_dir=$PATH3/MindRecord_COCO_TRAIN/
if [ $# -eq 5 ]
then
mindrecord_dir=$(get_real_path $5)
if [ ! -d $mindrecord_dir ]
then
echo "error: mindrecord_dir=$mindrecord_dir is not a dir"
exit 1
fi
fi
echo $mindrecord_dir
BASE_PATH=$(cd ./"`dirname $0`" || exit; pwd)
if [ $# -ge 1 ]; then
if [ $3 == 'resnet_v1.5_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
elif [ $3 == 'resnet_v1_101' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_101.yaml"
elif [ $3 == 'resnet_v1_152' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_152.yaml"
elif [ $3 == 'resnet_v1_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
else
echo "Unrecognized parameter"
exit 1
fi
else
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
fi
export DEVICE_NUM=1
export RANK_SIZE=$DEVICE_NUM
export DEVICE_ID=0
@ -61,10 +99,12 @@ then
fi
mkdir ./eval
cp ../*.py ./eval
cp ../*.yaml ./eval
cp *.sh ./eval
cp -r ../src ./eval
cd ./eval || exit
env > env.log
echo "start eval for device $DEVICE_ID"
python eval.py --device_target="GPU" --device_id=$DEVICE_ID --ann_file=$PATH1 --checkpoint_path=$PATH2 --backbone=$3 &> log &
python eval.py --config_path=$CONFIG_FILE --coco_root=$PATH3 --mindrecord_dir=$mindrecord_dir \
--device_target="GPU" --device_id=$DEVICE_ID --ann_file=$PATH1 --checkpoint_path=$PATH2 --backbone=$3 &> log &
cd ..

View File

@ -14,9 +14,9 @@
# limitations under the License.
# ============================================================================
if [ $# -ne 2 ]
if [ $# -le 2 ]
then
echo "Usage: sh run_standalone_train_ascend.sh [PRETRAINED_PATH] [BACKBONE]"
echo "Usage: sh run_standalone_train_ascend.sh [PRETRAINED_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)"
exit 1
fi
@ -35,7 +35,9 @@ get_real_path(){
}
PATH1=$(get_real_path $1)
PATH2=$(get_real_path $3)
echo $PATH1
echo $PATH2
if [ ! -f $PATH1 ]
then
@ -43,6 +45,42 @@ then
exit 1
fi
if [ ! -d $PATH2 ]
then
echo "error: COCO_ROOT=$PATH2 is not a dir"
exit 1
fi
mindrecord_dir=$PATH2/MindRecord_COCO_TRAIN/
if [ $# -eq 4 ]
then
mindrecord_dir=$(get_real_path $4)
if [ ! -d $mindrecord_dir ]
then
echo "error: mindrecord_dir=$mindrecord_dir is not a dir"
exit 1
fi
fi
echo $mindrecord_dir
BASE_PATH=$(cd ./"`dirname $0`" || exit; pwd)
if [ $# -ge 1 ]; then
if [ $2 == 'resnet_v1.5_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
elif [ $2 == 'resnet_v1_101' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_101.yaml"
elif [ $2 == 'resnet_v1_152' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_152.yaml"
elif [ $2 == 'resnet_v1_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
else
echo "Unrecognized parameter"
exit 1
fi
else
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
fi
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=0
@ -55,10 +93,12 @@ then
fi
mkdir ./train
cp ../*.py ./train
cp ../*.yaml ./train
cp *.sh ./train
cp -r ../src ./train
cd ./train || exit
echo "start training for device $DEVICE_ID"
env > env.log
python train.py --device_id=$DEVICE_ID --pre_trained=$PATH1 --backbone=$2 &> log &
python train.py --config_path=$CONFIG_FILE --coco_root=$PATH2 --mindrecord_dir=$mindrecord_dir --device_id=$DEVICE_ID \
--pre_trained=$PATH1 --backbone=$2 &> log &
cd ..

View File

@ -14,9 +14,9 @@
# limitations under the License.
# ============================================================================
if [ $# -ne 2 ]
if [ $# -le 2 ]
then
echo "Usage: sh run_standalone_train_gpu.sh [PRETRAINED_PATH] [BACKBONE]"
echo "Usage: sh run_standalone_train_gpu.sh [PRETRAINED_PATH] [BACKBONE] [COCO_ROOT] [MINDRECORD_DIR](option)"
exit 1
fi
@ -35,7 +35,9 @@ get_real_path(){
}
PATH1=$(get_real_path $1)
PATH2=$(get_real_path $3)
echo $PATH1
echo $PATH2
if [ ! -f $PATH1 ]
then
@ -43,6 +45,42 @@ then
exit 1
fi
if [ ! -d $PATH2 ]
then
echo "error: COCO_ROOT=$PATH2 is not a dir"
exit 1
fi
mindrecord_dir=$PATH2/MindRecord_COCO_TRAIN/
if [ $# -eq 4 ]
then
mindrecord_dir=$(get_real_path $4)
if [ ! -d $mindrecord_dir ]
then
echo "error: mindrecord_dir=$mindrecord_dir is not a dir"
exit 1
fi
fi
echo $mindrecord_dir
BASE_PATH=$(cd ./"`dirname $0`" || exit; pwd)
if [ $# -ge 1 ]; then
if [ $2 == 'resnet_v1.5_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
elif [ $2 == 'resnet_v1_101' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_101.yaml"
elif [ $2 == 'resnet_v1_152' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config_152.yaml"
elif [ $2 == 'resnet_v1_50' ]; then
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
else
echo "Unrecognized parameter"
exit 1
fi
else
CONFIG_FILE="${BASE_PATH}/../default_config.yaml"
fi
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=0
@ -55,10 +93,12 @@ then
fi
mkdir ./train
cp ../*.py ./train
cp ../*.yaml ./train
cp *.sh ./train
cp -r ../src ./train
cd ./train || exit
echo "start training for device $DEVICE_ID"
env > env.log
python train.py --device_id=$DEVICE_ID --pre_trained=$PATH1 --device_target="GPU" --backbone=$2 &> log &
python train.py --config_path=$CONFIG_FILE --coco_root=$PATH2 --mindrecord_dir=$mindrecord_dir \
--device_id=$DEVICE_ID --pre_trained=$PATH1 --device_target="GPU" --backbone=$2 &> log &
cd ..

View File

@ -1,61 +0,0 @@
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Parse arguments"""
from pprint import pprint, pformat
import yaml
class Config:
"""
Configuration namespace. Convert dictionary to members.
"""
def __init__(self, cfg_dict):
for k, v in cfg_dict.items():
if isinstance(v, (list, tuple)):
setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
else:
setattr(self, k, Config(v) if isinstance(v, dict) else v)
def __str__(self):
return pformat(self.__dict__)
def __repr__(self):
return self.__str__()
def parse_yaml(yaml_path):
"""
Parse the yaml config file.
Args:
yaml_path: Path to the yaml config.
"""
with open(yaml_path, 'r') as fin:
try:
cfgs = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
cfgs = [x for x in cfgs]
if len(cfgs) == 1:
cfg = cfgs[0]
except:
raise ValueError("Failed to parse yaml")
return cfg
def get_config(config_path):
"""
Get Config according to the yaml file and cli arguments.
"""
default = parse_yaml(config_path)
pprint(default)
return Config(default)

View File

@ -15,15 +15,13 @@
"""
convert resnet pretrain model to faster_rcnn backbone pretrain model
"""
import argparse
from mindspore.train.serialization import load_checkpoint, save_checkpoint
from mindspore.common.parameter import Parameter
from mindspore.common.tensor import Tensor
import mindspore.common.dtype as mstype
from .model_utils.config import config
parser = argparse.ArgumentParser(description='load_ckpt')
parser.add_argument('--ckpt_file', type=str, default='', help='ckpt file path')
args_opt = parser.parse_args()
def load_weights(model_path, use_fp16_weight):
"""
load resnet pretrain checkpoint file.
@ -60,5 +58,5 @@ def load_weights(model_path, use_fp16_weight):
return param_list
if __name__ == "__main__":
parameter_list = load_weights(args_opt.ckpt_file, use_fp16_weight=False)
parameter_list = load_weights(config.ckpt_file, use_fp16_weight=False)
save_checkpoint(parameter_list, "resnet_backbone.ckpt")

View File

@ -0,0 +1,127 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Parse arguments"""
import os
import ast
import argparse
from pprint import pprint, pformat
import yaml
class Config:
"""
Configuration namespace. Convert dictionary to members.
"""
def __init__(self, cfg_dict):
for k, v in cfg_dict.items():
if isinstance(v, (list, tuple)):
setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
else:
setattr(self, k, Config(v) if isinstance(v, dict) else v)
def __str__(self):
return pformat(self.__dict__)
def __repr__(self):
return self.__str__()
def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="default_config.yaml"):
"""
Parse command line arguments to the configuration according to the default yaml.
Args:
parser: Parent parser.
cfg: Base configuration.
helper: Helper description.
cfg_path: Path to the default yaml config.
"""
parser = argparse.ArgumentParser(description="[REPLACE THIS at config.py]",
parents=[parser])
helper = {} if helper is None else helper
choices = {} if choices is None else choices
for item in cfg:
if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
choice = choices[item] if item in choices else None
if isinstance(cfg[item], bool):
parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice,
help=help_description)
else:
parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice,
help=help_description)
args = parser.parse_args()
return args
def parse_yaml(yaml_path):
"""
Parse the yaml config file.
Args:
yaml_path: Path to the yaml config.
"""
with open(yaml_path, 'r') as fin:
try:
cfgs = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
cfgs = [x for x in cfgs]
if len(cfgs) == 1:
cfg_helper = {}
cfg = cfgs[0]
cfg_choices = {}
elif len(cfgs) == 2:
cfg, cfg_helper = cfgs
cfg_choices = {}
elif len(cfgs) == 3:
cfg, cfg_helper, cfg_choices = cfgs
else:
raise ValueError("At most 3 docs (config, description for help, choices) are supported in config yaml")
print(cfg_helper)
except:
raise ValueError("Failed to parse yaml")
return cfg, cfg_helper, cfg_choices
def merge(args, cfg):
"""
Merge the base config from yaml file and command line arguments.
Args:
args: Command line arguments.
cfg: Base configuration.
"""
args_var = vars(args)
for item in args_var:
cfg[item] = args_var[item]
return cfg
def get_config():
"""
Get Config according to the yaml file and cli arguments.
"""
parser = argparse.ArgumentParser(description="default name", add_help=False)
current_dir = os.path.dirname(os.path.abspath(__file__))
parser.add_argument("--config_path", type=str, default=os.path.join(current_dir, "../../default_config.yaml"),
help="Config file path")
path_args, _ = parser.parse_known_args()
default, helper, choices = parse_yaml(path_args.config_path)
pprint(default)
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path)
final_config = merge(args, default)
return Config(final_config)
config = get_config()

View File

@ -0,0 +1,27 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Device adapter for ModelArts"""
from .config import config
if config.enable_modelarts:
from .moxing_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
else:
from .local_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
__all__ = [
"get_device_id", "get_device_num", "get_rank_id", "get_job_id"
]

View File

@ -0,0 +1,36 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Local adapter"""
import os
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def get_rank_id():
global_rank_id = os.getenv('RANK_ID', '0')
return int(global_rank_id)
def get_job_id():
return "Local Job"

View File

@ -0,0 +1,122 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Moxing adapter for ModelArts"""
import os
import functools
from mindspore import context
from mindspore.profiler import Profiler
from .config import config
_global_sync_count = 0
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def get_rank_id():
global_rank_id = os.getenv('RANK_ID', '0')
return int(global_rank_id)
def get_job_id():
job_id = os.getenv('JOB_ID')
job_id = job_id if job_id != "" else "default"
return job_id
def sync_data(from_path, to_path):
"""
Download data from remote obs to local directory if the first url is remote url and the second one is local path
Upload data from local directory to remote obs in contrast.
"""
import moxing as mox
import time
global _global_sync_count
sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
_global_sync_count += 1
# Each server contains 8 devices as most.
if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
print("from path: ", from_path)
print("to path: ", to_path)
mox.file.copy_parallel(from_path, to_path)
print("===finish data synchronization===")
try:
os.mknod(sync_lock)
except IOError:
pass
print("===save flag===")
while True:
if os.path.exists(sync_lock):
break
time.sleep(1)
print("Finish sync data from {} to {}.".format(from_path, to_path))
def moxing_wrapper(pre_process=None, post_process=None):
"""
Moxing wrapper to download dataset and upload outputs.
"""
def wrapper(run_func):
@functools.wraps(run_func)
def wrapped_func(*args, **kwargs):
# Download data from data_url
if config.enable_modelarts:
if config.data_url:
sync_data(config.data_url, config.data_path)
print("Dataset downloaded: ", os.listdir(config.data_path))
if config.checkpoint_url:
sync_data(config.checkpoint_url, config.load_path)
print("Preload downloaded: ", os.listdir(config.load_path))
if config.train_url:
sync_data(config.train_url, config.output_path)
print("Workspace downloaded: ", os.listdir(config.output_path))
context.set_context(save_graphs_path=os.path.join(config.output_path, str(get_rank_id())))
config.device_num = get_device_num()
config.device_id = get_device_id()
if not os.path.exists(config.output_path):
os.makedirs(config.output_path)
if pre_process:
pre_process()
if config.enable_profiling:
profiler = Profiler()
run_func(*args, **kwargs)
if config.enable_profiling:
profiler.analyse()
# Upload data to train_url
if config.enable_modelarts:
if post_process:
post_process()
if config.train_url:
print("Start to copy output directory")
sync_data(config.output_path, config.train_url)
return wrapped_func
return wrapper

View File

@ -17,8 +17,6 @@
import os
import time
import argparse
import ast
import numpy as np
import mindspore.common.dtype as mstype
@ -34,62 +32,45 @@ from mindspore.common import set_seed
from src.network_define import LossCallBack, WithLossCell, TrainOneStepCell, LossNet
from src.dataset import data_to_mindrecord_byte_image, create_fasterrcnn_dataset
from src.lr_schedule import dynamic_lr
import src.config as cfg
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
from src.model_utils.device_adapter import get_device_id, get_device_num, get_rank_id
set_seed(1)
context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target, device_id=get_device_id())
parser = argparse.ArgumentParser(description="FasterRcnn training")
parser.add_argument("--run_distribute", type=ast.literal_eval, default=False, help="Run distribute, default: false.")
parser.add_argument("--dataset", type=str, default="coco", help="Dataset name, default: coco.")
parser.add_argument("--pre_trained", type=str, default="", help="Pretrained file path.")
parser.add_argument("--device_target", type=str, default="Ascend",
help="device where the code will be implemented, default is Ascend")
parser.add_argument("--device_id", type=int, default=0, help="Device id, default: 0.")
parser.add_argument("--device_num", type=int, default=1, help="Use device nums, default: 1.")
parser.add_argument("--rank_id", type=int, default=0, help="Rank id, default: 0.")
parser.add_argument("--backbone", type=str, required=True, \
help="backbone network name, options:resnet_v1_50, resnet_v1.5_50, resnet_v1_101, resnet_v1_152")
args_opt = parser.parse_args()
context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=args_opt.device_id)
if args_opt.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
if config.backbone in ("resnet_v1.5_50", "resnet_v1_101", "resnet_v1_152"):
from src.FasterRcnn.faster_rcnn_resnet import Faster_Rcnn_Resnet
if args_opt.backbone == "resnet_v1.5_50":
config = cfg.get_config("./src/config_50.yaml")
elif args_opt.backbone == "resnet_v1_101":
config = cfg.get_config("./src/config_101.yaml")
elif args_opt.backbone == "resnet_v1_152":
config = cfg.get_config("./src/config_152.yaml")
elif args_opt.backbone == "resnet_v1_50":
config = cfg.get_config("./src/config_50.yaml")
elif config.backbone == "resnet_v1_50":
from src.FasterRcnn.faster_rcnn_resnet50v1 import Faster_Rcnn_Resnet
if __name__ == '__main__':
if args_opt.device_target == "GPU":
context.set_context(enable_graph_kernel=True)
if args_opt.run_distribute:
if args_opt.device_target == "Ascend":
rank = args_opt.rank_id
device_num = args_opt.device_num
context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
init()
else:
init("nccl")
context.reset_auto_parallel_context()
rank = get_rank()
device_num = get_group_size()
context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
if config.device_target == "GPU":
context.set_context(enable_graph_kernel=True)
if config.run_distribute:
if config.device_target == "Ascend":
rank = get_rank_id()
device_num = get_device_num()
context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
init()
else:
rank = 0
device_num = 1
init("nccl")
context.reset_auto_parallel_context()
rank = get_rank()
device_num = get_group_size()
context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
else:
rank = 0
device_num = 1
def train_fasterrcnn_():
""" train_fasterrcnn_ """
print("Start create dataset!")
# It will generate mindrecord file in args_opt.mindrecord_dir,
# It will generate mindrecord file in config.mindrecord_dir,
# and the file name is FasterRcnn.mindrecord0, 1, ... file_num.
prefix = "FasterRcnn.mindrecord"
mindrecord_dir = config.mindrecord_dir
@ -99,7 +80,7 @@ if __name__ == '__main__':
if rank == 0 and not os.path.exists(mindrecord_file):
if not os.path.isdir(mindrecord_dir):
os.makedirs(mindrecord_dir)
if args_opt.dataset == "coco":
if config.dataset == "coco":
if os.path.isdir(config.coco_root):
if not os.path.exists(config.coco_root):
print("Please make sure config:coco_root is valid.")
@ -125,8 +106,6 @@ if __name__ == '__main__':
print("CHECKING MINDRECORD FILES DONE!")
loss_scale = float(config.loss_scale)
# When create MindDataset, using the fitst mindrecord file, such as FasterRcnn.mindrecord0.
dataset = create_fasterrcnn_dataset(config, mindrecord_file, batch_size=config.batch_size,
device_num=device_num, rank_id=rank,
@ -136,10 +115,20 @@ if __name__ == '__main__':
dataset_size = dataset.get_dataset_size()
print("Create dataset done!")
return dataset_size, dataset
def modelarts_pre_process():
config.save_checkpoint_path = config.output_path
@moxing_wrapper(pre_process=modelarts_pre_process)
def train_fasterrcnn():
""" train_fasterrcnn """
dataset_size, dataset = train_fasterrcnn_()
net = Faster_Rcnn_Resnet(config=config)
net = net.set_train()
load_path = args_opt.pre_trained
load_path = config.pre_trained
if load_path != "":
param_dict = load_checkpoint(load_path)
@ -180,7 +169,7 @@ if __name__ == '__main__':
opt = SGD(params=net.trainable_params(), learning_rate=lr, momentum=config.momentum,
weight_decay=config.weight_decay, loss_scale=config.loss_scale)
net_with_loss = WithLossCell(net, loss)
if args_opt.run_distribute:
if config.run_distribute:
net = TrainOneStepCell(net_with_loss, opt, sens=config.loss_scale, reduce_flag=True,
mean=True, degree=device_num)
else:
@ -198,3 +187,6 @@ if __name__ == '__main__':
model = Model(net)
model.train(config.epoch_size, dataset, callbacks=cb)
if __name__ == '__main__':
train_fasterrcnn()

View File

@ -60,12 +60,115 @@ After dataset preparation, you can start training and evaluation as follows:
sh scripts/run_standalone_train_ascend.sh 0 52 /path/ende-l128-mindrecord
# run distributed training example
sh scripts/run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord rank_table.json
sh scripts/run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord rank_table.json ./default_config.yaml
# run evaluation example
python eval.py > eval.log 2>&1 &
```
- Running on [ModelArts](https://support.huaweicloud.com/modelarts/)
```bash
# Train 8p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "distribute=True" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set "epoch_size: 52" on default_config.yaml file.
# (optional)Set "checkpoint_url='s3://dir_to_your_pretrained/'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "distribute=True" on the website UI interface.
# Add "dataset_path=/cache/data" on the website UI interface.
# Add "epoch_size: 52" on the website UI interface.
# (optional)Add "checkpoint_url='s3://dir_to_your_pretrained/'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your pretrained model to S3 bucket if you want to finetune.
# (4) Perform a or b. (suggested option a)
# a. First, zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/transformer" on the website UI interface.
# (6) Set the startup file to "train.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
#
# Train 1p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set "epoch_size: 52" on default_config.yaml file.
# (optional)Set "checkpoint_url='s3://dir_to_your_pretrained/'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "dataset_path='/cache/data'" on the website UI interface.
# Add "epoch_size: 52" on the website UI interface.
# (optional)Add "checkpoint_url='s3://dir_to_your_pretrained/'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your pretrained model to S3 bucket if you want to finetune.
# (4) Perform a or b. (suggested option a)
# a. zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/transformer" on the website UI interface.
# (6) Set the startup file to "train.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
#
# Eval 1p with Ascend
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on default_config.yaml file.
# Set "checkpoint_url='s3://dir_to_your_trained_model/'" on base_config.yaml file.
# Set "checkpoint='./transformer/transformer_trained.ckpt'" on default_config.yaml file.
# Set "dataset_path='/cache/data'" on default_config.yaml file.
# Set other parameters on default_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "checkpoint_url='s3://dir_to_your_trained_model/'" on the website UI interface.
# Add "checkpoint='./transformer/transformer_trained.ckpt'" on the website UI interface.
# Add "dataset_path='/cache/data'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Prepare model code
# (3) Upload or copy your trained model to S3 bucket.
# (4) Perform a or b. (suggested option a)
# a. First, zip MindRecord dataset to one zip file.
# Second, upload your zip dataset to S3 bucket.(you could also upload the origin mindrecord dataset, but it can be so slow.)
# b. Upload the original dataset to S3 bucket.
# (Data set conversion occurs during training process and costs a lot of time. it happens every time you train.)
# (5) Set the code directory to "/path/transformer" on the website UI interface.
# (6) Set the startup file to "eval.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
```
- Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start evaluating as follows)
1. Export s8 multiscale and flip with voc val dataset on modelarts, evaluating steps are as follows:
```python
# (1) Perform a or b.
# a. Set "enable_modelarts=True" on base_config.yaml file.
# Set "file_name='transformer'" on base_config.yaml file.
# Set "file_format='AIR'" on base_config.yaml file.
# Set "checkpoint_url='/The path of checkpoint in S3/'" on beta_config.yaml file.
# Set "ckpt_file='/cache/checkpoint_path/model.ckpt'" on base_config.yaml file.
# Set other parameters on base_config.yaml file you need.
# b. Add "enable_modelarts=True" on the website UI interface.
# Add "file_name='transformer'" on the website UI interface.
# Add "file_format='AIR'" on the website UI interface.
# Add "checkpoint_url='/The path of checkpoint in S3/'" on the website UI interface.
# Add "ckpt_file='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
# Add other parameters on the website UI interface.
# (2) Upload or copy your trained model to S3 bucket.
# (3) Set the code directory to "/path/transformer" on the website UI interface.
# (4) Set the startup file to "export.py" on the website UI interface.
# (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (6) Create your job.
```
## [Script Description](#contents)
### [Script and Sample Code](#contents)
@ -85,7 +188,6 @@ python eval.py > eval.log 2>&1 &
├─src
├─__init__.py
├─beam_search.py
├─config.py
├─dataset.py
├─eval_config.py
├─lr_schedule.py
@ -93,7 +195,15 @@ python eval.py > eval.log 2>&1 &
├─tokenization.py
├─transformer_for_train.py
├─transformer_model.py
└─weight_init.py
├─weight_init.py
└─model_utils
├─config.py
├─device_adapter.py
├─local_adapter.py
└─moxing_adapter.py
├─default_config.yaml
├─default_config_large.yaml
├─default_config_large_gpu.yaml
├─create_data.py
├─eval.py
├─export.py
@ -214,7 +324,7 @@ Parameters for learning rate:
- Run `run_distribute_train_ascend.sh` for distributed training of Transformer model.
``` bash
sh scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE
sh scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH
```
**Attention**: data sink mode can not be used in transformer since the input data have different sequence lengths.

View File

@ -66,12 +66,115 @@ Transformer具体包括六个编码模块和六个解码模块。每个编码模
sh scripts/run_standalone_train_ascend.sh 0 52 /path/ende-l128-mindrecord
# 运行分布式训练示例
sh scripts/run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord rank_table.json
sh scripts/run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord rank_table.json ./default_config.yaml
# 运行评估示例
python eval.py > eval.log 2>&1 &
```
- 在 ModelArts 进行训练 (如果你想在modelarts上运行可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
```python
# 在 ModelArts 上使用8卡训练
# (1) 执行a或者b
# a. 在 default_config.yaml 文件中设置 "enable_modelarts=True"
# 在 default_config.yaml 文件中设置 "distribute=True"
# 在 default_config.yaml 文件中设置 "dataset_path='/cache/data'"
# 在 default_config.yaml 文件中设置 "epoch_size: 52"
# (可选)在 default_config.yaml 文件中设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在 default_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "distribute=True"
# 在网页上设置 "dataset_path=/cache/data"
# 在网页上设置 "epoch_size: 52"
# (可选)在网页上设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在网页上设置 其他参数
# (2) 准备模型代码
# (3) 如果选择微调您的模型,请上传你的预训练模型到 S3 桶上
# (4) 执行a或者b (推荐选择 a)
# a. 第一, 将该数据集压缩为一个 ".zip" 文件。
# 第二, 上传你的压缩数据集到 S3 桶上 (你也可以上传未压缩的数据集,但那可能会很慢。)
# b. 上传原始数据集到 S3 桶上。
# (数据集转换发生在训练过程中,需要花费较多的时间。每次训练的时候都会重新进行转换。)
# (5) 在网页上设置你的代码路径为 "/path/transformer"
# (6) 在网页上设置启动文件为 "train.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
#
# 在 ModelArts 上使用单卡训练
# (1) 执行a或者b
# a. 在 default_config.yaml 文件中设置 "enable_modelarts=True"
# 在 default_config.yaml 文件中设置 "dataset_path='/cache/data'"
# 在 default_config.yaml 文件中设置 "epoch_size: 52"
# (可选)在 default_config.yaml 文件中设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在 default_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "dataset_path='/cache/data'"
# 在网页上设置 "epoch_size: 52"
# (可选)在网页上设置 "checkpoint_url='s3://dir_to_your_pretrained/'"
# 在网页上设置 其他参数
# (2) 准备模型代码
# (3) 如果选择微调您的模型,上传你的预训练模型到 S3 桶上
# (4) 执行a或者b (推荐选择 a)
# a. 第一, 将该数据集压缩为一个 ".zip" 文件。
# 第二, 上传你的压缩数据集到 S3 桶上 (你也可以上传未压缩的数据集,但那可能会很慢。)
# b. 上传原始数据集到 S3 桶上。
# (数据集转换发生在训练过程中,需要花费较多的时间。每次训练的时候都会重新进行转换。)
# (5) 在网页上设置你的代码路径为 "/path/transformer"
# (6) 在网页上设置启动文件为 "train.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
#
# 在 ModelArts 上使用单卡验证
# (1) 执行a或者b
# a. 在 default_config.yaml 文件中设置 "enable_modelarts=True"
# 在 default_config.yaml 文件中设置 "checkpoint_url='s3://dir_to_your_trained_model/'"
# 在 default_config.yaml 文件中设置 "checkpoint='./transformer/transformer_trained.ckpt'"
# 在 default_config.yaml 文件中设置 "dataset_path='/cache/data'"
# 在 default_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "checkpoint_url='s3://dir_to_your_trained_model/'"
# 在网页上设置 "checkpoint='./transformer/transformer_trained.ckpt'"
# 在网页上设置 "dataset_path='/cache/data'"
# 在网页上设置 其他参数
# (2) 准备模型代码
# (3) 上传你训练好的模型到 S3 桶上
# (4) 执行a或者b (推荐选择 a)
# a. 第一, 将该数据集压缩为一个 ".zip" 文件。
# 第二, 上传你的压缩数据集到 S3 桶上 (你也可以上传未压缩的数据集,但那可能会很慢。)
# b. 上传原始数据集到 S3 桶上。
# (数据集转换发生在训练过程中,需要花费较多的时间。每次训练的时候都会重新进行转换。)
# (5) 在网页上设置你的代码路径为 "/path/transformer"
# (6) 在网页上设置启动文件为 "train.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
```
- 在 ModelArts 进行导出 (如果你想在modelarts上运行可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
1. 使用voc val数据集评估多尺度和翻转s8。评估步骤如下
```python
# (1) 执行 a 或者 b.
# a. 在 base_config.yaml 文件中设置 "enable_modelarts=True"
# 在 base_config.yaml 文件中设置 "file_name='transformer'"
# 在 base_config.yaml 文件中设置 "file_format='AIR'"
# 在 base_config.yaml 文件中设置 "checkpoint_url='/The path of checkpoint in S3/'"
# 在 base_config.yaml 文件中设置 "ckpt_file='/cache/checkpoint_path/model.ckpt'"
# 在 base_config.yaml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "file_name='transformer'"
# 在网页上设置 "file_format='AIR'"
# 在网页上设置 "checkpoint_url='/The path of checkpoint in S3/'"
# 在网页上设置 "ckpt_file='/cache/checkpoint_path/model.ckpt'"
# 在网页上设置 其他参数
# (2) 上传你的预训练模型到 S3 桶上
# (3) 在网页上设置你的代码路径为 "/path/transformer"
# (4) 在网页上设置启动文件为 "export.py"
# (5) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (6) 创建训练作业
```
## 脚本说明
### 脚本和样例代码
@ -91,7 +194,6 @@ python eval.py > eval.log 2>&1 &
├─src
├─__init__.py
├─beam_search.py
├─config.py
├─dataset.py
├─eval_config.py
├─lr_schedule.py
@ -99,7 +201,15 @@ python eval.py > eval.log 2>&1 &
├─tokenization.py
├─transformer_for_train.py
├─transformer_model.py
└─weight_init.py
├─weight_init.py
└─model_utils
├─config.py
├─device_adapter.py
├─local_adapter.py
└─moxing_adapter.py
├─default_config.yaml
├─default_config_large.yaml
├─default_config_large_gpu.yaml
├─create_data.py
├─eval.py
├─export.py
@ -221,7 +331,7 @@ Parameters for learning rate:
- 运行`run_distribute_train_ascend.sh`进行Transformer模型的非分布式训练。
``` bash
sh scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE
sh scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH
```
**注意**:由于网络输入中有不同句长的数据,所以数据下沉模式不可使用。