forked from mindspore-Ecosystem/mindspore
!22924 add network run demo
Merge pull request !22924 from Maige/code_docs_week2
This commit is contained in:
commit
65842802a6
|
@ -97,15 +97,18 @@ After installing MindSpore via the official website, you can start training and
|
|||
|
||||
```python
|
||||
# run training example default train densenet121 if you want to train densenet100 modify _config_path in /src/model_utils/config.py
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --train_pretrained /PATH/TO/PRETRAINED_CKPT --is_distributed 0 > train.log 2>&1 &
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --is_distributed 0 > train.log 2>&1 &
|
||||
# example: python train.py --net densenet121 --dataset imagenet --train_data_dir /home/DataSet/ImageNet_Original/train/
|
||||
|
||||
# run distributed training example
|
||||
bash scripts/run_distribute_train.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/PRETRAINED_CKPT
|
||||
bash scripts/run_distribute_train.sh [DEVICE_NUM] [RANK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [TRAIN_DATA_DIR]
|
||||
# example bash scripts/run_distribute_train.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/
|
||||
|
||||
# run evaluation example
|
||||
python eval.py --net [NET_NAME] --dataset [DATASET_NAME] --eval_data_dir /PATH/TO/DATASET --ckpt_files /PATH/TO/CHECKPOINT > eval.log 2>&1 &
|
||||
OR
|
||||
bash scripts/run_distribute_eval.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/CHECKPOINT
|
||||
bash scripts/run_distribute_eval.sh [DEVICE_NUM] [RANDK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [EVAL_DATA_DIR][CKPT_PATH]
|
||||
# example: bash script/run_distribute_eval.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/validation_preprocess/ /home/model/densenet/ckpt/0-120_500.ckpt
|
||||
```
|
||||
|
||||
For distributed training, a hccl configuration file with JSON format needs to be created in advance.
|
||||
|
@ -118,6 +121,7 @@ After installing MindSpore via the official website, you can start training and
|
|||
- If you want to train the model on modelarts, you can refer to the [official guidance document] of modelarts (https://support.huaweicloud.com/modelarts/)
|
||||
|
||||
```python
|
||||
|
||||
# Example of using distributed training densenet121 on modelarts :
|
||||
# Data set storage method
|
||||
|
||||
|
@ -164,6 +168,7 @@ After installing MindSpore via the official website, you can start training and
|
|||
# (6) Set the data path of the model on the modelarts interface ".../ImageNet_Original"(choices ImageNet_Original Folder path) ,
|
||||
# The output path of the model "Output file path" and the log path of the model "Job log path" 。
|
||||
# (7) Start model inference。
|
||||
|
||||
```
|
||||
|
||||
- running on GPU
|
||||
|
@ -171,6 +176,7 @@ After installing MindSpore via the official website, you can start training and
|
|||
- For running on GPU, please change `platform` from `Ascend` to `GPU`
|
||||
|
||||
```python
|
||||
|
||||
# run training example
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python train.py --net=[NET_NAME] --dataset=[DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='GPU' > train.log 2>&1 &
|
||||
|
@ -182,13 +188,15 @@ After installing MindSpore via the official website, you can start training and
|
|||
python eval.py --net=[NET_NAME] --dataset=[DATASET_NAME] --eval_data_dir=[DATASET_PATH] --device_target='GPU' --ckpt_files=[CHECKPOINT_PATH] > eval.log 2>&1 &
|
||||
OR
|
||||
bash run_distribute_eval_gpu.sh 1 0 [NET_NAME] [DATASET_NAME] [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
|
||||
```
|
||||
|
||||
# [Script Description](#contents)
|
||||
|
||||
## [Script and Sample Code](#contents)
|
||||
|
||||
```text
|
||||
```densenet
|
||||
|
||||
├── model_zoo
|
||||
├── README.md // descriptions about all the models
|
||||
├── densenet
|
||||
|
@ -227,6 +235,7 @@ After installing MindSpore via the official website, you can start training and
|
|||
├── export.py // Export script
|
||||
├── densenet100_config.yaml // config file
|
||||
├── densenet100_config.yaml // config file
|
||||
|
||||
```
|
||||
|
||||
## [Script Parameters](#contents)
|
||||
|
@ -234,6 +243,7 @@ After installing MindSpore via the official website, you can start training and
|
|||
You can modify the training behaviour through the various flags in the `densenet100.yaml/densenet121.yaml` script. Flags in the `densenet100.yaml/densenet121.yaml` script are as follows:
|
||||
|
||||
```densenet100.yaml/densenet121.yaml
|
||||
|
||||
--train_data_dir train data dir
|
||||
--num_classes num of classes in dataset(default:1000 for densenet121; 10 for densenet100)
|
||||
--image_size image size of the dataset
|
||||
|
@ -258,6 +268,7 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
--is_distributed if multi device(default: 1)
|
||||
--rank local rank of distributed(default: 0)
|
||||
--group_size world size of distributed(default: 1)
|
||||
|
||||
```
|
||||
|
||||
## [Training Process](#contents)
|
||||
|
@ -267,26 +278,31 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
- running on Ascend
|
||||
|
||||
```python
|
||||
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --train_pretrained /PATH/TO/PRETRAINED_CKPT --is_distributed 0 > train.log 2>&1 &
|
||||
|
||||
```
|
||||
|
||||
The python command above will run in the background, The log and model checkpoint will be generated in `output/202x-xx-xx_time_xx_xx_xx/`. The loss value of training DenseNet121 on ImageNet will be achieved as follows:
|
||||
|
||||
```shell
|
||||
```log
|
||||
|
||||
2020-08-22 16:58:56,617:INFO:epoch[0], iter[5003], loss:4.367, mean_fps:0.00 imgs/sec
|
||||
2020-08-22 16:58:56,619:INFO:local passed
|
||||
2020-08-22 17:02:19,920:INFO:epoch[1], iter[10007], loss:3.193, mean_fps:6301.11 imgs/sec
|
||||
2020-08-22 17:02:19,921:INFO:local passed
|
||||
2020-08-22 17:05:43,112:INFO:epoch[2], iter[15011], loss:3.096, mean_fps:6304.53 imgs/sec
|
||||
2020-08-22 17:05:43,113:INFO:local passed
|
||||
...
|
||||
|
||||
```
|
||||
|
||||
- running on GPU
|
||||
|
||||
```python
|
||||
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='GPU' > train.log 2>&1 &
|
||||
|
||||
```
|
||||
|
||||
The python command above will run in the background, you can view the results through the file `train.log`.
|
||||
|
@ -296,7 +312,9 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
- running on CPU
|
||||
|
||||
```python
|
||||
|
||||
python train.py --net=[NET_NAME] --dataset=[DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='CPU' > train.log 2>&1 &
|
||||
|
||||
```
|
||||
|
||||
The python command above will run in the background, The log and model checkpoint will be generated in `output/202x-xx-xx_time_xx_xx_xx/`.
|
||||
|
@ -306,27 +324,32 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
- running on Ascend
|
||||
|
||||
```bash
|
||||
bash scripts/run_distribute_train.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/PRETRAINED_CKPT
|
||||
|
||||
bash scripts/run_distribute_train.sh [DEVICE_NUM] [RANK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [TRAIN_DATA_DIR]
|
||||
# example bash scripts/run_distribute_train.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/
|
||||
|
||||
```
|
||||
|
||||
The above shell script will run distribute training in the background. You can view the results log and model checkpoint through the file `train[X]/output/202x-xx-xx_time_xx_xx_xx/`. The loss value of training DenseNet121 on ImageNet will be achieved as follows:
|
||||
|
||||
```log
|
||||
|
||||
2020-08-22 16:58:54,556:INFO:epoch[0], iter[5003], loss:3.857, mean_fps:0.00 imgs/sec
|
||||
2020-08-22 17:02:19,188:INFO:epoch[1], iter[10007], loss:3.18, mean_fps:6260.18 imgs/sec
|
||||
2020-08-22 17:05:42,490:INFO:epoch[2], iter[15011], loss:2.621, mean_fps:6301.11 imgs/sec
|
||||
2020-08-22 17:09:05,686:INFO:epoch[3], iter[20015], loss:3.113, mean_fps:6304.37 imgs/sec
|
||||
2020-08-22 17:12:28,925:INFO:epoch[4], iter[25019], loss:3.29, mean_fps:6303.07 imgs/sec
|
||||
2020-08-22 17:15:52,167:INFO:epoch[5], iter[30023], loss:2.865, mean_fps:6302.98 imgs/sec
|
||||
...
|
||||
...
|
||||
|
||||
```
|
||||
|
||||
- running on GPU
|
||||
|
||||
```bash
|
||||
|
||||
cd scripts
|
||||
bash run_distribute_train_gpu.sh 8 0,1,2,3,4,5,6,7 [NET_NAME] [DATASET_NAME] [DATASET_PATH]
|
||||
|
||||
```
|
||||
|
||||
The above shell script will run distribute training in the background. You can view the results through the file `train/train.log`.
|
||||
|
@ -340,16 +363,21 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
running the command below for evaluation.
|
||||
|
||||
```python
|
||||
|
||||
python eval.py --net [NET_NAME] --dataset [DATASET_NAME] --eval_data_dir /PATH/TO/DATASET --ckpt_files /PATH/TO/CHECKPOINT > eval.log 2>&1 &
|
||||
OR
|
||||
bash scripts/run_distribute_eval.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/CHECKPOINT
|
||||
bash scripts/run_distribute_eval.sh [DEVICE_NUM] [RANDK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [EVAL_DATA_DIR][CKPT_PATH]
|
||||
# example: bash script/run_distribute_eval.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/validation_preprocess/ /home/model/densenet/ckpt/0-120_500.ckpt
|
||||
|
||||
```
|
||||
|
||||
The above python command will run in the background. You can view the results through the file "output/202x-xx-xx_time_xx_xx_xx/202x_xxxx.log". The accuracy of evaluating DenseNet121 on the test dataset of ImageNet will be as follows:
|
||||
|
||||
```log
|
||||
|
||||
2020-08-24 09:21:50,551:INFO:after allreduce eval: top1_correct=37657, tot=49920, acc=75.43%
|
||||
2020-08-24 09:21:50,551:INFO:after allreduce eval: top5_correct=46224, tot=49920, acc=92.60%
|
||||
|
||||
```
|
||||
|
||||
- evaluation on GPU
|
||||
|
@ -357,22 +385,28 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
running the command below for evaluation.
|
||||
|
||||
```python
|
||||
|
||||
python eval.py --net=[NET_NAME] --dataset=[DATASET_NAME] --eval_data_dir=[DATASET_PATH] --device_target='GPU' --ckpt_files=[CHECKPOINT_PATH] > eval.log 2>&1 &
|
||||
OR
|
||||
bash run_distribute_eval_gpu.sh 1 0 [NET_NAME] [DATASET_NAME] [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
|
||||
```
|
||||
|
||||
The above python command will run in the background. You can view the results through the file "eval/eval.log". The accuracy of evaluating DenseNet121 on the test dataset of ImageNet will be as follows:
|
||||
|
||||
```log
|
||||
|
||||
2021-02-04 14:20:50,551:INFO:after allreduce eval: top1_correct=37637, tot=49984, acc=75.30%
|
||||
2021-02-04 14:20:50,551:INFO:after allreduce eval: top5_correct=46370, tot=49984, acc=92.77%
|
||||
|
||||
```
|
||||
|
||||
The accuracy of evaluating DenseNet100 on the test dataset of Cifar-10 will be as follows:
|
||||
|
||||
```log
|
||||
|
||||
2021-03-12 18:04:07,893:INFO:after allreduce eval: top1_correct=9536, tot=9984, acc=95.51%
|
||||
|
||||
```
|
||||
|
||||
- evaluation on CPU
|
||||
|
@ -380,13 +414,17 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
running the command below for evaluation.
|
||||
|
||||
```python
|
||||
|
||||
python eval.py --net=[NET_NAME] --dataset=[DATASET_NAME] --eval_data_dir=[DATASET_PATH] --device_target='CPU' --ckpt_files=[CHECKPOINT_PATH] > eval.log 2>&1 &
|
||||
|
||||
```
|
||||
|
||||
The above python command will run in the background. You can view the results through the file "eval/eval.log". The accuracy of evaluating DenseNet100 on the test dataset of Cifar-10 will be as follows:
|
||||
|
||||
```log
|
||||
|
||||
2021-03-18 09:06:43,247:INFO:after allreduce eval: top1_correct=9492, tot=9984, acc=95.07%
|
||||
|
||||
```
|
||||
|
||||
## [Export Process](#contents)
|
||||
|
@ -394,7 +432,9 @@ You can modify the training behaviour through the various flags in the `densenet
|
|||
### export
|
||||
|
||||
```shell
|
||||
|
||||
python export.py --net [NET_NAME] --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --file_format [EXPORT_FORMAT] --batch_size [BATCH_SIZE]
|
||||
|
||||
```
|
||||
|
||||
`EXPORT_FORMAT` should be in ["AIR", "MINDIR"]
|
||||
|
@ -402,6 +442,7 @@ python export.py --net [NET_NAME] --ckpt_file [CKPT_PATH] --device_target [DEVIC
|
|||
- Export MindIR on Modelarts
|
||||
|
||||
```Modelarts
|
||||
|
||||
Export MindIR example on ModelArts
|
||||
Data storage method is the same as training
|
||||
# (1) Choose either a (modify yaml file parameters) or b (modelArts create training job to modify parameters)。
|
||||
|
@ -418,6 +459,7 @@ Data storage method is the same as training
|
|||
# (4) Set the model's startup file on the modelarts interface "export.py" 。
|
||||
# (5) Set the data path of the model on the modelarts interface ".../ImageNet_Original/checkpoint"(choices ImageNet_Original/checkpoint Folder path) ,
|
||||
# The output path of the model "Output file path" and the log path of the model "Job log path" 。
|
||||
|
||||
```
|
||||
|
||||
## [Inference Process](#contents)
|
||||
|
@ -427,8 +469,10 @@ Data storage method is the same as training
|
|||
Before performing inference, we need to export the model first. Air model can only be exported in Ascend 910 environment, mindir can be exported in any environment.
|
||||
|
||||
```shell
|
||||
|
||||
# Ascend310 inference
|
||||
bash run_infer_310.sh [MINDIR_PATH] [DATASET] [DATA_PATH] [LABEL_FILE] [DEVICE_ID]
|
||||
|
||||
```
|
||||
|
||||
-NOTE:Ascend310 inference use Imagenet dataset . The label of the image is the number of folder which is started from 0 after sorting.
|
||||
|
@ -437,8 +481,10 @@ Inference result is saved in current path, you can find result like this in acc.
|
|||
The accuracy of evaluating DenseNet121 on the test dataset of ImageNet will be as follows:
|
||||
|
||||
```log
|
||||
|
||||
2020-08-24 09:21:50,551:INFO:after allreduce eval: top1_correct=37657, tot=49920, acc=75.56%
|
||||
2020-08-24 09:21:50,551:INFO:after allreduce eval: top5_correct=46224, tot=49920, acc=92.74%
|
||||
|
||||
```
|
||||
|
||||
# [Model Description](#contents)
|
||||
|
|
|
@ -99,18 +99,22 @@ DenseNet-100使用的数据集: Cifar-10
|
|||
|
||||
- Ascend处理器环境运行
|
||||
|
||||
```默认训练densenet121,训练densenet100需要修改 _config_path值,参数路径: src/model_utils/config.py
|
||||
# 单卡训练示例
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --train_pretrained /PATH/TO/PRETRAINED_CKPT --is_distributed 0 > train.log 2>&1 &
|
||||
```默认训练densenet121,训练densenet100需要修改 _config_path值,参数路径: src/model_utils/config.py
|
||||
|
||||
# 分布式训练示例
|
||||
bash scripts/run_distribute_train.sh 8 /PATH/TO/RANK_TABLE.JSON [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/PRETRAINED_CKPT
|
||||
# 单卡训练示例
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --train_pretrained /PATH/TO/PRETRAINED_CKPT --is_distributed 0 > train.log 2>&1 &
|
||||
# example: python train.py --net densenet121 --dataset imagenet --train_data_dir /home/DataSet/ImageNet_Original/train/
|
||||
|
||||
# 单卡评估示例
|
||||
python eval.py --net [NET_NAME] --dataset [DATASET_NAME] --eval_data_dir /PATH/TO/DATASET --ckpt_files /PATH/TO/CHECKPOINT > eval.log 2>&1 &
|
||||
# 分布式训练示例
|
||||
bash scripts/run_distribute_train.sh [DEVICE_NUM] [RANK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [TRAIN_DATA_DIR]
|
||||
# example bash scripts/run_distribute_train.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/
|
||||
|
||||
bash scripts/run_distribute_eval.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/CHECKPOINT
|
||||
```
|
||||
# 单卡评估示例
|
||||
python eval.py --net [NET_NAME] --dataset [DATASET_NAME] --eval_data_dir /PATH/TO/DATASET --ckpt_files /PATH/TO/CHECKPOINT > eval.log 2>&1 &
|
||||
|
||||
bash scripts/run_distribute_eval.sh [DEVICE_NUM] [RANDK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [EVAL_DATA_DIR][CKPT_PATH]
|
||||
# example: bash script/run_distribute_eval.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/validation_preprocess/ /home/model/densenet/ckpt/0-120_500.ckpt
|
||||
```
|
||||
|
||||
分布式训练需要提前创建JSON格式的HCCL配置文件。
|
||||
|
||||
|
@ -267,66 +271,67 @@ DenseNet-100使用的数据集: Cifar-10
|
|||
|
||||
- Ascend处理器环境运行
|
||||
|
||||
```python
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --train_pretrained /PATH/TO/PRETRAINED_CKPT --is_distributed 0 > train.log 2>&1 &
|
||||
```
|
||||
```python
|
||||
python train.py --net [NET_NAME] --dataset [DATASET_NAME] --train_data_dir /PATH/TO/DATASET --train_pretrained /PATH/TO/PRETRAINED_CKPT --is_distributed 0 > train.log 2>&1 &
|
||||
```
|
||||
|
||||
以上python命令在后台运行,在`output/202x-xx-xx_time_xx_xx/`目录下生成日志和模型检查点。在ImageNet数据集上训练DenseNet-121的损失值的实现如下:
|
||||
以上python命令在后台运行,在`output/202x-xx-xx_time_xx_xx/`目录下生成日志和模型检查点。在ImageNet数据集上训练DenseNet-121的损失值的实现如下:
|
||||
|
||||
```log
|
||||
2020-08-22 16:58:56,617:INFO:epoch[0], iter[5003], loss:4.367, mean_fps:0.00 imgs/sec
|
||||
2020-08-22 16:58:56,619:INFO:local passed
|
||||
2020-08-22 17:02:19,920:INFO:epoch[1], iter[10007], loss:3.193, mean_fps:6301.11 imgs/sec
|
||||
2020-08-22 17:02:19,921:INFO:local passed
|
||||
2020-08-22 17:05:43,112:INFO:epoch[2], iter[15011], loss:3.096, mean_fps:6304.53 imgs/sec
|
||||
2020-08-22 17:05:43,113:INFO:local passed
|
||||
...
|
||||
```
|
||||
```log
|
||||
2020-08-22 16:58:56,617:INFO:epoch[0], iter[5003], loss:4.367, mean_fps:0.00 imgs/sec
|
||||
2020-08-22 16:58:56,619:INFO:local passed
|
||||
2020-08-22 17:02:19,920:INFO:epoch[1], iter[10007], loss:3.193, mean_fps:6301.11 imgs/sec
|
||||
2020-08-22 17:02:19,921:INFO:local passed
|
||||
2020-08-22 17:05:43,112:INFO:epoch[2], iter[15011], loss:3.096, mean_fps:6304.53 imgs/sec
|
||||
2020-08-22 17:05:43,113:INFO:local passed
|
||||
...
|
||||
```
|
||||
|
||||
- GPU处理器环境运行
|
||||
|
||||
```python
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python train.py --net=[NET_NAME] --dataset=[DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='GPU' > train.log 2>&1 &
|
||||
```
|
||||
```python
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python train.py --net=[NET_NAME] --dataset=[DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='GPU' > train.log 2>&1 &
|
||||
```
|
||||
|
||||
以上python命令在后台运行,在`output/202x-xx-xx_time_xx_xx/`目录下生成日志和模型检查点。
|
||||
以上python命令在后台运行,在`output/202x-xx-xx_time_xx_xx/`目录下生成日志和模型检查点。
|
||||
|
||||
- CPU处理器环境运行
|
||||
|
||||
```python
|
||||
python train.py --net=[NET_NAME] --dataset=[DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='CPU' > train.log 2>&1 &
|
||||
```
|
||||
```python
|
||||
python train.py --net=[NET_NAME] --dataset=[DATASET_NAME] --train_data_dir=[DATASET_PATH] --is_distributed=0 --device_target='CPU' > train.log 2>&1 &
|
||||
```
|
||||
|
||||
以上python命令在后台运行,在`output/202x-xx-xx_time_xx_xx/`目录下生成日志和模型检查点。
|
||||
以上python命令在后台运行,在`output/202x-xx-xx_time_xx_xx/`目录下生成日志和模型检查点。
|
||||
|
||||
### 分布式训练
|
||||
|
||||
- Ascend处理器环境运行
|
||||
|
||||
```shell
|
||||
bash scripts/run_distribute_train.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/PRETRAINED_CKPT
|
||||
```
|
||||
```shell
|
||||
bash scripts/run_distribute_train.sh [DEVICE_NUM] [RANK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [TRAIN_DATA_DIR]
|
||||
# example bash scripts/run_distribute_train.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/
|
||||
```
|
||||
|
||||
上述shell脚本将在后台进行分布式训练。可以通过文件`train[X]/output/202x-xx-xx_time_xx_xx_xx/`查看结果日志和模型检查点。在ImageNet数据集上训练DenseNet-121的损失值的实现如下:
|
||||
|
||||
```log
|
||||
2020-08-22 16:58:54,556:INFO:epoch[0], iter[5003], loss:3.857, mean_fps:0.00 imgs/sec
|
||||
2020-08-22 17:02:19,188:INFO:epoch[1], iter[10007], loss:3.18, mean_fps:6260.18 imgs/sec
|
||||
2020-08-22 17:05:42,490:INFO:epoch[2], iter[15011], loss:2.621, mean_fps:6301.11 imgs/sec
|
||||
2020-08-22 17:09:05,686:INFO:epoch[3], iter[20015], loss:3.113, mean_fps:6304.37 imgs/sec
|
||||
2020-08-22 17:12:28,925:INFO:epoch[4], iter[25019], loss:3.29, mean_fps:6303.07 imgs/sec
|
||||
2020-08-22 17:15:52,167:INFO:epoch[5], iter[30023], loss:2.865, mean_fps:6302.98 imgs/sec
|
||||
...
|
||||
...
|
||||
```
|
||||
```log
|
||||
2020-08-22 16:58:54,556:INFO:epoch[0], iter[5003], loss:3.857, mean_fps:0.00 imgs/sec
|
||||
2020-08-22 17:02:19,188:INFO:epoch[1], iter[10007], loss:3.18, mean_fps:6260.18 imgs/sec
|
||||
2020-08-22 17:05:42,490:INFO:epoch[2], iter[15011], loss:2.621, mean_fps:6301.11 imgs/sec
|
||||
2020-08-22 17:09:05,686:INFO:epoch[3], iter[20015], loss:3.113, mean_fps:6304.37 imgs/sec
|
||||
2020-08-22 17:12:28,925:INFO:epoch[4], iter[25019], loss:3.29, mean_fps:6303.07 imgs/sec
|
||||
2020-08-22 17:15:52,167:INFO:epoch[5], iter[30023], loss:2.865, mean_fps:6302.98 imgs/sec
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
- GPU处理器环境运行
|
||||
|
||||
```bash
|
||||
cd scripts
|
||||
bash run_distribute_train_gpu.sh 8 0,1,2,3,4,5,6,7 [NET_NAME] [DATASET_NAME] [DATASET_PATH]
|
||||
```
|
||||
```bash
|
||||
cd scripts
|
||||
bash run_distribute_train_gpu.sh 8 0,1,2,3,4,5,6,7 [NET_NAME] [DATASET_NAME] [DATASET_PATH]
|
||||
```
|
||||
|
||||
上述shell脚本将在后台进行分布式训练。可以通过文件`train[X]/output/202x-xx-xx_time_xx_xx_xx/`查看结果日志和模型检查点。
|
||||
|
||||
|
@ -341,7 +346,8 @@ DenseNet-100使用的数据集: Cifar-10
|
|||
```eval
|
||||
python eval.py --net [NET_NAME] --dataset [DATASET_NAME] --eval_data_dir /PATH/TO/DATASET --ckpt_files /PATH/TO/CHECKPOINT > eval.log 2>&1 &
|
||||
OR
|
||||
bash scripts/run_distribute_eval.sh 8 rank_table.json [NET_NAME] [DATASET_NAME] /PATH/TO/DATASET /PATH/TO/CHECKPOINT
|
||||
bash scripts/run_distribute_eval.sh [DEVICE_NUM] [RANDK_TABLE_FILE] [NET_NAME] [DATASET_NAME] [EVAL_DATA_DIR][CKPT_PATH]
|
||||
# example: bash script/run_distribute_eval.sh 8 /root/hccl_8p_01234567_10.155.170.71.json densenet121 imagenet /home/DataSet/ImageNet_Original/train/validation_preprocess/ /home/model/densenet/ckpt/0-120_500.ckpt
|
||||
```
|
||||
|
||||
上述python命令在后台运行。可以通过“output/202x-xx-xx_time_xx_xx_xx/202x_xxxx.log”文件查看结果。DenseNet-121在ImageNet的测试数据集的准确率如下:
|
||||
|
|
|
@ -94,12 +94,14 @@ To train the DPNs, run the shell script `scripts/train_standalone.sh` with the f
|
|||
|
||||
```shell
|
||||
bash scripts/train_standalone.sh [device_id] [train_data_dir] [ckpt_path_to_save] [eval_each_epoch] [pretrained_ckpt(optional)]
|
||||
# example: bash scripts/train_standalone.sh 0 /home/DataSet/ImageNet_Original/train/ ./ckpt 0
|
||||
```
|
||||
|
||||
To validate the DPNs, run the shell script `scripts/eval.sh` with the format below:
|
||||
|
||||
```shell
|
||||
bash scripts/eval.sh [device_id] [eval_data_dir] [checkpoint_path]
|
||||
# example bash scripts/eval.sh 0 /home/DataSet/ImageNet_Original/validation_preprocess/ /home/model/dpn/ckpt/dpn-100_40036.ckpt
|
||||
```
|
||||
|
||||
# [Script Description](#contents)
|
||||
|
@ -184,12 +186,7 @@ Run `scripts/train_standalone.sh` to train the model standalone. The usage of th
|
|||
|
||||
```shell
|
||||
bash scripts/train_standalone.sh [device_id] [train_data_dir] [ckpt_path_to_save] [eval_each_epoch] [pretrained_ckpt(optional)]
|
||||
```
|
||||
|
||||
For example, you can run the shell command below to launch the training procedure.
|
||||
|
||||
```shell
|
||||
bash scripts/train_standalone.sh 0 /data/dataset/imagenet/ scripts/pretrian/ 0
|
||||
# example: bash scripts/train_standalone.sh 0 /home/DataSet/ImageNet_Original/train/ ./ckpt 0
|
||||
```
|
||||
|
||||
If eval_each_epoch is 1, it will evaluate after each epoch and save the parameters with the max accuracy. But in this case, the time of one epoch will be longer.
|
||||
|
@ -231,12 +228,7 @@ Run `scripts/train_distributed.sh` to train the model distributed. The usage of
|
|||
|
||||
```text
|
||||
bash scripts/train_distributed.sh [rank_table] [train_data_dir] [ckpt_path_to_save] [rank_size] [eval_each_epoch] [pretrained_ckpt(optional)]
|
||||
```
|
||||
|
||||
For example, you can run the shell command below to launch the training procedure.
|
||||
|
||||
```shell
|
||||
bash scripts/train_distributed.sh /home/rank_table.json /data/dataset/imagenet/ ../scripts 8 0 ../pretrain/dpn92.ckpt
|
||||
# example: bash scripts/train_distributed.sh /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/train/ ./ckpt/ 8 0
|
||||
```
|
||||
|
||||
The above shell script will run distribute training in the background. You can view the results through the file `train_parallel[X]/log.txt` as follows:
|
||||
|
@ -259,12 +251,7 @@ Run `scripts/eval.sh` to evaluate the model with one Ascend processor. The usage
|
|||
|
||||
```text
|
||||
bash scripts/eval.sh [device_id] [eval_data_dir] [checkpoint_path]
|
||||
```
|
||||
|
||||
For example, you can run the shell command below to launch the validation procedure.
|
||||
|
||||
```text
|
||||
bash scripts/eval.sh 0 /data/dataset/imagenet/ pretrain/dpn-180_5004.ckpt
|
||||
# example bash scripts/eval.sh 0 /home/DataSet/ImageNet_Original/validation_preprocess/ /home/model/dpn/ckpt/dpn-100_40036.ckpt
|
||||
```
|
||||
|
||||
The above shell script will run evaluation in the background. You can view the results through the file `eval_log.txt`. The result will be achieved as follows:
|
||||
|
|
|
@ -80,11 +80,16 @@ Dataset used [ICDAR 2015](https://rrc.cvc.uab.es/?ch=4&com=downloads)
|
|||
|
||||
```bash
|
||||
# distribute training example(8p)
|
||||
sh run_distribute_train.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [RANK_TABLE_FILE]
|
||||
bash run_distribute_train.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [RANK_TABLE_FILE]
|
||||
# example: bash run_distribute_train.sh /home/DataSet/ICDAR2015/ic15/ home/model/east/pretrained/0-150_5004.ckpt /root/hccl_8p_01234567_10.155.170.71.json
|
||||
|
||||
# standalone training
|
||||
sh run_standalone_train_ascend.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [DEVICE_ID]
|
||||
bash run_standalone_train_ascend.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [DEVICE_ID]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/ICDAR2015/ic15/ home/model/east/pretrained/0-150_5004.ckpt 0
|
||||
|
||||
# evaluation:
|
||||
sh run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
||||
bash run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
||||
# example: bash run_eval_ascend.sh /home/DataSet/ICDAR2015/ch4_test_images/ home/model/east/ckpt/checkpoint_east-600_15.ckpt
|
||||
```
|
||||
|
||||
> Notes:
|
||||
|
@ -100,9 +105,12 @@ sh run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
|||
shell:
|
||||
Ascend:
|
||||
# distribute training example(8p)
|
||||
sh run_distribute_train.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [RANK_TABLE_FILE]
|
||||
bash run_distribute_train.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [RANK_TABLE_FILE]
|
||||
# example: bash run_distribute_train.sh /home/DataSet/ICDAR2015/ic15/ home/model/east/pretrained/0-150_5004.ckpt /root/hccl_8p_01234567_10.155.170.71.json
|
||||
|
||||
# standalone training
|
||||
sh run_standalone_train_ascend.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [DEVICE_ID]
|
||||
bash run_standalone_train_ascend.sh [DATASET_PATH] [PRETRAINED_BACKBONE] [DEVICE_ID]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/ICDAR2015/ic15/ home/model/east/pretrained/0-150_5004.ckpt 0
|
||||
```
|
||||
|
||||
### Result
|
||||
|
@ -199,7 +207,8 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
- Ascend:
|
||||
|
||||
```bash
|
||||
sh run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
||||
bash run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
||||
# example: bash run_eval_ascend.sh /home/DataSet/ICDAR2015/ch4_test_images/ home/model/east/ckpt/checkpoint_east-600_15.ckpt
|
||||
```
|
||||
|
||||
### Launch
|
||||
|
@ -224,7 +233,8 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
# eval example
|
||||
shell:
|
||||
Ascend:
|
||||
sh run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
||||
bash run_eval_ascend.sh [DATASET_PATH] [CKPT_PATH] [DEVICE_ID]
|
||||
# example: bash run_eval_ascend.sh /home/DataSet/ICDAR2015/ch4_test_images/ home/model/east/ckpt/checkpoint_east-600_15.ckpt
|
||||
```
|
||||
|
||||
> checkpoint can be produced in training process.
|
||||
|
|
|
@ -90,17 +90,28 @@ After installing MindSpore via the official website, you can start training and
|
|||
|
||||
- running on Ascend
|
||||
|
||||
```yaml
|
||||
# Add data set path, take training cifar10 as an example
|
||||
train_data_path:/home/DataSet/cifar10/
|
||||
val_data_path:/home/DataSet/cifar10/
|
||||
|
||||
# Add checkpoint path parameters before inference
|
||||
chcekpoint_path:/home/model/googlenet/ckpt/train_googlenet_cifar10-125_390.ckpt
|
||||
```
|
||||
|
||||
```python
|
||||
# run training example
|
||||
python train.py > train.log 2>&1 &
|
||||
|
||||
# run distributed training example
|
||||
bash scripts/run_train.sh rank_table.json
|
||||
bash scripts/run_train.sh [RANK_TABLE_FILE] [DATASET_NAME]
|
||||
# example: bash scripts/run_train.sh /root/hccl_8p_01234567_10.155.170.71.json cifar10
|
||||
|
||||
# run evaluation example
|
||||
python eval.py > eval.log 2>&1 &
|
||||
OR
|
||||
bash run_eval.sh
|
||||
bash run_eval.sh [DATASET_NAME]
|
||||
# example: bash run_eval.sh cifar10
|
||||
|
||||
# run inferenct example
|
||||
bash run_infer_310.sh [MINDIR_PATH] [DATASET] [DATA_PATH] [LABEL_FILE] [DEVICE_ID]
|
||||
|
@ -390,7 +401,7 @@ For more configuration details, please refer the script `config.py`.
|
|||
- running on Ascend
|
||||
|
||||
```bash
|
||||
bash scripts/run_train.sh rank_table.json
|
||||
bash scripts/run_train.sh /root/hccl_8p_01234567_10.155.170.71.json cifar10
|
||||
```
|
||||
|
||||
The above shell script will run distribute training in the background. You can view the results through the file `train_parallel[X]/log`. The loss value will be achieved as follows:
|
||||
|
@ -425,7 +436,7 @@ For more configuration details, please refer the script `config.py`.
|
|||
```python
|
||||
python eval.py > eval.log 2>&1 &
|
||||
OR
|
||||
bash scripts/run_eval.sh
|
||||
bash run_eval.sh cifar10
|
||||
```
|
||||
|
||||
The above python command will run in the background. You can view the results through the file "eval.log". The accuracy of the test dataset will be as follows:
|
||||
|
@ -460,7 +471,7 @@ For more configuration details, please refer the script `config.py`.
|
|||
OR,
|
||||
|
||||
```bash
|
||||
bash scripts/run_eval_gpu.sh [CHECKPOINT_PATH]
|
||||
bash run_eval_gpu.sh [CHECKPOINT_PATH]
|
||||
```
|
||||
|
||||
The above python command will run in the background. You can view the results through the file "eval/eval.log". The accuracy of the test dataset will be as follows:
|
||||
|
|
|
@ -92,17 +92,28 @@ GoogleNet由多个inception模块串联起来,可以更加深入。 降维的
|
|||
|
||||
- Ascend处理器环境运行
|
||||
|
||||
```yaml
|
||||
# 添加数据集路径,以训练cifar10为例
|
||||
train_data_path:/home/DataSet/cifar10/
|
||||
val_data_path:/home/DataSet/cifar10/
|
||||
|
||||
# 推理前添加checkpoint路径参数
|
||||
chcekpoint_path:/home/model/googlenet/ckpt/train_googlenet_cifar10-125_390.ckpt
|
||||
```
|
||||
|
||||
```python
|
||||
# 运行训练示例
|
||||
python train.py > train.log 2>&1 &
|
||||
|
||||
# 运行分布式训练示例
|
||||
bash scripts/run_train.sh rank_table.json
|
||||
bash scripts/run_train.sh [RANK_TABLE_FILE] [DATASET_NAME]
|
||||
# example: bash scripts/run_train.sh /root/hccl_8p_01234567_10.155.170.71.json cifar10
|
||||
|
||||
# 运行评估示例
|
||||
python eval.py > eval.log 2>&1 &
|
||||
或
|
||||
bash run_eval.sh
|
||||
bash run_eval.sh [DATASET_NAME]
|
||||
# example: bash run_eval.sh cifar10
|
||||
|
||||
# 运行推理示例
|
||||
bash run_infer_310.sh [MINDIR_PATH] [DATASET] [DATA_PATH] [LABEL_FILE] [DEVICE_ID]
|
||||
|
@ -360,7 +371,7 @@ GoogleNet由多个inception模块串联起来,可以更加深入。 降维的
|
|||
- Ascend处理器环境运行
|
||||
|
||||
```bash
|
||||
bash scripts/run_train.sh rank_table.json
|
||||
bash scripts/run_train.sh /root/hccl_8p_01234567_10.155.170.71.json cifar10
|
||||
```
|
||||
|
||||
上述shell脚本将在后台运行分布训练。您可以通过train_parallel[X]/log文件查看结果。采用以下方式达到损失值:
|
||||
|
@ -395,7 +406,7 @@ GoogleNet由多个inception模块串联起来,可以更加深入。 降维的
|
|||
```bash
|
||||
python eval.py > eval.log 2>&1 &
|
||||
OR
|
||||
sh scripts/run_eval.sh
|
||||
bash run_eval.sh cifar10
|
||||
```
|
||||
|
||||
上述python命令将在后台运行,您可以通过eval.log文件查看结果。测试数据集的准确性如下:
|
||||
|
@ -430,7 +441,7 @@ GoogleNet由多个inception模块串联起来,可以更加深入。 降维的
|
|||
或者,
|
||||
|
||||
```bash
|
||||
bash scripts/run_eval_gpu.sh [CHECKPOINT_PATH]
|
||||
bash run_eval_gpu.sh [CHECKPOINT_PATH]
|
||||
```
|
||||
|
||||
上述python命令将在后台运行,您可以通过eval/eval.log文件查看结果。测试数据集的准确性如下:
|
||||
|
|
|
@ -81,7 +81,7 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
|
||||
- Running on [ModelArts](https://support.huaweicloud.com/modelarts/)
|
||||
|
||||
```bash
|
||||
```inceptionv3
|
||||
# Train 8p with Ascend
|
||||
# (1) Perform a or b.
|
||||
# a. Set "enable_modelarts=True" on default_config.yaml file.
|
||||
|
@ -274,11 +274,21 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
|
||||
- Ascend:
|
||||
|
||||
```yaml
|
||||
ds_type:imagenet
|
||||
or
|
||||
ds_type:cifar10
|
||||
Take training cifar10 as an example, the ds_type parameter is set to cifar10
|
||||
```
|
||||
|
||||
```shell
|
||||
# distribute training(8p)
|
||||
bash scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_012345467_10.155.170.71.json /home/DataSet/cifar10/ ./ckpt/
|
||||
|
||||
# standalone training
|
||||
bash scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
|
||||
bash scripts/run_standalone_train.sh [DEVICE_ID] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash scripts/run_standalone_train.sh 0 /home/DataSet/cifar10/ ./ckpt/
|
||||
```
|
||||
|
||||
- CPU:
|
||||
|
@ -302,10 +312,12 @@ bash scripts/run_standalone_train_cpu.sh DATA_PATH
|
|||
|
||||
shell:
|
||||
Ascend:
|
||||
# distribute training example(8p)
|
||||
bash scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_012345467_10.155.170.71.json /home/DataSet/cifar10/ ./ckpt/
|
||||
|
||||
# standalone training example
|
||||
bash scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
|
||||
bash scripts/run_standalone_train.sh [DEVICE_ID] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash scripts/run_standalone_train.sh 0 /home/DataSet/cifar10/ ./ckpt/
|
||||
|
||||
CPU:
|
||||
bash script/run_standalone_train_cpu.sh DATA_PATH
|
||||
|
@ -344,13 +356,14 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
|
||||
- Ascend:
|
||||
|
||||
```python
|
||||
bash scripts/run_eval.sh DEVICE_ID DATA_PATH PATH_CHECKPOINT
|
||||
```shell
|
||||
bash run_eval.sh [DEVICE_ID] [DATA_DIR] [PATH_CHECKPOINT]
|
||||
# example: bash run_eval.sh 0 /home/DataSet/cifar10/ /home/model/inceptionv3/ckpt/inception_v3-rank0-2_1251.ckpt
|
||||
```
|
||||
|
||||
- CPU:
|
||||
|
||||
```python
|
||||
```shell
|
||||
bash scripts/run_eval_cpu.sh DATA_PATH PATH_CHECKPOINT
|
||||
```
|
||||
|
||||
|
|
|
@ -280,12 +280,22 @@ train.py和config.py中主要参数如下:
|
|||
|
||||
- Ascend:
|
||||
|
||||
```shell
|
||||
# 分布式训练示例(8卡)
|
||||
bash scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
|
||||
# 单机训练
|
||||
bash scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
|
||||
```
|
||||
```yaml
|
||||
ds_type:imagenet
|
||||
or
|
||||
ds_type:cifar10
|
||||
以训练cifar10为例,ds_type参数设置为cifar10
|
||||
````
|
||||
|
||||
```shell
|
||||
# 分布式训练示例(8卡)
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_012345467_10.155.170.71.json /home/DataSet/cifar10/ ./ckpt/
|
||||
|
||||
# 单机训练
|
||||
bash scripts/run_standalone_train.sh [DEVICE_ID] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash scripts/run_standalone_train.sh 0 /home/DataSet/cifar10/ ./ckpt/
|
||||
```
|
||||
|
||||
> 注:RANK_TABLE_FILE可参考[链接](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/distributed_training_ascend.html)。device_ip可以通过[链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)获取
|
||||
> 这是关于device_num和处理器总数的处理器核绑定操作。如不需要,请删除scripts/run_distribute_train.sh中的taskset操作。
|
||||
|
@ -301,9 +311,12 @@ train.py和config.py中主要参数如下:
|
|||
shell:
|
||||
Ascend:
|
||||
# 分布式训练示例(8卡)
|
||||
bash scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_012345467_10.155.170.71.json /home/DataSet/cifar10/ ./ckpt/
|
||||
|
||||
# 单机训练
|
||||
bash scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
|
||||
bash scripts/run_standalone_train.sh [DEVICE_ID] [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash scripts/run_standalone_train.sh 0 /home/DataSet/cifar10/ ./ckpt/
|
||||
|
||||
CPU:
|
||||
bash script/run_standalone_train_cpu.sh DATA_PATH
|
||||
|
@ -343,7 +356,8 @@ epoch time: 6358482.104 ms, per step time: 16303.800 ms
|
|||
- Ascend:
|
||||
|
||||
```shell
|
||||
bash scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
bash run_eval.sh [DEVICE_ID] [DATA_DIR] [PATH_CHECKPOINT]
|
||||
# example: bash run_eval.sh 0 /home/DataSet/cifar10/ /home/model/inceptionv3/ckpt/inception_v3-rank0-2_1251.ckpt
|
||||
```
|
||||
|
||||
- CPU:
|
||||
|
@ -361,7 +375,7 @@ epoch time: 6358482.104 ms, per step time: 16303.800 ms
|
|||
CPU: python eval.py --config_path CONFIG_FILE --dataset_path DATA_PATH --checkpoint PATH_CHECKPOINT --platform CPU
|
||||
|
||||
shell:
|
||||
Ascend: bash scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
|
||||
Ascend: bash run_eval.sh [DEVICE_ID] [DATA_DIR] [PATH_CHECKPOINT]
|
||||
CPU: bash scripts/run_eval_cpu.sh DATA_PATH PATH_CHECKPOINT
|
||||
```
|
||||
|
||||
|
|
|
@ -245,11 +245,21 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
|
||||
- Ascend:
|
||||
|
||||
```yaml
|
||||
ds_type:imagenet
|
||||
or
|
||||
ds_type:cifar10
|
||||
Take training cifar10 as an example, the ds_type parameter is set to cifar10
|
||||
````
|
||||
|
||||
```bash
|
||||
# distribute training example(8p)
|
||||
bash scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
|
||||
bash scripts/run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATA_DIR]
|
||||
# example: bash scripts/run_distribute_train_ascend.sh /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cifar10/
|
||||
|
||||
# standalone training
|
||||
bash scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
|
||||
bash scripts/run_standalone_train_ascend.sh [DEVICE_ID] [DATA_DIR]
|
||||
# example: bash scripts/run_standalone_train_ascend.sh 0 /home/DataSet/cifar10/
|
||||
```
|
||||
|
||||
> Notes:
|
||||
|
@ -278,9 +288,13 @@ bash scripts/run_standalone_train_cpu.sh DATA_PATH
|
|||
shell:
|
||||
Ascend:
|
||||
# distribute training example(8p)
|
||||
bash scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
|
||||
bash scripts/run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATA_DIR]
|
||||
# example: bash scripts/run_distribute_train_ascend.sh /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cifar10/
|
||||
|
||||
# standalone training
|
||||
bash scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
|
||||
bash scripts/run_standalone_train_ascend.sh [DEVICE_ID] [DATA_DIR]
|
||||
# example: bash scripts/run_standalone_train_ascend.sh 0 /home/DataSet/cifar10/
|
||||
|
||||
GPU:
|
||||
# distribute training example(8p)
|
||||
bash scripts/run_distribute_train_gpu.sh DATA_PATH
|
||||
|
@ -324,7 +338,8 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
- Ascend:
|
||||
|
||||
```bash
|
||||
bash scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
|
||||
bash scripts/run_eval_ascend.sh [DEVICE_ID] [DATA_DIR] [CHECKPOINT_PATH]
|
||||
# example: bash scripts/run_eval_ascend.sh 0 /home/DataSet/cifar10/ /home/model/inceptionv4/ckpt/inceptionv4-train-250_1251
|
||||
```
|
||||
|
||||
- GPU
|
||||
|
@ -339,7 +354,7 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
# eval example
|
||||
shell:
|
||||
Ascend:
|
||||
bash scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
|
||||
bash scripts/run_eval_ascend.sh [DEVICE_ID] [DATA_DIR] [CHECKPOINT_PATH]
|
||||
GPU:
|
||||
bash scripts/run_eval_gpu.sh DATA_DIR CHECKPOINT_PATH
|
||||
```
|
||||
|
|
|
@ -73,11 +73,14 @@ Dataset used: [MNIST](<http://yann.lecun.com/exdb/mnist/>)
|
|||
|
||||
After installing MindSpore via the official website, you can start training and evaluation as follows:
|
||||
|
||||
```python
|
||||
```bash
|
||||
# enter script dir, train LeNet
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_SAVE_PATH]
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_SAVE_PATH]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/MNIST/ ./ckpt/
|
||||
|
||||
# enter script dir, evaluate LeNet
|
||||
bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
||||
# example: bash run_standalone_eval_ascend.sh /home/DataSet/MNIST/ /home/model/lenet/ckpt/checkpoint_lenet-1_1875.ckpt
|
||||
```
|
||||
|
||||
- Running on [ModelArts](https://support.huaweicloud.com/modelarts/)
|
||||
|
@ -147,7 +150,7 @@ bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
|||
|
||||
- Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start evaluating as follows)
|
||||
|
||||
1. Export s8 multiscale and flip with voc val dataset on modelarts, evaluating steps are as follows:
|
||||
1. The evaluation steps using ModelArts are as follows:
|
||||
|
||||
```python
|
||||
# (1) Perform a or b.
|
||||
|
@ -206,8 +209,8 @@ bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
|||
|
||||
## [Script Parameters](#contents)
|
||||
|
||||
```python
|
||||
Major parameters in train.py and default_config.yaml as follows:
|
||||
```default_config.yaml
|
||||
default_config.yaml as follows:
|
||||
|
||||
--data_path: The absolute full path to the train and evaluation datasets.
|
||||
--epoch_size: Total training epochs.
|
||||
|
@ -228,7 +231,8 @@ Major parameters in train.py and default_config.yaml as follows:
|
|||
```bash
|
||||
python train.py --data_path Data --ckpt_path ckpt > log.txt 2>&1 &
|
||||
# or enter script dir, and run the script
|
||||
bash run_standalone_train_ascend.sh Data ckpt
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_SAVE_PATH]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/MNIST/ ./ckpt/
|
||||
```
|
||||
|
||||
After training, the loss value will be achieved as follows:
|
||||
|
@ -254,7 +258,8 @@ Before running the command below, please check the checkpoint path used for eval
|
|||
```bash
|
||||
python eval.py --data_path Data --ckpt_path ckpt/checkpoint_lenet-1_1875.ckpt > log.txt 2>&1 &
|
||||
# or enter script dir, and run the script
|
||||
bash run_standalone_eval_ascend.sh Data ckpt/checkpoint_lenet-1_1875.ckpt
|
||||
bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
||||
# example: bash run_standalone_eval_ascend.sh /home/DataSet/MNIST/ /home/model/lenet/ckpt/checkpoint_lenet-1_1875.ckpt
|
||||
```
|
||||
|
||||
You can view the results through the file "log.txt". The accuracy of the test dataset will be as follows:
|
||||
|
|
|
@ -75,11 +75,14 @@ LeNet非常简单,包含5层,由2个卷积层和3个全连接层组成。
|
|||
|
||||
通过官方网站安装MindSpore后,您可以按照如下步骤进行训练和评估:
|
||||
|
||||
```python
|
||||
```bash
|
||||
# 进入脚本目录,训练LeNet
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_SAVE_PATH]
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_SAVE_PATH]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/MNIST/ ./ckpt/
|
||||
|
||||
# 进入脚本目录,评估LeNet
|
||||
bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
||||
# example: bash run_standalone_eval_ascend.sh /home/DataSet/MNIST/ /home/model/lenet/ckpt/checkpoint_lenet-1_1875.ckpt
|
||||
```
|
||||
|
||||
- 在 ModelArts 进行训练 (如果你想在modelarts上运行,可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
|
||||
|
@ -147,7 +150,7 @@ bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
|||
|
||||
- 在 ModelArts 进行导出 (如果你想在modelarts上运行,可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
|
||||
|
||||
1. 使用voc val数据集评估多尺度和翻转s8。评估步骤如下:
|
||||
1. 使用ModelArts评估步骤如下:
|
||||
|
||||
```python
|
||||
# (1) 执行 a 或者 b.
|
||||
|
@ -206,8 +209,8 @@ bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
|||
|
||||
## 脚本参数
|
||||
|
||||
```python
|
||||
train.py和default_config.yaml中主要参数如下:
|
||||
```default_config.yaml
|
||||
default_config.yaml中主要参数如下:
|
||||
|
||||
--data_path: 到训练和评估数据集的绝对全路径
|
||||
--epoch_size: 训练轮次数
|
||||
|
@ -226,7 +229,8 @@ train.py和default_config.yaml中主要参数如下:
|
|||
```bash
|
||||
python train.py --data_path Data --ckpt_path ckpt > log.txt 2>&1 &
|
||||
# or enter script dir, and run the script
|
||||
bash run_standalone_train_ascend.sh Data ckpt
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_SAVE_PATH]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/MNIST/ ./ckpt/
|
||||
```
|
||||
|
||||
训练结束,损失值如下:
|
||||
|
@ -252,7 +256,8 @@ epoch:1 step:1538, loss is 1.0221305
|
|||
```bash
|
||||
python eval.py --data_path Data --ckpt_path ckpt/checkpoint_lenet-1_1875.ckpt > log.txt 2>&1 &
|
||||
# or enter script dir, and run the script
|
||||
bash run_standalone_eval_ascend.sh Data ckpt/checkpoint_lenet-1_1875.ckpt
|
||||
bash run_standalone_eval_ascend.sh [DATA_PATH] [CKPT_NAME]
|
||||
# example: bash run_standalone_eval_ascend.sh /home/DataSet/MNIST/ /home/model/lenet/ckpt/checkpoint_lenet-1_1875.ckpt
|
||||
```
|
||||
|
||||
您可以通过log.txt文件查看结果。测试数据集的准确性如下:
|
||||
|
|
|
@ -72,18 +72,23 @@ After installing MindSpore via the official website, you can start training and
|
|||
|
||||
```python
|
||||
# enter ../lenet directory and train lenet network,then a '.ckpt' file will be generated.
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH]
|
||||
# enter lenet dir, train LeNet-Quant
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/MNIST/ ./ckpt/
|
||||
|
||||
# enter lenet_quant dir, train lenet_quant
|
||||
python train.py --device_target=Ascend --data_path=[DATA_PATH] --ckpt_path=[CKPT_PATH] --dataset_sink_mode=True
|
||||
# example: python train.py --device_target=Ascend --data_path=/home/DataSet/MNIST/ --ckpt_path=/home/model/lenet/checkpoint_lenet-10_1875.ckpt --dataset_sink_mode=True
|
||||
|
||||
#evaluate LeNet-Quant
|
||||
python eval.py --device_target=Ascend --data_path=[DATA_PATH] --ckpt_path=[CKPT_PATH] --dataset_sink_mode=True
|
||||
# example: python eval.py --device_target=Ascend --data_path=/home/DataSet/MNIST/ --ckpt_path=/home/model/lenet_quant/checkpoint_lenet-10_937.ckpt --dataset_sink_mode=True
|
||||
```
|
||||
|
||||
## [Script Description](#contents)
|
||||
|
||||
## [Script and Sample Code](#contents)
|
||||
|
||||
```bash
|
||||
```lenet_quant
|
||||
├── model_zoo
|
||||
├── README.md // descriptions about all the models
|
||||
├── lenet_quant
|
||||
|
|
|
@ -75,19 +75,24 @@ LeNet非常简单,包含5层,由2个卷积层和3个全连接层组成。
|
|||
通过官方网站安装MindSpore后,您可以按照如下步骤进行训练和评估:
|
||||
|
||||
```python
|
||||
# 进入../lenet目录,训练lenet网络,生成'.ckpt'文件。
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH]
|
||||
# 进入lenet目录,训练LeNet-Quant
|
||||
# 进入../lenet目录,训练lenet网络,生成'.ckpt'文件作为lenet-quant预训练文件
|
||||
bash run_standalone_train_ascend.sh [DATA_PATH] [CKPT_PATH]
|
||||
# example: bash run_standalone_train_ascend.sh /home/DataSet/MNIST/ ./ckpt/
|
||||
|
||||
# 进入lenet-quant目录,训练lenet-quant
|
||||
python train.py --device_target=Ascend --data_path=[DATA_PATH] --ckpt_path=[CKPT_PATH] --dataset_sink_mode=True
|
||||
# 评估LeNet-Quant
|
||||
# example: python train.py --device_target=Ascend --data_path=/home/DataSet/MNIST/ --ckpt_path=/home/model/lenet/checkpoint_lenet-10_1875.ckpt --dataset_sink_mode=True
|
||||
|
||||
# 评估lenet-quant
|
||||
python eval.py --device_target=Ascend --data_path=[DATA_PATH] --ckpt_path=[CKPT_PATH] --dataset_sink_mode=True
|
||||
# example: python eval.py --device_target=Ascend --data_path=/home/DataSet/MNIST/ --ckpt_path=/home/model/lenet_quant/checkpoint_lenet-10_937.ckpt --dataset_sink_mode=True
|
||||
```
|
||||
|
||||
## 脚本说明
|
||||
|
||||
### 脚本及样例代码
|
||||
|
||||
```bash
|
||||
```lenet_quant
|
||||
├── model_zoo
|
||||
├── README.md // 所有型号的描述
|
||||
├── lenet_quant
|
||||
|
|
|
@ -106,10 +106,12 @@ pip install mmcv=0.2.14
|
|||
On Ascend:
|
||||
|
||||
# distributed training
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_CKPT]
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [PRETRAINED_CKPT(optional)]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cocodataset/
|
||||
|
||||
# standalone training
|
||||
bash run_standalone_train.sh [PRETRAINED_CKPT]
|
||||
bash run_standalone_train.sh [DATA_PATH] [PRETRAINED_CKPT(optional)]
|
||||
# example: bash run_standalone_train.sh /home/DataSet/cocodataset/
|
||||
|
||||
On CPU:
|
||||
|
||||
|
@ -128,7 +130,8 @@ pip install mmcv=0.2.14
|
|||
|
||||
```bash
|
||||
# Evaluation on Ascend
|
||||
bash run_eval.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
|
||||
bash run_eval.sh [ANN_FILE] [CHECKPOINT_PATH] [DATA_PATH]
|
||||
# example: bash run_eval.sh /home/DataSet/cocodataset/annotations/instances_val2017.json /home/model/maskrcnn_mobilenetv1/ckpt/mask_rcnn-5_7393.ckpt /home/DataSet/cocodataset/
|
||||
|
||||
# Evaluation on CPU
|
||||
bash run_eval_cpu.sh [ANN_FILE] [CHECKPOINT_PATH]
|
||||
|
@ -347,10 +350,12 @@ pip install mmcv=0.2.14
|
|||
On Ascend:
|
||||
|
||||
# distributed training
|
||||
Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
|
||||
Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [PRETRAINED_CKPT(optional)]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cocodataset/
|
||||
|
||||
# standalone training
|
||||
Usage: bash run_standalone_train.sh [PRETRAINED_MODEL]
|
||||
Usage: bash run_standalone_train.sh [DATA_PATH] [PRETRAINED_CKPT(optional)]
|
||||
# example: bash run_standalone_train.sh /home/DataSet/cocodataset/
|
||||
|
||||
On CPU:
|
||||
|
||||
|
@ -360,7 +365,7 @@ Usage: bash run_standalone_train_cpu.sh [PRETRAINED_MODEL](optional)
|
|||
|
||||
### [Parameters Configuration](#contents)
|
||||
|
||||
```bash
|
||||
```default_config.yaml
|
||||
"img_width": 1280, # width of the input images
|
||||
"img_height": 768, # height of the input images
|
||||
|
||||
|
@ -510,7 +515,8 @@ Usage: bash run_standalone_train_cpu.sh [PRETRAINED_MODEL](optional)
|
|||
|
||||
```bash
|
||||
# standalone training
|
||||
bash run_standalone_train.sh [PRETRAINED_MODEL]
|
||||
bash run_standalone_train.sh [DATA_PATH] [PRETRAINED_CKPT(optional)]
|
||||
# example: bash run_standalone_train.sh /home/DataSet/cocodataset/
|
||||
```
|
||||
|
||||
- Run `run_standalone_train_cpu.sh` for non-distributed training of maskrcnn_mobilenetv1 model on CPU.
|
||||
|
@ -525,7 +531,8 @@ bash run_standalone_train_cpu.sh [PRETRAINED_MODEL](optional)
|
|||
- Run `run_distribute_train.sh` for distributed training of Mask model on Ascend.
|
||||
|
||||
```bash
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
|
||||
bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [PRETRAINED_MODEL(optional)]
|
||||
# example: bash run_distribute_train.sh /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cocodataset/
|
||||
```
|
||||
|
||||
> hccl.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
|
||||
|
@ -536,7 +543,7 @@ bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
|
|||
|
||||
Training result will be stored in the example path, whose folder name begins with "train" or "train_parallel". You can find checkpoint file together with result like the following in loss_rankid.log.
|
||||
|
||||
```bash
|
||||
```log
|
||||
# distribute training result(8p)
|
||||
2123 epoch: 1 step: 7393 ,rpn_loss: 0.24854, rcnn_loss: 1.04492, rpn_cls_loss: 0.19238, rpn_reg_loss: 0.05603, rcnn_cls_loss: 0.47510, rcnn_reg_loss: 0.16919, rcnn_mask_loss: 0.39990, total_loss: 1.29346
|
||||
3973 epoch: 2 step: 7393 ,rpn_loss: 0.02769, rcnn_loss: 0.51367, rpn_cls_loss: 0.01746, rpn_reg_loss: 0.01023, rcnn_cls_loss: 0.24255, rcnn_reg_loss: 0.05630, rcnn_mask_loss: 0.21484, total_loss: 0.54137
|
||||
|
@ -557,7 +564,8 @@ Training result will be stored in the example path, whose folder name begins wit
|
|||
|
||||
```bash
|
||||
# infer
|
||||
bash run_eval.sh [VALIDATION_ANN_FILE_JSON] [CHECKPOINT_PATH]
|
||||
bash run_eval.sh [VALIDATION_ANN_FILE_JSON] [CHECKPOINT_PATH] [DATA_PATH]
|
||||
# example: bash run_eval.sh /home/DataSet/cocodataset/annotations/instances_val2017.json /home/model/maskrcnn_mobilenetv1/ckpt/mask_rcnn-5_7393.ckpt /home/DataSet/cocodataset/
|
||||
```
|
||||
|
||||
> As for the COCO2017 dataset, VALIDATION_ANN_FILE_JSON is refer to the annotations/instances_val2017.json in the dataset directory.
|
||||
|
@ -567,7 +575,7 @@ bash run_eval.sh [VALIDATION_ANN_FILE_JSON] [CHECKPOINT_PATH]
|
|||
|
||||
Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the following in log.
|
||||
|
||||
```bash
|
||||
```log
|
||||
Evaluate annotation type *bbox*
|
||||
Accumulating evaluation results...
|
||||
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.227
|
||||
|
@ -625,7 +633,7 @@ bash run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
|
|||
|
||||
Inference result is saved in current path, you can find result like this in acc.log file.
|
||||
|
||||
```bash
|
||||
```log
|
||||
Evaluate annotation type *bbox*
|
||||
Accumulating evaluation results...
|
||||
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.227
|
||||
|
|
|
@ -46,14 +46,29 @@ Dataset used: [ImageNet2012](http://www.image-net.org/)
|
|||
- Test: 50,000 images
|
||||
- Data format:jpeg
|
||||
- Note:Data will be processed in dataset.py
|
||||
|
||||
Dataset used: [CIFAR-10](http://www.cs.toronto.edu/~kriz/cifar.html)
|
||||
|
||||
- Dataset size:175M,60,000 32*32 colorful images in 10 classes
|
||||
- Train:146M,50,000 images
|
||||
- Test:29M,10,000 images
|
||||
- Data format:binary files
|
||||
- Note:Data will be processed in dataset.py
|
||||
|
||||
- Download the dataset, the directory structure is as follows:
|
||||
|
||||
```bash
|
||||
└─dataset
|
||||
├─ilsvrc # train dataset
|
||||
```ImageNet2012
|
||||
└─ImageNet_Original
|
||||
├─train # train dataset
|
||||
└─validation_preprocess # evaluate dataset
|
||||
```
|
||||
|
||||
```cifar10
|
||||
└─cifar10
|
||||
├─cifar-10-batches-bin # train dataset
|
||||
└─cifar-10-verify-bin # evaluate dataset
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
### Mixed Precision(Ascend)
|
||||
|
@ -223,6 +238,9 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
You can start training using python or shell scripts. The usage of shell scripts as follows:
|
||||
|
||||
- Ascend: bash run_distribute_train.sh [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH] (optional)
|
||||
# example: bash run_distribute_train.sh cifar10 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cifar10/cifar-10-batches-bin/
|
||||
# example: bash run_distribute_train.sh imagenet2012 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/
|
||||
|
||||
- CPU: bash run_train_CPU.sh [cifar10|imagenet2012] [DATASET_PATH] [PRETRAINED_CKPT_PATH] (optional)
|
||||
- GPU(single device):bash run_standalone_train_gpu.sh [cifar10|imagenet2012] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
- GPU(distribute training): bash run_distribute_train_gpu.sh [cifar10|imagenet2012] [CONFIG_PATH] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
|
@ -246,6 +264,9 @@ Please follow the instructions in the link [hccn_tools](https://gitee.com/mindsp
|
|||
|
||||
shell:
|
||||
Ascend: bash run_distribute_train.sh [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
# example: bash run_distribute_train.sh cifar10 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/cifar10/cifar-10-batches-bin/
|
||||
# example: bash run_distribute_train.sh imagenet2012 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/
|
||||
|
||||
CPU: bash run_train_CPU.sh [cifar10|imagenet2012] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
GPU(single device): bash run_standalone_train_gpu.sh [cifar10|imagenet2012] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
GPU(distribute training): bash run_distribute_train_gpu.sh [cifar10|imagenet2012] [CONFIG_PATH] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
|
||||
|
@ -278,6 +299,9 @@ Epoch time: 320744.265, per step time: 256.390
|
|||
You can start training using python or shell scripts.If the train method is train or fine tune, should not input the `[CHECKPOINT_PATH]` The usage of shell scripts as follows:
|
||||
|
||||
- Ascend: bash run_eval.sh [cifar10|imagenet2012] [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
# example: bash run_eval.sh cifar10 /home/DataSet/cifar10/cifar-10-verify-bin/ /home/model/mobilenetv1/ckpt/cifar10/mobilenetv1-90_1562.ckpt
|
||||
# example: bash run_eval.sh imagenet2012 /home/DataSet/ImageNet_Original/ /home/model/mobilenetv1/ckpt/imagenet2012/mobilenetv1-90_625.ckpt
|
||||
|
||||
- CPU: bash run_eval_CPU.sh [cifar10|imagenet2012] [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
|
||||
### Launch
|
||||
|
@ -291,6 +315,9 @@ You can start training using python or shell scripts.If the train method is trai
|
|||
|
||||
shell:
|
||||
Ascend: bash run_eval.sh [cifar10|imagenet2012] [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
# example: bash run_eval.sh cifar10 /home/DataSet/cifar10/cifar-10-verify-bin/ /home/model/mobilenetv1/ckpt/cifar10/mobilenetv1-90_1562.ckpt
|
||||
# example: bash run_eval.sh imagenet2012 /home/DataSet/ImageNet_Original/ /home/model/mobilenetv1/ckpt/imagenet2012/mobilenetv1-90_625.ckpt
|
||||
|
||||
CPU: bash run_eval_CPU.sh [cifar10|imagenet2012] [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
```
|
||||
|
||||
|
|
|
@ -22,7 +22,7 @@ pretrain_epoch_size: 0
|
|||
save_checkpoint: True
|
||||
save_checkpoint_epochs: 5
|
||||
keep_checkpoint_max: 10
|
||||
save_checkpoint_path: "/cache/train"
|
||||
save_checkpoint_path: "./"
|
||||
warmup_epochs: 5
|
||||
lr_decay_mode: "poly"
|
||||
lr_init: 0.01
|
||||
|
|
|
@ -225,7 +225,7 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
|
|||
|
||||
You can start training using python or shell scripts. The usage of shell scripts as follows:
|
||||
|
||||
- Ascend: bash run_train.sh Ascend [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
- Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH(optional)] [FREEZE_LAYER(optional)] [FILTER_HEAD(optional)]
|
||||
- GPU: bash run_trian.sh GPU [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
- CPU: bash run_trian.sh CPU [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
|
||||
|
@ -273,29 +273,35 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
CPU: python train.py --platform CPU --dataset_path [TRAIN_DATASET_PATH]
|
||||
|
||||
shell:
|
||||
Ascend: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 hccl_config.json [TRAIN_DATASET_PATH]
|
||||
Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH]
|
||||
# example: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/
|
||||
|
||||
GPU: bash run_train.sh GPU 8 0,1,2,3,4,5,6,7 [TRAIN_DATASET_PATH]
|
||||
CPU: bash run_train.sh CPU [TRAIN_DATASET_PATH]
|
||||
|
||||
# fine tune whole network example
|
||||
# finetune whole network example
|
||||
python:
|
||||
Ascend: python train.py --platform Ascend --config_path [CONFIG_PATH] --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer none --filter_head True
|
||||
GPU: python train.py --platform GPU --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer none --filter_head True
|
||||
CPU: python train.py --platform CPU --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer none --filter_head True
|
||||
|
||||
shell:
|
||||
Ascend: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 hccl_config.json [TRAIN_DATASET_PATH] [CKPT_PATH] none True
|
||||
Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
# example: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/ /home/model/mobilenetv2/predtrain/mobilenet-200_625.ckpt none True
|
||||
|
||||
GPU: bash run_train.sh GPU 8 0,1,2,3,4,5,6,7 [TRAIN_DATASET_PATH] [CKPT_PATH] none True
|
||||
CPU: bash run_train.sh CPU [TRAIN_DATASET_PATH] [CKPT_PATH] none True
|
||||
|
||||
# fine tune full connected layers example
|
||||
# finetune full connected layers example
|
||||
python:
|
||||
Ascend: python train.py --platform Ascend --config_path default_config.yaml --dataset_path [TRAIN_DATASET_PATH]--pretrain_ckpt [CKPT_PATH] --freeze_layer backbone
|
||||
GPU: python train.py --platform GPU --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer backbone
|
||||
CPU: python train.py --platform CPU --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer backbone
|
||||
|
||||
shell:
|
||||
Ascend: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 hccl_config.json [TRAIN_DATASET_PATH] [CKPT_PATH] backbone
|
||||
Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER]
|
||||
# example: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/ /home/model/mobilenetv2/backbone/mobilenet-200_625.ckpt backbone
|
||||
|
||||
GPU: bash run_train.sh GPU 8 0,1,2,3,4,5,6,7 [TRAIN_DATASET_PATH] [CKPT_PATH] backbone
|
||||
CPU: bash run_train.sh CPU [TRAIN_DATASET_PATH] [CKPT_PATH] backbone
|
||||
```
|
||||
|
@ -304,7 +310,7 @@ You can start training using python or shell scripts. The usage of shell scripts
|
|||
|
||||
Training result will be stored in the example path. Checkpoints will be stored at `. /checkpoint` by default, and training log will be redirected to `./train.log` like followings with the platform CPU and GPU, will be wrote to `./train/rank*/log*.log` with the platform Ascend .
|
||||
|
||||
```shell
|
||||
```log
|
||||
epoch: [ 0/200], step:[ 624/ 625], loss:[5.258/5.258], time:[140412.236], lr:[0.100]
|
||||
epoch time: 140522.500, per step time: 224.836, avg loss: 5.258
|
||||
epoch: [ 1/200], step:[ 624/ 625], loss:[3.917/3.917], time:[138221.250], lr:[0.200]
|
||||
|
@ -331,7 +337,9 @@ You can start training using python or shell scripts.If the train method is trai
|
|||
CPU: python eval.py --platform CPU --dataset_path [VAL_DATASET_PATH] --pretrain_ckpt ./ckpt_0/mobilenetv2_15.ckpt
|
||||
|
||||
shell:
|
||||
Ascend: bash run_eval.sh Ascend [VAL_DATASET_PATH] ./checkpoint/mobilenetv2_head_15.ckpt
|
||||
Ascend: bash run_eval.sh Ascend [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
# example: bash run_eval.sh Ascend /home/DataSet/ImageNet_Original/ /home/model/mobilenetV2/ckpt/mobilenet-200_625.ckpt
|
||||
|
||||
GPU: bash run_eval.sh GPU [VAL_DATASET_PATH] ./checkpoint/mobilenetv2_head_15.ckpt
|
||||
CPU: bash run_eval.sh CPU [VAL_DATASET_PATH] ./checkpoint/mobilenetv2_head_15.ckpt
|
||||
```
|
||||
|
@ -342,7 +350,7 @@ You can start training using python or shell scripts.If the train method is trai
|
|||
|
||||
Inference result will be stored in the example path, you can find result like the followings in `eval.log`.
|
||||
|
||||
```shell
|
||||
```log
|
||||
result: {'acc': 0.71976314102564111} ckpt=./ckpt_0/mobilenet-200_625.ckpt
|
||||
```
|
||||
|
||||
|
|
|
@ -227,7 +227,7 @@ MobileNetV2总体网络架构如下:
|
|||
|
||||
使用python或shell脚本开始训练。shell脚本的使用方法如下:
|
||||
|
||||
- Ascend: sh run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
- Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH(optional)] [FREEZE_LAYER(optional)] [FILTER_HEAD(optional)]
|
||||
- GPU: bash run_trian.sh GPU [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
- CPU: bash run_trian.sh CPU [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
|
||||
|
@ -275,7 +275,9 @@ MobileNetV2总体网络架构如下:
|
|||
CPU: python train.py --platform CPU --dataset_path [TRAIN_DATASET_PATH]
|
||||
|
||||
shell:
|
||||
Ascend: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 hccl_config.json [TRAIN_DATASET_PATH]
|
||||
Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH]
|
||||
# example: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/
|
||||
|
||||
GPU: bash run_train.sh GPU 8 0,1,2,3,4,5,6,7 [TRAIN_DATASET_PATH]
|
||||
CPU: bash run_train.sh CPU [TRAIN_DATASET_PATH]
|
||||
|
||||
|
@ -286,7 +288,9 @@ MobileNetV2总体网络架构如下:
|
|||
CPU: python train.py --platform CPU --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer none --filter_head True
|
||||
|
||||
shell:
|
||||
Ascend: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 hccl_config.json [TRAIN_DATASET_PATH] [CKPT_PATH] none True
|
||||
Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER] [FILTER_HEAD]
|
||||
# example: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/ /home/model/mobilenetv2/predtrain/mobilenet-200_625.ckpt none True
|
||||
|
||||
GPU: bash run_train.sh GPU 8 0,1,2,3,4,5,6,7 [TRAIN_DATASET_PATH] [CKPT_PATH] none True
|
||||
CPU: bash run_train.sh CPU [TRAIN_DATASET_PATH] [CKPT_PATH] none True
|
||||
|
||||
|
@ -297,7 +301,8 @@ MobileNetV2总体网络架构如下:
|
|||
CPU: python train.py --platform CPU --dataset_path [TRAIN_DATASET_PATH] --pretrain_ckpt [CKPT_PATH] --freeze_layer backbone
|
||||
|
||||
shell:
|
||||
Ascend: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 hccl_config.json [TRAIN_DATASET_PATH] [CKPT_PATH] backbone
|
||||
Ascend: bash run_train.sh Ascend [CONFIG_PATH] [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [RANK_TABLE_FILE] [DATASET_PATH] [CKPT_PATH] [FREEZE_LAYER]
|
||||
# example: bash run_train.sh Ascend default_config.yaml 8 0,1,2,3,4,5,6,7 /root/hccl_8p_01234567_10.155.170.71.json /home/DataSet/ImageNet_Original/ /home/model/mobilenetv2/backbone/mobilenet-200_625.ckpt backbone
|
||||
GPU: bash run_train.sh GPU 8 0,1,2,3,4,5,6,7 [TRAIN_DATASET_PATH] [CKPT_PATH] backbone
|
||||
CPU: bash run_train.sh CPU [TRAIN_DATASET_PATH] [CKPT_PATH] backbone
|
||||
```
|
||||
|
@ -306,7 +311,7 @@ MobileNetV2总体网络架构如下:
|
|||
|
||||
训练结果保存在示例路径。检查点默认保存在 `./checkpoint`,训练日志会重定向到的CPU和GPU的`./train.log`,写入到Ascend的`./train/rank*/log*.log`。
|
||||
|
||||
```shell
|
||||
```log
|
||||
epoch:[ 0/200], step:[ 624/ 625], loss:[5.258/5.258], time:[140412.236], lr:[0.100]
|
||||
epoch time:140522.500, per step time:224.836, avg loss:5.258
|
||||
epoch:[ 1/200], step:[ 624/ 625], loss:[3.917/3.917], time:[138221.250], lr:[0.200]
|
||||
|
@ -317,7 +322,7 @@ epoch time:138331.250, per step time:221.330, avg loss:3.917
|
|||
|
||||
### 用法
|
||||
|
||||
使用python或shell脚本开始训练。采用train或fine tune训练方法时,不建议输入`[CHECKPOINT_PATH]`。shell脚本的用法如下:
|
||||
使用python或shell脚本开始训练。采用train或finetune训练方法时,不建议输入`[CHECKPOINT_PATH]`。shell脚本的用法如下:
|
||||
|
||||
- Ascend: bash run_eval.sh Ascend [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
- GPU: bash run_eval.sh GPU [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
|
@ -333,7 +338,9 @@ epoch time:138331.250, per step time:221.330, avg loss:3.917
|
|||
CPU: python eval.py --platform CPU --dataset_path [VAL_DATASET_PATH] --pretrain_ckpt ./ckpt_0/mobilenetv2_15.ckpt
|
||||
|
||||
shell:
|
||||
Ascend: bash run_eval.sh Ascend [VAL_DATASET_PATH] ./checkpoint/mobilenetv2_head_15.ckpt
|
||||
Ascend: bash run_eval.sh Ascend [DATASET_PATH] [CHECKPOINT_PATH]
|
||||
# example: bash run_eval.sh Ascend /home/DataSet/ImageNet_Original/ /home/model/mobilenetV2/ckpt/mobilenet-200_625.ckpt
|
||||
|
||||
GPU: bash run_eval.sh GPU [VAL_DATASET_PATH] ./checkpoint/mobilenetv2_head_15.ckpt
|
||||
CPU: bash run_eval.sh CPU [VAL_DATASET_PATH] ./checkpoint/mobilenetv2_head_15.ckpt
|
||||
```
|
||||
|
@ -344,7 +351,7 @@ epoch time:138331.250, per step time:221.330, avg loss:3.917
|
|||
|
||||
推理结果保存在示例路径,可以在`eval.log`中找到如下结果。
|
||||
|
||||
```shell
|
||||
```log
|
||||
result:{'acc':0.71976314102564111} ckpt=./ckpt_0/mobilenet-200_625.ckpt
|
||||
```
|
||||
|
||||
|
|
Loading…
Reference in New Issue