!10488 modify tinydarknet

From: @wukesong Reviewed-by: @liangchenghui,@oacjiewen Signed-off-by: @liangchenghui
2020-12-26 17:37:54 +08:00 · 2020-12-26 17:37:54 +08:00 · 40f98b3464
parent 4a940e4be6 741d399574
commit 40f98b3464
8 changed files with 311 additions and 24 deletions
--- a/model_zoo/official/cv/tinydarknet/README.md
+++ b/model_zoo/official/cv/tinydarknet/README.md
@ -0,0 +1,268 @@
+# Contents
+
+- [Contents](#contents)
+- [Tiny-DarkNet Description](#tiny-darknet-description)
+- [Model Architecture](#model-architecture)
+- [Dataset](#dataset)
+- [Environment Requirements](#environment-requirements)
+- [Quick Start](#quick-start)
+- [Script Description](#script-description)
+    - [Script and Sample Code](#script-and-sample-code)
+    - [Script Parameters](#script-parameters)
+    - [Training Process](#training-process)
+        - [Training](#training)
+        - [Distributed Training](#distributed-training)
+    - [Evaluation Procsee](#evaluation-process)
+        - [Evaluation](#evaluation)
+- [Model Description](#model-description)
+    - [Performance](#performance)
+        - [Training Performance](#training-performance)
+        - [Inference Performance](#inference-performance)
+- [ModelZoo Homepage](#modelzoo-homepage)
+
+# [Tiny-DarkNet Description](#contents)
+
+Tiny-DarkNet is a 16-layer image classification network model for the classic image classification data set ImageNet proposed by Joseph Chet Redmon and others. Tiny-DarkNet, as a simplified version of Darknet designed by the author to minimize the size of the model to meet the needs of users for smaller model sizes, has better image classification capabilities than AlexNet and SqueezeNet, and at the same time it uses only fewer model parameters than them. In order to reduce the scale of the model, the Tiny-DarkNet network does not use a fully connected layer, but only consists of a convolutional layer, a maximum pooling layer, and an average pooling layer.
+
+For more detailed information on Tiny-DarkNet, please refer to the [official introduction.](https://pjreddie.com/darknet/tiny-darknet/)
+
+# [Model Architecture](#contents)
+
+Specifically, the Tiny-DarkNet network consists of 1×1 conv , 3×3 conv , 2×2 max and a global average pooling layer. These modules form each other to convert the input picture into a 1×1000 vector.
+
+# [Dataset](#contents)
+
+In the following sections, we will introduce how to run the scripts using the related dataset below.：
+<!-- Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below. -->
+
+<!-- Dataset used: [CIFAR-10](<http://www.cs.toronto.edu/~kriz/cifar.html>)  -->
+
+<!-- Dataset used ImageNet can refer to [paper](<https://ieeexplore.ieee.org/abstract/document/5206848>)
+
+- Dataset size: 125G, 1250k colorful images in 1000 classes
+  - Train: 120G, 1200k images
+  - Test: 5G, 50k images
+- Data format: RGB images.
+  - Note: Data will be processed in src/dataset.py  -->
+
+Dataset used can refer to [paper](<https://ieeexplore.ieee.org/abstract/document/5206848>)
+
+- Dataset size：125G，1250k colorful images in 1000 classes
+    - Train: 120G,1200k images
+    - Test: 5G, 50k images
+- Data format: RGB images
+    - Note: Data will be processed in src/dataset.py
+
+# [Environment Requirements](#contents)
+
+- Hardware（Ascend）
+    - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) form to ascend@huawei.com.
+- Framework
+    - [MindSpore](https://www.mindspore.cn/install/en)
+- For more information,please check the resources below：
+    - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
+    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
+
+# [Quick Start](#contents)
+
+After installing MindSpore via the official website, you can start training and evaluation as follows:
+
+- running on Ascend：
+
+  ```python
+  # run training example
+  bash ./scripts/run_train_single.sh
+
+  # run distributed training example
+  bash ./scripts/run_train.sh rank_table.json
+
+  # run evaluation example
+  python eval.py > eval.log 2>&1 &
+  OR
+  bash ./script/run_eval.sh
+  ```
+
+  For distributed training, a hccl configuration file with JSON format needs to be created in advance.
+
+  Please follow the instructions in the link below:
+
+  <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.>
+
+For more details, please refer the specify script.
+
+# [Script Description](#contents)
+
+## [Script and Sample Code](#contents)
+
+```bash
+
+├── Tiny-DarkNet
+    ├── README.md                      // descriptions about Tiny-Darknet
+    ├── scripts
+    │   ├──run_train_single.sh      // shell script for single on Ascend
+    │   ├──run_train.sh                // shell script for distributed on Ascend
+    │   ├──run_eval.sh                 // shell script for evaluation on Ascend
+    ├── src
+    │   ├──dataset.py                 // creating dataset
+    │   ├──tinydarknet.py             // Tiny-Darknet architecture
+    │   ├──config.py                  // parameter configuration
+    ├── train.py                       // training script
+    ├── eval.py                        //  evaluation script
+    ├── export.py                      // export checkpoint file into air/onnx
+
+```
+
+## [Script Parameters](#contents)
+
+Parameters for both training and evaluation can be set in config.py
+
+- config for Tiny-Darknet
+
+  ```python
+  'pre_trained': 'False'    # whether training based on the pre-trained model
+  'num_classes': 1000       # the number of classes in the dataset
+  'lr_init': 0.1            # initial learning rate
+  'batch_size': 128         # training batch_size
+  'epoch_size': 500         # total training epoch
+  'momentum': 0.9           # momentum
+  'weight_decay': 1e-4      # weight decay value
+  'image_height': 224       # image height used as input to the model
+  'image_width': 224        # image width used as input to the model
+  'data_path': './ImageNet_Original/train/'  # absolute full path to the train datasets
+  'val_data_path': './ImageNet_Original/val/'  # absolute full path to the evaluation datasets
+  'device_target': 'Ascend' # device running the program
+  'device_id': 0            # device ID used to train or evaluate the dataset. Ignore it when you use run_train.sh for distributed training
+  'keep_checkpoint_max': 10 # only keep the last keep_checkpoint_max checkpoint
+  'checkpoint_path': '/train_tinydarknet.ckpt'  # the absolute full path to save the checkpoint file
+  'onnx_filename': 'tinydarknet.onnx' # file name of the onnx model used in export.py
+  'air_filename': 'tinydarknet.air'   # file name of the air model used in export.py
+  'lr_scheduler': 'exponential'     # learning rate scheduler
+  'lr_epochs': [70, 140, 210, 280]  # epoch of lr changing
+  'lr_gamma': 0.3            # decrease lr by a factor of exponential lr_scheduler
+  'eta_min': 0.0             # eta_min in cosine_annealing scheduler
+  'T_max': 150               # T-max in cosine_annealing scheduler
+  'warmup_epochs': 0         # warmup epoch
+  'is_dynamic_loss_scale': 0 # dynamic loss scale
+  'loss_scale': 1024         # loss scale
+  'label_smooth_factor': 0.1 # label_smooth_factor
+  'use_label_smooth': True   # label smooth
+  ```
+
+For more configuration details, please refer the script config.py.
+
+## [Training Process](#contents)
+
+### [Training](#contents)
+
+- running on Ascend：
+
+  ```python
+  sh scripts/run_train_single.sh
+  ```
+
+  The command above will run in the background, you can view the results through the file train.log.
+
+  After training, you'll get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
+  <!-- After training, you'll get some checkpoint files under the script folder by default. The loss value will be achieved as follows: -->
+
+  ```python
+  # grep "loss is " train.log
+  epoch: 498 step: 1251, loss is 2.7798953
+  Epoch time: 130690.544, per step time: 104.469
+  epoch: 499 step: 1251, loss is 2.9261637
+  Epoch time: 130511.081, per step time: 104.325
+  epoch: 500 step: 1251, loss is 2.69412
+  Epoch time: 127067.548, per step time: 101.573
+  ...
+  ```
+
+  The model checkpoint file will be saved in the current folder.
+  <!-- The model checkpoint will be saved in the current directory.  -->
+
+### [Distributed Training](#contents)
+
+- running on Ascend：
+
+  ```python
+  sh scripts/run_train.sh
+  ```
+
+  The above shell script will run distribute training in the background. You can view the results through the file train_parallel[X]/log. The loss value will be achieved as follows:
+
+  ```python
+  # grep "result: " train_parallel*/log
+  epoch: 498 step: 1251, loss is 2.7798953
+  Epoch time: 130690.544, per step time: 104.469
+  epoch: 499 step: 1251, loss is 2.9261637
+  Epoch time: 130511.081, per step time: 104.325
+  epoch: 500 step: 1251, loss is 2.69412
+  Epoch time: 127067.548, per step time: 101.573
+  ...
+  ```
+
+## [Evaluation Process](#contents)
+
+### [Evaluation](#contents)
+
+- evaluation on Imagenet dataset when running on Ascend:
+
+  Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "/username/tinydaeknet/train_tinydarknet.ckpt".
+
+  ```python
+  python eval.py > eval.log 2>&1 &  
+  OR
+  sh scripts/run_eval.sh
+  ```
+
+  The above python command will run in the background. You can view the results through the file "eval.log". The accuracy of the test dataset will be as follows:
+
+  ```python
+  # grep "accuracy: " eval.log
+  accuracy:  {'top_1_accuracy': 0.5871979166666667, 'top_5_accuracy': 0.8175280448717949}
+  ```
+
+  Note that for evaluation after distributed training, please set the checkpoint_path to be the last saved checkpoint file. The accuracy of the test dataset will be as follows:
+
+  ```python
+  # grep "accuracy: " eval.log
+  accuracy:  {'top_1_accuracy': 0.5871979166666667, 'top_5_accuracy': 0.8175280448717949}
+  ```
+
+# [Model Description](#contents)
+
+## [Performance](#contents)
+
+### [Evaluation Performance](#contents)
+
+| Parameters                 | Ascend                                                      |
+| -------------------------- | ----------------------------------------------------------- |
+| Model Version              | V1                                                |
+| Resource                   | Ascend 910, CPU 2.60GHz, 56cores, Memory 314G               |
+| Uploaded Date              | 2020/12/22                                 |
+| MindSpore Version          | 1.1.0                                                       |
+| Dataset                    | 1200k images                                                |
+| Training Parameters        | epoch=500, steps=1251, batch_size=128, lr=0.1               |
+| Optimizer                  | Momentum                                                    |
+| Loss Function              | Softmax Cross Entropy                                       |
+| Speed                      | 8 pc: 104 ms/step                        |
+| Total Time                 | 8 pc: 17.8 hours                                             |
+| Parameters(M)             | 4.0M                                                        |
+| Scripts                    | [Tiny-Darknet Scripts](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/tinydarknet) |
+
+### [Inference Performance](#contents)
+
+| Parameters          | Ascend                      |
+| ------------------- | --------------------------- |
+| Model Version       | V1                |
+| Resource            | Ascend 910                  |
+| Uploaded Date       | 2020/12/22 |
+| MindSpore Version   | 1.1.0                       |
+| Dataset             | 200k images                |
+| batch_size          | 128                         |
+| Outputs             | probability                 |
+| Accuracy            | 8 pc Top-5: 81.7%                 |
+| Model for inference             | 11.6M (.ckpt file)                 |
+
+# [ModelZoo Homepage](#contents)
+
+ Please check the official[homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).  
--- a/model_zoo/official/cv/tinydarknet/README_CN.md
+++ b/model_zoo/official/cv/tinydarknet/README_CN.md
@ -78,16 +78,16 @@ Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的
 - 在Ascend资源上运行：

  ```python
-  # 单机训练
-  python train.py > train.log 2>&1 &
+  # 单卡训练
+  bash ./scripts/run_train_single.sh

  # 分布式训练
-  bash scripts/run_train.sh rank_table.json
+  bash ./scripts/run_train.sh rank_table.json

  # 评估
  python eval.py > eval.log 2>&1 &
  OR
-  bash run_eval.sh
+  bash ./script/run_eval.sh
  ```

  进行并行训练时, 需要提前创建JSON格式的hccl配置文件。
@ -105,9 +105,10 @@ Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的
 ```bash

 ├── Tiny-DarkNet
-    ├── README.md                    // Tiny-Darknet相关说明
+    ├── README.md                  // Tiny-Darknet相关说明
    ├── scripts
-    │   ├──run_train.sh                // Ascend分布式训练shell脚本
+    │   ├──run_train_single.sh     // Ascend单卡训练shell脚本
+    │   ├──run_train.sh               // Ascend分布式训练shell脚本
    │   ├──run_eval.sh                // Ascend评估shell脚本
    ├── src
    │   ├──dataset.py                 // 创建数据集
@ -140,7 +141,7 @@ Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的
  'device_target': 'Ascend' # 程序运行的设备
  'device_id': 0            # 用来训练和评估的设备编号
  'keep_checkpoint_max': 10 # 仅仅保持最新的keep_checkpoint_max个checkpoint文件
-  'checkpoint_path': './train_tinydarknet_imagenet-125_390.ckpt'  # 保存checkpoint文件的绝对路径
+  'checkpoint_path': '/train_tinydarknet.ckpt'  # 保存checkpoint文件的绝对路径
  'onnx_filename': 'tinydarknet.onnx' # 用于export.py 文件中的onnx模型的文件名
  'air_filename': 'tinydarknet.air'   # 用于export.py 文件中的air模型的文件名
  'lr_scheduler': 'exponential'     # 学习率策略
@ -164,10 +165,10 @@ Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的
 - 在Ascend资源上运行：

  ```python
-  python train.py > train.log 2>&1 &
+  sh scripts/run_train_single.sh
  ```

-  上述的python命令将运行在后台中，可以通过 `train.log` 文件查看运行结果.
+  上述的命令将运行在后台中，可以通过 `train.log` 文件查看运行结果.

  训练完成后,默认情况下,可在script文件夹下得到一些checkpoint文件. 训练的损失值将以如下的形式展示:
  <!-- After training, you'll get some checkpoint files under the script folder by default. The loss value will be achieved as follows: -->
@ -213,7 +214,7 @@ Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的

 - 在Ascend资源上进行评估:

-  在运行如下命令前,请确认用于评估的checkpoint文件的路径.请将checkpoint路径设置为绝对路径,例如:"username/imagenet/train_tiny-darknet_imagenet-125_390.ckpt".
+  在运行如下命令前,请确认用于评估的checkpoint文件的路径.请将checkpoint路径设置为绝对路径,例如:"/username/imagenet/train_tinydarknet.ckpt"

  ```python
  python eval.py > eval.log 2>&1 &  
@ -254,7 +255,7 @@ Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的
 | 速度                      | 8卡: 104 ms/step                        |
 | 总时间                 | 8卡: 17.8小时                                             |
 | 参数(M)             | 4.0                                                        |
-| 脚本                    | [Tiny-Darknet脚本](https://gitee.com/mindspore/mindspore/tree/r0.7/model_zoo/official/cv/googlenet) |
+| 脚本                    | [Tiny-Darknet脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/tinydarknet) |

 ### [评估性能](#目录)

--- a/model_zoo/official/cv/tinydarknet/eval.py
+++ b/model_zoo/official/cv/tinydarknet/eval.py
@ -56,8 +56,6 @@ if __name__ == '__main__':

    device_target = cfg.device_target
    context.set_context(mode=context.GRAPH_MODE, device_target=cfg.device_target)
-    if device_target == "Ascend":
-        context.set_context(device_id=cfg.device_id)

    if args_opt.checkpoint_path is not None:
        param_dict = load_checkpoint(args_opt.checkpoint_path)
--- a/model_zoo/official/cv/tinydarknet/scripts/run_eval.sh
+++ b/model_zoo/official/cv/tinydarknet/scripts/run_eval.sh
@ -14,4 +14,6 @@
 # limitations under the License.
 # ============================================================================

-python ../eval.py > ./eval.log 2>&1 &
+rm -rf ./eval
+mkdir ./eval
+python ./eval.py > ./eval/eval.log 2>&1 &
--- a/model_zoo/official/cv/tinydarknet/scripts/run_train.sh
+++ b/model_zoo/official/cv/tinydarknet/scripts/run_train.sh
@ -29,7 +29,7 @@ exit 1
 fi


-dataset_type='cifar10'
+dataset_type='imagenet'
 if [ $# == 2 ]
 then
    if [ $2 != "cifar10" ] && [ $2 != "imagenet" ]
@ -56,8 +56,8 @@ do
    export RANK_ID=$((rank_start + i))
    rm -rf ./train_parallel$i
    mkdir ./train_parallel$i
-    cp -r ../src ./train_parallel$i
-    cp ../train.py ./train_parallel$i
+    cp -r ./src ./train_parallel$i
+    cp ./train.py ./train_parallel$i
    echo "start training for rank $RANK_ID, device $DEVICE_ID, $dataset_type"
    cd ./train_parallel$i || exit
    env > env.log
--- a/model_zoo/official/cv/tinydarknet/scripts/run_train_single.sh
+++ b/model_zoo/official/cv/tinydarknet/scripts/run_train_single.sh
@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+rm -rf ./train_single
+mkdir ./train_single
+cp -r ./src ./train_single
+cp ./train.py ./train_single
+cd ./train_single
+python ./train.py > ./train.log 2>&1 &
--- a/model_zoo/official/cv/tinydarknet/src/config.py
+++ b/model_zoo/official/cv/tinydarknet/src/config.py
@ -31,8 +31,6 @@ imagenet_cfg = edict({
    'data_path': './dataset/imagenet_original/train/',
    'val_data_path': './dataset/imagenet_original/val/',
    'device_target': 'Ascend',
-    'device_id': 0,
-    'device_num': 8,
    'keep_checkpoint_max': 1,
    'checkpoint_path': './scripts/train_parallel4/ckpt_4/train_tinydarknet_imagenet-300_1251.ckpt',
    'onnx_filename': 'tinydarknet.onnx',
--- a/model_zoo/official/cv/tinydarknet/train.py
+++ b/model_zoo/official/cv/tinydarknet/train.py
@ -16,6 +16,7 @@
 #################train tinydarknet example on cifar10########################
 python train.py
 """
+import os
 import argparse

 from mindspore import Tensor
@ -78,14 +79,11 @@ if __name__ == '__main__':
    device_target = cfg.device_target

    context.set_context(mode=context.GRAPH_MODE, device_target=cfg.device_target)
-    device_num = cfg.device_num
+    device_num = int(os.environ.get("DEVICE_NUM", 1))

    rank = 0
    if device_target == "Ascend":
-        if args_opt.device_id is not None:
-            context.set_context(device_id=args_opt.device_id)
-        else:
-            context.set_context(device_id=cfg.device_id)
+        context.set_context(device_id=args_opt.device_id)

        if device_num > 1:
            context.reset_auto_parallel_context()