update vit

This commit is contained in:
gengdongjie 2021-08-26 19:43:31 +08:00
parent 8c6d4a05fc
commit 33e6608ba7
40 changed files with 4662 additions and 0 deletions

View File

@ -0,0 +1,526 @@
# Contents
[查看中文](./README_CN.md)
- [Vit Description](#vit-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Features](#features)
- [Mixed Precision](#mixed-precision)
- [Environment Requirements](#environment-requirements)
- [Quick Start](#quick-start)
- [Script Description](#script-description)
- [Script and Sample Code](#script-and-sample-code)
- [Script Parameters](#script-parameters)
- [Training Process](#training-process)
- [Training](#training)
- [Distributed Training](#distributed-training)
- [Evaluation Process](#evaluation-process)
- [Evaluation](#evaluation)
- [Export Process](#Export-process)
- [Export](#Export)
- [Inference Process](#Inference-process)
- [Inference](#Inference)
- [Model Description](#model-description)
- [Performance](#performance)
- [Evaluation Performance](#evaluation-performance)
- [Inference Performance](#evaluation-performance)
- [How to use](#how-to-use)
- [Inference](#inference)
- [Continue Training on the Pretrained Model](#continue-training-on-the-pretrained-model)
- [Description of Random Situation](#description-of-random-situation)
- [ModelZoo Homepage](#modelzoo-homepage)
# [Vit Description](#contents)
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
[Paper](https://arxiv.org/abs/2010.11929): Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 2021.
# [Model Architecture](#contents)
Specifically, the vit contains transformer encoder. The structure is patch_embeding + n transformer layer + head(FC for classification).
# [Dataset](#contents)
Dataset used: [ImageNet2012](http://www.image-net.org/)
- Dataset size 224*224 colorful images in 1000 classes
- Train1,281,167 images
- Test 50,000 images
- Data formatjpeg
- NoteData will be processed in dataset.py
- Download the dataset, the directory structure is as follows:
```bash
└─dataset
├─train # train dataset, should be .tar file when running on clould
└─val # evaluate dataset
```
- Data format: RGB images.
- Note: Data will be processed in src/dataset.py
# [Features](#contents)
## Mixed Precision
The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching reduce precision.
# [Environment Requirements](#contents)
- HardwareAscend/GPU/CPU
- Prepare hardware environment with Ascend/GPU/CPU processor.
- Framework
- [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
# [Quick Start](#contents)
After installing MindSpore via the official website, you can start training and evaluation as follows:
- running on Ascend
```python
# run training example CONFIG_PATH in ./config/*.yml or *.ymal
python train.py --config_path=[CONFIG_PATH] > train.log 2>&1 &
# run distributed training example
cd scripts;
bash run_train_distribute.sh [RANK_TABLE_FILE] [CONFIG_PATH]
# run evaluation example
cd scripts;
bash run_eval.sh [RANK_TABLE_FILE] [CONFIG_PATH]
# run inferenct example
cd scripts;
bash run_infer_310.sh [MINDIR_PATH] [NET_TYPE] [DATASET] [DATA_PATH] [DEVICE_ID]
```
For distributed training, a hccl configuration file(RANK_TABLE_FILE) with JSON format needs to be created in advance.
Please follow the instructions in the link below:
<https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools>.
- ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start training as follows)
- Train imagenet 8p on ModelArts
```python
# (1) Add "config_path='/path_to_code/config/vit_patch32_imagenet2012_config_cloud.yml'" on the website UI interface.
# (2) Perform a or b.
# a. Set "enable_modelarts=1" on yml file.
# Set "output_path" on yml file.
# Set "data_path='/cache/data/ImageNet/'" on yml file.
# Set other parameters on yml file you need.
# b. Add "enable_modelarts=1" on the website UI interface.
# Set "output_path" on yml file.
# Set "data_path='/cache/data/ImageNet/'" on yml file.
# Add other parameters on the website UI interface.
# (3) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
# (4) Set the code directory to "/path/vit" on the website UI interface.
# (5) Set the startup file to "train.py" on the website UI interface.
# (6) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (7) Create your job.
```
- Eval imagenet on ModelArts
```python
# (1) Add "config_path='/path_to_code/config/vit_eval.yml'" on the website UI interface.
# (2) Perform a or b.
# a. Set "enable_modelarts=1" on yml file.
# Set "output_path" on yml file.
# Set "data_path='/cache/data/ImageNet/'" on yml file.
# Set "checkpoint_url='s3://dir_to_trained_ckpt/'" on yml file.
# Set "load_path='/cache/checkpoint_path/model.ckpt'" on yml file.
# Set other parameters on yml file you need.
# b. Add "enable_modelarts=1" on the website UI interface.
# Add "dataset_name=imagenet" on the website UI interface.
# Add "val_data_path=/cache/data/ImageNet/val/" on the website UI interface.
# Add "checkpoint_url='s3://dir_to_trained_ckpt/'" on the website UI interface.
# Add "load_path='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
# Add other parameters on the website UI interface.
# (3) Upload or copy your pretrained model to S3 bucket.
# (4) Upload a zip dataset to S3 bucket. (you could also upload the origin dataset, but it can be so slow.)
# (5) Set the code directory to "/path/vit" on the website UI interface.
# (6) Set the startup file to "eval.py" on the website UI interface.
# (7) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (8) Create your job.
```
- Export on ModelArts
```python
# (1) Add "config_path='/path_to_code/config/vit_export.yml'" on the website UI interface.
# (2) Perform a or b.
# a. Set "enable_modelarts=1" on yml file.
# Set "checkpoint_url='s3://dir_to_trained_ckpt/'" on yml file.
# Set "load_path='/cache/checkpoint_path/model.ckpt'" on yml file.
# Set other parameters on yml file you need.
# b. Add "enable_modelarts=1" on the website UI interface.
# Add "checkpoint_url=s3://dir_to_trained_ckpt/" on the website UI interface.
# Add "load_path=/cache/checkpoint_path/model.ckpt" on the website UI interface.
# Add other parameters on the website UI interface.
# (3) Upload or copy your trained model to S3 bucket.
# (4) Set the code directory to "/path/vit" on the website UI interface.
# (5) Set the startup file to "export.py" on the website UI interface.
# (6) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
# (7) Create your job.
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
```text
├── model_zoo
├── README.md // descriptions about all the models
├── vit
├── README.md // descriptions about vit
├── ascend310_infer // application for 310 inference
├── scripts
│ ├──run_train_distribute.sh // shell script for distributed on Ascend
│ ├──run_train_standalone.sh // shell script for single node on Ascend
│ ├──run_eval.sh // shell script for evaluation on Ascend
│ ├──run_infer_310.sh // shell script for 310 inference
├── src
│ ├──autoaugment.py // autoaugment for data processing
│ ├──callback.py // logging callback
│ ├──cross_entropy.py // ce loss
│ ├──dataset.py // creating dataset
│ ├──eval_engine.py // eval code
│ ├──logging.py // logging engine
│ ├──lr_generator.py // lr schedule
│ ├──metric.py // metric for eval
│ ├──optimizer.py // user defined optimizer
│ ├──vit.py // model architecture
│ ├──model_utils // cloud depending files, all model zoo shares the same files, not recommend user changing
├── config
│ ├──vit_eval.yml // parameter configuration for eval
│ ├──vit_export.yml // parameter configuration for export
│ ├──vit_patch32_imagenet2012_config.yml // parameter configuration for 8P training
│ ├──vit_patch32_imagenet2012_config_cloud.yml // parameter configuration for 8P training on cloud
│ ├──vit_patch32_imagenet2012_config_standalone.yml // parameter configuration for 1P training
├── train.py // training script
├── eval.py // evaluation script
├── postprogress.py // post process for 310 inference
├── export.py // export checkpoint files into air/mindir
├── create_imagenet2012_label.py // create label for 310 inference
├── requirements.txt // requirements pip list
├── mindspore_hub_conf.py // The mindspore_hub_conf file required for the operation of the hub warehouse
```
## [Script Parameters](#contents)
Parameters for both training and evaluation can be set in config.py
- config for vit, ImageNet dataset
```python
enable_modelarts: 1 # train on cloud or not
# Url for modelarts
data_url: "" # S3 dataset path
train_url: "" # S3 output path
checkpoint_url: "" # S3 pretrain model path
output_path: "/cache/train" # output cache, copy to train_url
data_path: "/cache/datasets/imagenet" # dataset cache(real path on cloud), copy from data_url
load_path: "/cache/model/vit_base_patch32.ckpt" # model cache, copy from checkpoint_url
# train datasets
dataset_path: '/cache/datasets/imagenet/train' # training dataset
train_image_size: 224 # image height and weight used as input to the model
interpolation: 'BILINEAR' # dataset interpolation
crop_min: 0.05 # random crop min value
batch_size: 256 # batch size for train
train_num_workers: 14 # parallel work number
# eval datasets
eval_path: '/cache/datasets/imagenet/val' # eval dataset
eval_image_size: 224 # image height and weight used as input to the model
eval_batch_size: 256 # batch size for eval
eval_interval: 1 # eval interval
eval_offset: -1 # eval offset
eval_num_workers: 12 # parallel work number
# network
backbone: 'vit_base_patch32' # backbone type
class_num: 1001 # class number, imagenet is 1000+1
vit_config_path: 'src.vit.VitConfig' #vit config path, for advanced user to design transformer based new architecture
pretrained: '' # pre-trained model path, '' means not use pre-trained model
# lr
lr_decay_mode: 'cosine' # lr decay type, support cosine, exp... detail see lr_generator.py
lr_init: 0.0 # start lr(epoch 0)
lr_max: 0.00355 # max lr
lr_min: 0.0 # min lr (max epoch)
max_epoch: 300 # max epoch
warmup_epochs: 40 # warmup epoch
# optimizer
opt: 'adamw' # optimizer type
beta1: 0.9 # adam beta
beta2: 0.999 # adam beta
weight_decay: 0.05 # weight decay
no_weight_decay_filter: "beta,bias" # which type of weight not use weight decay
gc_flag: 0 # use gc or not, not support for user defined opt, support for system defined opt
# loss, some parameter also used in datasets
loss_scale: 1024 # amp loss scale
use_label_smooth: 1 # use label smooth or not
label_smooth_factor: 0.1 #label smooth factor
mixup: 0.2 # use mixup or not
autoaugment: 1 # use autoaugment or not
loss_name: "ce_smooth_mixup" #loss type, detail see cross_entropy.py
# ckpt
save_checkpoint: 1 # save .ckpt(training result) or not
save_checkpoint_epochs: 8 # when to save .ckpt
keep_checkpoint_max: 3 # max keep ckpt
save_checkpoint_path: './outputs' # save path
# profiler
open_profiler: 0 # do profiling or not. if use profile, you'd better set a small dataset as training dataset and set max_epoch=1
```
For more configuration details, please refer the script `train.py`, `eval.py`, `export.py` and `config/*.yml`.
## [Training Process](#contents)
### Training
- running on Ascend
```bash
python train.py --config_path=[CONFIG_PATH] > train.log 2>&1 &
```
The python command above will run in the background, you can view the results through the file `train.log`.
After training, you'll get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
```bash
# vim log
2021-08-05 15:17:12:INFO:compile time used=143.16s
2021-08-05 15:34:41:INFO:epoch[0], epoch time: 1048.72s, per step time: 0.2096s, loss=6.738676, lr=0.000011, fps=1221.51
2021-08-05 15:52:03:INFO:epoch[1], epoch time: 1041.90s, per step time: 0.2082s, loss=6.381927, lr=0.000022, fps=1229.51
...
```
The model checkpoint will be saved in the train directory.
### Distributed Training
- running on Ascend
```bash
cd scripts
bash run_train_distribute.sh [RANK_TABLE_FILE] [CONFIG_PATH]
```
The above shell script will run distribute training in the background. You can view the results through the file `train_parallel[X]/log`. The loss value will be achieved as follows:
```bash
# vim train_parallel0/log
# fps depend on cpu processing ability, data processing take times
2021-08-05 20:15:16:INFO:compile time used=191.77s
2021-08-05 20:17:46:INFO:epoch[0], epoch time: 149.10s, per step time: 0.2386s, loss=6.729037, lr=0.000089, fps=8584.97, accuracy=0.014940, eval_cost=1.58
2021-08-05 20:20:11:INFO:epoch[1], epoch time: 143.44s, per step time: 0.2295s, loss=6.786729, lr=0.000177, fps=8923.72, accuracy=0.047000, eval_cost=1.27
...
2021-08-06 08:18:18:INFO:epoch[299], epoch time: 143.19s, per step time: 0.2291s, loss=2.718115, lr=0.000000, fps=8939.29, accuracy=0.741800, eval_cost=1.28
2021-08-06 08:18:20:INFO:training time used=43384.70s
2021-08-06 08:18:20:INFO:last_metric[0.74206]
2021-08-06 08:18:20:INFO:ip[*.*.*.*], mean_fps[8930.40]
```
## [Evaluation Process](#contents)
### Evaluation
- evaluation on imagenet dataset when running on Ascend
Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/vit/vit_base_patch32.ckpt".
```bash
cd scripts
bash run_eval.sh [RANK_TABLE_FILE] [CONFIG_PATH]
```
The above python command will run in the background. You can view the results through the file "eval.log". The accuracy of the test dataset will be as follows:
```bash
# grep "accuracy=" eval0/log
accuracy=0.741260
```
Note that for evaluation after distributed training, please choose the checkpoint_path to be the saved checkpoint file such as "username/vit/train_parallel0/outputs/vit_base_patch32-288_625.ckpt". The accuracy of the test dataset will be as follows:
```bash
# grep "accuracy=" eval0/log
accuracy=0.741260
```
## [Export Process](#contents)
### [Export](#content)
Before export model, you must modify the config file, config/export.yml.
The config items you should modify are batch_size and ckpt_file/pretrained.
```bash
python export.py --config_path=[CONFIG_PATH]
```
## [Inference Process](#contents)
### [Inference](#content)
Before performing inference, we need to export model first. Air model can only be exported in Ascend 910 environment, mindir model can be exported in any environment.
Current batch_ Size can only be set to 1.
- inference on imagenet dataset when running on Ascend
Before running the command below, you should modify the config file. The items you should modify are batch_size and val_data_path.
Inference result will be stored in the example path, you can find result like the followings in acc.log.
```shell
# Ascend310 inference
cd scripts
bash run_infer_310.sh [MINDIR_PATH] [NET_TYPE] [DATASET] [DATA_PATH] [DEVICE_ID]
Total data: 50000, top1 accuracy: 0.74084, top5 accuracy: 0.91026
```
# [Model Description](#contents)
## [Performance](#contents)
### Evaluation Performance
#### Vit on imagenet 1200k images
| Parameters | Ascend |
| -------------------------- | ----------------------------------------------------------- |
| Model Version | Vit |
| Resource | Ascend 910; CPU 2.60GHz, 56cores; Memory 314G; OS Euler2.8 |
| uploaded Date | 08/30/2021 (month/day/year) |
| MindSpore Version | 1.3.0 |
| Dataset | 1200k images |
| Training Parameters | epoch=300, steps=625*300, batch_size=256, lr=0.00355 |
| Optimizer | Adamw |
| Loss Function | Softmax Cross Entropy |
| outputs | probability |
| Loss | 1.0 |
| Speed | 1pc: 180 ms/step; 8pcs: 185 ms/step |
| Total time | 8pcs: 11 hours |
| Parameters (M) | 86.0 |
| Checkpoint for Fine tuning | 1000M (.ckpt file) |
| Scripts | [vit script](https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/cv/vit) |
### Inference Performance
#### Vit on 1200k images
| Parameters | Ascend |
| ------------------- | --------------------------- |
| Model Version | Vit |
| Resource | Ascend 910; OS Euler2.8 |
| Uploaded Date | 08/30/2021 (month/day/year) |
| MindSpore Version | 1.3.0 |
| Dataset | 1200k images |
| batch_size | 256 |
| outputs | probability |
| Accuracy | 8pcs: 73.5%-74.6% |
## [How to use](#contents)
### Inference
If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html). Following the steps below, this is a simple example:
- Running on Ascend
```python
# get args from cfg and get parameter by args
args.loss_scale = ...
lrs = ...
...
# Set context
context.set_context(mode=context.GRAPH_HOME, device_target=args.device_target)
context.set_context(device_id=args.device_id)
# Load unseen dataset for inference
dataset = dataset.create_dataset(args.data_path, 1, False)
# Define model
net = ViT(args.vit_config)
opt = AdamW(filter(lambda x: x.requires_grad, net.get_parameters()), lrs, args.beta1, args.beta2, loss_scale=args.loss_scale, weight_decay=cfg.weight_decay)
loss = CrossEntropySmoothMixup(smooth_factor=args.label_smooth_factor, num_classes=args.class_num)
model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
# Load pre-trained model
param_dict = load_checkpoint(args.pretrained)
load_param_into_net(net, param_dict)
net.set_train(False)
# Make predictions on the unseen dataset
acc = model.eval(dataset)
print("accuracy: ", acc)
```
### Continue Training on the Pretrained Model
- running on Ascend
```python
# get args from cfg and get parameter by args
args.loss_scale = ...
lrs = ...
...
# Load dataset
dataset = create_dataset(cfg.data_path, 1)
batch_num = dataset.get_dataset_size()
# Define model
net = ViT(args.vit_config)
# Continue training if set pre_trained to be True
if cfg.pretrained != '':
param_dict = load_checkpoint(cfg.pretrained)
load_param_into_net(net, param_dict)
# Define model
opt = AdamW(filter(lambda x: x.requires_grad, net.get_parameters()), lrs, args.beta1, args.beta2, loss_scale=args.loss_scale, weight_decay=cfg.weight_decay)
loss = CrossEntropySmoothMixup(smooth_factor=args.label_smooth_factor, num_classes=args.class_num)
model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
# Start training
epoch_size = args.max_epoch
step_size = dataset.get_dataset_size()
# Set callbacks
state_cb = StateMonitor(data_size=step_size,
tot_batch_size=args.batch_size * device_num,
lrs=lrs,
eval_interval=args.eval_interval,
eval_offset=args.eval_offset,
eval_engine=eval_engine,
logger=args.logger.info)
cb = [state_cb, ]
model.train(epoch_size, dataset, callbacks=cb, sink_size=step_size)
print("train success")
```
# [Description of Random Situation](#contents)
In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
# [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).

View File

@ -0,0 +1,532 @@
# 目录
[View English](./README.md)
<!-- TOC -->
- [目录](#目录)
- [Vit描述](#vit描述)
- [模型架构](#模型架构)
- [数据集](#数据集)
- [特性](#特性)
- [混合精度](#混合精度)
- [环境要求](#环境要求)
- [快速入门](#快速入门)
- [脚本说明](#脚本说明)
- [脚本及样例代码](#脚本及样例代码)
- [脚本参数](#脚本参数)
- [训练过程](#训练过程)
- [训练](#训练)
- [分布式训练](#分布式训练)
- [评估过程](#评估过程)
- [评估](#评估)
- [导出过程](#导出过程)
- [导出](#导出)
- [推理过程](#推理过程)
- [推理](#推理)
- [模型描述](#模型描述)
- [性能](#性能)
- [评估性能](#评估性能)
- [120万张图像上的GoogleNet](#120万张图像上的vit)
- [推理性能](#推理性能)
- [120万张图像上的GoogleNet](#120万张图像上的vit)
- [使用流程](#使用流程)
- [推理](#推理)
- [继续训练预训练模型](#继续训练预训练模型)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#modelzoo主页)
<!-- /TOC -->
# Vit描述
vit全名vision transformer不同于传统的基于CNN的网络结果是基于transformer结构的cv网络2021年谷歌研究发表网络在大数据集上表现了非常强的泛化能力。大数据任务如clip基于该结构能有良好的效果。
[论文](https://arxiv.org/abs/2010.11929): Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 2021.
# 模型架构
Vit是基于多个transformer encoder模块串联起来由多个inception模块串联起来基本结构由patch_embeding + n transformer layer + head(分类网络中就是FC)构成。
# 数据集
使用的数据集:[ImageNet2012](http://www.image-net.org/)
- 数据集大小125G共1000个类、125万张彩色图像
- 训练集120G共120万张图像
- 测试集5G共5万张图像
- 数据格式RGB
- 注数据将在src/dataset.py中处理。
```bash
└─dataset
├─train # 训练集, 云上训练得是 .tar压缩文件格式
└─val # 评估数据集
```
# 特性
## 混合精度
采用[混合精度](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/enable_mixed_precision.html)的训练方法使用支持单精度和半精度数据来提高深度学习神经网络的训练速度,同时保持单精度训练所能达到的网络精度。混合精度训练提高计算速度、减少内存使用的同时,支持在特定硬件上训练更大的模型或实现更大批次的训练。
以FP16算子为例如果输入数据类型为FP32MindSpore后台会自动降低精度来处理数据。用户可打开INFO日志搜索“reduce precision”查看精度降低的算子。
# 环境要求
- 硬件Ascend/GPU/CPU
- 使用Ascend/GPU/CPU处理器来搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install/en)
- 如需查看详情,请参见如下资源:
- [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
# 快速入门
通过官方网站安装MindSpore后您可以按照如下步骤进行训练和评估
- Ascend处理器环境运行
```python
# 运行训练示例 CONFIG_PATH配置文件请参考'./config'路径下相关文件
python train.py --config_path=[CONFIG_PATH] > train.log 2>&1 &
# 运行分布式训练示例
cd scripts;
sh run_train_distribute.sh [RANK_TABLE_FILE] [CONFIG_PATH]
# 运行评估示例
cd scripts;
bash run_eval.sh [RANK_TABLE_FILE] [CONFIG_PATH]
# 运行推理示例
cd scripts;
bash run_infer_310.sh [MINDIR_PATH] [NET_TYPE] [DATASET] [DATA_PATH] [DEVICE_ID]
```
对于分布式训练需要提前创建JSON格式的hccl配置文件。
请遵循以下链接中的说明:
<https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.>
- 在 ModelArts 进行训练 (如果你想在modelarts上运行可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
- 在 ModelArts 上使用8卡训练 ImageNet 数据集
```python
# (1) 在网页上设置 "config_path='/path_to_code/config/vit_patch32_imagenet2012_config_cloud.yml'"
# (2) 执行a或者b
# a. 在 .yml 文件中设置 "enable_modelarts=True"
# 在 .yml 文件中设置 "output_path"
# 在 .yml 文件中设置 "data_path='/cache/data/ImageNet/'"
# 在 .yml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "output_path"
# 在网页上设置 "data_path='/cache/data/ImageNet/'"
# 在网页上设置 其他参数
# (3) 上传你的压缩数据集到 S3 桶上 (你也可以上传原始的数据集,但那可能会很慢。)
# (4) 在网页上设置你的代码路径为 "/path/vit"
# (5) 在网页上设置启动文件为 "train.py"
# (6) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (7) 创建训练作业
```
- 在 ModelArts 上使用单卡验证 ImageNet 数据集
```python
# (1) 在网页上设置 "config_path='/path_to_code/config/vit_eval.yml'"
# (2) 执行a或者b
# a. 在 .yml 文件中设置 "enable_modelarts=True"
# 在 .yml 文件中设置 "dataset_name='imagenet'"
# 在 .yml 文件中设置 "val_data_path='/cache/data/ImageNet/val/'"
# 在 .yml 文件中设置 "checkpoint_url='s3://dir_to_trained_ckpt/'"
# 在 .yml 文件中设置 "checkpoint_path='/cache/checkpoint_path/model.ckpt'"
# 在 .yml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "dataset_name=imagenet"
# 在网页上设置 "val_data_path=/cache/data/ImageNet/val/"
# 在网页上设置 "checkpoint_url='s3://dir_to_trained_ckpt/'"
# 在网页上设置 "checkpoint_path='/cache/checkpoint_path/model.ckpt'"
# 在网页上设置 其他参数
# (3) 上传你的预训练模型到 S3 桶上
# (4) 上传你的压缩数据集到 S3 桶上 (你也可以上传原始的数据集,但那可能会很慢。)
# (5) 在网页上设置你的代码路径为 "/path/vit"
# (6) 在网页上设置启动文件为 "eval.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
```
- 在 ModelArts 上转模型
```python
# (1) 在网页上设置 "config_path='/path_to_code/config/vit_export.yml'"
# (2) 执行a或者b
# a. 在 .yml 文件中设置 "enable_modelarts=True"
# 在 .yml 文件中设置 "checkpoint_url='s3://dir_to_trained_ckpt/'"
# 在 .yml 文件中设置 "load_path='/cache/checkpoint_path/model.ckpt'"
# 在 .yml 文件中设置 其他参数
# b. 在网页上设置 "enable_modelarts=True"
# 在网页上设置 "checkpoint_url=s3://dir_to_trained_ckpt/"
# 在网页上设置 "load_path=/cache/checkpoint_path/model.ckpt"
# 在网页上设置 其他参数
# (3) 上传你的预训练模型到 S3 桶上
# (4) 上传你的压缩数据集到 S3 桶上 (你也可以上传原始的数据集,但那可能会很慢。)
# (5) 在网页上设置你的代码路径为 "/path/vit"
# (6) 在网页上设置启动文件为 "export.py"
# (7) 在网页上设置"训练数据集"、"训练输出文件路径"、"作业日志路径"等
# (8) 创建训练作业
```
# 脚本说明
## 脚本及样例代码
```text
├── model_zoo
├── README.md // 所有模型相关说明
├── vit
├── README.md // vit模型相关说明
├── ascend310_infer // 实现310推理源代码
├── scripts
│ ├──run_train_distribute.sh // 分布式到Ascend的shell脚本
│ ├──run_train_standalone.sh // 单卡到Ascend的shell脚本
│ ├──run_eval.sh // Ascend评估的shell脚本
│ ├──run_infer_310.sh // Ascend推理shell脚本
├── src
│ ├──autoaugment.py // 数据自动增强策略
│ ├──callback.py // 打印结果的回调函数
│ ├──cross_entropy.py // ce loss函数
│ ├──dataset.py // 创建数据集
│ ├──eval_engine.py // 评估策略
│ ├──logging.py // 自定义日志打印策略
│ ├──lr_generator.py // lr的策略
│ ├──metric.py // 评估结果计算方式
│ ├──optimizer.py // 优化器
│ ├──vit.py // 模型结构
│ ├──model_utils // 云上训练依赖
├── config
│ ├──vit_eval.yml // 评估配置
│ ├──vit_export.yml // 转模型配置
│ ├──vit_patch32_imagenet2012_config.yml // 8p训练参数配置
│ ├──vit_patch32_imagenet2012_config_cloud.yml // 8p云上训练参数配置
│ ├──vit_patch32_imagenet2012_config_standalone.yml // 单p训练参数配置
├── train.py // 训练脚本
├── eval.py // 评估脚本
├── postprogress.py // 310推理的后处理
├── export.py // 模型转 air/mindir类型
├── create_imagenet2012_label.py // 310推理ImageNet转label格式
├── requirements.txt // 依赖python包
├── mindspore_hub_conf.py // mindspore_hub_conf文件为hub warehouse准备
```
## 脚本参数
在./config/.yml中可以同时配置训练参数和评估参数。
- vit和ImageNet数据集配置。
```python
enable_modelarts: 1 # 是否云上训练
# modelarts云上参数
data_url: "" # S3 数据集路径
train_url: "" # S3 输出路径
checkpoint_url: "" # S3 预训练模型路径
output_path: "/cache/train" # 真实的云上机器路径从train_url拷贝
data_path: "/cache/datasets/imagenet" # 真实的云上机器路径从data_url拷贝
load_path: "/cache/model/vit_base_patch32.ckpt" #真实的云上机器路径从checkpoint_url拷贝
# 训练数据集
dataset_path: '/cache/datasets/imagenet/train' # 训练数据集路径
train_image_size: 224 # 输入图片的宽高
interpolation: 'BILINEAR' # 图片预处理的插值算法
crop_min: 0.05 # random crop 最小参数
batch_size: 256 # 训练batch size
train_num_workers: 14 # 并行work数量
# 评估数据集
eval_path: '/cache/datasets/imagenet/val' # eval dataset
eval_image_size: 224 # 输入图片的宽高
eval_batch_size: 256 # 评估batch size
eval_interval: 1 # 评估 interval
eval_offset: -1 # 评估 offset
eval_num_workers: 12 # 并行work数量
# 网络
backbone: 'vit_base_patch32' # 网络backbone选择目前支持vit_base_patch32和vit_base_patch16更多的用户去vit.py下自定义添加即可
class_num: 1001 # 训练数据集类别数
vit_config_path: 'src.vit.VitConfig' #vit网络相关配置路径, 高阶的用户可仿照该类自定义基于transformer的cv网络
pretrained: '' # 预训练模型路径, '' 指重头开始训练
# lr
lr_decay_mode: 'cosine' # lr下降类型选择支持cos、exp等具体见lr_generator.py
lr_init: 0.0 # 初始的lr(epoch 0)
lr_max: 0.00355 # 最大的lr
lr_min: 0.0 # 最后一个step的lr值
max_epoch: 300 # 总的epoch
warmup_epochs: 40 # warmup epoch值
# 优化器
opt: 'adamw' # 优化器类型
beta1: 0.9 # adam beta参数
beta2: 0.999 # adam beta参数
weight_decay: 0.05 # weight decay知
no_weight_decay_filter: "beta,bias" # 哪些权重不用weight decay
gc_flag: 0 # 是否使用gc
# loss, 有些参数也用于dataset预处理
loss_scale: 1024 # amp 静态loss scale值
use_label_smooth: 1 # 是否使用 label smooth
label_smooth_factor: 0.1 # label smooth因子的值
mixup: 0.2 # 是否使用mixup
autoaugment: 1 # 是否使用autoaugment
loss_name: "ce_smooth_mixup" # loss类别选择, 详情看cross_entropy.py
# ckpt
save_checkpoint: 1 # 是否保存训练结果
save_checkpoint_epochs: 8 # 每隔多少个epoch存储一次
keep_checkpoint_max: 3 # 最多保留的结果数
save_checkpoint_path: './outputs' # 训练结果存储目录
# profiler
open_profiler: 0 # 是否开启性能评估,使用时最好用个小数据集+max_epoch设为1.
```
更多配置细节请参考脚本`train.py`, `eval.py`, `export.py``config/*.yml`
## 训练过程
### 训练
- Ascend处理器环境运行
```bash
python train.py --config_path=[CONFIG_PATH] > train.log 2>&1 &
```
上述python命令将在后台运行您可以通过train.log文件查看结果。
训练结束后,您可在默认脚本文件夹下找到检查点文件。采用以下方式达到损失值:
```bash
# vim log
2021-08-05 15:17:12:INFO:compile time used=143.16s
2021-08-05 15:34:41:INFO:epoch[0], epoch time: 1048.72s, per step time: 0.2096s, loss=6.738676, lr=0.000011, fps=1221.51
2021-08-05 15:52:03:INFO:epoch[1], epoch time: 1041.90s, per step time: 0.2082s, loss=6.381927, lr=0.000022, fps=1229.51
...
```
模型检查点保存在当前目录下。
### 分布式训练
- Ascend处理器环境运行
```bash
cd scripts;
bash run_train_distribute.sh [RANK_TABLE_FILE] [CONFIG_PATH]
```
上述shell脚本将在后台运行分布训练。您可以通过train_parallel[X]/log文件查看结果。采用以下方式达到损失值
```bash
# vim train_parallel0/log
# fps跟cpu能力相关由于用到了autoaugementationpatch32的vit运行速度是数据瓶颈
2021-08-05 20:15:16:INFO:compile time used=191.77s
2021-08-05 20:17:46:INFO:epoch[0], epoch time: 149.10s, per step time: 0.2386s, loss=6.729037, lr=0.000089, fps=8584.97, accuracy=0.014940, eval_cost=1.58
2021-08-05 20:20:11:INFO:epoch[1], epoch time: 143.44s, per step time: 0.2295s, loss=6.786729, lr=0.000177, fps=8923.72, accuracy=0.047000, eval_cost=1.27
...
2021-08-06 08:18:18:INFO:epoch[299], epoch time: 143.19s, per step time: 0.2291s, loss=2.718115, lr=0.000000, fps=8939.29, accuracy=0.741800, eval_cost=1.28
2021-08-06 08:18:20:INFO:training time used=43384.70s
2021-08-06 08:18:20:INFO:last_metric[0.74206]
2021-08-06 08:18:20:INFO:ip[*.*.*.*], mean_fps[8930.40]
```
## 评估过程
### 评估
- 在Ascend环境运行时评估ImageNet数据集
在运行以下命令之前请检查用于评估的检查点路径。请将检查点路径设置为绝对全路径例如“username/vit/vit_base_patch32.ckpt”。
```bash
cd scripts;
bash run_eval.sh [RANK_TABLE_FILE] [CONFIG_PATH]
```
上述python命令将在后台运行您可以通过eval.log文件查看结果。测试数据集的准确性如下
```bash
# grep "accuracy=" eval0/log
accuracy=0.741260
```
对于分布式训练后评估请将checkpoint_path设置为用户保存的检查点文件如“username/vit/train_parallel0/outputs/vit_base_patch32-288_625.ckpt”。测试数据集的准确性如下
```bash
# grep "accuracy=" eval0/log
accuracy=0.741260
```
## 导出过程
### 导出
在导出之前需要修改数据集对应的配置文件config/export.yml. 需要修改的配置项为 batch_size 和 ckpt_file.
```shell
python export.py --config_path=[CONFIG_PATH]
```
## 推理过程
### 推理
在还行推理之前我们需要先导出模型。Air模型只能在昇腾910环境上导出mindir可以在任意环境上导出。batch_size只支持1。
- 在昇腾310上使用ImageNet数据集进行推理
在执行下面的命令之前我们需要先修改配置文件。修改的项包括batch_size和val_data_path。
推理的结果保存在当前目录下在acc.log日志文件中可以找到类似以下的结果。
```bash
# Ascend310 inference
cd scripts;
bash run_infer_310.sh [MINDIR_PATH] [NET_TYPE] [DATASET] [DATA_PATH] [DEVICE_ID]
Total data: 50000, top1 accuracy: 0.74084, top5 accuracy: 0.91026
```
- `NET_TYPE` 选择范围:[vit]。
- `DATASET` 选择范围:[imagenet]。
- `DEVICE_ID` 可选默认值为0。
# 模型描述
## 性能
### 评估性能
#### imagenet 120万张图像上的Vit
| 参数 | Ascend |
| -------------------------- | -----------------------------------------------------------|
| 模型版本 | Vit |
| 资源 | Ascend 910CPU 2.60GHz56核内存 314G系统 Euler2.8 |
| 上传日期 | 08/30/2021 |
| MindSpore版本 | 1.3.0 |
| 数据集 | 120万张图像 |
| 训练参数 | epoch=300, steps=625*300, batch_size=256, lr=0.00355 |
| 优化器 | Adamw |
| 损失函数 | Softmax交叉熵 |
| 输出 | 概率 |
| 损失 | 1.0 |
| 速度 | 单卡180毫秒/步; 8卡185毫秒/步 |
| 总时长 | 8卡11小时 |
| 参数(M) | 86.0 |
| 微调检查点 | 1000M (.ckpt文件) |
| 脚本 | [vit脚本](https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/cv/vit) |
### 推理性能
#### 120万张图像上的Vit
| 参数 | Ascend |
| ------------------- | --------------------------- |
| 模型版本 | Vit |
| 资源 | Ascend 910系统 Euler2.8 |
| 上传日期 | 08/30/2021 |
| MindSpore版本 | 1.3.0 |
| 数据集 | 120万张图像 |
| batch_size | 256 |
| 输出 | 概率 |
| 准确性 | 8卡: 73.5%-74.6% |
## 使用流程
### 推理
如果您需要使用此训练模型在GPU、Ascend 910、Ascend 310等多个硬件平台上进行推理可参考此[链接](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html)。下面是操作步骤示例:
- Ascend处理器环境运行
```python
# 配置文件读取+通过配置文件生成模型训练需要的参数
args.loss_scale = ...
lrs = ...
...
# 设置上下文
context.set_context(mode=context.GRAPH_HOME, device_target=args.device_target)
context.set_context(device_id=args.device_id)
# 加载未知数据集进行推理
dataset = dataset.create_dataset(args.data_path, 1, False)
# 定义模型
net = ViT(args.vit_config)
opt = AdamW(filter(lambda x: x.requires_grad, net.get_parameters()), lrs, args.beta1, args.beta2, loss_scale=args.loss_scale, weight_decay=cfg.weight_decay)
loss = CrossEntropySmoothMixup(smooth_factor=args.label_smooth_factor, num_classes=args.class_num)
model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
# 加载预训练模型
param_dict = load_checkpoint(args.pretrained)
load_param_into_net(net, param_dict)
net.set_train(False)
# 执行评估
acc = model.eval(dataset)
print("accuracy: ", acc)
```
### 继续训练预训练模型
- Ascend处理器环境运行
```python
# 配置文件读取+通过配置文件生成模型训练需要的参数
args.loss_scale = ...
lrs = ...
...
# 加载数据集
dataset = create_dataset(cfg.data_path, 1)
batch_num = dataset.get_dataset_size()
# 定义模型
net = ViT(args.vit_config)
# 若pre_trained为True继续训练
if cfg.pretrained != '':
param_dict = load_checkpoint(cfg.pretrained)
load_param_into_net(net, param_dict)
# 定义训练模型
opt = AdamW(filter(lambda x: x.requires_grad, net.get_parameters()), lrs, args.beta1, args.beta2, loss_scale=args.loss_scale, weight_decay=cfg.weight_decay)
loss = CrossEntropySmoothMixup(smooth_factor=args.label_smooth_factor, num_classes=args.class_num)
model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
# 开始训练
epoch_size = args.max_epoch
step_size = dataset.get_dataset_size()
# 设置回调
state_cb = StateMonitor(data_size=step_size,
tot_batch_size=args.batch_size * device_num,
lrs=lrs,
eval_interval=args.eval_interval,
eval_offset=args.eval_offset,
eval_engine=eval_engine,
logger=args.logger.info)
cb = [state_cb, ]
model.train(epoch_size, dataset, callbacks=cb, sink_size=step_size)
print("train success")
```
# 随机情况说明
在dataset.py中我们设置了“create_dataset”函数内的种子同时还使用了train.py中的随机种子。
# ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)。

View File

@ -0,0 +1,33 @@
/**
* Copyright 2021 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef MINDSPORE_INFERENCE_UTILS_H_
#define MINDSPORE_INFERENCE_UTILS_H_
#include <sys/stat.h>
#include <dirent.h>
#include <vector>
#include <string>
#include <memory>
#include "include/api/types.h"
DIR *OpenDir(std::string_view dirName);
std::string RealPath(std::string_view path);
mindspore::MSTensor ReadFileToTensor(const std::string &file);
int WriteResult(const std::string& imageFile, const std::vector<mindspore::MSTensor> &outputs);
std::vector<std::string> GetAllFiles(std::string dir_name);
#endif

View File

@ -0,0 +1,14 @@
cmake_minimum_required(VERSION 3.14.1)
project(MindSporeCxxTestcase[CXX])
add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O0 -g -std=c++17 -Werror -Wall -fPIE -Wl,--allow-shlib-undefined")
set(PROJECT_SRC_ROOT ${CMAKE_CURRENT_LIST_DIR}/)
option(MINDSPORE_PATH "mindspore install path" "")
include_directories(${MINDSPORE_PATH})
include_directories(${MINDSPORE_PATH}/include)
include_directories(${PROJECT_SRC_ROOT}/../)
find_library(MS_LIB libmindspore.so ${MINDSPORE_PATH}/lib)
file(GLOB_RECURSE MD_LIB ${MINDSPORE_PATH}/_c_dataengine*)
add_executable(main main.cc utils.cc)
target_link_libraries(main ${MS_LIB} ${MD_LIB} gflags)

View File

@ -0,0 +1,18 @@
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
cmake . -DMINDSPORE_PATH="`pip3.7 show mindspore-ascend | grep Location | awk '{print $2"/mindspore"}' | xargs realpath`"
make

View File

@ -0,0 +1,161 @@
/**
* Copyright 2021 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <sys/time.h>
#include <gflags/gflags.h>
#include <dirent.h>
#include <iostream>
#include <string>
#include <algorithm>
#include <iosfwd>
#include <vector>
#include <fstream>
#include <sstream>
#include "include/api/model.h"
#include "include/api/context.h"
#include "include/api/types.h"
#include "include/api/serialization.h"
#include "include/dataset/vision_ascend.h"
#include "include/dataset/execute.h"
#include "include/dataset/transforms.h"
#include "include/dataset/vision.h"
#include "inc/utils.h"
using mindspore::dataset::vision::Decode;
using mindspore::dataset::vision::Resize;
using mindspore::dataset::vision::CenterCrop;
using mindspore::dataset::vision::Normalize;
using mindspore::dataset::vision::HWC2CHW;
using mindspore::dataset::TensorTransform;
using mindspore::Context;
using mindspore::Serialization;
using mindspore::Model;
using mindspore::Status;
using mindspore::ModelType;
using mindspore::GraphCell;
using mindspore::kSuccess;
using mindspore::MSTensor;
using mindspore::dataset::Execute;
DEFINE_string(mindir_path, "", "mindir path");
DEFINE_string(dataset_path, ".", "dataset path");
DEFINE_string(network, "vit", "networktype");
DEFINE_string(dataset, "imagenet", "dataset");
DEFINE_int32(device_id, 0, "device id");
int main(int argc, char **argv) {
gflags::ParseCommandLineFlags(&argc, &argv, true);
if (RealPath(FLAGS_mindir_path).empty()) {
std::cout << "Invalid mindir" << std::endl;
return 1;
}
auto context = std::make_shared<Context>();
auto ascend310 = std::make_shared<mindspore::Ascend310DeviceInfo>();
ascend310->SetDeviceID(FLAGS_device_id);
context->MutableDeviceInfo().push_back(ascend310);
mindspore::Graph graph;
Serialization::Load(FLAGS_mindir_path, ModelType::kMindIR, &graph);
Model model;
Status ret = model.Build(GraphCell(graph), context);
if (ret != kSuccess) {
std::cout << "ERROR: Build failed." << std::endl;
return 1;
}
auto all_files = GetAllFiles(FLAGS_dataset_path);
if (all_files.empty()) {
std::cout << "ERROR: no input data." << std::endl;
return 1;
}
std::vector<MSTensor> modelInputs = model.GetInputs();
std::map<double, double> costTime_map;
size_t size = all_files.size();
std::shared_ptr<TensorTransform> decode = std::make_shared<Decode>();
std::shared_ptr<TensorTransform> hwc2chw = std::make_shared<HWC2CHW>();
std::shared_ptr<TensorTransform> resize = std::make_shared<Resize>(std::vector<int>{256});
std::shared_ptr<TensorTransform> centercrop = std::make_shared<CenterCrop>(std::vector<int>{224});
std::shared_ptr<TensorTransform> normalize = std::make_shared<Normalize>(
std::vector<float>{123.675, 116.28, 103.53}, std::vector<float>{58.395, 57.12, 57.375});
std::shared_ptr<TensorTransform> normalizeResnet101 = std::make_shared<Normalize>(
std::vector<float>{121.125, 115.005, 99.96}, std::vector<float>{70.125, 68.085, 70.89});
std::shared_ptr<TensorTransform> sr_resize = std::make_shared<Resize>(std::vector<int>{292});
std::shared_ptr<TensorTransform> sr_centercrop = std::make_shared<CenterCrop>(std::vector<int>{256});
std::shared_ptr<TensorTransform> sr_normalize = std::make_shared<Normalize>(
std::vector<float>{123.68, 116.78, 103.94}, std::vector<float>{1.0, 1.0, 1.0});
std::vector<std::shared_ptr<TensorTransform>> trans_list;
if (FLAGS_network == "se-resnet50") {
trans_list = {decode, sr_resize, sr_centercrop, sr_normalize, hwc2chw};
} else if (FLAGS_network == "resnet101") {
trans_list = {decode, resize, centercrop, normalizeResnet101, hwc2chw};
} else {
trans_list = {decode, resize, centercrop, normalize, hwc2chw};
}
mindspore::dataset::Execute SingleOp(trans_list);
for (size_t i = 0; i < size; ++i) {
struct timeval start = {0};
struct timeval end = {0};
double startTimeMs;
double endTimeMs;
std::vector<MSTensor> inputs;
std::vector<MSTensor> outputs;
std::cout << "Start predict input files:" << all_files[i] <<std::endl;
MSTensor image = ReadFileToTensor(all_files[i]);
if (FLAGS_dataset == "imagenet") {
SingleOp(image, &image);
}
inputs.emplace_back(modelInputs[0].Name(), modelInputs[0].DataType(), modelInputs[0].Shape(),
image.Data().get(), image.DataSize());
gettimeofday(&start, nullptr);
ret = model.Predict(inputs, &outputs);
gettimeofday(&end, nullptr);
if (ret != kSuccess) {
std::cout << "Predict " << all_files[i] << " failed." << std::endl;
return 1;
}
startTimeMs = (1.0 * start.tv_sec * 1000000 + start.tv_usec) / 1000;
endTimeMs = (1.0 * end.tv_sec * 1000000 + end.tv_usec) / 1000;
costTime_map.insert(std::pair<double, double>(startTimeMs, endTimeMs));
WriteResult(all_files[i], outputs);
}
double average = 0.0;
int inferCount = 0;
for (auto iter = costTime_map.begin(); iter != costTime_map.end(); iter++) {
average += iter->second - iter->first;
inferCount++;
}
average = average / inferCount;
std::stringstream timeCost;
timeCost << "NN inference cost average time: "<< average << " ms of infer_count " << inferCount << std::endl;
std::cout << "NN inference cost average time: "<< average << "ms of infer_count " << inferCount << std::endl;
std::string fileName = "./time_Result" + std::string("/test_perform_static.txt");
std::ofstream fileStream(fileName.c_str(), std::ios::trunc);
fileStream << timeCost.str();
fileStream.close();
costTime_map.clear();
return 0;
}

View File

@ -0,0 +1,145 @@
/**
* Copyright 2021 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <fstream>
#include <algorithm>
#include <iostream>
#include "inc/utils.h"
using mindspore::MSTensor;
using mindspore::DataType;
std::vector<std::string> GetAllFiles(std::string dirName) {
struct dirent *filename;
DIR *dir = OpenDir(dirName);
if (dir == nullptr) {
return {};
}
std::vector<std::string> dirs;
std::vector<std::string> files;
while ((filename = readdir(dir)) != nullptr) {
std::string dName = std::string(filename->d_name);
if (dName == "." || dName == "..") {
continue;
} else if (filename->d_type == DT_DIR) {
dirs.emplace_back(std::string(dirName) + "/" + filename->d_name);
} else if (filename->d_type == DT_REG) {
files.emplace_back(std::string(dirName) + "/" + filename->d_name);
} else {
continue;
}
}
for (auto d : dirs) {
dir = OpenDir(d);
while ((filename = readdir(dir)) != nullptr) {
std::string dName = std::string(filename->d_name);
if (dName == "." || dName == ".." || filename->d_type != DT_REG) {
continue;
}
files.emplace_back(std::string(d) + "/" + filename->d_name);
}
}
std::sort(files.begin(), files.end());
for (auto &f : files) {
std::cout << "image file: " << f << std::endl;
}
return files;
}
int WriteResult(const std::string& imageFile, const std::vector<MSTensor> &outputs) {
std::string homePath = "./result_Files";
for (size_t i = 0; i < outputs.size(); ++i) {
size_t outputSize;
std::shared_ptr<const void> netOutput;
netOutput = outputs[i].Data();
outputSize = outputs[i].DataSize();
int pos = imageFile.rfind('/');
std::string fileName(imageFile, pos + 1);
fileName.replace(fileName.find('.'), fileName.size() - fileName.find('.'), '_' + std::to_string(i) + ".bin");
std::string outFileName = homePath + "/" + fileName;
FILE *outputFile = fopen(outFileName.c_str(), "wb");
fwrite(netOutput.get(), outputSize, sizeof(char), outputFile);
fclose(outputFile);
outputFile = nullptr;
}
return 0;
}
mindspore::MSTensor ReadFileToTensor(const std::string &file) {
if (file.empty()) {
std::cout << "Pointer file is nullptr" << std::endl;
return mindspore::MSTensor();
}
std::ifstream ifs(file);
if (!ifs.good()) {
std::cout << "File: " << file << " is not exist" << std::endl;
return mindspore::MSTensor();
}
if (!ifs.is_open()) {
std::cout << "File: " << file << "open failed" << std::endl;
return mindspore::MSTensor();
}
ifs.seekg(0, std::ios::end);
size_t size = ifs.tellg();
mindspore::MSTensor buffer(file, mindspore::DataType::kNumberTypeUInt8, {static_cast<int64_t>(size)}, nullptr, size);
ifs.seekg(0, std::ios::beg);
ifs.read(reinterpret_cast<char *>(buffer.MutableData()), size);
ifs.close();
return buffer;
}
DIR *OpenDir(std::string_view dirName) {
if (dirName.empty()) {
std::cout << " dirName is null ! " << std::endl;
return nullptr;
}
std::string realPath = RealPath(dirName);
struct stat s;
lstat(realPath.c_str(), &s);
if (!S_ISDIR(s.st_mode)) {
std::cout << "dirName is not a valid directory !" << std::endl;
return nullptr;
}
DIR *dir;
dir = opendir(realPath.c_str());
if (dir == nullptr) {
std::cout << "Can not open dir " << dirName << std::endl;
return nullptr;
}
std::cout << "Successfully opened the dir " << dirName << std::endl;
return dir;
}
std::string RealPath(std::string_view path) {
char realPathMem[PATH_MAX] = {0};
char *realPathRet = nullptr;
realPathRet = realpath(path.data(), realPathMem);
if (realPathRet == nullptr) {
std::cout << "File: " << path << " is not exist.";
return "";
}
std::string realPath(realPathMem);
std::cout << path << " realpath is: " << realPath << std::endl;
return realPath;
}

View File

@ -0,0 +1,20 @@
enable_modelarts: 0
# eval datasets
interpolation: 'BILINEAR'
eval_path: '/opt/npu/datasets/imagenet/val'
eval_image_size: 224
eval_batch_size: 256
eval_interval: 1
eval_offset: -1
eval_num_workers: 12
# load model
pretrained: '../vit_base_patch32.ckpt'
# network
backbone: 'vit_base_patch32'
class_num: 1001
vit_config_path: 'src.vit.VitConfig'
open_profiler: 0

View File

@ -0,0 +1,22 @@
enable_modelarts: 0
device_target: 'Ascend'
device_id: 0
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
output_path: "/cache/train"
file_name: 'vit_base_patch32.mindir'
file_format: 'MINDIR'
backbone: 'vit_base_patch32'
train_image_size: 224
class_num: 1001
batch_size: 1
vit_config_path: 'src.vit.VitConfig'
# load model
pretrained: './vit_base_patch32.ckpt'

View File

@ -0,0 +1,62 @@
enable_modelarts: 0
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
output_path: "/cache/train"
# train datasets
dataset_path: '/opt/npu/datasets/imagenet/train'
train_image_size: 224
interpolation: 'BILINEAR'
crop_min: 0.05
batch_size: 256
train_num_workers: 14
# eval datasets
eval_path: '/opt/npu/datasets/imagenet/val'
eval_image_size: 224
eval_batch_size: 256
eval_interval: 1
eval_offset: -1
eval_num_workers: 12
# network
backbone: 'vit_base_patch32'
class_num: 1001
vit_config_path: 'src.vit.VitConfig'
pretrained: ''
# lr
lr_decay_mode: 'cosine'
lr_init: 0.0
lr_max: 0.00355
lr_min: 0.0
max_epoch: 300
warmup_epochs: 40
# optimizer
opt: 'adamw'
beta1: 0.9
beta2: 0.999
weight_decay: 0.05
no_weight_decay_filter: "beta,bias"
gc_flag: 0
# loss
loss_scale: 1024
use_label_smooth: 1
label_smooth_factor: 0.1
mixup: 0.2
autoaugment: 1
loss_name: "ce_smooth_mixup"
# ckpt
save_checkpoint: 1
save_checkpoint_epochs: 8
keep_checkpoint_max: 3
save_checkpoint_path: './outputs'
# profiler
open_profiler: 0

View File

@ -0,0 +1,64 @@
enable_modelarts: 1
# Url for modelarts
data_url: "s3://bucket-d/datasets/imagenet"
train_url: "s3://bucket-d/train"
checkpoint_url: "s3://bucket-d/model/vit_base_patch32.ckpt"
output_path: "/cache/train"
data_path: "/cache/datasets/imagenet"
load_path: "/cache/model/vit_base_patch32.ckpt"
# train datasets
dataset_path: '/cache/datasets/imagenet/train'
train_image_size: 224
interpolation: 'BILINEAR'
crop_min: 0.05
batch_size: 256
train_num_workers: 14
# eval datasets
eval_path: '/cache/datasets/imagenet/val'
eval_image_size: 224
eval_batch_size: 256
eval_interval: 1
eval_offset: -1
eval_num_workers: 12
# network
backbone: 'vit_base_patch32'
class_num: 1001
vit_config_path: 'src.vit.VitConfig'
pretrained: ''
# lr
lr_decay_mode: 'cosine'
lr_init: 0.0
lr_max: 0.00355
lr_min: 0.0
max_epoch: 300
warmup_epochs: 40
# optimizer
opt: 'adamw'
beta1: 0.9
beta2: 0.999
weight_decay: 0.05
no_weight_decay_filter: "beta,bias"
gc_flag: 0
# loss
loss_scale: 1024
use_label_smooth: 1
label_smooth_factor: 0.1
mixup: 0.2
autoaugment: 1
loss_name: "ce_smooth_mixup"
# ckpt
save_checkpoint: 1
save_checkpoint_epochs: 8
keep_checkpoint_max: 3
save_checkpoint_path: './outputs'
# profiler
open_profiler: 0

View File

@ -0,0 +1,56 @@
enable_modelarts: 0
# train datasets
dataset_path: '/opt/npu/datasets/imagenet/train'
train_image_size: 224
interpolation: 'BILINEAR'
crop_min: 0.05
batch_size: 256
train_num_workers: 14
# eval datasets
eval_path: '/opt/npu/datasets/imagenet/val'
eval_image_size: 224
eval_batch_size: 256
eval_interval: 1
eval_offset: -1
eval_num_workers: 12
# network
backbone: 'vit_base_patch32'
class_num: 1001
vit_config_path: 'src.vit.VitConfig'
pretrained: ''
# lr
lr_decay_mode: 'cosine'
lr_init: 0.0
lr_max: 0.00044375
lr_min: 0.0
max_epoch: 300
warmup_epochs: 40
# optimizer
opt: 'adamw'
beta1: 0.9
beta2: 0.999
weight_decay: 0.05
no_weight_decay_filter: "beta,bias"
gc_flag: 0
# loss
loss_scale: 1024
use_label_smooth: 1
label_smooth_factor: 0.1
mixup: 0.2
autoaugment: 1
loss_name: "ce_smooth_mixup"
# ckpt
save_checkpoint: 1
save_checkpoint_epochs: 8
keep_checkpoint_max: 3
save_checkpoint_path: './outputs'
# profiler
open_profiler: 0

View File

@ -0,0 +1,49 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""create_imagenet2012_label"""
import os
import json
import argparse
parser = argparse.ArgumentParser(description="resnet imagenet2012 label")
parser.add_argument("--img_path", type=str, required=True, help="imagenet2012 file path.")
args = parser.parse_args()
def create_label(file_path):
"""create_imagenet2012_label"""
print("[WARNING] Create imagenet label. Currently only use for Imagenet2012!")
dirs = os.listdir(file_path)
file_list = []
for file in dirs:
file_list.append(file)
file_list = sorted(file_list)
total = 0
img_label = {}
for i, file_dir in enumerate(file_list):
files = os.listdir(os.path.join(file_path, file_dir))
for f in files:
img_label[f] = i
total += len(files)
with open("imagenet_label.json", "w+") as label:
json.dump(img_label, label)
print("[INFO] Completed! Total {} data.".format(total))
if __name__ == '__main__':
create_label(args.img_path)

View File

@ -0,0 +1,137 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""eval script"""
import os
import numpy as np
from mindspore import context
from mindspore.train.model import Model, ParallelMode
from mindspore.communication.management import init
from mindspore.profiler.profiling import Profiler
from mindspore.train.serialization import load_checkpoint
from src.vit import get_network
from src.dataset import get_dataset
from src.optimizer import get_optimizer
from src.eval_engine import get_eval_engine
from src.logging import get_logger
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
try:
os.environ['MINDSPORE_HCCL_CONFIG_PATH'] = os.getenv('RANK_TABLE_FILE')
device_id = int(os.getenv('DEVICE_ID')) # 0 ~ 7
local_rank = int(os.getenv('RANK_ID')) # local_rank
device_num = int(os.getenv('RANK_SIZE')) # world_size
print("distribute")
except TypeError:
device_id = 0 # 0 ~ 7
local_rank = 0 # local_rank
device_num = 1 # world_size
print("standalone")
def add_static_args(args):
"""add_static_args"""
args.train_image_size = args.eval_image_size
args.weight_decay = 0.05
args.no_weight_decay_filter = ""
args.gc_flag = 0
args.beta1 = 0.9
args.beta2 = 0.999
args.loss_scale = 1024
args.dataset_name = 'imagenet'
args.save_checkpoint_path = './outputs'
args.eval_engine = 'imagenet'
args.auto_tune = 0
args.seed = 1
args.device_id = device_id
args.local_rank = local_rank
args.device_num = device_num
return args
@moxing_wrapper()
def eval_net():
"""eval_net"""
args = add_static_args(config)
np.random.seed(args.seed)
args.logger = get_logger(args.save_checkpoint_path, rank=local_rank)
context.set_context(device_id=device_id,
mode=context.GRAPH_MODE,
device_target="Ascend",
save_graphs=False)
if args.auto_tune:
context.set_context(auto_tune_mode='GA')
elif args.device_num == 1:
pass
else:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
if args.open_profiler:
profiler = Profiler(output_path="data_{}".format(local_rank))
# init the distribute env
if not args.auto_tune and args.device_num > 1:
init()
# network
net = get_network(backbone_name=args.backbone, args=args)
if os.path.isfile(args.pretrained):
load_checkpoint(args.pretrained, net, strict_load=False)
# evaluation dataset
eval_dataset = get_dataset(dataset_name=args.dataset_name,
do_train=False,
dataset_path=args.eval_path,
args=args)
opt, _ = get_optimizer(optimizer_name='adamw',
network=net,
lrs=1.0,
args=args)
# evaluation engine
if args.auto_tune or args.open_profiler or eval_dataset is None:
args.eval_engine = ''
eval_engine = get_eval_engine(args.eval_engine, net, eval_dataset, args)
# model
model = Model(net, loss_fn=None, optimizer=opt,
metrics=eval_engine.metric, eval_network=eval_engine.eval_network,
loss_scale_manager=None, amp_level="O3")
eval_engine.set_model(model)
args.logger.save_args(args)
eval_engine.compile(sink_size=625) #step_size
eval_engine.eval()
output = eval_engine.get_result()
print_str = 'accuracy={:.6f}'.format(float(output))
print(print_str)
if args.open_profiler:
profiler.analyse()
if __name__ == '__main__':
eval_net()

View File

@ -0,0 +1,52 @@
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
##############export checkpoint file into air and onnx models#################
python export.py
"""
import os
import numpy as np
from mindspore import Tensor, load_checkpoint, load_param_into_net, export, context
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
from src.vit import get_network
context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target)
if config.device_target == "Ascend":
context.set_context(device_id=config.device_id)
def modelarts_pre_process():
'''modelarts pre process function.'''
config.file_name = os.path.join(config.output_path, config.file_name)
@moxing_wrapper(pre_process=modelarts_pre_process)
def run_export():
"""run export."""
net = get_network(backbone_name=config.backbone, args=config)
assert config.pretrained is not None, "checkpoint_path is None."
param_dict = load_checkpoint(config.pretrained)
load_param_into_net(net, param_dict)
config.height = config.train_image_size
config.width = config.train_image_size
input_arr = Tensor(np.zeros([config.batch_size, 3, config.height, config.width], np.float32))
export(net, input_arr, file_name=config.file_name, file_format=config.file_format)
if __name__ == '__main__':
run_export()

View File

@ -0,0 +1,24 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""hub config."""
from src.vit import vit_base_patch16, vit_base_patch32
def create_network(name, *args, **kwargs):
"""create_network about resnet"""
if name == 'vit_base_patch16':
return vit_base_patch16(*args, **kwargs)
if name == 'vit_base_patch32':
return vit_base_patch32(*args, **kwargs)
raise NotImplementedError(f"{name} is not implemented in the repo")

View File

@ -0,0 +1,52 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""post process for 310 inference"""
import os
import json
import argparse
import numpy as np
batch_size = 1
parser = argparse.ArgumentParser(description="resnet inference")
parser.add_argument("--dataset", type=str, required=True, help="dataset type.")
parser.add_argument("--result_path", type=str, required=True, help="result files path.")
parser.add_argument("--label_path", type=str, required=True, help="image file path.")
args = parser.parse_args()
def cal_acc_imagenet(result_path, label_path):
"""cal_acc_imagenet"""
files = os.listdir(result_path)
with open(label_path, "r") as label:
labels = json.load(label)
result_shape = (1, 1001)
top1 = 0
top5 = 0
total_data = len(files)
for file in files:
img_ids_name = file.split('_0.')[0]
data_path = os.path.join(result_path, img_ids_name + "_0.bin")
result = np.fromfile(data_path, dtype=np.float32).reshape(result_shape)
for batch in range(batch_size):
predict = np.argsort(-result[batch], axis=-1)
if labels[img_ids_name+".JPEG"] == predict[0]:
top1 += 1
if labels[img_ids_name+".JPEG"] in predict[:5]:
top5 += 1
print(f"Total data: {total_data}, top1 accuracy: {top1/total_data}, top5 accuracy: {top5/total_data}.")
if __name__ == '__main__':
cal_acc_imagenet(args.result_path, args.label_path)

View File

@ -0,0 +1 @@
easydict

View File

@ -0,0 +1,81 @@
#!/bin/bash
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 2 ]
then
echo "Usage: bash run_eval.sh [RANK_TABLE_FILE] [CONFIG_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
PATH1=$(get_real_path $1)
CONFIG_FILE=$(get_real_path $2)
if [ ! -f $PATH1 ]
then
echo "error: RANK_TABLE_FILE=$PATH1 is not a directory"
exit 1
fi
if [ ! -f $CONFIG_FILE ]
then
echo "error: config_path=$CONFIG_PATH is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
export RANK_TABLE_FILE=$PATH1
export SERVER_ID=0
rank_start=$((DEVICE_NUM * SERVER_ID))
cpus=`cat /proc/cpuinfo| grep "processor"| wc -l`
avg=`expr $cpus \/ $DEVICE_NUM`
gap=`expr $avg \- 1`
for((i=0; i<${DEVICE_NUM}; i++))
do
start=`expr $i \* $avg`
end=`expr $start \+ $gap`
cmdopt=$start"-"$end
export DEVICE_ID=${i}
export RANK_ID=$((rank_start + i))
rm -rf ./eval$i
mkdir ./eval$i
cp ../*.py ./eval$i
cp *.sh ./eval$i
cp -r ../config/*.yml ./eval$i
cp -r ../src ./eval$i
cd ./eval$i || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
if [ $# == 2 ]
then
taskset -c $cmdopt python eval.py --config_path=$CONFIG_FILE &> log &
fi
cd ..
done

View File

@ -0,0 +1,115 @@
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [[ $# -lt 4 || $# -gt 5 ]]; then
echo "Usage: bash run_infer_310.sh [MINDIR_PATH] [NET_TYPE] [DATASET] [DATA_PATH] [DEVICE_ID]
NET_TYPE can choose from [vit]
DEVICE_ID is optional, it can be set by environment variable device_id, otherwise the value is zero"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
model=$(get_real_path $1)
if [ $2 == 'vit' ]; then
network=$2
else
echo "NET_TYPE can choose from [vit]"
exit 1
fi
dataset=$3
data_path=$(get_real_path $4)
device_id=0
if [ $# == 5 ]; then
device_id=$5
fi
echo "mindir name: "$model
echo "dataset path: "$data_path
echo "network: "$network
echo "dataset: "$dataset
echo "device id: "$device_id
export ASCEND_HOME=/usr/local/Ascend/
if [ -d ${ASCEND_HOME}/ascend-toolkit ]; then
export PATH=$ASCEND_HOME/fwkacllib/bin:$ASCEND_HOME/fwkacllib/ccec_compiler/bin:$ASCEND_HOME/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin:$ASCEND_HOME/ascend-toolkit/latest/atc/bin:$PATH
export LD_LIBRARY_PATH=$ASCEND_HOME/fwkacllib/lib64:/usr/local/lib:$ASCEND_HOME/ascend-toolkit/latest/atc/lib64:$ASCEND_HOME/ascend-toolkit/latest/fwkacllib/lib64:$ASCEND_HOME/driver/lib64:$ASCEND_HOME/add-ons:$LD_LIBRARY_PATH
export TBE_IMPL_PATH=$ASCEND_HOME/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe
export PYTHONPATH=$ASCEND_HOME/fwkacllib/python/site-packages:${TBE_IMPL_PATH}:$ASCEND_HOME/ascend-toolkit/latest/fwkacllib/python/site-packages:$PYTHONPATH
export ASCEND_OPP_PATH=$ASCEND_HOME/ascend-toolkit/latest/opp
else
export PATH=$ASCEND_HOME/fwkacllib/bin:$ASCEND_HOME/fwkacllib/ccec_compiler/bin:$ASCEND_HOME/atc/ccec_compiler/bin:$ASCEND_HOME/atc/bin:$PATH
export LD_LIBRARY_PATH=$ASCEND_HOME/fwkacllib/lib64:/usr/local/lib:$ASCEND_HOME/atc/lib64:$ASCEND_HOME/acllib/lib64:$ASCEND_HOME/driver/lib64:$ASCEND_HOME/add-ons:$LD_LIBRARY_PATH
export PYTHONPATH=$ASCEND_HOME/fwkacllib/python/site-packages:$ASCEND_HOME/atc/python/site-packages:$PYTHONPATH
export ASCEND_OPP_PATH=$ASCEND_HOME/opp
fi
function compile_app()
{
cd ../ascend310_infer/src/ || exit
if [ -f "Makefile" ]; then
make clean
fi
bash build.sh &> build.log
}
function infer()
{
cd - || exit
if [ -d result_Files ]; then
rm -rf ./result_Files
fi
if [ -d time_Result ]; then
rm -rf ./time_Result
fi
mkdir result_Files
mkdir time_Result
../ascend310_infer/src/main --mindir_path=$model --dataset_path=$data_path --network=$network --dataset=$dataset --device_id=$device_id &> infer.log
}
function cal_acc()
{
python3.7 ../create_imagenet2012_label.py --img_path=$data_path
python3.7 ../postprocess.py --dataset=$dataset --result_path=./result_Files --label_path=./imagenet_label.json &> acc.log
if [ $? -ne 0 ]; then
echo "calculate accuracy failed"
exit 1
fi
}
compile_app
if [ $? -ne 0 ]; then
echo "compile app code failed"
exit 1
fi
infer
if [ $? -ne 0 ]; then
echo " execute inference failed"
exit 1
fi
cal_acc
if [ $? -ne 0 ]; then
echo "calculate accuracy failed"
exit 1
fi

View File

@ -0,0 +1,81 @@
#!/bin/bash
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 2 ]
then
echo "Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [CONFIG_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
PATH1=$(get_real_path $1)
CONFIG_FILE=$(get_real_path $2)
if [ ! -f $PATH1 ]
then
echo "error: RANK_TABLE_FILE=$PATH1 is not a file"
exit 1
fi
if [ ! -f $CONFIG_FILE ]
then
echo "error: config_path=$CONFIG_FILE is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
export RANK_TABLE_FILE=$PATH1
export SERVER_ID=0
rank_start=$((DEVICE_NUM * SERVER_ID))
cpus=`cat /proc/cpuinfo| grep "processor"| wc -l`
avg=`expr $cpus \/ $DEVICE_NUM`
gap=`expr $avg \- 1`
for((i=0; i<${DEVICE_NUM}; i++))
do
start=`expr $i \* $avg`
end=`expr $start \+ $gap`
cmdopt=$start"-"$end
export DEVICE_ID=${i}
export RANK_ID=$((rank_start + i))
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp ../*.py ./train_parallel$i
cp *.sh ./train_parallel$i
cp -r ../config/*.yml ./train_parallel$i
cp -r ../src ./train_parallel$i
cd ./train_parallel$i || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID"
env > env.log
if [ $# == 2 ]
then
taskset -c $cmdopt python train.py --config_path=$CONFIG_FILE &> log &
fi
cd ..
done

View File

@ -0,0 +1,61 @@
#!/bin/bash
# Copyright 2020-2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 1 ]
then
echo "Usage: bash run_standalone_train.sh [CONFIG_PATH] "
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
CONFIG_FILE=$(get_real_path $1)
if [ ! -f $CONFIG_FILE ]
then
echo "error: config_path=$CONFIG_FILE is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=1
export RANK_ID=0
export RANK_SIZE=1
if [ -d "train" ];
then
rm -rf ./train
fi
mkdir ./train
cp ../config/*.yml ./train
cp ../*.py ./train
cp *.sh ./train
cp -r ../src ./train
cd ./train || exit
echo "start training for device $DEVICE_ID"
env > env.log
if [ $# == 1 ]
then
python train.py --config_path=$CONFIG_FILE &> log &
fi
cd ..

View File

@ -0,0 +1,264 @@
# MIT License
# Copyright (c) 2018 Philip Popien
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ============================================================================
"""
This code is based on https://github.com/DeepVoltaire/AutoAugment/blob/master/autoaugment.py
"""
import random
from PIL import Image, ImageEnhance, ImageOps
import numpy as np
class ImageNetPolicy():
""" Randomly choose one of the best 24 Sub-policies on ImageNet.
Example:
>>> policy = ImageNetPolicy()
>>> transformed = policy(image)
>>> transform=transforms.Compose([
>>> transforms.Resize(256),
>>> ImageNetPolicy(),
>>> transforms.ToTensor()])
"""
def __init__(self, fillcolor=(128, 128, 128)):
self.policies = [
SubPolicy(0.4, "posterize", 8, 0.6, "rotate", 9, fillcolor),
SubPolicy(0.6, "solarize", 5, 0.6, "autocontrast", 5, fillcolor),
SubPolicy(0.8, "equalize", 8, 0.6, "equalize", 3, fillcolor),
SubPolicy(0.6, "posterize", 7, 0.6, "posterize", 6, fillcolor),
SubPolicy(0.4, "equalize", 7, 0.2, "solarize", 4, fillcolor),
SubPolicy(0.4, "equalize", 4, 0.8, "rotate", 8, fillcolor),
SubPolicy(0.6, "solarize", 3, 0.6, "equalize", 7, fillcolor),
SubPolicy(0.8, "posterize", 5, 1.0, "equalize", 2, fillcolor),
SubPolicy(0.2, "rotate", 3, 0.6, "solarize", 8, fillcolor),
SubPolicy(0.6, "equalize", 8, 0.4, "posterize", 6, fillcolor),
SubPolicy(0.8, "rotate", 8, 0.4, "color", 0, fillcolor),
SubPolicy(0.4, "rotate", 9, 0.6, "equalize", 2, fillcolor),
SubPolicy(0.0, "equalize", 7, 0.8, "equalize", 8, fillcolor),
SubPolicy(0.6, "invert", 4, 1.0, "equalize", 8, fillcolor),
SubPolicy(0.6, "color", 4, 1.0, "contrast", 8, fillcolor),
SubPolicy(0.8, "rotate", 8, 1.0, "color", 2, fillcolor),
SubPolicy(0.8, "color", 8, 0.8, "solarize", 7, fillcolor),
SubPolicy(0.4, "sharpness", 7, 0.6, "invert", 8, fillcolor),
SubPolicy(0.6, "shearX", 5, 1.0, "equalize", 9, fillcolor),
SubPolicy(0.4, "color", 0, 0.6, "equalize", 3, fillcolor),
SubPolicy(0.4, "equalize", 7, 0.2, "solarize", 4, fillcolor),
SubPolicy(0.6, "solarize", 5, 0.6, "autocontrast", 5, fillcolor),
SubPolicy(0.6, "invert", 4, 1.0, "equalize", 8, fillcolor),
SubPolicy(0.6, "color", 4, 1.0, "contrast", 8, fillcolor),
SubPolicy(0.8, "equalize", 8, 0.6, "equalize", 3, fillcolor)
]
def __call__(self, img, policy_idx=None):
if policy_idx is None or not isinstance(policy_idx, int):
policy_idx = random.randint(0, len(self.policies) - 1)
else:
policy_idx = policy_idx % len(self.policies)
return self.policies[policy_idx](img)
def __repr__(self):
return "AutoAugment ImageNet Policy"
class CIFAR10Policy():
""" Randomly choose one of the best 25 Sub-policies on CIFAR10.
Example:
>>> policy = CIFAR10Policy()
>>> transformed = policy(image)
Example as a PyTorch Transform:
>>> transform=transforms.Compose([
>>> transforms.Resize(256),
>>> CIFAR10Policy(),
>>> transforms.ToTensor()])
"""
def __init__(self, fillcolor=(128, 128, 128)):
self.policies = [
SubPolicy(0.1, "invert", 7, 0.2, "contrast", 6, fillcolor),
SubPolicy(0.7, "rotate", 2, 0.3, "translateX", 9, fillcolor),
SubPolicy(0.8, "sharpness", 1, 0.9, "sharpness", 3, fillcolor),
SubPolicy(0.5, "shearY", 8, 0.7, "translateY", 9, fillcolor),
SubPolicy(0.5, "autocontrast", 8, 0.9, "equalize", 2, fillcolor),
SubPolicy(0.2, "shearY", 7, 0.3, "posterize", 7, fillcolor),
SubPolicy(0.4, "color", 3, 0.6, "brightness", 7, fillcolor),
SubPolicy(0.3, "sharpness", 9, 0.7, "brightness", 9, fillcolor),
SubPolicy(0.6, "equalize", 5, 0.5, "equalize", 1, fillcolor),
SubPolicy(0.6, "contrast", 7, 0.6, "sharpness", 5, fillcolor),
SubPolicy(0.7, "color", 7, 0.5, "translateX", 8, fillcolor),
SubPolicy(0.3, "equalize", 7, 0.4, "autocontrast", 8, fillcolor),
SubPolicy(0.4, "translateY", 3, 0.2, "sharpness", 6, fillcolor),
SubPolicy(0.9, "brightness", 6, 0.2, "color", 8, fillcolor),
SubPolicy(0.5, "solarize", 2, 0.0, "invert", 3, fillcolor),
SubPolicy(0.2, "equalize", 0, 0.6, "autocontrast", 0, fillcolor),
SubPolicy(0.2, "equalize", 8, 0.8, "equalize", 4, fillcolor),
SubPolicy(0.9, "color", 9, 0.6, "equalize", 6, fillcolor),
SubPolicy(0.8, "autocontrast", 4, 0.2, "solarize", 8, fillcolor),
SubPolicy(0.1, "brightness", 3, 0.7, "color", 0, fillcolor),
SubPolicy(0.4, "solarize", 5, 0.9, "autocontrast", 3, fillcolor),
SubPolicy(0.9, "translateY", 9, 0.7, "translateY", 9, fillcolor),
SubPolicy(0.9, "autocontrast", 2, 0.8, "solarize", 3, fillcolor),
SubPolicy(0.8, "equalize", 8, 0.1, "invert", 3, fillcolor),
SubPolicy(0.7, "translateY", 9, 0.9, "autocontrast", 1, fillcolor)
]
def __call__(self, img, policy_idx=None):
if policy_idx is None or not isinstance(policy_idx, int):
policy_idx = random.randint(0, len(self.policies) - 1)
else:
policy_idx = policy_idx % len(self.policies)
return self.policies[policy_idx](img)
def __repr__(self):
return "AutoAugment CIFAR10 Policy"
class SVHNPolicy():
""" Randomly choose one of the best 25 Sub-policies on SVHN.
Example:
>>> policy = SVHNPolicy()
>>> transformed = policy(image)
Example as a PyTorch Transform:
>>> transform=transforms.Compose([
>>> transforms.Resize(256),
>>> SVHNPolicy(),
>>> transforms.ToTensor()])
"""
def __init__(self, fillcolor=(128, 128, 128)):
self.policies = [
SubPolicy(0.9, "shearX", 4, 0.2, "invert", 3, fillcolor),
SubPolicy(0.9, "shearY", 8, 0.7, "invert", 5, fillcolor),
SubPolicy(0.6, "equalize", 5, 0.6, "solarize", 6, fillcolor),
SubPolicy(0.9, "invert", 3, 0.6, "equalize", 3, fillcolor),
SubPolicy(0.6, "equalize", 1, 0.9, "rotate", 3, fillcolor),
SubPolicy(0.9, "shearX", 4, 0.8, "autocontrast", 3, fillcolor),
SubPolicy(0.9, "shearY", 8, 0.4, "invert", 5, fillcolor),
SubPolicy(0.9, "shearY", 5, 0.2, "solarize", 6, fillcolor),
SubPolicy(0.9, "invert", 6, 0.8, "autocontrast", 1, fillcolor),
SubPolicy(0.6, "equalize", 3, 0.9, "rotate", 3, fillcolor),
SubPolicy(0.9, "shearX", 4, 0.3, "solarize", 3, fillcolor),
SubPolicy(0.8, "shearY", 8, 0.7, "invert", 4, fillcolor),
SubPolicy(0.9, "equalize", 5, 0.6, "translateY", 6, fillcolor),
SubPolicy(0.9, "invert", 4, 0.6, "equalize", 7, fillcolor),
SubPolicy(0.3, "contrast", 3, 0.8, "rotate", 4, fillcolor),
SubPolicy(0.8, "invert", 5, 0.0, "translateY", 2, fillcolor),
SubPolicy(0.7, "shearY", 6, 0.4, "solarize", 8, fillcolor),
SubPolicy(0.6, "invert", 4, 0.8, "rotate", 4, fillcolor),
SubPolicy(0.3, "shearY", 7, 0.9, "translateX", 3, fillcolor),
SubPolicy(0.1, "shearX", 6, 0.6, "invert", 5, fillcolor),
SubPolicy(0.7, "solarize", 2, 0.6, "translateY", 7, fillcolor),
SubPolicy(0.8, "shearY", 4, 0.8, "invert", 8, fillcolor),
SubPolicy(0.7, "shearX", 9, 0.8, "translateY", 3, fillcolor),
SubPolicy(0.8, "shearY", 5, 0.7, "autocontrast", 3, fillcolor),
SubPolicy(0.7, "shearX", 2, 0.1, "invert", 5, fillcolor)
]
def __call__(self, img, policy_idx=None):
if policy_idx is None or not isinstance(policy_idx, int):
policy_idx = random.randint(0, len(self.policies) - 1)
else:
policy_idx = policy_idx % len(self.policies)
return self.policies[policy_idx](img)
def __repr__(self):
return "AutoAugment SVHN Policy"
class SubPolicy():
"""
Randomly choose one of the best 14 Sub-policies on SubPolicy.
"""
def __init__(self, p1, operation1, magnitude_idx1, p2, operation2, magnitude_idx2, fillcolor=(128, 128, 128)):
ranges = {
"shearX": np.linspace(0, 0.3, 10),
"shearY": np.linspace(0, 0.3, 10),
"translateX": np.linspace(0, 150 / 331, 10),
"translateY": np.linspace(0, 150 / 331, 10),
"rotate": np.linspace(0, 30, 10),
"color": np.linspace(0.0, 0.9, 10),
"posterize": np.round(np.linspace(8, 4, 10), 0).astype(np.int),
"solarize": np.linspace(256, 0, 10),
"contrast": np.linspace(0.0, 0.9, 10),
"sharpness": np.linspace(0.0, 0.9, 10),
"brightness": np.linspace(0.0, 0.9, 10),
"autocontrast": [0] * 10,
"equalize": [0] * 10,
"invert": [0] * 10
}
# from https://stackoverflow.com/questions/5252170/specify-image-filling-color-when-rotating-in-python-with-pil-and-setting-expand
def rotate_with_fill(img, magnitude):
rot = img.convert("RGBA").rotate(magnitude)
return Image.composite(rot, Image.new("RGBA", rot.size, (128,) * 4), rot).convert(img.mode)
# pylint: disable = unnecessary-lambda
func = {
"shearX": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, magnitude * random.choice([-1, 1]), 0, 0, 1, 0),
Image.BICUBIC, fillcolor=fillcolor),
"shearY": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, 0, 0, magnitude * random.choice([-1, 1]), 1, 0),
Image.BICUBIC, fillcolor=fillcolor),
"translateX": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, 0, magnitude * img.size[0] * random.choice([-1, 1]), 0, 1, 0),
fillcolor=fillcolor),
"translateY": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude * img.size[1] * random.choice([-1, 1])),
fillcolor=fillcolor),
"rotate": lambda img, magnitude: rotate_with_fill(img, magnitude),
"color": lambda img, magnitude: ImageEnhance.Color(img).enhance(1 + magnitude * random.choice([-1, 1])),
"posterize": lambda img, magnitude: ImageOps.posterize(img, magnitude),
"solarize": lambda img, magnitude: ImageOps.solarize(img, magnitude),
"contrast": lambda img, magnitude: ImageEnhance.Contrast(img).enhance(
1 + magnitude * random.choice([-1, 1])),
"sharpness": lambda img, magnitude: ImageEnhance.Sharpness(img).enhance(
1 + magnitude * random.choice([-1, 1])),
"brightness": lambda img, magnitude: ImageEnhance.Brightness(img).enhance(
1 + magnitude * random.choice([-1, 1])),
"autocontrast": lambda img, magnitude: ImageOps.autocontrast(img),
"equalize": lambda img, magnitude: ImageOps.equalize(img),
"invert": lambda img, magnitude: ImageOps.invert(img)
}
self.p1 = p1
self.operation1 = func[operation1]
self.magnitude1 = ranges[operation1][magnitude_idx1]
self.p2 = p2
self.operation2 = func[operation2]
self.magnitude2 = ranges[operation2][magnitude_idx2]
def __call__(self, img):
if random.random() < self.p1: img = self.operation1(img, self.magnitude1)
if random.random() < self.p2: img = self.operation2(img, self.magnitude2)
return img

View File

@ -0,0 +1,108 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""callbacks"""
import time
import numpy as np
from mindspore.train.callback import Callback
from mindspore.common.tensor import Tensor
class StateMonitor(Callback):
"""StateMonitor"""
def __init__(self, data_size, tot_batch_size=None, lrs=None,
eval_interval=None, eval_offset=None, eval_engine=None, logger=None):
super(StateMonitor, self).__init__()
self.data_size = data_size
self.tot_batch_size = tot_batch_size
self.lrs = lrs
self.epoch_num = 0
self.loss = 0
self.eval_interval = eval_interval
self.eval_offset = eval_offset
self.eval_engine = eval_engine
self.best_acc = -1
self.best_acc_top5 = -1
self.best_i2t_recall = -1
self.best_t2i_recall = -1
self.mean_fps = 0.0
self.print = print
if logger is not None:
self.print = logger
def step_end(self, run_context):
cb_params = run_context.original_args()
loss = cb_params.net_outputs
if isinstance(loss, (tuple, list)):
if isinstance(loss[0], Tensor) and isinstance(loss[0].asnumpy(), np.ndarray):
loss = loss[0]
if isinstance(loss, Tensor) and isinstance(loss.asnumpy(), np.ndarray):
loss = np.mean(loss.asnumpy())
self.loss = loss
def epoch_begin(self, run_context):
self.epoch_time = time.time()
def epoch_end(self, run_context):
epoch_seconds = (time.time() - self.epoch_time)
per_step_seconds = epoch_seconds / self.data_size
print_str = "epoch[{}]".format(self.epoch_num)
print_str += ', epoch time: {:.2f}s'.format(epoch_seconds)
print_str += ', per step time: {:.4f}s'.format(per_step_seconds)
print_str += ', loss={:.6f}'.format(self.loss)
if self.lrs is not None:
lr = self.lrs[(self.epoch_num + 1) * self.data_size - 1]
print_str += ', lr={:.6f}'.format(lr)
if self.tot_batch_size is not None:
fps = self.tot_batch_size * self.data_size / epoch_seconds
self.mean_fps = (self.mean_fps * self.epoch_num + fps) / (self.epoch_num + 1)
print_str += ', fps={:.2f}'.format(fps)
if (self.epoch_num + 1) % self.eval_interval == self.eval_offset:
eval_start = time.time()
self.eval_engine.eval()
output = self.eval_engine.get_result()
eval_seconds = time.time() - eval_start
if output is not None:
if isinstance(output, list):
print_str += ', top1 accuracy={:.6f}'.format(float(output[0]))
print_str += ', top5 accuracy={:.6f}'.format(float(output[1]))
print_str += ', i2t_recall={:.6f}'.format(float(output[2]))
print_str += ', t2i_recall={:.6f}'.format(float(output[3]))
print_str += ', eval_cost={:.2f}'.format(eval_seconds)
if float(output[0]) > self.best_acc:
self.best_acc = float(output[0])
if float(output[1]) > self.best_acc_top5:
self.best_acc_top5 = float(output[1])
if float(output[2]) > self.best_i2t_recall:
self.best_i2t_recall = float(output[2])
if float(output[3]) > self.best_t2i_recall:
self.best_t2i_recall = float(output[3])
else:
print_str += ', accuracy={:.6f}'.format(float(output))
print_str += ', eval_cost={:.2f}'.format(eval_seconds)
if float(output) > self.best_acc:
self.best_acc = float(output)
self.print(print_str)
self.epoch_num += 1

View File

@ -0,0 +1,124 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""loss functions"""
from mindspore import nn
from mindspore import Tensor
from mindspore.common import dtype as mstype
try:
from mindspore.nn.loss.loss import Loss
except ImportError:
try:
from mindspore.nn.loss.loss import LossBase as Loss
except ImportError:
from mindspore.nn.loss.loss import _Loss as Loss
from mindspore.ops import functional as F
from mindspore.ops import operations as P
class CrossEntropySmooth(Loss):
"""CrossEntropy"""
def __init__(self, sparse=True, reduction='mean', smooth_factor=0., num_classes=1000, aux_factor=0.4):
super().__init__()
self.aux_factor = aux_factor
self.onehot = P.OneHot()
self.sparse = sparse
self.on_value = Tensor(1.0 - smooth_factor, mstype.float32)
self.off_value = Tensor(1.0 * smooth_factor / (num_classes - 1), mstype.float32)
self.ce = nn.SoftmaxCrossEntropyWithLogits(reduction=reduction)
def construct(self, logits, label):
if isinstance(logits, tuple):
logit, aux_logit = logits
else:
logit, aux_logit = logits, None
if self.sparse:
label = self.onehot(label, F.shape(logit)[1], self.on_value, self.off_value)
loss = self.ce(logit, label)
if aux_logit is not None:
loss = loss + self.aux_factor * self.ce(aux_logit, label)
return loss
class CrossEntropySmoothMixup(Loss):
"""CrossEntropy"""
def __init__(self, reduction='mean', smooth_factor=0., num_classes=1000):
super().__init__()
self.on_value = Tensor(1.0 - smooth_factor, mstype.float32)
self.off_value = 1.0 * smooth_factor / (num_classes - 2)
self.cross_entropy = nn.SoftmaxCrossEntropyWithLogits(reduction=reduction)
def construct(self, logit, label):
off_label = P.Select()(P.Equal()(label, 0.0), \
P.Fill()(mstype.float32, P.Shape()(label), self.off_value), \
P.Fill()(mstype.float32, P.Shape()(label), 0.0))
label = self.on_value * label + off_label
loss = self.cross_entropy(logit, label)
return loss
class CrossEntropyIgnore(Loss):
"""CrossEntropyIgnore"""
def __init__(self, num_classes=21, ignore_label=255):
super().__init__()
self.one_hot = P.OneHot(axis=-1)
self.on_value = Tensor(1.0, mstype.float32)
self.off_value = Tensor(0.0, mstype.float32)
self.cast = P.Cast()
self.ce = nn.SoftmaxCrossEntropyWithLogits()
self.not_equal = P.NotEqual()
self.num_cls = num_classes
self.ignore_label = ignore_label
self.mul = P.Mul()
self.sum = P.ReduceSum(False)
self.div = P.RealDiv()
self.transpose = P.Transpose()
self.reshape = P.Reshape()
def construct(self, logits, labels):
labels_int = self.cast(labels, mstype.int32)
labels_int = self.reshape(labels_int, (-1,))
logits_ = self.transpose(logits, (0, 2, 3, 1))
logits_ = self.reshape(logits_, (-1, self.num_cls))
weights = self.not_equal(labels_int, self.ignore_label)
weights = self.cast(weights, mstype.float32)
one_hot_labels = self.one_hot(labels_int, self.num_cls, self.on_value, self.off_value)
loss = self.ce(logits_, one_hot_labels)
loss = self.mul(weights, loss)
loss = self.div(self.sum(loss), self.sum(weights))
return loss
def get_loss(loss_name, args):
"""get_loss"""
loss = None
if loss_name == 'ce_smooth':
loss = CrossEntropySmooth(smooth_factor=args.label_smooth_factor,
num_classes=args.class_num,
aux_factor=args.aux_factor)
elif loss_name == 'ce_smooth_mixup':
loss = CrossEntropySmoothMixup(smooth_factor=args.label_smooth_factor,
num_classes=args.class_num)
elif loss_name == 'ce_ignore':
loss = CrossEntropyIgnore(num_classes=args.class_num,
ignore_label=args.ignore_label)
else:
raise NotImplementedError
return loss

View File

@ -0,0 +1,171 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""create train or eval dataset."""
import os
import warnings
from io import BytesIO
from PIL import Image
import numpy as np
import mindspore.common.dtype as mstype
import mindspore.dataset.engine as de
import mindspore.dataset.vision.c_transforms as C
import mindspore.dataset.transforms.c_transforms as C2
import mindspore.dataset.vision.py_transforms as P
from mindspore.dataset.vision.utils import Inter
from .autoaugment import ImageNetPolicy
warnings.filterwarnings("ignore", "(Possibly )?corrupt EXIF data", UserWarning)
class ToNumpy:
def __init__(self):
pass
def __call__(self, img):
return np.asarray(img)
def create_dataset(dataset_path,
do_train,
image_size=224,
interpolation='BILINEAR',
crop_min=0.05,
repeat_num=1,
batch_size=32,
num_workers=12,
autoaugment=False,
mixup=0.0,
num_classes=1001):
"""create_dataset"""
if hasattr(Inter, interpolation):
interpolation = getattr(Inter, interpolation)
else:
interpolation = Inter.BILINEAR
print('cannot find interpolation_type: {}, use {} instead'.format(interpolation, 'BILINEAR'))
device_num = int(os.getenv("RANK_SIZE", '1'))
rank_id = int(os.getenv('RANK_ID', '0'))
if do_train:
ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=num_workers, shuffle=True,
num_shards=device_num, shard_id=rank_id)
else:
batch_per_step = batch_size * device_num
print("eval batch per step: {}".format(batch_per_step))
if batch_per_step < 50000:
if 50000 % batch_per_step == 0:
num_padded = 0
else:
num_padded = batch_per_step - (50000 % batch_per_step)
else:
num_padded = batch_per_step - 50000
print("eval dataset num_padded: {}".format(num_padded))
if num_padded != 0:
# padded_with_decode
white_io = BytesIO()
Image.new('RGB', (image_size, image_size), (255, 255, 255)).save(white_io, 'JPEG')
padded_sample = {
'image': np.array(bytearray(white_io.getvalue()), dtype='uint8'),
'label': np.array(-1, np.int32)
}
sample = [padded_sample for x in range(num_padded)]
ds_pad = de.PaddedDataset(sample)
ds_imagefolder = de.ImageFolderDataset(dataset_path, num_parallel_workers=num_workers)
ds = ds_pad + ds_imagefolder
distribute_sampler = de.DistributedSampler(num_shards=device_num, shard_id=rank_id, \
shuffle=False, num_samples=None)
ds.use_sampler(distribute_sampler)
else:
ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=num_workers, \
shuffle=False, num_shards=device_num, shard_id=rank_id)
print("eval dataset size: {}".format(ds.get_dataset_size()))
mean = [0.485*255, 0.456*255, 0.406*255]
std = [0.229*255, 0.224*255, 0.225*255]
# define map operations
if do_train:
trans = [
C.RandomCropDecodeResize(image_size, scale=(crop_min, 1.0), \
ratio=(0.75, 1.333), interpolation=interpolation),
C.RandomHorizontalFlip(prob=0.5),
]
if autoaugment:
trans += [
P.ToPIL(),
ImageNetPolicy(),
ToNumpy(),
]
trans += [
C.Normalize(mean=mean, std=std),
C.HWC2CHW(),
]
else:
resize = int(int(image_size / 0.875 / 16 + 0.5) * 16)
print('eval, resize:{}'.format(resize))
trans = [
C.Decode(),
C.Resize(resize, interpolation=interpolation),
C.CenterCrop(image_size),
C.Normalize(mean=mean, std=std),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32)
ds = ds.repeat(repeat_num)
ds = ds.map(input_columns="image", num_parallel_workers=num_workers, operations=trans, python_multiprocessing=True)
ds = ds.map(input_columns="label", num_parallel_workers=num_workers, operations=type_cast_op)
if do_train and mixup > 0:
one_hot_encode = C2.OneHot(num_classes)
ds = ds.map(operations=one_hot_encode, input_columns=["label"])
ds = ds.batch(batch_size, drop_remainder=True)
if do_train and mixup > 0:
trans_mixup = C.MixUpBatch(alpha=mixup)
ds = ds.map(input_columns=["image", "label"], num_parallel_workers=num_workers, operations=trans_mixup)
return ds
def get_dataset(dataset_name, do_train, dataset_path, args):
"""get_dataset"""
if dataset_name == "imagenet":
if do_train:
data = create_dataset(dataset_path=dataset_path,
do_train=True,
image_size=args.train_image_size,
interpolation=args.interpolation,
autoaugment=args.autoaugment,
mixup=args.mixup,
crop_min=args.crop_min,
batch_size=args.batch_size,
num_workers=args.train_num_workers)
else:
data = create_dataset(dataset_path=dataset_path,
do_train=False,
image_size=args.eval_image_size,
interpolation=args.interpolation,
batch_size=args.eval_batch_size,
num_workers=args.eval_num_workers)
else:
raise NotImplementedError
return data

View File

@ -0,0 +1,105 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""eval engine"""
from mindspore import Tensor
import mindspore.common.dtype as mstype
from src.metric import ClassifyCorrectWithCache, ClassifyCorrectCell, DistAccuracy
class BasicEvalEngine():
"""BasicEvalEngine"""
def __init__(self):
pass
@property
def metric(self):
return None
@property
def eval_network(self):
return None
def compile(self, sink_size=-1):
pass
def eval(self):
pass
def set_model(self, model):
self.model = model
def get_result(self):
return None
class ImageNetCacheEvelEngine(BasicEvalEngine):
"""ImageNetCacheEvelEngine"""
def __init__(self, net, eval_dataset, args):
super().__init__()
self.dist_eval_network = ClassifyCorrectWithCache(net, eval_dataset)
self.outputs = None
self.args = args
def compile(self, sink_size=-1):
index = Tensor(0, mstype.int32)
self.dist_eval_network.set_train(False)
self.dist_eval_network.compile(index)
def eval(self):
index = Tensor(0, mstype.int32)
output = self.dist_eval_network(index)
output = output.asnumpy() / 50000
self.outputs = {"acc": output}
def get_result(self):
return self.outputs["acc"]
class ImageNetEvelEngine(BasicEvalEngine):
"""ImageNetEvelEngine"""
def __init__(self, net, eval_dataset, args):
super().__init__()
self.eval_dataset = eval_dataset
self.dist_eval_network = ClassifyCorrectCell(net)
self.args = args
self.outputs = None
self.model = None
@property
def metric(self):
return {'acc': DistAccuracy(batch_size=self.args.eval_batch_size, device_num=self.args.device_num)}
@property
def eval_network(self):
return self.dist_eval_network
def eval(self):
self.outputs = self.model.eval(self.eval_dataset)
def get_result(self):
return self.outputs["acc"]
def get_eval_engine(engine_name, net, eval_dataset, args):
"""get_eval_engine"""
if engine_name == '':
eval_engine = BasicEvalEngine()
elif engine_name == "imagenet":
eval_engine = ImageNetEvelEngine(net, eval_dataset, args)
elif engine_name == "imagenet_cache":
eval_engine = ImageNetCacheEvelEngine(net, eval_dataset, args)
else:
raise NotImplementedError
return eval_engine

View File

@ -0,0 +1,80 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""logging"""
import logging
import os
import sys
from datetime import datetime
logger_name = 'mindspore-benchmark'
class LOGGER(logging.Logger):
"""
LOGGER
"""
def __init__(self, logger_name_local, rank=0):
super().__init__(logger_name_local)
self.log_fn = None
if rank % 8 == 0:
console = logging.StreamHandler(sys.stdout)
console.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s', "%Y-%m-%d %H:%M:%S")
console.setFormatter(formatter)
self.addHandler(console)
def setup_logging_file(self, log_dir, rank=0):
"""setup_logging_file"""
self.rank = rank
if not os.path.exists(log_dir):
os.makedirs(log_dir, exist_ok=True)
log_name = datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S') + '_rank_{}.log'.format(rank)
log_fn = os.path.join(log_dir, log_name)
fh = logging.FileHandler(log_fn)
fh.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s')
fh.setFormatter(formatter)
self.addHandler(fh)
self.log_fn = log_fn
def info(self, msg, *args, **kwargs):
"""info"""
if self.isEnabledFor(logging.INFO):
self._log(logging.INFO, msg, args, **kwargs)
def save_args(self, args):
"""save_args"""
self.info('Args:')
if isinstance(args, (list, tuple)):
for value in args:
message = '--> {}'.format(value)
self.info(message)
else:
if isinstance(args, dict):
args_dict = args
else:
args_dict = vars(args)
for key in args_dict.keys():
message = '--> {}: {}'.format(key, args_dict[key])
self.info(message)
self.info('')
def get_logger(path, rank=0):
"""get_logger"""
logger = LOGGER(logger_name, rank)
logger.setup_logging_file(path, rank)
return logger

View File

@ -0,0 +1,93 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""learning rate generator"""
import math
import numpy as np
def linear_warmup_lr(current_step, warmup_steps, base_lr, init_lr):
lr_inc = (float(base_lr) - float(init_lr)) / float(warmup_steps)
lr = float(init_lr) + lr_inc * current_step
return lr
def get_lr(global_step, lr_init, lr_end, lr_max, warmup_epochs, \
total_epochs, steps_per_epoch, lr_decay_mode, poly_power=2.0):
"""
generate learning rate array
Args:
global_step(int): total steps of the training
lr_init(float): init learning rate
lr_end(float): end learning rate
lr_max(float): max learning rate
warmup_epochs(int): number of warmup epochs
total_epochs(int): total epoch of training
steps_per_epoch(int): steps of one epoch
lr_decay_mode(string): learning rate decay mode, including steps, poly, cosine or default
Returns:
np.array, learning rate array
"""
lr_each_step = []
total_steps = steps_per_epoch * total_epochs
warmup_steps = int(steps_per_epoch * warmup_epochs)
if lr_decay_mode == 'steps':
decay_epoch_index = [0.3 * total_steps, 0.6 * total_steps, 0.8 * total_steps]
for i in range(total_steps):
if i < decay_epoch_index[0]:
lr = lr_max
elif i < decay_epoch_index[1]:
lr = lr_max * 0.1
elif i < decay_epoch_index[2]:
lr = lr_max * 0.01
else:
lr = lr_max * 0.001
lr_each_step.append(lr)
elif lr_decay_mode == 'poly':
if warmup_steps != 0:
inc_each_step = (float(lr_max) - float(lr_init)) / float(warmup_steps)
else:
inc_each_step = 0
for i in range(total_steps):
if i < warmup_steps:
lr = float(lr_init) + inc_each_step * float(i)
else:
base = (1.0 - (float(i) - float(warmup_steps)) / (float(total_steps) - float(warmup_steps)))
lr = float(lr_max - lr_end) * base ** poly_power + lr_end
lr = max(lr, 0.0)
lr_each_step.append(lr)
elif lr_decay_mode == 'cosine':
decay_steps = total_steps - warmup_steps
for i in range(total_steps):
if i < warmup_steps:
lr_inc = (float(lr_max) - float(lr_init)) / float(warmup_steps)
lr = float(lr_init) + lr_inc * (i + 1)
else:
cur_step = i + 1 - warmup_steps
lr = lr_max * (1 + math.cos(math.pi * cur_step / decay_steps)) / 2
lr_each_step.append(lr)
else:
for i in range(total_steps):
if i < warmup_steps:
lr = lr_init + (lr_max - lr_init) * i / warmup_steps
else:
lr = lr_max - (lr_max - lr_end) * (i - warmup_steps) / (total_steps - warmup_steps)
lr_each_step.append(lr)
current_step = global_step
lr_each_step = np.array(lr_each_step).astype(np.float32)
learning_rate = lr_each_step[current_step:]
return learning_rate

View File

@ -0,0 +1,115 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""metric"""
import numpy as np
from mindspore.communication.management import GlobalComm
from mindspore.ops import operations as P
import mindspore.nn as nn
import mindspore.common.dtype as mstype
from mindspore.common.tensor import Tensor
from mindspore.common.parameter import Parameter
class ClassifyCorrectWithCache(nn.Cell):
"""ClassifyCorrectWithCache"""
def __init__(self, network, eval_dataset):
super(ClassifyCorrectWithCache, self).__init__(auto_prefix=False)
self._network = network
self.argmax = P.Argmax()
self.equal = P.Equal()
self.cast = P.Cast()
self.reduce_sum = P.ReduceSum()
self.allreduce = P.AllReduce(P.ReduceOp.SUM, GlobalComm.WORLD_COMM_GROUP)
self.assign_add = P.AssignAdd()
self.assign = P.Assign()
self._correct_num = Parameter(Tensor(0.0, mstype.float32), name="correct_num", requires_grad=False)
# save data to parameter
pdata = []
plabel = []
step_num = 0
for batch in eval_dataset.create_dict_iterator(output_numpy=True, num_epochs=1):
pdata.append(batch["image"])
plabel.append(batch["label"])
step_num = step_num + 1
pdata = Tensor(np.array(pdata), mstype.float32)
plabel = Tensor(np.array(plabel), mstype.int32)
self._data = Parameter(pdata, name="pdata", requires_grad=False)
self._label = Parameter(plabel, name="plabel", requires_grad=False)
self._step_num = Tensor(step_num, mstype.int32)
def construct(self, index):
self._correct_num = 0
while index < self._step_num:
data = self._data[index]
label = self._label[index]
outputs = self._network(data)
y_pred = self.argmax(outputs)
y_pred = self.cast(y_pred, mstype.int32)
y_correct = self.equal(y_pred, label)
y_correct = self.cast(y_correct, mstype.float32)
y_correct_sum = self.reduce_sum(y_correct)
self._correct_num += y_correct_sum #self.assign(self._correct_num, y_correct_sum)
index = index + 1
total_correct = self.allreduce(self._correct_num)
return total_correct
class ClassifyCorrectCell(nn.Cell):
"""ClassifyCorrectCell"""
def __init__(self, network):
super(ClassifyCorrectCell, self).__init__(auto_prefix=False)
self._network = network
self.argmax = P.Argmax()
self.equal = P.Equal()
self.cast = P.Cast()
self.reduce_sum = P.ReduceSum()
self.allreduce = P.AllReduce(P.ReduceOp.SUM, GlobalComm.WORLD_COMM_GROUP)
def construct(self, data, label):
outputs = self._network(data)
y_pred = self.argmax(outputs)
y_pred = self.cast(y_pred, mstype.int32)
y_correct = self.equal(y_pred, label)
y_correct = self.cast(y_correct, mstype.float32)
y_correct = self.reduce_sum(y_correct)
total_correct = self.allreduce(y_correct)
return (total_correct,)
class DistAccuracy(nn.Metric):
"""DistAccuracy"""
def __init__(self, batch_size, device_num):
super(DistAccuracy, self).__init__()
self.clear()
self.batch_size = batch_size
self.device_num = device_num
def clear(self):
self._correct_num = 0
self._total_num = 0
def update(self, *inputs):
if len(inputs) != 1:
raise ValueError('Distribute accuracy needs 1 input (y_correct), but got {}'.format(len(inputs)))
y_correct = self._convert_data(inputs[0])
self._correct_num += y_correct
self._total_num += self.batch_size * self.device_num
def eval(self):
if self._total_num == 0:
raise RuntimeError('Accuracy can not be calculated, because the number of samples is 0.')
return self._correct_num / 50000

View File

@ -0,0 +1,129 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Parse arguments"""
import os
import ast
import argparse
from pprint import pprint, pformat
import yaml
class Config:
"""
Configuration namespace. Convert dictionary to members.
"""
def __init__(self, cfg_dict):
for k, v in cfg_dict.items():
if isinstance(v, (list, tuple)):
setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
else:
setattr(self, k, Config(v) if isinstance(v, dict) else v)
def __str__(self):
return pformat(self.__dict__)
def __repr__(self):
return self.__str__()
def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="default_config.yaml"):
"""
Parse command line arguments to the configuration according to the default yaml.
Args:
parser: Parent parser.
cfg: Base configuration.
helper: Helper description.
cfg_path: Path to the default yaml config.
"""
parser = argparse.ArgumentParser(description="[REPLACE THIS at config.py]",
parents=[parser])
helper = {} if helper is None else helper
choices = {} if choices is None else choices
for item in cfg:
if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
choice = choices[item] if item in choices else None
if isinstance(cfg[item], bool):
parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice,
help=help_description)
else:
parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice,
help=help_description)
args = parser.parse_args()
return args
def parse_yaml(yaml_path):
"""
Parse the yaml config file.
Args:
yaml_path: Path to the yaml config.
"""
with open(yaml_path, 'r') as fin:
try:
cfgs = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
cfgs = [x for x in cfgs]
if len(cfgs) == 1:
cfg_helper = {}
cfg = cfgs[0]
cfg_choices = {}
elif len(cfgs) == 2:
cfg, cfg_helper = cfgs
cfg_choices = {}
elif len(cfgs) == 3:
cfg, cfg_helper, cfg_choices = cfgs
else:
raise ValueError("At most 3 docs (config, description for help, choices) are supported in config yaml")
print(cfg_helper)
except:
raise ValueError("Failed to parse yaml")
return cfg, cfg_helper, cfg_choices
def merge(args, cfg):
"""
Merge the base config from yaml file and command line arguments.
Args:
args: Command line arguments.
cfg: Base configuration.
"""
args_var = vars(args)
for item in args_var:
cfg[item] = args_var[item]
return cfg
def get_config():
"""
Get Config according to the yaml file and cli arguments.
"""
parser = argparse.ArgumentParser(description="default name", add_help=False)
current_dir = os.path.dirname(os.path.abspath(__file__))
parser.add_argument("--config_path", type=str,
default=os.path.join(current_dir, "../../vit_patch32_imagenet2012_config.yml"),
help="Config file path")
path_args, _ = parser.parse_known_args()
default, helper, choices = parse_yaml(path_args.config_path)
pprint(default)
config_path = path_args.config_path
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=config_path)
final_config = merge(args, default)
return Config(final_config)
config = get_config()

View File

@ -0,0 +1,27 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Device adapter for ModelArts"""
from .config import config
if config.enable_modelarts:
from .moxing_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
else:
from .local_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
__all__ = [
"get_device_id", "get_device_num", "get_rank_id", "get_job_id"
]

View File

@ -0,0 +1,36 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Local adapter"""
import os
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def get_rank_id():
global_rank_id = os.getenv('RANK_ID', '0')
return int(global_rank_id)
def get_job_id():
return "Local Job"

View File

@ -0,0 +1,115 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Moxing adapter for ModelArts"""
import os
import functools
from mindspore import context
from .config import config
_global_sync_count = 0
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def get_rank_id():
global_rank_id = os.getenv('RANK_ID', '0')
return int(global_rank_id)
def get_job_id():
job_id = os.getenv('JOB_ID')
job_id = job_id if job_id != "" else "default"
return job_id
def sync_data(from_path, to_path):
"""
Download data from remote obs to local directory if the first url is remote url and the second one is local path
Upload data from local directory to remote obs in contrast.
"""
import moxing as mox
import time
global _global_sync_count
sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
_global_sync_count += 1
# Each server contains 8 devices as most.
if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
print("from path: ", from_path)
print("to path: ", to_path)
mox.file.copy_parallel(from_path, to_path)
print("===finish data synchronization===")
try:
os.mknod(sync_lock)
except IOError:
pass
print("===save flag===")
while True:
if os.path.exists(sync_lock):
break
time.sleep(1)
print("Finish sync data from {} to {}.".format(from_path, to_path))
def moxing_wrapper(pre_process=None, post_process=None):
"""
Moxing wrapper to download dataset and upload outputs.
"""
def wrapper(run_func):
@functools.wraps(run_func)
def wrapped_func(*args, **kwargs):
# Download data from data_url
if config.enable_modelarts:
if config.data_url:
sync_data(config.data_url, config.data_path)
print("Dataset downloaded: ", os.listdir(config.data_path))
if config.checkpoint_url:
sync_data(config.checkpoint_url, config.load_path)
print("Preload downloaded: ", os.listdir(config.load_path))
if config.train_url:
sync_data(config.train_url, config.output_path)
print("Workspace downloaded: ", os.listdir(config.output_path))
context.set_context(save_graphs_path=os.path.join(config.output_path, str(get_rank_id())))
config.device_num = get_device_num()
config.device_id = get_device_id()
if not os.path.exists(config.output_path):
os.makedirs(config.output_path)
if pre_process:
pre_process()
run_func(*args, **kwargs)
# Upload data to train_url
if config.enable_modelarts:
if post_process:
post_process()
if config.train_url:
print("Start to copy output directory")
sync_data(config.output_path, config.train_url)
return wrapped_func
return wrapper

View File

@ -0,0 +1,214 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Gradient clipping wrapper for optimizers."""
import numpy as np
from mindspore._checkparam import Validator as validator
from mindspore.ops import functional as F
from mindspore.ops import operations as P
from mindspore.ops import composite as C
from mindspore.common import dtype as mstype
from mindspore.common.initializer import initializer
from mindspore.common.parameter import Parameter
from mindspore.common.tensor import Tensor
from mindspore._checkparam import Rel
from mindspore.nn.optim import Optimizer
from mindspore.nn.optim.optimizer import opt_init_args_register
def _check_param_value(beta1, beta2, eps, prim_name):
"""Check the type of inputs."""
validator.check_value_type("beta1", beta1, [float], prim_name)
validator.check_value_type("beta2", beta2, [float], prim_name)
validator.check_value_type("eps", eps, [float], prim_name)
validator.check_float_range(beta1, 0.0, 1.0, Rel.INC_NEITHER, "beta1", prim_name)
validator.check_float_range(beta2, 0.0, 1.0, Rel.INC_NEITHER, "beta2", prim_name)
validator.check_positive_float(eps, "eps", prim_name)
_grad_scale = C.MultitypeFuncGraph("grad_scale")
op_mul = P.Mul()
map_ = C.Map()
@_grad_scale.register("Number", "Tensor")
def tensor_grad_scale(scale, grad):
"""Get grad with scale."""
if scale == 1.0:
return grad
return op_mul(grad, F.cast(scale, F.dtype(grad)))
@_grad_scale.register("Tensor", "Tensor")
def tensor_grad_scale_with_tensor(scale, grad):
"""Get grad with scale."""
return op_mul(grad, F.cast(scale, F.dtype(grad)))
def scale_grad(gradients, reciprocal_scale):
gradients = map_(F.partial(_grad_scale, reciprocal_scale), gradients)
return gradients
_adam_opt = C.MultitypeFuncGraph("adam_opt")
_scaler_one = Tensor(1, mstype.int32)
@_adam_opt.register("Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Tensor", "Number", "Tensor", "Tensor", "Tensor",
"Tensor", "Bool", "Bool")
def _update_run_op(beta1_power, beta2_power, beta1, beta2, eps, lr, weight_decay, param, \
m, v, gradient, decay_flag, optim_filter):
"""
Update parameters.
Args:
beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
lr (Tensor): Learning rate.
weight_decay (Number): Weight decay. Should be equal to or greater than 0.
param (Tensor): Parameters.
m (Tensor): m value of parameters.
v (Tensor): v value of parameters.
gradient (Tensor): Gradient of parameters.
decay_flag (bool): Applies weight decay or not.
optim_filter (bool): Applies parameter update or not.
Returns:
Tensor, the new value of v after updating.
"""
if optim_filter:
# op_mul = P.Mul(), defined output
op_square = P.Square()
op_sqrt = P.Sqrt()
op_cast = P.Cast()
op_reshape = P.Reshape()
op_shape = P.Shape()
param_fp32 = op_cast(param, mstype.float32)
m_fp32 = op_cast(m, mstype.float32)
v_fp32 = op_cast(v, mstype.float32)
gradient_fp32 = op_cast(gradient, mstype.float32)
next_m = op_mul(beta1, m_fp32) + op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32)
- beta1, gradient_fp32)
next_v = op_mul(beta2, v_fp32) + op_mul(op_cast(F.tuple_to_array((1.0,)), mstype.float32)
- beta2, op_square(gradient_fp32))
regulate_m = next_m / (_scaler_one - beta1_power)
regulate_v = next_v / (_scaler_one - beta2_power)
update = regulate_m / (eps + op_sqrt(regulate_v))
if decay_flag:
update = op_mul(weight_decay, param_fp32) + update
update_with_lr = op_mul(lr, update)
next_param = param_fp32 - op_reshape(update_with_lr, op_shape(param_fp32))
next_param = F.depend(next_param, F.assign(param, op_cast(next_param, F.dtype(param))))
next_param = F.depend(next_param, F.assign(m, op_cast(next_m, F.dtype(m))))
next_param = F.depend(next_param, F.assign(v, op_cast(next_v, F.dtype(v))))
return op_cast(next_param, F.dtype(param))
return gradient
class AdamW(Optimizer):
"""
Implements the gradient clipping by norm for a AdamWeightDecay optimizer.
"""
@opt_init_args_register
def __init__(self, params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, \
weight_decay=0.0, loss_scale=1.0, clip=False):
super(AdamW, self).__init__(learning_rate, params, weight_decay)
_check_param_value(beta1, beta2, eps, self.cls_name)
self.beta1 = Tensor(np.array([beta1]).astype(np.float32))
self.beta2 = Tensor(np.array([beta2]).astype(np.float32))
self.eps = Tensor(np.array([eps]).astype(np.float32))
self.moments1 = self.parameters.clone(prefix="adam_m", init='zeros')
self.moments2 = self.parameters.clone(prefix="adam_v", init='zeros')
self.hyper_map = C.HyperMap()
self.beta1_power = Parameter(initializer(1, [1], mstype.float32), name="beta1_power")
self.beta2_power = Parameter(initializer(1, [1], mstype.float32), name="beta2_power")
self.reciprocal_scale = Tensor(1.0 / loss_scale, mstype.float32)
self.clip = clip
def construct(self, gradients):
lr = self.get_lr()
gradients = scale_grad(gradients, self.reciprocal_scale)
if self.clip:
gradients = C.clip_by_global_norm(gradients, 5.0, None)
beta1_power = self.beta1_power * self.beta1
self.beta1_power = beta1_power
beta2_power = self.beta2_power * self.beta2
self.beta2_power = beta2_power
if self.is_group:
if self.is_group_lr:
optim_result = self.hyper_map(F.partial(_adam_opt, beta1_power, beta2_power, \
self.beta1, self.beta2, self.eps),
lr, self.weight_decay, self.parameters, self.moments1, self.moments2,
gradients, self.decay_flags, self.optim_filter)
else:
optim_result = self.hyper_map(F.partial(_adam_opt, beta1_power, beta2_power, \
self.beta1, self.beta2, self.eps, lr),
self.weight_decay, self.parameters, self.moments1, self.moments2,
gradients, self.decay_flags, self.optim_filter)
else:
optim_result = self.hyper_map(F.partial(_adam_opt, beta1_power, beta2_power, self.beta1, self.beta2, \
self.eps, lr, self.weight_decay),
self.parameters, self.moments1, self.moments2,
gradients, self.decay_flags, self.optim_filter)
if self.use_parallel:
self.broadcast_params(optim_result)
return optim_result
def paramter_group(network, weight_decay, no_weight_decay_filter, gc_flag):
"""paramter_group"""
filter_len = len(no_weight_decay_filter)
if filter_len > 0:
decayed_params = []
no_decayed_params = []
for param in network.trainable_params():
if all([key not in param.name for key in no_weight_decay_filter]):
decayed_params.append(param)
else:
no_decayed_params.append(param)
group_params = [{'params': decayed_params, 'weight_decay': weight_decay, 'grad_centralization': gc_flag},
{'params': no_decayed_params},
{'order_params': network.trainable_params()}]
else:
group_params = [{'params': network.trainable_params(), \
'weight_decay': weight_decay, 'grad_centralization': gc_flag},
{'order_params': network.trainable_params()}]
return group_params
def get_optimizer(optimizer_name, network, lrs, args):
no_weight_decay_filter = [x for x in args.no_weight_decay_filter.split(",") if len(x) > 0]
group_params = paramter_group(network, args.weight_decay, no_weight_decay_filter, bool(args.gc_flag))
if optimizer_name == 'adamw':
opt = AdamW(group_params, lrs, args.beta1, args.beta2, loss_scale=args.loss_scale)
else:
raise NotImplementedError
return opt, group_params

View File

@ -0,0 +1,26 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""set_loglevel"""
import os
def set_loglevel(level='info'):
print('set device global log level to {}'.format(level))
os.system('/usr/local/Ascend/driver/tools/msnpureport -g {}'.format(level))
os.system('/usr/local/Ascend/driver/tools/msnpureport -g {} -d 4'.format(level))
event_log_level = 'enable' if level in ['info', 'debug'] else 'disable'
print('set device event log level to {}'.format(event_log_level))
os.system('/usr/local/Ascend/driver/tools/msnpureport -e {}'.format(event_log_level))
os.system('/usr/local/Ascend/driver/tools/msnpureport -e {} -d 4'.format(event_log_level))

View File

@ -0,0 +1,507 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Vision Transformer implementation."""
from importlib import import_module
from easydict import EasyDict as edict
import numpy as np
import mindspore
from mindspore.common.initializer import initializer
from mindspore.common.parameter import Parameter
from mindspore.nn import Cell, Dense, Dropout, SequentialCell
from mindspore.ops import operations as P
import mindspore.common.dtype as mstype
from mindspore import Tensor
MIN_NUM_PATCHES = 4
class VitConfig:
"""
VitConfig
"""
def __init__(self, configs):
self.configs = configs
# network init
self.network_norm = mindspore.nn.LayerNorm((configs.normalized_shape,))
self.network_init = mindspore.common.initializer.Normal(sigma=1.0)
self.network_dropout_rate = 0.1
self.network_pool = 'cls'
self.network = ViT
# stem
self.stem_init = mindspore.common.initializer.XavierUniform()
self.stem = VitStem
# body
self.body_norm = mindspore.nn.LayerNorm
self.body_drop_path_rate = 0.1
self.body = Transformer
# body attention
self.attention_init = mindspore.common.initializer.XavierUniform()
self.attention_activation = mindspore.nn.Softmax()
self.attention_dropout_rate = 0.1
self.attention = Attention
# body feedforward
self.feedforward_init = mindspore.common.initializer.XavierUniform()
self.feedforward_activation = mindspore.nn.GELU()
self.feedforward_dropout_rate = 0.1
self.feedforward = FeedForward
# head
self.head = origin_head
self.head_init = mindspore.common.initializer.XavierUniform()
self.head_dropout_rate = 0.1
self.head_norm = mindspore.nn.LayerNorm((configs.normalized_shape,))
self.head_activation = mindspore.nn.GELU()
class DropPath(Cell):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
"""
def __init__(self, drop_prob=None, seed=0):
super(DropPath, self).__init__()
self.keep_prob = 1 - drop_prob
seed = min(seed, 0) # always be 0
self.rand = P.UniformReal(seed=seed) # seed must be 0, if set to other value, it's not rand for multiple call
self.shape = P.Shape()
self.floor = P.Floor()
self.print = P.Print()
def construct(self, x):
if self.training:
x_shape = self.shape(x) # B N C
random_tensor = self.rand((x_shape[0], 1, 1))
random_tensor = random_tensor + self.keep_prob
random_tensor = self.floor(random_tensor)
x = x / self.keep_prob
x = x * random_tensor
return x
class BatchDense(Cell):
"""BatchDense module."""
def __init__(self, in_features, out_features, initialization, has_bias=True):
super().__init__()
self.out_features = out_features
self.dense = Dense(in_features, out_features, has_bias=has_bias)
self.dense.weight.set_data(initializer(initialization, [out_features, in_features]))
self.reshape = P.Reshape()
def construct(self, x):
bs, seq_len, d_model = x.shape
out = self.reshape(x, (bs * seq_len, d_model))
out = self.dense(out)
out = self.reshape(out, (bs, seq_len, self.out_features))
return out
class ResidualCell(Cell):
"""Cell which implements x + f(x) function."""
def __init__(self, cell):
super().__init__()
self.cell = cell
def construct(self, x, **kwargs):
return self.cell(x, **kwargs) + x
def pretrain_head(vit_config):
"""Head for ViT pretraining."""
d_model = vit_config.configs.d_model
mlp_dim = vit_config.configs.mlp_dim
num_classes = vit_config.configs.num_classes
dropout_rate = vit_config.head_dropout_rate
initialization = vit_config.head_init
normalization = vit_config.head_norm
activation = vit_config.head_activation
dense1 = Dense(d_model, mlp_dim)
dense1.weight.set_data(initializer(initialization, [mlp_dim, d_model]))
dense2 = Dense(mlp_dim, num_classes)
dense2.weight.set_data(initializer(initialization, [num_classes, mlp_dim]))
return SequentialCell([
normalization,
dense1,
activation,
Dropout(keep_prob=(1. - dropout_rate)),
dense2])
def origin_head(vit_config):
"""Head for ViT pretraining."""
d_model = vit_config.configs.d_model
num_classes = vit_config.configs.num_classes
initialization = vit_config.head_init
dense = Dense(d_model, num_classes)
dense.weight.set_data(initializer(initialization, [num_classes, d_model]))
return SequentialCell([dense])
class VitStem(Cell):
"""Stem layer for ViT."""
def __init__(self, vit_config):
super().__init__()
d_model = vit_config.configs.d_model
patch_size = vit_config.configs.patch_size
image_size = vit_config.configs.image_size
initialization = vit_config.stem_init
channels = 3
assert image_size % patch_size == 0, 'Image dimensions must be divisible by the patch size.'
num_patches = (image_size // patch_size) ** 2
assert num_patches > MIN_NUM_PATCHES, f'your number of patches {num_patches} is too small'
patch_dim = channels * patch_size ** 2
self.patch_size = patch_size
self.reshape = P.Reshape()
self.transpose = P.Transpose()
self.patch_to_embedding = BatchDense(patch_dim, d_model, initialization, has_bias=True)
def construct(self, img):
p = self.patch_size
bs, channels, h, w = img.shape
x = self.reshape(img, (bs, channels, h // p, p, w // p, p))
x = self.transpose(x, (0, 2, 4, 1, 3, 5))
x = self.reshape(x, (bs, (h//p)*(w//p), channels*p*p))
x = self.patch_to_embedding(x)
return x
class ViT(Cell):
"""Vision Transformer implementation."""
def __init__(self, vit_config):
super().__init__()
d_model = vit_config.configs.d_model
patch_size = vit_config.configs.patch_size
image_size = vit_config.configs.image_size
initialization = vit_config.network_init
pool = vit_config.network_pool
dropout_rate = vit_config.network_dropout_rate
norm = vit_config.network_norm
stem = vit_config.stem(vit_config)
body = vit_config.body(vit_config)
head = vit_config.head(vit_config)
assert pool in {'cls', 'mean'}, 'pool type must be either cls or mean'
num_patches = (image_size // patch_size) ** 2
if pool == "cls":
self.cls_token = Parameter(initializer(initialization, (1, 1, d_model)),
name='cls', requires_grad=True)
self.pos_embedding = Parameter(initializer(initialization, (1, num_patches + 1, d_model)),
name='pos_embedding', requires_grad=True)
self.tile = P.Tile()
self.cat_1 = P.Concat(axis=1)
else:
self.pos_embedding = Parameter(initializer(initialization, (1, num_patches, d_model)),
name='pos_embedding', requires_grad=True)
self.mean = P.ReduceMean(keep_dims=False)
self.pool = pool
self.cast = P.Cast()
self.dropout = Dropout(keep_prob=(1. - dropout_rate))
self.stem = stem
self.body = body
self.head = head
self.norm = norm
def construct(self, img):
x = self.stem(img)
bs, seq_len, _ = x.shape
if self.pool == "cls":
cls_tokens = self.tile(self.cls_token, (bs, 1, 1))
x = self.cat_1((cls_tokens, x)) # now x has shape = (bs, seq_len+1, d)
x += self.pos_embedding[:, :(seq_len + 1)]
else:
x += self.pos_embedding[:, :seq_len]
y = self.cast(x, mstype.float32)
y = self.dropout(y)
x = self.cast(y, x.dtype)
x = self.body(x)
if self.norm is not None:
x = self.norm(x)
if self.pool == "cls":
x = x[:, 0]
else:
x = self.mean(x, (-2,))
return self.head(x)
class Attention(Cell):
"""Attention layer implementation."""
def __init__(self, vit_config):
super().__init__()
d_model = vit_config.configs.d_model
dim_head = vit_config.configs.dim_head
heads = vit_config.configs.heads
initialization = vit_config.attention_init
activation = vit_config.attention_activation
dropout_rate = vit_config.attention_dropout_rate
inner_dim = heads * dim_head
self.dim_head = dim_head
self.heads = heads
self.scale = Tensor([dim_head ** -0.5])
self.to_q = Dense(d_model, inner_dim, has_bias=True)
self.to_q.weight.set_data(initializer(initialization, [inner_dim, d_model]))
self.to_k = Dense(d_model, inner_dim, has_bias=True)
self.to_k.weight.set_data(initializer(initialization, [inner_dim, d_model]))
self.to_v = Dense(d_model, inner_dim, has_bias=True)
self.to_v.weight.set_data(initializer(initialization, [inner_dim, d_model]))
self.to_out = Dense(inner_dim, d_model, has_bias=True)
self.to_out.weight.set_data(initializer(initialization, [inner_dim, d_model]))
self.dropout = Dropout(1 - dropout_rate)
self.activation = activation
#auxiliary functions
self.reshape = P.Reshape()
self.transpose = P.Transpose()
self.cast = P.Cast()
self.mul = P.Mul()
self.q_matmul_k = P.BatchMatMul(transpose_b=True)
self.attn_matmul_v = P.BatchMatMul()
self.softmax_nz = True
def construct(self, x):
'''x size - BxNxd_model'''
bs, seq_len, d_model, h, d = x.shape[0], x.shape[1], x.shape[2], self.heads, self.dim_head
x_2d = self.reshape(x, (-1, d_model))
q, k, v = self.to_q(x_2d), self.to_k(x_2d), self.to_v(x_2d)
if self.softmax_nz:
q = self.reshape(q, (bs, seq_len, h, d))
q = self.transpose(q, (0, 2, 1, 3))
q = self.cast(q, mstype.float32)
q = self.mul(q, self.scale)
k = self.reshape(k, (bs, seq_len, h, d))
k = self.transpose(k, (0, 2, 1, 3))
v = self.reshape(v, (bs, seq_len, h, d))
v = self.transpose(v, (0, 2, 1, 3))
q = self.cast(q, k.dtype)
attn_scores = self.q_matmul_k(q, k) #bs x h x seq_len x seq_len
attn_scores = self.cast(attn_scores, x.dtype)
attn_scores = self.activation(attn_scores)
else:
q = self.reshape(q, (bs, seq_len, h, d))
q = self.transpose(q, (0, 2, 1, 3))
k = self.reshape(k, (bs, seq_len, h, d))
k = self.transpose(k, (0, 2, 1, 3))
v = self.reshape(v, (bs, seq_len, h, d))
v = self.transpose(v, (0, 2, 1, 3))
attn_scores = self.q_matmul_k(q, k) #bs x h x seq_len x seq_len
attn_scores = self.cast(attn_scores, mstype.float32)
attn_scores = self.mul(attn_scores, self.scale)
attn_scores = self.cast(attn_scores, x.dtype)
attn_scores = self.activation(attn_scores)
out = self.attn_matmul_v(attn_scores, v) #bs x h x seq_len x dim_head
out = self.transpose(out, (0, 2, 1, 3))
out = self.reshape(out, (bs*seq_len, h*d))
out = self.to_out(out)
out = self.reshape(out, (bs, seq_len, d_model))
#out = self.dropout(out)
y = self.cast(out, mstype.float32)
y = self.dropout(y)
out = self.cast(y, out.dtype)
#out = self.reshape(out, (bs, seq_len, d_model))
return out
class FeedForward(Cell):
"""FeedForward layer implementation."""
def __init__(self, vit_config):
super().__init__()
d_model = vit_config.configs.d_model
hidden_dim = vit_config.configs.mlp_dim
initialization = vit_config.feedforward_init
activation = vit_config.feedforward_activation
dropout_rate = vit_config.feedforward_dropout_rate
self.ff1 = BatchDense(d_model, hidden_dim, initialization)
self.activation = activation
self.dropout = Dropout(keep_prob=1.-dropout_rate)
self.ff2 = BatchDense(hidden_dim, d_model, initialization)
self.cast = P.Cast()
def construct(self, x):
y = self.ff1(x)
y = self.cast(y, mstype.float32)
y = self.activation(y)
y = self.dropout(y)
y = self.cast(y, x.dtype)
y = self.ff2(y)
y = self.cast(y, mstype.float32)
y = self.dropout(y)
y = self.cast(y, x.dtype)
return y
class Transformer(Cell):
"""Transformer implementation."""
def __init__(self, vit_config):
super().__init__()
depth = vit_config.configs.depth
drop_path_rate = vit_config.body_drop_path_rate
dpr = [x.item() for x in np.linspace(0, drop_path_rate, depth)]
att_seeds = [np.random.randint(1024) for _ in range(depth)]
mlp_seeds = [np.random.randint(1024) for _ in range(depth)]
layers = []
for i in range(depth):
normalization = vit_config.body_norm((vit_config.configs.normalized_shape,))
normalization2 = vit_config.body_norm((vit_config.configs.normalized_shape,))
attention = vit_config.attention(vit_config)
feedforward = vit_config.feedforward(vit_config)
if drop_path_rate > 0:
layers.append(
SequentialCell([
ResidualCell(SequentialCell([normalization,
attention,
DropPath(dpr[i], att_seeds[i])])),
ResidualCell(SequentialCell([normalization2,
feedforward,
DropPath(dpr[i], mlp_seeds[i])]))
])
)
else:
layers.append(
SequentialCell([
ResidualCell(SequentialCell([normalization,
attention])),
ResidualCell(SequentialCell([normalization2,
feedforward]))
])
)
self.layers = SequentialCell(layers)
def construct(self, x):
return self.layers(x)
def load_function(func_name):
"""Load function using its name."""
modules = func_name.split(".")
if len(modules) > 1:
module_path = ".".join(modules[:-1])
name = modules[-1]
module = import_module(module_path)
return getattr(module, name)
return func_name
vit_cfg = edict({
'd_model': 768,
'depth': 12,
'heads': 12,
'mlp_dim': 3072,
'dim_head': 64,
'patch_size': 32,
'normalized_shape': 768,
'image_size': 224,
'num_classes': 1001,
})
def vit_base_patch16(args):
"""vit_base_patch16"""
vit_cfg.d_model = 768
vit_cfg.depth = 12
vit_cfg.heads = 12
vit_cfg.mlp_dim = 3072
vit_cfg.dim_head = vit_cfg.d_model // vit_cfg.heads
vit_cfg.patch_size = 16
vit_cfg.normalized_shape = vit_cfg.d_model
vit_cfg.image_size = args.train_image_size
vit_cfg.num_classes = args.class_num
if args.vit_config_path != '':
print("get vit_config_path")
vit_config = load_function(args.vit_config_path)(vit_cfg)
else:
print("get default_vit_cfg")
vit_config = VitConfig(vit_cfg)
model = vit_config.network(vit_config)
return model
def vit_base_patch32(args):
"""vit_base_patch32"""
vit_cfg.d_model = 768
vit_cfg.depth = 12
vit_cfg.heads = 12
vit_cfg.mlp_dim = 3072
vit_cfg.dim_head = vit_cfg.d_model // vit_cfg.heads
vit_cfg.patch_size = 32
vit_cfg.normalized_shape = vit_cfg.d_model
vit_cfg.image_size = args.train_image_size
vit_cfg.num_classes = args.class_num
if args.vit_config_path != '':
print("get vit_config_path")
vit_config = load_function(args.vit_config_path)(vit_cfg)
else:
print("get default_vit_cfg")
vit_config = VitConfig(vit_cfg)
model = vit_config.network(vit_config)
return model
def get_network(backbone_name, args):
"""get_network"""
if backbone_name == 'vit_base_patch32':
backbone = vit_base_patch32(args=args)
elif backbone_name == 'vit_base_patch16':
backbone = vit_base_patch16(args=args)
else:
raise NotImplementedError
return backbone

View File

@ -0,0 +1,242 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""training script"""
import os
import time
import socket
import numpy as np
from mindspore import context
from mindspore import Tensor
from mindspore.train.model import Model, ParallelMode
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig
from mindspore.train.loss_scale_manager import FixedLossScaleManager
from mindspore.communication.management import init
from mindspore.profiler.profiling import Profiler
from mindspore.train.serialization import load_checkpoint
import mindspore.dataset as ds
from src.vit import get_network
from src.dataset import get_dataset
from src.cross_entropy import get_loss
from src.optimizer import get_optimizer
from src.lr_generator import get_lr
from src.eval_engine import get_eval_engine
from src.callback import StateMonitor
from src.logging import get_logger
from src.model_utils.config import config
from src.model_utils.moxing_adapter import moxing_wrapper
try:
os.environ['MINDSPORE_HCCL_CONFIG_PATH'] = os.getenv('RANK_TABLE_FILE')
device_id = int(os.getenv('DEVICE_ID')) # 0 ~ 7
local_rank = int(os.getenv('RANK_ID')) # local_rank
device_num = int(os.getenv('RANK_SIZE')) # world_size
print("distribute training")
except TypeError:
device_id = 0 # 0 ~ 7
local_rank = 0 # local_rank
device_num = 1 # world_size
print("standalone training")
def add_static_args(args):
"""add_static_args"""
args.weight_decay = float(args.weight_decay)
args.eval_engine = 'imagenet'
args.split_point = 0.4
args.poly_power = 2
args.aux_factor = 0.4
args.seed = 1
args.auto_tune = 0
if args.eval_offset < 0:
args.eval_offset = args.max_epoch % args.eval_interval
args.device_id = device_id
args.local_rank = local_rank
args.device_num = device_num
args.dataset_name = 'imagenet'
return args
def modelarts_pre_process():
'''modelarts pre process function.'''
start_t = time.time()
val_file = os.path.join(config.data_path, 'val/imagenet_val.tar')
train_file = os.path.join(config.data_path, 'train/imagenet_train.tar')
tar_files = [val_file, train_file]
print('tar_files:{}'.format(tar_files))
for tar_file in tar_files:
if os.path.exists(tar_file):
t1 = time.time()
tar_dir = os.path.dirname(tar_file)
print('cd {}; tar -xvf {} > /dev/null 2>&1'.format(tar_dir, tar_file))
os.system('cd {}; tar -xvf {} > /dev/null 2>&1'.format(tar_dir, tar_file))
t2 = time.time()
print('uncompress, time used={:.2f}s'.format(t2 - t1))
os.system('cd {}; rm -rf {}'.format(tar_dir, tar_file))
else:
print('file no exists:', tar_file)
end_t = time.time()
print('tar cost time {:.2f} sec'.format(end_t-start_t))
@moxing_wrapper(pre_process=modelarts_pre_process)
def train_net():
"""train_net"""
args = add_static_args(config)
np.random.seed(args.seed)
args.logger = get_logger(args.save_checkpoint_path, rank=local_rank)
context.set_context(device_id=device_id,
mode=context.GRAPH_MODE,
device_target="Ascend",
save_graphs=False)
if args.auto_tune:
context.set_context(auto_tune_mode='GA')
elif args.device_num == 1:
pass
else:
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
if args.open_profiler:
profiler = Profiler(output_path="data_{}".format(local_rank))
# init the distribute env
if not args.auto_tune and args.device_num > 1:
init()
# network
net = get_network(backbone_name=args.backbone, args=args)
# set grad allreduce split point
parameters = [param for param in net.trainable_params()]
parameter_len = len(parameters)
if args.split_point > 0:
print("split_point={}".format(args.split_point))
split_parameter_index = [int(args.split_point*parameter_len),]
parameter_indices = 1
for i in range(parameter_len):
if i in split_parameter_index:
parameter_indices += 1
parameters[i].comm_fusion = parameter_indices
else:
print("warning!!!, no split point")
if os.path.isfile(args.pretrained):
load_checkpoint(args.pretrained, net, strict_load=False)
# loss
if not args.use_label_smooth:
args.label_smooth_factor = 0.0
loss = get_loss(loss_name=args.loss_name, args=args)
# train dataset
epoch_size = args.max_epoch
dataset = get_dataset(dataset_name=args.dataset_name,
do_train=True,
dataset_path=args.dataset_path,
args=args)
ds.config.set_seed(args.seed)
step_size = dataset.get_dataset_size()
args.steps_per_epoch = step_size
# evaluation dataset
eval_dataset = get_dataset(dataset_name=args.dataset_name,
do_train=False,
dataset_path=args.eval_path,
args=args)
# evaluation engine
if args.auto_tune or args.open_profiler or eval_dataset is None or args.device_num == 1:
args.eval_engine = ''
eval_engine = get_eval_engine(args.eval_engine, net, eval_dataset, args)
# loss scale
loss_scale = FixedLossScaleManager(args.loss_scale, drop_overflow_update=False)
# learning rate
lr_array = get_lr(global_step=0, lr_init=args.lr_init, lr_end=args.lr_min, lr_max=args.lr_max,
warmup_epochs=args.warmup_epochs, total_epochs=epoch_size, steps_per_epoch=step_size,
lr_decay_mode=args.lr_decay_mode, poly_power=args.poly_power)
lr = Tensor(lr_array)
# optimizer, group_params used in grad freeze
opt, _ = get_optimizer(optimizer_name=args.opt,
network=net,
lrs=lr,
args=args)
# model
model = Model(net, loss_fn=loss, optimizer=opt,
metrics=eval_engine.metric, eval_network=eval_engine.eval_network,
loss_scale_manager=loss_scale, amp_level="O3")
eval_engine.set_model(model)
args.logger.save_args(args)
t0 = time.time()
# equal to model._init(dataset, sink_size=step_size)
eval_engine.compile(sink_size=step_size)
t1 = time.time()
args.logger.info('compile time used={:.2f}s'.format(t1 - t0))
# callbacks
state_cb = StateMonitor(data_size=step_size,
tot_batch_size=args.batch_size * device_num,
lrs=lr_array,
eval_interval=args.eval_interval,
eval_offset=args.eval_offset,
eval_engine=eval_engine,
logger=args.logger.info)
cb = [state_cb,]
if args.save_checkpoint and local_rank == 0:
config_ck = CheckpointConfig(save_checkpoint_steps=args.save_checkpoint_epochs*step_size,
keep_checkpoint_max=args.keep_checkpoint_max,
async_save=True)
ckpt_cb = ModelCheckpoint(prefix=args.backbone, directory=args.save_checkpoint_path, config=config_ck)
cb += [ckpt_cb]
t0 = time.time()
model.train(epoch_size, dataset, callbacks=cb, sink_size=step_size)
t1 = time.time()
args.logger.info('training time used={:.2f}s'.format(t1 - t0))
last_metric = 'last_metric[{}]'.format(state_cb.best_acc)
args.logger.info(last_metric)
is_cloud = args.enable_modelarts
if is_cloud:
ip = os.getenv("BATCH_TASK_CURRENT_HOST_IP")
else:
ip = socket.gethostbyname(socket.gethostname())
args.logger.info('ip[{}], mean_fps[{:.2f}]'.format(ip, state_cb.mean_fps))
if args.open_profiler:
profiler.analyse()
if __name__ == '__main__':
train_net()