add parameter 'config_path' in tinybert on r1.3

This commit is contained in:
郑彬 2021-07-12 16:02:38 +08:00
parent cd2d2bef0a
commit da2b9779e0
11 changed files with 353 additions and 138 deletions

View File

@ -71,65 +71,122 @@ The backbone structure of TinyBERT is transformer, the transformer contains four
# [Quick Start](#contents)
After installing MindSpore via the official website, you can start general distill, task distill and evaluation as follows:
- running on local
```text
# run standalone general distill example
bash scripts/run_standalone_gd.sh
After installing MindSpore via the official website, you can start general distill, task distill and evaluation as follows:
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
```text
# run standalone general distill example
bash scripts/run_standalone_gd.sh
# For Ascend device, run distributed general distill example
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
# For Ascend device, run distributed general distill example
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
# For GPU device, run distributed general distill example
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
# run task distill and evaluation example
bash scripts/run_standalone_td.sh
# For GPU device, run distributed general distill example
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
If running on GPU, please set the `device_target=GPU`.
```
# run task distill and evaluation example
bash scripts/run_standalone_td.sh {path}/*.yaml
For distributed training on Ascend, a hccl configuration file with JSON format needs to be created in advance.
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
If running on GPU, please set the `device_target=GPU`.
```
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/doc/programming_guide/en/master/dataset_loading.html#tfrecord) format.
For distributed training on Ascend, a hccl configuration file with JSON format needs to be created in advance.
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
```text
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/doc/programming_guide/en/master/dataset_loading.html#tfrecord) format.
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
```text
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
}
}
}
}
```
```
- running on ModelArts
If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start training as follows
- general_distill with 8 cards on ModelArts
```python
# (1) Upload the code folder to S3 bucket.
# (2) Click to "create training task" on the website UI interface.
# (3) Set the code directory to "/{path}/tinybert" on the website UI interface.
# (4) Set the startup file to /{path}/tinybert/run_general_distill.py" on the website UI interface.
# (5) Perform a or b.
# a. setting parameters in /{path}/tinybert/gd_config.yaml.
# 1. Set ”enable_modelarts=True“
# 2. Set other parameters('config_path' cannot be set here), other parameter configuration can refer to `./scripts/run_distributed_gd_ascend.sh`
# b. adding on the website UI interface.
# 1. Add ”enable_modelarts=True“
# 3. Add other parameters, other parameter configuration can refer to `./scripts/run_distributed_gd_ascend.sh`
# Note that 'data_dir' and 'schema_dir' fill in the relative path relative to the path selected in step 7.
# Add "config_path=../../gd_config.yaml" on the webpage ('config_path' is the path of the'*.yaml' file relative to {path}/tinybert/src/model_utils/config.py, and'* .yaml' file must be in {path}/bert/)
# (6) Upload the dataset to S3 bucket.
# (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path (there is only data or zip package under this path).
# (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
# (9) Under the item "resource pool selection", select the specification of 8 cards.
# (10) Create your job.
# After training, the '*.ckpt' file will be saved under the'training output file path'
```
- Running task_distill with single card on ModelArts
```python
# (1) Upload the code folder to S3 bucket.
# (2) Click to "create training task" on the website UI interface.
# (3) Set the code directory to "/{path}/tinybert" on the website UI interface.
# (4) Set the startup file to /{path}/tinybert/run_ner.py"(or run_pretrain.py or run_squad.py) on the website UI interface.
# (5) Perform a or b.
# Add "config_path=../../td_config/td_config_sst2.yaml" on the web page (select the *.yaml configuration file according to the distill task)
# a. setting parameters in task_ner_config.yaml(or task_squad_config.yaml or task_classifier_config.yaml under the folder `/{path}/bert/`
# 1. Set ”enable_modelarts=True“
# 2. Set "task_name=SST-2" (depending on the task, select from ["SST-2", "QNLI", "MNLI", "TNEWS", "CLUENER"])
# 3. Set other parameters, other parameter configuration can refer to './scripts/run_standalone_td.sh'.
# b. adding on the website UI interface.
# 1. Add ”enable_modelarts=True“
# 2. Add "task_name=SST-2" (depending on the task, select from ["SST-2", "QNLI", "MNLI", "TNEWS", "CLUENER"])
# 3. Add other parameters, other parameter configuration can refer to './scripts/run_standalone_td.sh'.
# Note that 'load_teacher_ckpt_path', 'train_data_dir', 'eval_data_dir' and 'schema_dir' fill in the relative path relative to the path selected in step 7.
# Note that 'load_gd_ckpt_path' fills in the relative path relative to the path selected in step 3.
# (6) Upload the dataset to S3 bucket.
# (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path.
# (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
# (9) Under the item "resource pool selection", select the specification of a single card.
# (10) Create your job.
# After training, the '*.ckpt' file will be saved under the'training output file path'.
```
# [Script Description](#contents)
@ -139,23 +196,39 @@ For example, the dataset is cn-wiki-128, the schema file for general distill pha
.
└─bert
├─README.md
├─README_CN.md
├─scripts
├─run_distributed_gd_ascend.sh # shell script for distributed general distill phase on Ascend
├─run_distributed_gd_gpu.sh # shell script for distributed general distill phase on GPU
├─run_infer_310.sh # shell script for 310 infer
├─run_standalone_gd.sh # shell script for standalone general distill phase
├─run_standalone_td.sh # shell script for standalone task distill phase
├─src
├─model_utils
├── config.py # parse *.yaml parameter configuration file
├── devcie_adapter.py # distinguish local/ModelArts training
├── local_adapter.py # get related environment variables in local training
└── moxing_adapter.py # get related environment variables in ModelArts training
├─__init__.py
├─assessment_method.py # assessment method for evaluation
├─dataset.py # data processing
├─gd_config.py # parameter configuration for general distill phase
├─td_config.py # parameter configuration for task distill phase
├─tinybert_for_gd_td.py # backbone code of network
├─tinybert_model.py # backbone code of network
├─utils.py # util function
├─td_config # folder where *.yaml files of different distillation tasks are located
├── td_config_15cls.yaml
├── td_config_mnli.py
├── td_config_ner.py
├── td_config_qnli.py
└── td_config_stt2.py
├─__init__.py
├─export.py # export scripts
├─gd_config.yaml # parameter configuration for general_distill
├─mindspore_hub_conf.py # Mindspore Hub interface
├─postprocess.py # scripts for 310 postprocess
├─preprocess.py # scripts for 310 preprocess
├─run_general_distill.py # train net for general distillation
├─run_task_distill.py # train and eval net for task distillation
─run_task_distill.py # train and eval net for task distillation
```
## [Script Parameters](#contents)
@ -231,7 +304,7 @@ options:
## Options and Parameters
`gd_config.py` and `td_config.py` contain parameters of BERT model and options for optimizer and lossscale.
`gd_config.yaml` and `td_config/*.yaml` contain parameters of BERT model and options for optimizer and lossscale.
### Options
@ -358,7 +431,7 @@ If you want to after running and continue to eval, please set `do_train=true` an
#### evaluation on SST-2 dataset
```bash
bash scripts/run_standalone_td.sh
bash scripts/run_standalone_td.sh {path}/*.yaml
```
The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
@ -378,7 +451,7 @@ The best acc is 0.902777
Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
```bash
bash scripts/run_standalone_td.sh
bash scripts/run_standalone_td.sh {path}/*.yaml
```
The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
@ -398,7 +471,7 @@ The best acc is 0.813929
Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
```bash
bash scripts/run_standalone_td.sh
bash scripts/run_standalone_td.sh {path}/*.yaml
```
The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
@ -417,6 +490,8 @@ The best acc is 0.891176
### [Export MindIR](#contents)
- Export on local
```shell
python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --file_format [FILE_FORMAT]
```
@ -424,6 +499,32 @@ python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --file_format [
The ckpt_file parameter is required,
`EXPORT_FORMAT` should be in ["AIR", "MINDIR"]
- Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start as follows)
```python
# (1) Upload the code folder to S3 bucket.
# (2) Click to "create training task" on the website UI interface.
# (3) Set the code directory to "/{path}/tinybert" on the website UI interface.
# (4) Set the startup file to /{path}/tinybert/export.py" on the website UI interface.
# (5) Perform a or b.
# a. Set parameters in a *.yaml file under /path/tinybert/td_config/
# 1. Set ”enable_modelarts: True“
# 2. Set “ckpt_file: ./{path}/*.ckpt”('ckpt_file' indicates the path of the weight file to be exported relative to the file `export.py`, and the weight file must be included in the code directory.)
# 3. Set ”file_name: bert_ner“
# 4. Set ”file_formatMINDIR“
# b. Adding on the website UI interface.
# 1. Add ”enable_modelarts=True“
# 2. Add “ckpt_file=./{path}/*.ckpt”('ckpt_file' indicates the path of the weight file to be exported relative to the file `export.py`, and the weight file must be included in the code directory.)
# 3. Add ”file_name=tinybert_sst2“
# 4. Add ”file_format=MINDIR“
# Finally, "config_path=../../td_config/*.yaml" must be added on the web page (select the *.yaml configuration file according to the downstream task)
# (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path.(Although it is useless, but to do)
# (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
# (9) Under the item "resource pool selection", select the specification of a single card.
# (10) Create your job.
# You will see tinybert_sst2.mindir under {Output file path}.
```
### Infer on Ascend310
Before performing inference, the mindir file must be exported by `export.py` script. We only provide an example of inference using MINDIR model.
@ -459,7 +560,7 @@ Inference result is saved in current path, you can find result like this in acc.
| uploaded Date | 08/20/2020 | 08/24/2020 |
| MindSpore Version | 1.0.0 | 1.0.0 |
| Dataset | en-wiki-128 | en-wiki-128 |
| Training Parameters | src/gd_config.py | src/gd_config.py |
| Training Parameters | src/gd_config.yaml | src/gd_config.yaml |
| Optimizer | AdamWeightDecay | AdamWeightDecay |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | probability |
@ -489,7 +590,7 @@ Inference result is saved in current path, you can find result like this in acc.
In run_standaloned_td.sh, we set do_shuffle to shuffle the dataset.
In gd_config.py and td_config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to dropout some network node.
In gd_config.yaml and td_config/*.yaml, we set the hidden_dropout_prob and attention_pros_dropout_prob to dropout some network node.
In run_general_distill.py, we set the random seed to make sure distribute training has the same init weight.

View File

@ -78,63 +78,118 @@ TinyBERT模型的主干结构是转换器转换器包含四个编码器模块
从官网下载安装MindSpore之后可以开始一般蒸馏。任务蒸馏和评估方法如下
```bash
# 单机运行一般蒸馏示例
bash scripts/run_standalone_gd.sh
- 在本地运行
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
```bash
# 单机运行一般蒸馏示例
bash scripts/run_standalone_gd.sh
# Ascend设备上分布式运行一般蒸馏示例
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
# Ascend设备上分布式运行一般蒸馏示例
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
# GPU设备上分布式运行一般蒸馏示例
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
# 运行任务蒸馏和评估示例
bash scripts/run_standalone_td.sh
# GPU设备上分布式运行一般蒸馏示例
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
If running on GPU, please set the `device_target=GPU`.
```
# 运行任务蒸馏和评估示例
bash scripts/run_standalone_td.sh {path}/*.yaml
若在Ascend设备上运行分布式训练请提前创建JSON格式的HCCL配置文件。
详情参见如下链接:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
If running on GPU, please set the `device_target=GPU`.
```
如需设置数据集格式和参数请创建JSON格式的视图配置文件详见[TFRecord](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/dataset_loading.html#tfrecord) 格式。
若在Ascend设备上运行分布式训练请提前创建JSON格式的HCCL配置文件。
详情参见如下链接:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
```text
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
如需设置数据集格式和参数请创建JSON格式的视图配置文件详见[TFRecord](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/dataset_loading.html#tfrecord) 格式。
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
```text
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
}
}
}
```
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
}
}
}
```
- 在ModelArts上运行(如果你想在modelarts上运行可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/))
- 在ModelArt上使用8卡一般蒸馏
```python
# (1) 上传你的代码到 s3 桶上
# (2) 在ModelArts上创建训练任务
# (3) 选择代码目录 /{path}/tinybert
# (4) 选择启动文件 /{path}/tinybert/run_general_distill.py
# (5) 执行a或b
# a. 在 /{path}/tinybert/default_config.yaml 文件中设置参数
# 1. 设置 ”enable_modelarts=True“
# 2. 设置其它参数(config_path无法在这里设置),其它参数配置可以参考 `./scripts/run_distributed_gd_ascend.sh`
# b. 在 网页上设置
# 1. 添加 ”enable_modelarts=True“
# 2. 添加其它参数,其它参数配置可以参考 `./scripts/run_distributed_gd_ascend.sh`
# 注意'data_dir'、'schema_dir'填写相对于第7步所选路径的相对路径。
# 在网页上添加 “config_path=../../gd_config.yaml”('config_path' 是'*.yaml'文件相对于 {path}/tinybert/src/model_utils/config.py的路径, 且'*.yaml'文件必须在{path}/bert/内)
# (6) 上传你的 数据 到 s3 桶上
# (7) 在网页上勾选数据存储位置,设置“训练数据集”路径
# (8) 在网页上设置“训练输出文件路径”、“作业日志路径”
# (9) 在网页上的’资源池选择‘项目下, 选择8卡规格的资源
# (10) 创建训练作业
# 训练结束后会在'训练输出文件路径'下保存训练的权重
```
- 在ModelArts上使用单卡运行任务蒸馏
```python
# (1) 上传你的代码到 s3 桶上
# (2) 在ModelArts上创建训练任务
# (3) 选择代码目录 /{path}/tinybert
# (4) 选择启动文件 /{path}/tinybert/run_task_distill.py
# (5) 在网页上添加 “config_path=../../td_config/td_config_sst2.yaml”(根据蒸馏任务选择 *.yaml 配置文件)
# 执行a或b
# a. 在选定的'*.yaml'文件中设置参数
# 1. 设置 ”enable_modelarts=True“
# 2. 设置 ”task_name=SST-2“(根据任务不同,在["SST-2", "QNLI", "MNLI", "TNEWS", "CLUENER"]中选择)
# 3. 设置其它参数,其它参数配置可以参考 './scripts/'下的 `run_standalone_td.sh`
# b. 在 网页上设置
# 1. 添加 ”enable_modelarts=True“
# 2. 添加 ”task_name=SST-2“(根据任务不同,在["SST-2", "QNLI", "MNLI", "TNEWS", "CLUENER"]中选择)
# 3. 添加其它参数,其它参数配置可以参考 './scripts/'下的 `run_standalone_td.sh`
# 注意load_teacher_ckpt_pathtrain_data_direval_data_dirschema_dir填写相对于第7步所选路径的相对路径。
# 注意load_gd_ckpt_path填写相对于第3步所选路径的相对路径
# (6) 上传你的 数据 到 s3 桶上
# (7) 在网页上勾选数据存储位置,设置“训练数据集”路径
# (8) 在网页上设置“训练输出文件路径”、“作业日志路径”
# (9) 在网页上的’资源池选择‘项目下, 选择单卡规格的资源
# (10) 创建训练作业
# 训练结束后会在'训练输出文件路径'下保存训练的权重
```
# 脚本说明
@ -142,25 +197,41 @@ For example, the dataset is cn-wiki-128, the schema file for general distill pha
```shell
.
└─bert
└─tinybert
├─README.md
├─README_CN.md
├─scripts
├─run_distributed_gd_ascend.sh # Ascend设备上分布式运行一般蒸馏的shell脚本
├─run_distributed_gd_gpu.sh # GPU设备上分布式运行一般蒸馏的shell脚本
├─run_infer_310.sh # 310推理的shell脚本
├─run_standalone_gd.sh # 单机运行一般蒸馏的shell脚本
├─run_standalone_td.sh # 单机运行任务蒸馏的shell脚本
├─src
├─model_utils
├── config.py # 解析 *.yaml参数配置文件
├── devcie_adapter.py # 区分本地/ModelArts训练
├── local_adapter.py # 本地训练获取相关环境变量
└── moxing_adapter.py # ModelArts训练获取相关环境变量、交换数据
├─__init__.py
├─assessment_method.py # 评估过程的测评方法
├─dataset.py # 数据处理
├─gd_config.py # 一般蒸馏阶段的参数配置
├─td_config.py # 任务蒸馏阶段的参数配置
├─tinybert_for_gd_td.py # 网络骨干编码
├─tinybert_model.py # 网络骨干编码
├─utils.py # util函数
├─td_config # 不同蒸馏任务的*.yaml文件所在文件夹
├── td_config_15cls.yaml
├── td_config_mnli.py
├── td_config_ner.py
├── td_config_qnli.py
└── td_config_stt2.py
├─__init__.py
├─export.py # export scripts
├─gd_config.yaml # 一般蒸馏参数配置文件
├─mindspore_hub_conf.py # Mindspore Hub接口
├─postprocess.py # 310推理前处理脚本
├─preprocess.py # 310推理后处理脚本
├─run_general_distill.py # 一般蒸馏训练网络
├─run_task_distill.py # 任务蒸馏训练评估网络
─run_task_distill.py # 任务蒸馏训练评估网络
```
## 脚本参数
@ -233,7 +304,7 @@ options:
## 选项及参数
`gd_config.py` and `td_config.py` 包含BERT模型参数与优化器和损失缩放选项。
`gd_config.yaml` and `td_config/*.yaml` 包含BERT模型参数与优化器和损失缩放选项。
### 选项
@ -321,7 +392,7 @@ epoch: 1, step: 100, outputs are 28.2093
运行以下命令前确保已设置load_teacher_ckpt_path、data_dir和schma_dir。请将路径设置为绝对全路径例如/username/checkpoint_100_300.ckpt。
```bash
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json /path/gd_config.json
```
以上命令后台运行您可以在log.txt文件中查看运行结果。训练后可以得到默认log*文件夹路径下的检查点文件。 得到如下损失值:
@ -339,7 +410,7 @@ epoch: 1, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, 30.5901), Ten
输入绝对全路径,例如:"/username/checkpoint_100_300.ckpt"。
```bash
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt /path/gd_config.json
```
以上命令后台运行您可以在log.txt文件中查看运行结果。训练结束后您可以在默认LOG*文件夹下找到检查点文件。得到如下损失值:
@ -359,7 +430,7 @@ epoch: 1, step: 1, outputs are 63.4098
#### 基于SST-2数据集进行评估
```bash
bash scripts/run_standalone_td.sh
bash scripts/run_standalone_td.sh {path}/*.yaml
```
以上命令后台运行您可以在log.txt文件中查看运行结果。得出如下测试数据集准确率
@ -379,7 +450,7 @@ The best acc is 0.902777
运行如下命令前,请确保已设置加载与训练检查点路径。请将检查点路径设置为绝对全路径,例如,/username/pretrain/checkpoint_100_300.ckpt。
```bash
bash scripts/run_standalone_td.sh
bash scripts/run_standalone_td.sh {path}/*.yaml
```
以上命令将在后台运行请在log.txt文件中查看结果。测试数据集的准确率如下
@ -399,7 +470,7 @@ The best acc is 0.813929
运行如下命令前,请确保已设置加载与训练检查点路径。请将检查点路径设置为绝对全路径,例如/username/pretrain/checkpoint_100_300.ckpt。
```bash
bash scripts/run_standalone_td.sh
bash scripts/run_standalone_td.sh {path}/*.yaml
```
以上命令后台运行您可以在log.txt文件中查看运行结果。测试数据集的准确率如下
@ -418,8 +489,36 @@ The best acc is 0.891176
### [导出MindIR](#contents)
```shell
python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --file_format [FILE_FORMAT]
- 在本地导出
```shell
python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --file_format [FILE_FORMAT]
```
- 在ModelArts上导出
```python
# (1) 上传你的代码到 s3 桶上
# (2) 在ModelArts上创建训练任务
# (3) 选择代码目录 /{path}/tinybert
# (4) 选择启动文件 /{path}/tinybert/export.py
# (5) 执行a或b
# a. 在 /path/tinybert/td_config/ 下的某个*.yaml文件中设置参数
# 1. 设置 ”enable_modelarts: True“
# 2. 设置 “ckpt_file: ./{path}/*.ckpt”('ckpt_file' 指待导出的'*.ckpt'权重文件相对于`export.py`的路径, 且权重文件必须包含在代码目录下)
# 3. 设置 ”file_name: tinybert_sst2“
# 4. 设置 ”file_formatMINDIR“
# b. 在 网页上设置
# 1. 添加 ”enable_modelarts=True“
# 2. 添加 “ckpt_file=./{path}/*.ckpt”(('ckpt_file' 指待导出的'*.ckpt'权重文件相对于`export.py`的路径, 且权重文件必须包含在代码目录下)
# 3. 添加 ”file_name=tinybert_sst2“
# 4. 添加 ”file_format=MINDIR“
# 最后必须在网页上添加 “config_path=../../td_config/*.yaml”(根据下游任务选择 *.yaml 配置文件)
# (7) 在网页上勾选数据存储位置,设置“训练数据集”路径(虽然没用,但要做)
# (8) 在网页上设置“训练输出文件路径”、“作业日志路径”
# (9) 在网页上的’资源池选择‘项目下, 选择单卡规格的资源
# (10) 创建训练作业
# 你将在{Output file path}下看到 'tinybert_sst2.mindir'文件
```
参数ckpt_file为必填项
@ -460,7 +559,7 @@ bash run_infer_310.sh [MINDIR_PATH] [DATASET_PATH] [SCHEMA_DIR] [DATASET_TYPE] [
| 上传日期 | 2020-08-20 | 2020-08-24 |
| MindSpore版本 | 0.6.0 | 0.7.0 |
| 数据集 | en-wiki-128 | en-wiki-128 |
| 训练参数 | src/gd_config.py | src/gd_config.py |
| 训练参数 | src/gd_config.yaml | src/gd_config.yaml |
| 优化器| AdamWeightDecay | AdamWeightDecay |
| 损耗函数 | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| 输出 | 概率 | 概率 |
@ -489,7 +588,7 @@ bash run_infer_310.sh [MINDIR_PATH] [DATASET_PATH] [SCHEMA_DIR] [DATASET_TYPE] [
run_standaloned_td.sh脚本中设置了do_shuffle来轮换数据集。
gd_config.py和td_config.py文件中设置了hidden_dropout_prob和attention_pros_dropout_prob使网点随机失活。
gd_config.yaml和td_config/*.yaml文件中设置了hidden_dropout_prob和attention_pros_dropout_prob使网点随机失活。
run_general_distill.py文件中设置了随机种子确保分布式训练初始权重相同。

View File

@ -16,12 +16,22 @@
"""postprocess"""
import os
import argparse
import numpy as np
from mindspore import Tensor
from src.assessment_method import Accuracy, F1
from src.model_utils.config import eval_cfg, config as args_opt
parser = argparse.ArgumentParser(description='postprocess')
parser.add_argument("--task_name", type=str, default="", choices=["SST-2", "QNLI", "MNLI", "TNEWS", "CLUENER"],
help="The name of the task to train.")
parser.add_argument("--assessment_method", type=str, default="accuracy", choices=["accuracy", "bf1", "mf1"],
help="assessment_method include: [accuracy, bf1, mf1], default is accuracy")
parser.add_argument("--result_path", type=str, default="./result_Files", help="result path")
parser.add_argument("--label_path", type=str, default="./preprocess_Result/label_ids.npy", help="label path")
args_opt = parser.parse_args()
BATCH_SIZE = 32
DEFAULT_NUM_LABELS = 2
DEFAULT_SEQ_LENGTH = 128
task_params = {"SST-2": {"num_labels": 2, "seq_length": 64},
@ -49,8 +59,11 @@ class Task:
if self.task_name in task_params and "seq_length" in task_params[self.task_name]:
return task_params[self.task_name]["seq_length"]
return DEFAULT_SEQ_LENGTH
task = Task(args_opt.task_name)
def eval_result_print(assessment_method="accuracy", callback=None):
"""print eval result"""
if assessment_method == "accuracy":
@ -79,9 +92,9 @@ def get_acc():
labels = np.load(args_opt.label_path)
file_num = len(os.listdir(args_opt.result_path))
for i in range(file_num):
f_name = "tinybert_bs" + str(eval_cfg.batch_size) + "_" + str(i) + "_0.bin"
f_name = "tinybert_bs" + str(BATCH_SIZE) + "_" + str(i) + "_0.bin"
logits = np.fromfile(os.path.join(args_opt.result_path, f_name), np.float32)
logits = logits.reshape(eval_cfg.batch_size, task.num_labels)
logits = logits.reshape(BATCH_SIZE, task.num_labels)
label_ids = labels[i]
callback.update(Tensor(logits), Tensor(label_ids))
print("==============================================================")

View File

@ -16,11 +16,21 @@
"""preprocess"""
import os
import argparse
import numpy as np
from src.model_utils.config import eval_cfg, config as args_opt
from src.dataset import create_tinybert_dataset, DataType
parser = argparse.ArgumentParser(description='preprocess')
parser.add_argument("--eval_data_dir", type=str, default="", help="Data path, it is better to use absolute path")
parser.add_argument("--schema_dir", type=str, default="", help="Schema path, it is better to use absolute path")
parser.add_argument("--dataset_type", type=str, default="tfrecord",
help="dataset type tfrecord/mindrecord, default is tfrecord")
parser.add_argument("--result_path", type=str, default="./preprocess_Result/", help="result path")
args_opt = parser.parse_args()
BATCH_SIZE = 32
if args_opt.dataset_type == "tfrecord":
dataset_type = DataType.TFRECORD
elif args_opt.dataset_type == "mindrecord":
@ -28,6 +38,7 @@ elif args_opt.dataset_type == "mindrecord":
else:
raise Exception("dataset format is not supported yet")
def get_bin():
"""
generate bin files.
@ -41,7 +52,7 @@ def get_bin():
os.makedirs(token_type_id_path)
os.makedirs(input_mask_path)
eval_dataset = create_tinybert_dataset('td', batch_size=eval_cfg.batch_size,
eval_dataset = create_tinybert_dataset('td', batch_size=BATCH_SIZE,
device_num=1, rank=0, do_shuffle="false",
data_dir=args_opt.eval_data_dir,
schema_dir=args_opt.schema_dir,
@ -49,7 +60,7 @@ def get_bin():
columns_list = ["input_ids", "input_mask", "segment_ids", "label_ids"]
label_list = []
for j, data in enumerate(eval_dataset.create_dict_iterator(output_numpy=True, num_epochs=1)):
file_name = "tinybert_bs" + str(eval_cfg.batch_size) + "_" + str(j) + ".bin"
file_name = "tinybert_bs" + str(BATCH_SIZE) + "_" + str(j) + ".bin"
input_data = []
for i in columns_list:
input_data.append(data[i])

View File

@ -16,17 +16,21 @@
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_standalone_td.sh"
echo "for example: bash scripts/run_standalone_td.sh"
echo "bash scripts/run_standalone_td.sh [config_path]"
echo "for example: bash scripts/run_standalone_td.sh /home/data1/td_config_sst2.yaml"
echo "=============================================================================================================="
if [ $# != 1 ]; then
echo "bash scripts/run_standalone_td.sh [config_path]"
exit 1
fi
mkdir -p ms_log
PROJECT_DIR=$(cd "$(dirname "$0")" || exit; pwd)
CUR_DIR=`pwd`
export GLOG_log_dir=${CUR_DIR}/ms_log
export GLOG_logtostderr=0
python ${PROJECT_DIR}/../run_task_distill.py \
--config_path="../../td_config/td_config_sst2.yaml" \
--config_path=$1 \
--device_target="Ascend" \
--device_id=0 \
--do_train="true" \

View File

@ -152,9 +152,11 @@ def get_config():
"""
Get Config according to the yaml file and cli arguments.
"""
def get_abs_path(path_relative):
def get_abs_path(path_input):
if os.path.isabs(path_input):
return path_input
current_dir = os.path.dirname(os.path.abspath(__file__))
return os.path.join(current_dir, path_relative)
return os.path.join(current_dir, path_input)
parser = argparse.ArgumentParser(description="default name", add_help=False)
parser.add_argument("--config_path", type=get_abs_path, default="../../gd_config.yaml",
help="Config file path")

View File

@ -40,9 +40,6 @@ dataset_type: "tfrecord"
ckpt_file: ''
file_name: "tinybert"
file_format: "AIR"
# postprocess related
result_path: "./result_Files"
label_path: "./preprocess_Result/label_ids.npy"
phase1_cfg:
batch_size: 32
loss_scale_value: 256

View File

@ -40,9 +40,6 @@ dataset_type: "tfrecord"
ckpt_file: ''
file_name: "tinybert"
file_format: "AIR"
# postprocess related
result_path: "./result_Files"
label_path: "./preprocess_Result/label_ids.npy"
phase1_cfg:
batch_size: 32
loss_scale_value: 256

View File

@ -40,9 +40,6 @@ dataset_type: "tfrecord"
ckpt_file: ''
file_name: "tinybert"
file_format: "AIR"
# postprocess related
result_path: "./result_Files"
label_path: "./preprocess_Result/label_ids.npy"
phase1_cfg:
batch_size: 32
loss_scale_value: 256

View File

@ -40,9 +40,6 @@ dataset_type: "tfrecord"
ckpt_file: ''
file_name: "tinybert"
file_format: "AIR"
# postprocess related
result_path: "./result_Files"
label_path: "./preprocess_Result/label_ids.npy"
phase1_cfg:
batch_size: 32
loss_scale_value: 256

View File

@ -40,9 +40,6 @@ dataset_type: "tfrecord"
ckpt_file: ''
file_name: "tinybert"
file_format: "AIR"
# postprocess related
result_path: "./result_Files"
label_path: "./preprocess_Result/label_ids.npy"
phase1_cfg:
batch_size: 32
loss_scale_value: 256