add schema for BERT and TinyBERT

This commit is contained in:
wanghua 2020-08-28 17:17:21 +08:00
parent cc491b6442
commit f347d1c9e7
4 changed files with 123 additions and 35 deletions

View File

@ -73,6 +73,60 @@ For distributed training, a hccl configuration file with JSON format needs to be
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"].
For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for pretraining as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"next_sentence_labels": {
"type": "int64",
"rank": 1,
"shape": [1]
},
"masked_lm_positions": {
"type": "int64",
"rank": 1,
"shape": [32]
},
"masked_lm_ids": {
"type": "int64",
"rank": 1,
"shape": [32]
},
"masked_lm_weights": {
"type": "float32",
"rank": 1,
"shape": [32]
}
}
}
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
@ -87,11 +141,12 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
├─hyper_parameter_config.ini # hyper paramter for distributed pretraining
├─run_distribute_pretrain.py # script for distributed pretraining
├─README.md
├─run_classifier.sh # shell script for standalone classifier task
├─run_ner.sh # shell script for standalone NER task
├─run_squad.sh # shell script for standalone SQUAD task
├─run_classifier.sh # shell script for standalone classifier task on ascend or gpu
├─run_ner.sh # shell script for standalone NER task on ascend or gpu
├─run_squad.sh # shell script for standalone SQUAD task on ascend or gpu
├─run_standalone_pretrain_ascend.sh # shell script for standalone pretrain on ascend
├─run_distributed_pretrain_ascend.sh # shell script for distributed pretrain on ascend
├─run_distributed_pretrain_gpu.sh # shell script for distributed pretrain on gpu
└─run_standaloned_pretrain_gpu.sh # shell script for distributed pretrain on gpu
├─src
├─__init__.py
@ -363,55 +418,59 @@ The result will be as follows:
## [Model Description](#contents)
## [Performance](#contents)
### Pretraining Performance
| Parameters | BERT | BERT |
| Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | base | base |
| Model Version | BERT_base | BERT_base |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128 | ImageNet |
| Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | |
| Loss | | 1.913 |
| Speed | 116.5 ms/step | 1.913 |
| Total time | | |
| Epoch | 40 | | |
| Batch_size | 256*8 | 130(8P) | |
| Loss | 1.7 | 1.913 |
| Speed | 340ms/step | 1.913 |
| Total time | 73h | |
| Params (M) | 110M | |
| Checkpoint for Fine tuning | 1.2G(.ckpt file) | |
| Parameters | BERT | BERT |
| Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | NEZHA | NEZHA |
| Model Version | BERT_NEZHA | BERT_NEZHA |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/20/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128 | ImageNet |
| Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | |
| Loss | | 1.913 |
| Speed | | 1.913 |
| Total time | | |
| Epoch | 40 | | |
| Batch_size | 96*8 | 130(8P) |
| Loss | 1.7 | 1.913 |
| Speed | 360ms/step | 1.913 |
| Total time | 200h | |
| Params (M) | 340M | |
| Checkpoint for Fine tuning | 3.2G(.ckpt file) | |
#### Inference Performance
| Parameters | | | |
| -------------------------- | ----------------------------- | ------------------------- | -------------------- |
| Model Version | V1 | | |
| Resource | Huawei 910 | NV SMX2 V100-32G | Huawei 310 |
| uploaded Date | 08/22/2020 | 05/22/2020 | |
| MindSpore Version | 0.6.0 | 0.2.0 | 0.2.0 |
| Dataset | cola, 1.2W | ImageNet, 1.2W | ImageNet, 1.2W |
| batch_size | 32(1P) | 130(8P) | |
| Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] | |
| Speed | 59.25ms/step | | |
| Total time | | | |
| Model for inference | 1.2G(.ckpt file) | | |
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- |
| Model Version | | |
| Resource | Ascend 910 | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/22/2020 |
| MindSpore Version | 0.6.0 | 0.2.0 |
| Dataset | cola, 1.2W | ImageNet, 1.2W |
| batch_size | 32(1P) | 130(8P) |
| Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] |
| Speed | 59.25ms/step | |
| Total time | 15min | |
| Model for inference | 1.2G(.ckpt file) | |
# [Description of Random Situation](#contents)

View File

@ -122,7 +122,7 @@ def distribute_pretrain():
print("core_nums:", cmdopt)
print("epoch_size:", str(cfg['epoch_size']))
print("data_dir:", data_dir)
print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/log.txt")
print("log_file_dir: " + cur_dir + "/LOG" + str(device_id) + "/pretraining_log.txt")
os.chdir(cur_dir + "/LOG" + str(device_id))
cmd = 'taskset -c ' + cmdopt + ' nohup python ' + run_script + " "

View File

@ -112,9 +112,6 @@ def create_squad_dataset(batch_size=1, repeat_count=1, data_file_path=None, sche
else:
ds = de.TFRecordDataset([data_file_path], schema_file_path if schema_file_path != "" else None,
columns_list=["input_ids", "input_mask", "segment_ids", "unique_ids"])
ds = ds.map(input_columns="input_ids", operations=type_cast_op)
ds = ds.map(input_columns="input_mask", operations=type_cast_op)
ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
ds = ds.map(input_columns="segment_ids", operations=type_cast_op)
ds = ds.map(input_columns="input_mask", operations=type_cast_op)
ds = ds.map(input_columns="input_ids", operations=type_cast_op)

View File

@ -65,6 +65,38 @@ For distributed training on Ascend, a hccl configuration file with JSON format n
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
`numRows` is the only option which could be set by user, the others value must be set according to the dataset.
For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [256]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [256]
}
}
}
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
@ -304,9 +336,9 @@ The best acc is 0.891176
## [Model Description](#contents)
## [Performance](#contents)
### training Performance
| Parameters | TinyBERT | TinyBERT |
| Parameters | Ascend | GPU |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version | | |
| Model Version | TinyBERT | TinyBERT |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G, cpu:2.10GHz 64cores, memory:251G |
| uploaded Date | 08/20/2020 | 08/24/2020 |
| MindSpore Version | 0.6.0 | 0.7.0 |
@ -323,10 +355,10 @@ The best acc is 0.891176
#### Inference Performance
| Parameters | | |
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- |
| Model Version | | |
| Resource | Huawei 910 | NV SMX2 V100-32G |
| Resource | Ascend 910 | NV SMX2 V100-32G |
| uploaded Date | 08/20/2020 | 08/24/2020 |
| MindSpore Version | 0.6.0 | 0.7.0 |
| Dataset | SST-2, | SST-2 |