add prophetnet into model zoo

This commit is contained in:
y00488820 2020-10-23 19:01:45 +08:00
parent 17764803ef
commit 1138b08a1e
41 changed files with 3943 additions and 0 deletions

View File

@ -0,0 +1,643 @@
![](https://www.mindspore.cn/static/img/logo.a3e472c9.png)
<!-- TOC -->
- [MASS: Masked Sequence to Sequence Pre-training for Language Generation Description](#googlenet-description)
- [Model architecture](#model-architecture)
- [Dataset](#dataset)
- [Features](#features)
- [Script description](#script-description)
- [Data Preparation](#Data-Preparation)
- [Tokenization](#Tokenization)
- [Byte Pair Encoding](#Byte-Pair-Encoding)
- [Build Vocabulary](#Build-Vocabulary)
- [Generate Dataset](#Generate-Dataset)
- [News Crawl Corpus](#News-Crawl-Corpus)
- [Gigaword Corpus](#Gigaword-Corpus)
- [Cornell Movie Dialog Corpus](#Cornell-Movie-Dialog-Corpus)
- [Configuration](#Configuration)
- [Training & Evaluation process](#Training-&-Evaluation-process)
- [Weights average](#Weights-average)
- [Learning rate scheduler](#Learning-rate-scheduler)
- [Model description](#model-description)
- [Performance](#performance)
- [Results](#results)
- [Training Performance](#training-performance)
- [Inference Performance](#inference-performance)
- [Environment Requirements](#environment-requirements)
- [Platform](#Platform)
- [Requirements](#Requirements)
- [Get started](#get-started)
- [Pre-training](#Pre-training)
- [Fine-tuning](#Fine-tuning)
- [Inference](#Inference)
- [Description of random situation](#description-of-random-situation)
- [others](#others)
- [ModelZoo Homepage](#modelzoo-homepage)
<!-- /TOC -->
# MASS: Masked Sequence to Sequence Pre-training for Language Generation Description
[MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019.
BERT(Devlin et al., 2018) have achieved SOTA in natural language understanding area by pre-training the encoder part of Transformer(Vaswani et al., 2017) with masked rich-resource text. Likewise, GPT(Raddford et al., 2018) pre-trains the decoder part of Transformer with masked(encoder inputs are masked) rich-resource text. Both of them build a robust language model by pre-training with masked rich-resource text.
Inspired by BERT, GPT and other language models, MicroSoft addressed [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) which combines BERT's and GPT's idea. MASS has an important parameter k, which controls the masked fragment length. BERT and GPT are specicl case when k equals to 1 and sentence length.
[Introducing MASS A pre-training method that outperforms BERT and GPT in sequence to sequence language generation tasks](https://www.microsoft.com/en-us/research/blog/introducing-mass-a-pre-training-method-that-outperforms-bert-and-gpt-in-sequence-to-sequence-language-generation-tasks/)
[Paper](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf): Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu. “MASS: Masked Sequence to Sequence Pre-training for Language Generation.” ICML (2019).
# Model architecture
The overall network architecture of MASS is shown below, which is Transformer(Vaswani et al., 2017):
MASS is consisted of 6-layer encoder and 6-layer decoder with 1024 embedding/hidden size, and 4096 intermediate size between feed forward network which has two full connection layers.
# Dataset
Dataset used:
- monolingual English data from News Crawl dataset(WMT 2019) for pre-training.
- Gigaword Corpus(Graff et al., 2003) for Text Summarization.
- Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011).
Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf).
# Features
Mass is designed to jointly pre train encoder and decoder to complete the task of language generation.
First of all, through a sequence to sequence framework, mass only predicts the blocked token, which forces the encoder to understand the meaning of the unshielded token, and encourages the decoder to extract useful information from the encoder.
Secondly, by predicting the continuous token of the decoder, the decoder can build better language modeling ability than only predicting discrete token.
Third, by further shielding the input token of the decoder which is not shielded in the encoder, the decoder is encouraged to extract more useful information from the encoder side, rather than using the rich information in the previous token.
# Script description
MASS script and code structure are as follow:
```text
├── mass
├── README.md // Introduction of MASS model.
├── config
│ ├──config.py // Configuration instance definition.
│ ├──config.json // Configuration file.
├── src
│ ├──dataset
│ ├──bi_data_loader.py // Dataset loader for fine-tune or inferring.
│ ├──mono_data_loader.py // Dataset loader for pre-training.
│ ├──language_model
│ ├──noise_channel_language_model.p // Noisy channel language model for dataset generation.
│ ├──mass_language_model.py // MASS language model according to MASS paper.
│ ├──loose_masked_language_model.py // MASS language model according to MASS released code.
│ ├──masked_language_model.py // Masked language model according to MASS paper.
│ ├──transformer
│ ├──create_attn_mask.py // Generate mask matrix to remove padding positions.
│ ├──transformer.py // Transformer model architecture.
│ ├──encoder.py // Transformer encoder component.
│ ├──decoder.py // Transformer decoder component.
│ ├──self_attention.py // Self-Attention block component.
│ ├──multi_head_attention.py // Multi-Head Self-Attention component.
│ ├──embedding.py // Embedding component.
│ ├──positional_embedding.py // Positional embedding component.
│ ├──feed_forward_network.py // Feed forward network.
│ ├──residual_conn.py // Residual block.
│ ├──beam_search.py // Beam search decoder for inferring.
│ ├──transformer_for_infer.py // Use Transformer to infer.
│ ├──transformer_for_train.py // Use Transformer to train.
│ ├──utils
│ ├──byte_pair_encoding.py // Apply BPE with subword-nmt.
│ ├──dictionary.py // Dictionary.
│ ├──loss_moniter.py // Callback of monitering loss during training step.
│ ├──lr_scheduler.py // Learning rate scheduler.
│ ├──ppl_score.py // Perplexity score based on N-gram.
│ ├──rouge_score.py // Calculate ROUGE score.
│ ├──load_weights.py // Load weights from a checkpoint or NPZ file.
│ ├──initializer.py // Parameters initializer.
├── vocab
│ ├──all.bpe.codes // BPE codes table(this file should be generated by user).
│ ├──all_en.dict.bin // Learned vocabulary file(this file should be generated by user).
├── scripts
│ ├──run_ascend.sh // Ascend train & evaluate model script.
│ ├──run_gpu.sh // GPU train & evaluate model script.
│ ├──learn_subword.sh // Learn BPE codes.
│ ├──stop_training.sh // Stop training.
├── requirements.txt // Requirements of third party package.
├── train.py // Train API entry.
├── eval.py // Infer API entry.
├── tokenize_corpus.py // Corpus tokenization.
├── apply_bpe_encoding.py // Applying bpe encoding.
├── weights_average.py // Average multi model checkpoints to NPZ format.
├── news_crawl.py // Create News Crawl dataset for pre-training.
├── gigaword.py // Create Gigaword Corpus.
├── cornell_dialog.py // Create Cornell Movie Dialog dataset for conversation response.
```
## Data Preparation
The data preparation of a natural language processing task contains data cleaning, tokenization, encoding and vocabulary generation steps.
In our experiments, using [Byte Pair Encoding(BPE)](https://arxiv.org/abs/1508.07909) could reduce size of vocabulary, and relieve the OOV influence effectively.
Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE.
For more detail about BPE, please refer to [Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt) or [paper](https://arxiv.org/abs/1508.07909).
In our experiments, vocabulary was learned based on 1.9M sentences from News Crawl Dataset, size of vocabulary is 45755.
Here, we have a brief introduction of data preparation scripts.
### Tokenization
Using `tokenize_corpus.py` could tokenize corpus whose text files are in format of `.txt`.
Major parameters in `tokenize_corpus.py`:
```bash
--corpus_folder: Corpus folder path, if multi-folders are provided, use ',' split folders.
--output_folder: Output folder path.
--tokenizer: Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
--pool_size: Processes pool size.
```
Sample code:
```bash
python tokenize_corpus.py --corpus_folder /{path}/corpus --output_folder /{path}/tokenized_corpus --tokenizer {nltk|jieba} --pool_size 16
```
### Byte Pair Encoding
After tokenization, BPE is applied to tokenized corpus with provided `all.bpe.codes`.
Apply BPE script can be found in `apply_bpe_encoding.py`.
Major parameters in `apply_bpe_encoding.py`:
```bash
--codes: BPE codes file.
--src_folder: Corpus folders.
--output_folder: Output files folder.
--prefix: Prefix of text file in `src_folder`.
--vocab_path: Generated vocabulary output path.
--threshold: Filter out words that frequency is lower than threshold.
--processes: Size of process pool (to accelerate). Default: 2.
```
Sample code:
```bash
python tokenize_corpus.py --codes /{path}/all.bpe.codes \
--src_folder /{path}/tokenized_corpus \
--output_folder /{path}/tokenized_corpus/bpe \
--prefix tokenized \
--vocab_path /{path}/vocab_en.dict.bin
--processes 32
```
### Build Vocabulary
Support that you want to create a new vocabulary, there are two options:
1. Learn BPE codes from scratch, and create vocabulary with multi vocabulary files from `subword-nmt`.
2. Create from an existing vocabulary file which lines in the format of `word frequency`.
3. *Optional*, Create a small vocabulary based on `vocab/all_en.dict.bin` with method of `shink` from `src/utils/dictionary.py`.
4. Persistent vocabulary to `vocab` folder with method `persistence()`.
Major interface of `src/utils/dictionary.py` are as follow:
1. `shrink(self, threshold=50)`: Shrink the size of vocabulary by filter out words frequency is lower than threshold. It returns a new vocabulary.
2. `load_from_text(cls, filepaths: List[str])`: Load existed text vocabulary which lines in the format of `word frequency`.
3. `load_from_persisted_dict(cls, filepath)`: Load from a persisted binary vocabulary which was saved by calling `persistence()` method.
4. `persistence(self, path)`: Save vocabulary object to binary file.
Sample code:
```python
from src.utils import Dictionary
vocabulary = Dictionary.load_from_persisted_dict("vocab/all_en.dict.bin")
tokens = [1, 2, 3, 4, 5]
# Convert ids to symbols.
print([vocabulary[t] for t in tokens])
sentence = ["Hello", "world"]
# Convert symbols to ids.
print([vocabulary.index[s] for s in sentence])
```
For more detail, please refer to the source file.
### Generate Dataset
As mentioned above, three corpus are used in MASS mode, dataset generation scripts for them are provided.
#### News Crawl Corpus
Script can be found in `news_crawl.py`.
Major parameters in `news_crawl.py`:
```bash
Note that please provide `--existed_vocab` or `--dict_folder` at least one.
A new vocabulary would be created in `output_folder` when pass `--dict_folder`.
--src_folder: Corpus folders.
--existed_vocab: Optional, persisted vocabulary file.
--mask_ratio: Ratio of mask.
--output_folder: Output dataset files folder path.
--max_len: Maximum sentence length. If a sentence longer than `max_len`, then drop it.
--suffix: Optional, suffix of generated dataset files.
--processes: Optional, size of process pool (to accelerate). Default: 2.
```
Sample code:
```bash
python news_crawl.py --src_folder /{path}/news_crawl \
--existed_vocab /{path}/mass/vocab/all_en.dict.bin \
--mask_ratio 0.5 \
--output_folder /{path}/news_crawl_dataset \
--max_len 32 \
--processes 32
```
#### Gigaword Corpus
Script can be found in `gigaword.py`.
Major parameters in `gigaword.py`:
```bash
--train_src: Train source file path.
--train_ref: Train reference file path.
--test_src: Test source file path.
--test_ref: Test reference file path.
--existed_vocab: Persisted vocabulary file.
--output_folder: Output dataset files folder path.
--noise_prob: Optional, add noise prob. Default: 0.
--max_len: Optional, maximum sentence length. If a sentence longer than `max_len`, then drop it. Default: 64.
--format: Optional, dataset format, "mindrecord" or "tfrecord". Default: "tfrecord".
```
Sample code:
```bash
python gigaword.py --train_src /{path}/gigaword/train_src.txt \
--train_ref /{path}/gigaword/train_ref.txt \
--test_src /{path}/gigaword/test_src.txt \
--test_ref /{path}/gigaword/test_ref.txt \
--existed_vocab /{path}/mass/vocab/all_en.dict.bin \
--noise_prob 0.1 \
--output_folder /{path}/gigaword_dataset \
--max_len 64
```
#### Cornell Movie Dialog Corpus
Script can be found in `cornell_dialog.py`.
Major parameters in `cornell_dialog.py`:
```bash
--src_folder: Corpus folders.
--existed_vocab: Persisted vocabulary file.
--train_prefix: Train source and target file prefix. Default: train.
--test_prefix: Test source and target file prefix. Default: test.
--output_folder: Output dataset files folder path.
--max_len: Maximum sentence length. If a sentence longer than `max_len`, then drop it.
--valid_prefix: Optional, Valid source and target file prefix. Default: valid.
```
Sample code:
```bash
python cornell_dialog.py --src_folder /{path}/cornell_dialog \
--existed_vocab /{path}/mass/vocab/all_en.dict.bin \
--train_prefix train \
--test_prefix test \
--noise_prob 0.1 \
--output_folder /{path}/cornell_dialog_dataset \
--max_len 64
```
## Configuration
Json file under the path `config/` is the template configuration file.
Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
For more detailed information about the attributes, refer to the file `config/config.py`.
## Training & Evaluation process
For training a model, the shell script `run_ascend.sh` or `run_gpu.sh` is all you need. In this scripts, the environment variable is set and the training script `train.py` under `mass` is executed.
You may start a task training with single device or multiple devices by assigning the options and run the command in bash:
Ascend:
```ascend
sh run_ascend.sh [--options]
```
GPU:
```gpu
sh run_gpu.sh [--options]
```
The usage of `run_ascend.sh` is shown as bellow:
```text
Usage: run_ascend.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
[-i, --device_id <N>] [-j, --hccl_json <FILE>]
[-c, --config <FILE>] [-o, --output <FILE>]
[-v, --vocab <FILE>]
options:
-h, --help show usage
-t, --task select task: CHAR, 't' for train and 'i' for inference".
-n, --device_num device number used for training: N, default is 1.
-i, --device_id device id used for training with single device: N, 0<=N<=7, default is 0.
-j, --hccl_json rank table file used for training with multiple devices: FILE.
-c, --config configuration file as shown in the path 'mass/config': FILE.
-o, --output assign output file of inference: FILE.
-v, --vocab set the vocabulary.
-m, --metric set the metric.
```
Notes: Be sure to assign the hccl_json file while running a distributed-training.
The usage of `run_gpu.sh` is shown as bellow:
```text
Usage: run_gpu.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
[-i, --device_id <N>] [-c, --config <FILE>]
[-o, --output <FILE>] [-v, --vocab <FILE>]
options:
-h, --help show usage
-t, --task select task: CHAR, 't' for train and 'i' for inference".
-n, --device_num device number used for training: N, default is 1.
-i, --device_id device id used for training with single device: N, 0<=N<=7, default is 0.
-c, --config configuration file as shown in the path 'mass/config': FILE.
-o, --output assign output file of inference: FILE.
-v, --vocab set the vocabulary.
-m, --metric set the metric.
```
The command followed shows a example for training with 2 devices.
Ascend:
```ascend
sh run_ascend.sh --task t --device_num 2 --hccl_json /{path}/rank_table.json --config /{path}/config.json
```
ps. Discontinuous device id is not supported in `run_ascend.sh` at present, device id in `rank_table.json` must start from 0.
GPU:
```gpu
sh run_gpu.sh --task t --device_num 2 --config /{path}/config.json
```
If use a single chip, it would be like this:
Ascend:
```ascend
sh run_ascend.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
```
GPU:
```gpu
sh run_gpu.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
```
## Weights average
```python
python weights_average.py --input_files your_checkpoint_list --output_file model.npz
```
The input_files is a list of you checkpoints file. To use model.npz as the weights, add its path in config.json at "existed_ckpt".
```json
{
...
"checkpoint_options": {
"existed_ckpt": "/xxx/xxx/model.npz",
"save_ckpt_steps": 1000,
...
},
...
}
```
## Learning rate scheduler
Two learning rate scheduler are provided in our model:
1. [Polynomial decay scheduler](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1).
2. [Inverse square root scheduler](https://ece.uwaterloo.ca/~dwharder/aads/Algorithms/Inverse_square_root/).
LR scheduler could be config in `config/config.json`.
For Polynomial decay scheduler, config could be like:
```json
{
...
"learn_rate_config": {
"optimizer": "adam",
"lr": 1e-4,
"lr_scheduler": "poly",
"poly_lr_scheduler_power": 0.5,
"decay_steps": 10000,
"warmup_steps": 2000,
"min_lr": 1e-6
},
...
}
```
For Inverse square root scheduler, config could be like:
```json
{
...
"learn_rate_config": {
"optimizer": "adam",
"lr": 1e-4,
"lr_scheduler": "isr",
"decay_start_step": 12000,
"warmup_steps": 2000,
"min_lr": 1e-6
},
...
}
```
More detail about LR scheduler could be found in `src/utils/lr_scheduler.py`.
# Model description
The MASS network is implemented by Transformer, which has multi-encoder layers and multi-decoder layers.
For pre-training, we use the Adam optimizer and loss-scale to get the pre-trained model.
During fine-turning, we fine-tune this pre-trained model with different dataset according to different tasks.
During testing, we use the fine-turned model to predict the result, and adopt a beam search algorithm to
get the most possible prediction results.
## Performance
### Results
#### Fine-Tuning on Text Summarization
The comparisons between MASS and two other pre-training methods in terms of ROUGE score on the text summarization task
with 3.8M training data are as follows:
| Method | RG-1(F) | RG-2(F) | RG-L(F) |
|:---------------|:--------------|:-------------|:-------------|
| MASS | Ongoing | Ongoing | Ongoing |
#### Fine-Tuning on Conversational ResponseGeneration
The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus are as follows:
| Method | Data = 10K | Data = 110K |
|--------------------|------------------|-----------------|
| MASS | Ongoing | Ongoing |
#### Training Performance
| Parameters | Masked Sequence to Sequence Pre-training for Language Generation |
|:---------------------------|:--------------------------------------------------------------------------|
| Model Version | v1 |
| Resource | Ascend 910, cpu 2.60GHz, 56coresmemory, 314G |
| uploaded Date | 05/24/2020 |
| MindSpore Version | 0.2.0 |
| Dataset | News Crawl 2007-2017 English monolingual corpus, Gigaword corpus, Cornell Movie Dialog corpus |
| Training Parameters | Epoch=50, steps=XXX, batch_size=192, lr=1e-4 |
| Optimizer | Adam |
| Loss Function | Label smoothed cross-entropy criterion |
| outputs | Sentence and probability |
| Loss | Lower than 2 |
| Accuracy | For conversation response, ppl=23.52, for text summarization, RG-1=29.79. |
| Speed | 611.45 sentences/s |
| Total time | --/-- |
| Params (M) | 44.6M |
| Checkpoint for Fine tuning | ---Mb, --, [A link]() |
| Model for inference | ---Mb, --, [A link]() |
| Scripts | [A link]() |
#### Inference Performance
| Parameters | Masked Sequence to Sequence Pre-training for Language Generation |
|:---------------------------|:-----------------------------------------------------------|
| Model Version | V1 |
| Resource | Huawei 910 |
| uploaded Date | 05/24/2020 |
| MindSpore Version | 0.2.0 |
| Dataset | Gigaword corpus, Cornell Movie Dialog corpus |
| batch_size | --- |
| outputs | Sentence and probability |
| Accuracy | ppl=23.52 for conversation response, RG-1=29.79 for text summarization. |
| Speed | ---- sentences/s |
| Total time | --/-- |
| Model for inference | ---Mb, --, [A link]() |
# Environment Requirements
## Platform
- Hardware(Ascend)
- Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you could get the resources for trial.
- Framework
- [MindSpore](http://10.90.67.50/mindspore/archive/20200506/OpenSource/me_vm_x86/)
- For more information, please check the resources below
- [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html)
- [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)
## Requirements
```txt
nltk
numpy
subword-nmt
rouge
```
https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/network_migration.html
# Get started
MASS pre-trains a sequence to sequence model by predicting the masked fragments in an input sequence. After this, downstream tasks including text summarization and conversation response are candidated for fine-tuning the model and for inference.
Here we provide a practice example to demonstrate the basic usage of MASS for pre-training, fine-tuning a model, and the inference process. The overall process is as follows:
1. Download and process the dataset.
2. Modify the `config.json` to config the network.
3. Run a task for pre-training and fine-tuning.
4. Perform inference and validation.
## Pre-training
For pre-training a model, config the options in `config.json` firstly:
- Assign the `pre_train_dataset` under `dataset_config` node to the dataset path.
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the 'ckpt_prefix' and 'ckpt_path' under `checkpoint_path` to save the model files.
- Set other arguments including dataset configurations and network configurations.
- If you have a trained model already, assign the `existed_ckpt` to the checkpoint file.
If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
```ascend
sh run_ascend.sh -t t -n 1 -i 1 -c /mass/config/config.json
```
You can also run the shell script `run_gpu.sh` on gpu as followed:
```gpu
sh run_gpu.sh -t t -n 1 -i 1 -c /mass/config/config.json
```
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
## Fine-tuning
For fine-tuning a model, config the options in `config.json` firstly:
- Assign the `fine_tune_dataset` under `dataset_config` node to the dataset path.
- Assign the `existed_ckpt` under `checkpoint_path` node to the existed model file generated by pre-training.
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the `ckpt_prefix` and `ckpt_path` under `checkpoint_path` node to save the model files.
- Set other arguments including dataset configurations and network configurations.
If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
```ascend
sh run_ascend.sh -t t -n 1 -i 1 -c config/config.json
```
You can also run the shell script `run_gpu.sh` on gpu as followed:
```gpu
sh run_gpu.sh -t t -n 1 -i 1 -c config/config.json
```
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
## Inference
If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/network_migration.html).
For inference, config the options in `config.json` firstly:
- Assign the `test_dataset` under `dataset_config` node to the dataset path.
- Assign the `existed_ckpt` under `checkpoint_path` node to the model file produced by fine-tuning.
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the `ckpt_prefix` and `ckpt_path` under `checkpoint_path` node to save the model files.
- Set other arguments including dataset configurations and network configurations.
If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
```bash
sh run_ascend.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
```
You can also run the shell script `run_gpu.sh` on gpu as followed:
```gpu
sh run_gpu.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
```
# Description of random situation
MASS model contains dropout operations, if you want to disable dropout, please set related dropout_rate to 0 in `config/config.json`.
# others
The model has been validated on Ascend environment, not validated on CPU and GPU.
# ModelZoo Homepage
[Link](https://gitee.com/mindspore/mindspore/tree/master/mindspore/model_zoo)

View File

@ -0,0 +1,84 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Apply bpe script."""
import os
import argparse
from multiprocessing import Pool, cpu_count
from src.utils import Dictionary
from src.utils import bpe_encode
parser = argparse.ArgumentParser(description='Apply BPE.')
parser.add_argument("--codes", type=str, default="", required=True,
help="bpe codes path.")
parser.add_argument("--src_folder", type=str, default="", required=True,
help="raw corpus folder.")
parser.add_argument("--output_folder", type=str, default="", required=True,
help="encoded corpus output path.")
parser.add_argument("--prefix", type=str, default="", required=False,
help="Prefix of text file.")
parser.add_argument("--vocab_path", type=str, default="", required=True,
help="Generated vocabulary output path.")
parser.add_argument("--threshold", type=int, default=None, required=False,
help="Filter out words that frequency is lower than threshold.")
parser.add_argument("--processes", type=int, default=2, required=False,
help="Number of processes to use.")
if __name__ == '__main__':
args, _ = parser.parse_known_args()
if not (args.codes and args.src_folder and args.output_folder):
raise ValueError("Please enter required params.")
source_folder = args.src_folder
output_folder = args.output_folder
codes = args.codes
if not os.path.exists(codes):
raise FileNotFoundError("`--codes` is not existed.")
if not os.path.exists(source_folder) or not os.path.isdir(source_folder):
raise ValueError("`--src_folder` must be a dir and existed.")
if not os.path.exists(output_folder) or not os.path.isdir(output_folder):
raise ValueError("`--output_folder` must be a dir and existed.")
if not isinstance(args.prefix, str) or len(args.prefix) > 128:
raise ValueError("`--prefix` must be a str and len <= 128.")
if not isinstance(args.processes, int):
raise TypeError("`--processes` must be an integer.")
available_dict = []
args_groups = []
for file in os.listdir(source_folder):
if args.prefix and not file.startswith(args.prefix):
continue
if file.endswith(".txt"):
output_path = os.path.join(output_folder, file.replace(".txt", "_bpe.txt"))
dict_path = os.path.join(output_folder, file.replace(".txt", ".dict"))
available_dict.append(dict_path)
args_groups.append((codes, os.path.join(source_folder, file),
output_path, dict_path))
kernel_size = 1 if args.processes <= 0 else args.processes
kernel_size = min(kernel_size, cpu_count())
pool = Pool(kernel_size)
for arg in args_groups:
pool.apply_async(bpe_encode, args=arg)
pool.close()
pool.join()
vocab = Dictionary.load_from_text(available_dict)
if args.threshold is not None:
vocab = vocab.shrink(args.threshold)
vocab.persistence(args.vocab_path)
print(f" | Vocabulary Size: {len(vocab)}")

View File

@ -0,0 +1,20 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""MASS model configuration."""
from .config import TransformerConfig
__all__ = [
"TransformerConfig"
]

View File

@ -0,0 +1,243 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Configuration class for Transformer."""
import os
import json
import copy
from typing import List
import mindspore.common.dtype as mstype
def _is_dataset_file(file: str):
return "tfrecord" in file.lower() or "mindrecord" in file.lower()
def _get_files_from_dir(folder: str):
_files = []
for file in os.listdir(folder):
if _is_dataset_file(file):
_files.append(os.path.join(folder, file))
return _files
def get_source_list(folder: str) -> List:
"""
Get file list from a folder.
Returns:
list, file list.
"""
_list = []
if not folder:
return _list
if os.path.isdir(folder):
_list = _get_files_from_dir(folder)
else:
if _is_dataset_file(folder):
_list.append(folder)
return _list
PARAM_NODES = {"dataset_config",
"model_config",
"loss_scale_config",
"learn_rate_config",
"checkpoint_options"}
class TransformerConfig:
"""
Configuration for `Transformer`.
Args:
random_seed (int): Random seed.
batch_size (int): Batch size of input dataset.
epochs (int): Epoch number.
dataset_sink_mode (bool): Whether enable dataset sink mode.
dataset_sink_step (int): Dataset sink step.
lr_scheduler (str): Whether use lr_scheduler, only support "ISR" now.
lr (float): Initial learning rate.
min_lr (float): Minimum learning rate.
decay_start_step (int): Step to decay.
warmup_steps (int): Warm up steps.
dataset_schema (str): Path of dataset schema file.
pre_train_dataset (str): Path of pre-training dataset file or folder.
fine_tune_dataset (str): Path of fine-tune dataset file or folder.
test_dataset (str): Path of test dataset file or folder.
valid_dataset (str): Path of validation dataset file or folder.
ckpt_path (str): Checkpoints save path.
save_ckpt_steps (int): Interval of saving ckpt.
ckpt_prefix (str): Prefix of ckpt file.
keep_ckpt_max (int): Max ckpt files number.
seq_length (int): Length of input sequence. Default: 64.
vocab_size (int): The shape of each embedding vector. Default: 46192.
hidden_size (int): Size of embedding, attention, dim. Default: 512.
num_hidden_layers (int): Encoder, Decoder layers.
ngram (int): Number of tokens to predict ahead. Default: 2.
accumulation_steps (int): Number of steps to hold until next gradient optimization. Default: 1.
num_attention_heads (int): Number of hidden layers in the Transformer encoder/decoder
cell. Default: 6.
intermediate_size (int): Size of intermediate layer in the Transformer
encoder/decoder cell. Default: 4096.
hidden_act (str): Activation function used in the Transformer encoder/decoder
cell. Default: "relu".
loss_scale_mode (str): Loss scale mode. Default: "dynamic".
init_loss_scale (int): Initialized loss scale.
loss_scale_factor (int): Loss scale factor.
scale_window (int): Window size of loss scale.
beam_width (int): Beam width for beam search in inferring. Default: 4.
length_penalty_weight (float): Penalty for sentence length. Default: 1.0.
label_smoothing (float): Label smoothing setting. Default: 0.1.
input_mask_from_dataset (bool): Specifies whether to use the input mask that loaded from
dataset. Default: True.
save_graphs (bool): Whether to save graphs, please set to True if mindinsight
is wanted.
dtype (mstype): Data type of the input. Default: mstype.float32.
max_decode_length (int): Max decode length for inferring. Default: 64.
hidden_dropout_prob (float): The dropout probability for hidden outputs. Default: 0.1.
attention_dropout_prob (float): The dropout probability for
Multi-head Self-Attention. Default: 0.1.
max_position_embeddings (int): Maximum length of sequences used in this
model. Default: 512.
initializer_range (float): Initialization value of TruncatedNormal. Default: 0.02.
"""
def __init__(self,
random_seed=74,
batch_size=64, epochs=1,
dataset_sink_mode=True, dataset_sink_step=1,
lr_scheduler="", optimizer="adam",
lr=1e-4, min_lr=1e-6,
decay_steps=10000, poly_lr_scheduler_power=1,
decay_start_step=-1, warmup_steps=2000,
pre_train_dataset: str = None,
fine_tune_dataset: str = None,
test_dataset: str = None,
valid_dataset: str = None,
ckpt_path: str = None,
save_ckpt_steps=2000,
ckpt_prefix="CKPT",
existed_ckpt="",
keep_ckpt_max=20,
seq_length=128,
vocab_size=46192,
hidden_size=512,
num_hidden_layers=6,
ngram=2,
accumulation_steps=1,
disable_ngram_loss=False,
num_attention_heads=8,
intermediate_size=4096,
hidden_act="relu",
hidden_dropout_prob=0.1,
attention_dropout_prob=0.1,
max_position_embeddings=64,
initializer_range=0.02,
loss_scale_mode="dynamic",
init_loss_scale=2 ** 10,
loss_scale_factor=2, scale_window=2000,
beam_width=5,
length_penalty_weight=1.0,
label_smoothing=0.1,
input_mask_from_dataset=True,
save_graphs=False,
dtype=mstype.float32,
max_decode_length=64):
self.save_graphs = save_graphs
self.random_seed = random_seed
self.pre_train_dataset = get_source_list(pre_train_dataset) # type: List[str]
self.fine_tune_dataset = get_source_list(fine_tune_dataset) # type: List[str]
self.valid_dataset = get_source_list(valid_dataset) # type: List[str]
self.test_dataset = get_source_list(test_dataset) # type: List[str]
if not isinstance(epochs, int) and epochs < 0:
raise ValueError("`epoch` must be type of int.")
self.epochs = epochs
self.dataset_sink_mode = dataset_sink_mode
self.dataset_sink_step = dataset_sink_step
self.ckpt_path = ckpt_path
self.keep_ckpt_max = keep_ckpt_max
self.save_ckpt_steps = save_ckpt_steps
self.ckpt_prefix = ckpt_prefix
self.existed_ckpt = existed_ckpt
self.batch_size = batch_size
self.seq_length = seq_length
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.ngram = ngram
self.accumulation_steps = accumulation_steps
self.disable_ngram_loss = disable_ngram_loss
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_dropout_prob = attention_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.label_smoothing = label_smoothing
self.beam_width = beam_width
self.length_penalty_weight = length_penalty_weight
self.max_decode_length = max_decode_length
self.input_mask_from_dataset = input_mask_from_dataset
self.compute_type = mstype.float32
self.dtype = dtype
self.loss_scale_mode = loss_scale_mode
self.scale_window = scale_window
self.loss_scale_factor = loss_scale_factor
self.init_loss_scale = init_loss_scale
self.optimizer = optimizer
self.lr = lr
self.lr_scheduler = lr_scheduler
self.min_lr = min_lr
self.poly_lr_scheduler_power = poly_lr_scheduler_power
self.decay_steps = decay_steps
self.decay_start_step = decay_start_step
self.warmup_steps = warmup_steps
self.train_url = ""
@classmethod
def from_dict(cls, json_object: dict):
"""Constructs a `TransformerConfig` from a Python dictionary of parameters."""
_params = {}
for node in PARAM_NODES:
for key in json_object[node]:
_params[key] = json_object[node][key]
return cls(**_params)
@classmethod
def from_json_file(cls, json_file):
"""Constructs a `TransformerConfig` from a json file of parameters."""
with open(json_file, "r") as reader:
return cls.from_dict(json.load(reader))
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

View File

@ -0,0 +1,59 @@
{
"dataset_config": {
"epochs": 5,
"batch_size": 1,
"pre_train_dataset": "",
"fine_tune_dataset": "../cnndm_data_prophetnet/dataset_hugging_face_tokenized/train",
"test_dataset": "",
"valid_dataset": "",
"dataset_sink_mode": false,
"dataset_sink_step": 100
},
"model_config": {
"random_seed": 1,
"save_graphs": false,
"seq_length": 512,
"vocab_size": 30522,
"hidden_size": 512,
"num_hidden_layers": 3,
"ngram": 2,
"accumulation_steps": 1,
"disable_ngram_loss": false,
"num_attention_heads": 8,
"intermediate_size": 2048,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_dropout_prob": 0.1,
"max_position_embeddings": 512,
"initializer_range": 0.02,
"label_smoothing": 0.1,
"beam_width": 5,
"length_penalty_weight": 1.0,
"max_decode_length": 64,
"input_mask_from_dataset": true
},
"loss_scale_config": {
"loss_scale_mode":"static",
"init_loss_scale": 1,
"loss_scale_factor": 2,
"scale_window": 200
},
"learn_rate_config": {
"optimizer": "adam",
"lr": 1e-4,
"lr_scheduler": "isr",
"poly_lr_scheduler_power": 0.5,
"decay_steps": 10000,
"decay_start_step": 1000,
"warmup_steps": 1000,
"min_lr": 1e-7
},
"checkpoint_options": {
"existed_ckpt": "",
"save_ckpt_steps": 20000,
"keep_ckpt_max": 50,
"ckpt_prefix": "ckpt",
"ckpt_path": "checkpoints"
}
}

View File

@ -0,0 +1,58 @@
{
"dataset_config": {
"epochs": 2,
"batch_size": 1,
"pre_train_dataset": "../news_crawl/dataset/tf_small_pretrain",
"fine_tune_dataset": "",
"test_dataset": "",
"valid_dataset": "",
"dataset_sink_mode": false,
"dataset_sink_step": 100
},
"model_config": {
"random_seed": 100,
"save_graphs": false,
"seq_length": 128,
"vocab_size": 44000,
"hidden_size": 768,
"num_hidden_layers": 3,
"ngram": 2,
"disable_ngram_loss": false,
"num_attention_heads": 12,
"intermediate_size": 3072,
"hidden_act": "relu",
"hidden_dropout_prob": 0.1,
"attention_dropout_prob": 0.1,
"max_position_embeddings": 64,
"initializer_range": 0.02,
"label_smoothing": 0.1,
"beam_width": 4,
"length_penalty_weight": 1.0,
"max_decode_length": 64,
"input_mask_from_dataset": true
},
"loss_scale_config": {
"loss_scale_mode":"static",
"init_loss_scale": 32,
"loss_scale_factor": 2,
"scale_window": 200
},
"learn_rate_config": {
"optimizer": "adam",
"lr": 1e-4,
"lr_scheduler": "poly",
"poly_lr_scheduler_power": 0.5,
"decay_steps": 10000,
"decay_start_step": 12000,
"warmup_steps": 4000,
"min_lr": 1e-6
},
"checkpoint_options": {
"existed_ckpt": "/home/yanglinfeng/ProphetNet/training_result/checkpoints/ckpt_1_0.ckpt",
"save_ckpt_steps": 10,
"keep_ckpt_max": 50,
"ckpt_prefix": "ckpt",
"ckpt_path": "checkpoints"
}
}

View File

@ -0,0 +1,57 @@
{
"dataset_config": {
"epochs": 2,
"batch_size": 1,
"pre_train_dataset": "",
"fine_tune_dataset": "",
"test_dataset": "../cnndm_data_prophetnet/dataset_hugging_face_tokenized",
"valid_dataset": "",
"dataset_sink_mode": false,
"dataset_sink_step": 100
},
"model_config": {
"random_seed": 100,
"save_graphs": false,
"seq_length": 512,
"vocab_size": 30522,
"hidden_size": 512,
"num_hidden_layers": 3,
"ngram": 2,
"disable_ngram_loss": false,
"num_attention_heads": 8,
"intermediate_size": 2048,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_dropout_prob": 0.1,
"max_position_embeddings": 512,
"initializer_range": 0.02,
"label_smoothing": 0.1,
"beam_width": 5,
"length_penalty_weight": 1.2,
"max_decode_length": 110,
"input_mask_from_dataset": true
},
"loss_scale_config": {
"loss_scale_mode":"static",
"init_loss_scale": 32,
"loss_scale_factor": 2,
"scale_window": 200
},
"learn_rate_config": {
"optimizer": "adam",
"lr": 1e-4,
"lr_scheduler": "poly",
"poly_lr_scheduler_power": 0.5,
"decay_steps": 10000,
"decay_start_step": 12000,
"warmup_steps": 4000,
"min_lr": 1e-6
},
"checkpoint_options": {
"existed_ckpt": "../training_weight/ckpt-1_20000.ckpt",
"save_ckpt_steps": 500,
"keep_ckpt_max": 50,
"ckpt_prefix": "ckpt",
"ckpt_path": "checkpoints"
}
}

View File

@ -0,0 +1,77 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Evaluation api."""
import os
import argparse
import pickle
from mindspore.common import dtype as mstype
from mindspore import context
from config import TransformerConfig
from src.transformer import infer, infer_ppl
from src.utils import Dictionary
from src.utils import get_score
parser = argparse.ArgumentParser(description='Evaluation MASS.')
parser.add_argument("--config", type=str, required=True,
help="Model config json file path.")
parser.add_argument("--vocab", type=str, required=True,
help="Vocabulary to use.")
parser.add_argument("--output", type=str, required=True,
help="Result file path.")
parser.add_argument("--metric", type=str, default='rouge',
help='Set eval method.')
parser.add_argument("--platform", type=str, required=True,
help="model working platform.")
def get_config(config):
config = TransformerConfig.from_json_file(config)
config.compute_type = mstype.float32
config.dtype = mstype.float32
return config
if __name__ == '__main__':
args, _ = parser.parse_known_args()
if args.vocab.endswith("bin"):
vocab = Dictionary.load_from_persisted_dict(args.vocab)
else:
vocab = Dictionary.load_from_text([args.vocab])
_config = get_config(args.config)
device_id = os.getenv('DEVICE_ID', None)
if device_id is None:
device_id = 0
device_id = int(device_id)
context.set_context(
#mode=context.GRAPH_MODE,
mode=context.PYNATIVE_MODE,
device_target=args.platform,
reserve_class_name_in_scope=False,
device_id=device_id)
if args.metric == 'rouge':
result = infer(_config)
else:
result = infer_ppl(_config)
with open(args.output, "wb") as f:
pickle.dump(result, f, 1)
# get score by given metric
score = get_score(result, vocab, metric=args.metric)
print(score)

View File

@ -0,0 +1,84 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Generate Gigaword dataset."""
import os
import argparse
from src.dataset import BiLingualDataLoader
from src.language_model import NoiseChannelLanguageModel
from src.utils import Dictionary
parser = argparse.ArgumentParser(description='Create Gigaword fine-tune Dataset.')
parser.add_argument("--train_src", type=str, default="", required=False,
help="train dataset source file path.")
parser.add_argument("--train_ref", type=str, default="", required=False,
help="train dataset reference file path.")
parser.add_argument("--test_src", type=str, default="", required=False,
help="test dataset source file path.")
parser.add_argument("--test_ref", type=str, default="", required=False,
help="test dataset reference file path.")
parser.add_argument("--noise_prob", type=float, default=0., required=False,
help="add noise prob.")
parser.add_argument("--existed_vocab", type=str, default="", required=False,
help="existed vocab path.")
parser.add_argument("--max_len", type=int, default=64, required=False,
help="max length of sentences.")
parser.add_argument("--output_folder", type=str, default="", required=True,
help="dataset output path.")
parser.add_argument("--format", type=str, default="tfrecord", required=False,
help="dataset format.")
if __name__ == '__main__':
args, _ = parser.parse_known_args()
vocab = Dictionary.load_from_persisted_dict(args.existed_vocab)
if args.train_src and args.train_ref:
train = BiLingualDataLoader(
src_filepath=args.train_src,
tgt_filepath=args.train_ref,
src_dict=vocab, tgt_dict=vocab,
src_lang="en", tgt_lang="en",
language_model=NoiseChannelLanguageModel(add_noise_prob=args.noise_prob),
max_sen_len=args.max_len
)
if "tf" in args.format.lower():
train.write_to_tfrecord(
path=os.path.join(args.output_folder, "gigaword_train_dataset.tfrecord")
)
else:
train.write_to_mindrecord(
path=os.path.join(args.output_folder, "gigaword_train_dataset.mindrecord")
)
if args.test_src and args.test_ref:
test = BiLingualDataLoader(
src_filepath=args.test_src,
tgt_filepath=args.test_ref,
src_dict=vocab, tgt_dict=vocab,
src_lang="en", tgt_lang="en",
language_model=NoiseChannelLanguageModel(add_noise_prob=0),
max_sen_len=args.max_len
)
if "tf" in args.format.lower():
test.write_to_tfrecord(
path=os.path.join(args.output_folder, "gigaword_test_dataset.tfrecord")
)
else:
test.write_to_mindrecord(
path=os.path.join(args.output_folder, "gigaword_test_dataset.mindrecord")
)
print(f" | Vocabulary size: {vocab.size}.")

View File

@ -0,0 +1,209 @@
python tokenize_corpus.py --corpus_folder /{path}/corpus --output_folder /{path}/tokenized_corpus --tokenizer nltk --pool_size 16
cd tokenized_corpus/
# build bpe codes
cat *.txt | subword-nmt learn-bpe -s 46000 -o all.bpe.codes
# build bpe dict
"subword-nmt get-vocab -i tokenized.txt -o vocab_en.dict.bin"
# apply bpe encoding
python apply_bpe_encoding.py --codes ~/Mindspore/mindspore/model_zoo/official/nlp/mass/tokenized_corpus/all.bpe.codes \
--src_folder ~/Mindspore/mindspore/model_zoo/official/nlp/mass/tokenized_corpus/ \
--output_folder ~/Mindspore/mindspore/model_zoo/official/nlp/mass/tokenized_corpus/bpe \
--vocab_path ~/Mindspore/mindspore/model_zoo/official/nlp/mass/tokenized_corpus/vocab_en.dict.bin \
--processes 32
# build dataset news crawl
python news_crawl.py --src_folder ./news_crawl \
--dict_folder ./news_crawl \
--existed_vocab ./tokenized_corpus/vocab_en.dict.bin \
--mask_ratio 0.5 \
--output_folder ./news_crawl/dataset/tf_small_pretrain \
--max_len 128 \
--processes 32 \
--ngram 2
# build dataset cnndm
python cnn_dm.py --test_src ./cnndm_data_prophetnet/prophetnet_tokenized/test.src.txt --test_ref ./cnndm_data_prophetnet/prophetnet_tokenized/test.tgt.txt --existed_vocab ./cnndm_data_prophetnet/cnndm_torch_prophetnet_30522.bin --noise_prob 0.0 --output_folder ./cnndm_data_prophetnet/dataset_hugging_face_tokenized/ --max_len 512
# train
bash run_gpu.sh --task t --device_num 1 --device_id 3 --config ./config/config.json
# inference
bash run_gpu.sh --task i \
--device_num 1 \
--device_id 3 \
--config ./config/test.json \
--output output \
--metric rouge \
--vocab ./cnndm_data_prophetnet/cnndm_torch_prophetnet_30522.bin
# pytorch model structure
NgramTransformerProphetModel(
(encoder): TransformerEncoder(
(embed_tokens): Embedding(30522, 512, padding_idx=0)
(embed_positions): LearnedPositionalEmbedding(513, 512, padding_idx=0)
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
(emb_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(decoder): NgramTransformerDecoder(
(embed_tokens): Embedding(30522, 512, padding_idx=0)
(embed_positions): LearnedPositionalEmbedding(514, 512, padding_idx=0)
(ngram_input_embed): Embedding(2, 512)
(layers): ModuleList(
(0): NgramTransformerDecoderLayer(
(ngram_self_attn): NgramMultiheadAttention(
(relative_linear): Linear(in_features=512, out_features=256, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): NgramTransformerDecoderLayer(
(ngram_self_attn): NgramMultiheadAttention(
(relative_linear): Linear(in_features=512, out_features=256, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): NgramTransformerDecoderLayer(
(ngram_self_attn): NgramMultiheadAttention(
(relative_linear): Linear(in_features=512, out_features=256, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=True)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
(emb_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
data example:
src_tokens
tensor([[ 1996, 11555, 18172, 7042, 2055, 1037, 18147, 5913, 3756, 6982,
1999, 1996, 4120, 1012, 2007, 1996, 4022, 2000, 2022, 3621,
2062, 4795, 1010, 2021, 2074, 2004, 26102, 1010, 1996, 7726,
3212, 2038, 2042, 27696, 1996, 6745, 2804, 2000, 2049, 4170,
1011, 1037, 8235, 4408, 28653, 2630, 6982, 1012, 11216, 1997,
1996, 27143, 1011, 2550, 21905, 2442, 2031, 2245, 2008, 1996,
13576, 8703, 2052, 2191, 1996, 7477, 12586, 1999, 2007, 1996,
2784, 5380, 1997, 1996, 2152, 11915, 1012, 17186, 2091, 2005,
2678, 1012, 3239, 1011, 9105, 1024, 7726, 3212, 9058, 2020,
4760, 2125, 2037, 4408, 28653, 12622, 2006, 2110, 2547, 1012,
18783, 1024, 7726, 3212, 3738, 3233, 2006, 2327, 1997, 1996,
8254, 2050, 1021, 6982, 2328, 27143, 1012, 2021, 2009, 1005,
1055, 2524, 2000, 2903, 2008, 1996, 4099, 2180, 1005, 1056,
2156, 2023, 2028, 2746, 2007, 1996, 6120, 2437, 2009, 3233,
2041, 2066, 1037, 14699, 7639, 2114, 1996, 2300, 1005, 1055,
3302, 1012, 1996, 3212, 2001, 4760, 2125, 1996, 3239, 1011,
9105, 4325, 1010, 2029, 2003, 2105, 1996, 2946, 1997, 1037,
15437, 1010, 2006, 4238, 2110, 2547, 7483, 1012, 3212, 4584,
1010, 2738, 4603, 2135, 5102, 1999, 5810, 2601, 11408, 4102,
2000, 2037, 28190, 2911, 1010, 3427, 2004, 1996, 8254, 2050,
1011, 1021, 1010, 6055, 2007, 3424, 1011, 2911, 10815, 1010,
2001, 3390, 2012, 24112, 2099, 17532, 1010, 2379, 1996, 6143,
11195, 1997, 7570, 10867, 17040, 1012, 2048, 2047, 7726, 1011,
2328, 1043, 16102, 4313, 4942, 2015, 1998, 2048, 13671, 25215,
11890, 27528, 2102, 2020, 2036, 5359, 2000, 1996, 3212, 1012,
8235, 2630, 1024, 4238, 1005, 1055, 4397, 3390, 1043, 16102,
4313, 6982, 5829, 1999, 2392, 1997, 1037, 4049, 1999, 1996,
2670, 3417, 1997, 24112, 2099, 17532, 1999, 1996, 4723, 6084,
1012, 19194, 1024, 1996, 12622, 3233, 2041, 2066, 1037, 14699,
1011, 7639, 2114, 1996, 3302, 1997, 1996, 2712, 1012, 3212,
2708, 4373, 5902, 5292, 28065, 14511, 4430, 2360, 13380, 2072,
2001, 9339, 2006, 7726, 2547, 2004, 3038, 2008, 1996, 3842,
2442, 10295, 1996, 1005, 14751, 2974, 1998, 2327, 1011, 3694,
4128, 2000, 4047, 2049, 6645, 1012, 1005, 1043, 16102, 4313,
2465, 12622, 2064, 2543, 10815, 1998, 18544, 2012, 1996, 2168,
2051, 1010, 1998, 2064, 5452, 1999, 1996, 4723, 6084, 1005,
1055, 8467, 5380, 1012, 4238, 2038, 4912, 2000, 12200, 2049,
2250, 3639, 1998, 3987, 9859, 1010, 3038, 2151, 2825, 2925,
4491, 2006, 2009, 2052, 2272, 2013, 1996, 2250, 1998, 2712,
1012, 1996, 2406, 2085, 4447, 2000, 2022, 1005, 2969, 7182,
1005, 1999, 3408, 1997, 17731, 3941, 2000, 3113, 2049, 2510,
3791, 1012, 14430, 1024, 1996, 7726, 6982, 1005, 1055, 2453,
2022, 2062, 9252, 2084, 1996, 11555, 1005, 21864, 15952, 3756,
6982, 1010, 15885, 1010, 2021, 2027, 2024, 8053, 14224, 11401,
1012, 102]], device='cuda:0')
prev_output_tokens
tensor([[ 102, 7726, 2110, 2547, 3662, 8333, 1997, 1996, 2047, 3719,
1011, 1037, 8254, 2050, 1021, 6982, 1010, 2048, 1043, 16102,
4313, 4942, 2015, 1998, 1037, 3940, 1997, 25215, 11890, 27528,
2102, 1012, 2, 3212, 4584, 2360, 2008, 1996, 4170, 2442,
10295, 1005, 1996, 14751, 2974, 1005, 2000, 4047, 2049, 6645,
1012]], device='cuda:0')
target_tokens:
tensor([[ 7726, 2110, 2547, 3662, 8333, 1997, 1996, 2047, 3719, 1011,
1037, 8254, 2050, 1021, 6982, 1010, 2048, 1043, 16102, 4313,
4942, 2015, 1998, 1037, 3940, 1997, 25215, 11890, 27528, 2102,
1012, 2, 3212, 4584, 2360, 2008, 1996, 4170, 2442, 10295,
1005, 1996, 14751, 2974, 1005, 2000, 4047, 2049, 6645, 1012,
102]], device='cuda:0')

View File

@ -0,0 +1,61 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Generate News Crawl corpus dataset."""
import argparse
from src.utils import Dictionary
from src.utils.preprocess import create_pre_training_dataset
parser = argparse.ArgumentParser(description='Create News Crawl Pre-Training Dataset.')
parser.add_argument("--src_folder", type=str, default="", required=True,
help="Raw corpus folder.")
parser.add_argument("--existed_vocab", type=str, default="", required=True,
help="Existed vocab path.")
parser.add_argument("--mask_ratio", type=float, default=0.4, required=True,
help="Mask ratio.")
parser.add_argument("--output_folder", type=str, default="", required=True,
help="Dataset output path.")
parser.add_argument("--max_len", type=int, default=32, required=False,
help="Max length of sentences.")
parser.add_argument("--ngram", type=int, default=3, required=True,
help="Number of tokens to predict ahead.")
parser.add_argument("--suffix", type=str, default="", required=False,
help="Add suffix to output file.")
parser.add_argument("--processes", type=int, default=2, required=False,
help="Size of processes pool.")
if __name__ == '__main__':
args, _ = parser.parse_known_args()
if not (args.src_folder and args.output_folder):
raise ValueError("Please enter required params.")
if not args.existed_vocab:
raise ValueError("`--existed_vocab` is required.")
vocab = Dictionary.load_from_persisted_dict(args.existed_vocab)
create_pre_training_dataset(
folder_path=args.src_folder,
output_folder_path=args.output_folder,
vocabulary=vocab,
prefix="news.20", suffix=args.suffix,
mask_ratio=args.mask_ratio,
ngram=args.ngram,
min_sen_len=10,
max_sen_len=args.max_len,
dataset_type="tfrecord",
cores=args.processes
)
print(f" | Vocabulary size: {vocab.size}.")

View File

@ -0,0 +1,20 @@
#!/bin/bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
src_folder_path=$1 # source text folder path.
cd $src_folder_path || exit
cat *.txt | subword-nmt learn-bpe -s 46000 -o all.bpe.codes

View File

@ -0,0 +1,179 @@
#!/usr/bin/env bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
export DEVICE_ID=0
export RANK_ID=0
export RANK_SIZE=1
options=`getopt -u -o ht:n:i:j:c:o:v:m: -l help,task:,device_num:,device_id:,hccl_json:,config:,output:,vocab:,metric: -- "$@"`
eval set -- "$options"
echo $options
echo_help()
{
echo "Usage:"
echo "bash train.sh [-h] [-t t|i] [-n N] [-i N] [-j FILE] [-c FILE] [-o FILE] [-v FILE]"
echo "options:"
echo " -h --help show usage"
echo " -t --task select task, 't' for training and 'i' for inference"
echo " -n --device_num training with N devices"
echo " -i --device_id training with device i"
echo " -j --hccl_json set the rank table file"
echo " -c --config set the configuration file"
echo " -o --output set the output file of inference"
echo " -v --vocab set the vocabulary"
echo " -m --metric set the metric"
}
set_hccl_json()
{
while [ -n "$1" ]
do
if [[ "$1" == "-j" || "$1" == "--hccl_json" ]]
then
export RANK_TABLE_FILE=$2
break
fi
shift
done
}
set_device_id()
{
while [ -n "$1" ]
do
if [[ "$1" == "-i" || "$1" == "--device_id" ]]
then
if [[ $2 -ge 0 && $2 -le 7 ]]
then
export DEVICE_ID=$2
fi
break
fi
shift
done
}
while [ -n "$1" ]
do
case "$1" in
-h|--help)
echo_help
shift
;;
-t|--task)
echo "task:"
if [ "$2" == "t" ]
then
task=train
elif [ "$2" == "i" ]
then
task=infer
fi
shift 2
;;
-n|--device_num)
echo "device_num"
if [ $2 -eq 1 ]
then
set_device_id $options
elif [ $2 -gt 1 ]
then
export HCCL_FLAG=1
export DEPLOY_MODE=0
export RANK_SIZE=$2
set_hccl_json $options
fi
shift 2
;;
-i|--device_id)
echo "set device id"
export DEVICE_ID=$2
shift 2
;;
-c|--config)
echo "config";
configurations=$2
shift 2
;;
-o|--output)
echo "output";
output=$2
shift 2
;;
-v|--vocab)
echo "vocab";
vocab=$2
shift 2
;;
-m|--metric)
echo "metric";
metric=$2
shift 2
;;
--)
shift
break
;;
*)
shift
;;
esac
done
file_path=$(cd "$(dirname $0)" || exit; pwd)
for((i=0; i < $RANK_SIZE; i++))
do
if [ $RANK_SIZE -gt 1 ]
then
echo $RANK_SIZE
export RANK_ID=$i
export DEVICE_ID=$[i]
fi
echo "Working on device $i"
cd $file_path || exit
cd ../ || exit
rm -rf ./${task}_prophetnet_$DEVICE_ID
mkdir ./${task}_prophetnet_$DEVICE_ID
cp train_gradient_accumulation.py ./${task}_prophetnet_$DEVICE_ID
cp train.py ./${task}_prophetnet_$DEVICE_ID
cp eval.py ./${task}_prophetnet_$DEVICE_ID
cp -r src ./${task}_prophetnet_$DEVICE_ID
cp -r config ./${task}_prophetnet_$DEVICE_ID
cp $configurations ./${task}_prophetnet_$DEVICE_ID
if [ $vocab ]
then
cp $vocab ./${task}_prophetnet_$DEVICE_ID
fi
cd ./${task}_prophetnet_$DEVICE_ID || exit
env > log.log
echo $task
if [ "$task" == "train" ]
then
#python train.py --config ${configurations##*/} --platform Ascend >>log.log 2>&1 &
python train.py --config ${configurations##*/} --platform Ascend
elif [ "$task" == "infer" ]
then
#python eval.py --config ${configurations##*/} --output ${output} --vocab ${vocab##*/} --metric ${metric} --platform Ascend >>log_infer.log 2>&1 &
python eval.py --config ${configurations##*/} --output ${output} --vocab ${vocab##*/} --metric ${metric} --platform Ascend
fi
cd ../
done

View File

@ -0,0 +1,162 @@
#!/usr/bin/env bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
export DEVICE_ID=0
export RANK_ID=0
export RANK_SIZE=1
options=`getopt -u -o ht:n:i::o:v:m: -l help,task:,device_num:,device_id:,config:,output:,vocab:,metric: -- "$@"`
eval set -- "$options"
echo $options
echo_help()
{
echo "Usage:"
echo "bash train.sh [-h] [-t t|i] [-n N] [-i N] [-j FILE] [-c FILE] [-o FILE] [-v FILE]"
echo "options:"
echo " -h --help show usage"
echo " -t --task select task, 't' for training and 'i' for inference"
echo " -n --device_num training with N devices"
echo " -i --device_id training with device i"
echo " -c --config set the configuration file"
echo " -o --output set the output file of inference"
echo " -v --vocab set the vocabulary"
echo " -m --metric set the metric"
}
set_device_id()
{
while [ -n "$1" ]
do
if [[ "$1" == "-i" || "$1" == "--device_id" ]]
then
if [[ $2 -ge 0 && $2 -le 7 ]]
then
export DEVICE_ID=$2
fi
break
fi
shift
done
}
while [ -n "$1" ]
do
case "$1" in
-h|--help)
echo_help
shift
;;
-t|--task)
echo "task:"
if [ "$2" == "t" ]
then
task=train
elif [ "$2" == "i" ]
then
task=infer
fi
shift 2
;;
-n|--device_num)
echo "device_num"
if [ $2 -eq 1 ]
then
set_device_id $options
elif [ $2 -gt 1 ]
then
export RANK_SIZE=$2
fi
shift 2
;;
-i|--device_id)
echo "set device id"
export DEVICE_ID=$2
shift 2
;;
-c|--config)
echo "config";
configurations=$2
shift 2
;;
-o|--output)
echo "output";
output=$2
shift 2
;;
-v|--vocab)
echo "vocab";
vocab=$2
shift 2
;;
-m|--metric)
echo "metric";
metric=$2
shift 2
;;
--)
shift
break
;;
*)
shift
;;
esac
done
file_path=$(cd "$(dirname $0)" || exit; pwd)
if [ $RANK_SIZE -gt 1 ]
then
echo "Working on $RANK_SIZE device"
fi
echo "Working on file ${task}_prophetnet_$DEVICE_ID"
cd $file_path || exit
cd ../ || exit
rm -rf ./${task}_prophetnet_$DEVICE_ID
mkdir ./${task}_prophetnet_$DEVICE_ID
cp train_gradient_accumulation.py ./${task}_prophetnet_$DEVICE_ID
cp train.py ./${task}_prophetnet_$DEVICE_ID
cp eval.py ./${task}_prophetnet_$DEVICE_ID
cp -r src ./${task}_prophetnet_$DEVICE_ID
cp -r config ./${task}_prophetnet_$DEVICE_ID
cp $configurations ./${task}_prophetnet_$DEVICE_ID
if [ $vocab ]
then
cp $vocab ./${task}_prophetnet_$DEVICE_ID
fi
cd ./${task}_prophetnet_$DEVICE_ID || exit
env > log.log
echo $task
if [ "$task" == "train" ]
then
if [ $RANK_SIZE -gt 1 ]
then
mpirun -n $RANK_SIZE python train.py --config ${configurations##*/} --platform GPU >>log.log 2>&1 &
fi
#python train.py --config ${configurations##*/} --platform GPU >>log.log 2>&1 &
python train.py --config ${configurations##*/} --platform GPU
elif [ "$task" == "infer" ]
then
#python eval.py --config ${configurations##*/} --output ${output} --vocab ${vocab##*/} --metric ${metric} --platform GPU >>log_infer.log 2>&1 &
python eval.py --config ${configurations##*/} --output ${output} --vocab ${vocab##*/} --metric ${metric} --platform GPU
fi
cd ../

View File

@ -0,0 +1,44 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Source of mass model."""
from .dataset import load_dataset
from .dataset import bi_data_loader
from .dataset import mono_data_loader
from .transformer import TransformerDecoder
from .transformer import TransformerEncoder
from .transformer import Transformer
from .transformer import TransformerNetworkWithLoss
from .transformer import LabelSmoothedCrossEntropyCriterion
from .transformer import TransformerTrainOneStepWithLossScaleCell
from .transformer import TransformerTraining
from .transformer import infer
from .language_model import LooseMaskedLanguageModel
from .language_model import MaskedLanguageModel
from .language_model import NoiseChannelLanguageModel
__all__ = [
"load_dataset",
"bi_data_loader",
"mono_data_loader",
"Transformer",
"infer",
"TransformerTraining",
"TransformerNetworkWithLoss",
"TransformerTrainOneStepWithLossScaleCell",
"LabelSmoothedCrossEntropyCriterion",
"LooseMaskedLanguageModel",
"MaskedLanguageModel",
"NoiseChannelLanguageModel"
]

View File

@ -0,0 +1,24 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Dataset module."""
from .bi_data_loader import BiLingualDataLoader
from .mono_data_loader import MonoLingualDataLoader
from .load_dataset import load_dataset
__all__ = [
"load_dataset",
"BiLingualDataLoader",
"MonoLingualDataLoader"
]

View File

@ -0,0 +1,111 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Dataset loader to feed into model."""
import mindspore.common.dtype as mstype
import mindspore.dataset.engine as de
import mindspore.dataset.transforms.c_transforms as deC
def _load_dataset(input_files, batch_size, epoch_count=1,
sink_mode=False, sink_step=1, rank_size=1, rank_id=0, shuffle=True):
"""
Load dataset according to passed in params.
Args:
input_files (list): Data files.
batch_size (int): Batch size.
epoch_count (int): Epoch count.
sink_mode (bool): Whether enable sink mode.
sink_step (int): Step to sink.
rank_size (int): Rank size.
rank_id (int): Rank id.
shuffle (bool): Whether shuffle dataset.
Returns:
Dataset, dataset instance.
"""
if not input_files:
raise FileNotFoundError("Require at least one dataset.")
if not isinstance(sink_mode, bool):
raise ValueError("`sink` must be type of bool.")
for datafile in input_files:
print(f" | Loading {datafile}.")
ds = de.TFRecordDataset(
input_files,
columns_list=[
"src", "src_padding",
"prev_opt", "prev_padding",
"target", "tgt_padding"
],
shuffle=shuffle, num_shards=rank_size, shard_id=rank_id,
shard_equal_rows=True, num_parallel_workers=8)
ori_dataset_size = ds.get_dataset_size()
print(f" | Dataset size: {ori_dataset_size}.")
repeat_count = epoch_count
type_cast_op = deC.TypeCast(mstype.int32)
ds = ds.map(input_columns="src", operations=type_cast_op)
ds = ds.map(input_columns="src_padding", operations=type_cast_op)
ds = ds.map(input_columns="prev_opt", operations=type_cast_op)
ds = ds.map(input_columns="prev_padding", operations=type_cast_op)
ds = ds.map(input_columns="target", operations=type_cast_op)
ds = ds.map(input_columns="tgt_padding", operations=type_cast_op)
ds = ds.rename(
input_columns=["src",
"src_padding",
"prev_opt",
"prev_padding",
"target",
"tgt_padding"],
output_columns=["source_eos_ids",
"source_eos_mask",
"target_sos_ids",
"target_sos_mask",
"target_eos_ids",
"target_eos_mask"]
)
ds = ds.batch(batch_size, drop_remainder=True)
ds = ds.repeat(repeat_count)
ds.channel_name = 'transformer'
return ds
def load_dataset(data_files: list, batch_size: int, epoch_count: int,
sink_mode: bool, sink_step: int = 1, rank_size: int = 1, rank_id: int = 0, shuffle=True):
"""
Load dataset.
Args:
data_files (list): Data files.
batch_size (int): Batch size.
epoch_count (int): Epoch count.
sink_mode (bool): Whether enable sink mode.
sink_step (int): Step to sink.
rank_size (int): Rank size.
rank_id (int): Rank id.
shuffle (bool): Whether shuffle dataset.
Returns:
Dataset, dataset instance.
"""
return _load_dataset(data_files, batch_size, epoch_count, sink_mode,
sink_step, rank_size, rank_id, shuffle=shuffle)

View File

@ -0,0 +1,24 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Define schema of mindrecord."""
SCHEMA = {
"src": {"type": "int64", "shape": [-1]},
"src_padding": {"type": "int64", "shape": [-1]},
"prev_opt": {"type": "int64", "shape": [-1]},
"prev_padding": {"type": "int64", "shape": [-1]},
"target": {"type": "int64", "shape": [-1]},
"tgt_padding": {"type": "int64", "shape": [-1]},
}

View File

@ -0,0 +1,29 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Language model."""
from .noise_channel_language_model import NoiseChannelLanguageModel
from .masked_language_model import MaskedLanguageModel
from .loose_masked_language_model import LooseMaskedLanguageModel
from .mass_language_model import MassLanguageModel
from .prophetnet_language_model import ProphetNetLanguageModel, NgramNoiseChannelLanguageModel
__all__ = [
"LooseMaskedLanguageModel",
"MassLanguageModel",
"MaskedLanguageModel",
"NoiseChannelLanguageModel",
"ProphetNetLanguageModel",
"NgramNoiseChannelLanguageModel"
]

View File

@ -0,0 +1,25 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Base language model."""
class LanguageModel:
"""Define base language model."""
def __init__(self):
pass
def emit(self, **kwargs):
raise NotImplementedError

View File

@ -0,0 +1,129 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Modified masked language model."""
import numpy as np
from src.utils import Dictionary
from .base import LanguageModel
class LooseMaskedLanguageModel(LanguageModel):
"""
Modified mask operation on sentence.
If k is assigned, then mask sentence with length k.
Otherwise, use mask_ratio.
Args:
k (int): Length of fragment.
mask_ratio (float): Mask ratio.
"""
def __init__(self, k: int = None, mask_ratio=0.5,
mask_all_prob=None):
super(LooseMaskedLanguageModel, self).__init__()
self.mask_ratio = mask_ratio
self._k = k
self._threshold = mask_all_prob
def emit(self, sentence: np.ndarray, vocabulary: Dictionary):
"""
Mask mono source sentence.
A sample used to train model is processed with following step:
encoder input (source): [x1, x2, x3, x4, x5, x6, x7, x8, </eos>]
masked encoder input: [x1, x2, x3, _, _, _, x7, x8, </eos>]
decoder input: [ -, x3, x4, x5]
| | | |
V V V V
decoder output: [x3, x4, x5, x6]
Notes:
A simple rule is made that source sentence starts without <BOS>
but end with <EOS>.
Args:
vocabulary (Dictionary): Vocabulary.
sentence (np.ndarray): Raw sentence instance.
Returns:
dict, an example.
"""
# If v=0, then u must equal to 0. [u, v)
u, v = self._get_masked_interval(sentence.shape[0],
self._k, self._threshold)
encoder_input = sentence.copy()
right_shifted_sentence = np.concatenate(([vocabulary.bos_index], sentence[:-1]))
if u == 0:
_len = v - u if v - u != 0 else sentence.shape[0]
decoder_input = right_shifted_sentence[:_len]
decoder_input[0] = vocabulary.mask_index
decoder_output = sentence[:_len].copy()
else:
decoder_input = right_shifted_sentence[u - 1:v]
decoder_input[0] = vocabulary.mask_index
decoder_output = sentence[u - 1:v].copy()
if v == 0:
decoder_input[:] = vocabulary.mask_index
else:
encoder_input[np.arange(start=u, stop=v)] = vocabulary.mask_index
if u != v and u > 1:
padding = np.array([vocabulary.padding_index] * (u - 1), dtype=np.int32)
decoder_input = np.concatenate((padding, decoder_input))
decoder_output = np.concatenate((padding, decoder_output))
if decoder_input.shape[0] != decoder_output.shape[0]:
raise ValueError("seq len must equal.")
return {
"sentence_length": sentence.shape[0],
"tgt_sen_length": decoder_output.shape[0],
"encoder_input": encoder_input, # end with </eos>
"decoder_input": decoder_input,
"decoder_output": decoder_output # end with </eos>
}
def _get_masked_interval(self, length, fix_length=None,
threshold_to_mask_all=None):
"""
Generate a sequence length according to length and mask_ratio.
Args:
length (int): Sequence length.
Returns:
Tuple[int, int], [start position, end position].
"""
# Can not larger than sequence length.
# Mask_length belongs to [0, length].
if fix_length is not None:
interval_length = min(length, fix_length)
else:
interval_length = min(length, round(self.mask_ratio * length))
_magic = np.random.random()
if threshold_to_mask_all is not None and _magic <= threshold_to_mask_all:
return 0, length
# If not sequence to be masked, then return 0, 0.
if interval_length == 0:
return 0, 0
# Otherwise, return start position and interval length.
start_pos = np.random.randint(low=0, high=length - interval_length + 1)
return start_pos, start_pos + interval_length

View File

@ -0,0 +1,128 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Masked language model."""
import numpy as np
from .base import LanguageModel
class MaskedLanguageModel(LanguageModel):
"""
Do mask operation on sentence.
If k is assigned, then mask sentence with length k.
Otherwise, use mask_ratio.
Args:
k (int): Length of fragment.
mask_ratio (float): Mask ratio.
"""
def __init__(self, k: int = None, mask_ratio=0.5,
mask_all_prob=None):
super(MaskedLanguageModel, self).__init__()
self.mask_ratio = mask_ratio
self._k = k
self._threshold = mask_all_prob
def emit(self, sentence: np.ndarray, vocabulary):
"""
Mask mono source sentence.
A sample used to train model is processed with following step:
encoder input (source): [x1, x2, x3, x4, x5, x6, x7, x8, </eos>]
masked encoder input: [x1, x2, _, _, _, x6, x7, x8, </eos>]
decoder input: [ _, x3, x4]
| | |
V V V
decoder output: [ x3, x4, x5]
Notes:
A simple rule is made that source sentence starts without <BOS>
but end with <EOS>.
Args:
vocabulary (Dictionary): Vocabulary.
sentence (np.ndarray): Raw sentence instance.
Returns:
dict, an example.
"""
encoder_input = sentence.copy()
seq_len = encoder_input.shape[0]
# If v=0, then u must equal to 0. [u, v)
u, v = self._get_masked_interval(len(encoder_input),
self._k, self._threshold)
if u == 0:
_len = v - u if v - u != 0 else seq_len
decoder_input = np.array([vocabulary.mask_index] * _len, dtype=np.int32)
decoder_input[1:] = encoder_input[:_len - 1].copy()
else:
decoder_input = np.array([vocabulary.mask_index] * (v - u), dtype=np.int32)
decoder_input[1:] = encoder_input[u:v - 1].copy()
if v == 0:
decoder_output = encoder_input.copy()
encoder_input[:] = vocabulary.mask_index
else:
decoder_output = encoder_input[u:v].copy()
encoder_input[np.arange(start=u, stop=v)] = vocabulary.mask_index
if u != v and u > 0:
padding = np.array([vocabulary.padding_index] * u, dtype=np.int32)
decoder_input = np.concatenate((padding, decoder_input))
decoder_output = np.concatenate((padding, decoder_output))
assert decoder_input.shape[0] == decoder_output.shape[0], "seq len must equal."
return {
"sentence_length": seq_len,
"tgt_sen_length": decoder_output.shape[0],
"encoder_input": encoder_input, # end with </eos>
"decoder_input": decoder_input,
"decoder_output": decoder_output # end with </eos>
}
def _get_masked_interval(self, length, fix_length=None,
threshold_to_mask_all=None):
"""
Generate a sequence length according to length and mask_ratio.
Args:
length (int): Sequence length.
Returns:
Tuple[int, int], [start position, end position].
"""
# Can not larger than sequence length.
# Mask_length belongs to [0, length].
if fix_length is not None:
interval_length = min(length, fix_length)
else:
interval_length = min(length, round(self.mask_ratio * length))
_magic = np.random.random()
if threshold_to_mask_all is not None and _magic <= threshold_to_mask_all:
return 0, length
# If not sequence to be masked, then return 0, 0.
if interval_length == 0:
return 0, 0
# Otherwise, return start position and interval length.
start_pos = np.random.randint(low=0, high=length - interval_length + 1)
return start_pos, start_pos + interval_length

View File

@ -0,0 +1,72 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Noise channel language model."""
import numpy as np
from .base import LanguageModel
class NoiseChannelLanguageModel(LanguageModel):
"""Do mask on bilingual data."""
def __init__(self, add_noise_prob: float = 0.1):
super(NoiseChannelLanguageModel, self).__init__()
self._noisy_prob = add_noise_prob
def emit(self, sentence: np.ndarray, target: np.ndarray,
mask_symbol_idx: int,
bos_symbol_idx: int):
"""
Add noise to sentence randomly.
For example, given a sentence pair:
source sentence: [x1, x2, x3, x4, x5, x6, </eos>]
target sentence: [y1, y2, y3, y4, </eos>]
After do random mask, data is looked like:
encoder input (source): [x1, x2, _, x4, x5, _, </eos>]
decoder input: [<bos>, y1, y2, y3, y4]
| | | | |
V V V V V
decoder output: [ y1, y2, y3, y4, </eos>]
Args:
sentence (np.ndarray): Raw sentence.
target (np.ndarray): Target output (prediction).
mask_symbol_idx (int): Index of MASK symbol.
bos_symbol_idx (int): Index of bos symbol.
Returns:
dict, an example.
"""
encoder_input = sentence.copy()
tgt_seq_len = target.shape[0]
if self._noisy_prob > 0:
for i, _ in enumerate(encoder_input):
_prob = np.random.random()
if _prob < self._noisy_prob:
encoder_input[i] = mask_symbol_idx
decoder_input = np.empty(shape=tgt_seq_len, dtype=np.int64)
decoder_input[1:] = target[:-1]
decoder_input[0] = bos_symbol_idx
return {
"sentence_length": encoder_input.shape[0],
"tgt_sen_length": tgt_seq_len,
"encoder_input": encoder_input, # end with </eos>
"decoder_input": decoder_input, # start with <bos>
"decoder_output": target # end with </eos>
}

View File

@ -0,0 +1,37 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Transformer model module."""
from .transformer import Transformer
from .encoder import TransformerEncoder
from .decoder import TransformerDecoder
from .beam_search import BeamSearchDecoder
from .transformer_for_train import TransformerTraining, LabelSmoothedCrossEntropyCriterion, \
TransformerNetworkWithLoss, TransformerTrainOneStepWithLossScaleCell, \
TransformerTrainAccumulateStepsWithLossScaleCell
from .infer_mass import infer, infer_ppl
__all__ = [
"infer",
"infer_ppl",
"TransformerTraining",
"LabelSmoothedCrossEntropyCriterion",
"TransformerTrainOneStepWithLossScaleCell",
"TransformerTrainAccumulateStepsWithLossScaleCell",
"TransformerNetworkWithLoss",
"Transformer",
"TransformerEncoder",
"TransformerDecoder",
"BeamSearchDecoder"
]

View File

@ -0,0 +1,81 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Embedding."""
import numpy as np
import mindspore.common.dtype as mstype
from mindspore import nn
from mindspore.ops import operations as P
from mindspore.common.tensor import Tensor
from mindspore.common.parameter import Parameter
class EmbeddingLookup(nn.Cell):
"""Embeddings lookup table with a fixed dictionary and size."""
def __init__(self,
vocab_size,
embed_dim,
use_one_hot_embeddings=False):
"""
Embeddings lookup table with a fixed dictionary and size.
Args:
vocab_size (int): Size of the dictionary of embeddings.
embed_dim (int): The size of word embedding.
use_one_hot_embeddings (bool): Whether use one-hot embedding. Default: False.
"""
super(EmbeddingLookup, self).__init__()
self.embedding_dim = embed_dim
self.vocab_size = vocab_size
self.use_one_hot_embeddings = use_one_hot_embeddings
init_weight = np.random.normal(0, embed_dim ** -0.5, size=[vocab_size, embed_dim]).astype(np.float32)
# 0 is Padding index, thus init it as 0.
init_weight[0, :] = 0
self.embedding_table = Parameter(Tensor(init_weight),
name='embedding_table')
self.expand = P.ExpandDims()
self.gather = P.GatherV2()
self.one_hot = P.OneHot()
self.on_value = Tensor(1.0, mstype.float32)
self.off_value = Tensor(0.0, mstype.float32)
self.array_mul = P.MatMul()
self.reshape = P.Reshape()
self.get_shape = P.Shape()
def construct(self, input_ids):
"""
Construct network.
Args:
input_ids (Tensor): A batch of sentences with shape (N, T).
Returns:
Tensor, word embeddings with shape (N, T, D)
"""
_shape = self.get_shape(input_ids) # (N, T).
_batch_size = _shape[0]
_max_len = _shape[1]
flat_ids = self.reshape(input_ids, (_batch_size * _max_len,))
if self.use_one_hot_embeddings:
one_hot_ids = self.one_hot(flat_ids, self.vocab_size, self.on_value, self.off_value)
output_for_reshape = self.array_mul(
one_hot_ids, self.embedding_table)
else:
output_for_reshape = self.gather(self.embedding_table, flat_ids, 0)
output = self.reshape(output_for_reshape, (_batch_size, _max_len, self.embedding_dim))
return output, self.embedding_table

View File

@ -0,0 +1,67 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Gradient clip."""
import mindspore.nn as nn
from mindspore.ops import operations as P
from mindspore.ops import functional as F
from mindspore.ops import composite as C
GRADIENT_CLIP_TYPE = 1
GRADIENT_CLIP_VALUE = 8
class ClipGradients(nn.Cell):
"""
Clip gradients.
Returns:
List, a list of clipped_grad tuples.
"""
def __init__(self):
super(ClipGradients, self).__init__()
self.clip_by_norm = nn.ClipByNorm()
self.cast = P.Cast()
self.dtype = P.DType()
def construct(self,
grads,
clip_type,
clip_value):
"""
Construct gradient clip network.
Args:
grads (list): List of gradient tuples.
clip_type (Tensor): The way to clip, 'value' or 'norm'.
clip_value (Tensor): Specifies how much to clip.
Returns:
List, a list of clipped_grad tuples.
"""
if clip_type != 0 and clip_type != 1: # pylint: disable=R1714
return grads
new_grads = ()
for grad in grads:
dt = self.dtype(grad)
if clip_type == 0:
t = C.clip_by_value(grad, self.cast(F.tuple_to_array((-clip_value,)), dt),
self.cast(F.tuple_to_array((clip_value,)), dt))
else:
t = self.clip_by_norm(grad, self.cast(F.tuple_to_array((clip_value,)), dt))
new_grads = new_grads + (t,)
return new_grads

View File

@ -0,0 +1,288 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Infer api."""
import time
import mindspore.nn as nn
import mindspore.common.dtype as mstype
from mindspore.ops import operations as P
from mindspore.common.tensor import Tensor
from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore import context
from src.dataset import load_dataset
from .transformer_for_infer import TransformerInferModel
from .transformer_for_train import TransformerTraining
from ..utils.load_weights import load_infer_weights
context.set_context(
mode=context.GRAPH_MODE,
#mode=context.PYNATIVE_MODE,
save_graphs=False,
device_target="GPU",
reserve_class_name_in_scope=False)
class TransformerInferCell(nn.Cell):
"""
Encapsulation class of transformer network infer.
Args:
network (nn.Cell): Transformer model.
Returns:
Tuple[Tensor, Tensor], predicted_ids and predicted_probs.
"""
def __init__(self, network):
super(TransformerInferCell, self).__init__(auto_prefix=False)
self.network = network
def construct(self,
source_ids,
source_mask):
"""Defines the computation performed."""
predicted_ids, predicted_probs = self.network(source_ids,
source_mask)
return predicted_ids, predicted_probs
def transformer_infer(config, dataset):
"""
Run infer with Transformer.
Args:
config (TransformerConfig): Config.
dataset (Dataset): Dataset.
Returns:
List[Dict], prediction, each example has 4 keys, "source",
"target", "prediction" and "prediction_prob".
"""
tfm_model = TransformerInferModel(config=config, use_one_hot_embeddings=False)
tfm_model.init_parameters_data()
params = tfm_model.trainable_params()
weights = load_infer_weights(config)
for param in params:
value = param.data
name = param.name
if name not in weights:
raise ValueError(f"{name} is not found in weights.")
with open("weight_after_deal.txt", "a+") as f:
weights_name = name
f.write(weights_name + "\n")
if isinstance(value, Tensor):
print(name, value.asnumpy().shape)
if weights_name in weights:
assert weights_name in weights
param.set_data(Tensor(weights[weights_name], mstype.float32))
else:
raise ValueError(f"{weights_name} is not found in checkpoint.")
else:
raise TypeError(f"Type of {weights_name} is not Tensor.")
print(" | Load weights successfully.")
tfm_infer = TransformerInferCell(tfm_model)
model = Model(tfm_infer)
predictions = []
probs = []
source_sentences = []
target_sentences = []
for batch in dataset.create_dict_iterator():
source_sentences.append(batch["source_eos_ids"])
target_sentences.append(batch["target_eos_ids"])
source_ids = Tensor(batch["source_eos_ids"], mstype.int32)
source_mask = Tensor(batch["source_eos_mask"], mstype.int32)
start_time = time.time()
predicted_ids, entire_probs = model.predict(source_ids, source_mask)
print(f" | Batch size: {config.batch_size}, "
f"Time cost: {time.time() - start_time}.")
predictions.append(predicted_ids.asnumpy())
probs.append(entire_probs.asnumpy())
output = []
for inputs, ref, batch_out, batch_probs in zip(source_sentences,
target_sentences,
predictions,
probs):
for i in range(config.batch_size):
if batch_out.ndim == 3:
batch_out = batch_out[:, 0]
example = {
"source": inputs[i].asnumpy().tolist(),
"target": ref[i].asnumpy().tolist(),
"prediction": batch_out[i].tolist(),
"prediction_prob": batch_probs[i].tolist()
}
output.append(example)
return output
def infer(config):
"""
Transformer infer api.
Args:
config (TransformerConfig): Config.
Returns:
list, result with
"""
eval_dataset = load_dataset(data_files=config.test_dataset,
batch_size=config.batch_size,
epoch_count=1,
sink_mode=config.dataset_sink_mode,
shuffle=False) if config.test_dataset else None
prediction = transformer_infer(config, eval_dataset)
return prediction
class TransformerInferPPLCell(nn.Cell):
"""
Encapsulation class of transformer network infer for PPL.
Args:
config(TransformerConfig): Config.
Returns:
Tuple[Tensor, Tensor], predicted log prob and label lengths.
"""
def __init__(self, config):
super(TransformerInferPPLCell, self).__init__()
self.transformer = TransformerTraining(config, is_training=False, use_one_hot_embeddings=False)
self.batch_size = config.batch_size
self.vocab_size = config.vocab_size
self.one_hot = P.OneHot()
self.on_value = Tensor(float(1), mstype.float32)
self.off_value = Tensor(float(0), mstype.float32)
self.reduce_sum = P.ReduceSum()
self.reshape = P.Reshape()
self.cast = P.Cast()
self.flat_shape = (config.batch_size * config.seq_length,)
self.batch_shape = (config.batch_size, config.seq_length)
self.last_idx = (-1,)
def construct(self,
source_ids,
source_mask,
target_ids,
target_mask,
label_ids,
label_mask):
"""Defines the computation performed."""
predicted_log_probs = self.transformer(source_ids, source_mask, target_ids, target_mask)
label_ids = self.reshape(label_ids, self.flat_shape)
label_mask = self.cast(label_mask, mstype.float32)
one_hot_labels = self.one_hot(label_ids, self.vocab_size, self.on_value, self.off_value)
label_log_probs = self.reduce_sum(predicted_log_probs * one_hot_labels, self.last_idx)
label_log_probs = self.reshape(label_log_probs, self.batch_shape)
log_probs = label_log_probs * label_mask
lengths = self.reduce_sum(label_mask, self.last_idx)
return log_probs, lengths
def transformer_infer_ppl(config, dataset):
"""
Run infer with Transformer for PPL.
Args:
config (TransformerConfig): Config.
dataset (Dataset): Dataset.
Returns:
List[Dict], prediction, each example has 4 keys, "source",
"target", "log_prob" and "length".
"""
tfm_infer = TransformerInferPPLCell(config=config)
tfm_infer.init_parameters_data()
parameter_dict = load_checkpoint(config.existed_ckpt)
load_param_into_net(tfm_infer, parameter_dict)
model = Model(tfm_infer)
log_probs = []
lengths = []
source_sentences = []
target_sentences = []
for batch in dataset.create_dict_iterator():
source_sentences.append(batch["source_eos_ids"])
target_sentences.append(batch["target_eos_ids"])
source_ids = Tensor(batch["source_eos_ids"], mstype.int32)
source_mask = Tensor(batch["source_eos_mask"], mstype.int32)
target_ids = Tensor(batch["target_sos_ids"], mstype.int32)
target_mask = Tensor(batch["target_sos_mask"], mstype.int32)
label_ids = Tensor(batch["target_eos_ids"], mstype.int32)
label_mask = Tensor(batch["target_eos_mask"], mstype.int32)
start_time = time.time()
log_prob, length = model.predict(source_ids, source_mask, target_ids, target_mask, label_ids, label_mask)
print(f" | Batch size: {config.batch_size}, "
f"Time cost: {time.time() - start_time}.")
log_probs.append(log_prob.asnumpy())
lengths.append(length.asnumpy())
output = []
for inputs, ref, log_prob, length in zip(source_sentences,
target_sentences,
log_probs,
lengths):
for i in range(config.batch_size):
example = {
"source": inputs[i].tolist(),
"target": ref[i].tolist(),
"log_prob": log_prob[i].tolist(),
"length": length[i]
}
output.append(example)
return output
def infer_ppl(config):
"""
Transformer infer PPL api.
Args:
config (TransformerConfig): Config.
Returns:
list, result with
"""
eval_dataset = load_dataset(data_files=config.test_dataset,
batch_size=config.batch_size,
epoch_count=1,
sink_mode=config.dataset_sink_mode,
shuffle=False) if config.test_dataset else None
prediction = transformer_infer_ppl(config, eval_dataset)
return prediction

View File

@ -0,0 +1,82 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Positional Embedding."""
import numpy as np
from mindspore import nn
from mindspore import Tensor
import mindspore.common.dtype as mstype
from mindspore.ops import operations as P
def position_encoding(length, depth,
min_timescale=1,
max_timescale=1e4):
"""
Create Tensor of sinusoids of different frequencies.
Args:
length (int): Length of the Tensor to create, i.e. Number of steps.
depth (int): Dimensions of embedding.
min_timescale (float): Minimum time scale.
max_timescale (float): Maximum time scale.
Returns:
Tensor of shape (T, D)
"""
depth = depth // 2
positions = np.arange(length, dtype=np.float32)
log_timescale_increment = (np.log(max_timescale / min_timescale) / (depth - 1))
inv_timescales = min_timescale * np.exp(
np.arange(depth, dtype=np.float32) * -log_timescale_increment)
scaled_time = np.expand_dims(positions, 1) * np.expand_dims(inv_timescales, 0)
# instead of using SIN and COS interleaved
# it's the same to first use SIN then COS
# as they are applied to the same position
x = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
return x
class PositionalEmbedding(nn.Cell):
"""
Add positional info to word embeddings.
Args:
embedding_size (int): Size of word embedding.
max_position_embeddings (int): Maximum step in this model.
Returns:
Tensor, shape of (N, T, D).
"""
def __init__(self,
embedding_size,
max_position_embeddings=512):
super(PositionalEmbedding, self).__init__()
self.add = P.TensorAdd()
self.expand_dims = P.ExpandDims()
self.position_embedding_table = Tensor(
position_encoding(max_position_embeddings, embedding_size),
mstype.float32
)
self.gather = P.GatherV2()
self.get_shape = P.Shape()
def construct(self, word_embeddings):
input_shape = self.get_shape(word_embeddings)
input_len = input_shape[1]
position_embeddings = self.position_embedding_table[0:input_len:1, ::]
position_embeddings = self.expand_dims(position_embeddings, 0)
output = self.add(word_embeddings, position_embeddings)
return output

View File

@ -0,0 +1,49 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Residual block."""
import mindspore.nn as nn
from mindspore.ops import operations as P
class ResidualConnection(nn.Cell):
"""
Add residual to output.
Args:
dropout_prob (float): Dropout rate.
Returns:
Tensor, with same shape of hidden_tensor.
"""
def __init__(self, dropout_prob=0.1):
super(ResidualConnection, self).__init__()
self.add = P.TensorAdd()
self.dropout = nn.Dropout(1 - dropout_prob)
def construct(self, hidden_tensor, residual):
"""
Construct network.
Args:
hidden_tensor (Tensor): Hidden tensor.
residual (Tensor): Input tensor.
Returns:
Tensor, which has the same shape with hidden_tensor and residual.
"""
output = self.dropout(hidden_tensor)
output = self.add(output, residual)
return output

View File

@ -0,0 +1,37 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Utils for mass model."""
from .dictionary import Dictionary
from .ppl_score import ngram_ppl
from .lr_scheduler import square_root_schedule
from .loss_monitor import LossCallBack
from .byte_pair_encoding import bpe_encode
from .initializer import zero_weight, one_weight, normal_weight, weight_variable
from .rouge_score import rouge
from .eval_score import get_score
__all__ = [
"Dictionary",
"rouge",
"bpe_encode",
"ngram_ppl",
"square_root_schedule",
"LossCallBack",
"one_weight",
"zero_weight",
"normal_weight",
"weight_variable",
"get_score"
]

View File

@ -0,0 +1,52 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""BPE."""
import os
import subprocess
ENCODER = "subword-nmt apply-bpe -c"
LEARN_DICT = "subword-nmt get-vocab -i"
def bpe_encode(codes_path, src_path, output_path, dict_path):
"""
Do bpe.
Args:
codes_path (str): BPE codes file.
src_path (str): Source text file path.
output_path (str): Output path.
dict_path (str): Dict path.
"""
if not (os.path.isabs(codes_path)
and os.path.isabs(src_path)
and os.path.isabs(output_path)
and os.path.isabs(dict_path)):
raise ValueError("Absolute path is required.")
if not (os.path.exists(os.path.dirname(codes_path))
and os.path.exists(os.path.dirname(src_path))
and os.path.exists(os.path.dirname(output_path))
and os.path.exists(os.path.dirname(dict_path))):
raise FileNotFoundError("Dir not found.")
# Encoding.
print(" | Applying BPE encoding.")
commands = ENCODER.split() + [codes_path] + ["-i"] + [src_path] + ["-o"] + [output_path]
subprocess.call(commands)
print(" | Fetching vocabulary from single file.")
# Learn vocab.
commands = LEARN_DICT.split() + [output_path] + ["-o"] + [dict_path]
subprocess.call(commands)

View File

@ -0,0 +1,92 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Get score by given metric."""
from .ppl_score import ngram_ppl
from .rouge_score import rouge
def get_ppl_score(result):
"""
Calculate Perplexity(PPL) score.
Args:
List[Dict], prediction, each example has 4 keys, "source",
"target", "log_prob" and "length".
Returns:
Float, ppl score.
"""
log_probs = []
total_length = 0
for sample in result:
log_prob = sample['log_prob']
length = sample['length']
log_probs.extend(log_prob)
total_length += length
print(f" | log_prob:{log_prob}")
print(f" | length:{length}")
ppl = ngram_ppl(log_probs, total_length, log_softmax=True)
print(f" | final PPL={ppl}.")
return ppl
def get_rouge_score(result, vocab):
"""
Calculate ROUGE score.
Args:
List[Dict], prediction, each example has 4 keys, "source",
"target", "prediction" and "prediction_prob".
Dictionary, dict instance.
retur:
Str, rouge score.
"""
predictions = []
targets = []
for sample in result:
predictions.append(' '.join([vocab[t] for t in sample['prediction']]))
targets.append(' '.join([vocab[t] for t in sample['target']]))
print(f" | source: {' '.join([vocab[t] for t in sample['source']])}")
print(f" | target: {targets[-1]}")
return rouge(predictions, targets)
def get_score(result, vocab=None, metric='rouge'):
"""
Get eval score.
Args:
List[Dict], prediction.
Dictionary, dict instance.
Str, metric function, default is rouge.
Return:
Str, Score.
"""
score = None
if metric == 'rouge':
score = get_rouge_score(result, vocab)
elif metric == 'ppl':
score = get_ppl_score(result)
else:
print(f" |metric not in (rouge, ppl)")
return score

View File

@ -0,0 +1,108 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Initializer."""
import math
import numpy as np
from mindspore import Tensor
def _compute_fans(shape):
"""
Computes the number of input and output units for a weight shape.
Args:
shape (tuple): Integer shape tuple or TF tensor shape.
Returns:
tuple, integer scalars (fan_in, fan_out).
"""
if not shape:
fan_in = fan_out = 1
elif len(shape) == 1:
fan_in = fan_out = shape[0]
elif len(shape) == 2:
fan_in = shape[0]
fan_out = shape[1]
else:
# Assuming convolution kernels (2D, 3D, or more).
# kernel shape: (..., input_depth, depth)
receptive_field_size = 1
for dim in shape[:-2]:
receptive_field_size *= dim
fan_in = shape[-2] * receptive_field_size
fan_out = shape[-1] * receptive_field_size
return int(fan_in), int(fan_out)
def weight_variable(shape):
"""
Generate weight var.
Args:
shape (tuple): Shape.
Returns:
Tensor, var.
"""
scale_shape = shape
fan_in, fan_out = _compute_fans(scale_shape)
scale = 1.0 / max(1., (fan_in + fan_out) / 2.)
limit = math.sqrt(3.0 * scale)
values = np.random.uniform(-limit, limit, shape).astype(np.float32)
return Tensor(values)
def one_weight(shape):
"""
Generate weight with ones.
Args:
shape (tuple): Shape.
Returns:
Tensor, var.
"""
ones = np.ones(shape).astype(np.float32)
return Tensor(ones)
def zero_weight(shape):
"""
Generate weight with zeros.
Args:
shape (tuple): Shape.
Returns:
Tensor, var.
"""
zeros = np.zeros(shape).astype(np.float32)
return Tensor(zeros)
def normal_weight(shape, num_units):
"""
Generate weight with normal dist.
Args:
shape (tuple): Shape.
num_units (int): Dimension.
Returns:
Tensor, var.
"""
norm = np.random.normal(0.0, num_units ** -0.5, shape).astype(np.float32)
return Tensor(norm)

View File

@ -0,0 +1,68 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Loss monitor."""
import time
from mindspore.train.callback import Callback
from config import TransformerConfig
class LossCallBack(Callback):
"""
Monitor the loss in training.
If the loss is NAN or INF terminating training.
Note:
If per_print_times is 0 do not print loss.
Args:
per_print_times (int): Print loss every times. Default: 1.
"""
time_stamp_init = False
time_stamp_first = 0
def __init__(self, config: TransformerConfig, per_print_times: int = 1):
super(LossCallBack, self).__init__()
if not isinstance(per_print_times, int) or per_print_times < 0:
raise ValueError("print_step must be int and >= 0.")
self.config = config
self._per_print_times = per_print_times
if not self.time_stamp_init:
self.time_stamp_first = self._get_ms_timestamp()
self.time_stamp_init = True
def step_end(self, run_context):
cb_params = run_context.original_args()
file_name = "./loss.log"
with open(file_name, "a+") as f:
time_stamp_current = self._get_ms_timestamp()
is_accu_step = cb_params.net_outputs[3]
accu_length = cb_params.net_outputs[4]
# Only update at non-accumulation steps
if not is_accu_step:
f.write("time: {}, epoch: {}, step: {}, outputs are {},{},{}.\n".format(
time_stamp_current - self.time_stamp_first,
cb_params.cur_epoch_num,
cb_params.cur_step_num // accu_length,
str(cb_params.net_outputs[0].asnumpy()),
str(cb_params.net_outputs[1].asnumpy()),
str(cb_params.net_outputs[2].asnumpy())
))
@staticmethod
def _get_ms_timestamp():
t = time.time()
return int(round(t * 1000))

View File

@ -0,0 +1,140 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Learning scheduler."""
from math import ceil
import numpy as np
import mindspore.nn.learning_rate_schedule as lr_schedules
def square_root_schedule(lr, update_num, decay_start_step,
warmup_steps=2000,
min_lr=1e-7):
"""
Decay the LR based on the ISR(inverse square root).
During warm-up::
lrs = np.linspace(0, lr, warmup_steps)
After warm-up:
decay_factor = lr * sqrt(warmup_steps)
lr = decay_factor / sqrt(step) if step >= decay_start_step else lr
Args:
lr (float): Init learning rate.
update_num (int): Total steps.
decay_start_step (int): Decay begins after `decay_start_step` steps.
warmup_steps (int): Warm up steps.
min_lr (float): Min learning rate.
Returns:
np.ndarray, learning rate array.
"""
warmup_end_lr = lr
warmup_init_lr = 1e-7 if warmup_steps > 0 else warmup_end_lr
# If warmup_init_lr > lr, then lr_step is negative.
# Otherwise, it's positive.
lr_step = (warmup_end_lr - warmup_init_lr) / warmup_steps
decay_factor = lr * warmup_steps ** 0.5
lrs = np.empty(shape=update_num, dtype=np.float32)
_start_step = 0
if 0 < warmup_steps < update_num:
lrs[:warmup_steps] = np.linspace(warmup_init_lr, warmup_end_lr, warmup_steps)
_start_step = warmup_steps
for step in range(_start_step, update_num):
if step < warmup_steps:
_lr = warmup_init_lr + step * lr_step
elif step < decay_start_step:
_lr = lr
else:
_lr = decay_factor * step ** -0.5
if _lr < min_lr:
_lr = min_lr
lrs[step] = _lr
return lrs
def polynomial_decay_scheduler(lr, min_lr, decay_steps, total_update_num, warmup_steps=1000, power=1.0):
"""
Implements of polynomial decay learning rate scheduler which cycles by default.
Args:
lr (float): Initial learning rate.
warmup_steps (int): Warmup steps.
decay_steps (int): Decay steps.
total_update_num (int): Total update steps.
min_lr (float): Min learning.
power (float): Power factor.
Returns:
np.ndarray, learning rate of each step.
"""
lrs = np.zeros(shape=total_update_num, dtype=np.float32)
if decay_steps <= 0:
raise ValueError("`decay_steps` must larger than 1.")
_start_step = 0
if 0 < warmup_steps < total_update_num:
warmup_end_lr = lr
warmup_init_lr = 0 if warmup_steps > 0 else warmup_end_lr
lrs[:warmup_steps] = np.linspace(warmup_init_lr, warmup_end_lr, warmup_steps)
_start_step = warmup_steps
decay_steps = decay_steps
for step in range(_start_step, total_update_num):
_step = step - _start_step # 2999
ratio = ceil(_step / decay_steps) # 3
ratio = 1 if ratio < 1 else ratio
_decay_steps = decay_steps * ratio # 3000
lrs[step] = (lr - min_lr) * pow(1 - _step / _decay_steps, power) + min_lr
return lrs
class BertLearningRate(lr_schedules.LearningRateSchedule):
"""
Implements of warmup-polydecay learning rate scheduler.
Args:
learning_rate (float): The initial value of learning rate.
end_learning_rate (float): The end value of learning rate.
warmup_steps (int): The warm up steps of learning rate.
decay_steps (int): A value used to calculate decayed learning rate.
power (float): A value used to calculate decayed learning rate.
Returns:
Tensor. The learning rate value for the current step.
"""
def __init__(self, learning_rate, end_learning_rate, warmup_steps, decay_steps, power):
super(BertLearningRate, self).__init__()
self.warmup_lr = lr_schedules.WarmUpLR(learning_rate, warmup_steps)
self.decay_lr = lr_schedules.PolynomialDecayLR(learning_rate, end_learning_rate, decay_steps, power)
self.warmup_steps = Tensor(np.array([warmup_steps]).astype(np.float32))
self.greater = P.Greater()
self.one = Tensor(np.array([1.0]).astype(np.float32))
self.cast = P.Cast()
def construct(self, global_step):
is_warmup = self.cast(self.greater(self.warmup_steps, global_step), mstype.float32)
warmup_lr = self.warmup_lr(global_step)
decay_lr = self.decay_lr(global_step)
lr = (self.one - is_warmup) * decay_lr + is_warmup * warmup_lr
return lr

View File

@ -0,0 +1,61 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Calculate Perplexity score under N-gram language model."""
from typing import Union
import numpy as np
def ngram_ppl(prob: Union[np.ndarray, list], length: int, log_softmax=False, index: float = np.e):
"""
Calculate Perplexity(PPL) score under N-gram language model.
Please make sure the sum of `prob` is 1.
Otherwise, assign `normalize=True`.
The number of N is depended by model.
Args:
prob (Union[list, np.ndarray]): Prediction probability
of the sentence.
log_softmax (bool): If sum of `prob` is not 1, please
set normalize=True.
index (float): Base number of log softmax.
Returns:
float, ppl score.
"""
if not length:
return np.inf
if not isinstance(prob, (np.ndarray, list)):
raise TypeError("`prob` must be type of list or np.ndarray.")
if not isinstance(prob, np.ndarray):
prob = np.array(prob)
if prob.shape[0] == 0:
raise ValueError("`prob` length must greater than 0.")
print(f'length:{length}, log_prob:{prob}')
if log_softmax:
prob = np.sum(prob) / length
ppl = 1. / np.power(index, prob)
print(f'avg log prob:{prob}')
else:
p = 1.
for i in range(prob.shape[0]):
p *= (1. / prob[i])
ppl = pow(p, 1 / length)
print(f'ppl val:{ppl}')
return ppl

View File

@ -0,0 +1,61 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Calculate ROUGE score."""
from typing import List
from rouge import Rouge
H_PATH = "summaries.txt"
R_PATH = "references.txt"
def rouge(hypothesis: List[str], target: List[str]):
"""
Calculate ROUGE score.
Args:
hypothesis (List[str]): Inference result.
target (List[str]): Reference.
"""
def cut(s):
idx = s.find("[SEP]")
if idx != -1:
s = s[:idx]
return s
if not hypothesis or not target:
raise ValueError(f"`hypothesis` and `target` can not be None.")
edited_hyp = []
edited_ref = []
for h, r in zip(hypothesis, target):
h = "[BOS]" + h[5:]
h = cut(h).replace("[BOS]", "").strip()
r = cut(r).replace("[SEP]", "").strip()
edited_hyp.append(h + "\n")
edited_ref.append(r + "\n")
_rouge = Rouge()
scores = _rouge.get_scores(edited_hyp, edited_ref, avg=True)
print(" | ROUGE Score:")
print(f" | RG-1(F): {scores['rouge-1']['f'] * 100:8.2f}")
print(f" | RG-2(F): {scores['rouge-2']['f'] * 100:8.2f}")
print(f" | RG-L(F): {scores['rouge-l']['f'] * 100:8.2f}")
with open(H_PATH, "w") as f:
f.writelines(edited_hyp)
with open(R_PATH, "w") as f:
f.writelines(edited_ref)

View File

@ -0,0 +1,97 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Tokenizer."""
import os
import argparse
from typing import Callable
from multiprocessing import Pool
parser = argparse.ArgumentParser(description='Corpus tokenizer which text file must end with `.txt`.')
parser.add_argument("--corpus_folder", type=str, default="", required=True,
help="Corpus folder path, if multi-folders are provided, use ',' split folders.")
parser.add_argument("--output_folder", type=str, default="", required=True,
help="Output folder path.")
parser.add_argument("--tokenizer", type=str, default="nltk", required=False,
help="Tokenizer to be used, nltk or jieba, if nltk is not installed fully, "
"use jieba instead.")
parser.add_argument("--pool_size", type=int, default=2, required=False,
help="Processes pool size.")
TOKENIZER = Callable
def create_tokenized_sentences(file_path, tokenized_file):
"""
Create tokenized sentences.
Args:
file_path (str): Text file.
tokenized_file (str): Output file.
"""
global TOKENIZER
print(f" | Processing {file_path}.")
tokenized_sen = []
with open(file_path, "r") as file:
for sen in file:
tokens = TOKENIZER(sen)
tokens = [t for t in tokens if t != " "]
if len(tokens) > 175:
continue
tokenized_sen.append(" ".join(tokens) + "\n")
with open(tokenized_file, "w") as file:
file.writelines(tokenized_sen)
print(f" | Wrote to {tokenized_file}.")
def tokenize():
"""Tokenizer."""
global TOKENIZER
args, _ = parser.parse_known_args()
src_folder = args.corpus_folder.split(",")
try:
from nltk.tokenize import word_tokenize
TOKENIZER = word_tokenize
except (ImportError, ModuleNotFoundError, LookupError):
try:
import jieba
except Exception as e:
raise e
print(" | NLTK is not found, use jieba instead.")
TOKENIZER = jieba.cut
if args.tokenizer == "jieba":
import jieba
TOKENIZER = jieba.cut
pool = Pool(args.pool_size)
for folder in src_folder:
for file in os.listdir(folder):
if not file.endswith(".txt"):
continue
file_path = os.path.join(folder, file)
out_path = os.path.join(args.output_folder, file.replace(".txt", "_tokenized.txt"))
pool.apply_async(create_tokenized_sentences, (file_path, out_path,))
pool.close()
pool.join()
if __name__ == '__main__':
tokenize()

View File

@ -0,0 +1,81 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Weight average."""
import os
import argparse
import numpy as np
from mindspore.train.serialization import load_checkpoint
parser = argparse.ArgumentParser(description='transformer')
parser.add_argument("--input_files", type=str, default=None, required=False,
help="Multi ckpt files path.")
parser.add_argument("--input_folder", type=str, default=None, required=False,
help="Ckpt files folder.")
parser.add_argument("--output_file", type=str, default=None, required=True,
help="Output model file path.")
def average_me_models(ckpt_list):
"""
Average multi ckpt params.
Args:
ckpt_list (list): Ckpt paths.
Returns:
dict, params dict.
"""
avg_model = {}
# load all checkpoint
for ckpt in ckpt_list:
if not ckpt.endswith(".ckpt"):
continue
if not os.path.exists(ckpt):
raise FileNotFoundError(f"Checkpoint file is not existed.")
print(f" | Loading ckpt from {ckpt}.")
ms_ckpt = load_checkpoint(ckpt)
for param_name in ms_ckpt:
if param_name not in avg_model:
avg_model[param_name] = []
avg_model[param_name].append(ms_ckpt[param_name].data.asnumpy())
for name in avg_model:
avg_model[name] = sum(avg_model[name]) / float(len(ckpt_list))
return avg_model
def main():
"""Entry point."""
args, _ = parser.parse_known_args()
if not args.input_files and not args.input_folder:
raise ValueError("`--input_files` or `--input_folder` must be provided one as least.")
ckpt_list = []
if args.input_files:
ckpt_list.extend(args.input_files.split(","))
if args.input_folder and os.path.exists(args.input_folder) and os.path.isdir(args.input_folder):
for file in os.listdir(args.input_folder):
ckpt_list.append(os.path.join(args.input_folder, file))
avg_weights = average_me_models(ckpt_list)
np.savez(args.output_file, **avg_weights)
if __name__ == '__main__':
main()