- [Description of random situation](#description-of-random-situation)
- [others](#others)
- [ModelZoo Homepage](#modelzoo-homepage)
# MASS: Masked Sequence to Sequence Pre-training for Language Generation Description
[MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019.
BERT(Devlin et al., 2018) have achieved SOTA in natural language understanding area by pre-training the encoder part of Transformer(Vaswani et al., 2017) with masked rich-resource text. Likewise, GPT(Raddford et al., 2018) pre-trains the decoder part of Transformer with masked(encoder inputs are masked) rich-resource text. Both of them build a robust language model by pre-training with masked rich-resource text.
Inspired by BERT, GPT and other language models, MicroSoft addressed [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) which combines BERT's and GPT's idea. MASS has an important parameter k, which controls the masked fragment length. BERT and GPT are specicl case when k equals to 1 and sentence length.
[Introducing MASS – A pre-training method that outperforms BERT and GPT in sequence to sequence language generation tasks](https://www.microsoft.com/en-us/research/blog/introducing-mass-a-pre-training-method-that-outperforms-bert-and-gpt-in-sequence-to-sequence-language-generation-tasks/)
[Paper](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf): Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu. “MASS: Masked Sequence to Sequence Pre-training for Language Generation.” ICML (2019).
# Model architecture
The overall network architecture of MASS is shown below, which is Transformer(Vaswani et al., 2017):
MASS is consisted of 6-layer encoder and 6-layer decoder with 1024 embedding/hidden size, and 4096 intermediate size between feed forward network which has two full connection layers.
# Dataset
Dataset used:
- monolingual English data from News Crawl dataset(WMT 2019) for pre-training.
- Gigaword Corpus(Graff et al., 2003) for Text Summarization.
- Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011).
Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf).
# Features
Mass is designed to jointly pre train encoder and decoder to complete the task of language generation.
First of all, through a sequence to sequence framework, mass only predicts the blocked token, which forces the encoder to understand the meaning of the unshielded token, and encourages the decoder to extract useful information from the encoder.
Secondly, by predicting the continuous token of the decoder, the decoder can build better language modeling ability than only predicting discrete token.
Third, by further shielding the input token of the decoder which is not shielded in the encoder, the decoder is encouraged to extract more useful information from the encoder side, rather than using the rich information in the previous token.
├── weights_average.py // Average multi model checkpoints to NPZ format.
├── news_crawl.py // Create News Crawl dataset for pre-training.
├── gigaword.py // Create Gigaword Corpus.
├── cornell_dialog.py // Create Cornell Movie Dialog dataset for conversation response.
```
## Data Preparation
The data preparation of a natural language processing task contains data cleaning, tokenization, encoding and vocabulary generation steps.
In our experiments, using [Byte Pair Encoding(BPE)](https://arxiv.org/abs/1508.07909) could reduce size of vocabulary, and relieve the OOV influence effectively.
Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE.
For more detail about BPE, please refer to [Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt) or [paper](https://arxiv.org/abs/1508.07909).
In our experiments, vocabulary was learned based on 1.9M sentences from News Crawl Dataset, size of vocabulary is 45755.
Here, we have a brief introduction of data preparation scripts.
### Tokenization
Using `tokenize_corpus.py` could tokenize corpus whose text files are in format of `.txt`.
Major parameters in `tokenize_corpus.py`:
```bash
--corpus_folder: Corpus folder path, if multi-folders are provided, use ',' split folders.
--output_folder: Output folder path.
--tokenizer: Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
Json file under the path `config/` is the template configuration file.
Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
For more detailed information about the attributes, refer to the file `config/config.py`.
## Training & Evaluation process
For training a model, the shell script `run_ascend.sh` or `run_gpu.sh` is all you need. In this scripts, the environment variable is set and the training script `train.py` under `mass` is executed.
You may start a task training with single device or multiple devices by assigning the options and run the command in bash:
MASS pre-trains a sequence to sequence model by predicting the masked fragments in an input sequence. After this, downstream tasks including text summarization and conversation response are candidated for fine-tuning the model and for inference.
Here we provide a practice example to demonstrate the basic usage of MASS for pre-training, fine-tuning a model, and the inference process. The overall process is as follows:
1. Download and process the dataset.
2. Modify the `config.json` to config the network.
3. Run a task for pre-training and fine-tuning.
4. Perform inference and validation.
## Pre-training
For pre-training a model, config the options in `config.json` firstly:
- Assign the `pre_train_dataset` under `dataset_config` node to the dataset path.
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the 'ckpt_prefix' and 'ckpt_path' under `checkpoint_path` to save the model files.
- Set other arguments including dataset configurations and network configurations.
- If you have a trained model already, assign the `existed_ckpt` to the checkpoint file.
If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
```ascend
bash run_ascend.sh -t t -n 1 -i 1 -c /mass/config/config.json
```
You can also run the shell script `run_gpu.sh` on gpu as followed:
```gpu
bash run_gpu.sh -t t -n 1 -i 1 -c /mass/config/config.json
```
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
## Fine-tuning
For fine-tuning a model, config the options in `config.json` firstly:
- Assign the `fine_tune_dataset` under `dataset_config` node to the dataset path.
- Assign the `existed_ckpt` under `checkpoint_path` node to the existed model file generated by pre-training.
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the `ckpt_prefix` and `ckpt_path` under `checkpoint_path` node to save the model files.
- Set other arguments including dataset configurations and network configurations.
If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
```ascend
bash run_ascend.sh -t t -n 1 -i 1 -c config/config.json
```
You can also run the shell script `run_gpu.sh` on gpu as followed:
```gpu
bash run_gpu.sh -t t -n 1 -i 1 -c config/config.json
```
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
## Inference
If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/docs/programming_guide/en/master/multi_platform_inference.html).
For inference, config the options in `config.json` firstly:
- Assign the `test_dataset` under `dataset_config` node to the dataset path.
- Assign the `existed_ckpt` under `checkpoint_path` node to the model file produced by fine-tuning.
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the `ckpt_prefix` and `ckpt_path` under `checkpoint_path` node to save the model files.
- Set other arguments including dataset configurations and network configurations.
If you use the ascend chip, run the shell script `run_ascend.sh` as followed: