8.1 KiB
Transformer Example
Description
This example implements training and evaluation of Transformer Model, which is introduced in the following paper:
- Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, pages 5998–6008.
Requirements
- Install MindSpore.
- Download and preprocess the WMT English-German dataset for training and evaluation.
Notes:If you are running an evaluation task, prepare the corresponding checkpoint file.
Example structure
.
└─Transformer
├─README.md
├─scripts
├─process_output.sh
├─replace-quote.perl
├─run_distribute_train.sh
└─run_standalone_train.sh
├─src
├─__init__.py
├─beam_search.py
├─config.py
├─dataset.py
├─eval_config.py
├─lr_schedule.py
├─process_output.py
├─tokenization.py
├─transformer_for_train.py
├─transformer_model.py
└─weight_init.py
├─create_data.py
├─eval.py
└─train.py
Prepare the dataset
-
You may use this shell script to download and preprocess WMT English-German dataset. Assuming you get the following files:
- train.tok.clean.bpe.32000.en
- train.tok.clean.bpe.32000.de
- vocab.bpe.32000
- newstest2014.tok.bpe.32000.en
- newstest2014.tok.bpe.32000.de
- newstest2014.tok.de
-
Convert the original data to mindrecord for training:
paste train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.all python create_data.py --input_file train.all --vocab_file vocab.bpe.32000 --output_file /path/ende-l128-mindrecord --max_seq_length 128
-
Convert the original data to mindrecord for evaluation:
paste newstest2014.tok.bpe.32000.en newstest2014.tok.bpe.32000.de > test.all python create_data.py --input_file test.all --vocab_file vocab.bpe.32000 --output_file /path/newstest2014-l128-mindrecord --num_splits 1 --max_seq_length 128 --clip_to_max_len True
Running the example
Training
-
Set options in
config.py
, including loss_scale, learning rate and network hyperparameters. Click here for more information about dataset. -
Run
run_standalone_train.sh
for non-distributed training of Transformer model.sh scripts/run_standalone_train.sh DEVICE_ID EPOCH_SIZE DATA_PATH
-
Run
run_distribute_train.sh
for distributed training of Transformer model.sh scripts/run_distribute_train.sh DEVICE_NUM EPOCH_SIZE DATA_PATH MINDSPORE_HCCL_CONFIG_PATH
Evaluation
-
Set options in
eval_config.py
. Make sure the 'data_file', 'model_file' and 'output_file' are set to your own path. -
Run
eval.py
for evaluation of Transformer model.python eval.py
-
Run
process_output.sh
to process the output token ids to get the real translation results.sh scripts/process_output.sh REF_DATA EVAL_OUTPUT VOCAB_FILE
You will get two files, REF_DATA.forbleu and EVAL_OUTPUT.forbleu, for BLEU score calculation.
-
Calculate BLEU score, you may use this perl script and run following command to get the BLEU score.
perl multi-bleu.perl REF_DATA.forbleu < EVAL_OUTPUT.forbleu
Usage
Training
usage: train.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
[--enable_save_ckpt ENABLE_SAVE_CKPT]
[--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
[--enable_data_sink ENABLE_DATA_SINK] [--save_checkpoint_steps N]
[--save_checkpoint_num N] [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
[--data_path DATA_PATH]
options:
--distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
--epoch_size epoch size: N, default is 52
--device_num number of used devices: N, default is 1
--device_id device id: N, default is 0
--enable_save_ckpt enable save checkpoint: "true" | "false", default is "true"
--enable_lossscale enable lossscale: "true" | "false", default is "true"
--do_shuffle enable shuffle: "true" | "false", default is "true"
--enable_data_sink enable data sink: "true" | "false", default is "false"
--checkpoint_path path to load checkpoint files: PATH, default is ""
--save_checkpoint_steps steps for saving checkpoint files: N, default is 2500
--save_checkpoint_num number for saving checkpoint files: N, default is 30
--save_checkpoint_path path to save checkpoint files: PATH, default is "./checkpoint/"
--data_path path to dataset file: PATH, default is ""
Options and Parameters
It contains of parameters of Transformer model and options for training and evaluation, which is set in file config.py
and evaluation_config.py
respectively.
Options:
config.py:
transformer_network version of Transformer model: base | large, default is large
init_loss_scale_value initial value of loss scale: N, default is 2^10
scale_factor factor used to update loss scale: N, default is 2
scale_window steps for once updatation of loss scale: N, default is 2000
optimizer optimizer used in the network: Adam, default is "Adam"
eval_config.py:
transformer_network version of Transformer model: base | large, default is large
data_file data file: PATH
model_file checkpoint file to be loaded: PATH
output_file output file of evaluation: PATH
Parameters:
Parameters for dataset and network (Training/Evaluation):
batch_size batch size of input dataset: N, default is 96
seq_length length of input sequence: N, default is 128
vocab_size size of each embedding vector: N, default is 36560
hidden_size size of Transformer encoder layers: N, default is 1024
num_hidden_layers number of hidden layers: N, default is 6
num_attention_heads number of attention heads: N, default is 16
intermediate_size size of intermediate layer: N, default is 4096
hidden_act activation function used: ACTIVATION, default is "relu"
hidden_dropout_prob dropout probability for TransformerOutput: Q, default is 0.3
attention_probs_dropout_prob dropout probability for TransformerAttention: Q, default is 0.3
max_position_embeddings maximum length of sequences: N, default is 128
initializer_range initialization value of TruncatedNormal: Q, default is 0.02
label_smoothing label smoothing setting: Q, default is 0.1
input_mask_from_dataset use the input mask loaded form dataset or not: True | False, default is True
beam_width beam width setting: N, default is 4
max_decode_length max decode length in evaluation: N, default is 80
length_penalty_weight normalize scores of translations according to their length: Q, default is 1.0
compute_type compute type in Transformer: mstype.float16 | mstype.float32, default is mstype.float16
Parameters for learning rate:
learning_rate value of learning rate: Q
warmup_steps steps of the learning rate warm up: N
start_decay_step step of the learning rate to decay: N
min_lr minimal learning rate: Q