mindspore/model_zoo/Transformer/README.md

8.1 KiB
Raw Blame History

Transformer Example

Description

This example implements training and evaluation of Transformer Model, which is introduced in the following paper:

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, pages 59986008.

Requirements

  • Install MindSpore.
  • Download and preprocess the WMT English-German dataset for training and evaluation.

Notes:If you are running an evaluation task, prepare the corresponding checkpoint file.

Example structure

.
└─Transformer
  ├─README.md
  ├─scripts
    ├─process_output.sh
    ├─replace-quote.perl
    ├─run_distribute_train.sh
    └─run_standalone_train.sh
  ├─src
    ├─__init__.py
    ├─beam_search.py
    ├─config.py
    ├─dataset.py
    ├─eval_config.py
    ├─lr_schedule.py
    ├─process_output.py
    ├─tokenization.py
    ├─transformer_for_train.py
    ├─transformer_model.py
    └─weight_init.py
  ├─create_data.py
  ├─eval.py
  └─train.py

Prepare the dataset

  • You may use this shell script to download and preprocess WMT English-German dataset. Assuming you get the following files:

    • train.tok.clean.bpe.32000.en
    • train.tok.clean.bpe.32000.de
    • vocab.bpe.32000
    • newstest2014.tok.bpe.32000.en
    • newstest2014.tok.bpe.32000.de
    • newstest2014.tok.de
  • Convert the original data to mindrecord for training:

    paste train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.all
    python create_data.py --input_file train.all --vocab_file vocab.bpe.32000 --output_file /path/ende-l128-mindrecord --max_seq_length 128
    
  • Convert the original data to mindrecord for evaluation:

    paste newstest2014.tok.bpe.32000.en newstest2014.tok.bpe.32000.de > test.all
    python create_data.py --input_file test.all --vocab_file vocab.bpe.32000 --output_file /path/newstest2014-l128-mindrecord --num_splits 1 --max_seq_length 128 --clip_to_max_len True
    

Running the example

Training

  • Set options in config.py, including loss_scale, learning rate and network hyperparameters. Click here for more information about dataset.

  • Run run_standalone_train.sh for non-distributed training of Transformer model.

    sh scripts/run_standalone_train.sh DEVICE_ID EPOCH_SIZE DATA_PATH
    
  • Run run_distribute_train.sh for distributed training of Transformer model.

    sh scripts/run_distribute_train.sh DEVICE_NUM EPOCH_SIZE DATA_PATH MINDSPORE_HCCL_CONFIG_PATH
    

Evaluation

  • Set options in eval_config.py. Make sure the 'data_file', 'model_file' and 'output_file' are set to your own path.

  • Run eval.py for evaluation of Transformer model.

    python eval.py
    
  • Run process_output.sh to process the output token ids to get the real translation results.

    sh scripts/process_output.sh REF_DATA EVAL_OUTPUT VOCAB_FILE
    

    You will get two files, REF_DATA.forbleu and EVAL_OUTPUT.forbleu, for BLEU score calculation.

  • Calculate BLEU score, you may use this perl script and run following command to get the BLEU score.

    perl multi-bleu.perl REF_DATA.forbleu < EVAL_OUTPUT.forbleu
    

Usage

Training

usage: train.py  [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
                 [--enable_save_ckpt ENABLE_SAVE_CKPT]
                 [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
                 [--enable_data_sink ENABLE_DATA_SINK] [--save_checkpoint_steps N]
                 [--save_checkpoint_num N] [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
                 [--data_path DATA_PATH]

options:
    --distribute               pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
    --epoch_size               epoch size: N, default is 52
    --device_num               number of used devices: N, default is 1
    --device_id                device id: N, default is 0
    --enable_save_ckpt         enable save checkpoint: "true" | "false", default is "true"
    --enable_lossscale         enable lossscale: "true" | "false", default is "true"
    --do_shuffle               enable shuffle: "true" | "false", default is "true"
    --enable_data_sink         enable data sink: "true" | "false", default is "false"
    --checkpoint_path          path to load checkpoint files: PATH, default is ""
    --save_checkpoint_steps    steps for saving checkpoint files: N, default is 2500
    --save_checkpoint_num      number for saving checkpoint files: N, default is 30
    --save_checkpoint_path     path to save checkpoint files: PATH, default is "./checkpoint/"
    --data_path                path to dataset file: PATH, default is ""

Options and Parameters

It contains of parameters of Transformer model and options for training and evaluation, which is set in file config.py and evaluation_config.py respectively.

Options:

config.py:
    transformer_network             version of Transformer model: base | large, default is large
    init_loss_scale_value           initial value of loss scale: N, default is 2^10
    scale_factor                    factor used to update loss scale: N, default is 2
    scale_window                    steps for once updatation of loss scale: N, default is 2000
    optimizer                       optimizer used in the network: Adam, default is "Adam"

eval_config.py:
    transformer_network             version of Transformer model: base | large, default is large
    data_file                       data file: PATH
    model_file                      checkpoint file to be loaded: PATH
    output_file                     output file of evaluation: PATH

Parameters:

Parameters for dataset and network (Training/Evaluation):
    batch_size                      batch size of input dataset: N, default is 96
    seq_length                      length of input sequence: N, default is 128
    vocab_size                      size of each embedding vector: N, default is 36560
    hidden_size                     size of Transformer encoder layers: N, default is 1024
    num_hidden_layers               number of hidden layers: N, default is 6
    num_attention_heads             number of attention heads: N, default is 16
    intermediate_size               size of intermediate layer: N, default is 4096
    hidden_act                      activation function used: ACTIVATION, default is "relu"
    hidden_dropout_prob             dropout probability for TransformerOutput: Q, default is 0.3
    attention_probs_dropout_prob    dropout probability for TransformerAttention: Q, default is 0.3
    max_position_embeddings         maximum length of sequences: N, default is 128
    initializer_range               initialization value of TruncatedNormal: Q, default is 0.02
    label_smoothing                 label smoothing setting: Q, default is 0.1
    input_mask_from_dataset         use the input mask loaded form dataset or not: True | False, default is True
    beam_width                      beam width setting: N, default is 4
    max_decode_length               max decode length in evaluation: N, default is 80
    length_penalty_weight           normalize scores of translations according to their length: Q, default is 1.0
    compute_type                    compute type in Transformer: mstype.float16 | mstype.float32, default is mstype.float16

Parameters for learning rate:
    learning_rate                   value of learning rate: Q
    warmup_steps                    steps of the learning rate warm up: N
    start_decay_step                step of the learning rate to decay: N
    min_lr                          minimal learning rate: Q