mindspore/model_zoo/official/nlp/tinybert
hanhuifeng2020 9e2b38c961 The teacher use fp16 calculations to optimize the performance of tinybert on the gpu 2020-08-17 15:01:01 +08:00
..
scripts The teacher use fp16 calculations to optimize the performance of tinybert on the gpu 2020-08-17 15:01:01 +08:00
src The teacher use fp16 calculations to optimize the performance of tinybert on the gpu 2020-08-17 15:01:01 +08:00
README.md tinybert script suit for gpu 2020-08-10 20:00:07 +08:00
__init__.py add tinybert scripts 2020-07-29 16:07:04 +08:00
run_general_distill.py The teacher use fp16 calculations to optimize the performance of tinybert on the gpu 2020-08-17 15:01:01 +08:00
run_task_distill.py The teacher use fp16 calculations to optimize the performance of tinybert on the gpu 2020-08-17 15:01:01 +08:00

README.md

TinyBERT Example

Description

TinyBERT is 7.5x smalller and 9.4x faster on inference than BERT-base (the base version of BERT model) and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages.

Requirements

  • Install MindSpore.
  • Download dataset for general distill and task distill such as GLUE.
  • Prepare a pre-trained bert model and a fine-tuned bert model for specific task such as GLUE.

Running the Example

General Distill

  • Set options in src/gd_config.py, including lossscale, optimizer and network.

  • Set options in scripts/run_standalone_gd.sh, including device target, data sink config, checkpoint config and dataset. Click here for more information about dataset and the json schema file.

  • Run run_standalone_gd.sh for non-distributed general distill of BERT-base model.

    bash scripts/run_standalone_gd.sh
    
  • Run run_distribute_gd.sh for distributed general distill of BERT-base model.

    bash scripts/run_distribute_gd.sh DEVICE_NUM EPOCH_SIZE RANK_TABLE_FILE
    

Task Distill

Task distill has two phases, pre-distill and task distill.

  • Set options in src/td_config.py, including lossscale, optimizer config of phase 1 and 2, as well as network config.

  • Run run_standalone_td.py for task distill of BERT-base model.

    bash scripts/run_standalone_td.sh
    

Usage

General Distill

usage: run_standalone_gd.py  [--distribute DISTRIBUTE] [--device_target DEVICE_TARGET]
                             [--epoch_size N] [--device_id N]
                             [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
                             [--save_checkpoint_steps N] [--max_ckpt_num N]
                             [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
                             [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR]

options:
    --distribute               whether to run distributely: "true" | "false"
    --device_target            targeted device to run task: "Ascend" | "GPU"
    --epoch_size               epoch size: N, default is 1
    --device_id                device id: N, default is 0
    --enable_data_sink         enable data sink: "true" | "false", default is "true"
    --data_sink_steps          set data sink steps: N, default is 1
    --load_teacher_ckpt_path   path of teacher checkpoint to load: PATH, default is ""
    --data_dir                 path to dataset directory: PATH, default is ""
    --schema_dir               path to schema.json file, PATH, default is ""

usage: run_distribute_gd.py  [--distribute DISTRIBUTE] [--device_target DEVICE_TARGET]
                             [--epoch_size N] [--device_id N] [--device_num N]
                             [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
                             [--save_ckpt_steps N] [--max_ckpt_num N]
                             [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
                             [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR]

options:
    --distribute               whether to run distributely: "true" | "false"
    --device_target            targeted device to run task: "Ascend" | "GPU"
    --epoch_size               epoch size: N, default is 1
    --device_id                device id: N, default is 0
    --device_num               device id to run task
    --enable_data_sink         enable data sink: "true" | "false", default is "true"
    --data_sink_steps          set data sink steps: N, default is 1
    --load_teacher_ckpt_path   path of teacher checkpoint to load: PATH, default is ""
    --data_dir                 path to dataset directory: PATH, default is ""
    --schema_dir               path to schema.json file, PATH, default is ""

Options and Parameters

gd_config.py and td_config.py Contain parameters of BERT model and options for optimizer and lossscale.

Options:

Parameters for lossscale:
    loss_scale_value                initial value of loss scale: N, default is 2^8
    scale_factor                    factor used to update loss scale: N, default is 2
    scale_window                    steps for once updatation of loss scale: N, default is 50 

Parameters for task-specific config:
    load_teacher_ckpt_path          teacher checkpoint to load
    load_student_ckpt_path          student checkpoint to load
    data_dir                        training data dir
    eval_data_dir                   evaluation data dir
    schema_dir                      data schema path

Parameters:

Parameters for bert network:
    batch_size                      batch size of input dataset: N, default is 16
    seq_length                      length of input sequence: N, default is 128
    vocab_size                      size of each embedding vector: N, must be consistant with the dataset you use. Default is 30522
    hidden_size                     size of bert encoder layers: N
    num_hidden_layers               number of hidden layers: N
    num_attention_heads             number of attention heads: N, default is 12
    intermediate_size               size of intermediate layer: N
    hidden_act                      activation function used: ACTIVATION, default is "gelu"
    hidden_dropout_prob             dropout probability for BertOutput: Q
    attention_probs_dropout_prob    dropout probability for BertAttention: Q
    max_position_embeddings         maximum length of sequences: N, default is 512
    save_ckpt_step                  number for saving checkponit: N, default is 100
    max_ckpt_num                    maximum number for saving checkpoint: N, default is 1
    type_vocab_size                 size of token type vocab: N, default is 2
    initializer_range               initialization value of TruncatedNormal: Q, default is 0.02
    use_relative_positions          use relative positions or not: True | False, default is False
    input_mask_from_dataset         use the input mask loaded form dataset or not: True | False, default is True
    token_type_ids_from_dataset     use the token type ids loaded from dataset or not: True | False, default is True
    dtype                           data type of input: mstype.float16 | mstype.float32, default is mstype.float32
    compute_type                    compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
    enable_fused_layernorm          use batchnorm instead of layernorm to improve performance, default is False

Parameters for optimizer:
    optimizer                       optimizer used in the network: AdamWeightDecay
    learning_rate                   value of learning rate: Q
    end_learning_rate               value of end learning rate: Q, must be positive
    power                           power: Q
    weight_decay                    weight decay: Q
    eps                             term added to the denominator to improve numerical stability: Q