mindspore/model_zoo/research/audio/deepspeech2
chenhaozhe 3d2e405430 update requirements.txt in modelzoo 2021-07-16 16:52:29 +08:00
..
scripts add_shell 2021-02-08 11:14:10 +08:00
src Update normalization description at nn/layer/normalization.py 2021-04-20 12:30:37 +08:00
README.md modify urls related docs for master_mindspore 2021-07-13 21:20:15 +08:00
eval.py deepspeech cpu training 2021-01-30 15:57:50 +08:00
export.py deepspeech 2021-01-11 17:37:40 +08:00
labels.json deepspeech 2021-01-11 17:37:40 +08:00
requirements.txt update requirements.txt in modelzoo 2021-07-16 16:52:29 +08:00
train.py enable graph kernel 2021-05-28 12:40:03 +08:00

README.md

Contents

DeepSpeech2 Description

DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy environments, accents and different languages. We support training and evaluation on CPU and GPU.

Paper: Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.

Model Architecture

The current reproduced model consists of:

  • two convolutional layers:
    • number of channels is 32, kernel size is [41, 11], stride is [2, 2]
    • number of channels is 32, kernel size is [41, 11], stride is [2, 1]
  • five bidirectional LSTM layers (size is 1024)
  • one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)

Dataset

Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.

Dataset used: LibriSpeech

  • Train Data
    • train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
    • train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
    • train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
  • Val Data
    • dev-clean.tar.gz [337M] (development set, "clean" speech)
    • dev-other.tar.gz [314M] (development set, "other", more challenging, speech)
  • Test Data:
    • test-clean.tar.gz [346M] (test set, "clean" speech )
    • test-other.tar.gz [328M] (test set, "other" speech )
  • Data formatwav and txt files
    • NoteData will be processed in librispeech.py

Environment Requirements

Script Description

Script and Sample Code

.
├── audio
    ├── deepspeech2
        ├── scripts
           ├──run_distribute_train_gpu.sh // launch distributed training with gpu platform(8p)
           ├──run_eval_cpu.sh             // launch evaluation with cpu platform
           ├──run_eval_gpu.sh             // launch evaluation with gpu platform
           ├──run_standalone_train_cpu.sh // launch standalone training with cpu platform
           └──run_standalone_train_gpu.sh // launch standalone training with gpu platform(1p)
        ├── train.py                       // training scripts
        ├── eval.py                        // testing and evaluation outputs
        ├── export.py                      // convert mindspore model to mindir model
        ├── labels.json                    // possible characters to map to
        ├── README.md                      // descriptions about DeepSpeech
        ├── deepspeech_pytorch             //
            ├──decoder.py                  // decoder from third party codes(MIT License)
        ├── src
            ├──__init__.py
            ├──DeepSpeech.py               // DeepSpeech networks
            ├──dataset.py                  // generate dataloader and data processing entry
            ├──config.py                   // DeepSpeech configs
            ├──lr_generator.py             // learning rate generator
            ├──greedydecoder.py            // modified greedydecoder for mindspore code
            └──callback.py                 // callbacks to monitor the training

Script Parameters

Training

usage: train.py  [--use_pretrained USE_PRETRAINED]
                 [--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
                 [--is_distributed IS_DISTRIBUTED]
                 [--bidirectional BIDIRECTIONAL]
                 [--device_target DEVICE_TARGET]
options:
    --pre_trained_model_path    pretrained checkpoint path, default is ''
    --is_distributed            distributed training, default is False
    --bidirectional             whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
    --device_target             device where the code will be implemented: "GPU" | "CPU", default is "GPU"

Evaluation

usage: eval.py  [--bidirectional BIDIRECTIONAL]
                [--pretrain_ckpt PRETRAIN_CKPT]
                [--device_target DEVICE_TARGET]

options:
    --bidirectional              whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
    --pretrain_ckpt              saved checkpoint path, default is ''
    --device_target              device where the code will be implemented: "GPU" | "CPU", default is "GPU"

Options and Parameters

Parameters for training and evaluation can be set in file config.py

config for training.
    epochs                       number of training epoch, default is 70
config for dataloader.
    train_manifest               train manifest path, default is 'data/libri_train_manifest.csv'
    val_manifest                 dev manifest path, default is 'data/libri_val_manifest.csv'
    batch_size                   batch size for training, default is 8
    labels_path                  tokens json path for model output, default is "./labels.json"
    sample_rate                  sample rate for the data/model features, default is 16000
    window_size                  window size for spectrogram generation (seconds), default is 0.02
    window_stride                window stride for spectrogram generation (seconds), default is 0.01
    window                       window type for spectrogram generation, default is 'hamming'
    speed_volume_perturb         use random tempo and gain perturbations, default is False, not used in current model
    spec_augment                 use simple spectral augmentation on mel spectograms, default is False, not used in current model
    noise_dir                    directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
    noise_prob                   probability of noise being added per sample, default is 0.4, not used in current model
    noise_min                    minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
    noise_max                    maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
config for model.
    rnn_type                     type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
    hidden_size                  hidden size of RNN Layer, default is 1024
    hidden_layers                number of RNN layers, default is 5
    lookahead_context            look ahead context, default is 20, not used in current model
config for optimizer.
    learning_rate                initial learning rate, default is 3e-4
    learning_anneal              annealing applied to learning rate after each epoch, default is 1.1
    weight_decay                 weight decay, default is 1e-5
    momentum                     momentum, default is 0.9
    eps                          Adam eps, default is 1e-8
    betas                        Adam betas, default is (0.9, 0.999)
    loss_scale                   loss scale, default is 1024
config for checkpoint.
    ckpt_file_name_prefix        ckpt_file_name_prefix, default is 'DeepSpeech'
    ckpt_path                    path to save ckpt, default is 'checkpoints'
    keep_checkpoint_max          max number of checkpoints to save, delete older checkpoints, default is 10

Training and Eval process

Before training, the dataset should be processed. We use the scripts provided by SeanNaren to process the dataset. This script in SeanNaren will automatically download the dataset and process it. After the process, the dataset directory structure is as follows:

    .
    ├─ LibriSpeech_dataset
      ├── train
         ├─ wav
         └─ txt
      ├── val
          ├─ wav
          └─ txt
      ├── test_clean  
          ├─ wav
          └─ txt  
      └── test_other
           ├─ wav
           └─ txt
    └─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv

The three *.csv file stores the absolute path of the corresponding data. After obtaining the 3 csv file, you should modify the configurations in src/config.py. For training config, the train_manifest should be configured with the path of libri_train_manifest.csv and for eval config, it should be configured with libri_test_other_manifest.csv or libri_train_manifest.csv, depending on which dataset is evaluated.

...
for training configuration
"DataConfig":{
     train_manifest:'path_to_csv/libri_train_manifest.csv'
}

for evaluation configuration
"DataConfig":{
     train_manifest:'path_to_csv/libri_test_clean_manifest.csv'
}

Before training, some requirements should be installed, including librosa and Levenshtein After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:


# standalone training gpu
sh ./scripts/run_standalone_train_gpu.sh [DEVICE_ID]

# standalone training cpu
sh ./scripts/run_standalone_train_cpu.sh

# distributed training gpu
sh ./scripts/run_distribute_train_gpu.sh

The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script, you should download the decoder code from SeanNaren and place deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]


# eval on cpu
sh ./scripts/run_eval_cpu.sh [PATH_CHECKPOINT]

# eval on gpu
sh ./scripts/run_eval_gpu.sh [DEVICE_ID] [PATH_CHECKPOINT]

Export MindIR

python export.py --pre_trained_model_path='ckpt_path'

Model Description

Performance

Training Performance

Parameters DeepSpeech
Resource NV SMX2 V100-32G
uploaded Date 12/29/2020 (month/day/year)
MindSpore Version 1.0.0
Dataset LibriSpeech
Training Parameters 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4
Optimizer Adam
Loss Function CTCLoss
outputs probability
Loss 0.2-0.7
Speed 2p 2.139s/step
Total time: training 2p: around 1 week;
Checkpoint 991M (.ckpt file)
Scripts DeepSpeech script

Inference Performance

Parameters DeepSpeech
Resource NV SMX2 V100-32G
uploaded Date 12/29/2020 (month/day/year)
MindSpore Version 1.0.0
Dataset LibriSpeech
batch_size 20
outputs probability
Accuracy(test-clean) 2p: WER: 9.902 CER: 3.317 8p: WER: 11.593 CER: 3.907
Accuracy(test-others) 2p: WER: 28.693 CER: 12.473 8p: WER: 31.397 CER: 13.696
Model for inference 330M (.mindir file)

ModelZoo Homepage

Please check the official homepage.