chenhaozhe 3d2e405430 | ||
---|---|---|
.. | ||
scripts | ||
src | ||
README.md | ||
eval.py | ||
export.py | ||
labels.json | ||
requirements.txt | ||
train.py |
README.md
Contents
- DeepSpeech2 Description
- Model Architecture
- Dataset
- Environment Requirements
- Script Description
- Model Description
- ModelZoo Homepage
DeepSpeech2 Description
DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy environments, accents and different languages. We support training and evaluation on CPU and GPU.
Paper: Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.
Model Architecture
The current reproduced model consists of:
- two convolutional layers:
- number of channels is 32, kernel size is [41, 11], stride is [2, 2]
- number of channels is 32, kernel size is [41, 11], stride is [2, 1]
- five bidirectional LSTM layers (size is 1024)
- one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)
Dataset
Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
Dataset used: LibriSpeech
- Train Data:
- train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
- train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
- train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
- Val Data:
- dev-clean.tar.gz [337M] (development set, "clean" speech)
- dev-other.tar.gz [314M] (development set, "other", more challenging, speech)
- Test Data:
- test-clean.tar.gz [346M] (test set, "clean" speech )
- test-other.tar.gz [328M] (test set, "other" speech )
- Data format:wav and txt files
- Note:Data will be processed in librispeech.py
Environment Requirements
- Hardware(GPU)
- Prepare hardware environment with GPU processor.
- Framework
- For more information, please check the resources below:
Script Description
Script and Sample Code
.
├── audio
├── deepspeech2
├── scripts
│ ├──run_distribute_train_gpu.sh // launch distributed training with gpu platform(8p)
│ ├──run_eval_cpu.sh // launch evaluation with cpu platform
│ ├──run_eval_gpu.sh // launch evaluation with gpu platform
│ ├──run_standalone_train_cpu.sh // launch standalone training with cpu platform
│ └──run_standalone_train_gpu.sh // launch standalone training with gpu platform(1p)
├── train.py // training scripts
├── eval.py // testing and evaluation outputs
├── export.py // convert mindspore model to mindir model
├── labels.json // possible characters to map to
├── README.md // descriptions about DeepSpeech
├── deepspeech_pytorch //
├──decoder.py // decoder from third party codes(MIT License)
├── src
├──__init__.py
├──DeepSpeech.py // DeepSpeech networks
├──dataset.py // generate dataloader and data processing entry
├──config.py // DeepSpeech configs
├──lr_generator.py // learning rate generator
├──greedydecoder.py // modified greedydecoder for mindspore code
└──callback.py // callbacks to monitor the training
Script Parameters
Training
usage: train.py [--use_pretrained USE_PRETRAINED]
[--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
[--is_distributed IS_DISTRIBUTED]
[--bidirectional BIDIRECTIONAL]
[--device_target DEVICE_TARGET]
options:
--pre_trained_model_path pretrained checkpoint path, default is ''
--is_distributed distributed training, default is False
--bidirectional whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
--device_target device where the code will be implemented: "GPU" | "CPU", default is "GPU"
Evaluation
usage: eval.py [--bidirectional BIDIRECTIONAL]
[--pretrain_ckpt PRETRAIN_CKPT]
[--device_target DEVICE_TARGET]
options:
--bidirectional whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
--pretrain_ckpt saved checkpoint path, default is ''
--device_target device where the code will be implemented: "GPU" | "CPU", default is "GPU"
Options and Parameters
Parameters for training and evaluation can be set in file config.py
config for training.
epochs number of training epoch, default is 70
config for dataloader.
train_manifest train manifest path, default is 'data/libri_train_manifest.csv'
val_manifest dev manifest path, default is 'data/libri_val_manifest.csv'
batch_size batch size for training, default is 8
labels_path tokens json path for model output, default is "./labels.json"
sample_rate sample rate for the data/model features, default is 16000
window_size window size for spectrogram generation (seconds), default is 0.02
window_stride window stride for spectrogram generation (seconds), default is 0.01
window window type for spectrogram generation, default is 'hamming'
speed_volume_perturb use random tempo and gain perturbations, default is False, not used in current model
spec_augment use simple spectral augmentation on mel spectograms, default is False, not used in current model
noise_dir directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
noise_prob probability of noise being added per sample, default is 0.4, not used in current model
noise_min minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
noise_max maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
config for model.
rnn_type type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
hidden_size hidden size of RNN Layer, default is 1024
hidden_layers number of RNN layers, default is 5
lookahead_context look ahead context, default is 20, not used in current model
config for optimizer.
learning_rate initial learning rate, default is 3e-4
learning_anneal annealing applied to learning rate after each epoch, default is 1.1
weight_decay weight decay, default is 1e-5
momentum momentum, default is 0.9
eps Adam eps, default is 1e-8
betas Adam betas, default is (0.9, 0.999)
loss_scale loss scale, default is 1024
config for checkpoint.
ckpt_file_name_prefix ckpt_file_name_prefix, default is 'DeepSpeech'
ckpt_path path to save ckpt, default is 'checkpoints'
keep_checkpoint_max max number of checkpoints to save, delete older checkpoints, default is 10
Training and Eval process
Before training, the dataset should be processed. We use the scripts provided by SeanNaren to process the dataset. This script in SeanNaren will automatically download the dataset and process it. After the process, the dataset directory structure is as follows:
.
├─ LibriSpeech_dataset
│ ├── train
│ │ ├─ wav
│ │ └─ txt
│ ├── val
│ │ ├─ wav
│ │ └─ txt
│ ├── test_clean
│ │ ├─ wav
│ │ └─ txt
│ └── test_other
│ ├─ wav
│ └─ txt
└─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv
The three *.csv file stores the absolute path of the corresponding
data. After obtaining the 3 csv file, you should modify the configurations in src/config.py
.
For training config, the train_manifest should be configured with the path of libri_train_manifest.csv
and for eval config, it should be configured
with libri_test_other_manifest.csv
or libri_train_manifest.csv
, depending on which dataset is evaluated.
...
for training configuration
"DataConfig":{
train_manifest:'path_to_csv/libri_train_manifest.csv'
}
for evaluation configuration
"DataConfig":{
train_manifest:'path_to_csv/libri_test_clean_manifest.csv'
}
Before training, some requirements should be installed, including librosa
and Levenshtein
After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:
# standalone training gpu
sh ./scripts/run_standalone_train_gpu.sh [DEVICE_ID]
# standalone training cpu
sh ./scripts/run_standalone_train_cpu.sh
# distributed training gpu
sh ./scripts/run_distribute_train_gpu.sh
The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script, you should download the decoder code from SeanNaren and place deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]
# eval on cpu
sh ./scripts/run_eval_cpu.sh [PATH_CHECKPOINT]
# eval on gpu
sh ./scripts/run_eval_gpu.sh [DEVICE_ID] [PATH_CHECKPOINT]
Export MindIR
python export.py --pre_trained_model_path='ckpt_path'
Model Description
Performance
Training Performance
Parameters | DeepSpeech |
---|---|
Resource | NV SMX2 V100-32G |
uploaded Date | 12/29/2020 (month/day/year) |
MindSpore Version | 1.0.0 |
Dataset | LibriSpeech |
Training Parameters | 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4 |
Optimizer | Adam |
Loss Function | CTCLoss |
outputs | probability |
Loss | 0.2-0.7 |
Speed | 2p 2.139s/step |
Total time: training | 2p: around 1 week; |
Checkpoint | 991M (.ckpt file) |
Scripts | DeepSpeech script |
Inference Performance
Parameters | DeepSpeech |
---|---|
Resource | NV SMX2 V100-32G |
uploaded Date | 12/29/2020 (month/day/year) |
MindSpore Version | 1.0.0 |
Dataset | LibriSpeech |
batch_size | 20 |
outputs | probability |
Accuracy(test-clean) | 2p: WER: 9.902 CER: 3.317 8p: WER: 11.593 CER: 3.907 |
Accuracy(test-others) | 2p: WER: 28.693 CER: 12.473 8p: WER: 31.397 CER: 13.696 |
Model for inference | 330M (.mindir file) |
ModelZoo Homepage
Please check the official homepage.