DAMO-ConvAI/spectra
tnlin 22930be7a7 Update readme.md 2024-03-13 15:51:22 +08:00
..
modeling_spectra add: spectra 2023-06-27 12:44:34 +08:00
preprocess Add preprocess_pretrain 2024-03-13 15:10:06 +08:00
scripts add: spectra 2023-06-27 12:44:34 +08:00
spectra-trippy add: spectra 2023-06-27 12:44:34 +08:00
dataset.py add: spectra 2023-06-27 12:44:34 +08:00
downstream_metrics.py add: spectra 2023-06-27 12:44:34 +08:00
main.py add: spectra 2023-06-27 12:44:34 +08:00
readme.md Update readme.md 2024-03-13 15:51:22 +08:00
requirements.txt add: spectra 2023-06-27 12:44:34 +08:00
utils.py add: spectra 2023-06-27 12:44:34 +08:00

readme.md

Installation

pip install -r requirements.txt

If you want to use apex for AMP training, please clone the apex source code from the repository at github.com to install.

Pre-Training

To pretrain our model from scratch, please first download our processed pretraining dataset Spotify-100k

Warning: The spotify.tgz file is 96GB with 960 hours of audio (*.npy). Use download tools like aria2 that can pause and resume to download it efficiently

Then, download pre-trained WavLM and RoBERTa models from huggingface.co (optional), and run scripts/train-960.sh

PT dataset description

The transForV4.pkl (in spotify.tgz) includes a dataset of 358,268 audio-text pairs, each less than 10 seconds in duration, complete with precise timestamp alignment details. Here's what an example entry looks like:

['/<directory>/spotify-960/0/0/show_002B8PbILr169CdsS9ySTH/0.npy',
 [0, 405, 18, 627, 19456, 1644, 102, 43264, 4, 3056, 6025, 7325, 3479, 254, 5652, 10162, 4, 2678, 4783, 2],
 ["it's", 1, 3, 0, 8000],
 ['the', 3, 4, 6400, 9600],
 ['mother', 4, 5, 8000, 14400],
 ['back', 5, 6, 12800, 17600],
 ['a', 6, 7, 16000, 19200],
 ['podcast.', 7, 9, 17600, 35200],
 ['well', 9, 10, 88000, 99200],
 ['that', 10, 11, 100800, 105600],
 ['was', 11, 12, 104000, 110400],
 ['longer', 12, 14, 108800, 115200],
 ['than', 14, 15, 113600, 118400],
 ['expected.', 15, 17, 116800, 132800],
 ['oh', 17, 18, 145600, 148800],
 ['my', 18, 19, 147200, 152000],
 -1]

Each dataset entry consists of four parts:

  1. Audio File Path: The location of the audio file in .npy format.
  2. Text IDs: Tokens generated by RoBERTa corresponding to the text.
  3. Previous Turn Index: The last element indicating the reference to the previous turn's ID, with -1 indicating no prior reference.
  4. Audio-Text Alignment: Detailed information on how text aligns with audio segments. For instance, the word "it's" aligns with text IDs [1:3] and audio segment [0:8000], where the sample rate of 16000 Hz means this segment represents approximately 0.5 seconds of audio.

Fine-tuning

We provide the pre-trained checkpoint of our model at huggingface.co. To reproduce our result in the paper, please first download the pre-processed fine-tuning data: MOSI, MOSEI, IEMOCAP, and MINTREC. These are all composed by pickles and can be used directly. Then run

scripts/finetune.sh

Since SpokenWOZ data is large and constantly updated, please obtain it from the source.

FT dataset description

To access the training, validation, and test files in the datasets, you can use the following command to extract the mosi.tgz file:

tar -xzvf mosi.tgz

Once extracted, you'll find .pkl files for training, validation, and testing. Each pickle file contains a list of samples, and each sample includes the following components:

  1. Audio Features: This field contains the audio feature data.
  2. Text Token IDs: Here, you'll find the IDs corresponding to text tokens.
  3. Label: This is the label assigned to the sample.
  4. History Audio Features (if applicable): If present, this field contains historical audio feature data.
  5. History Text Token IDs (if applicable): Similar to the above, this includes historical text token IDs, if available.

We hope these information helps you in utilizing the dataset effectively. Should you have any questions or need further assistance, please feel free to reach out.