update: README.md

This commit is contained in:
言斯 2024-03-13 15:09:23 +08:00
parent 80ee93c17c
commit 34690a20a5
1 changed files with 45 additions and 23 deletions

View File

@ -4,19 +4,56 @@ pip install -r requirements.txt
``` ```
If you want to use apex for AMP training, please clone the apex source code from the repository at github.com to install. If you want to use apex for AMP training, please clone the apex source code from the repository at github.com to install.
## Pre-Training
To pretrain our model from scratch, please first download our processed pretraining dataset [Spotify-100k](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/spotify.tgz)
> Warning: The spotify.tgz file is 96GB with 960 hours of audio (*.npy). Use download tools like aria2 that can pause and resume to download it efficiently
Then, download pre-trained WavLM and RoBERTa models from huggingface.co (optional), and run `scripts/train-960.sh`
#### PT dataset description
The `transForV4.pkl` file includes a dataset of 358,268 audio-text pairs, each less than 10 seconds in duration, complete with precise timestamp alignment details. Here's what an example entry looks like:
```
['/<directory>/spotify-960/0/0/show_002B8PbILr169CdsS9ySTH/0.npy',
[0, 405, 18, 627, 19456, 1644, 102, 43264, 4, 3056, 6025, 7325, 3479, 254, 5652, 10162, 4, 2678, 4783, 2],
["it's", 1, 3, 0, 8000],
['the', 3, 4, 6400, 9600],
['mother', 4, 5, 8000, 14400],
['back', 5, 6, 12800, 17600],
['a', 6, 7, 16000, 19200],
['podcast.', 7, 9, 17600, 35200],
['well', 9, 10, 88000, 99200],
['that', 10, 11, 100800, 105600],
['was', 11, 12, 104000, 110400],
['longer', 12, 14, 108800, 115200],
['than', 14, 15, 113600, 118400],
['expected.', 15, 17, 116800, 132800],
['oh', 17, 18, 145600, 148800],
['my', 18, 19, 147200, 152000],
-1]
```
Each dataset entry consists of four parts:
1. **Audio File Path**: The location of the audio file in .npy format.
2. **Text IDs**: Tokens generated by RoBERTa corresponding to the text.
3. **Previous Turn Index**: The last element indicating the reference to the previous turn's ID, with `-1` indicating no prior reference.
4. **Audio-Text Alignment**: Detailed information on how text aligns with audio segments. For instance, the word "it's" aligns with text IDs [1:3] and audio segment [0:8000], where the sample rate of 16000 Hz means this segment represents approximately 0.5 seconds of audio.
## Fine-tuning ## Fine-tuning
We provide the pre-trained checkpoint of our model at [huggingface.co](https://huggingface.co/publicstaticvo/SPECTRA-base). To reproduce our result in the paper, please first download the pre-processed fine-tuning data (be available soon), then run `scripts/finetune.sh` We provide the pre-trained checkpoint of our model at [huggingface.co](https://huggingface.co/publicstaticvo/SPECTRA-base). To reproduce our result in the paper, please first download the pre-processed fine-tuning data: [**MOSI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosi.tgz),
### Datasets
Here are the processed fine-tuning data datasets:
[**MOSI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosi.tgz),
[**MOSEI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosei.tgz), [**MOSEI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosei.tgz),
[**IEMOCAP**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/iemocap.tgz), and [**IEMOCAP**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/iemocap.tgz), and
[**MINTREC**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mintrec.tgz). [**MINTREC**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mintrec.tgz).
These are all composed by pickles and can be used directly. These are all composed by pickles and can be used directly. Then run
```
scripts/finetune.sh
```
> Due to the large data size of SpokenWOZ and Spotify-100k (tens of GBs), please obtain from the original repo." > Since SpokenWOZ data is large and constantly updated, please obtain it from the [source](https://spokenwoz.github.io/SpokenWOZ-github.io/).
### Usage #### FT dataset description
To access the training, validation, and test files in the datasets, you can use the following command to extract the mosi.tgz file: To access the training, validation, and test files in the datasets, you can use the following command to extract the mosi.tgz file:
``` ```
@ -30,19 +67,4 @@ Once extracted, you'll find .pkl files for training, validation, and testing. Ea
4. History Audio Features (if applicable): If present, this field contains historical audio feature data. 4. History Audio Features (if applicable): If present, this field contains historical audio feature data.
5. History Text Token IDs (if applicable): Similar to the above, this includes historical text token IDs, if available. 5. History Text Token IDs (if applicable): Similar to the above, this includes historical text token IDs, if available.
We hope this information helps you in utilizing the dataset effectively. Should you have any questions or need further assistance, please feel free to reach out. We hope these information helps you in utilizing the dataset effectively. Should you have any questions or need further assistance, please feel free to reach out.
## Pre-Train
To pretrain our model from scratch, please first download our processed pretraining dataset (be available soon), then download pre-trained WavLM and RoBERTa models from huggingface.co (optional), and run `scripts/train-960.sh`
<!--
```commandline
python run_dst.py --model spectra --model_type roberta \
--data_dir ./data \
--model_dir /PATH/OF/YOUR/PRETRAINED/SPECTRA/MODEL \
--output_dir ./result \
--dataset_config ./data/spokenwoz_config.json \
--per_gpu_train_batch_size 2 \
--accum 4
```
-->