update: README.md
This commit is contained in:
parent
80ee93c17c
commit
34690a20a5
|
@ -4,19 +4,56 @@ pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
If you want to use apex for AMP training, please clone the apex source code from the repository at github.com to install.
|
If you want to use apex for AMP training, please clone the apex source code from the repository at github.com to install.
|
||||||
|
|
||||||
|
## Pre-Training
|
||||||
|
To pretrain our model from scratch, please first download our processed pretraining dataset [Spotify-100k](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/spotify.tgz)
|
||||||
|
|
||||||
|
> Warning: The spotify.tgz file is 96GB with 960 hours of audio (*.npy). Use download tools like aria2 that can pause and resume to download it efficiently
|
||||||
|
|
||||||
|
Then, download pre-trained WavLM and RoBERTa models from huggingface.co (optional), and run `scripts/train-960.sh`
|
||||||
|
|
||||||
|
#### PT dataset description
|
||||||
|
The `transForV4.pkl` file includes a dataset of 358,268 audio-text pairs, each less than 10 seconds in duration, complete with precise timestamp alignment details. Here's what an example entry looks like:
|
||||||
|
|
||||||
|
```
|
||||||
|
['/<directory>/spotify-960/0/0/show_002B8PbILr169CdsS9ySTH/0.npy',
|
||||||
|
[0, 405, 18, 627, 19456, 1644, 102, 43264, 4, 3056, 6025, 7325, 3479, 254, 5652, 10162, 4, 2678, 4783, 2],
|
||||||
|
["it's", 1, 3, 0, 8000],
|
||||||
|
['the', 3, 4, 6400, 9600],
|
||||||
|
['mother', 4, 5, 8000, 14400],
|
||||||
|
['back', 5, 6, 12800, 17600],
|
||||||
|
['a', 6, 7, 16000, 19200],
|
||||||
|
['podcast.', 7, 9, 17600, 35200],
|
||||||
|
['well', 9, 10, 88000, 99200],
|
||||||
|
['that', 10, 11, 100800, 105600],
|
||||||
|
['was', 11, 12, 104000, 110400],
|
||||||
|
['longer', 12, 14, 108800, 115200],
|
||||||
|
['than', 14, 15, 113600, 118400],
|
||||||
|
['expected.', 15, 17, 116800, 132800],
|
||||||
|
['oh', 17, 18, 145600, 148800],
|
||||||
|
['my', 18, 19, 147200, 152000],
|
||||||
|
-1]
|
||||||
|
```
|
||||||
|
|
||||||
|
Each dataset entry consists of four parts:
|
||||||
|
1. **Audio File Path**: The location of the audio file in .npy format.
|
||||||
|
2. **Text IDs**: Tokens generated by RoBERTa corresponding to the text.
|
||||||
|
3. **Previous Turn Index**: The last element indicating the reference to the previous turn's ID, with `-1` indicating no prior reference.
|
||||||
|
4. **Audio-Text Alignment**: Detailed information on how text aligns with audio segments. For instance, the word "it's" aligns with text IDs [1:3] and audio segment [0:8000], where the sample rate of 16000 Hz means this segment represents approximately 0.5 seconds of audio.
|
||||||
|
|
||||||
|
|
||||||
## Fine-tuning
|
## Fine-tuning
|
||||||
We provide the pre-trained checkpoint of our model at [huggingface.co](https://huggingface.co/publicstaticvo/SPECTRA-base). To reproduce our result in the paper, please first download the pre-processed fine-tuning data (be available soon), then run `scripts/finetune.sh`
|
We provide the pre-trained checkpoint of our model at [huggingface.co](https://huggingface.co/publicstaticvo/SPECTRA-base). To reproduce our result in the paper, please first download the pre-processed fine-tuning data: [**MOSI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosi.tgz),
|
||||||
### Datasets
|
|
||||||
Here are the processed fine-tuning data datasets:
|
|
||||||
[**MOSI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosi.tgz),
|
|
||||||
[**MOSEI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosei.tgz),
|
[**MOSEI**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mosei.tgz),
|
||||||
[**IEMOCAP**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/iemocap.tgz), and
|
[**IEMOCAP**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/iemocap.tgz), and
|
||||||
[**MINTREC**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mintrec.tgz).
|
[**MINTREC**](https://space-mm-data.oss-cn-wulanchabu.aliyuncs.com/downstreamv2/mintrec.tgz).
|
||||||
These are all composed by pickles and can be used directly.
|
These are all composed by pickles and can be used directly. Then run
|
||||||
|
```
|
||||||
|
scripts/finetune.sh
|
||||||
|
```
|
||||||
|
|
||||||
> Due to the large data size of SpokenWOZ and Spotify-100k (tens of GBs), please obtain from the original repo."
|
> Since SpokenWOZ data is large and constantly updated, please obtain it from the [source](https://spokenwoz.github.io/SpokenWOZ-github.io/).
|
||||||
|
|
||||||
### Usage
|
#### FT dataset description
|
||||||
To access the training, validation, and test files in the datasets, you can use the following command to extract the mosi.tgz file:
|
To access the training, validation, and test files in the datasets, you can use the following command to extract the mosi.tgz file:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -30,19 +67,4 @@ Once extracted, you'll find .pkl files for training, validation, and testing. Ea
|
||||||
4. History Audio Features (if applicable): If present, this field contains historical audio feature data.
|
4. History Audio Features (if applicable): If present, this field contains historical audio feature data.
|
||||||
5. History Text Token IDs (if applicable): Similar to the above, this includes historical text token IDs, if available.
|
5. History Text Token IDs (if applicable): Similar to the above, this includes historical text token IDs, if available.
|
||||||
|
|
||||||
We hope this information helps you in utilizing the dataset effectively. Should you have any questions or need further assistance, please feel free to reach out.
|
We hope these information helps you in utilizing the dataset effectively. Should you have any questions or need further assistance, please feel free to reach out.
|
||||||
|
|
||||||
## Pre-Train
|
|
||||||
To pretrain our model from scratch, please first download our processed pretraining dataset (be available soon), then download pre-trained WavLM and RoBERTa models from huggingface.co (optional), and run `scripts/train-960.sh`
|
|
||||||
|
|
||||||
<!--
|
|
||||||
```commandline
|
|
||||||
python run_dst.py --model spectra --model_type roberta \
|
|
||||||
--data_dir ./data \
|
|
||||||
--model_dir /PATH/OF/YOUR/PRETRAINED/SPECTRA/MODEL \
|
|
||||||
--output_dir ./result \
|
|
||||||
--dataset_config ./data/spokenwoz_config.json \
|
|
||||||
--per_gpu_train_batch_size 2 \
|
|
||||||
--accum 4
|
|
||||||
```
|
|
||||||
-->
|
|
||||||
|
|
Loading…
Reference in New Issue