DAMO-ConvAI/star
出蛰 0746daf1c0 update: star readme 2022-11-14 14:19:10 +08:00
..
LGESQL add: star 2022-11-14 14:09:25 +08:00
data_systhesis add: star 2022-11-14 14:09:25 +08:00
pretrain add: star 2022-11-14 14:09:25 +08:00
README.md update: star readme 2022-11-14 14:19:10 +08:00

README.md

🌟 STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing

🤗 transformers support

This is the official project containing source code for the EMNLP 2022 paper "STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing"

You can use our checkpoint to evaluation directly or train from scratch with our instructions.

  1. File data_systhesis contains code to generate conversational text-to-SQL data.
  2. File pretrain contains code to pre-train STAR model.
  3. File LGESQL contains fine-tune and evaluation code.

The relevant models and data involved in the paper can be downloaded through Baidu Netdisk, or downloaded through Google Drive in the corresponding folder.

Citation

@article{cai2022star,
  title={STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing},
  author={Cai, Zefeng and Li, Xiangyu and Hui, Binyuan and Yang, Min and Li, Bowen and Li, Binhua and Cao, Zheng and Li, Weijie and Huang, Fei and Si, Luo and others},
  journal={arXiv preprint arXiv:2210.11888},
  year={2022}
}

🪜 Pretrain

Create conda environment

The following commands.

Create conda environment star:

  • In our experiments, we use torch==1.7.0 with CUDA version 11.0

  • We use four GeForce A-100 for our pre-trained experiments.

    conda create -n star python=3.6 conda activate star pip install torch==1.7.0+cu110 -f https://download.pytorch.org/whl/torch_stable.html pip install -r requirements.txt

Unzip pretraining dataset

Download and move the pretaining data file pretrain_data.txt into the directory datasets.

Training

python pretain_inbatch.py

It may takes two days on four Tesla V100-PCIE-32GB GPU.

Saving STAR model

python save_model.py

Then you can get the trained model and its configuration (at least containing model.bin and config.json) under pretrained/sss direction.

🚪 Fine-tuning and Evaluation

This section presents the results on CoSQL and SParC datasets with STAR fine-tuned with LGESQL.

Create conda environment

The following commands.

Create conda environment lgesql:

  • In our experiments, we use torch==1.7.0 with CUDA version 11.0:
    conda create -n lgesql python=3.6
    source activate lgesql
    pip install torch==1.8.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
    pip install -r requirements.txt
    
  • Next, download dependencies:
    python -c "import nltk; nltk.download('punkt')"
    python -c "import stanza; stanza.download('en')"
    python -c "import nltk; nltk.download('stopwords')"
    

Using our checkpoint to evaluation:

  • Download our processed datasets CoSQL or SParC and unzip them into the cosql/data and sparc/data respectively. Make sure the datasets are correctly located as:
    data
    ├── database
    ├── dev_electra.json
    ├── dev_electra.bin
    ├── dev_electra.lgesql.bin
    ├── dev_gold.txt
    ├── label.json
    ├── tables_electra.bin
    ├── tables.json
    ├── train_electra.bin
    ├── train_electra.json
    └── train_electra.lgesql.bin
    
  • Download our processed checkpoints CoSQL or SParC and unzip them into the cosql/checkpoints and sparc/checkpoints respectively. Make sure the checkpoints are correctly located as:
    checkpoints
    ├── model_IM.bin
    └── params.json
    
  • Execute the following command and the results are recorded in result_XXX.txt(it will take 10 to 30 minutes on one Tesla V100-PCIE-32GB GPU):
    sh run/run_evaluation.sh
    

Train from scratch

  • You can train STAR yourself by following the process in the pretrain file or download our pre-trained STAR and unzip it into the pretrained_models/sss directory. Make sure the STAR are correctly located as:
    pretrained_models
    └── sss
        ├── config.json
        ├── pytorch_model.bin
        └── vocab.txt
    
  • You can preprocess the data with the process_data&&label.py file and refer to the methods in LGESQL, or download our processed data as described above directly.
  • Traning: (it will take 4 days on one Tesla V100-PCIE-32GB GPU)
    sh run/run_lgesql_plm.sh