History

出蛰 e42cda6d18 add: proton		2022-08-06 19:43:17 +08:00
..
asdl	add: proton	2022-08-06 19:43:17 +08:00
model	add: proton	2022-08-06 19:43:17 +08:00
preprocess	add: proton	2022-08-06 19:43:17 +08:00
results	add: proton	2022-08-06 19:43:17 +08:00
run	add: proton	2022-08-06 19:43:17 +08:00
scripts	add: proton	2022-08-06 19:43:17 +08:00
static	add: proton	2022-08-06 19:43:17 +08:00
utils	add: proton	2022-08-06 19:43:17 +08:00
README.md	add: proton	2022-08-06 19:43:17 +08:00
eval.py	add: proton	2022-08-06 19:43:17 +08:00
evaluation.py	add: proton	2022-08-06 19:43:17 +08:00
process_sql.py	add: proton	2022-08-06 19:43:17 +08:00
requirements.txt	add: proton	2022-08-06 19:43:17 +08:00
setup.sh	add: proton	2022-08-06 19:43:17 +08:00

README.md

Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing

This repository contains code for the KDD 2022 paper Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing.

If you use Proton in your work, please cite it as follows:

@inproceedings{wang2022proton,
    title={Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing},
    author={Wang, Lihan and Qin, Bowen and Hui, Binyuan and Li, Bowen and Yang, Min and Wang, Bailin and Li, Binhua and Huang, Fei and Si, Luo and Li, Yongbin},
    booktitle={KDD},
    year={2022}
}

Overview

In this work, we propose a novel framework, called Proton, which first probes the underlying relational schema-linking structures between a NL query and its database schema from a pre-trained language model, and then effectively injects it into the downstream text-to-SQL parsing models.

Codebase

Prepare Environment

The setup of the environment is exactly the same as that of LGESQL:

The environment-related commands are provided in setup.sh.

sh setup.sh

Download dataset.

Download, unzip and rename the spider.zip into the directory data. Merge the data/train_spider.json and data/train_others.json into one single dataset data/train.json.

Preprocess dataset.

Preprocess the train and dev dataset, including input normalization, schema linking, graph construction and output actions generation.

./run/run_preprocessing.sh

Training

Training Proton with:

./run/run_lgesql_plm.sh msde electra-large-discriminator

Evaluation

For evaluation, see run/run_evaluation.sh and run/run_submission.sh (eval from scratch) for reference.

Result

Model	DK	SYN	Spider
LGESQL	48.4	64.6	75.3
LGESQL + Proton	51.0	65.6	76.3

Acknowledgements

This implementation is based on RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. and LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations. Thanks to the author for releasing the code.