DAMO-ConvAI/oltqa
Yi Dai 33c94b51b5
replace invalid links
2023-11-13 13:18:12 +08:00
..
data_process add: oltqa 2023-05-18 20:11:17 +08:00
dpr add: oltqa 2023-05-18 20:11:17 +08:00
metric add: oltqa 2023-05-18 20:11:17 +08:00
models add: oltqa 2023-05-18 20:11:17 +08:00
rerankermodel add: oltqa 2023-05-18 20:11:17 +08:00
retrievermodel add: oltqa 2023-05-18 20:11:17 +08:00
README.md replace invalid links 2023-11-13 13:18:12 +08:00
ablationknowledge.py add: oltqa 2023-05-18 20:11:17 +08:00
ablationknowledge.sh add: oltqa 2023-05-18 20:11:17 +08:00
cycleablationmeta.py add: oltqa 2023-05-18 20:11:17 +08:00
cycleablationmeta.sh add: oltqa 2023-05-18 20:11:17 +08:00
cyclekd.py add: oltqa 2023-05-18 20:11:17 +08:00
cyclekd.sh add: oltqa 2023-05-18 20:11:17 +08:00
dataset_processors.py add: oltqa 2023-05-18 20:11:17 +08:00
gen_lt.py add: oltqa 2023-05-18 20:11:17 +08:00
gen_sel.py add: oltqa 2023-05-18 20:11:17 +08:00
gen_seldev.py add: oltqa 2023-05-18 20:11:17 +08:00
gen_seltest.py add: oltqa 2023-05-18 20:11:17 +08:00
hintpreprocess.py add: oltqa 2023-05-18 20:11:17 +08:00
json2select.json add: oltqa 2023-05-18 20:11:17 +08:00
loadscorea.py add: oltqa 2023-05-18 20:11:17 +08:00
metatrainer.py add: oltqa 2023-05-18 20:11:17 +08:00
metatrainernometa.py add: oltqa 2023-05-18 20:11:17 +08:00
r.txt add: oltqa 2023-05-18 20:11:17 +08:00
simpletrainer.py add: oltqa 2023-05-18 20:11:17 +08:00
testseen-nometa.py add: oltqa 2023-05-18 20:11:17 +08:00
testseen-nometa.sh add: oltqa 2023-05-18 20:11:17 +08:00
testseen.py add: oltqa 2023-05-18 20:11:17 +08:00
testseen.sh add: oltqa 2023-05-18 20:11:17 +08:00
testunseen-nometa.py add: oltqa 2023-05-18 20:11:17 +08:00
testunseen-nometa.sh add: oltqa 2023-05-18 20:11:17 +08:00
testunseen.py add: oltqa 2023-05-18 20:11:17 +08:00
testunseen.sh add: oltqa 2023-05-18 20:11:17 +08:00
train_stage1.py add: oltqa 2023-05-18 20:11:17 +08:00
train_stage1.sh add: oltqa 2023-05-18 20:11:17 +08:00

README.md

OLTQA

The PyTorch implementation of paper Long-Tailed Question Answering in an Open World

Requirements and data preparation

cd LongTailQA
pip install -r r.txt

The raw dataset is available here to be placed in data_process/data

Construct Pareto Long-Tail subset of raw data

python gen_lt.py

preprocessing: large PLM inference

BM25 candidates

use This Repo to select BM25 examples for PLM inference (to construct a candidate pool for further selection)

python find_bm25.py output_path=$PWD/data/{compute_bm25_outfile} \
    dataset_split=train setup_type={bm25_setup_type} task_name={dataset} +ds_size={ds_size} L={finder_L}

PLM inference

Install GLM-10B or GLM-130B in ./plm and infer hint candidates with each example for an input.

cd plm
bash ./install_glm.sh
bash ./scripts/generate_block.sh \
     config_tasks/model_blocklm_10B_chinese.sh

Two-stage Training

generate dataset for example selection

python gen_sel.py
python gen_seldev.py
python gen_seltest.py

pre-train bi-encoder and cross-encoder with plm scores

bash ./train_stage1.sh ${train batch size}

For a quickstart, pre-trained bi-encoder and cross-encoder checkpoints are available.

train and evaluate the framework

bash ./cyclekd.sh ./ll ${train batch size} ${eval batch size} ${epoch}
bash ./testseen.sh  ${eval batch size}
bash ./testunseen.sh  ${eval batch size}

Ablation variants: removing knowledge sharing and knowledge mining:

#w/o knowledge sharing
bash ./cycleablationmeta.sh  ${train batch size} ${eval batch size} ${epoch}
bash ./testseen-nometa.sh ${eval batch size}
bash ./testunseen-nometa.sh ${eval batch size}

and

#w/o knowledge mining
bash ./ablationknowledge.sh ${train batch size} ${eval batch size} ${epoch}