33c94b51b5 | ||
---|---|---|
.. | ||
data_process | ||
dpr | ||
metric | ||
models | ||
rerankermodel | ||
retrievermodel | ||
README.md | ||
ablationknowledge.py | ||
ablationknowledge.sh | ||
cycleablationmeta.py | ||
cycleablationmeta.sh | ||
cyclekd.py | ||
cyclekd.sh | ||
dataset_processors.py | ||
gen_lt.py | ||
gen_sel.py | ||
gen_seldev.py | ||
gen_seltest.py | ||
hintpreprocess.py | ||
json2select.json | ||
loadscorea.py | ||
metatrainer.py | ||
metatrainernometa.py | ||
r.txt | ||
simpletrainer.py | ||
testseen-nometa.py | ||
testseen-nometa.sh | ||
testseen.py | ||
testseen.sh | ||
testunseen-nometa.py | ||
testunseen-nometa.sh | ||
testunseen.py | ||
testunseen.sh | ||
train_stage1.py | ||
train_stage1.sh |
README.md
OLTQA
The PyTorch implementation of paper Long-Tailed Question Answering in an Open World
Requirements and data preparation
cd LongTailQA
pip install -r r.txt
The raw dataset is available here to be placed in data_process/data
Construct Pareto Long-Tail subset of raw data
python gen_lt.py
preprocessing: large PLM inference
BM25 candidates
use This Repo to select BM25 examples for PLM inference (to construct a candidate pool for further selection)
python find_bm25.py output_path=$PWD/data/{compute_bm25_outfile} \
dataset_split=train setup_type={bm25_setup_type} task_name={dataset} +ds_size={ds_size} L={finder_L}
PLM inference
Install GLM-10B or GLM-130B in ./plm and infer hint candidates with each example for an input.
cd plm
bash ./install_glm.sh
bash ./scripts/generate_block.sh \
config_tasks/model_blocklm_10B_chinese.sh
Two-stage Training
generate dataset for example selection
python gen_sel.py
python gen_seldev.py
python gen_seltest.py
pre-train bi-encoder and cross-encoder with plm scores
bash ./train_stage1.sh ${train batch size}
For a quickstart, pre-trained bi-encoder and cross-encoder checkpoints are available.
train and evaluate the framework
bash ./cyclekd.sh ./ll ${train batch size} ${eval batch size} ${epoch}
bash ./testseen.sh ${eval batch size}
bash ./testunseen.sh ${eval batch size}
Ablation variants: removing knowledge sharing and knowledge mining:
#w/o knowledge sharing
bash ./cycleablationmeta.sh ${train batch size} ${eval batch size} ${epoch}
bash ./testseen-nometa.sh ${eval batch size}
bash ./testunseen-nometa.sh ${eval batch size}
and
#w/o knowledge mining
bash ./ablationknowledge.sh ${train batch size} ${eval batch size} ${epoch}