diff --git a/PRO/README.md b/PRO/README.md index 331eaac..b0e355d 100644 --- a/PRO/README.md +++ b/PRO/README.md @@ -4,14 +4,14 @@ Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, H arXiv: [Abstract](https://arxiv.org/abs/2306.17492) / [PDF](https://arxiv.org/pdf/2306.17492.pdf) ## Abstract -Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment. +Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. ## The pipeline of PRO
## Results -### Automatic Evaluation -
+### Automatic Evaluation on *HH-RLHF* +
### GPT-4 Evaluation
@@ -19,32 +19,63 @@ Large language models (LLMs) often contain misleading content, emphasizing the n ### Human Evaluation
+### Automatic Evaluation on *Summarize From Feedback* +
+ ## Running! ### Data Preparation -1. Download [data.zip](https://ylab-mobile-prod.oss-cn-beijing.aliyuncs.com/yueli.ybw/pro_data.zip) and unzip it. -2. Place the unzipped ```data/``` folder in the root directory of the project. -3. You can also get the raw data from [this repo](https://github.com/anthropics/hh-rlhf), and run the following command to preprocess it to get the same data as ```train_len2/``` in ```data.zip```: +We provide the preprocessed data for training and testing, which can be get with following steps: +1. Download [data.zip](https://ylab-mobile-prod.oss-cn-beijing.aliyuncs.com/yueli.ybw/data.zip) and unzip it. +2. Place the unzipped ```data``` folder in the root directory of the project. + +Besides, we also provide the scripts for preprocessing the raw data. Please follow the steps below to prepare the data: +1. Create a directory named ```data``` in the root directory of this project. +2. Create a directory named ```data/raw_data``` in the ```data``` directory. +3. Download the raw data from [*HH-RLHF*](https://github.com/anthropics/hh-rlhf) or [*Summarize From Feedback*](https://github.com/openai/summarize-from-feedback), which should be named as ```hhrlhf``` or ```summarize_from_feedback```, and put it in the ```data/raw_data``` directory. +4. Run the following command to preprocess the data: + ``` -cd train/preprocess_data +# For HH-RLHF +cd train/hh_preprocess_data +python step_1_process.py +python step_2_get_train_data.py +python step_3_get_test_data.py + +# For Summarize From Feedback +cd ../summarize_preprocess_data python step_1_process.py python step_2_get_train_data.py python step_3_get_test_data.py ``` + ### Train -We provide the training script for training the model. For example, you can run the following command to train the model: +We provide the training scripts for training the model. For example, you can run the following commands to train the model: ``` cd train -./train.sh [id_of_exp] train_len2 2 + +# Train LLMs with HH-RLHF +./train_hh.sh [id_of_exp] hh_train_len2 2 + +# Train LLMs with Summarize From Feedback +./train_summarize.sh [id_of_exp] summarize_train_len2 2 +# Length 3 +./train3_summarize.sh [id_of_exp] summarize_train_len3_alpaca 3 ``` -You can modify the ```train.sh``` to train the model with different dataset. + +The scripts can be easily modified to train LLMs with different datasets. ### Test -You can run the following command to test the model: +The following command can be used to test the model: ``` -cd eval +# Test LLMs with HH-RLHF +cd eval_hh +./run_infer_main_dist.sh + +# Test LLMs with Summarize From Feedback +cd ../eval_summarize ./run_infer_main_dist.sh ``` -> **Note:** Before run this script, you should modify the ```infer_main_dist.sh``` to specify ```id_of_exp``` and corresponding ranking length in training. +> **Note:** Before running, the ```id_of_exp``` and corresponding ranking length (during training) in ```run_infer_main_dist.sh``` have to be specified. ## Citation If this work is helpful to you, welcome to cite our paper as: diff --git a/PRO/eval/dp_config.yaml b/PRO/eval_hh/dp_config.yaml similarity index 100% rename from PRO/eval/dp_config.yaml rename to PRO/eval_hh/dp_config.yaml diff --git a/PRO/eval/infer_and_eval_main_generate.py b/PRO/eval_hh/infer_and_eval_main_generate.py similarity index 97% rename from PRO/eval/infer_and_eval_main_generate.py rename to PRO/eval_hh/infer_and_eval_main_generate.py index 76137b5..ee02f46 100644 --- a/PRO/eval/infer_and_eval_main_generate.py +++ b/PRO/eval_hh/infer_and_eval_main_generate.py @@ -76,7 +76,7 @@ if __name__ == "__main__": "helpful_online.json", "helpful_rejection.json" ]: - file_path = os.path.join("..", "data", "test", file_name) + file_path = os.path.join("..", "data", "hh_test", file_name) with open(file_path, "r", encoding='utf-8') as f: infer_data = {line_index: json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index-rank) % rank_sum == 0} diff --git a/PRO/eval/infer_and_eval_main_reward.py b/PRO/eval_hh/infer_and_eval_main_reward.py similarity index 100% rename from PRO/eval/infer_and_eval_main_reward.py rename to PRO/eval_hh/infer_and_eval_main_reward.py diff --git a/PRO/eval/infer_and_eval_main_score.py b/PRO/eval_hh/infer_and_eval_main_score.py similarity index 100% rename from PRO/eval/infer_and_eval_main_score.py rename to PRO/eval_hh/infer_and_eval_main_score.py diff --git a/PRO/eval/infer_func_now.py b/PRO/eval_hh/infer_func_now.py similarity index 96% rename from PRO/eval/infer_func_now.py rename to PRO/eval_hh/infer_func_now.py index 58a295b..89f4ac8 100644 --- a/PRO/eval/infer_func_now.py +++ b/PRO/eval_hh/infer_func_now.py @@ -62,7 +62,6 @@ def generate_pipeline(model, tokenizer, prompts, add_special_tokens=False, gen_k text = text_res[index] assert truncated_prompts[index].rstrip() in text text = text.replace(truncated_prompts[index].rstrip(), "").strip() - # text = text[prompts_size[index]:].strip() for stop in ["Human:", "human:", "Assistant:", "assistant:"]: stop_ix = text.find(stop) if stop_ix >= 0: diff --git a/PRO/eval/metrics2.py b/PRO/eval_hh/metrics2.py similarity index 63% rename from PRO/eval/metrics2.py rename to PRO/eval_hh/metrics2.py index 4d23249..72c2c6f 100644 --- a/PRO/eval/metrics2.py +++ b/PRO/eval_hh/metrics2.py @@ -21,47 +21,7 @@ def get_bleu(hyp, ref): ref = ref.strip() return nltk.translate.bleu_score.sentence_bleu([ref], hyp) -# Thank trlx for their helpful code: -# https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L115 -def create_reward_fn_1(): - reward_tokenizer = AutoTokenizer.from_pretrained("gpt2") - reward_tokenizer.pad_token = reward_tokenizer.eos_token - reward_tokenizer.truncation_side = "left" - reward_model = TrainRewardModel("EleutherAI/gpt-j-6B", reward_tokenizer.eos_token_id) - checkpoint = os.path.join("..", "rm", "gptj-rm-static", "hf_ckpt.pt") - - reward_model.load_state_dict(torch.load(checkpoint)) - reward_device = "cuda:{}".format(rank) - reward_model = reward_model.half().to(reward_device) - reward_model.eval() - - def get_score(prefixes, suffixes): - # prefixes = [[p1, p1, p1], [p2, p2, p2]] - # suffixes = [s1, s2] - texts = [] - for p, s in zip(prefixes,suffixes): - p = "".join(p) - p = p.replace("<|prompter|>", "\n\nHuman: ").replace("<|assistant|>", "\n\nAssistant: ") - texts.append(p + s + reward_tokenizer.eos_token) - - input = reward_tokenizer( - texts, - padding=True, - truncation=True, - max_length=reward_tokenizer.max_len_single_sentence, - return_tensors="pt", - ).to(reward_device) - - with torch.no_grad(): - rewards = reward_model(input['input_ids']) # [batch] - - return rewards.view(-1) - # return torch.sigmoid(rewards.view(-1)) - - return get_score, 16 - def create_reward_fn_2(): - # model_name = "OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5" model_name = "OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1" model_device = "cuda:{}".format(rank) tokenizer = AutoTokenizer.from_pretrained(model_name) diff --git a/PRO/eval/reward_model.py b/PRO/eval_hh/reward_model.py similarity index 100% rename from PRO/eval/reward_model.py rename to PRO/eval_hh/reward_model.py diff --git a/PRO/eval/run_infer_main_dist.sh b/PRO/eval_hh/run_infer_main_dist.sh similarity index 90% rename from PRO/eval/run_infer_main_dist.sh rename to PRO/eval_hh/run_infer_main_dist.sh index 08858a2..d33fe06 100755 --- a/PRO/eval/run_infer_main_dist.sh +++ b/PRO/eval_hh/run_infer_main_dist.sh @@ -3,19 +3,14 @@ export OMP_NUM_THREADS=16 id=$1 ranking_len=$2 -# 30 min accelerate launch --config_file dp_config.yaml infer_and_eval_main_generate.py \ --index $id \ --stage $ranking_len > logs/generate_infer_main_${id}_${ranking_len}.log 2>&1 -#10 min accelerate launch --config_file dp_config.yaml infer_and_eval_main_reward.py \ --index $id \ --stage $ranking_len > logs/reward_infer_main_${id}_${ranking_len}.log 2>&1 -#1 second python -u infer_and_eval_main_score.py \ --index $id \ - --stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1 - -# total 40 min \ No newline at end of file + --stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1 \ No newline at end of file diff --git a/PRO/eval_summarize/dp_config.yaml b/PRO/eval_summarize/dp_config.yaml new file mode 100644 index 0000000..31c7264 --- /dev/null +++ b/PRO/eval_summarize/dp_config.yaml @@ -0,0 +1,16 @@ +compute_environment: LOCAL_MACHINE +deepspeed_config: {} +distributed_type: MULTI_GPU +downcast_bf16: 'no' +dynamo_backend: 'NO' +fsdp_config: {} +gpu_ids: all +machine_rank: 0 +main_training_function: main +megatron_lm_config: {} +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +use_cpu: false diff --git a/PRO/eval_summarize/infer_and_eval_main_generate.py b/PRO/eval_summarize/infer_and_eval_main_generate.py new file mode 100644 index 0000000..b2daeba --- /dev/null +++ b/PRO/eval_summarize/infer_and_eval_main_generate.py @@ -0,0 +1,95 @@ +#import some packages and reward funcs +import os +import argparse +import json +import tqdm +import torch +import torch.nn.functional as F +import metrics2 +from transformers import ( + AutoConfig, + AutoTokenizer, + LlamaTokenizer, + AutoModelForCausalLM +) +from infer_func_now import setup_seed, generate_pipeline +from accelerate import Accelerator +from accelerate.utils import InitProcessGroupKwargs +from datetime import timedelta + +def get_args(): + parser = argparse.ArgumentParser(description="") + parser.add_argument('--index', type=str) + parser.add_argument('--stage', type=int) + parser.add_argument('--directory', default="best_checkpoint", type=str) + args = parser.parse_args() + return args + +if __name__ == "__main__": + args = get_args() + kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400)) + accelerator = Accelerator(kwargs_handlers=[kwargs])# **accelerator_log_kwargs) + rank = int(os.environ['RANK']) + rank_sum = accelerator.num_processes + model_name_or_path = os.path.join("..", "checkpoints", f"index_{args.index}", f"stage_{args.stage}", f"{args.directory}") + model_device = "cuda:{}".format(rank) + + model_config = AutoConfig.from_pretrained(model_name_or_path) + model = AutoModelForCausalLM.from_pretrained(model_name_or_path, config=model_config, torch_dtype=torch.bfloat16).to(model_device) + if accelerator.is_main_process: + print(type(model)) + print(model.config) + if model.config.architectures[0].lower() == "llamaforcausallm": + tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path) + tokenizer.unk_token = "" + tokenizer.bos_token = "" + tokenizer.eos_token = "" + else: + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + + tokenizer.pad_token=tokenizer.eos_token, + tokenizer.pad_token_id=tokenizer.eos_token_id, + tokenizer.sep_token = "" + model.resize_token_embeddings(len(tokenizer)) + + print(model.dtype) + torch.cuda.empty_cache() + model.eval() + print(f"Rank {rank} is activated...") + if accelerator.is_main_process: + file_name = "test.json" + save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name)) + if os.path.exists(save_path): + os.remove(save_path) + accelerator.wait_for_everyone() + + file_name = "test.json" + file_path = os.path.join("..", "data", "summarize_test", file_name) + with open(file_path, "r", encoding='utf-8') as f: + infer_data = {line_index: json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index-rank) % rank_sum == 0} + + for line_index in infer_data: + infer_data[line_index]["line_index"] = line_index + infer_data = [infer_data[line_index] for line_index in infer_data] + + prompts = [l['prefix'][0] for l in infer_data] + + setup_seed() + generated_suffixes, truncated_prompts = generate_pipeline(model, tokenizer, prompts, add_special_tokens=True) + setup_seed() + save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name)) + + for index in range(len(infer_data)): + infer_data[index]['infer'] = {"t": generated_suffixes[index]} + with open(save_path, 'a', encoding='utf-8') as f: + for line in infer_data: + content = json.dumps(line, ensure_ascii=False) + f.write(content+'\n') + + accelerator.wait_for_everyone() + + print("") + if accelerator.is_main_process: + print("Eval on {}".format(file_name)) + torch.cuda.empty_cache() + accelerator.wait_for_everyone() \ No newline at end of file diff --git a/PRO/eval_summarize/infer_and_eval_main_reward.py b/PRO/eval_summarize/infer_and_eval_main_reward.py new file mode 100644 index 0000000..4848430 --- /dev/null +++ b/PRO/eval_summarize/infer_and_eval_main_reward.py @@ -0,0 +1,84 @@ +#import some packages and reward funcs +import os +import argparse +import json +import tqdm +import torch +import torch.nn.functional as F +import metrics2 +from transformers import ( + AutoConfig, + AutoTokenizer, + LlamaTokenizer, + AutoModelForCausalLM +) +from peft import PeftConfig, PeftModel +from infer_func_now import setup_seed +from accelerate import Accelerator +from accelerate.utils import InitProcessGroupKwargs +from datetime import timedelta + +def get_args(): + parser = argparse.ArgumentParser(description="") + parser.add_argument('--index', type=str) + parser.add_argument('--stage', type=int) + parser.add_argument('--directory', default="best_checkpoint", type=str) + args = parser.parse_args() + return args + +if __name__ == "__main__": + args = get_args() + setup_seed() + kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400)) + accelerator = Accelerator(kwargs_handlers=[kwargs]) + rank = int(os.environ['RANK']) + rank_sum = accelerator.num_processes + torch.cuda.empty_cache() + print(f"Rank {rank} is activated...") + if accelerator.is_main_process: + file_name = "test.json" + save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name)) + if os.path.exists(save_path): + os.remove(save_path) + + save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name)) + with open(save_path, 'r', encoding='utf-8') as f: + infer_data = [json.loads(l) for l in f.readlines()] + if "line_index" in infer_data[0]: + infer_data = {l["line_index"]: l for l in infer_data} + with open(save_path, 'w', encoding='utf-8') as f: + infer_data = [infer_data[line_index] for line_index in range(len(infer_data))] + for line in infer_data: + content = json.dumps(line, ensure_ascii=False) + f.write(content+'\n') + + accelerator.wait_for_everyone() + + get_score, reward_batch_size = metrics2.create_reward_fn() + + file_name = "test.json" + save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name)) + with open(save_path, 'r', encoding='utf-8') as f: + infer_data = [json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index - rank) % rank_sum == 0] + raw_prefixes = [l['prefix'][0].strip() + " " for l in infer_data] + generated_suffixes = [l['infer']["t"].strip() for l in infer_data] + + setup_seed() + rewards = [] + batch_size = reward_batch_size + for index in tqdm.tqdm(range(0,len(raw_prefixes), batch_size), desc=f"Rank {rank} rewarding..."): + if len(raw_prefixes) - index < batch_size: + batch_size = len(raw_prefixes) - index + rewards.extend(torch.sigmoid(get_score(raw_prefixes[index:index+batch_size], generated_suffixes[index:index+batch_size])).cpu().detach().numpy().tolist()) + assert len(rewards) == len(generated_suffixes) and len(rewards) == len(infer_data), (len(rewards), len(generated_suffixes), len(infer_data)) + + for index in range(len(infer_data)): + infer_data[index]["infer"]["score"] = rewards[index] + infer_data[index]["infer"]["bleu"] = metrics2.get_bleu(infer_data[index]['infer']['t'], infer_data[index]['suffix'][0]) + + save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name)) + with open(save_path, 'a', encoding='utf-8') as f: + for line in infer_data: + content = json.dumps(line, ensure_ascii=False) + f.write(content+'\n') + print(f"Rank {rank} completed!") \ No newline at end of file diff --git a/PRO/eval_summarize/infer_and_eval_main_score.py b/PRO/eval_summarize/infer_and_eval_main_score.py new file mode 100644 index 0000000..7fddc11 --- /dev/null +++ b/PRO/eval_summarize/infer_and_eval_main_score.py @@ -0,0 +1,47 @@ +import os +import argparse +import json +import tqdm +import evaluate + +def get_args(): + parser = argparse.ArgumentParser(description="") + parser.add_argument('--index', type=str) + parser.add_argument('--stage', type=int) + parser.add_argument('--directory', default="best_checkpoint", type=str) + args = parser.parse_args() + return args + +if __name__ == "__main__": + args = get_args() + + file_name = "test.json" + save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name)) + with open(save_path, 'r', encoding='utf-8') as f: + infer_data = [json.loads(l) for line_index, l in enumerate(f.readlines())] + + bleu = 0 + avg_reward = 0 + predictions = [] + references = [] + + for line in infer_data: + avg_reward += line['infer']['score'] + bleu += line['infer']['bleu'] + predictions.append( + line['infer']["t"].strip() + ) + references.append( + line["suffix"][0].strip() + ) + + rouge = evaluate.load('rouge') + results = rouge.compute(predictions=predictions, references=references) + bleu = bleu / len(infer_data) + avg_reward = avg_reward / len(infer_data) + + print("Eval on {}".format(file_name)) + print("BLEU: {}".format(bleu)) + print("Avg Reward: {}".format(avg_reward)) + for key in results: + print("{}: {}".format(key, results[key])) \ No newline at end of file diff --git a/PRO/eval_summarize/infer_func_now.py b/PRO/eval_summarize/infer_func_now.py new file mode 100644 index 0000000..cf7e63f --- /dev/null +++ b/PRO/eval_summarize/infer_func_now.py @@ -0,0 +1,76 @@ +import torch +import torch.nn.functional as F +import tqdm +import numpy as np +import random + +def setup_seed(seed=42): + torch.manual_seed(seed) + torch.cuda.manual_seed_all(seed) + np.random.seed(seed) + random.seed(seed) + torch.backends.cudnn.benchmark=False + torch.backends.cudnn.deterministic=True + +def generate_pipeline(model, tokenizer, prompts, add_special_tokens=False, gen_kwarg={"max_new_tokens": 64, "num_beams": 1, "do_sample": False,}, batch_size = 28): + def pipeline(prompts): + tokenizer.padding_side = "left" + tokenizer.truncation_side = "right" + + new_prompts = [] + for p in prompts: + assert p[-7:] == "\nTL;DR:", p[-7:] + p = p[:-7] + new_prompts.append(p) + + model_inputs = tokenizer( + new_prompts, + max_length=512, + truncation=True, + add_special_tokens=add_special_tokens, + ) + truncated_prompts = tokenizer.batch_decode(model_inputs['input_ids'], skip_special_tokens=True) + truncated_prompts = [p + "\nTL;DR:" for p in truncated_prompts] + model_inputs = tokenizer( + truncated_prompts, + add_special_tokens=add_special_tokens, + padding=True, + return_tensors="pt" + ) + truncated_prompts = tokenizer.batch_decode(model_inputs['input_ids'], skip_special_tokens=True) + prompts_size = [len(s) for s in truncated_prompts] + return model_inputs, prompts_size, truncated_prompts + + model_inputs, prompts_size, truncated_prompts = pipeline(prompts) + text_res = [] + for index in tqdm.tqdm(range(0, len(model_inputs["input_ids"]), batch_size)): + if len(model_inputs["input_ids"]) - index < batch_size: + batch_size = len(model_inputs["input_ids"]) - index + + batch = {key: model_inputs[key][index:index+batch_size].to(model.device) for key in model_inputs} + with torch.no_grad(): + ts = model.generate( + **batch, + **gen_kwarg, + pad_token_id=tokenizer.pad_token_id, + ).cpu().detach() + text_res.append(ts) + + for index in range(len(text_res)): + text_res[index] = tokenizer.batch_decode( + text_res[index], + skip_special_tokens=True + ) + + text_res = sum(text_res, []) + for index in range(len(text_res)): + text = text_res[index] + assert truncated_prompts[index].rstrip() in text + text = text.replace(truncated_prompts[index].rstrip(), "").strip() + for stop in ["\n\n"]: + stop_ix = text.find(stop) + if stop_ix >= 0: + text = text[:stop_ix].rstrip() + text_res[index] = text + + return text_res, truncated_prompts \ No newline at end of file diff --git a/PRO/eval_summarize/metrics2.py b/PRO/eval_summarize/metrics2.py new file mode 100644 index 0000000..a6528c7 --- /dev/null +++ b/PRO/eval_summarize/metrics2.py @@ -0,0 +1,70 @@ +import sys +sys.path.append("..") +import os +import math +import torch +import torch.nn as nn +from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification +from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXConfig, GPTNeoXModel, GPTNeoXPreTrainedModel +from transformers.utils import ModelOutput +from dataclasses import dataclass +from typing import Literal, Optional +import tqdm +import nltk + +rank = int(os.environ['RANK']) + +def get_bleu(hyp, ref): + hyp = hyp.strip() + ref = ref.strip() + return nltk.translate.bleu_score.sentence_bleu([ref], hyp) + +def create_reward_fn_2(): + model_name = "OpenAssistant/reward-model-deberta-v3-large-v2" + model_device = "cuda:{}".format(rank) + tokenizer = AutoTokenizer.from_pretrained(model_name) + tokenizer.truncation_side = "right" + reward_model = AutoModelForSequenceClassification.from_pretrained(model_name).to(model_device) + reward_model.eval() + + def get_score(prefixes, suffixes): + input_content = tokenizer( + prefixes, + suffixes, + padding=True, + truncation=True, + max_length=1024, + return_tensors="pt", + ).to(model_device) + with torch.no_grad(): + rewards = reward_model(**input_content).logits + + return rewards.view(-1) + + return get_score, 140 + +def create_reward_fn_3(): + model_name = "OpenAssistant/reward-model-deberta-v3-large" + model_device = "cuda:{}".format(rank) + tokenizer = AutoTokenizer.from_pretrained(model_name) + tokenizer.truncation_side = "right" + reward_model = AutoModelForSequenceClassification.from_pretrained(model_name).to(model_device) + reward_model.eval() + + def get_score(prefixes, suffixes): + input_content = tokenizer( + prefixes, + suffixes, + padding=True, + truncation=True, + max_length=1024, + return_tensors="pt", + ).to(model_device) + with torch.no_grad(): + rewards = reward_model(**input_content).logits + + return rewards.view(-1) + + return get_score, 140 + +create_reward_fn = create_reward_fn_2 \ No newline at end of file diff --git a/PRO/eval_summarize/run_infer_main_dist.sh b/PRO/eval_summarize/run_infer_main_dist.sh new file mode 100755 index 0000000..d33fe06 --- /dev/null +++ b/PRO/eval_summarize/run_infer_main_dist.sh @@ -0,0 +1,16 @@ +export PYTHONIOENCODING=utf-8 +export OMP_NUM_THREADS=16 + +id=$1 +ranking_len=$2 +accelerate launch --config_file dp_config.yaml infer_and_eval_main_generate.py \ + --index $id \ + --stage $ranking_len > logs/generate_infer_main_${id}_${ranking_len}.log 2>&1 + +accelerate launch --config_file dp_config.yaml infer_and_eval_main_reward.py \ + --index $id \ + --stage $ranking_len > logs/reward_infer_main_${id}_${ranking_len}.log 2>&1 + +python -u infer_and_eval_main_score.py \ + --index $id \ + --stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1 \ No newline at end of file diff --git a/PRO/requirements.txt b/PRO/requirements.txt index d520bbc..edce707 100644 --- a/PRO/requirements.txt +++ b/PRO/requirements.txt @@ -8,3 +8,6 @@ torch==1.13.1+cu117 tqdm==4.64.1 transformers==4.28.1 deepspeed==0.8.1 +evaluate +rouge_score +tensorboard \ No newline at end of file diff --git a/PRO/resources/automatic.jpg b/PRO/resources/automatic_hh.jpg similarity index 100% rename from PRO/resources/automatic.jpg rename to PRO/resources/automatic_hh.jpg diff --git a/PRO/resources/automatic_summarize.jpg b/PRO/resources/automatic_summarize.jpg new file mode 100644 index 0000000..1646f25 Binary files /dev/null and b/PRO/resources/automatic_summarize.jpg differ diff --git a/PRO/train/ds_config2.yaml b/PRO/train/ds_config2.yaml new file mode 100644 index 0000000..bc5d938 --- /dev/null +++ b/PRO/train/ds_config2.yaml @@ -0,0 +1,22 @@ +compute_environment: LOCAL_MACHINE +deepspeed_config: + gradient_accumulation_steps: 16 + gradient_clipping: 1.0 + offload_optimizer_device: none + offload_param_device: none + zero3_init_flag: true + zero3_save_16bit_model: true + zero_stage: 2 +distributed_type: DEEPSPEED +downcast_bf16: 'no' +dynamo_backend: 'NO' +fsdp_config: {} +machine_rank: 0 +main_training_function: main +megatron_lm_config: {} +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +use_cpu: false diff --git a/PRO/train/preprocess_data/step_1_process.py b/PRO/train/hh_preprocess_data/step_1_process.py similarity index 74% rename from PRO/train/preprocess_data/step_1_process.py rename to PRO/train/hh_preprocess_data/step_1_process.py index 04b1752..a71ed04 100644 --- a/PRO/train/preprocess_data/step_1_process.py +++ b/PRO/train/hh_preprocess_data/step_1_process.py @@ -31,8 +31,6 @@ def hhrlhf_preprocess(path,filename,index_generator,split='train'): dialogue: list[tuple[str, str]] = [] - # go over messages and combine consecutive messages from the - # same speaker (OA v1 expects alternating roles) role = None messages = [] for line in lines: @@ -72,7 +70,7 @@ def hhrlhf_preprocess(path,filename,index_generator,split='train'): if len(rejected) == len(chosen): assert chosen[-1][0] == rejected[-1][0] - prefix = [role+text for role, text in chosen[:-1]] # need to be concated with [EOS] in practice + prefix = [role+text for role, text in chosen[:-1]] prefix.append(chosen[-1][0]) good_reply = chosen[-1][1] # last part of dialog, the text bad_reply = rejected[-1][1] # last part of dialog, the text @@ -112,7 +110,7 @@ def hhrlhf_preprocess(path,filename,index_generator,split='train'): prefix.append([role, text]) elif role == "<|assistant|>": temp_prefix = [temp_role+temp_text for temp_role, temp_text in prefix] - temp_prefix.append(role) # need to be concated with [EOS] in practice + temp_prefix.append(role) temp_reply = text # last part of dialog, the text chosen_sample = { 'extended':[ @@ -148,7 +146,7 @@ def hhrlhf_preprocess(path,filename,index_generator,split='train'): prefix.append([role, text]) elif role == "<|assistant|>": temp_prefix = [temp_role+temp_text for temp_role, temp_text in prefix] - temp_prefix.append(role) # need to be concated with [EOS] in practice + temp_prefix.append(role) temp_reply = text # last part of dialog, the text rejected_sample = { 'extended':[ @@ -184,56 +182,13 @@ def hhrlhf_preprocess(path,filename,index_generator,split='train'): return samples if __name__ == "__main__": - # get a global index generator global_index_generator = gen_global_index() - - # prepare to post-processing - res = { - 'hhrlhf':[], - } - - prompts = { - 'hhrlhf': '', - 'summarize':'', - 'webgpt':'', - 'tldr':'', - } - # process raw datasets - # hhrlhf - res['hhrlhf'] = [ - hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','harmless-base'),'train.jsonl',global_index_generator,split='train'), - # hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','harmless-base'),'test.jsonl',global_index_generator,split='test'), - hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-base'),'train.jsonl',global_index_generator,split='train'), - # hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-base'),'test.jsonl',global_index_generator,split='test'), - hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-online'),'train.jsonl',global_index_generator,split='train'), - # hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-online'),'test.jsonl',global_index_generator,split='test'), - hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-rejection'),'train.jsonl',global_index_generator,split='train'), - # hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-rejection'),'test.jsonl',global_index_generator,split='test'), - ] + + hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','harmless-base'),'train.jsonl',global_index_generator,split='train'), + hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-base'),'train.jsonl',global_index_generator,split='train'), + hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-online'),'train.jsonl',global_index_generator,split='train'), + hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-rejection'),'train.jsonl',global_index_generator,split='train'), hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','harmless-base'),'test.jsonl',global_index_generator,split='dev') hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-base'),'test.jsonl',global_index_generator,split='dev') hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-online'),'test.jsonl',global_index_generator,split='dev') - hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-rejection'),'test.jsonl',global_index_generator,split='dev') - - global_prefixes = [] - global_extended_samples = 0 - for key in res: - for dataset in res[key]: - for sample in dataset: - for sub_sample in sample['extended']: - prefix = "".join(sub_sample['prefix']) - prefix = prefix.replace("<|prompter|>", "\n\nHuman: ").replace("<|assistant|>", "\n\nAssistant: ") - prefix = prompts[key].replace('', prefix) - global_prefixes.append( - { - 'id': sub_sample['id'], - 'prefix': prefix, - 'target_num': sub_sample['target_num'], - 'target': [] - } - ) - global_extended_samples += sub_sample['target_num'] - - - print('Total Num: {}'.format(len(global_prefixes))) - print('Total Extended Num: {}'.format(global_extended_samples)) \ No newline at end of file + hhrlhf_preprocess(os.path.join('..','..','data','raw_data','hhrlhf','helpful-rejection'),'test.jsonl',global_index_generator,split='dev') \ No newline at end of file diff --git a/PRO/train/preprocess_data/step_2_gen_train_data.py b/PRO/train/hh_preprocess_data/step_2_gen_train_data.py similarity index 92% rename from PRO/train/preprocess_data/step_2_gen_train_data.py rename to PRO/train/hh_preprocess_data/step_2_gen_train_data.py index a64605f..bbeb4be 100644 --- a/PRO/train/preprocess_data/step_2_gen_train_data.py +++ b/PRO/train/hh_preprocess_data/step_2_gen_train_data.py @@ -5,9 +5,8 @@ import json import random import numpy as np import tqdm -from utils.metrics import create_reward_fn_3 -get_score, reward_batch_size = create_reward_fn_3() -# get_score = None +from utils.metrics_hh import create_reward_fn +get_score, reward_batch_size = create_reward_fn() def split_trans(split): if split == 'train' or split == 'test' or split == 'dev': @@ -48,7 +47,6 @@ def reward_model_ranker(prefixes, suffixes): def extract_train_data(root_dir, if_score, if_rerank, training_stage_num = None, split='train'): file_list = [] - # for root,dirs,files in os.walk('refilled_data'): for root,dirs,files in os.walk(root_dir): for file in files: if not file.endswith("json"): @@ -103,16 +101,19 @@ def extract_train_data(root_dir, if_score, if_rerank, training_stage_num = None, for l in training_data: l['reward'] = [1.0] * len(l['suffix']) + for l in training_data: + l['sft_index'] = 0 + return training_data if __name__ == '__main__': root_dir = os.path.join('..','..','data',"preprocessed_data") data_aug = False - os.makedirs(os.path.join('..','..','data','train_len2'), exist_ok=True) + os.makedirs(os.path.join('..','..','data','hh_train_len2'), exist_ok=True) random.seed(42) training_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf"), if_score = True, if_rerank=True, split = 'train') random.shuffle(training_data) - with open(os.path.join('..','..','data','train_len2','train.json'),'w', encoding='utf-8') as f: + with open(os.path.join('..','..','data','hh_train_len2','train.json'),'w', encoding='utf-8') as f: for sample in training_data: f.write(json.dumps(sample,ensure_ascii=False)+'\n') @@ -124,7 +125,6 @@ if __name__ == '__main__': random.seed(42) helpful_base_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf", "helpful-base"), if_score = True, if_rerank=False, split = 'dev') random.shuffle(helpful_base_dev_data) - random.seed(42) helpful_online_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf", "helpful-online"), if_score = True, if_rerank=False, split = 'dev') @@ -137,6 +137,6 @@ if __name__ == '__main__': total_dev_data = harmless_base_dev_data + helpful_base_dev_data + helpful_online_dev_data + helpful_rejection_dev_data random.shuffle(total_dev_data) total_dev_data = total_dev_data[:280] - with open(os.path.join('..','..','data','dev','sampled_dev.json'),'w', encoding='utf-8') as f: + with open(os.path.join('..','..','data','hh_dev','sampled_dev.json'),'w', encoding='utf-8') as f: for sample in total_dev_data: f.write(json.dumps(sample,ensure_ascii=False)+'\n') \ No newline at end of file diff --git a/PRO/train/preprocess_data/step_3_gen_test_data.py b/PRO/train/hh_preprocess_data/step_3_gen_test_data.py similarity index 90% rename from PRO/train/preprocess_data/step_3_gen_test_data.py rename to PRO/train/hh_preprocess_data/step_3_gen_test_data.py index 0aea94d..4945a88 100644 --- a/PRO/train/preprocess_data/step_3_gen_test_data.py +++ b/PRO/train/hh_preprocess_data/step_3_gen_test_data.py @@ -47,7 +47,6 @@ def reward_model_ranker(prefixes, suffixes): def extract_train_data(root_dir, if_score, if_rerank, training_stage_num = None, split='train'): file_list = [] - # for root,dirs,files in os.walk('refilled_data'): for root,dirs,files in os.walk(root_dir): for file in files: if not file.endswith("json"): @@ -103,37 +102,40 @@ def extract_train_data(root_dir, if_score, if_rerank, training_stage_num = None, for l in training_data: l['reward'] = [1.0] * len(l['suffix']) + for l in training_data: + l['sft_index'] = 0 + return training_data if __name__ == '__main__': root_dir = os.path.join('..','..','data',"preprocessed_data") data_aug = False - os.makedirs(os.path.join('..','..','data','test'), exist_ok=True) + os.makedirs(os.path.join('..','..','data','hh_test'), exist_ok=True) random.seed(42) harmless_base_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf", "harmless-base"), if_score = True, if_rerank=True, split = 'dev') random.shuffle(harmless_base_dev_data) - with open(os.path.join('..','..','data','test','harmless_base.json'),'w', encoding='utf-8') as f: + with open(os.path.join('..','..','data','hh_test','harmless_base.json'),'w', encoding='utf-8') as f: for sample in harmless_base_dev_data: f.write(json.dumps(sample,ensure_ascii=False)+'\n') random.seed(42) helpful_base_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf", "helpful-base"), if_score = True, if_rerank=False, split = 'dev') random.shuffle(helpful_base_dev_data) - with open(os.path.join('..','..','data','test','helpful_base.json'),'w', encoding='utf-8') as f: + with open(os.path.join('..','..','data','hh_test','helpful_base.json'),'w', encoding='utf-8') as f: for sample in helpful_base_dev_data: f.write(json.dumps(sample,ensure_ascii=False)+'\n') random.seed(42) helpful_online_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf", "helpful-online"), if_score = True, if_rerank=False, split = 'dev') random.shuffle(helpful_online_dev_data) - with open(os.path.join('..','..','data','test','helpful_online.json'),'w', encoding='utf-8') as f: + with open(os.path.join('..','..','data','hh_test','helpful_online.json'),'w', encoding='utf-8') as f: for sample in helpful_online_dev_data: f.write(json.dumps(sample,ensure_ascii=False)+'\n') random.seed(42) helpful_rejection_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "hhrlhf", "helpful-rejection"), if_score = True, if_rerank=False, split = 'dev') random.shuffle(helpful_rejection_dev_data) - with open(os.path.join('..','..','data','test','helpful_rejection.json'),'w', encoding='utf-8') as f: + with open(os.path.join('..','..','data','hh_test','helpful_rejection.json'),'w', encoding='utf-8') as f: for sample in helpful_rejection_dev_data: f.write(json.dumps(sample,ensure_ascii=False)+'\n') \ No newline at end of file diff --git a/PRO/train/summarize_preprocess_data/step_1_process.py b/PRO/train/summarize_preprocess_data/step_1_process.py new file mode 100644 index 0000000..69200fd --- /dev/null +++ b/PRO/train/summarize_preprocess_data/step_1_process.py @@ -0,0 +1,109 @@ +import json +import re +import pprint +import os +import tqdm +import random +random.seed(42) + +def gen_global_index(): + index = 0 + while True: + yield index + index += 1 + +def split_trans(split): + if split == 'train' or split == 'test' or split == 'dev': + return split + elif split == 'valid': + return 'dev' + elif split == 'valid1': + return 'dev' + elif split == 'valid2': + return 'test' + else: + raise Exception('guaiguaidigai') + +def summarize_from_feedback_preprocess(path,index_generator): + files = os.listdir(path) + files = [filename for filename in files if filename.endswith('.json')] + target_samples = { + 'train':[], + 'dev':[], + 'test':[] + } + + for filename in files: + with open(os.path.join(path,filename),'r', encoding="utf-8") as f: + raw = f.readlines() + + data = [] + for line in raw: + line = json.loads(line) + data.append(line) + + samples = [] + bar = tqdm.tqdm(data) + for index,sample in enumerate(bar): + bar.set_description(os.path.join(path,filename)) + assert len(sample['summaries']) == 2 + if 'post' in sample['info']: + prefix = "SUBREDDIT: r/{}\nTITLE: {}\nPOST: {}\nTL;DR:".format(sample['info']['subreddit'], sample['info']['title'],sample['info']['post']).strip() + one_sample = { + 'available': [ + { + 'id':next(index_generator), + 'prefix': prefix, + 'target_num':2, + 'target':[ + " {}".format(sample['summaries'][sample['choice']]['text'].strip()), + " {}".format(sample['summaries'][1-sample['choice']]['text'].strip()), + ] + }, + ], + 'split': split_trans(sample['split']), + 'source': { + 'path': os.path.join(path,filename), + 'line_num': index+1, + } + } + target_samples[one_sample['split']].append(one_sample) + else: + prefix = "Article: {}\nTL;DR:".format(sample['info']['article']) + pass + + os.makedirs(path.replace('raw_data','preprocessed_data'), exist_ok=True) + + true_dev_index = random.sample(list(range(len(target_samples['dev']))),1000) + true_dev = [] + for index, sample in enumerate(target_samples['dev']): + if index in true_dev_index: + sample['split'] = 'dev' + true_dev.append(sample) + else: + sample['split'] = 'train' + target_samples['train'].append(sample) + target_samples['dev'] = true_dev + + with open(os.path.join(path.replace('raw_data','preprocessed_data'), "train.json"), 'w', encoding='utf-8') as f: + for sample in target_samples['train']: + f.write(json.dumps(sample,ensure_ascii=False)+'\n') + print("{}: {}".format(os.path.join(path.replace('raw_data','preprocessed_data'),"train.json"),len(target_samples['train']))) + + with open(os.path.join(path.replace('raw_data','preprocessed_data'), "dev.json"), 'w', encoding='utf-8') as f: + for sample in target_samples['dev']: + f.write(json.dumps(sample,ensure_ascii=False)+'\n') + print("{}: {}".format(os.path.join(path.replace('raw_data','preprocessed_data'),"dev.json"),len(target_samples['dev']))) + + with open(os.path.join(path.replace('raw_data','preprocessed_data'), "test.json"), 'w', encoding='utf-8') as f: + for sample in target_samples['test']: + f.write(json.dumps(sample,ensure_ascii=False)+'\n') + print("{}: {}".format(os.path.join(path.replace('raw_data','preprocessed_data'),"test.json"),len(target_samples['test']))) + +if __name__ == "__main__": + global_index_generator = gen_global_index() + + summarize_from_feedback_preprocess( + os.path.join('..','..','data','raw_data','summarize_from_feedback','comparisons'), + global_index_generator + ) \ No newline at end of file diff --git a/PRO/train/summarize_preprocess_data/step_2_gen_train_data.py b/PRO/train/summarize_preprocess_data/step_2_gen_train_data.py new file mode 100644 index 0000000..e1fc16e --- /dev/null +++ b/PRO/train/summarize_preprocess_data/step_2_gen_train_data.py @@ -0,0 +1,118 @@ +import os +import sys +sys.path.append("..") +import json +import random +import numpy as np +import tqdm +from utils.metrics_summarize import create_reward_fn +get_score, reward_batch_size = create_reward_fn() + +def split_trans(split): + if split == 'train' or split == 'test' or split == 'dev': + return split + elif split == 'valid': + return 'dev' + elif split == 'valid1': + return 'dev' + elif split == 'valid2': + return 'test' + else: + raise Exception('guaiguaidigai') + +def concat_wo_ranker(prefixes, suffixes): + #prefixes = [[a,b,c],[d,e,f]] + #suffixes = [[a,b,c],[d,e,f]] + training_stage_num = len(prefixes[0]) + batch_size = len(prefixes) + new_prefixes = sum(prefixes,[]) + new_suffixes = sum(suffixes,[]) + rewards = get_score(new_prefixes, new_suffixes).view(batch_size, training_stage_num).cpu().detach().numpy().tolist() #[batch_size, ranking] + + return prefixes, suffixes, rewards + +def reward_model_ranker(prefixes, suffixes): + #prefixes = [[a,b,c],[d,e,f]] + #suffixes = [[a,b,c],[d,e,f]] + training_stage_num = len(prefixes[0]) + batch_size = len(prefixes) + new_prefixes = sum(prefixes,[]) + new_suffixes = sum(suffixes,[]) + rewards = get_score(new_prefixes, new_suffixes).view(batch_size, training_stage_num).cpu().detach().numpy() #[batch_size, ranking] + indices = np.argsort(-rewards,axis=1) + prefixes = [[prefixes[i][index] for index in indices[i]] for i in range(batch_size)] + suffixes = [[suffixes[i][index] for index in indices[i]] for i in range(batch_size)] + rewards = [[float(rewards[i][index]) for index in indices[i]] for i in range(batch_size)] + return prefixes, suffixes, rewards + +def extract_train_data(root_dir, if_score, if_rerank, training_stage_num = None, split='train'): + training_data = [] + with open(root_dir, 'r', encoding='utf-8') as f: + raw_data = f.readlines() + for line in raw_data: + sample = json.loads(line) + if split_trans(sample['split']) == split: + new_sample = {'meta': sample['source'], 'prefix':[],'suffix':[]} + if data_aug: + for s in sample['extended']+sample['available']: + for suffix in s['target']: + assert isinstance(suffix,str) + new_sample['prefix'].append(s['prefix']) + new_sample['suffix'].append(suffix) + else: + for s in sample['available']: + for suffix in s['target']: + assert isinstance(suffix,str) + new_sample['prefix'].append(s['prefix']) + new_sample['suffix'].append(suffix) + training_data.append(new_sample) + if training_stage_num == None: + training_stage_num = len(new_sample['prefix']) + assert training_stage_num == len(new_sample['prefix']) + + if if_score: + batch_size = reward_batch_size / 2 # default + for index in tqdm.tqdm(range(0,len(training_data),batch_size),desc="rewarding"): + prefixes = [] + suffixes = [] + if len(training_data)-index < batch_size: + batch_size = len(training_data)-index + for sub_index in range(batch_size): + prefixes.append(training_data[index+sub_index]['prefix']) + suffixes.append(training_data[index+sub_index]['suffix']) + if if_rerank: + prefixes, suffixes, rewards = reward_model_ranker(prefixes,suffixes) + else: + prefixes, suffixes, rewards = concat_wo_ranker(prefixes,suffixes) + for sub_index in range(batch_size): + training_data[index+sub_index]['prefix'] = prefixes[sub_index] + training_data[index+sub_index]['suffix'] = suffixes[sub_index] + training_data[index+sub_index]['reward'] = rewards[sub_index] + else: + for l in training_data: + l['reward'] = [1.0] * len(l['suffix']) + + for l in training_data: + l['sft_index'] = 0 + + return training_data + +if __name__ == '__main__': + root_dir = os.path.join('..','..','data',"preprocessed_data", "summarize_from_feedback", "comparisons") + data_aug = False + os.makedirs(os.path.join('..','..','data','summarize_train_len2'), exist_ok=True) + random.seed(42) + training_data = extract_train_data(root_dir = os.path.join(root_dir, "train.json"), if_score = True, if_rerank=True, split = 'train') + random.shuffle(training_data) + with open(os.path.join('..','..','data','summarize_train_len2','train.json'),'a', encoding='utf-8') as f: + for sample in training_data: + f.write(json.dumps(sample,ensure_ascii=False)+'\n') + + data_aug = False + os.makedirs(os.path.join('..','..','data','summarize_dev'), exist_ok=True) + random.seed(42) + total_dev_data = extract_train_data(root_dir = os.path.join(root_dir, "dev.json"), if_score = True, if_rerank=False, split = 'dev') + random.shuffle(total_dev_data) + with open(os.path.join('..','..','data','summarize_dev','sampled_dev.json'),'a', encoding='utf-8') as f: + for sample in total_dev_data: + f.write(json.dumps(sample,ensure_ascii=False)+'\n') \ No newline at end of file diff --git a/PRO/train/summarize_preprocess_data/step_3_gen_test_data.py b/PRO/train/summarize_preprocess_data/step_3_gen_test_data.py new file mode 100644 index 0000000..fd64e58 --- /dev/null +++ b/PRO/train/summarize_preprocess_data/step_3_gen_test_data.py @@ -0,0 +1,109 @@ +import os +import sys +sys.path.append("..") +import json +import random +import numpy as np +import tqdm +from utils.metrics import create_reward_fn_2 +get_score, reward_batch_size = create_reward_fn_2() + +def split_trans(split): + if split == 'train' or split == 'test' or split == 'dev': + return split + elif split == 'valid': + return 'dev' + elif split == 'valid1': + return 'dev' + elif split == 'valid2': + return 'test' + else: + raise Exception('guaiguaidigai') + +def concat_wo_ranker(prefixes, suffixes): + #prefixes = [[a,b,c],[d,e,f]] + #suffixes = [[a,b,c],[d,e,f]] + training_stage_num = len(prefixes[0]) + batch_size = len(prefixes) + new_prefixes = sum(prefixes,[]) + new_suffixes = sum(suffixes,[]) + rewards = get_score(new_prefixes, new_suffixes).view(batch_size, training_stage_num).cpu().detach().numpy().tolist() #[batch_size, ranking] + + return prefixes, suffixes, rewards + +def reward_model_ranker(prefixes, suffixes): + #prefixes = [[a,b,c],[d,e,f]] + #suffixes = [[a,b,c],[d,e,f]] + training_stage_num = len(prefixes[0]) + batch_size = len(prefixes) + new_prefixes = sum(prefixes,[]) + new_suffixes = sum(suffixes,[]) + rewards = get_score(new_prefixes, new_suffixes).view(batch_size, training_stage_num).cpu().detach().numpy() #[batch_size, ranking] + indices = np.argsort(-rewards,axis=1) + prefixes = [[prefixes[i][index] for index in indices[i]] for i in range(batch_size)] + suffixes = [[suffixes[i][index] for index in indices[i]] for i in range(batch_size)] + rewards = [[float(rewards[i][index]) for index in indices[i]] for i in range(batch_size)] + return prefixes, suffixes, rewards + +def extract_train_data(root_dir, if_score, if_rerank, training_stage_num = None, split='train'): + training_data = [] + with open(root_dir, 'r', encoding='utf-8') as f: + raw_data = f.readlines() + for line in raw_data: + sample = json.loads(line) + if split_trans(sample['split']) == split: + new_sample = {'meta': sample['source'], 'prefix':[],'suffix':[]} + if data_aug: + for s in sample['extended']+sample['available']: + for suffix in s['target']: + assert isinstance(suffix,str) + new_sample['prefix'].append(s['prefix']) + new_sample['suffix'].append(suffix) + else: + for s in sample['available']: + for suffix in s['target']: + assert isinstance(suffix,str) + new_sample['prefix'].append(s['prefix']) + new_sample['suffix'].append(suffix) + training_data.append(new_sample) + if training_stage_num == None: + training_stage_num = len(new_sample['prefix']) + assert training_stage_num == len(new_sample['prefix']) + + if if_score: + batch_size = reward_batch_size / 2 # default + for index in tqdm.tqdm(range(0,len(training_data),batch_size),desc="rewarding"): + prefixes = [] + suffixes = [] + if len(training_data)-index < batch_size: + batch_size = len(training_data)-index + for sub_index in range(batch_size): + prefixes.append(training_data[index+sub_index]['prefix']) + suffixes.append(training_data[index+sub_index]['suffix']) + if if_rerank: + prefixes, suffixes, rewards = reward_model_ranker(prefixes,suffixes) + else: + prefixes, suffixes, rewards = concat_wo_ranker(prefixes,suffixes) + for sub_index in range(batch_size): + training_data[index+sub_index]['prefix'] = prefixes[sub_index] + training_data[index+sub_index]['suffix'] = suffixes[sub_index] + training_data[index+sub_index]['reward'] = rewards[sub_index] + else: + for l in training_data: + l['reward'] = [1.0] * len(l['suffix']) + + for l in training_data: + l['sft_index'] = 0 + + return training_data + +if __name__ == '__main__': + root_dir = os.path.join('..','..','data',"preprocessed_data", "summarize_from_feedback", "comparisons") + data_aug = False + os.makedirs(os.path.join('..','..','data','summarize_test'), exist_ok=True) + random.seed(42) + test_data = extract_train_data(root_dir = os.path.join(root_dir, "test.json"), if_score = True, if_rerank=False, split = 'test') + random.shuffle(test_data) + with open(os.path.join('..','..','data','summarize_test','test.json'),'w', encoding='utf-8') as f: + for sample in test_data: + f.write(json.dumps(sample,ensure_ascii=False)+'\n') \ No newline at end of file diff --git a/PRO/train/train3_summarize.sh b/PRO/train/train3_summarize.sh new file mode 100755 index 0000000..7df8fc2 --- /dev/null +++ b/PRO/train/train3_summarize.sh @@ -0,0 +1,28 @@ +export OMP_NUM_THREADS=16 +root_dir=.. + +#stage 23 +id=$1 +data_path=$2 +ranking_len=$3 +mkdir -p $root_dir/logs/$id/$ranking_len +accelerate launch --num_processes 7 --config_file ds_config2.yaml main.py \ + --task summarize \ + --train_file_path $root_dir/data/${data_path} \ + --validation_file_path $root_dir/data/summarize_dev \ + --validation_file_name sampled_dev.json \ + --output_dir $root_dir/checkpoints/index_$id/stage_$ranking_len \ + --log_path $root_dir/logs/$id/$ranking_len \ + --index $id \ + --seed 42 \ + --temperature 1 \ + --sft_weight 0.05 \ + --num_train_epochs 2 \ + --training_stage_num $ranking_len \ + --block_size 720 \ + --learning_rate 5e-6 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 28 \ + --model_name_or_path decapoda-research/llama-7b-hf \ + --do_train \ + --do_validation > $root_dir/logs/$id/$ranking_len/train_detail.log 2>&1 \ No newline at end of file diff --git a/PRO/train/train.sh b/PRO/train/train_hh.sh similarity index 91% rename from PRO/train/train.sh rename to PRO/train/train_hh.sh index e180381..d9c3271 100755 --- a/PRO/train/train.sh +++ b/PRO/train/train_hh.sh @@ -7,8 +7,9 @@ data_path=$2 ranking_len=$3 mkdir -p $root_dir/logs/$id/$ranking_len accelerate launch --num_processes 7 --config_file ds_config.yaml main.py \ + --task hh \ --train_file_path $root_dir/data/${data_path} \ - --validation_file_path $root_dir/data/dev \ + --validation_file_path $root_dir/data/hh_dev \ --validation_file_name sampled_dev.json \ --output_dir $root_dir/checkpoints/index_$id/stage_$ranking_len \ --log_path $root_dir/logs/$id/$ranking_len \ diff --git a/PRO/train/train_summarize.sh b/PRO/train/train_summarize.sh new file mode 100755 index 0000000..8796d7f --- /dev/null +++ b/PRO/train/train_summarize.sh @@ -0,0 +1,28 @@ +export OMP_NUM_THREADS=16 +root_dir=.. + +#stage 23 +id=$1 +data_path=$2 +ranking_len=$3 +mkdir -p $root_dir/logs/$id/$ranking_len +accelerate launch --num_processes 7 --config_file ds_config.yaml main.py \ + --task summarize \ + --train_file_path $root_dir/data/${data_path} \ + --validation_file_path $root_dir/data/summarize_dev \ + --validation_file_name sampled_dev.json \ + --output_dir $root_dir/checkpoints/index_$id/stage_$ranking_len \ + --log_path $root_dir/logs/$id/$ranking_len \ + --index $id \ + --seed 42 \ + --temperature 1 \ + --sft_weight 0.05 \ + --num_train_epochs 2 \ + --training_stage_num $ranking_len \ + --block_size 720 \ + --learning_rate 5e-6 \ + --per_device_train_batch_size 2 \ + --per_device_eval_batch_size 28 \ + --model_name_or_path decapoda-research/llama-7b-hf \ + --do_train \ + --do_validation > $root_dir/logs/$id/$ranking_len/train_detail.log 2>&1 \ No newline at end of file diff --git a/PRO/train/utils/config.py b/PRO/train/utils/config.py index 8e51d8c..084f6d0 100644 --- a/PRO/train/utils/config.py +++ b/PRO/train/utils/config.py @@ -6,6 +6,11 @@ from transformers import SchedulerType def parse_args(): parser = argparse.ArgumentParser(description="Preference Ranking Optimization For Human Alignment") + parser.add_argument( + "--task", + type=str, + default="hh", + ) parser.add_argument( "--do_train", action="store_true", @@ -73,7 +78,7 @@ def parse_args(): type=int, default=20, ) - parser.add_argument("--num_train_epochs", type=int, default=1") + parser.add_argument("--num_train_epochs", type=int, default=1) parser.add_argument( "--max_train_steps", type=int, diff --git a/PRO/train/utils/data_manager.py b/PRO/train/utils/data_manager.py index 0af1681..0c4efdb 100644 --- a/PRO/train/utils/data_manager.py +++ b/PRO/train/utils/data_manager.py @@ -16,7 +16,7 @@ from transformers import ( DataCollatorWithPadding, ) -class DataManager(): +class HH_DataManager(): def __init__(self, config, training_stage, tokenizer_path = args.model_name_or_path): self.config = config if self.config.architectures[0].lower() == "llamaforcausallm": @@ -49,7 +49,7 @@ class DataManager(): def train_data_collator(self, features): samples_num = len(features) - training_stage = self.training_stage #len(features[0]['input_ids']) + training_stage = self.training_stage origin_state = (self.tokenizer.padding_side, self.tokenizer.truncation_side) self.tokenizer.truncation_side = "left" @@ -83,7 +83,6 @@ class DataManager(): )['input_ids'] ps_lens = [len(p_input_ids)-1 for p_input_ids in ps_input_ids] - # assemble self.tokenizer.padding_side = "right" self.tokenizer.truncation_side = "right" @@ -100,7 +99,6 @@ class DataManager(): return_tensors = self.return_tensors, ) - #prepare prefix_mask seq_len = batch["attention_mask"].shape[1] prefix_mask = [] for p_len in ps_lens: @@ -168,7 +166,6 @@ class DataManager(): max_length = self.max_length - 128, truncation = True, add_special_tokens = self.add_special_tokens, - # pad_to_multiple_of = self.pad_to_multiple_of, return_tensors = self.return_tensors, ).to(model.device) batch_size = len(prefixes) @@ -182,7 +179,170 @@ class DataManager(): num_beams=1, do_sample=False, num_return_sequences = 1, - ) #tensor + ) + + instant_text = self.batch_decode(predicted_sents) + + # restore states + self.tokenizer.padding_side, self.tokenizer.truncation_side = origin_state + + for index in range(len(instant_text)): + assert truncated_prefixes[index].rstrip() in instant_text[index], (truncated_prefixes[index].strip(), instant_text[index]) + instant_text[index] = instant_text[index].replace(truncated_prefixes[index].rstrip(), "").strip() + instant_text[index] = self.early_truncation(instant_text[index]) + + return instant_text + +class Summarize_DataManager(): + def __init__(self, config, training_stage, tokenizer_path = args.model_name_or_path): + self.config = config + if self.config.architectures[0].lower() == "llamaforcausallm": + self.tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path, use_fast=False) + self.tokenizer.unk_token = "" + self.tokenizer.bos_token = "" + self.tokenizer.eos_token = "" + else: + self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False) + + self.tokenizer.pad_token = self.tokenizer.eos_token + self.padding = True + self.max_length = args.block_size + self.pad_to_multiple_of = 8 + self.return_tensors = "pt" + self.add_special_tokens = True + self.training_stage = training_stage + self.stop_sequences = ["\n\n"] + + def batch_decode(self, model_output): + # model_output = [batch, seq_len] + return self.tokenizer.batch_decode(model_output, skip_special_tokens=True) + + def early_truncation(self, text): + for stop in self.stop_sequences: + stop_ix = text.find(stop) + if stop_ix >= 0: + text = text[:stop_ix].strip() + return text.strip() + + def train_data_collator(self, features): + samples_num = len(features) + training_stage = self.training_stage + origin_state = (self.tokenizer.padding_side, self.tokenizer.truncation_side) + + self.tokenizer.truncation_side = "right" + ps = [] + ss = [] + rs = [] + sft_index = [] + for feature_index, feature in enumerate(features): + for p, s, r in zip(feature['prefix'][:training_stage], feature['suffix'][:training_stage], feature['reward'][:training_stage]): + ps.append(p) + ss.append(s) + rs.append(r) + assert feature["sft_index"] < training_stage + sft_index.append(feature["sft_index"]) + + ps_input_ids = self.tokenizer( + ps, + add_special_tokens = self.add_special_tokens, + )['input_ids'] + ps_lens = [len(p_input_ids)-1 for p_input_ids in ps_input_ids] + + self.tokenizer.padding_side = "right" + self.tokenizer.truncation_side = "right" + + texts = [] + for p, s in zip(ps, ss): + texts.append(p + s) + + batch = self.tokenizer( + texts, + padding=self.padding, + max_length = self.max_length, + truncation = True, + add_special_tokens = self.add_special_tokens, + return_tensors = self.return_tensors, + ) + + seq_len = batch["attention_mask"].shape[1] + prefix_mask = [] + for p_len in ps_lens: + assert seq_len > p_len + prefix_mask.append( + [1 if i