autogen subpackage (#968)

* math utils in autogen

* cleanup

* code utils

* remove check function from code response

* comment out test

* GPT-4

* increase request timeout

* name

* logging and error handling

* better doc

* doc

* codegen optimized

* GPT series

* text

* no demo example

* math

* import openai

* import openai

* azure model name

* azure model name

* openai version

* generate assertion if necessary

* condition to generate assertions

* init region key

* rename

* comments about budget

* prompt

---------

Co-authored-by: Susan Xueqing Liu <liususan091219@users.noreply.github.com>
This commit is contained in:
Chi Wang 2023-04-07 20:04:01 -07:00 committed by GitHub
parent 7f9402b8fd
commit 82f0a4309d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
20 changed files with 5251 additions and 3638 deletions

View File

@ -23,9 +23,9 @@
## What is FLAML
FLAML is a lightweight Python library that finds accurate machine
learning models automatically, efficiently and economically. It frees users from selecting
models and hyperparameters for each model. It can also be used to tune generic hyperparameters for large language models (LLM), MLOps/LMOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations and so on.
models and hyperparameters for each model. It can also be used to tune generic hyperparameters for foundation models, MLOps/LMOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations and so on.
1. For common machine learning or AI tasks like classification, regression, and generation, it quickly finds quality models for user-provided data with low computational resources. It supports both classical machine learning models and deep neural networks, including large language models such as the OpenAI GPT-3 models.
1. For common machine learning or AI tasks like classification, regression, and generation, it quickly finds quality models for user-provided data with low computational resources. It supports both classical machine learning models and deep neural networks, including foundation models such as the GPT series.
1. It is easy to customize or extend. Users can find their desired customizability from a smooth range: minimal customization (computational resource budget), medium customization (e.g., scikit-style learner, search space and metric), or full customization (arbitrary training and evaluation code).
1. It supports fast automatic tuning, capable of handling complex constraints/guidance/early stopping. FLAML is powered by a new, [cost-effective
hyperparameter optimization](https://microsoft.github.io/FLAML/docs/Use-Cases/Tune-User-Defined-Function/#hyperparameter-optimization-algorithm)
@ -95,6 +95,22 @@ estimator = LGBMRegressor()
estimator.fit(X_train, y_train)
```
* (New) You can optimize [generations](https://microsoft.github.io/FLAML/docs/Use-Cases/Auto-Generation) by ChatGPT or GPT-4 etc. with your own tuning data, success metrics and budgets.
```python
from flaml import oai
config, analysis = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_func,
inference_budget=0.05,
optimization_budget=3,
num_samples=-1,
)
```
## Documentation
You can find a detailed documentation about FLAML [here](https://microsoft.github.io/FLAML/) where you can find the API documentation, use cases and examples.

View File

@ -2,7 +2,7 @@ import logging
from flaml.automl import AutoML, logger_formatter
from flaml.tune.searcher import CFO, BlendSearch, FLOW2, BlendSearchTuner, RandomSearch
from flaml.onlineml.autovw import AutoVW
from flaml.integrations import oai
from flaml.autogen import oai
from flaml.version import __version__

181
flaml/autogen/code_utils.py Normal file
View File

@ -0,0 +1,181 @@
import signal
import subprocess
import sys
from typing import List, Dict, Tuple, Optional, Union, Callable
from flaml import oai
def timeout_handler(signum, frame):
raise TimeoutError("Timed out!")
def execute_code(code: str, max_exec_time: Optional[int] = 3):
signal.signal(signal.SIGALRM, timeout_handler)
code = code.strip()
with open("codetest.py", "w") as fout:
fout.write(code)
try:
signal.alarm(max_exec_time)
result = subprocess.run(
[sys.executable, "codetest.py"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
)
signal.alarm(0)
except TimeoutError:
return 0
return int(result.returncode == 0)
def generate_assertions(
definition: str, model: Optional[str] = "gpt-3.5-turbo"
) -> Tuple[str, float]:
"""Generate assertions for a function.
Args:
definition (str): The function definition, including the signature and docstr.
model (str): The model used for generation.
Returns:
str: The generated assertions.
float: The cost of the generation.
"""
prompt = """Given the signature and docstring, write the exactly same number of assertion(s) for the provided example(s) in the docstring, without assertion messages.
func signature:
{definition}
assertions:"""
response = oai.Completion.create(
{"definition": definition},
model=model,
prompt=prompt,
max_tokens=256,
stop="\n\n",
)
cost = oai.Completion.cost(model, response)
assertions = oai.Completion.extract_text(response)[0]
return assertions, cost
def _remove_check(response):
"""Remove the check function from the response."""
# find the position of the check function
pos = response.find("def check(")
if pos == -1:
return response
return response[:pos]
def eval_function_completions(
responses: List[str],
definition: str,
test: Optional[str] = None,
entry_point: Optional[str] = None,
assertions: Optional[Union[str, Callable[[str], Tuple[str, float]]]] = None,
) -> Dict:
"""Select a response from a list of responses for the function completion task (using generated assertions), and/or evaluate if the task is successful using a gold test.
Args:
responses (list): The list of responses.
definition (str): The input definition.
test (Optional, str): The test code.
entry_point (Optional, str): The name of the function.
assertions (Optional, str or Callable): The assertion code which serves as a filter of the responses, or an assertion generator.
When provided, only the responses that pass the assertions will be considered for the actual test (if provided).
Returns:
dict: The success metrics.
"""
n = len(responses)
if assertions is None:
# no assertion filter
success_list = []
for i in range(n):
response = _remove_check(responses[i])
code = (
f"{response}\n{test}\ncheck({entry_point})"
if response.startswith("def")
else f"{definition}{response}\n{test}\ncheck({entry_point})"
)
success = execute_code(code)
success_list.append(success)
return {
"expected_success": 1 - pow(1 - sum(success_list) / n, n),
"success": any(s for s in success_list),
}
if callable(assertions) and n > 1:
# assertion generator
assertions, gen_cost = assertions(definition)
else:
gen_cost = 0
if n > 1 or test is None:
for i in range(n):
response = responses[i] = _remove_check(responses[i])
code = (
f"{response}\n{assertions}"
if response.startswith("def")
else f"{definition}{response}\n{assertions}"
)
succeed_assertions = execute_code(code)
if succeed_assertions:
break
else:
# just test, no need to check assertions
succeed_assertions = False
i, response = 0, responses[0]
if test is None:
# no test code
return {
"index_selected": i,
"succeed_assertions": succeed_assertions,
"gen_cost": gen_cost,
"assertions": assertions,
}
code_test = (
f"{response}\n{test}\ncheck({entry_point})"
if response.startswith("def")
else f"{definition}{response}\n{test}\ncheck({entry_point})"
)
success = execute_code(code_test)
return {
"index_selected": i,
"succeed_assertions": succeed_assertions,
"success": success,
"gen_cost": gen_cost,
"assertions": assertions,
}
def implement(
definition: str,
configs: List[Dict],
assertions: Optional[
Union[str, Callable[[str], Tuple[str, float]]]
] = generate_assertions,
) -> Tuple[str, float]:
"""Implement a function from a definition.
Args:
definition (str): The function definition, including the signature and docstr.
configs (list): The list of configurations for completion.
assertions (Optional, str or Callable): The assertion code which serves as a filter of the responses, or an assertion generator.
Returns:
str: The implementation.
float: The cost of the implementation.
int: The index of the configuration which generates the implementation.
"""
cost = 0
if len(configs) > 1 and callable(assertions):
assertions, cost = assertions(definition)
for i, config in enumerate(configs):
response = oai.Completion.create({"definition": definition}, **config)
cost += oai.Completion.cost(config["model"], response)
responses = oai.Completion.extract_text(response)
metrics = eval_function_completions(
responses, definition, assertions=assertions
)
assertions = metrics["assertions"]
cost += metrics["gen_cost"]
if metrics["succeed_assertions"] or i == len(configs) - 1:
return responses[metrics["index_selected"]], cost, i

312
flaml/autogen/math_utils.py Normal file
View File

@ -0,0 +1,312 @@
from typing import Optional
def remove_boxed(string: str) -> Optional[str]:
"""Source: https://github.com/hendrycks/math
Extract the text within a \\boxed{...} environment.
Example:
>>> remove_boxed(\\boxed{\\frac{2}{3}})
\\frac{2}{3}
"""
left = "\\boxed{"
try:
assert string[: len(left)] == left
assert string[-1] == "}"
return string[len(left) : -1]
except Exception:
return None
def last_boxed_only_string(string: str) -> Optional[str]:
"""Source: https://github.com/hendrycks/math
Extract the last \\boxed{...} or \\fbox{...} element from a string.
"""
idx = string.rfind("\\boxed")
if idx < 0:
idx = string.rfind("\\fbox")
if idx < 0:
return None
i = idx
right_brace_idx = None
num_left_braces_open = 0
while i < len(string):
if string[i] == "{":
num_left_braces_open += 1
if string[i] == "}":
num_left_braces_open -= 1
if num_left_braces_open == 0:
right_brace_idx = i
break
i += 1
if right_brace_idx is None:
retval = None
else:
retval = string[idx : right_brace_idx + 1]
return retval
def _fix_fracs(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat fractions.
Examples:
>>> _fix_fracs("\\frac1b")
\frac{1}{b}
>>> _fix_fracs("\\frac12")
\frac{1}{2}
>>> _fix_fracs("\\frac1{72}")
\frac{1}{72}
"""
substrs = string.split("\\frac")
new_str = substrs[0]
if len(substrs) > 1:
substrs = substrs[1:]
for substr in substrs:
new_str += "\\frac"
if substr[0] == "{":
new_str += substr
else:
try:
assert len(substr) >= 2
except Exception:
return string
a = substr[0]
b = substr[1]
if b != "{":
if len(substr) > 2:
post_substr = substr[2:]
new_str += "{" + a + "}{" + b + "}" + post_substr
else:
new_str += "{" + a + "}{" + b + "}"
else:
if len(substr) > 2:
post_substr = substr[2:]
new_str += "{" + a + "}" + b + post_substr
else:
new_str += "{" + a + "}" + b
string = new_str
return string
def _fix_a_slash_b(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat fractions formatted as a/b to \\frac{a}{b}.
Example:
>>> _fix_a_slash_b("2/3")
\frac{2}{3}
"""
if len(string.split("/")) != 2:
return string
a_str = string.split("/")[0]
b_str = string.split("/")[1]
try:
a = int(a_str)
b = int(b_str)
assert string == "{}/{}".format(a, b)
new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
return new_string
except Exception:
return string
def _remove_right_units(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Remove units (on the right).
"\\text{ " only ever occurs (at least in the val set) when describing units.
"""
if "\\text{ " in string:
splits = string.split("\\text{ ")
assert len(splits) == 2
return splits[0]
else:
return string
def _fix_sqrt(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat square roots.
Example:
>>> _fix_sqrt("\\sqrt3")
\\sqrt{3}
"""
if "\\sqrt" not in string:
return string
splits = string.split("\\sqrt")
new_string = splits[0]
for split in splits[1:]:
if split[0] != "{":
a = split[0]
new_substr = "\\sqrt{" + a + "}" + split[1:]
else:
new_substr = "\\sqrt" + split
new_string += new_substr
return new_string
def _strip_string(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Apply the reformatting helper functions above.
"""
# linebreaks
string = string.replace("\n", "")
# print(string)
# remove inverse spaces
string = string.replace("\\!", "")
# print(string)
# replace \\ with \
string = string.replace("\\\\", "\\")
# print(string)
# replace tfrac and dfrac with frac
string = string.replace("tfrac", "frac")
string = string.replace("dfrac", "frac")
# print(string)
# remove \left and \right
string = string.replace("\\left", "")
string = string.replace("\\right", "")
# print(string)
# Remove circ (degrees)
string = string.replace("^{\\circ}", "")
string = string.replace("^\\circ", "")
# remove dollar signs
string = string.replace("\\$", "")
# remove units (on the right)
string = _remove_right_units(string)
# remove percentage
string = string.replace("\\%", "")
string = string.replace("%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
string = string.replace("{.", "{0.")
# if empty, return empty string
if len(string) == 0:
return string
if string[0] == ".":
string = "0" + string
# to consider: get rid of e.g. "k = " or "q = " at beginning
if len(string.split("=")) == 2:
if len(string.split("=")[0]) <= 2:
string = string.split("=")[1]
# fix sqrt3 --> sqrt{3}
string = _fix_sqrt(string)
# remove spaces
string = string.replace(" ", "")
# \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc.
# Even works with \frac1{72} (but not \frac{72}1).
# Also does a/b --> \\frac{a}{b}
string = _fix_fracs(string)
# manually change 0.5 --> \frac{1}{2}
if string == "0.5":
string = "\\frac{1}{2}"
# NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
string = _fix_a_slash_b(string)
return string
def get_answer(solution: Optional[str]) -> Optional[str]:
if solution is None:
return None
last_boxed = last_boxed_only_string(solution)
if last_boxed is None:
return None
answer = remove_boxed(last_boxed)
if answer is None:
return None
return answer
def is_equiv(str1: Optional[str], str2: Optional[str]) -> float:
"""Returns (as a float) whether two strings containing math are equivalent up to differences of formatting in
- units
- fractions
- square roots
- superfluous LaTeX.
Source: https://github.com/hendrycks/math
"""
if str1 is None and str2 is None:
print("WARNING: Both None")
return 1.0
if str1 is None or str2 is None:
return 0.0
try:
ss1 = _strip_string(str1)
ss2 = _strip_string(str2)
return float(ss1 == ss2)
except Exception:
return float(str1 == str2)
def is_equiv_chain_of_thought(str1: str, str2: str) -> float:
"""Strips the solution first before calling `is_equiv`."""
ans1 = get_answer(str1)
ans2 = get_answer(str2)
return is_equiv(ans1, ans2)
def voting_counts(responses):
answers = {}
for i in range(len(responses)):
equiv = i
if get_answer(responses[i]) is None:
# ignore None answers
continue
for j in answers:
if is_equiv_chain_of_thought(responses[i], responses[j]):
equiv = j
break
if equiv in answers:
answers[equiv] += 1
else:
answers[equiv] = 1
return answers
def eval_math_responses(responses, solution=None, **args):
"""Select a response for a math problem using voting, and check if the response is correct if the solution is provided.
Args:
responses (list): The list of responses.
solution (str): The canonical solution.
Returns:
dict: The success metrics.
"""
success_list = []
n = len(responses)
if solution is not None:
for i in range(n):
response = responses[i]
succeed = is_equiv_chain_of_thought(response, solution)
success_list.append(succeed)
# voting
answers = voting_counts(responses)
# find the answer with highest votes in answers
answer, votes = max(answers.items(), key=lambda x: x[1], default=(0, 0))
# check if the answer is correct
success_vote = is_equiv_chain_of_thought(responses[answer], solution)
return {
"expected_success": 1 - pow(1 - sum(success_list) / n, n),
"success": any(s for s in success_list),
"success_vote": success_vote,
"voted_answer": responses[answer],
"votes": votes,
}

View File

@ -0,0 +1,3 @@
from flaml.autogen.oai.completion import Completion, ChatCompletion
__all__ = ["Completion", "ChatCompletion"]

View File

@ -2,7 +2,10 @@ from time import sleep
import logging
import numpy as np
import time
from typing import List
import sys
from flaml import tune, BlendSearch
from flaml.automl.logger import logger_formatter
try:
import openai
@ -22,6 +25,11 @@ except ImportError:
"please install flaml[openai] option to use the flaml.oai subpackage."
)
logger = logging.getLogger(__name__)
if not logger.handlers:
# Add the console handler.
_ch = logging.StreamHandler(stream=sys.stdout)
_ch.setFormatter(logger_formatter)
logger.addHandler(_ch)
def get_key(config):
@ -50,6 +58,7 @@ class Completion:
chat_models = {
"gpt-3.5-turbo",
"gpt-3.5-turbo-0301",
"gpt-35-turbo",
"gpt-4",
"gpt-4-32k",
"gpt-4-32k-0314",
@ -67,6 +76,7 @@ class Completion:
"text-davinci-003": 0.02,
"gpt-3.5-turbo": 0.002,
"gpt-3.5-turbo-0301": 0.002,
"gpt-35-turbo": 0.002,
"gpt-4": (0.03, 0.06),
"gpt-4-0314": (0.03, 0.06),
"gpt-4-32k": (0.06, 0.12),
@ -95,12 +105,13 @@ class Completion:
}
seed = 41
cache_path = f".cache/{seed}"
# retry after this many seconds
retry_time = 10
# fail a request after hitting RateLimitError for this many seconds
retry_timeout = 60
retry_timeout = 120
# time out for request to openai server
request_timeout = 30
request_timeout = 60
openai_completion_class = not ERROR and openai.Completion
_total_cost = 0
@ -156,14 +167,18 @@ class Completion:
# retry after retry_time seconds
if time.time() - start_time + cls.retry_time < cls.retry_timeout:
logger.info(f"retrying in {cls.retry_time} seconds...", exc_info=1)
elif not eval_only:
elif eval_only:
raise
else:
break
sleep(cls.retry_time)
except InvalidRequestError:
if "azure" == openai.api_type and "model" in config:
# azure api uses "engine" instead of "model"
config = config.copy()
config["engine"] = config.pop("model")
config["engine"] = config.pop("model").replace(
"gpt-3.5-turbo", "gpt-35-turbo"
)
else:
raise
logger.warning(
@ -219,6 +234,13 @@ class Completion:
num_completions, invalid_n.get(max_tokens, np.inf)
)
@classmethod
def _pop_subspace(cls, config):
if "subspace" in config:
config = config.copy()
config.update(config.pop("subspace"))
return config
@classmethod
def _get_prompt_messages_from_config(cls, model, config):
prompt, messages = None, None
@ -254,6 +276,7 @@ class Completion:
"""
cost = 0
data = cls.data
config = cls._pop_subspace(config)
model = config["model"]
data_length = len(data)
price = cls.price1K.get(model)
@ -300,8 +323,10 @@ class Completion:
start_n = max_valid_n + 1
else:
start_n = config_n
region_key = None
params = config.copy()
params["stop"] = stop
if "stop" in config:
params["stop"] = stop
temperature_or_top_p = params.pop("temperature_or_top_p", None)
if temperature_or_top_p:
params.update(temperature_or_top_p)
@ -329,11 +354,7 @@ class Completion:
result["cost"] = cost
return result
# evaluate the quality of the responses
responses = (
[r["message"]["content"].rstrip() for r in response["choices"]]
if model in cls.chat_models
else [r["text"].rstrip() for r in response["choices"]]
)
responses = cls.extract_text(response)
usage = response["usage"]
n_input_tokens = usage["prompt_tokens"]
n_output_tokens = usage.get("completion_tokens", 0)
@ -491,11 +512,12 @@ class Completion:
```
log_file_name (str, optional): The log file.
inference_budget (float, optional): The inference budget.
optimization_budget (float, optional): The optimization budget.
inference_budget (float, optional): The inference budget, dollar per instance.
optimization_budget (float, optional): The optimization budget, dollar in total.
num_samples (int, optional): The number of samples to evaluate.
-1 means no hard restriction in the number of trials
and the actual number is decided by optimization_budget. Defaults to 1.
logging_level (optional): logging level. Defaults to logging.WARNING.
**config (dict): The search space to update over the default search.
For prompt, please provide a string/Callable or a list of strings/Callables.
- If prompt is provided for chat models, it will be converted to messages under role "user".
@ -570,22 +592,38 @@ class Completion:
cls.data = data
cls.avg_input_tokens = None
search_alg = BlendSearch(
cost_attr="cost",
cost_budget=optimization_budget,
metric=metric,
mode=mode,
space=space,
)
space_model = space["model"]
if not isinstance(space_model, str) and len(space_model) > 1:
# make a hierarchical search space
subspace = {}
if "max_tokens" in space:
subspace["max_tokens"] = space.pop("max_tokens")
if "temperature_or_top_p" in space:
subspace["temperature_or_top_p"] = space.pop("temperature_or_top_p")
if "best_of" in space:
subspace["best_of"] = space.pop("best_of")
if "n" in space:
subspace["n"] = space.pop("n")
choices = []
for model in space["model"]:
choices.append({"model": model, **subspace})
space["subspace"] = tune.choice(choices)
space.pop("model")
# start all the models with the same hp config
search_alg = BlendSearch(
cost_attr="cost",
cost_budget=optimization_budget,
metric=metric,
mode=mode,
space=space,
)
config0 = search_alg.suggest("t0")
points_to_evaluate = [config0]
for model in space_model:
if model != config0["model"]:
if model != config0["subspace"]["model"]:
point = config0.copy()
point["model"] = model
point["subspace"] = point["subspace"].copy()
point["subspace"]["model"] = model
points_to_evaluate.append(point)
search_alg = BlendSearch(
cost_attr="cost",
@ -595,6 +633,15 @@ class Completion:
space=space,
points_to_evaluate=points_to_evaluate,
)
else:
search_alg = BlendSearch(
cost_attr="cost",
cost_budget=optimization_budget,
metric=metric,
mode=mode,
space=space,
)
old_level = logger.getEffectiveLevel()
logger.setLevel(logging_level)
with diskcache.Cache(cls.cache_path) as cls._cache:
analysis = tune.run(
@ -605,7 +652,7 @@ class Completion:
verbose=3,
)
config = analysis.best_config
params = config.copy()
params = cls._pop_subspace(config)
if cls._prompts:
params["prompt"] = cls._prompts[config["prompt"]]
else:
@ -615,6 +662,7 @@ class Completion:
temperature_or_top_p = params.pop("temperature_or_top_p", None)
if temperature_or_top_p:
params.update(temperature_or_top_p)
logger.setLevel(old_level)
return params, analysis
@classmethod
@ -636,12 +684,14 @@ class Completion:
if ERROR:
raise ERROR
params = cls._construct_params(context, config)
if use_cache:
with diskcache.Cache(cls.cache_path) as cls._cache:
return cls._get_response(params)
return cls.openai_completion_class.create(
request_timeout=cls.request_timeout, **params
)
if not use_cache:
return cls._get_response(params, eval_only=True, use_cache=False)
seed = cls.seed
if "seed" in params:
cls.set_cache(params.pop("seed"))
with diskcache.Cache(cls.cache_path) as cls._cache:
cls.set_cache(seed)
return cls._get_response(params, eval_only=True)
@classmethod
def _construct_params(cls, data_instance, config, prompt=None, messages=None):
@ -698,8 +748,7 @@ class Completion:
use_cache=True,
agg_method="avg",
return_responses_and_per_instance_result=False,
seed=41,
cache_path=".cache",
logging_level=logging.WARNING,
):
"""Evaluate the responses created with the config for the OpenAI API call.
@ -750,54 +799,45 @@ class Completion:
return_responses_and_per_instance_result (bool): Whether to also return responses
and per instance results in addition to the aggregated results.
seed (int): Random seed for the evaluation. Defaults to 41.
cache_path (str): Path to the cache directory. Defaults to '.cache'.
If a cache directory does not exist, it will be created, otherwise use the existing one.
logging_level (optional): logging level. Defaults to logging.WARNING.
Returns:
None in case of rate limit error or when a valid eval_func is not provided in either test or tune;
None when no valid eval_func is provided in either test or tune;
Otherwise, a dict of aggregated results, responses and per instance results if `return_responses_and_per_instance_result` is True;
Otherwise, a dict of aggregated results (responses and per instance results are not returned).
"""
model = config["model"]
result_agg, responses_list, result_list = {}, [], []
metric_keys = None
cls.set_cache(seed, cache_path)
with diskcache.Cache(cls.cache_path) as cls._cache:
for i, data_i in enumerate(data):
logger.info(f"evaluating data instance {i}")
params = cls._construct_params(data_i, config)
response = cls._get_response(
params, eval_only=True, use_cache=use_cache
cost = 0
model = config["model"]
old_level = logger.getEffectiveLevel()
logger.setLevel(logging_level)
for i, data_i in enumerate(data):
logger.info(f"evaluating data instance {i}")
response = cls.create(data_i, use_cache, **config)
cost += cls.cost(model, response)
# evaluate the quality of the responses
responses = cls.extract_text(response)
if eval_func is not None:
metrics = eval_func(responses, **data_i)
elif hasattr(cls, "_eval_func"):
metrics = cls._eval_func(responses, **data_i)
else:
logger.warning(
"Please either provide a valid eval_func or do the test after the tune function is called."
)
if response == -1: # rate limit error, treat as invalid
return None
# evaluate the quality of the responses
responses = (
[r["message"]["content"].rstrip() for r in response["choices"]]
if model in cls.chat_models
else [r["text"].rstrip() for r in response["choices"]]
)
if eval_func is not None:
metrics = eval_func(responses, **data_i)
elif hasattr(cls, "_eval_func"):
metrics = cls._eval_func(responses, **data_i)
else:
logger.warning(
"Please either provide a valid eval_func or do the test after the tune function is called"
)
return
if not metric_keys:
metric_keys = []
for k in metrics.keys():
try:
_ = float(metrics[k])
metric_keys.append(k)
except ValueError:
pass
result_list.append(metrics)
if return_responses_and_per_instance_result:
responses_list.append(responses)
return
if not metric_keys:
metric_keys = []
for k in metrics.keys():
try:
_ = float(metrics[k])
metric_keys.append(k)
except ValueError:
pass
result_list.append(metrics)
if return_responses_and_per_instance_result:
responses_list.append(responses)
if isinstance(agg_method, str):
if agg_method in ["avg", "average"]:
for key in metric_keys:
@ -824,25 +864,57 @@ class Completion:
"agg_method needs to be a string ('avg' or 'median'),\
or a callable, or a dictionary of callable."
)
logger.setLevel(old_level)
# should we also return the result_list and responses_list or not?
if "cost" not in result_agg:
result_agg["cost"] = cost
if "inference_cost" not in result_agg:
result_agg["inference_cost"] = cost / len(data)
if return_responses_and_per_instance_result:
return result_agg, result_list, responses_list
else:
return result_agg
@classmethod
def cost(cls, model: str, response: dict):
"""Compute the cost of a completion.
Args:
model (str): The model name.
response (dict): The response from OpenAI API.
Returns:
The cost in USD.
"""
if model not in cls.price1K:
raise ValueError(f"Unknown model: {model}")
usage = response["usage"]
n_input_tokens = usage["prompt_tokens"]
n_output_tokens = usage.get("completion_tokens", 0)
price1K = cls.price1K[model]
if isinstance(price1K, tuple):
return (price1K[0] * n_input_tokens + price1K[1] * n_output_tokens) / 1000
return price1K * (n_input_tokens + n_output_tokens) / 1000
@classmethod
def extract_text(cls, response: dict) -> List[str]:
"""Extract the text from a completion response.
Args:
response (dict): The response from OpenAI API.
Returns:
A list of text in the responses.
"""
choices = response["choices"]
if "text" in choices[0]:
return [choice["text"] for choice in choices]
return [choice["message"]["content"] for choice in choices]
class ChatCompletion(Completion):
"""A class for OpenAI API ChatCompletion."""
price1K = {
"gpt-3.5-turbo": 0.002,
"gpt-3.5-turbo-0301": 0.002,
"gpt-4": (0.03, 0.06),
"gpt-4-0314": (0.03, 0.06),
"gpt-4-32k": (0.06, 0.12),
"gpt-4-32k-0314": (0.06, 0.12),
}
default_search_space = Completion.default_search_space.copy()
default_search_space["model"] = tune.choice(["gpt-3.5-turbo", "gpt-4"])
openai_completion_class = not ERROR and openai.ChatCompletion

View File

@ -1,3 +0,0 @@
from flaml.integrations.oai.completion import Completion, ChatCompletion
__all__ = ["Completion", "ChatCompletion"]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,787 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"\n",
"Licensed under the MIT License.\n",
"\n",
"# Use FLAML to Optimize Code Generation Performance\n",
"\n",
"In this notebook, we optimize OpenAI models for code generation. We use [the HumanEval benchmark](https://huggingface.co/datasets/openai_humaneval) released by OpenAI for synthesizing programs from docstrings. \n",
"\n",
"## Requirements\n",
"\n",
"FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:\n",
"```bash\n",
"pip install flaml[openai]==1.2.0\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-24T23:25:36.910966Z",
"iopub.status.busy": "2023-02-24T23:25:36.910473Z",
"iopub.status.idle": "2023-02-24T23:25:36.914554Z",
"shell.execute_reply": "2023-02-24T23:25:36.914030Z"
}
},
"outputs": [],
"source": [
"# %pip install flaml[openai]==1.2.0 datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Set your OpenAI key:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-24T23:25:36.917301Z",
"iopub.status.busy": "2023-02-24T23:25:36.917011Z",
"iopub.status.idle": "2023-02-24T23:25:36.923156Z",
"shell.execute_reply": "2023-02-24T23:25:36.922619Z"
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = \"<your OpenAI API key here>\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"If you use Azure OpenAI, uncomment the following:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-24T23:25:36.925804Z",
"iopub.status.busy": "2023-02-24T23:25:36.925423Z",
"iopub.status.idle": "2023-02-24T23:25:36.928191Z",
"shell.execute_reply": "2023-02-24T23:25:36.927673Z"
}
},
"outputs": [],
"source": [
"# import openai\n",
"# openai.api_type = \"azure\"\n",
"# openai.api_base = \"https://<your_endpoint>.openai.azure.com/\"\n",
"# openai.api_version = \"2023-03-15-preview\" # change if necessary"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load dataset\n",
"\n",
"First, we load the humaneval dataset. The dataset contains 164 examples. In each example, the \"prompt\" is the prompt string for eliciting the code generation (renamed into \"definition\"), \"test\" is the Python code for unit test for the example, and \"entry_point\" is the function name to be tested."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-24T23:25:36.931255Z",
"iopub.status.busy": "2023-02-24T23:25:36.930838Z",
"iopub.status.idle": "2023-02-24T23:25:39.148799Z",
"shell.execute_reply": "2023-02-24T23:25:39.148113Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset openai_humaneval (/home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1fdc8853bf2a4aecaa2cd024ad99b5a2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading cached shuffled indices for dataset at /home/vscode/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75/cache-1e8448101c1b32e8.arrow\n"
]
}
],
"source": [
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"openai_humaneval\")[\"test\"].shuffle(seed=seed)\n",
"data = data.select(range(len(data))).rename_column(\"prompt\", \"definition\").remove_columns([\"task_id\", \"canonical_solution\"])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-24T23:25:39.164187Z",
"iopub.status.busy": "2023-02-24T23:25:39.163867Z",
"iopub.status.idle": "2023-02-24T23:25:39.169009Z",
"shell.execute_reply": "2023-02-24T23:25:39.168427Z"
}
},
"outputs": [],
"source": [
"from flaml.autogen.code_utils import eval_function_completions, implement\n",
"from flaml import oai"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The `implement` function will first generate assertion statements for a problem. Then, it uses the assertions to select the generated responses."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-24T23:25:39.179030Z",
"iopub.status.busy": "2023-02-24T23:25:39.178624Z",
"iopub.status.idle": "2023-02-24T23:25:40.584410Z",
"shell.execute_reply": "2023-02-24T23:25:40.583802Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Example 0, config 1, success 1\n",
"Example 1, config 0, success 2\n",
"Example 2, config 0, success 3\n",
"Example 3, config 2, success 4\n",
"Example 4, config 2, success 5\n",
"Example 5, config 4, success 6\n",
"Example 6, config 4, success 6\n",
"Example 7, config 2, success 7\n",
"Example 8, config 2, success 8\n",
"Example 9, config 0, success 9\n",
"Example 10, config 1, success 10\n",
"Example 11, config 0, success 10\n",
"Example 12, config 2, success 11\n",
"Example 13, config 2, success 12\n",
"Example 14, config 0, success 13\n",
"Example 15, config 2, success 14\n",
"Example 16, config 0, success 15\n",
"Example 17, config 1, success 15\n",
"Example 18, config 1, success 16\n",
"Example 19, config 3, success 17\n",
"Example 20, config 2, success 18\n",
"Example 21, config 2, success 19\n",
"Example 22, config 2, success 19\n",
"Example 23, config 2, success 20\n",
"Example 24, config 0, success 21\n",
"Example 25, config 0, success 22\n",
"Example 26, config 4, success 23\n",
"Example 27, config 2, success 24\n",
"Example 28, config 4, success 24\n",
"Example 29, config 2, success 25\n",
"Example 30, config 2, success 26\n",
"Example 31, config 0, success 27\n",
"Example 32, config 0, success 28\n",
"Example 33, config 0, success 29\n",
"Example 34, config 2, success 30\n",
"Example 35, config 1, success 30\n",
"Example 36, config 0, success 31\n",
"Example 37, config 0, success 32\n",
"Example 38, config 0, success 33\n",
"Example 39, config 2, success 34\n",
"Example 40, config 0, success 35\n",
"Example 41, config 0, success 36\n",
"Example 42, config 3, success 37\n",
"Example 43, config 0, success 38\n",
"Example 44, config 2, success 39\n",
"Example 45, config 2, success 40\n",
"Example 46, config 2, success 40\n",
"Example 47, config 0, success 41\n",
"Example 48, config 3, success 42\n",
"Example 49, config 2, success 43\n",
"Example 50, config 1, success 44\n",
"Example 51, config 2, success 45\n",
"Example 52, config 3, success 46\n",
"Example 53, config 2, success 47\n",
"Example 54, config 0, success 48\n",
"Example 55, config 2, success 49\n",
"Example 56, config 2, success 50\n",
"Example 57, config 2, success 51\n",
"Example 58, config 0, success 52\n",
"Example 59, config 1, success 53\n",
"Example 60, config 0, success 53\n",
"Example 61, config 0, success 54\n",
"Example 62, config 1, success 55\n",
"Example 63, config 1, success 56\n",
"Example 64, config 0, success 57\n",
"Example 65, config 2, success 58\n",
"Example 66, config 2, success 59\n",
"Example 67, config 2, success 60\n",
"Example 68, config 2, success 61\n",
"Example 69, config 4, success 61\n",
"Example 70, config 2, success 62\n",
"Example 71, config 0, success 63\n",
"Example 72, config 0, success 64\n",
"Example 73, config 0, success 65\n",
"Example 74, config 0, success 66\n",
"Example 75, config 0, success 67\n",
"Example 76, config 1, success 68\n",
"Example 77, config 2, success 69\n",
"Example 78, config 1, success 70\n",
"Example 79, config 4, success 70\n",
"Example 80, config 2, success 71\n",
"Example 81, config 2, success 72\n",
"Example 82, config 0, success 72\n",
"Example 83, config 0, success 73\n",
"Example 84, config 4, success 73\n",
"Example 85, config 3, success 74\n",
"Example 86, config 0, success 75\n",
"Example 87, config 2, success 76\n",
"Example 88, config 2, success 77\n",
"Example 89, config 1, success 78\n",
"Example 90, config 0, success 79\n",
"Example 91, config 2, success 80\n",
"Example 92, config 1, success 81\n",
"Example 93, config 0, success 82\n",
"Example 94, config 0, success 83\n",
"Example 95, config 0, success 84\n",
"Example 96, config 2, success 85\n",
"Example 97, config 2, success 86\n",
"Example 98, config 2, success 87\n",
"Example 99, config 4, success 88\n",
"Example 100, config 0, success 89\n",
"Example 101, config 0, success 90\n",
"Example 102, config 2, success 91\n",
"Example 103, config 4, success 91\n",
"Example 104, config 2, success 92\n",
"Example 105, config 2, success 93\n",
"Example 106, config 4, success 93\n",
"Example 107, config 2, success 94\n",
"Example 108, config 0, success 95\n",
"Example 109, config 2, success 96\n",
"Example 110, config 0, success 97\n",
"Example 111, config 0, success 98\n",
"Example 112, config 2, success 99\n",
"Example 113, config 0, success 99\n",
"Example 114, config 2, success 100\n",
"Example 115, config 2, success 100\n",
"Example 116, config 0, success 101\n",
"Example 117, config 0, success 102\n",
"Example 118, config 0, success 103\n",
"Example 119, config 4, success 104\n",
"Example 120, config 2, success 105\n",
"Example 121, config 2, success 106\n",
"Example 122, config 0, success 107\n",
"Example 123, config 2, success 108\n",
"Example 124, config 1, success 109\n",
"Example 125, config 0, success 110\n",
"Example 126, config 1, success 111\n",
"Example 127, config 4, success 111\n",
"Example 128, config 2, success 112\n",
"Example 129, config 2, success 113\n",
"Example 130, config 0, success 114\n",
"Example 131, config 2, success 115\n",
"Example 132, config 0, success 116\n",
"Example 133, config 2, success 117\n",
"Example 134, config 1, success 118\n",
"Example 135, config 1, success 119\n",
"Example 136, config 0, success 120\n",
"Example 137, config 0, success 121\n",
"Example 138, config 2, success 122\n",
"Example 139, config 2, success 123\n",
"Example 140, config 2, success 124\n",
"Example 141, config 2, success 125\n",
"Example 142, config 2, success 126\n",
"Example 143, config 0, success 127\n",
"Example 144, config 0, success 128\n",
"Example 145, config 2, success 129\n",
"Example 146, config 1, success 130\n",
"Example 147, config 1, success 131\n",
"Example 148, config 2, success 132\n",
"Example 149, config 0, success 133\n",
"Example 150, config 0, success 134\n",
"Example 151, config 2, success 135\n",
"Example 152, config 0, success 136\n",
"Example 153, config 2, success 137\n",
"Example 154, config 2, success 138\n",
"Example 155, config 2, success 139\n",
"Example 156, config 0, success 140\n",
"Example 157, config 0, success 141\n",
"Example 158, config 4, success 142\n",
"Example 159, config 2, success 143\n",
"Example 160, config 0, success 144\n",
"Example 161, config 0, success 145\n",
"Example 162, config 0, success 146\n",
"Example 163, config 4, success 147\n",
"Success rate: 0.896\n",
"Average cost: 0.00818\n"
]
}
],
"source": [
"prompt = \"# Python 3{definition}\"\n",
"stops = [[\"\\nclass\", \"\\ndef\", \"\\nif\", \"\\nprint\"], None]\n",
"configs = [{\"model\": 'gpt-3.5-turbo', \"prompt\": prompt, \"stop\": stops[1], \"temperature\": 0, \"seed\": 0}, {\"model\": 'gpt-3.5-turbo', \"prompt\": prompt, \"stop\": stops[0], \"n\": 7, \"seed\": 0}, {\"model\": 'gpt-4', \"prompt\": prompt, \"stop\": stops[1], \"temperature\": 0, \"seed\": 1}, {\"model\": 'gpt-4', \"prompt\": prompt, \"stop\": stops[0], \"n\": 2, \"seed\": 2}, {\"model\": 'gpt-4', \"prompt\": prompt, \"stop\": stops[0], \"n\": 1, \"seed\": 2}]\n",
"oai.Completion.set_cache(0)\n",
"oai.Completion.retry_timeout = 600\n",
"cost = 0\n",
"success = 0\n",
"for i, d in enumerate(data):\n",
" response, cost_i, j = implement(d[\"definition\"], configs)\n",
" metrics = eval_function_completions(responses=[response], **d)\n",
" success += metrics[\"success\"]\n",
" cost += cost_i\n",
" print(f\"Example {i}, config {j}, success {success}\")\n",
"print(f\"Success rate: {success / len(data):.3f}\")\n",
"print(f\"Average cost: {cost / len(data):.5f}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"vscode": {
"interpreter": {
"hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {
"24dd93300e0442788ee6cc1310e5bf14": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"35cd066a31b242bb87b2c106ee72e5f2": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_8e7ee7687a99410d88a98a74ecfcea99",
"IPY_MODEL_421e02a11a974b40b3ddb75382b3b640",
"IPY_MODEL_77db9797e78b49438d21c5c8da34b4cb"
],
"layout": "IPY_MODEL_47d3046236a54b0e8f9ae455a82c7e0b",
"tabbable": null,
"tooltip": null
}
},
"3d5d106a38954af2bb3bde5777702f4e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"3e1ebb31412443b0bca86a301cbdac11": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"421e02a11a974b40b3ddb75382b3b640": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_e6398d4027c9459a97965b9d91ae484f",
"max": 1,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_3e1ebb31412443b0bca86a301cbdac11",
"tabbable": null,
"tooltip": null,
"value": 1
}
},
"47d3046236a54b0e8f9ae455a82c7e0b": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"754800f7feb04acea977696e4787d1ff": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"77db9797e78b49438d21c5c8da34b4cb": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_7b6c4e1c11e249409a1edcd63be450d8",
"placeholder": "",
"style": "IPY_MODEL_3d5d106a38954af2bb3bde5777702f4e",
"tabbable": null,
"tooltip": null,
"value": " 1/1 [00:00&lt;00:00, 44.40it/s]"
}
},
"7b6c4e1c11e249409a1edcd63be450d8": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"8e7ee7687a99410d88a98a74ecfcea99": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_754800f7feb04acea977696e4787d1ff",
"placeholder": "",
"style": "IPY_MODEL_24dd93300e0442788ee6cc1310e5bf14",
"tabbable": null,
"tooltip": null,
"value": "100%"
}
},
"e6398d4027c9459a97965b9d91ae484f": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}
},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,784 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
"\n",
"Licensed under the MIT License.\n",
"\n",
"# Math Study\n",
"\n",
"In this notebook, we study GPT-4 for math problem solving. We use [the MATH benchmark](https://crfm.stanford.edu/helm/latest/?group=math_chain_of_thought) for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning. \n",
"\n",
"## Requirements\n",
"\n",
"FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the [openai] option:\n",
"```bash\n",
"pip install flaml[openai]==1.2.0\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.317406Z",
"iopub.status.busy": "2023-02-13T23:40:52.316561Z",
"iopub.status.idle": "2023-02-13T23:40:52.321193Z",
"shell.execute_reply": "2023-02-13T23:40:52.320628Z"
}
},
"outputs": [],
"source": [
"# %pip install flaml[openai]==1.2.0 datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Set your OpenAI key:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.324240Z",
"iopub.status.busy": "2023-02-13T23:40:52.323783Z",
"iopub.status.idle": "2023-02-13T23:40:52.330570Z",
"shell.execute_reply": "2023-02-13T23:40:52.329750Z"
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"if \"OPENAI_API_KEY\" not in os.environ:\n",
" os.environ[\"OPENAI_API_KEY\"] = \"<your OpenAI API key here>\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Uncomment the following to use Azure OpenAI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.333547Z",
"iopub.status.busy": "2023-02-13T23:40:52.333249Z",
"iopub.status.idle": "2023-02-13T23:40:52.336508Z",
"shell.execute_reply": "2023-02-13T23:40:52.335858Z"
}
},
"outputs": [],
"source": [
"# import openai\n",
"# openai.api_type = \"azure\"\n",
"# openai.api_base = \"https://<your_endpoint>.openai.azure.com/\"\n",
"# openai.api_version = \"2023-03-15-preview\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load dataset\n",
"\n",
"First, we load the competition_math dataset. We use a random sample of 50 examples for testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:52.339977Z",
"iopub.status.busy": "2023-02-13T23:40:52.339556Z",
"iopub.status.idle": "2023-02-13T23:40:54.603349Z",
"shell.execute_reply": "2023-02-13T23:40:54.602630Z"
}
},
"outputs": [],
"source": [
"import datasets\n",
"\n",
"seed = 41\n",
"data = datasets.load_dataset(\"competition_math\")\n",
"train_data = data[\"train\"].shuffle(seed=seed)\n",
"test_data = data[\"test\"].shuffle(seed=seed)\n",
"n_tune_data = 20\n",
"tune_data = [\n",
" {\n",
" \"problem\": train_data[x][\"problem\"],\n",
" \"solution\": train_data[x][\"solution\"],\n",
" }\n",
" for x in range(len(train_data)) if train_data[x][\"level\"] == \"Level 5\" and train_data[x][\"type\"] == \"Counting & Probability\"\n",
"][:n_tune_data]\n",
"test_data = [\n",
" {\n",
" \"problem\": test_data[x][\"problem\"],\n",
" \"solution\": test_data[x][\"solution\"],\n",
" }\n",
" for x in range(len(test_data)) if test_data[x][\"level\"] == \"Level 5\" and test_data[x][\"type\"] == \"Counting & Probability\"\n",
"]\n",
"print(len(tune_data), len(test_data))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Check a tuning example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.607152Z",
"iopub.status.busy": "2023-02-13T23:40:54.606441Z",
"iopub.status.idle": "2023-02-13T23:40:54.610504Z",
"shell.execute_reply": "2023-02-13T23:40:54.609759Z"
},
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"print(tune_data[1][\"problem\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is one example of the canonical solution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.613590Z",
"iopub.status.busy": "2023-02-13T23:40:54.613168Z",
"iopub.status.idle": "2023-02-13T23:40:54.616873Z",
"shell.execute_reply": "2023-02-13T23:40:54.616193Z"
}
},
"outputs": [],
"source": [
"print(tune_data[1][\"solution\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Success Metric\n",
"\n",
"For each math task, we use voting to select a response with the most common answers out of all the generated responses. If it has an equivalent answer to the canonical solution, we consider the task as successfully solved. Then we can optimize the mean success rate of a collection of tasks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.626998Z",
"iopub.status.busy": "2023-02-13T23:40:54.626593Z",
"iopub.status.idle": "2023-02-13T23:40:54.631383Z",
"shell.execute_reply": "2023-02-13T23:40:54.630770Z"
}
},
"outputs": [],
"source": [
"from flaml.autogen.math_utils import eval_math_responses"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Import the oai and tune subpackages from flaml.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:54.634335Z",
"iopub.status.busy": "2023-02-13T23:40:54.633929Z",
"iopub.status.idle": "2023-02-13T23:40:56.105700Z",
"shell.execute_reply": "2023-02-13T23:40:56.105085Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"from flaml import oai"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For (local) reproducibility and cost efficiency, we cache responses from OpenAI."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:56.109177Z",
"iopub.status.busy": "2023-02-13T23:40:56.108624Z",
"iopub.status.idle": "2023-02-13T23:40:56.112651Z",
"shell.execute_reply": "2023-02-13T23:40:56.112076Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"oai.ChatCompletion.set_cache(seed)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create a disk cache in \".cache/{seed}\". You can change `cache_path` in `set_cache()`. The cache for different seeds are stored separately."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2023-02-13T23:40:56.115383Z",
"iopub.status.busy": "2023-02-13T23:40:56.114975Z",
"iopub.status.idle": "2023-02-13T23:41:55.045654Z",
"shell.execute_reply": "2023-02-13T23:41:55.044973Z"
}
},
"outputs": [],
"source": [
"prompt = \"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\\\boxed{{}}.\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate the success rate on the test data\n",
"\n",
"You can use flaml's `oai.ChatCompletion.test` to evaluate the performance of an entire dataset with the tuned config."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"\n",
"config_n1 = {\"model\": 'gpt-4', \"prompt\": prompt, \"max_tokens\": 600, \"n\": 1}\n",
"n1_result = oai.ChatCompletion.test(test_data[:50], config_n1, eval_math_responses)\n",
"print(n1_result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"oai.ChatCompletion.request_timeout = 120\n",
"config_n10 = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": 10}\n",
"n10_result = oai.ChatCompletion.test(test_data[:50], config_n10, eval_math_responses, logging_level=logging.INFO)\n",
"print(n10_result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"config_n30 = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": 30}\n",
"n30_result = oai.ChatCompletion.test(test_data[:50], config_n30, eval_math_responses, logging_level=logging.INFO)\n",
"print(n30_result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"import matplotlib.pyplot as plt\n",
"\n",
"prompts = [\"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\\\boxed{{}}.\"]\n",
"markers = [\"o\", \"s\", \"D\", \"v\", \"p\", \"h\", \"d\", \"P\", \"X\", \"H\", \"8\", \"4\", \"3\", \"2\", \"1\", \"x\", \"+\", \">\", \"<\", \"^\", \"v\", \"1\", \"2\", \"3\", \"4\", \"8\", \"s\", \"p\", \"*\", \"h\", \"H\", \"d\", \"D\", \"|\", \"_\"]\n",
"for j, n in enumerate([10, 30]):\n",
" config = {\"model\": 'gpt-4', \"prompt\": prompts[0], \"max_tokens\": 600, \"n\": n}\n",
" metrics = []\n",
" x, y = [], []\n",
" votes_success = defaultdict(lambda: [0, 0])\n",
" for i, data_i in enumerate(test_data[:50]):\n",
" response = oai.ChatCompletion.create(context=data_i, **config)\n",
" responses = oai.ChatCompletion.extract_text(response)\n",
" metrics.append(eval_math_responses(responses, **data_i))\n",
" votes = metrics[-1][\"votes\"]\n",
" success = metrics[-1][\"success_vote\"]\n",
" votes_success[votes][0] += 1\n",
" votes_success[votes][1] += success\n",
" for votes in votes_success:\n",
" x.append(votes)\n",
" y.append(votes_success[votes][1] / votes_success[votes][0])\n",
"\n",
" plt.scatter(x, y, marker=markers[j])\n",
" plt.xlabel(\"top vote\")\n",
" plt.ylabel(\"success rate\")\n",
"plt.legend([\"n=10\", \"n=30\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"vscode": {
"interpreter": {
"hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {
"2d910cfd2d2a4fc49fc30fbbdc5576a7": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"454146d0f7224f038689031002906e6f": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_e4ae2b6f5a974fd4bafb6abb9d12ff26",
"IPY_MODEL_577e1e3cc4db4942b0883577b3b52755",
"IPY_MODEL_b40bdfb1ac1d4cffb7cefcb870c64d45"
],
"layout": "IPY_MODEL_dc83c7bff2f241309537a8119dfc7555",
"tabbable": null,
"tooltip": null
}
},
"577e1e3cc4db4942b0883577b3b52755": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_2d910cfd2d2a4fc49fc30fbbdc5576a7",
"max": 1,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_74a6ba0c3cbc4051be0a83e152fe1e62",
"tabbable": null,
"tooltip": null,
"value": 1
}
},
"6086462a12d54bafa59d3c4566f06cb2": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"74a6ba0c3cbc4051be0a83e152fe1e62": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"7d3f3d9e15894d05a4d188ff4f466554": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"b40bdfb1ac1d4cffb7cefcb870c64d45": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_f1355871cc6f4dd4b50d9df5af20e5c8",
"placeholder": "",
"style": "IPY_MODEL_ca245376fd9f4354af6b2befe4af4466",
"tabbable": null,
"tooltip": null,
"value": " 1/1 [00:00&lt;00:00, 44.69it/s]"
}
},
"ca245376fd9f4354af6b2befe4af4466": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "StyleView",
"background": null,
"description_width": "",
"font_size": null,
"text_color": null
}
},
"dc83c7bff2f241309537a8119dfc7555": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"e4ae2b6f5a974fd4bafb6abb9d12ff26": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "2.0.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "2.0.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "2.0.0",
"_view_name": "HTMLView",
"description": "",
"description_allow_html": false,
"layout": "IPY_MODEL_6086462a12d54bafa59d3c4566f06cb2",
"placeholder": "",
"style": "IPY_MODEL_7d3f3d9e15894d05a4d188ff4f466554",
"tabbable": null,
"tooltip": null,
"value": "100%"
}
},
"f1355871cc6f4dd4b50d9df5af20e5c8": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "2.0.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "2.0.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "2.0.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border_bottom": null,
"border_left": null,
"border_right": null,
"border_top": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}
},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -120,7 +120,7 @@ setuptools.setup(
"pytorch-forecasting>=0.9.0",
],
"benchmark": ["catboost>=0.26", "psutil==5.8.0", "xgboost==1.3.3"],
"openai": ["openai==0.27.0", "diskcache", "optuna==2.8.0"],
"openai": ["openai==0.27.4", "diskcache", "optuna==2.8.0"],
"synapse": ["joblibspark>=0.5.0", "optuna==2.8.0", "pyspark>=3.2.0"],
},
classifiers=[

View File

@ -1,10 +1,15 @@
import datasets
import signal
import subprocess
import sys
import numpy as np
import pytest
from functools import partial
from flaml import oai
from flaml.autogen.code_utils import (
eval_function_completions,
generate_assertions,
implement,
)
from flaml.autogen.math_utils import eval_math_responses
@pytest.mark.skipif(
@ -12,58 +17,16 @@ from flaml import oai
reason="do not run on windows",
)
def test_humaneval(num_samples=1):
def timeout_handler(signum, frame):
raise TimeoutError("Timed out!")
signal.signal(signal.SIGALRM, timeout_handler)
max_exec_time = 3 # seconds
def execute_code(code):
code = code.strip()
with open("codetest.py", "w") as fout:
fout.write(code)
try:
signal.alarm(max_exec_time)
result = subprocess.run(
[sys.executable, "codetest.py"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
)
signal.alarm(0)
except TimeoutError:
return 0
return int(result.returncode == 0)
def success_metrics(responses, prompt, test, entry_point):
"""Check if the response is correct.
Args:
responses (list): The list of responses.
prompt (str): The input prompt.
test (str): The test code.
entry_point (str): The name of the function.
Returns:
dict: The success metrics.
"""
success_list = []
n = len(responses)
for i in range(n):
response = responses[i]
code = f"{prompt}{response}\n{test}\ncheck({entry_point})"
succeed = execute_code(code)
success_list.append(succeed)
return {
"expected_success": 1 - pow(1 - np.mean(success_list), n),
"success": any(s for s in success_list),
}
eval_with_generated_assertions = partial(
eval_function_completions, assertions=generate_assertions
)
seed = 41
data = datasets.load_dataset("openai_humaneval")["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
{
"prompt": data[x]["prompt"],
"definition": data[x]["prompt"],
"test": data[x]["test"],
"entry_point": data[x]["entry_point"],
}
@ -71,7 +34,7 @@ def test_humaneval(num_samples=1):
]
test_data = [
{
"prompt": data[x]["prompt"],
"definition": data[x]["prompt"],
"test": data[x]["test"],
"entry_point": data[x]["entry_point"],
}
@ -79,335 +42,80 @@ def test_humaneval(num_samples=1):
]
oai.Completion.set_cache(seed)
try:
# a minimal tuning example
config, _ = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=success_metrics,
n=1,
)
responses = oai.Completion.create(context=test_data[0], **config)
# a minimal tuning example for tuning chat completion models using the Completion class
config, _ = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=success_metrics,
n=1,
model="gpt-3.5-turbo",
)
responses = oai.Completion.create(context=test_data[0], **config)
# a minimal tuning example for tuning chat completion models using the Completion class
config, _ = oai.ChatCompletion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=success_metrics,
n=1,
messages=[{"role": "user", "content": "{prompt}"}],
)
responses = oai.ChatCompletion.create(context=test_data[0], **config)
print(responses)
# a more comprehensive tuning example
config, analysis = oai.Completion.tune(
data=tune_data,
metric="expected_success",
mode="max",
eval_func=success_metrics,
log_file_name="logs/humaneval.log",
inference_budget=0.002,
optimization_budget=2,
num_samples=num_samples,
prompt=[
"{prompt}",
"# Python 3{prompt}",
"Complete the following Python function:{prompt}",
"Complete the following Python function while including necessary import statements inside the function:{prompt}",
],
stop=["\nclass", "\ndef", "\nif", "\nprint"],
)
print(config)
print(analysis.best_result)
print(test_data[0])
responses = oai.Completion.create(context=test_data[0], **config)
print(responses)
oai.Completion.data = test_data[:num_samples]
result = oai.Completion._eval(analysis.best_config, prune=False, eval_only=True)
print("result without pruning", result)
result = oai.Completion.test(test_data[:num_samples], config=config)
print(result)
import openai
import diskcache
except ImportError as exc:
print(exc)
return
# a minimal tuning example
config, _ = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_function_completions,
n=1,
prompt="{definition}",
)
responses = oai.Completion.create(context=test_data[0], **config)
# a minimal tuning example for tuning chat completion models using the Completion class
config, _ = oai.Completion.tune(
data=tune_data,
metric="succeed_assertions",
mode="max",
eval_func=eval_with_generated_assertions,
n=1,
model="gpt-3.5-turbo",
prompt="{definition}",
)
responses = oai.Completion.create(context=test_data[0], **config)
# a minimal tuning example for tuning chat completion models using the Completion class
config, _ = oai.ChatCompletion.tune(
data=tune_data,
metric="expected_success",
mode="max",
eval_func=eval_function_completions,
n=1,
messages=[{"role": "user", "content": "{definition}"}],
)
responses = oai.ChatCompletion.create(context=test_data[0], **config)
print(responses)
code, cost, _ = implement(tune_data[1], [config])
print(code)
print(cost)
print(eval_function_completions([code], **tune_data[1]))
# a more comprehensive tuning example
config2, analysis = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_with_generated_assertions,
log_file_name="logs/humaneval.log",
inference_budget=0.002,
optimization_budget=2,
num_samples=num_samples,
prompt=[
"{definition}",
"# Python 3{definition}",
"Complete the following Python function:{definition}",
],
stop=[["\nclass", "\ndef", "\nif", "\nprint"], None], # the stop sequences
)
print(config2)
print(analysis.best_result)
print(test_data[0])
responses = oai.Completion.create(context=test_data[0], **config2)
print(responses)
oai.Completion.data = test_data[:num_samples]
result = oai.Completion._eval(analysis.best_config, prune=False, eval_only=True)
print("result without pruning", result)
result = oai.Completion.test(test_data[:num_samples], config=config2)
print(result)
code, cost, selected = implement(tune_data[1], [config2, config])
print(selected)
print(eval_function_completions([code], **tune_data[1]))
def test_math(num_samples=-1):
from typing import Optional
def remove_boxed(string: str) -> Optional[str]:
"""Source: https://github.com/hendrycks/math
Extract the text within a \\boxed{...} environment.
Example:
>>> remove_boxed(\\boxed{\\frac{2}{3}})
\\frac{2}{3}
"""
left = "\\boxed{"
try:
assert string[: len(left)] == left
assert string[-1] == "}"
return string[len(left) : -1]
except Exception:
return None
def last_boxed_only_string(string: str) -> Optional[str]:
"""Source: https://github.com/hendrycks/math
Extract the last \\boxed{...} or \\fbox{...} element from a string.
"""
idx = string.rfind("\\boxed")
if idx < 0:
idx = string.rfind("\\fbox")
if idx < 0:
return None
i = idx
right_brace_idx = None
num_left_braces_open = 0
while i < len(string):
if string[i] == "{":
num_left_braces_open += 1
if string[i] == "}":
num_left_braces_open -= 1
if num_left_braces_open == 0:
right_brace_idx = i
break
i += 1
if right_brace_idx is None:
retval = None
else:
retval = string[idx : right_brace_idx + 1]
return retval
def _fix_fracs(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat fractions.
Examples:
>>> _fix_fracs("\\frac1b")
\frac{1}{b}
>>> _fix_fracs("\\frac12")
\frac{1}{2}
>>> _fix_fracs("\\frac1{72}")
\frac{1}{72}
"""
substrs = string.split("\\frac")
new_str = substrs[0]
if len(substrs) > 1:
substrs = substrs[1:]
for substr in substrs:
new_str += "\\frac"
if substr[0] == "{":
new_str += substr
else:
try:
assert len(substr) >= 2
except Exception:
return string
a = substr[0]
b = substr[1]
if b != "{":
if len(substr) > 2:
post_substr = substr[2:]
new_str += "{" + a + "}{" + b + "}" + post_substr
else:
new_str += "{" + a + "}{" + b + "}"
else:
if len(substr) > 2:
post_substr = substr[2:]
new_str += "{" + a + "}" + b + post_substr
else:
new_str += "{" + a + "}" + b
string = new_str
return string
def _fix_a_slash_b(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Reformat fractions formatted as a/b to \\frac{a}{b}.
Example:
>>> _fix_a_slash_b("2/3")
\frac{2}{3}
"""
if len(string.split("/")) != 2:
return string
a_str = string.split("/")[0]
b_str = string.split("/")[1]
try:
a = int(a_str)
b = int(b_str)
assert string == "{}/{}".format(a, b)
new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
return new_string
except Exception:
return string
def _remove_right_units(string: str) -> str:
"""Source: https://github.com/hendrycks/math"""
if "\\text{ " in string:
splits = string.split("\\text{ ")
assert len(splits) == 2
return splits[0]
else:
return string
def _fix_sqrt(string: str) -> str:
"""Source: https://github.com/hendrycks/math"""
if "\\sqrt" not in string:
return string
splits = string.split("\\sqrt")
new_string = splits[0]
for split in splits[1:]:
if split[0] != "{":
a = split[0]
new_substr = "\\sqrt{" + a + "}" + split[1:]
else:
new_substr = "\\sqrt" + split
new_string += new_substr
return new_string
def _strip_string(string: str) -> str:
"""Source: https://github.com/hendrycks/math
Apply the reformatting helper functions above.
"""
# linebreaks
string = string.replace("\n", "")
# print(string)
# remove inverse spaces
string = string.replace("\\!", "")
# print(string)
# replace \\ with \
string = string.replace("\\\\", "\\")
# print(string)
# replace tfrac and dfrac with frac
string = string.replace("tfrac", "frac")
string = string.replace("dfrac", "frac")
# print(string)
# remove \left and \right
string = string.replace("\\left", "")
string = string.replace("\\right", "")
# print(string)
# Remove circ (degrees)
string = string.replace("^{\\circ}", "")
string = string.replace("^\\circ", "")
# remove dollar signs
string = string.replace("\\$", "")
# remove units (on the right)
string = _remove_right_units(string)
# remove percentage
string = string.replace("\\%", "")
string = string.replace(r"\%", "")
# " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
string = string.replace(" .", " 0.")
string = string.replace("{.", "{0.")
# if empty, return empty string
if len(string) == 0:
return string
if string[0] == ".":
string = "0" + string
# to consider: get rid of e.g. "k = " or "q = " at beginning
if len(string.split("=")) == 2:
if len(string.split("=")[0]) <= 2:
string = string.split("=")[1]
# fix sqrt3 --> sqrt{3}
string = _fix_sqrt(string)
# remove spaces
string = string.replace(" ", "")
# \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc.
# Even works with \frac1{72} (but not \frac{72}1).
# Also does a/b --> \\frac{a}{b}
string = _fix_fracs(string)
# manually change 0.5 --> \frac{1}{2}
if string == "0.5":
string = "\\frac{1}{2}"
# NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
string = _fix_a_slash_b(string)
return string
def get_answer(solution: Optional[str]) -> Optional[str]:
if solution is None:
return None
last_boxed = last_boxed_only_string(solution)
if last_boxed is None:
return None
answer = remove_boxed(last_boxed)
if answer is None:
return None
return answer
def is_equiv(str1: Optional[str], str2: Optional[str]) -> float:
"""Returns (as a float) whether two strings containing math are equivalent up to differences of formatting in
- units
- fractions
- square roots
- superfluous LaTeX.
Source: https://github.com/hendrycks/math
"""
if str1 is None and str2 is None:
print("WARNING: Both None")
return 1.0
if str1 is None or str2 is None:
return 0.0
try:
ss1 = _strip_string(str1)
ss2 = _strip_string(str2)
return float(ss1 == ss2)
except Exception:
return float(str1 == str2)
def is_equiv_chain_of_thought(str1: str, str2: str) -> float:
"""Strips the solution first before calling `is_equiv`."""
ans1 = get_answer(str1)
ans2 = get_answer(str2)
return is_equiv(ans1, ans2)
def success_metrics(responses, solution, **args):
"""Check if each response is correct.
Args:
responses (list): The list of responses.
solution (str): The canonical solution.
Returns:
dict: The success metrics.
"""
success_list = []
n = len(responses)
for i in range(n):
response = responses[i]
succeed = is_equiv_chain_of_thought(response, solution)
success_list.append(succeed)
return {
"expected_success": 1 - pow(1 - sum(success_list) / n, n),
"success": any(s for s in success_list),
}
seed = 41
data = datasets.load_dataset("competition_math")
train_data = data["train"].shuffle(seed=seed)
@ -436,78 +144,87 @@ def test_math(num_samples=-1):
print(len(tune_data), len(test_data))
# prompt template
prompts = [
lambda data: "Given a mathematics problem, determine the answer. Simplify your answer as much as possible.\n###\nProblem: What is the value of $\\sqrt{3! \\cdot 3!}$ expressed as a positive integer?\nAnswer: $\\sqrt{3!\\cdot3!}$ is equal to $\\sqrt{(3!)^2}=3!=3\\cdot2\\cdot1=\\boxed{6}$.\n###\nProblem: %s\nAnswer:"
+ data["problem"]
lambda data: "%s Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{}."
% data["problem"]
]
try:
oai.ChatCompletion.set_cache(seed)
vanilla_config = {
"model": "gpt-3.5-turbo",
"temperature": 1,
"max_tokens": 2048,
"n": 1,
"prompt": prompts[0],
"stop": "###",
}
test_data_sample = test_data[0:3]
result = oai.ChatCompletion.test(
test_data_sample, vanilla_config, success_metrics
)
test_data_sample = test_data[3:6]
result = oai.ChatCompletion.test(
test_data_sample,
vanilla_config,
success_metrics,
use_cache=False,
agg_method="median",
)
def my_median(results):
return np.median(results)
def my_average(results):
return np.mean(results)
result = oai.ChatCompletion.test(
test_data_sample,
vanilla_config,
success_metrics,
use_cache=False,
agg_method=my_median,
)
result = oai.ChatCompletion.test(
test_data_sample,
vanilla_config,
success_metrics,
use_cache=False,
agg_method={"expected_success": my_median, "success": my_average},
)
print(result)
config, _ = oai.ChatCompletion.tune(
data=tune_data, # the data for tuning
metric="expected_success", # the metric to optimize
mode="max", # the optimization mode
eval_func=success_metrics, # the evaluation function to return the success metrics
# log_file_name="logs/math.log", # the log file name
inference_budget=0.002, # the inference budget (dollar)
optimization_budget=0.01, # the optimization budget (dollar)
num_samples=num_samples,
prompt=prompts, # the prompt templates to choose from
stop="###", # the stop sequence
)
print("tuned config", config)
result = oai.ChatCompletion.test(test_data_sample, config)
print("result from tuned config:", result)
except (ImportError, NameError) as exc:
import openai
import diskcache
except ImportError as exc:
print(exc)
return
oai.ChatCompletion.set_cache(seed)
vanilla_config = {
"model": "gpt-3.5-turbo",
"temperature": 1,
"max_tokens": 2048,
"n": 1,
"prompt": prompts[0],
"stop": "###",
}
test_data_sample = test_data[0:3]
result = oai.ChatCompletion.test(
test_data_sample, vanilla_config, eval_math_responses
)
test_data_sample = test_data[3:6]
result = oai.ChatCompletion.test(
test_data_sample,
vanilla_config,
eval_math_responses,
use_cache=False,
agg_method="median",
)
def my_median(results):
return np.median(results)
def my_average(results):
return np.mean(results)
result = oai.ChatCompletion.test(
test_data_sample,
vanilla_config,
eval_math_responses,
use_cache=False,
agg_method=my_median,
)
result = oai.ChatCompletion.test(
test_data_sample,
vanilla_config,
eval_math_responses,
use_cache=False,
agg_method={
"expected_success": my_median,
"success": my_average,
"success_vote": my_average,
"votes": np.mean,
},
)
print(result)
config, _ = oai.ChatCompletion.tune(
data=tune_data, # the data for tuning
metric="expected_success", # the metric to optimize
mode="max", # the optimization mode
eval_func=eval_math_responses, # the evaluation function to return the success metrics
# log_file_name="logs/math.log", # the log file name
inference_budget=0.002, # the inference budget (dollar)
optimization_budget=0.01, # the optimization budget (dollar)
num_samples=num_samples,
prompt=prompts, # the prompt templates to choose from
stop="###", # the stop sequence
)
print("tuned config", config)
result = oai.ChatCompletion.test(test_data_sample, config)
print("result from tuned config:", result)
if __name__ == "__main__":
import openai
openai.api_key_path = "test/openai/key.txt"
test_humaneval(-1)
test_math(-1)
test_humaneval(1)
# test_math(1)

View File

@ -45,18 +45,18 @@ def run_notebook(input_nb, output_nb="executed_openai_notebook.ipynb", save=Fals
skip,
reason="do not run openai test if openai is not installed",
)
def test_integrate_openai(save=False):
run_notebook("integrate_openai.ipynb", save=save)
def test_autogen_openai(save=False):
run_notebook("autogen_openai.ipynb", save=save)
@pytest.mark.skipif(
skip,
reason="do not run openai test if openai is not installed",
)
def test_integrate_chatgpt(save=False):
run_notebook("integrate_chatgpt.ipynb", save=save)
def test_autogen_chatgpt(save=False):
run_notebook("autogen_chatgpt.ipynb", save=save)
if __name__ == "__main__":
test_integrate_chatgpt(save=True)
test_integrate_openai(save=True)
test_autogen_chatgpt(save=True)
test_autogen_openai(save=True)

View File

@ -1,9 +1,11 @@
FLAML offers a cost-effective hyperparameter optimization technique [EcoOptiGen](https://arxiv.org/abs/2303.04673) for tuning Large Language Models. Our study finds that tuning hyperparameters can significantly improve the utility of the OpenAI API.
# AutoGen - OpenAI
FLAML offers a cost-effective hyperparameter optimization technique [EcoOptiGen](https://arxiv.org/abs/2303.04673) for tuning Large Language Models. Our study finds that tuning hyperparameters can significantly improve the utility of them.
In this example, we will tune several hyperparameters for the OpenAI's completion API, including the temperature, prompt and n (number of completions), to optimize the inference performance for a code generation task.
### Prerequisites
Install the [openai] option. The OpenAI integration is in preview. ChaptGPT support is available since version 1.2.0.
Install the [openai] option. The OpenAI integration is in preview.
```bash
pip install "flaml[openai]==1.2.0"
```
@ -19,9 +21,11 @@ if "OPENAI_API_KEY" not in os.environ:
If you use Azure OpenAI, set up Azure using the following code:
```python
import openai
openai.api_type = "azure"
openai.api_base = "https://<your_endpoint>.openai.azure.com/"
openai.api_version = "2022-12-01" # change if necessary
openai.api_version = "2023-03-15-preview" # change if necessary
```
### Load the dataset
@ -36,7 +40,7 @@ data = datasets.load_dataset("openai_humaneval")["test"].shuffle(seed=seed)
n_tune_data = 20
tune_data = [
{
"prompt": data[x]["prompt"],
"definition": data[x]["prompt"],
"test": data[x]["test"],
"entry_point": data[x]["entry_point"],
}
@ -44,7 +48,7 @@ tune_data = [
]
test_data = [
{
"prompt": data[x]["prompt"],
"definition": data[x]["prompt"],
"test": data[x]["test"],
"entry_point": data[x]["entry_point"],
}
@ -54,71 +58,16 @@ test_data = [
### Defining the metric
Before starting tuning, you need to define the metric for the optimization. For the HumanEval dataset, we use the success rate as the metric. So if one of the returned responses can pass the test, we consider the task as successfully solved. Then we can define the mean success rate of a collection of tasks.
#### Define a code executor
First, we write a simple code executor. The code executor takes the generated code and the test code as the input, and execute them with a timer.
Before starting tuning, you need to define the metric for the optimization. For each code generation task, we can use the model to generate multiple candidate responses, and then select one from them. If the final selected response can pass a unit test, we consider the task as successfully solved. Then we can define the average success rate on a collection of tasks as the optimization metric.
```python
import signal
import subprocess
import sys
from functools import partial
from flaml.autogen.code_utils import eval_function_completions, generate_assertions
def timeout_handler(signum, frame):
raise TimeoutError("Timed out!")
signal.signal(signal.SIGALRM, timeout_handler)
max_exec_time = 3 # seconds
def execute_code(code):
code = code.strip()
with open("codetest.py", "w") as fout:
fout.write(code)
try:
signal.alarm(max_exec_time)
result = subprocess.run(
[sys.executable, "codetest.py"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
)
signal.alarm(0)
except TimeoutError:
return 0
return int(result.returncode == 0)
eval_with_generated_assertions = partial(eval_function_completions, assertions=generate_assertions)
```
This function will create a temp file "codetest.py" and execute it in a separate process. It allows for 3 seconds to finish that code.
#### Define a function to evaluate the success for a given program synthesis task
Now we define the success metric.
```python
def success_metrics(responses, prompt, test, entry_point):
"""Check if the task is successful.
Args:
responses (list): The list of responses.
prompt (str): The input prompt.
test (str): The test code.
entry_point (str): The name of the function.
Returns:
dict: The success metrics.
"""
success_list = []
n = len(responses)
for i in range(n):
response = responses[i]
code = f"{prompt}{response}\n{test}\ncheck({entry_point})"
succeed = execute_code(code)
success_list.append(succeed)
return {
"expected_success": 1 - pow(1 - sum(success_list) / n, n),
"success": any(s for s in success_list),
}
```
This function will first generate assertion statements for each problem. Then, it uses the assertions to select the generated responses.
### Tuning Hyperparameters for OpenAI
@ -131,24 +80,25 @@ The tuning will be performed under the specified optimization budgets.
Users can specify tuning data, optimization metric, optimization mode, evaluation function, search spaces etc.
```python
from flaml import oai
config, analysis = oai.Completion.tune(
data=tune_data, # the data for tuning
metric="expected_success", # the metric to optimize
metric="success", # the metric to optimize
mode="max", # the optimization mode
eval_func=success_metrics, # the evaluation function to return the success metrics
eval_func=eval_with_generated_assertions, # the evaluation function to return the success metrics
# log_file_name="logs/humaneval.log", # the log file name
inference_budget=0.1, # the inference budget (dollar)
optimization_budget=4, # the optimization budget (dollar)
inference_budget=0.05, # the inference budget (dollar per instance)
optimization_budget=3, # the optimization budget (dollar in total)
# num_samples can further limit the number of trials for different hyperparameter configurations;
# -1 means decided by the optimization budget only
num_samples=-1,
prompt=[
"{prompt}",
"# Python 3{prompt}",
"Complete the following Python function:{prompt}",
"Complete the following Python function while including necessary import statements inside the function:{prompt}",
"{definition}",
"# Python 3{definition}",
"Complete the following Python function:{definition}",
], # the prompt templates to choose from
stop=["\nclass", "\ndef", "\nif", "\nprint"], # the stop sequence
stop=[["\nclass", "\ndef", "\nif", "\nprint"], None], # the stop sequences
)
```
@ -168,7 +118,7 @@ We can apply the tuned config to the request for an instance:
```python
responses = oai.Completion.create(context=tune_data[1], **config)
print(responses)
print(success_metrics([response["text"].rstrip() for response in responses["choices"]], **tune_data[1]))
print(eval_with_generated_assertions(oai.Completion.extract_text(response), **tune_data[1]))
```
#### Evaluate the success rate on the test data
@ -177,9 +127,9 @@ You can use flaml's `oai.Completion.test` to evaluate the performance of an enti
```python
result = oai.Completion.test(test_data, config)
print(result)
print("performance on test data with the tuned config:", result)
```
The result will vary with the inference budget and optimization budget.
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_openai.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_openai.ipynb)
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/autogen_openai.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/autogen_openai.ipynb)

View File

@ -7,10 +7,8 @@ learning models automatically, efficiently and economically. It frees users from
### Main Features
1. For common machine learning or AI tasks like classification, regression, and generation, it quickly finds quality models for user-provided data with low computational resources. It supports both classical machine learning models and deep neural networks, including large language models such as the OpenAI GPT-3 models.
1. For common machine learning or AI tasks like classification, regression, and generation, it quickly finds quality models for user-provided data with low computational resources. It supports both classical machine learning models and deep neural networks, including foundation models such as the GPT series.
2. It is easy to customize or extend. Users can find their desired customizability from a smooth range: minimal customization (computational resource budget), medium customization (e.g., scikit-style learner, search space and metric), or full customization (arbitrary training and evaluation code). Users can customize only when and what they need to, and leave the rest to the library.
3. It supports fast and economical automatic tuning, capable of handling large search space with heterogeneous evaluation cost and complex constraints/guidance/early stopping. FLAML is powered by a new, [cost-effective
hyperparameter optimization](Use-Cases/Tune-User-Defined-Function#hyperparameter-optimization-algorithm)
and model selection method invented by Microsoft Research, and many followup [research studies](Research).
@ -88,6 +86,26 @@ from flaml.default import LGBMClassifier
Then, you can use it just like you use the original `LGMBClassifier`. Your other code can remain unchanged. When you call the `fit()` function from `flaml.default.LGBMClassifier`, it will automatically instantiate a good data-dependent hyperparameter configuration for your dataset, which is expected to work better than the default configuration.
#### (New) [Auto Generation](Use-Cases/Auto-Generation)
You can optimize generations by ChatGPT or GPT-4 etc. with your own tuning data, success metrics and budgets.
```python
from flaml import oai
config, analysis = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_func,
inference_budget=0.05,
optimization_budget=3,
num_samples=-1,
)
```
The optimization can help you maximize the utility out of these expensive models.
### Where to Go Next?
* Understand the use cases for [Task-oriented AutoML](Use-Cases/task-oriented-automl), [Tune user-defined function](Use-Cases/Tune-User-Defined-Function) and [Zero-shot AutoML](Use-Cases/Zero-Shot-AutoML).

View File

@ -0,0 +1,117 @@
# Auto Generation
`flaml.autogen` is a subpackage for automating generation tasks. It uses [`flaml.tune`](../reference/tune/tune) to find good hyperparameter configurations under budget constraints.
Such optimization has several benefits:
* Maximize the utility out of using expensive foundation models.
* Reduce the inference cost by using cheaper models or configurations which achieve equal or better performance.
## Choices to Optimize
The cost of using foundation models for text generation is typically measured in terms of the number of tokens in the input and output combined. From the perspective of an application builder using foundation models, the use case is to maximize the utility of the generated text under an inference budget constraint (e.g., measured by the average dollar cost needed to solve a coding problem). This can be achieved by optimizing the hyperparameters of the inference,
which can significantly affect both the utility and the cost of the generated text.
The tunable hyperparameters include:
1. model - this is a required input, specifying the model ID to use.
1. prompt - the input prompt to the model, which provides the context for the text generation task.
1. max_tokens - the maximum number of tokens (words or word pieces) to generate in the output.
1. temperature - a value between 0 and 1 that controls the randomness of the generated text. A higher temperature will result in more random and diverse text, while a lower temperature will result in more predictable text.
1. top_p - a value between 0 and 1 that controls the sampling probability mass for each token generation. A lower top_p value will make it more likely to generate text based on the most likely tokens, while a higher value will allow the model to explore a wider range of possible tokens.
1. n - the number of responses to generate for a given prompt. Generating multiple responses can provide more diverse and potentially more useful output, but it also increases the cost of the request.
1. stop - a list of strings that, when encountered in the generated text, will cause the generation to stop. This can be used to control the length or the validity of the output.
1. presence_penalty, frequency_penalty - values that control the relative importance of the presence and frequency of certain words or phrases in the generated text.
1. best_of - the number of responses to generate server-side when selecting the "best" (the one with the highest log probability per token) response for a given prompt.
The cost and utility of text generation are intertwined with the joint effect of these hyperparameters.
There are also complex interactions among subsets of the hyperparameters. For example,
the temperature and top_p are not recommended to be altered from their default values together because they both control the randomness of the generated text, and changing both at the same time can result in conflicting effects; n and best_of are rarely tuned together because if the application can process multiple outputs, filtering on the server side causes unnecessary information loss; both n and max_tokens will affect the total number of tokens generated, which in turn will affect the cost of the request.
These interactions and trade-offs make it difficult to manually determine the optimal hyperparameter settings for a given text generation task.
## Tune Hyperparameters
The tuning can be performed with the following information:
1. Validation data.
1. Evaluation function.
1. Metric to optimize.
1. Search space.
1. Budgets: inference and optimization respectively.
### Validation data
Collect a diverse set of instances. They can be stored in an iterable of dicts. For example, each instance dict can contain "problem" as a key and the description str of a math problem as the value; and "solution" as a key and the solution str as the value.
### Evaluation function
The evaluation function should take a list of responses, and other keyword arguments corresponding to the keys in each validation data instance as input, and output a dict of metrics. For example,
```python
def success_metrics(responses: List[str], problem: str, solution: str) -> Dict:
# select a response from the list of responses
# check whether the answer is correct
return {"success": True or False}
```
`flaml.autogen` offers some example evaluation functions for common tasks such as code generation and math problem solving.
### Metric to optimize
The metric to optimize is usually an aggregated metric over all the tuning data instances. For example, users can specify "success" as the metric and "max" as the optimization mode. By default, the aggregation function is taking the average. Users can provide a customized aggregation function if needed.
### Search space
Users can specify the (optional) search range for each hyperparameter.
1. model. Either a constant str, or multiple choices specified by `flaml.tune.choice`.
1. prompt. Either a str or a list of strs, of the prompt templates.
Each prompt template will be formatted with each data instance. For example, the prompt template can be:
"{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \\boxed{{}}."
And `{problem}` will be replaced by the "problem" field of each data instance.
1. max_tokens, n, best_of. They can be constants, or specified by `flaml.tune.randint`, `flaml.tune.qrandint`, `flaml.tune.lograndint` or `flaml.qlograndint`. By default, max_tokens is searched in [50, 1000); n is searched in [1, 100); and best_of is fixed to 1.
1. stop. It can be a str or a list of strs, or a list of lists of strs or None. Default is None.
1. temperature or top_p. One of them can be specified as a constant or by `flaml.tune.uniform` or `flaml.tune.loguniform` etc.
Please don't provide both. By default, each configuration will choose either a temperature or a top_p in [0, 1] uniformly.
1. presence_penalty, frequency_penalty. They can be constants or specified by `flaml.tune.uniform` etc. Not tuned by default.
### Budgets
One can specify an inference budget and an optimization budget.
The inference budget refers to the average inference cost per data instance.
The optimization budget refers to the total budget allowed in the tuning process. Both are measured by dollars and follow the price per 1000 tokens.
### Perform tuning
Now, you can use [`flaml.oai.Completion.tune`](../reference/autogen/oai/completion#tune) for tuning. For example,
```python
from flaml import oai
config, analysis = oai.Completion.tune(
data=tune_data,
metric="success",
mode="max",
eval_func=eval_func,
inference_budget=0.05,
optimization_budget=3,
num_samples=-1,
)
```
`num_samples` is the number of configurations to sample. -1 means unlimited (until optimization budget is exhausted).
The returned `config` contains the optimized configuration and `analysis` contains an [ExperimentAnalysis](../reference/tune/analysis#experimentanalysis-objects) object for all the tried configurations and results.
### Perform inference with the tuned config
One can use [`flaml.oai.Completion.create`](../reference/autogen/oai/completion#create) to performance inference. It materializes a prompt using a given context. For example,
```python
response = oai.Completion.create(problme=problem, **config)
responses = oai.Completion.extract_text(response)
# Extract a list of str responses
```
`flaml.oai.Completion` is compatible with both `openai.Completion` and `openai.ChatCompletion`. So models such as "text-davinci-003", "gpt-3.5-turbo" and "gpt-4" can share a common API. When only tuning the chat-based models, `flaml.oai.ChatCompletion` can be used.
`flaml.oai.Completion` also offers some additional utilities including a `test` function to conveniently evaluate the configuration over test data, a `cost` function to calculate the cost of an API call, and caching and error handling. It also supports both OpenAI API and Azure OpenAI API.
Interested in trying it yourself? Please check the following notebook examples:
* [Optimize for Code Gen](https://github.com/microsoft/FLAML/blob/main/notebook/autogen_openai.ipynb)
* [Optimize for Math](https://github.com/microsoft/FLAML/blob/main/notebook/autogen_chatgpt.ipynb)