update README.md

2023-11-17 13:49:55 +08:00 · 2023-11-17 13:49:55 +08:00 · 5d5a2d40b2
parent 1d76900822
commit 5d5a2d40b2
5 changed files with 12 additions and 11 deletions
--- a/api-bank/README.md
+++ b/api-bank/README.md
@ -1,25 +1,26 @@
-# API-Bank: A Benchmark for Tool-Augmented LLMs
-Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li
+# API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
+
+Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, Yongbin Li

 arXiv: [[Abstract]](https://arxiv.org/abs/2304.08244)/[[PDF]](https://arxiv.org/pdf/2304.08244.pdf)
-<!-- PDF: [API-Bank-arxiv-version.pdf](API-Bank-arxiv-version.pdf)
- -->


 ## News
+- **The Lynx model is released on [Huggingface Hub](https://huggingface.co/liminghao1630/Lynx-7b).**
+- **API-Bank is accepted by EMNLP 2023.**
 - **The code and data of API-Bank have been released.**
 
 ## Abstract

-Recent research has shown that Large Language Models (LLMs) can utilize external tools to improve their contextual processing abilities, moving away from the pure language modeling paradigm and paving the way for Artificial General Intelligence. Despite this, there has been a lack of systematic evaluation to demonstrate the efficacy of LLMs using tools to respond to human instructions. This paper presents API-Bank, the first benchmark tailored for Tool-Augmented LLMs. API-Bank includes 53 commonly used API tools, a complete Tool-Augmented LLM workflow, and 264 annotated dialogues that encompass a total of 568 API calls. These resources have been designed to thoroughly evaluate LLMs' ability to plan step-by-step API calls, retrieve relevant APIs, and correctly execute API calls to meet human needs. The experimental results show that GPT-3.5 emerges the ability to use the tools relative to GPT3, while GPT-4 has stronger planning performance. Nevertheless, there remains considerable scope for further improvement when compared to human performance. Additionally, detailed error analysis and case studies demonstrate the feasibility of Tool-Augmented LLMs for daily use, as well as the primary challenges that future research needs to address.
+Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.

-## Tool-Augmented LLMs Paradigm
+## Multi-Agent Dataset Synthesis

-![Paradigm](https://cdn.jsdelivr.net/gh/liminghao1630/auxiliary_use/figures/flowchart.png)
+![multiagent](./figures/multi-agent.png)

-## System Design
+## Evaluation Tasks

-![System](https://cdn.jsdelivr.net/gh/liminghao1630/auxiliary_use/figures/system.png)
+![ability](./figures/three_ability.png)

 ## Demo
 As far as we know, there is a conflict between the dependencies of the `googletrans` package and the dependencies of the `gradio` package, which may cause the demo not to run properly. There is no good solution, you can uninstall `googletrans` first when using the demo.
@ -50,8 +51,8 @@ JsDelivr: https://cdn.jsdelivr.net/gh/liminghao1630/auxiliary_use/gpt-3.5-demo.g

 ## Evaluation

-The conversation data of level-1 and level-2 are stored in the `lv1-lv2-samples` directory, please follow the code in `evaluator.py` to design the evaluation script.
-The evaluation of level-3 needs to be done manually, you can use `simulator.py` or `demo.py` for testing.
+The conversation data of level-1 and level-2 are stored in the `lv1-lv2-samples` directory or `test-data`, please follow the code in `evaluator.py`/`evaluator_by_json.py` to design the evaluation script.
+The evaluation of level-3 requires `lv3_evaluator.py`.



--- a/api-bank/figures/flowchart.png
+++ b/api-bank/figures/flowchart.png
--- a/api-bank/figures/multi-agent.png
+++ b/api-bank/figures/multi-agent.png
--- a/api-bank/figures/system.png
+++ b/api-bank/figures/system.png
--- a/api-bank/figures/three_ability.png
+++ b/api-bank/figures/three_ability.png