mirror of https://github.com/microsoft/autogen.git
Magentic-One Log Viewer + preview API (#4032)
* update example script with logs dir, add screenshot timestamp * readme examples update * add flask app to view magentic_one * remove copy example * rename * changes to magentic one helper * update test web surfer to delete logs * magentic_one icons * fix colors - final log viewer * fix termination condition * update coder and log viewer * timeout time * make tests pass * logs dir * repeated thing * remove log_viewer, mm web surfer comments * coder change prompt, edit readmes * type ignore * remove logviewer * add flag for coder agent * readme * changes readme * uv lock * update readme figures * not yet * pointer images
This commit is contained in:
parent
eca8a95c61
commit
8603317537
|
@ -33,10 +33,8 @@
|
|||
*.tsx text
|
||||
*.xml text
|
||||
*.xhtml text diff=html
|
||||
|
||||
# Docker
|
||||
Dockerfile text eol=lf
|
||||
|
||||
# Documentation
|
||||
*.ipynb text
|
||||
*.markdown text diff=markdown eol=lf
|
||||
|
@ -62,7 +60,6 @@ NEWS text eol=lf
|
|||
readme text eol=lf
|
||||
*README* text eol=lf
|
||||
TODO text
|
||||
|
||||
# Configs
|
||||
*.cnf text eol=lf
|
||||
*.conf text eol=lf
|
||||
|
@ -84,8 +81,9 @@ yarn.lock text -diff
|
|||
browserslist text
|
||||
Makefile text eol=lf
|
||||
makefile text eol=lf
|
||||
|
||||
# Images
|
||||
*.png filter=lfs diff=lfs merge=lfs -text
|
||||
*.jpg filter=lfs diff=lfs merge=lfs -text
|
||||
*.jpeg filter=lfs diff=lfs merge=lfs -text
|
||||
python/packages/autogen-magentic-one/imgs/autogen-magentic-one-example.png filter=lfs diff=lfs merge=lfs -text
|
||||
python/packages/autogen-magentic-one/imgs/autogen-magentic-one-agents.png filter=lfs diff=lfs merge=lfs -text
|
||||
|
|
|
@ -0,0 +1,153 @@
|
|||
# Magentic-One
|
||||
|
||||
> [!CAUTION]
|
||||
> Using Magentic-One involves interacting with a digital world designed for humans, which carries inherent risks. To minimize these risks, consider the following precautions:
|
||||
>
|
||||
> 1. **Use Containers**: Run all tasks in docker containers to isolate the agents and prevent direct system attacks.
|
||||
> 2. **Virtual Environment**: Use a virtual environment to run the agents and prevent them from accessing sensitive data.
|
||||
> 3. **Monitor Logs**: Closely monitor logs during and after execution to detect and mitigate risky behavior.
|
||||
> 4. **Human Oversight**: Run the examples with a human in the loop to supervise the agents and prevent unintended consequences.
|
||||
> 5. **Limit Access**: Restrict the agents' access to the internet and other resources to prevent unauthorized actions.
|
||||
> 6. **Safeguard Data**: Ensure that the agents do not have access to sensitive data or resources that could be compromised. Do not share sensitive information with the agents.
|
||||
> Be aware that agents may occasionally attempt risky actions, such as recruiting humans for help or accepting cookie agreements without human involvement. Always ensure agents are monitored and operate within a controlled environment to prevent unintended consequences. Moreover, be cautious that Magentic-One may be susceptible to prompt injection attacks from webpages.
|
||||
|
||||
> [!NOTE]
|
||||
> This code is currently being ported to AutoGen AgentChat. If you want to build on top of Magentic-One, we recommend waiting for the port to be completed. In the meantime, you can use this codebase to experiment with Magentic-One.
|
||||
|
||||
|
||||
We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives.
|
||||
|
||||
![](./imgs/autogen-magentic-one-example.png)
|
||||
|
||||
> _Example_: The figure above illustrates Magentic-One mutli-agent team completing a complex task from the GAIA benchmark. Magentic-One's Orchestrator agent creates a plan, delegates tasks to other agents, and tracks progress towards the goal, dynamically revising the plan as needed. The Orchestrator can delegate tasks to a FileSurfer agent to read and handle files, a WebSurfer agent to operate a web browser, or a Coder or Computer Terminal agent to write or execute code, respectively.
|
||||
|
||||
## Architecture
|
||||
|
||||
|
||||
|
||||
![](./imgs/autogen-magentic-one-agents.png)
|
||||
|
||||
Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.
|
||||
|
||||
Overall, Magentic-One consists of the following agents:
|
||||
- Orchestrator: the lead agent responsible for task decomposition and planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
|
||||
- WebSurfer: This is an LLM-based agent that is proficient in commanding and managing the state of a Chromium-based web browser. With each incoming request, the WebSurfer performs an action on the browser then reports on the new state of the web page The action space of the WebSurfer includes navigation (e.g. visiting a URL, performing a web search); web page actions (e.g., clicking and typing); and reading actions (e.g., summarizing or answering questions). The WebSurfer relies on the accessibility tree of the browser and on set-of-marks prompting to perform its actions.
|
||||
- FileSurfer: This is an LLM-based agent that commands a markdown-based file preview application to read local files of most types. The FileSurfer can also perform common navigation tasks such as listing the contents of directories and navigating a folder structure.
|
||||
- Coder: This is an LLM-based agent specialized through its system prompt for writing code, analyzing information collected from the other agents, or creating new artifacts.
|
||||
- ComputerTerminal: Finally, ComputerTerminal provides the team with access to a console shell where the Coder’s programs can be executed, and where new programming libraries can be installed.
|
||||
|
||||
Together, Magentic-One’s agents provide the Orchestrator with the tools and capabilities that it needs to solve a broad variety of open-ended problems, as well as the ability to autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.
|
||||
|
||||
While the default multimodal LLM we use for all agents is GPT-4o, Magentic-One is model agnostic and can incorporate heterogonous models to support different capabilities or meet different cost requirements when getting tasks done. For example, it can use different LLMs and SLMs and their specialized versions to power different agents. We recommend a strong reasoning model for the Orchestrator agent such as GPT-4o. In a different configuration of Magentic-One, we also experiment with using OpenAI o1-preview for the outer loop of the Orchestrator and for the Coder, while other agents continue to use GPT-4o.
|
||||
|
||||
|
||||
### Logging in Team One Agents
|
||||
|
||||
Team One agents can emit several log events that can be consumed by a log handler (see the example log handler in [utils.py](src/autogen_magentic_one/utils.py)). A list of currently emitted events are:
|
||||
|
||||
- OrchestrationEvent : emitted by a an [Orchestrator](src/autogen_magentic_one/agents/base_orchestrator.py) agent.
|
||||
- WebSurferEvent : emitted by a [WebSurfer](src/autogen_magentic_one/agents/multimodal_web_surfer/multimodal_web_surfer.py) agent.
|
||||
|
||||
In addition, developers can also handle and process logs generated from the AutoGen core library (e.g., LLMCallEvent etc). See the example log handler in [utils.py](src/autogen_magentic_one/utils.py) on how this can be implemented. By default, the logs are written to a file named `log.jsonl` which can be configured as a parameter to the defined log handler. These logs can be parsed to retrieved data agent actions.
|
||||
|
||||
# Setup and Usage
|
||||
|
||||
You can install the Magentic-One package and then run the example code to see how the agents work together to accomplish a task.
|
||||
|
||||
1. Clone the code and install the package:
|
||||
|
||||
```bash
|
||||
git clone -b staging https://github.com/microsoft/autogen.git
|
||||
cd autogen/python/packages/autogen-magentic-one
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
The following instructions are for running the example code:
|
||||
|
||||
2. Configure the environment variables for the chat completion client. See instructions below [Environment Configuration for Chat Completion Client](#environment-configuration-for-chat-completion-client).
|
||||
3. Magentic-One code uses code execution, you need to have [Docker installed](https://docs.docker.com/engine/install/) to run any examples.
|
||||
4. Magentic-One uses playwright to interact with web pages. You need to install the playwright dependencies. Run the following command to install the playwright dependencies:
|
||||
|
||||
```bash
|
||||
playwright install-deps
|
||||
```
|
||||
5. Now you can run the example code to see how the agents work together to accomplish a task.
|
||||
|
||||
> [!CAUTION]
|
||||
> The example code may download files from the internet, execute code, and interact with web pages. Ensure you are in a safe environment before running the example code.
|
||||
|
||||
> [!NOTE]
|
||||
> You will need to ensure Docker is running prior to running the example.
|
||||
|
||||
```bash
|
||||
|
||||
# Specify logs directory
|
||||
python examples/example.py --logs_dir ./my_logs
|
||||
|
||||
# Enable human-in-the-loop mode
|
||||
python examples/example.py -logs_dir ./my_logs --hil_mode
|
||||
|
||||
# Save screenshots of browser
|
||||
python examples/example.py -logs_dir ./my_logs --save_screenshots
|
||||
```
|
||||
|
||||
Arguments:
|
||||
|
||||
- logs_dir: (Required) Directory for logs, downloads and screenshots of browser (default: current directory)
|
||||
- hil_mode: (Optional) Enable human-in-the-loop mode (default: disabled)
|
||||
- save_screenshots: (Optional) Save screenshots of browser (default: disabled)
|
||||
|
||||
6. [Preview] We have a preview API for Magentic-One.
|
||||
You can use the `MagenticOneHelper` class to interact with the system. See the [interface README](interface/README.md) for more details.
|
||||
|
||||
|
||||
## Environment Configuration for Chat Completion Client
|
||||
|
||||
This guide outlines how to configure your environment to use the `create_completion_client_from_env` function, which reads environment variables to return an appropriate `ChatCompletionClient`.
|
||||
|
||||
Currently, Magentic-One only supports OpenAI's GPT-4o as the underlying LLM.
|
||||
|
||||
### Azure with Active Directory
|
||||
|
||||
To configure for Azure with Active Directory, set the following environment variables:
|
||||
|
||||
- `CHAT_COMPLETION_PROVIDER='azure'`
|
||||
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"api_version": "2024-02-15-preview",
|
||||
"azure_endpoint": "REPLACE_WITH_YOUR_ENDPOINT",
|
||||
"model_capabilities": {
|
||||
"function_calling": true,
|
||||
"json_output": true,
|
||||
"vision": true
|
||||
},
|
||||
"azure_ad_token_provider": "DEFAULT",
|
||||
"model": "gpt-4o-2024-05-13"
|
||||
}
|
||||
```
|
||||
|
||||
### With OpenAI
|
||||
|
||||
To configure for OpenAI, set the following environment variables:
|
||||
|
||||
- `CHAT_COMPLETION_PROVIDER='openai'`
|
||||
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"api_key": "REPLACE_WITH_YOUR_API",
|
||||
"model": "gpt-4o-2024-05-13"
|
||||
}
|
||||
```
|
||||
Feel free to replace the model with newer versions of gpt-4o if needed.
|
||||
|
||||
### Other Keys (Optional)
|
||||
|
||||
Some functionalities, such as using web-search requires an API key for Bing.
|
||||
You can set it using:
|
||||
|
||||
```bash
|
||||
export BING_API_KEY=xxxxxxx
|
||||
```
|
|
@ -1,11 +1,34 @@
|
|||
# Examples of Magentic-One
|
||||
|
||||
**Note**: The examples in this folder are ran at your own risk. They involve agents navigating the web, executing code and browsing local files. Please supervise the execution of the agents to reduce any risks. We also recommend running the examples in a docker environment.
|
||||
**Note**: The examples in this folder are ran at your own risk. They involve agents navigating the web, executing code and browsing local files. Please supervise the execution of the agents to reduce any risks. We also recommend running the examples in a virtual machine or a sandboxed environment.
|
||||
|
||||
|
||||
We include various examples for using Magentic-One and is agents:
|
||||
|
||||
- [example.py](example.py): Is a human-in-the-loop of Magentic-One trying to solve a task specified by user input. If you wish for the team to execute the task without involving the user, remove user_proxy from the Orchestrator agents list.
|
||||
- [example.py](example.py): Is [human-in-the-loop] Magentic-One trying to solve a task specified by user input.
|
||||
|
||||
|
||||
|
||||
```bash
|
||||
|
||||
# Specify logs directory
|
||||
python examples/example.py --logs_dir ./my_logs
|
||||
|
||||
# Enable human-in-the-loop mode
|
||||
python examples/example.py -logs_dir ./my_logs --hil_mode
|
||||
|
||||
# Save screenshots of browser
|
||||
python examples/example.py -logs_dir ./my_logs --save_screenshots
|
||||
```
|
||||
|
||||
Arguments:
|
||||
|
||||
- logs_dir: (Required) Directory for logs, downloads and screenshots of browser (default: current directory)
|
||||
- hil_mode: (Optional) Enable human-in-the-loop mode (default: disabled)
|
||||
- save_screenshots: (Optional) Save screenshots of browser (default: disabled)
|
||||
|
||||
|
||||
The following examples are for individual agents in Magentic-One:
|
||||
|
||||
- [example_coder.py](example_coder.py): Is an example of the Coder + Execution agents in Magentic-One -- without the Magentic-One orchestrator. In a loop, specified by using the RoundRobinOrchestrator, the coder will write code based on user input, executor will run the code and then the user is asked for input again.
|
||||
|
||||
|
@ -16,4 +39,3 @@ We include various examples for using Magentic-One and is agents:
|
|||
- [example_websurfer.py](example_websurfer.py): Is an example of the MultimodalWebSurfer agent in Magentic-one -- without the orchestrator. To view the browser the agent uses, pass the argument 'headless = False' to 'actual_surfer.init'. In a loop, specified by using the RoundRobinOrchestrator, the web surfer will perform a single action on the browser in response to user input and then the user is asked for input again.
|
||||
|
||||
|
||||
Running these examples is simple. First make sure you have installed 'autogen-magentic-one' either from source or from pip, then run 'python example.py'
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
"""This example demonstrates MagenticOne performing a task given by the user and returning a final answer."""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
|
@ -8,7 +9,7 @@ from autogen_core.application import SingleThreadedAgentRuntime
|
|||
from autogen_core.application.logging import EVENT_LOGGER_NAME
|
||||
from autogen_core.base import AgentId, AgentProxy
|
||||
from autogen_core.components.code_executor import CodeBlock
|
||||
from autogen_ext.code_executor.docker_executor import DockerCommandLineCodeExecutor
|
||||
from autogen_ext.code_executors import DockerCommandLineCodeExecutor
|
||||
from autogen_magentic_one.agents.coder import Coder, Executor
|
||||
from autogen_magentic_one.agents.file_surfer import FileSurfer
|
||||
from autogen_magentic_one.agents.multimodal_web_surfer import MultimodalWebSurfer
|
||||
|
@ -28,14 +29,14 @@ async def confirm_code(code: CodeBlock) -> bool:
|
|||
return response.lower() == "yes"
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
async def main(logs_dir: str, hil_mode: bool, save_screenshots: bool) -> None:
|
||||
# Create the runtime.
|
||||
runtime = SingleThreadedAgentRuntime()
|
||||
|
||||
# Create an appropriate client
|
||||
client = create_completion_client_from_env(model="gpt-4o")
|
||||
|
||||
async with DockerCommandLineCodeExecutor() as code_executor:
|
||||
async with DockerCommandLineCodeExecutor(work_dir=logs_dir) as code_executor:
|
||||
# Register agents.
|
||||
await Coder.register(runtime, "Coder", lambda: Coder(model_client=client))
|
||||
coder = AgentProxy(AgentId("Coder", "default"), runtime)
|
||||
|
@ -61,11 +62,15 @@ async def main() -> None:
|
|||
)
|
||||
user_proxy = AgentProxy(AgentId("UserProxy", "default"), runtime)
|
||||
|
||||
agent_list = [web_surfer, coder, executor, file_surfer]
|
||||
if hil_mode:
|
||||
agent_list.append(user_proxy)
|
||||
|
||||
await LedgerOrchestrator.register(
|
||||
runtime,
|
||||
"Orchestrator",
|
||||
lambda: LedgerOrchestrator(
|
||||
agents=[web_surfer, user_proxy, coder, executor, file_surfer],
|
||||
agents=agent_list,
|
||||
model_client=client,
|
||||
max_rounds=30,
|
||||
max_time=25 * 60,
|
||||
|
@ -79,10 +84,12 @@ async def main() -> None:
|
|||
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, type=MultimodalWebSurfer)
|
||||
await actual_surfer.init(
|
||||
model_client=client,
|
||||
downloads_folder=os.getcwd(),
|
||||
downloads_folder=logs_dir,
|
||||
start_page="https://www.bing.com",
|
||||
browser_channel="chromium",
|
||||
headless=True,
|
||||
debug_dir=logs_dir,
|
||||
to_save_screenshots=save_screenshots,
|
||||
)
|
||||
|
||||
await runtime.send_message(RequestReplyMessage(), user_proxy.id)
|
||||
|
@ -90,8 +97,21 @@ async def main() -> None:
|
|||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Run MagenticOne example with log directory.")
|
||||
parser.add_argument("--logs_dir", type=str, required=True, help="Directory to store log files and downloads")
|
||||
parser.add_argument("--hil_mode", action="store_true", default=False, help="Run in human-in-the-loop mode")
|
||||
parser.add_argument(
|
||||
"--save_screenshots", action="store_true", default=False, help="Save additional browser screenshots to file"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Ensure the log directory exists
|
||||
if not os.path.exists(args.logs_dir):
|
||||
os.makedirs(args.logs_dir)
|
||||
|
||||
logger = logging.getLogger(EVENT_LOGGER_NAME)
|
||||
logger.setLevel(logging.INFO)
|
||||
log_handler = LogHandler()
|
||||
log_handler = LogHandler(filename=os.path.join(args.logs_dir, "log.jsonl"))
|
||||
logger.handlers = [log_handler]
|
||||
asyncio.run(main())
|
||||
asyncio.run(main(args.logs_dir, args.hil_mode, args.save_screenshots))
|
||||
|
|
|
@ -1,3 +1,3 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:e89c451d86c7e693127707e696443b77ddad2d9c596936f5fc2f6225cf4b431d
|
||||
size 97407
|
||||
oid sha256:25a3a1f79319b89d80b8459af8b522eb9a884dea842b11e3d7dae2bca30add5e
|
||||
size 90181
|
||||
|
|
|
@ -1,3 +0,0 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:a3aa615fa321b54e09efcd9dbb2e4d25a392232fd4e065f85b5a58ed58a7768c
|
||||
size 298340
|
|
@ -1,3 +1,3 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:e6d0c57dc734747319fd4f847748fd2400cfb73ea01e87ac85dc8c28c738d21f
|
||||
size 206468
|
||||
oid sha256:fc910bda7e5f3b54d6502f26384f7b10b67f0597d7ac4631dfb45801882768fa
|
||||
size 201460
|
||||
|
|
|
@ -0,0 +1,50 @@
|
|||
# MagenticOne Interface
|
||||
|
||||
This repository contains a preview interface for interacting with the MagenticOne system. It includes helper classes, and example usage.
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
### MagenticOneHelper
|
||||
|
||||
The MagenticOneHelper class provides an interface to interact with the MagenticOne system. It saves logs to a user-specified directory and provides methods to run tasks, stream logs, and retrieve the final answer.
|
||||
|
||||
The class provides the following methods:
|
||||
- async initialize(self) -> None: Initializes the MagenticOne system, setting up agents and runtime.
|
||||
- async run_task(self, task: str) -> None: Runs a specific task through the MagenticOne system.
|
||||
- get_final_answer(self) -> Optional[str]: Retrieves the final answer from the Orchestrator.
|
||||
- async stream_logs(self) -> AsyncGenerator[Dict[str, Any], None]: Streams logs from the system as they are generated.
|
||||
- get_all_logs(self) -> List[Dict[str, Any]]: Retrieves all logs that have been collected so far.
|
||||
|
||||
We show an example of how to use the MagenticOneHelper class to in [example_magentic_one_helper.py](example_magentic_one_helper.py).
|
||||
|
||||
```python
|
||||
from magentic_one_helper import MagenticOneHelper
|
||||
import asyncio
|
||||
import json
|
||||
|
||||
async def magentic_one_example():
|
||||
# Create and initialize MagenticOne
|
||||
magnetic_one = MagenticOneHelper(logs_dir="./logs")
|
||||
await magnetic_one.initialize()
|
||||
print("MagenticOne initialized.")
|
||||
|
||||
# Start a task and stream logs
|
||||
task = "How many members are in the MSR HAX Team"
|
||||
task_future = asyncio.create_task(magnetic_one.run_task(task))
|
||||
|
||||
# Stream and process logs
|
||||
async for log_entry in magnetic_one.stream_logs():
|
||||
print(json.dumps(log_entry, indent=2))
|
||||
|
||||
# Wait for task to complete
|
||||
await task_future
|
||||
|
||||
# Get the final answer
|
||||
final_answer = magnetic_one.get_final_answer()
|
||||
|
||||
if final_answer is not None:
|
||||
print(f"Final answer: {final_answer}")
|
||||
else:
|
||||
print("No final answer found in logs.")
|
||||
```
|
|
@ -0,0 +1,40 @@
|
|||
from magentic_one_helper import MagenticOneHelper
|
||||
import asyncio
|
||||
import json
|
||||
import argparse
|
||||
import os
|
||||
|
||||
|
||||
async def main(task, logs_dir):
|
||||
magnetic_one = MagenticOneHelper(logs_dir=logs_dir)
|
||||
await magnetic_one.initialize()
|
||||
print("MagenticOne initialized.")
|
||||
|
||||
# Create task and log streaming tasks
|
||||
task_future = asyncio.create_task(magnetic_one.run_task(task))
|
||||
final_answer = None
|
||||
|
||||
# Stream and process logs
|
||||
async for log_entry in magnetic_one.stream_logs():
|
||||
print(json.dumps(log_entry, indent=2))
|
||||
|
||||
# Wait for task to complete
|
||||
await task_future
|
||||
|
||||
# Get the final answer
|
||||
final_answer = magnetic_one.get_final_answer()
|
||||
|
||||
if final_answer is not None:
|
||||
print(f"Final answer: {final_answer}")
|
||||
else:
|
||||
print("No final answer found in logs.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Run a task with MagenticOneHelper.")
|
||||
parser.add_argument("task", type=str, help="The task to run")
|
||||
parser.add_argument("--logs_dir", type=str, default="./logs", help="Directory to store logs")
|
||||
args = parser.parse_args()
|
||||
if not os.path.exists(args.logs_dir):
|
||||
os.makedirs(args.logs_dir)
|
||||
asyncio.run(main(args.task, args.logs_dir))
|
|
@ -0,0 +1,217 @@
|
|||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from typing import Optional, AsyncGenerator, Dict, Any, List
|
||||
from datetime import datetime
|
||||
import json
|
||||
from dataclasses import asdict
|
||||
|
||||
from autogen_core.application import SingleThreadedAgentRuntime
|
||||
from autogen_core.application.logging import EVENT_LOGGER_NAME
|
||||
from autogen_core.base import AgentId, AgentProxy
|
||||
from autogen_core.components import DefaultTopicId
|
||||
from autogen_core.components.code_executor import LocalCommandLineCodeExecutor
|
||||
from autogen_ext.code_executor.docker_executor import DockerCommandLineCodeExecutor
|
||||
from autogen_core.components.code_executor import CodeBlock
|
||||
from autogen_magentic_one.agents.coder import Coder, Executor
|
||||
from autogen_magentic_one.agents.file_surfer import FileSurfer
|
||||
from autogen_magentic_one.agents.multimodal_web_surfer import MultimodalWebSurfer
|
||||
from autogen_magentic_one.agents.orchestrator import LedgerOrchestrator
|
||||
from autogen_magentic_one.agents.user_proxy import UserProxy
|
||||
from autogen_magentic_one.messages import BroadcastMessage
|
||||
from autogen_magentic_one.utils import LogHandler, create_completion_client_from_env
|
||||
from autogen_core.components.models import UserMessage
|
||||
from threading import Lock
|
||||
|
||||
|
||||
async def confirm_code(code: CodeBlock) -> bool:
|
||||
return True
|
||||
|
||||
|
||||
class MagenticOneHelper:
|
||||
def __init__(self, logs_dir: str = None, save_screenshots: bool = False) -> None:
|
||||
"""
|
||||
A helper class to interact with the MagenticOne system.
|
||||
Initialize MagenticOne instance.
|
||||
|
||||
Args:
|
||||
logs_dir: Directory to store logs and downloads
|
||||
save_screenshots: Whether to save screenshots of web pages
|
||||
"""
|
||||
self.logs_dir = logs_dir or os.getcwd()
|
||||
self.runtime: Optional[SingleThreadedAgentRuntime] = None
|
||||
self.log_handler: Optional[LogHandler] = None
|
||||
self.save_screenshots = save_screenshots
|
||||
|
||||
if not os.path.exists(self.logs_dir):
|
||||
os.makedirs(self.logs_dir)
|
||||
|
||||
async def initialize(self) -> None:
|
||||
"""
|
||||
Initialize the MagenticOne system, setting up agents and runtime.
|
||||
"""
|
||||
# Create the runtime
|
||||
self.runtime = SingleThreadedAgentRuntime()
|
||||
|
||||
# Set up logging
|
||||
logger = logging.getLogger(EVENT_LOGGER_NAME)
|
||||
logger.setLevel(logging.INFO)
|
||||
self.log_handler = LogHandler(filename=os.path.join(self.logs_dir, "log.jsonl"))
|
||||
logger.handlers = [self.log_handler]
|
||||
|
||||
# Create client
|
||||
client = create_completion_client_from_env(model="gpt-4o")
|
||||
|
||||
# Set up code executor
|
||||
self.code_executor = DockerCommandLineCodeExecutor(work_dir=self.logs_dir)
|
||||
await self.code_executor.__aenter__()
|
||||
|
||||
await Coder.register(self.runtime, "Coder", lambda: Coder(model_client=client))
|
||||
|
||||
coder = AgentProxy(AgentId("Coder", "default"), self.runtime)
|
||||
|
||||
await Executor.register(
|
||||
self.runtime,
|
||||
"Executor",
|
||||
lambda: Executor("A agent for executing code", executor=self.code_executor, confirm_execution=confirm_code),
|
||||
)
|
||||
executor = AgentProxy(AgentId("Executor", "default"), self.runtime)
|
||||
|
||||
# Register agents.
|
||||
await MultimodalWebSurfer.register(self.runtime, "WebSurfer", MultimodalWebSurfer)
|
||||
web_surfer = AgentProxy(AgentId("WebSurfer", "default"), self.runtime)
|
||||
|
||||
await FileSurfer.register(self.runtime, "file_surfer", lambda: FileSurfer(model_client=client))
|
||||
file_surfer = AgentProxy(AgentId("file_surfer", "default"), self.runtime)
|
||||
|
||||
agent_list = [web_surfer, coder, executor, file_surfer]
|
||||
await LedgerOrchestrator.register(
|
||||
self.runtime,
|
||||
"Orchestrator",
|
||||
lambda: LedgerOrchestrator(
|
||||
agents=agent_list,
|
||||
model_client=client,
|
||||
max_rounds=30,
|
||||
max_time=25 * 60,
|
||||
max_stalls_before_replan=10,
|
||||
return_final_answer=True,
|
||||
),
|
||||
)
|
||||
|
||||
self.runtime.start()
|
||||
|
||||
actual_surfer = await self.runtime.try_get_underlying_agent_instance(web_surfer.id, type=MultimodalWebSurfer)
|
||||
await actual_surfer.init(
|
||||
model_client=client,
|
||||
downloads_folder=os.getcwd(),
|
||||
start_page="https://www.bing.com",
|
||||
browser_channel="chromium",
|
||||
headless=True,
|
||||
debug_dir=self.logs_dir,
|
||||
to_save_screenshots=self.save_screenshots,
|
||||
)
|
||||
|
||||
async def __aexit__(self, exc_type, exc_value, traceback) -> None:
|
||||
"""
|
||||
Clean up resources.
|
||||
"""
|
||||
if self.code_executor:
|
||||
await self.code_executor.__aexit__(exc_type, exc_value, traceback)
|
||||
|
||||
async def run_task(self, task: str) -> None:
|
||||
"""
|
||||
Run a specific task through the MagenticOne system.
|
||||
|
||||
Args:
|
||||
task: The task description to be executed
|
||||
"""
|
||||
if not self.runtime:
|
||||
raise RuntimeError("MagenticOne not initialized. Call initialize() first.")
|
||||
|
||||
task_message = BroadcastMessage(content=UserMessage(content=task, source="UserProxy"))
|
||||
|
||||
await self.runtime.publish_message(task_message, topic_id=DefaultTopicId())
|
||||
await self.runtime.stop_when_idle()
|
||||
|
||||
def get_final_answer(self) -> Optional[str]:
|
||||
"""
|
||||
Get the final answer from the Orchestrator.
|
||||
|
||||
Returns:
|
||||
The final answer as a string
|
||||
"""
|
||||
if not self.log_handler:
|
||||
raise RuntimeError("Log handler not initialized")
|
||||
|
||||
for log_entry in self.log_handler.logs_list:
|
||||
if (
|
||||
log_entry.get("type") == "OrchestrationEvent"
|
||||
and log_entry.get("source") == "Orchestrator (final answer)"
|
||||
):
|
||||
return log_entry.get("message")
|
||||
return None
|
||||
|
||||
async def stream_logs(self) -> AsyncGenerator[Dict[str, Any], None]:
|
||||
"""
|
||||
Stream logs from the system as they are generated. Stops when it detects both
|
||||
the final answer and termination condition from the Orchestrator.
|
||||
|
||||
Yields:
|
||||
Dictionary containing log entry information
|
||||
"""
|
||||
if not self.log_handler:
|
||||
raise RuntimeError("Log handler not initialized")
|
||||
|
||||
last_index = 0
|
||||
found_final_answer = False
|
||||
found_termination = False
|
||||
found_termination_no_agent = False
|
||||
|
||||
while True:
|
||||
current_logs = self.log_handler.logs_list
|
||||
while last_index < len(current_logs):
|
||||
log_entry = current_logs[last_index]
|
||||
yield log_entry
|
||||
# Check for termination condition
|
||||
|
||||
if (
|
||||
log_entry.get("type") == "OrchestrationEvent"
|
||||
and log_entry.get("source") == "Orchestrator (final answer)"
|
||||
):
|
||||
found_final_answer = True
|
||||
|
||||
if (
|
||||
log_entry.get("type") == "OrchestrationEvent"
|
||||
and log_entry.get("source") == "Orchestrator (termination condition)"
|
||||
):
|
||||
found_termination = True
|
||||
|
||||
if (
|
||||
log_entry.get("type") == "OrchestrationEvent"
|
||||
and log_entry.get("source") == "Orchestrator (termination condition)"
|
||||
and log_entry.get("message") == "No agent selected."
|
||||
):
|
||||
found_termination_no_agent = True
|
||||
|
||||
if self.runtime._run_context is None:
|
||||
return
|
||||
|
||||
if found_termination_no_agent and found_final_answer:
|
||||
return
|
||||
elif found_termination and not found_termination_no_agent:
|
||||
return
|
||||
|
||||
last_index += 1
|
||||
|
||||
await asyncio.sleep(0.1) # Small delay to prevent busy waiting
|
||||
|
||||
def get_all_logs(self) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Get all logs that have been collected so far.
|
||||
|
||||
Returns:
|
||||
List of all log entries
|
||||
"""
|
||||
if not self.log_handler:
|
||||
raise RuntimeError("Log handler not initialized")
|
||||
return self.log_handler.logs_list
|
|
@ -7,7 +7,7 @@ name = "autogen-magentic-one"
|
|||
version = "0.0.1"
|
||||
license = {file = "LICENSE-CODE"}
|
||||
description = ''
|
||||
readme = "readme.md"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
keywords = []
|
||||
classifiers = [
|
||||
|
@ -18,7 +18,7 @@ classifiers = [
|
|||
|
||||
dependencies = [
|
||||
"autogen-core",
|
||||
"autogen-ext",
|
||||
"autogen-ext[docker]",
|
||||
"beautifulsoup4",
|
||||
"aiofiles",
|
||||
"requests",
|
||||
|
|
|
@ -1,230 +0,0 @@
|
|||
# Magentic-One
|
||||
|
||||
Magentic-One is a generalist multi-agent softbot that utilizes a combination of five agents, including LLM and tool-based agents, to tackle intricate tasks. For example, it can be used to solve general tasks that involve multi-step planning and action in the real-world.
|
||||
|
||||
![](./imgs/autogen-magentic-one-example.png)
|
||||
|
||||
> _Example_: Suppose a user requests the following: _Can you rewrite the readme of the autogen GitHub repository to be more clear_. Magentic-One will use the following process to handle this task. The Orchestrator agent will break down the task into subtasks and assign them to the appropriate agents. In this case, the WebSurfer will navigate to GiHub, search for the autogen repository, and extract the readme file. Next the Coder agent will rewrite the readme file for clarity and return the updated content to the Orchestrator. At each point, the Orchestrator will monitor progress via a ledger, and terminate when the task is completed successfully.
|
||||
|
||||
## Architecture
|
||||
|
||||
<!-- <center>
|
||||
<img src="./imgs/autgen" alt="drawing" style="width:350px;"/>
|
||||
</center> -->
|
||||
|
||||
![](./imgs/autogen-magentic-one-agents.png)
|
||||
|
||||
Magentic-One uses agents with the following personas and capabilities:
|
||||
|
||||
- Orchestrator: The orchestrator agent is responsible for planning, managing subgoals, and coordinating the other agents. It can break down complex tasks into smaller subtasks and assign them to the appropriate agents. It also keeps track of the overall progress and takes corrective actions if needed (such as reassigning tasks or replanning when stuck).
|
||||
|
||||
- Coder: The coder agent is skilled in programming languages and is responsible for writing code.
|
||||
|
||||
- Computer Terminal: The computer terminal agent acts as the interface that can execute code written by the coder agent.
|
||||
|
||||
- Web Surfer: The web surfer agent is proficient is responsible for web-related tasks. It can browse the internet, retrieve information from websites, and interact with web-based applications. It can handle interactive web pages, forms, and other web elements.
|
||||
|
||||
- File Surfer: The file surfer agent specializes in navigating files such as pdfs, powerpoints, WAV files, and other file types. It can search, read, and extract information from files.
|
||||
|
||||
We created Magentic-One with one agent of each type because their combined abilities help tackle tough benchmarks. By splitting tasks among different agents, we keep the code simple and modular, like in object-oriented programming. This also makes each agent's job easier since they only need to focus on specific tasks. For example, the websurfer agent only needs to navigate webpages and doesn't worry about writing code, making the team more efficient and effective.
|
||||
|
||||
### Planning and Tracking Task Progress
|
||||
|
||||
<center>
|
||||
<img src="./imgs/autogen-magentic-one-arch.png" alt="drawing" style=""/>
|
||||
</center>
|
||||
|
||||
The figure illustrates the workflow of an orchestrator managing a multi-agent setup, starting with an initial prompt or task. The orchestrator creates or updates a ledger with gathered information, including verified facts, facts to look up, derived facts, and educated guesses. Using this ledger, a plan is derived, which consists of a sequence of steps and task assignments for the agents. Before execution, the orchestrator clears the agents' contexts to ensure they start fresh. The orchestrator then evaluates if the request is fully satisfied. If so, it reports the final answer or an educated guess.
|
||||
|
||||
If the request is not fully satisfied, the orchestrator assesses whether the work is progressing or if there are significant barriers. If progress is being made, the orchestrator orchestrates the next step by selecting an agent and providing instructions. If the process stalls for more than two iterations, the ledger is updated with new information, and the plan is adjusted. This cycle continues, iterating through steps and evaluations, until the task is completed. The orchestrator ensures organized, effective tracking and iterative problem-solving to achieve the prompt's goal.
|
||||
|
||||
Note that many parameters such as terminal logic and maximum number of stalled iterations are configurable. Also note that the orchestrator cannot instantiate new agents. This is possible but not implemented in Magentic-One.
|
||||
|
||||
## Table of Definitions:
|
||||
|
||||
| Term | Definition |
|
||||
| --------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Agent | A component that can (autonomously) act based on observations. Different agents may have different functions and actions. |
|
||||
| Planning | The process of determining actions to achieve goals, performed by the Orchestrator agent in Magentic-One. |
|
||||
| Ledger | A record-keeping component used by the Orchestrator agent to track the progress and manage subgoals in Magentic-One. |
|
||||
| Stateful Tools | Tools that maintain state or data, such as the web browser and markdown-based file browser used by Magentic-One. |
|
||||
| Tools | Resources used by Magentic-One for various purposes, including stateful and stateless tools. |
|
||||
| Stateless Tools | Tools that do not maintain state or data, like the commandline executor used by Magentic-One. |
|
||||
|
||||
## Capabilities and Performance
|
||||
|
||||
### Capabilities
|
||||
|
||||
- Planning: The Orchestrator agent in Magentic-One excels at performing planning tasks. Planning involves determining actions to achieve goals. The Orchestrator agent breaks down complex tasks into smaller subtasks and assigns them to the appropriate agents.
|
||||
|
||||
- Ledger: The Orchestrator agent in Magentic-One utilizes a ledger, which is a record-keeping component. The ledger tracks the progress of tasks and manages subgoals. It allows the Orchestrator agent to monitor the overall progress of the system and take corrective actions if needed.
|
||||
|
||||
- Acting in the Real World: Magentic-One is designed to take action in the real world based on observations. The agents in Magentic-One can autonomously perform actions based on the information they observe from their environment.
|
||||
|
||||
- Adaptation to Observation: The agents in Magentic-One can adapt to new observations. They can update their knowledge and behavior based on the information they receive from their environment. This allows Magentic-One to effectively handle dynamic and changing situations.
|
||||
|
||||
- Stateful Tools: Magentic-One utilizes stateful tools such as a web browser and a markdown-based file browser. These tools maintain state or data, which is essential for performing complex tasks that involve actions that might change the state of the environment.
|
||||
|
||||
- Stateless Tools: Magentic-One also utilizes stateless tools such as a command-line executor. These tools do not maintain state or data.
|
||||
|
||||
- Coding: The Coder agent in Magentic-One is highly skilled in programming languages and is responsible for writing code. This capability enables Magentic-One to create and execute code to accomplish various tasks.
|
||||
|
||||
- Execution of Code: The Computer Terminal agent in Magentic-One acts as an interface that can execute code written by the Coder agent. This capability allows Magentic-One to execute the code and perform actions in the system.
|
||||
|
||||
- File Navigation and Extraction: The File Surfer agent in Magentic-One specializes in navigating and extracting information from various file types such as PDFs, PowerPoints, and WAV files. This capability enables Magentic-One to search, read, and extract relevant information from files.
|
||||
|
||||
- Web Interaction: The Web Surfer agent in Magentic-One is proficient in web-related tasks. It can browse the internet, retrieve information from websites, and interact with web-based applications. This capability allows Magentic-One to handle interactive web pages, forms, and other web elements.
|
||||
|
||||
### What Magentic-One Cannot Do
|
||||
|
||||
- **Video Scrubbing:** The agents are unable to navigate and process video content.
|
||||
- **User in the Loop Optimization:** The system does not currently incorporate ongoing user interaction beyond the initial task submission.
|
||||
- **Code Execution Beyond Python or Shell:** The agents are limited to executing code written in Python or shell scripts.
|
||||
- **Agent Instantiation:** The orchestrator agent cannot create new agents dynamically.
|
||||
- **Session-Based Learning:** The agents do not learn from previous sessions or retain information beyond the current session.
|
||||
- **Limited LLM Capacity:** The agents' abilities are constrained by the limitations of the underlying language model.
|
||||
- **Web Surfer Limitations:** The web surfer agent may struggle with certain types of web pages, such as those requiring complex interactions or extensive JavaScript handling.
|
||||
|
||||
### Safety and Risks
|
||||
|
||||
**Code Execution:**
|
||||
|
||||
- **Risks:** Code execution carries inherent risks as it happens in the environment where the agents run using the command line executor. This means that the agents can execute arbitrary Python code.
|
||||
- **Mitigation:** Users are advised to run the system in isolated environments, such as Docker containers, to mitigate the risks associated with executing arbitrary code.
|
||||
|
||||
**Web Browsing:**
|
||||
|
||||
- **Capabilities:** The web surfer agent can operate on most websites, including performing tasks like booking flights.
|
||||
- **Risks:** Since the requests are sent online using GPT-4-based models, there are potential privacy and security concerns. It is crucial not to provide sensitive information such as keys or credit card data to the agents.
|
||||
|
||||
**Safeguards:**
|
||||
|
||||
- **Guardrails from LLM:** The agents inherit the guardrails from the underlying language model (e.g., GPT-4). This means they will refuse to generate toxic or stereotyping content, providing a layer of protection against generating harmful outputs.
|
||||
- **Limitations:** The agents' behavior is directly influenced by the capabilities and limitations of the underlying LLM. Consequently, any lack of guardrails in the language model will also affect the behavior of the agents.
|
||||
|
||||
**General Recommendations:**
|
||||
|
||||
- Always use isolated or controlled environments for running the agents to prevent unauthorized or harmful code execution.
|
||||
- Avoid sharing sensitive information with the agents to protect your privacy and security.
|
||||
- Regularly update and review the underlying LLM and system configurations to ensure they adhere to the latest safety and security standards.
|
||||
|
||||
### Performance
|
||||
|
||||
Magentic-One currently achieves the following performance on complex agent benchmarks.
|
||||
|
||||
#### GAIA
|
||||
|
||||
GAIA is a benchmark from Meta that contains complex tasks that require multi-step reasoning and tool use. For example,
|
||||
|
||||
> _Example_: If Eliud Kipchoge could maintain his record-making marathon pace indefinitely, how many thousand hours would it take him to run the distance between the Earth and the Moon its closest approach? Please use the minimum perigee value on the Wikipedia page for the Moon when carrying out your calculation. Round your result to the nearest 1000 hours and do not use any comma separators if necessary.
|
||||
|
||||
In order to solve this task, the orchestrator begins by outlining the steps needed to solve the task of calculating how many thousand hours it would take Eliud Kipchoge to run the distance between the Earth and the Moon at its closest approach. The orchestrator instructs the web surfer agent to gather Eliud Kipchoge's marathon world record time (2:01:39) and the minimum perigee distance of the Moon from Wikipedia (356,400 kilometers).
|
||||
|
||||
Next, the orchestrator assigns the assistant agent to use this data to perform the necessary calculations. The assistant converts Kipchoge's marathon time to hours (2.0275 hours) and calculates his speed (approximately 20.81 km/h). It then calculates the total time to run the distance to the Moon (17,130.13 hours), rounding it to the nearest thousand hours, resulting in approximately 17,000 thousand hours. The orchestrator then confirms and reports this final result.
|
||||
|
||||
Here is the performance of Magentic-One on a GAIA development set.
|
||||
|
||||
| Level | Task Completion Rate\* |
|
||||
| ------- | ---------------------- |
|
||||
| Level 1 | 55% (29/53) |
|
||||
| Level 2 | 34% (29/86) |
|
||||
| Level 3 | 12% (3/26) |
|
||||
| Total | 37% (61/165) |
|
||||
|
||||
*Indicates the percentage of tasks completed successfully on the *validation\* set.
|
||||
|
||||
#### WebArena
|
||||
|
||||
> Example: Tell me the count of comments that have received more downvotes than upvotes for the user who made the latest post on the Showerthoughts forum.
|
||||
|
||||
To solve this task, the agents began by logging into the Postmill platform using provided credentials and navigating to the Showerthoughts forum. They identified the latest post in this forum, which was made by a user named Waoonet. To proceed with the task, they then accessed Waoonet's profile to examine the comments section, where they could find all comments made by this user.
|
||||
|
||||
Once on Waoonet's profile, the agents focused on counting the comments that had received more downvotes than upvotes. The web_surfer agent analyzed the available comments and found that Waoonet had made two comments, both of which had more upvotes than downvotes. Consequently, they concluded that none of Waoonet's comments had received more downvotes than upvotes. This information was summarized and reported back, completing the task successfully.
|
||||
|
||||
| Site | Task Completion Rate |
|
||||
| -------------- | -------------------- |
|
||||
| Reddit | 54% (57/106) |
|
||||
| Shopping | 33% (62/187) |
|
||||
| CMS | 29% (53/182) |
|
||||
| Gitlab | 28% (50/180) |
|
||||
| Maps | 35% (38/109) |
|
||||
| Multiple Sites | 15% (7/48) |
|
||||
| Total | 33% (267/812) |
|
||||
|
||||
### Logging in Team One Agents
|
||||
|
||||
Team One agents can emit several log events that can be consumed by a log handler (see the example log handler in [utils.py](src/autogen_magentic_one/utils.py)). A list of currently emitted events are:
|
||||
|
||||
- OrchestrationEvent : emitted by a an [Orchestrator](src/autogen_magentic_one/agents/base_orchestrator.py) agent.
|
||||
- WebSurferEvent : emitted by a [WebSurfer](src/autogen_magentic_one/agents/multimodal_web_surfer/multimodal_web_surfer.py) agent.
|
||||
|
||||
In addition, developers can also handle and process logs generated from the AutoGen core library (e.g., LLMCallEvent etc). See the example log handler in [utils.py](src/autogen_magentic_one/utils.py) on how this can be implemented. By default, the logs are written to a file named `log.jsonl` which can be configured as a parameter to the defined log handler. These logs can be parsed to retrieved data agent actions.
|
||||
|
||||
# Setup
|
||||
|
||||
You can install the Magentic-One package using pip and then run the example code to see how the agents work together to accomplish a task.
|
||||
|
||||
1. Clone the code.
|
||||
|
||||
```bash
|
||||
git clone -b staging https://github.com/microsoft/autogen.git
|
||||
cd autogen/python/packages/autogen-magentic-one
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
2. Configure the environment variables for the chat completion client. See instructions below.
|
||||
3. Now you can run the example code to see how the agents work together to accomplish a task.
|
||||
|
||||
**NOTE:** The example code may download files from the internet, execute code, and interact with web pages. Ensure you are in a safe environment before running the example code.
|
||||
|
||||
```bash
|
||||
python examples/example.py
|
||||
```
|
||||
|
||||
## Environment Configuration for Chat Completion Client
|
||||
|
||||
This guide outlines how to configure your environment to use the `create_completion_client_from_env` function, which reads environment variables to return an appropriate `ChatCompletionClient`.
|
||||
|
||||
### Azure with Active Directory
|
||||
|
||||
To configure for Azure with Active Directory, set the following environment variables:
|
||||
|
||||
- `CHAT_COMPLETION_PROVIDER='azure'`
|
||||
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"api_version": "2024-02-15-preview",
|
||||
"azure_endpoint": "REPLACE_WITH_YOUR_ENDPOINT",
|
||||
"model_capabilities": {
|
||||
"function_calling": true,
|
||||
"json_output": true,
|
||||
"vision": true
|
||||
},
|
||||
"azure_ad_token_provider": "DEFAULT",
|
||||
"model": "gpt-4o-2024-05-13"
|
||||
}
|
||||
```
|
||||
|
||||
### With OpenAI
|
||||
|
||||
To configure for OpenAI, set the following environment variables:
|
||||
|
||||
- `CHAT_COMPLETION_PROVIDER='openai'`
|
||||
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"api_key": "REPLACE_WITH_YOUR_API",
|
||||
"model": "gpt-4o-2024-05-13"
|
||||
}
|
||||
```
|
||||
|
||||
### Other Keys
|
||||
|
||||
Some functionalities, such as using web-search requires an API key for Bing.
|
||||
You can set it using:
|
||||
|
||||
```bash
|
||||
export BING_API_KEY=xxxxxxx
|
||||
```
|
|
@ -40,10 +40,12 @@ Reply "TERMINATE" in the end when everything is done.""")
|
|||
model_client: ChatCompletionClient,
|
||||
description: str = DEFAULT_DESCRIPTION,
|
||||
system_messages: List[SystemMessage] = DEFAULT_SYSTEM_MESSAGES,
|
||||
request_terminate: bool = False,
|
||||
) -> None:
|
||||
super().__init__(description)
|
||||
self._model_client = model_client
|
||||
self._system_messages = system_messages
|
||||
self._request_terminate = request_terminate
|
||||
|
||||
async def _generate_reply(self, cancellation_token: CancellationToken) -> Tuple[bool, UserContent]:
|
||||
"""Respond to a reply request."""
|
||||
|
@ -53,7 +55,10 @@ Reply "TERMINATE" in the end when everything is done.""")
|
|||
self._system_messages + self._chat_history, cancellation_token=cancellation_token
|
||||
)
|
||||
assert isinstance(response.content, str)
|
||||
return "TERMINATE" in response.content, response.content
|
||||
if self._request_terminate:
|
||||
return "TERMINATE" in response.content, response.content
|
||||
else:
|
||||
return False, response.content
|
||||
|
||||
|
||||
# True if the user confirms the code, False otherwise
|
||||
|
|
|
@ -6,6 +6,7 @@ import logging
|
|||
import os
|
||||
import pathlib
|
||||
import re
|
||||
import time
|
||||
import traceback
|
||||
from typing import Any, BinaryIO, Dict, List, Optional, Tuple, Union, cast # Any, Callable, Dict, List, Literal, Tuple
|
||||
from urllib.parse import quote_plus # parse_qs, quote, unquote, urlparse, urlunparse
|
||||
|
@ -85,7 +86,7 @@ class MultimodalWebSurfer(BaseWorker):
|
|||
self,
|
||||
description: str = DEFAULT_DESCRIPTION,
|
||||
):
|
||||
"""Do not instantiate directly. Call MultimodalWebSurfer.create instead."""
|
||||
"""To instantiate properly please make sure to call MultimodalWebSurfer.init"""
|
||||
super().__init__(description)
|
||||
|
||||
# Call init to set these
|
||||
|
@ -116,12 +117,28 @@ class MultimodalWebSurfer(BaseWorker):
|
|||
start_page: str | None = None,
|
||||
downloads_folder: str | None = None,
|
||||
debug_dir: str | None = os.getcwd(),
|
||||
to_save_screenshots: bool = False,
|
||||
# navigation_allow_list=lambda url: True,
|
||||
markdown_converter: Any | None = None, # TODO: Fixme
|
||||
) -> None:
|
||||
"""
|
||||
Initialize the MultimodalWebSurfer.
|
||||
|
||||
Args:
|
||||
model_client (ChatCompletionClient): The client to use for chat completions.
|
||||
headless (bool): Whether to run the browser in headless mode. Defaults to True.
|
||||
browser_channel (str | type[DEFAULT_CHANNEL]): The browser channel to use. Defaults to DEFAULT_CHANNEL.
|
||||
browser_data_dir (str | None): The directory to store browser data. Defaults to None.
|
||||
start_page (str | None): The initial page to visit. Defaults to DEFAULT_START_PAGE.
|
||||
downloads_folder (str | None): The folder to save downloads. Defaults to None.
|
||||
debug_dir (str | None): The directory to save debug information. Defaults to the current working directory.
|
||||
to_save_screenshots (bool): Whether to save screenshots. Defaults to False.
|
||||
markdown_converter (Any | None): The markdown converter to use. Defaults to None.
|
||||
"""
|
||||
self._model_client = model_client
|
||||
self.start_page = start_page or self.DEFAULT_START_PAGE
|
||||
self.downloads_folder = downloads_folder
|
||||
self.to_save_screenshots = to_save_screenshots
|
||||
self._chat_history: List[LLMMessage] = []
|
||||
self._last_download = None
|
||||
self._prior_metadata_hash = None
|
||||
|
@ -175,35 +192,57 @@ class MultimodalWebSurfer(BaseWorker):
|
|||
|
||||
if not os.path.isdir(self.debug_dir):
|
||||
os.mkdir(self.debug_dir)
|
||||
|
||||
debug_html = os.path.join(self.debug_dir, "screenshot.html")
|
||||
async with aiofiles.open(debug_html, "wt") as file:
|
||||
await file.write(
|
||||
f"""
|
||||
<html style="width:100%; margin: 0px; padding: 0px;">
|
||||
<body style="width: 100%; margin: 0px; padding: 0px;">
|
||||
<img src="screenshot.png" id="main_image" style="width: 100%; max-width: {VIEWPORT_WIDTH}px; margin: 0px; padding: 0px;">
|
||||
<script language="JavaScript">
|
||||
var counter = 0;
|
||||
setInterval(function() {{
|
||||
counter += 1;
|
||||
document.getElementById("main_image").src = "screenshot.png?bc=" + counter;
|
||||
}}, 300);
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
""".strip(),
|
||||
current_timestamp = "_" + int(time.time()).__str__()
|
||||
screenshot_png_name = "screenshot" + current_timestamp + ".png"
|
||||
debug_html = os.path.join(self.debug_dir, "screenshot" + current_timestamp + ".html")
|
||||
if self.to_save_screenshots:
|
||||
async with aiofiles.open(debug_html, "wt") as file:
|
||||
await file.write(
|
||||
f"""
|
||||
<html style="width:100%; margin: 0px; padding: 0px;">
|
||||
<body style="width: 100%; margin: 0px; padding: 0px;">
|
||||
<img src= {screenshot_png_name} id="main_image" style="width: 100%; max-width: {VIEWPORT_WIDTH}px; margin: 0px; padding: 0px;">
|
||||
<script language="JavaScript">
|
||||
var counter = 0;
|
||||
setInterval(function() {{
|
||||
counter += 1;
|
||||
document.getElementById("main_image").src = "screenshot.png?bc=" + counter;
|
||||
}}, 300);
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
""".strip(),
|
||||
)
|
||||
if self.to_save_screenshots:
|
||||
await self._page.screenshot(path=os.path.join(self.debug_dir, screenshot_png_name))
|
||||
self.logger.info(
|
||||
WebSurferEvent(
|
||||
source=self.metadata["type"],
|
||||
url=self._page.url,
|
||||
message="Screenshot: " + screenshot_png_name,
|
||||
)
|
||||
)
|
||||
self.logger.info(
|
||||
f"Multimodal Web Surfer debug screens: {pathlib.Path(os.path.abspath(debug_html)).as_uri()}\n"
|
||||
)
|
||||
await self._page.screenshot(path=os.path.join(self.debug_dir, "screenshot.png"))
|
||||
self.logger.info(f"Multimodal Web Surfer debug screens: {pathlib.Path(os.path.abspath(debug_html)).as_uri()}\n")
|
||||
|
||||
async def _reset(self, cancellation_token: CancellationToken) -> None:
|
||||
assert self._page is not None
|
||||
future = super()._reset(cancellation_token)
|
||||
await future
|
||||
await self._visit_page(self.start_page)
|
||||
if self.debug_dir:
|
||||
await self._page.screenshot(path=os.path.join(self.debug_dir, "screenshot.png"))
|
||||
if self.to_save_screenshots:
|
||||
current_timestamp = "_" + int(time.time()).__str__()
|
||||
screenshot_png_name = "screenshot" + current_timestamp + ".png"
|
||||
await self._page.screenshot(path=os.path.join(self.debug_dir, screenshot_png_name)) # type: ignore
|
||||
self.logger.info(
|
||||
WebSurferEvent(
|
||||
source=self.metadata["type"],
|
||||
url=self._page.url,
|
||||
message="Screenshot: " + screenshot_png_name,
|
||||
)
|
||||
)
|
||||
|
||||
self.logger.info(
|
||||
WebSurferEvent(
|
||||
source=self.metadata["type"],
|
||||
|
@ -373,7 +412,7 @@ setInterval(function() {{
|
|||
|
||||
# Handle metadata
|
||||
page_metadata = json.dumps(await self._get_page_metadata(), indent=4)
|
||||
metadata_hash = hashlib.sha256(page_metadata.encode("utf-8")).hexdigest()
|
||||
metadata_hash = hashlib.md5(page_metadata.encode("utf-8")).hexdigest()
|
||||
if metadata_hash != self._prior_metadata_hash:
|
||||
page_metadata = (
|
||||
"\nThe following metadata was extracted from the webpage:\n\n" + page_metadata.strip() + "\n"
|
||||
|
@ -394,9 +433,18 @@ setInterval(function() {{
|
|||
position_text = str(percent_scrolled) + "% down from the top of the page"
|
||||
|
||||
new_screenshot = await self._page.screenshot()
|
||||
if self.debug_dir:
|
||||
async with aiofiles.open(os.path.join(self.debug_dir, "screenshot.png"), "wb") as file:
|
||||
await file.write(new_screenshot)
|
||||
if self.to_save_screenshots:
|
||||
current_timestamp = "_" + int(time.time()).__str__()
|
||||
screenshot_png_name = "screenshot" + current_timestamp + ".png"
|
||||
async with aiofiles.open(os.path.join(self.debug_dir, screenshot_png_name), "wb") as file: # type: ignore
|
||||
await file.write(new_screenshot) # type: ignore
|
||||
self.logger.info(
|
||||
WebSurferEvent(
|
||||
source=self.metadata["type"],
|
||||
url=self._page.url,
|
||||
message="Screenshot: " + screenshot_png_name,
|
||||
)
|
||||
)
|
||||
|
||||
ocr_text = (
|
||||
await self._get_ocr_text(new_screenshot, cancellation_token=cancellation_token) if use_ocr is True else ""
|
||||
|
@ -435,9 +483,17 @@ setInterval(function() {{
|
|||
screenshot = await self._page.screenshot()
|
||||
som_screenshot, visible_rects, rects_above, rects_below = add_set_of_mark(screenshot, rects)
|
||||
|
||||
if self.debug_dir:
|
||||
som_screenshot.save(os.path.join(self.debug_dir, "screenshot.png"))
|
||||
|
||||
if self.to_save_screenshots:
|
||||
current_timestamp = "_" + int(time.time()).__str__()
|
||||
screenshot_png_name = "screenshot_som" + current_timestamp + ".png"
|
||||
som_screenshot.save(os.path.join(self.debug_dir, screenshot_png_name)) # type: ignore
|
||||
self.logger.info(
|
||||
WebSurferEvent(
|
||||
source=self.metadata["type"],
|
||||
url=self._page.url,
|
||||
message="Screenshot: " + screenshot_png_name,
|
||||
)
|
||||
)
|
||||
# What tools are available?
|
||||
tools = [
|
||||
TOOL_VISIT_URL,
|
||||
|
@ -516,8 +572,8 @@ When deciding between tools, consider if the request can be best addressed by:
|
|||
# Scale the screenshot for the MLM, and close the original
|
||||
scaled_screenshot = som_screenshot.resize((MLM_WIDTH, MLM_HEIGHT))
|
||||
som_screenshot.close()
|
||||
if self.debug_dir:
|
||||
scaled_screenshot.save(os.path.join(self.debug_dir, "screenshot_scaled.png"))
|
||||
if self.to_save_screenshots:
|
||||
scaled_screenshot.save(os.path.join(self.debug_dir, "screenshot_scaled.png")) # type: ignore
|
||||
|
||||
# Add the multimodal message and make the request
|
||||
history.append(
|
||||
|
|
|
@ -104,6 +104,7 @@ def message_content_to_str(
|
|||
class LogHandler(logging.FileHandler):
|
||||
def __init__(self, filename: str = "log.jsonl") -> None:
|
||||
super().__init__(filename)
|
||||
self.logs_list: List[Dict[str, Any]] = []
|
||||
|
||||
def emit(self, record: logging.LogRecord) -> None:
|
||||
try:
|
||||
|
@ -121,6 +122,7 @@ class LogHandler(logging.FileHandler):
|
|||
"type": "OrchestrationEvent",
|
||||
}
|
||||
)
|
||||
self.logs_list.append(json.loads(record.msg))
|
||||
super().emit(record)
|
||||
elif isinstance(record.msg, AgentEvent):
|
||||
console_message = (
|
||||
|
@ -135,6 +137,7 @@ class LogHandler(logging.FileHandler):
|
|||
"type": "AgentEvent",
|
||||
}
|
||||
)
|
||||
self.logs_list.append(json.loads(record.msg))
|
||||
super().emit(record)
|
||||
elif isinstance(record.msg, WebSurferEvent):
|
||||
console_message = f"\033[96m[{ts}], {record.msg.source}: {record.msg.message}\033[0m"
|
||||
|
@ -145,6 +148,7 @@ class LogHandler(logging.FileHandler):
|
|||
}
|
||||
payload.update(asdict(record.msg))
|
||||
record.msg = json.dumps(payload)
|
||||
self.logs_list.append(json.loads(record.msg))
|
||||
super().emit(record)
|
||||
elif isinstance(record.msg, LLMCallEvent):
|
||||
record.msg = json.dumps(
|
||||
|
@ -155,6 +159,7 @@ class LogHandler(logging.FileHandler):
|
|||
"type": "LLMCallEvent",
|
||||
}
|
||||
)
|
||||
self.logs_list.append(json.loads(record.msg))
|
||||
super().emit(record)
|
||||
except Exception:
|
||||
self.handleError(record)
|
||||
|
|
|
@ -41,7 +41,7 @@ pytest_plugins = ("pytest_asyncio",)
|
|||
BLOG_POST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
|
||||
BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AutoGen"
|
||||
BING_QUERY = "Microsoft"
|
||||
|
||||
DEBUG_DIR = "test_logs_web_surfer_autogen"
|
||||
|
||||
skip_all = False
|
||||
|
||||
|
@ -65,6 +65,22 @@ else:
|
|||
skip_openai = True
|
||||
|
||||
|
||||
def _rm_folder(path: str) -> None:
|
||||
"""Remove all the regular files in a folder, then deletes the folder. Assumes a flat file structure, with no subdirectories."""
|
||||
for fname in os.listdir(path):
|
||||
fpath = os.path.join(path, fname)
|
||||
if os.path.isfile(fpath):
|
||||
os.unlink(fpath)
|
||||
os.rmdir(path)
|
||||
|
||||
|
||||
def _create_logs_dir() -> None:
|
||||
logs_dir = os.path.join(os.getcwd(), DEBUG_DIR)
|
||||
if os.path.isdir(logs_dir):
|
||||
_rm_folder(logs_dir)
|
||||
os.mkdir(logs_dir)
|
||||
|
||||
|
||||
def generate_tool_request(tool: ToolSchema, args: Mapping[str, str]) -> list[FunctionCall]:
|
||||
ret = [FunctionCall(id="", arguments="", name=tool["name"])]
|
||||
ret[0].arguments = dumps(args)
|
||||
|
@ -106,7 +122,9 @@ async def test_web_surfer() -> None:
|
|||
runtime.start()
|
||||
|
||||
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer, MultimodalWebSurfer)
|
||||
await actual_surfer.init(model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium")
|
||||
await actual_surfer.init(
|
||||
model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium", debug_dir=DEBUG_DIR
|
||||
)
|
||||
|
||||
# Test some basic navigations
|
||||
tool_resp = await make_browser_request(actual_surfer, TOOL_VISIT_URL, {"url": BLOG_POST_URL})
|
||||
|
@ -189,7 +207,9 @@ async def test_web_surfer_oai() -> None:
|
|||
runtime.start()
|
||||
|
||||
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, MultimodalWebSurfer)
|
||||
await actual_surfer.init(model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium")
|
||||
await actual_surfer.init(
|
||||
model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium", debug_dir=DEBUG_DIR
|
||||
)
|
||||
|
||||
await runtime.send_message(
|
||||
BroadcastMessage(
|
||||
|
@ -248,7 +268,9 @@ async def test_web_surfer_bing() -> None:
|
|||
|
||||
runtime.start()
|
||||
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, MultimodalWebSurfer)
|
||||
await actual_surfer.init(model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium")
|
||||
await actual_surfer.init(
|
||||
model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium", debug_dir=DEBUG_DIR
|
||||
)
|
||||
|
||||
# Test some basic navigations
|
||||
tool_resp = await make_browser_request(actual_surfer, TOOL_WEB_SEARCH, {"query": BING_QUERY})
|
||||
|
@ -262,10 +284,15 @@ async def test_web_surfer_bing() -> None:
|
|||
markdown = await actual_surfer._get_page_markdown() # type: ignore
|
||||
assert "https://en.wikipedia.org/wiki/" in markdown
|
||||
await runtime.stop_when_idle()
|
||||
# remove the logs directory
|
||||
_rm_folder(DEBUG_DIR)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
|
||||
_create_logs_dir()
|
||||
asyncio.run(test_web_surfer())
|
||||
asyncio.run(test_web_surfer_oai())
|
||||
# IMPORTANT: last test should remove the logs directory
|
||||
asyncio.run(test_web_surfer_bing())
|
||||
|
|
|
@ -4,8 +4,10 @@ resolution-markers = [
|
|||
"python_full_version < '3.11'",
|
||||
"python_full_version == '3.11.*'",
|
||||
"python_full_version >= '3.12' and python_full_version < '3.12.4'",
|
||||
"python_full_version < '3.13'",
|
||||
"python_full_version >= '3.13'",
|
||||
"python_full_version < '3.11'",
|
||||
"python_full_version == '3.11.*'",
|
||||
"python_full_version >= '3.12' and python_full_version < '3.12.4'",
|
||||
"python_full_version >= '3.12.4'",
|
||||
]
|
||||
|
||||
[manifest]
|
||||
|
@ -436,7 +438,7 @@ requires-dist = [
|
|||
{ name = "opentelemetry-api", specifier = "~=1.27.0" },
|
||||
{ name = "pillow" },
|
||||
{ name = "protobuf", specifier = "~=4.25.1" },
|
||||
{ name = "pydantic", specifier = "<3.0.0,>=2.0.0" },
|
||||
{ name = "pydantic", specifier = ">=2.0.0,<3.0.0" },
|
||||
{ name = "tiktoken" },
|
||||
{ name = "typing-extensions" },
|
||||
]
|
||||
|
@ -534,7 +536,7 @@ source = { editable = "packages/autogen-magentic-one" }
|
|||
dependencies = [
|
||||
{ name = "aiofiles" },
|
||||
{ name = "autogen-core" },
|
||||
{ name = "autogen-ext" },
|
||||
{ name = "autogen-ext", extra = ["docker"] },
|
||||
{ name = "beautifulsoup4" },
|
||||
{ name = "mammoth" },
|
||||
{ name = "markdownify" },
|
||||
|
@ -567,7 +569,7 @@ dev = [
|
|||
requires-dist = [
|
||||
{ name = "aiofiles" },
|
||||
{ name = "autogen-core", editable = "packages/autogen-core" },
|
||||
{ name = "autogen-ext", editable = "packages/autogen-ext" },
|
||||
{ name = "autogen-ext", extras = ["docker"], editable = "packages/autogen-ext" },
|
||||
{ name = "beautifulsoup4" },
|
||||
{ name = "mammoth" },
|
||||
{ name = "markdownify" },
|
||||
|
@ -578,7 +580,7 @@ requires-dist = [
|
|||
{ name = "pdfminer-six" },
|
||||
{ name = "playwright" },
|
||||
{ name = "puremagic" },
|
||||
{ name = "pydantic", specifier = "<3.0.0,>=2.0.0" },
|
||||
{ name = "pydantic", specifier = ">=2.0.0,<3.0.0" },
|
||||
{ name = "pydub" },
|
||||
{ name = "python-pptx" },
|
||||
{ name = "requests" },
|
||||
|
@ -3672,7 +3674,7 @@ name = "psycopg"
|
|||
version = "3.2.3"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "typing-extensions", marker = "python_full_version < '3.13'" },
|
||||
{ name = "typing-extensions" },
|
||||
{ name = "tzdata", marker = "sys_platform == 'win32'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/d1/ad/7ce016ae63e231575df0498d2395d15f005f05e32d3a2d439038e1bd0851/psycopg-3.2.3.tar.gz", hash = "sha256:a5764f67c27bec8bfac85764d23c534af2c27b893550377e37ce59c12aac47a2", size = 155550 }
|
||||
|
@ -4798,7 +4800,7 @@ name = "sqlalchemy"
|
|||
version = "2.0.36"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "greenlet", marker = "(python_full_version < '3.13' and platform_machine == 'AMD64') or (python_full_version < '3.13' and platform_machine == 'WIN32') or (python_full_version < '3.13' and platform_machine == 'aarch64') or (python_full_version < '3.13' and platform_machine == 'amd64') or (python_full_version < '3.13' and platform_machine == 'ppc64le') or (python_full_version < '3.13' and platform_machine == 'win32') or (python_full_version < '3.13' and platform_machine == 'x86_64')" },
|
||||
{ name = "greenlet", marker = "platform_machine == 'AMD64' or platform_machine == 'WIN32' or platform_machine == 'aarch64' or platform_machine == 'amd64' or platform_machine == 'ppc64le' or platform_machine == 'win32' or platform_machine == 'x86_64'" },
|
||||
{ name = "typing-extensions" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/50/65/9cbc9c4c3287bed2499e05033e207473504dc4df999ce49385fb1f8b058a/sqlalchemy-2.0.36.tar.gz", hash = "sha256:7f2767680b6d2398aea7082e45a774b2b0767b5c8d8ffb9c8b683088ea9b29c5", size = 9574485 }
|
||||
|
|
Loading…
Reference in New Issue