Magentic-One Log Viewer + preview API (#4032)

* update example script with logs dir, add screenshot timestamp

* readme examples update

* add flask app to view magentic_one

* remove copy example

* rename

* changes to magentic one helper

* update test web surfer to delete logs

* magentic_one icons

* fix colors - final log viewer

* fix termination condition

* update coder and log viewer

* timeout time

* make tests pass

* logs dir

* repeated thing

* remove log_viewer, mm web surfer comments

* coder change prompt, edit readmes

* type ignore

* remove logviewer

* add flag for coder agent

* readme

* changes readme

* uv lock

* update readme figures

* not yet

* pointer images
This commit is contained in:
Hussein Mozannar 2024-11-04 17:18:46 -08:00 committed by GitHub
parent eca8a95c61
commit 8603317537
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
17 changed files with 660 additions and 298 deletions

6
.gitattributes vendored
View File

@ -33,10 +33,8 @@
*.tsx text *.tsx text
*.xml text *.xml text
*.xhtml text diff=html *.xhtml text diff=html
# Docker # Docker
Dockerfile text eol=lf Dockerfile text eol=lf
# Documentation # Documentation
*.ipynb text *.ipynb text
*.markdown text diff=markdown eol=lf *.markdown text diff=markdown eol=lf
@ -62,7 +60,6 @@ NEWS text eol=lf
readme text eol=lf readme text eol=lf
*README* text eol=lf *README* text eol=lf
TODO text TODO text
# Configs # Configs
*.cnf text eol=lf *.cnf text eol=lf
*.conf text eol=lf *.conf text eol=lf
@ -84,8 +81,9 @@ yarn.lock text -diff
browserslist text browserslist text
Makefile text eol=lf Makefile text eol=lf
makefile text eol=lf makefile text eol=lf
# Images # Images
*.png filter=lfs diff=lfs merge=lfs -text *.png filter=lfs diff=lfs merge=lfs -text
*.jpg filter=lfs diff=lfs merge=lfs -text *.jpg filter=lfs diff=lfs merge=lfs -text
*.jpeg filter=lfs diff=lfs merge=lfs -text *.jpeg filter=lfs diff=lfs merge=lfs -text
python/packages/autogen-magentic-one/imgs/autogen-magentic-one-example.png filter=lfs diff=lfs merge=lfs -text
python/packages/autogen-magentic-one/imgs/autogen-magentic-one-agents.png filter=lfs diff=lfs merge=lfs -text

View File

@ -0,0 +1,153 @@
# Magentic-One
> [!CAUTION]
> Using Magentic-One involves interacting with a digital world designed for humans, which carries inherent risks. To minimize these risks, consider the following precautions:
>
> 1. **Use Containers**: Run all tasks in docker containers to isolate the agents and prevent direct system attacks.
> 2. **Virtual Environment**: Use a virtual environment to run the agents and prevent them from accessing sensitive data.
> 3. **Monitor Logs**: Closely monitor logs during and after execution to detect and mitigate risky behavior.
> 4. **Human Oversight**: Run the examples with a human in the loop to supervise the agents and prevent unintended consequences.
> 5. **Limit Access**: Restrict the agents' access to the internet and other resources to prevent unauthorized actions.
> 6. **Safeguard Data**: Ensure that the agents do not have access to sensitive data or resources that could be compromised. Do not share sensitive information with the agents.
> Be aware that agents may occasionally attempt risky actions, such as recruiting humans for help or accepting cookie agreements without human involvement. Always ensure agents are monitored and operate within a controlled environment to prevent unintended consequences. Moreover, be cautious that Magentic-One may be susceptible to prompt injection attacks from webpages.
> [!NOTE]
> This code is currently being ported to AutoGen AgentChat. If you want to build on top of Magentic-One, we recommend waiting for the port to be completed. In the meantime, you can use this codebase to experiment with Magentic-One.
We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives.
![](./imgs/autogen-magentic-one-example.png)
> _Example_: The figure above illustrates Magentic-One mutli-agent team completing a complex task from the GAIA benchmark. Magentic-One's Orchestrator agent creates a plan, delegates tasks to other agents, and tracks progress towards the goal, dynamically revising the plan as needed. The Orchestrator can delegate tasks to a FileSurfer agent to read and handle files, a WebSurfer agent to operate a web browser, or a Coder or Computer Terminal agent to write or execute code, respectively.
## Architecture
![](./imgs/autogen-magentic-one-agents.png)
Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.
Overall, Magentic-One consists of the following agents:
- Orchestrator: the lead agent responsible for task decomposition and planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
- WebSurfer: This is an LLM-based agent that is proficient in commanding and managing the state of a Chromium-based web browser. With each incoming request, the WebSurfer performs an action on the browser then reports on the new state of the web page The action space of the WebSurfer includes navigation (e.g. visiting a URL, performing a web search); web page actions (e.g., clicking and typing); and reading actions (e.g., summarizing or answering questions). The WebSurfer relies on the accessibility tree of the browser and on set-of-marks prompting to perform its actions.
- FileSurfer: This is an LLM-based agent that commands a markdown-based file preview application to read local files of most types. The FileSurfer can also perform common navigation tasks such as listing the contents of directories and navigating a folder structure.
- Coder: This is an LLM-based agent specialized through its system prompt for writing code, analyzing information collected from the other agents, or creating new artifacts.
- ComputerTerminal: Finally, ComputerTerminal provides the team with access to a console shell where the Coders programs can be executed, and where new programming libraries can be installed.
Together, Magentic-Ones agents provide the Orchestrator with the tools and capabilities that it needs to solve a broad variety of open-ended problems, as well as the ability to autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.
While the default multimodal LLM we use for all agents is GPT-4o, Magentic-One is model agnostic and can incorporate heterogonous models to support different capabilities or meet different cost requirements when getting tasks done. For example, it can use different LLMs and SLMs and their specialized versions to power different agents. We recommend a strong reasoning model for the Orchestrator agent such as GPT-4o. In a different configuration of Magentic-One, we also experiment with using OpenAI o1-preview for the outer loop of the Orchestrator and for the Coder, while other agents continue to use GPT-4o.
### Logging in Team One Agents
Team One agents can emit several log events that can be consumed by a log handler (see the example log handler in [utils.py](src/autogen_magentic_one/utils.py)). A list of currently emitted events are:
- OrchestrationEvent : emitted by a an [Orchestrator](src/autogen_magentic_one/agents/base_orchestrator.py) agent.
- WebSurferEvent : emitted by a [WebSurfer](src/autogen_magentic_one/agents/multimodal_web_surfer/multimodal_web_surfer.py) agent.
In addition, developers can also handle and process logs generated from the AutoGen core library (e.g., LLMCallEvent etc). See the example log handler in [utils.py](src/autogen_magentic_one/utils.py) on how this can be implemented. By default, the logs are written to a file named `log.jsonl` which can be configured as a parameter to the defined log handler. These logs can be parsed to retrieved data agent actions.
# Setup and Usage
You can install the Magentic-One package and then run the example code to see how the agents work together to accomplish a task.
1. Clone the code and install the package:
```bash
git clone -b staging https://github.com/microsoft/autogen.git
cd autogen/python/packages/autogen-magentic-one
pip install -e .
```
The following instructions are for running the example code:
2. Configure the environment variables for the chat completion client. See instructions below [Environment Configuration for Chat Completion Client](#environment-configuration-for-chat-completion-client).
3. Magentic-One code uses code execution, you need to have [Docker installed](https://docs.docker.com/engine/install/) to run any examples.
4. Magentic-One uses playwright to interact with web pages. You need to install the playwright dependencies. Run the following command to install the playwright dependencies:
```bash
playwright install-deps
```
5. Now you can run the example code to see how the agents work together to accomplish a task.
> [!CAUTION]
> The example code may download files from the internet, execute code, and interact with web pages. Ensure you are in a safe environment before running the example code.
> [!NOTE]
> You will need to ensure Docker is running prior to running the example.
```bash
# Specify logs directory
python examples/example.py --logs_dir ./my_logs
# Enable human-in-the-loop mode
python examples/example.py -logs_dir ./my_logs --hil_mode
# Save screenshots of browser
python examples/example.py -logs_dir ./my_logs --save_screenshots
```
Arguments:
- logs_dir: (Required) Directory for logs, downloads and screenshots of browser (default: current directory)
- hil_mode: (Optional) Enable human-in-the-loop mode (default: disabled)
- save_screenshots: (Optional) Save screenshots of browser (default: disabled)
6. [Preview] We have a preview API for Magentic-One.
You can use the `MagenticOneHelper` class to interact with the system. See the [interface README](interface/README.md) for more details.
## Environment Configuration for Chat Completion Client
This guide outlines how to configure your environment to use the `create_completion_client_from_env` function, which reads environment variables to return an appropriate `ChatCompletionClient`.
Currently, Magentic-One only supports OpenAI's GPT-4o as the underlying LLM.
### Azure with Active Directory
To configure for Azure with Active Directory, set the following environment variables:
- `CHAT_COMPLETION_PROVIDER='azure'`
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
```json
{
"api_version": "2024-02-15-preview",
"azure_endpoint": "REPLACE_WITH_YOUR_ENDPOINT",
"model_capabilities": {
"function_calling": true,
"json_output": true,
"vision": true
},
"azure_ad_token_provider": "DEFAULT",
"model": "gpt-4o-2024-05-13"
}
```
### With OpenAI
To configure for OpenAI, set the following environment variables:
- `CHAT_COMPLETION_PROVIDER='openai'`
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
```json
{
"api_key": "REPLACE_WITH_YOUR_API",
"model": "gpt-4o-2024-05-13"
}
```
Feel free to replace the model with newer versions of gpt-4o if needed.
### Other Keys (Optional)
Some functionalities, such as using web-search requires an API key for Bing.
You can set it using:
```bash
export BING_API_KEY=xxxxxxx
```

View File

@ -1,11 +1,34 @@
# Examples of Magentic-One # Examples of Magentic-One
**Note**: The examples in this folder are ran at your own risk. They involve agents navigating the web, executing code and browsing local files. Please supervise the execution of the agents to reduce any risks. We also recommend running the examples in a docker environment. **Note**: The examples in this folder are ran at your own risk. They involve agents navigating the web, executing code and browsing local files. Please supervise the execution of the agents to reduce any risks. We also recommend running the examples in a virtual machine or a sandboxed environment.
We include various examples for using Magentic-One and is agents: We include various examples for using Magentic-One and is agents:
- [example.py](example.py): Is a human-in-the-loop of Magentic-One trying to solve a task specified by user input. If you wish for the team to execute the task without involving the user, remove user_proxy from the Orchestrator agents list. - [example.py](example.py): Is [human-in-the-loop] Magentic-One trying to solve a task specified by user input.
```bash
# Specify logs directory
python examples/example.py --logs_dir ./my_logs
# Enable human-in-the-loop mode
python examples/example.py -logs_dir ./my_logs --hil_mode
# Save screenshots of browser
python examples/example.py -logs_dir ./my_logs --save_screenshots
```
Arguments:
- logs_dir: (Required) Directory for logs, downloads and screenshots of browser (default: current directory)
- hil_mode: (Optional) Enable human-in-the-loop mode (default: disabled)
- save_screenshots: (Optional) Save screenshots of browser (default: disabled)
The following examples are for individual agents in Magentic-One:
- [example_coder.py](example_coder.py): Is an example of the Coder + Execution agents in Magentic-One -- without the Magentic-One orchestrator. In a loop, specified by using the RoundRobinOrchestrator, the coder will write code based on user input, executor will run the code and then the user is asked for input again. - [example_coder.py](example_coder.py): Is an example of the Coder + Execution agents in Magentic-One -- without the Magentic-One orchestrator. In a loop, specified by using the RoundRobinOrchestrator, the coder will write code based on user input, executor will run the code and then the user is asked for input again.
@ -16,4 +39,3 @@ We include various examples for using Magentic-One and is agents:
- [example_websurfer.py](example_websurfer.py): Is an example of the MultimodalWebSurfer agent in Magentic-one -- without the orchestrator. To view the browser the agent uses, pass the argument 'headless = False' to 'actual_surfer.init'. In a loop, specified by using the RoundRobinOrchestrator, the web surfer will perform a single action on the browser in response to user input and then the user is asked for input again. - [example_websurfer.py](example_websurfer.py): Is an example of the MultimodalWebSurfer agent in Magentic-one -- without the orchestrator. To view the browser the agent uses, pass the argument 'headless = False' to 'actual_surfer.init'. In a loop, specified by using the RoundRobinOrchestrator, the web surfer will perform a single action on the browser in response to user input and then the user is asked for input again.
Running these examples is simple. First make sure you have installed 'autogen-magentic-one' either from source or from pip, then run 'python example.py'

View File

@ -1,5 +1,6 @@
"""This example demonstrates MagenticOne performing a task given by the user and returning a final answer.""" """This example demonstrates MagenticOne performing a task given by the user and returning a final answer."""
import argparse
import asyncio import asyncio
import logging import logging
import os import os
@ -8,7 +9,7 @@ from autogen_core.application import SingleThreadedAgentRuntime
from autogen_core.application.logging import EVENT_LOGGER_NAME from autogen_core.application.logging import EVENT_LOGGER_NAME
from autogen_core.base import AgentId, AgentProxy from autogen_core.base import AgentId, AgentProxy
from autogen_core.components.code_executor import CodeBlock from autogen_core.components.code_executor import CodeBlock
from autogen_ext.code_executor.docker_executor import DockerCommandLineCodeExecutor from autogen_ext.code_executors import DockerCommandLineCodeExecutor
from autogen_magentic_one.agents.coder import Coder, Executor from autogen_magentic_one.agents.coder import Coder, Executor
from autogen_magentic_one.agents.file_surfer import FileSurfer from autogen_magentic_one.agents.file_surfer import FileSurfer
from autogen_magentic_one.agents.multimodal_web_surfer import MultimodalWebSurfer from autogen_magentic_one.agents.multimodal_web_surfer import MultimodalWebSurfer
@ -28,14 +29,14 @@ async def confirm_code(code: CodeBlock) -> bool:
return response.lower() == "yes" return response.lower() == "yes"
async def main() -> None: async def main(logs_dir: str, hil_mode: bool, save_screenshots: bool) -> None:
# Create the runtime. # Create the runtime.
runtime = SingleThreadedAgentRuntime() runtime = SingleThreadedAgentRuntime()
# Create an appropriate client # Create an appropriate client
client = create_completion_client_from_env(model="gpt-4o") client = create_completion_client_from_env(model="gpt-4o")
async with DockerCommandLineCodeExecutor() as code_executor: async with DockerCommandLineCodeExecutor(work_dir=logs_dir) as code_executor:
# Register agents. # Register agents.
await Coder.register(runtime, "Coder", lambda: Coder(model_client=client)) await Coder.register(runtime, "Coder", lambda: Coder(model_client=client))
coder = AgentProxy(AgentId("Coder", "default"), runtime) coder = AgentProxy(AgentId("Coder", "default"), runtime)
@ -61,11 +62,15 @@ async def main() -> None:
) )
user_proxy = AgentProxy(AgentId("UserProxy", "default"), runtime) user_proxy = AgentProxy(AgentId("UserProxy", "default"), runtime)
agent_list = [web_surfer, coder, executor, file_surfer]
if hil_mode:
agent_list.append(user_proxy)
await LedgerOrchestrator.register( await LedgerOrchestrator.register(
runtime, runtime,
"Orchestrator", "Orchestrator",
lambda: LedgerOrchestrator( lambda: LedgerOrchestrator(
agents=[web_surfer, user_proxy, coder, executor, file_surfer], agents=agent_list,
model_client=client, model_client=client,
max_rounds=30, max_rounds=30,
max_time=25 * 60, max_time=25 * 60,
@ -79,10 +84,12 @@ async def main() -> None:
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, type=MultimodalWebSurfer) actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, type=MultimodalWebSurfer)
await actual_surfer.init( await actual_surfer.init(
model_client=client, model_client=client,
downloads_folder=os.getcwd(), downloads_folder=logs_dir,
start_page="https://www.bing.com", start_page="https://www.bing.com",
browser_channel="chromium", browser_channel="chromium",
headless=True, headless=True,
debug_dir=logs_dir,
to_save_screenshots=save_screenshots,
) )
await runtime.send_message(RequestReplyMessage(), user_proxy.id) await runtime.send_message(RequestReplyMessage(), user_proxy.id)
@ -90,8 +97,21 @@ async def main() -> None:
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run MagenticOne example with log directory.")
parser.add_argument("--logs_dir", type=str, required=True, help="Directory to store log files and downloads")
parser.add_argument("--hil_mode", action="store_true", default=False, help="Run in human-in-the-loop mode")
parser.add_argument(
"--save_screenshots", action="store_true", default=False, help="Save additional browser screenshots to file"
)
args = parser.parse_args()
# Ensure the log directory exists
if not os.path.exists(args.logs_dir):
os.makedirs(args.logs_dir)
logger = logging.getLogger(EVENT_LOGGER_NAME) logger = logging.getLogger(EVENT_LOGGER_NAME)
logger.setLevel(logging.INFO) logger.setLevel(logging.INFO)
log_handler = LogHandler() log_handler = LogHandler(filename=os.path.join(args.logs_dir, "log.jsonl"))
logger.handlers = [log_handler] logger.handlers = [log_handler]
asyncio.run(main()) asyncio.run(main(args.logs_dir, args.hil_mode, args.save_screenshots))

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1 version https://git-lfs.github.com/spec/v1
oid sha256:e89c451d86c7e693127707e696443b77ddad2d9c596936f5fc2f6225cf4b431d oid sha256:25a3a1f79319b89d80b8459af8b522eb9a884dea842b11e3d7dae2bca30add5e
size 97407 size 90181

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a3aa615fa321b54e09efcd9dbb2e4d25a392232fd4e065f85b5a58ed58a7768c
size 298340

View File

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1 version https://git-lfs.github.com/spec/v1
oid sha256:e6d0c57dc734747319fd4f847748fd2400cfb73ea01e87ac85dc8c28c738d21f oid sha256:fc910bda7e5f3b54d6502f26384f7b10b67f0597d7ac4631dfb45801882768fa
size 206468 size 201460

View File

@ -0,0 +1,50 @@
# MagenticOne Interface
This repository contains a preview interface for interacting with the MagenticOne system. It includes helper classes, and example usage.
## Usage
### MagenticOneHelper
The MagenticOneHelper class provides an interface to interact with the MagenticOne system. It saves logs to a user-specified directory and provides methods to run tasks, stream logs, and retrieve the final answer.
The class provides the following methods:
- async initialize(self) -> None: Initializes the MagenticOne system, setting up agents and runtime.
- async run_task(self, task: str) -> None: Runs a specific task through the MagenticOne system.
- get_final_answer(self) -> Optional[str]: Retrieves the final answer from the Orchestrator.
- async stream_logs(self) -> AsyncGenerator[Dict[str, Any], None]: Streams logs from the system as they are generated.
- get_all_logs(self) -> List[Dict[str, Any]]: Retrieves all logs that have been collected so far.
We show an example of how to use the MagenticOneHelper class to in [example_magentic_one_helper.py](example_magentic_one_helper.py).
```python
from magentic_one_helper import MagenticOneHelper
import asyncio
import json
async def magentic_one_example():
# Create and initialize MagenticOne
magnetic_one = MagenticOneHelper(logs_dir="./logs")
await magnetic_one.initialize()
print("MagenticOne initialized.")
# Start a task and stream logs
task = "How many members are in the MSR HAX Team"
task_future = asyncio.create_task(magnetic_one.run_task(task))
# Stream and process logs
async for log_entry in magnetic_one.stream_logs():
print(json.dumps(log_entry, indent=2))
# Wait for task to complete
await task_future
# Get the final answer
final_answer = magnetic_one.get_final_answer()
if final_answer is not None:
print(f"Final answer: {final_answer}")
else:
print("No final answer found in logs.")
```

View File

@ -0,0 +1,40 @@
from magentic_one_helper import MagenticOneHelper
import asyncio
import json
import argparse
import os
async def main(task, logs_dir):
magnetic_one = MagenticOneHelper(logs_dir=logs_dir)
await magnetic_one.initialize()
print("MagenticOne initialized.")
# Create task and log streaming tasks
task_future = asyncio.create_task(magnetic_one.run_task(task))
final_answer = None
# Stream and process logs
async for log_entry in magnetic_one.stream_logs():
print(json.dumps(log_entry, indent=2))
# Wait for task to complete
await task_future
# Get the final answer
final_answer = magnetic_one.get_final_answer()
if final_answer is not None:
print(f"Final answer: {final_answer}")
else:
print("No final answer found in logs.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run a task with MagenticOneHelper.")
parser.add_argument("task", type=str, help="The task to run")
parser.add_argument("--logs_dir", type=str, default="./logs", help="Directory to store logs")
args = parser.parse_args()
if not os.path.exists(args.logs_dir):
os.makedirs(args.logs_dir)
asyncio.run(main(args.task, args.logs_dir))

View File

@ -0,0 +1,217 @@
import asyncio
import logging
import os
from typing import Optional, AsyncGenerator, Dict, Any, List
from datetime import datetime
import json
from dataclasses import asdict
from autogen_core.application import SingleThreadedAgentRuntime
from autogen_core.application.logging import EVENT_LOGGER_NAME
from autogen_core.base import AgentId, AgentProxy
from autogen_core.components import DefaultTopicId
from autogen_core.components.code_executor import LocalCommandLineCodeExecutor
from autogen_ext.code_executor.docker_executor import DockerCommandLineCodeExecutor
from autogen_core.components.code_executor import CodeBlock
from autogen_magentic_one.agents.coder import Coder, Executor
from autogen_magentic_one.agents.file_surfer import FileSurfer
from autogen_magentic_one.agents.multimodal_web_surfer import MultimodalWebSurfer
from autogen_magentic_one.agents.orchestrator import LedgerOrchestrator
from autogen_magentic_one.agents.user_proxy import UserProxy
from autogen_magentic_one.messages import BroadcastMessage
from autogen_magentic_one.utils import LogHandler, create_completion_client_from_env
from autogen_core.components.models import UserMessage
from threading import Lock
async def confirm_code(code: CodeBlock) -> bool:
return True
class MagenticOneHelper:
def __init__(self, logs_dir: str = None, save_screenshots: bool = False) -> None:
"""
A helper class to interact with the MagenticOne system.
Initialize MagenticOne instance.
Args:
logs_dir: Directory to store logs and downloads
save_screenshots: Whether to save screenshots of web pages
"""
self.logs_dir = logs_dir or os.getcwd()
self.runtime: Optional[SingleThreadedAgentRuntime] = None
self.log_handler: Optional[LogHandler] = None
self.save_screenshots = save_screenshots
if not os.path.exists(self.logs_dir):
os.makedirs(self.logs_dir)
async def initialize(self) -> None:
"""
Initialize the MagenticOne system, setting up agents and runtime.
"""
# Create the runtime
self.runtime = SingleThreadedAgentRuntime()
# Set up logging
logger = logging.getLogger(EVENT_LOGGER_NAME)
logger.setLevel(logging.INFO)
self.log_handler = LogHandler(filename=os.path.join(self.logs_dir, "log.jsonl"))
logger.handlers = [self.log_handler]
# Create client
client = create_completion_client_from_env(model="gpt-4o")
# Set up code executor
self.code_executor = DockerCommandLineCodeExecutor(work_dir=self.logs_dir)
await self.code_executor.__aenter__()
await Coder.register(self.runtime, "Coder", lambda: Coder(model_client=client))
coder = AgentProxy(AgentId("Coder", "default"), self.runtime)
await Executor.register(
self.runtime,
"Executor",
lambda: Executor("A agent for executing code", executor=self.code_executor, confirm_execution=confirm_code),
)
executor = AgentProxy(AgentId("Executor", "default"), self.runtime)
# Register agents.
await MultimodalWebSurfer.register(self.runtime, "WebSurfer", MultimodalWebSurfer)
web_surfer = AgentProxy(AgentId("WebSurfer", "default"), self.runtime)
await FileSurfer.register(self.runtime, "file_surfer", lambda: FileSurfer(model_client=client))
file_surfer = AgentProxy(AgentId("file_surfer", "default"), self.runtime)
agent_list = [web_surfer, coder, executor, file_surfer]
await LedgerOrchestrator.register(
self.runtime,
"Orchestrator",
lambda: LedgerOrchestrator(
agents=agent_list,
model_client=client,
max_rounds=30,
max_time=25 * 60,
max_stalls_before_replan=10,
return_final_answer=True,
),
)
self.runtime.start()
actual_surfer = await self.runtime.try_get_underlying_agent_instance(web_surfer.id, type=MultimodalWebSurfer)
await actual_surfer.init(
model_client=client,
downloads_folder=os.getcwd(),
start_page="https://www.bing.com",
browser_channel="chromium",
headless=True,
debug_dir=self.logs_dir,
to_save_screenshots=self.save_screenshots,
)
async def __aexit__(self, exc_type, exc_value, traceback) -> None:
"""
Clean up resources.
"""
if self.code_executor:
await self.code_executor.__aexit__(exc_type, exc_value, traceback)
async def run_task(self, task: str) -> None:
"""
Run a specific task through the MagenticOne system.
Args:
task: The task description to be executed
"""
if not self.runtime:
raise RuntimeError("MagenticOne not initialized. Call initialize() first.")
task_message = BroadcastMessage(content=UserMessage(content=task, source="UserProxy"))
await self.runtime.publish_message(task_message, topic_id=DefaultTopicId())
await self.runtime.stop_when_idle()
def get_final_answer(self) -> Optional[str]:
"""
Get the final answer from the Orchestrator.
Returns:
The final answer as a string
"""
if not self.log_handler:
raise RuntimeError("Log handler not initialized")
for log_entry in self.log_handler.logs_list:
if (
log_entry.get("type") == "OrchestrationEvent"
and log_entry.get("source") == "Orchestrator (final answer)"
):
return log_entry.get("message")
return None
async def stream_logs(self) -> AsyncGenerator[Dict[str, Any], None]:
"""
Stream logs from the system as they are generated. Stops when it detects both
the final answer and termination condition from the Orchestrator.
Yields:
Dictionary containing log entry information
"""
if not self.log_handler:
raise RuntimeError("Log handler not initialized")
last_index = 0
found_final_answer = False
found_termination = False
found_termination_no_agent = False
while True:
current_logs = self.log_handler.logs_list
while last_index < len(current_logs):
log_entry = current_logs[last_index]
yield log_entry
# Check for termination condition
if (
log_entry.get("type") == "OrchestrationEvent"
and log_entry.get("source") == "Orchestrator (final answer)"
):
found_final_answer = True
if (
log_entry.get("type") == "OrchestrationEvent"
and log_entry.get("source") == "Orchestrator (termination condition)"
):
found_termination = True
if (
log_entry.get("type") == "OrchestrationEvent"
and log_entry.get("source") == "Orchestrator (termination condition)"
and log_entry.get("message") == "No agent selected."
):
found_termination_no_agent = True
if self.runtime._run_context is None:
return
if found_termination_no_agent and found_final_answer:
return
elif found_termination and not found_termination_no_agent:
return
last_index += 1
await asyncio.sleep(0.1) # Small delay to prevent busy waiting
def get_all_logs(self) -> List[Dict[str, Any]]:
"""
Get all logs that have been collected so far.
Returns:
List of all log entries
"""
if not self.log_handler:
raise RuntimeError("Log handler not initialized")
return self.log_handler.logs_list

View File

@ -7,7 +7,7 @@ name = "autogen-magentic-one"
version = "0.0.1" version = "0.0.1"
license = {file = "LICENSE-CODE"} license = {file = "LICENSE-CODE"}
description = '' description = ''
readme = "readme.md" readme = "README.md"
requires-python = ">=3.10" requires-python = ">=3.10"
keywords = [] keywords = []
classifiers = [ classifiers = [
@ -18,7 +18,7 @@ classifiers = [
dependencies = [ dependencies = [
"autogen-core", "autogen-core",
"autogen-ext", "autogen-ext[docker]",
"beautifulsoup4", "beautifulsoup4",
"aiofiles", "aiofiles",
"requests", "requests",

View File

@ -1,230 +0,0 @@
# Magentic-One
Magentic-One is a generalist multi-agent softbot that utilizes a combination of five agents, including LLM and tool-based agents, to tackle intricate tasks. For example, it can be used to solve general tasks that involve multi-step planning and action in the real-world.
![](./imgs/autogen-magentic-one-example.png)
> _Example_: Suppose a user requests the following: _Can you rewrite the readme of the autogen GitHub repository to be more clear_. Magentic-One will use the following process to handle this task. The Orchestrator agent will break down the task into subtasks and assign them to the appropriate agents. In this case, the WebSurfer will navigate to GiHub, search for the autogen repository, and extract the readme file. Next the Coder agent will rewrite the readme file for clarity and return the updated content to the Orchestrator. At each point, the Orchestrator will monitor progress via a ledger, and terminate when the task is completed successfully.
## Architecture
<!-- <center>
<img src="./imgs/autgen" alt="drawing" style="width:350px;"/>
</center> -->
![](./imgs/autogen-magentic-one-agents.png)
Magentic-One uses agents with the following personas and capabilities:
- Orchestrator: The orchestrator agent is responsible for planning, managing subgoals, and coordinating the other agents. It can break down complex tasks into smaller subtasks and assign them to the appropriate agents. It also keeps track of the overall progress and takes corrective actions if needed (such as reassigning tasks or replanning when stuck).
- Coder: The coder agent is skilled in programming languages and is responsible for writing code.
- Computer Terminal: The computer terminal agent acts as the interface that can execute code written by the coder agent.
- Web Surfer: The web surfer agent is proficient is responsible for web-related tasks. It can browse the internet, retrieve information from websites, and interact with web-based applications. It can handle interactive web pages, forms, and other web elements.
- File Surfer: The file surfer agent specializes in navigating files such as pdfs, powerpoints, WAV files, and other file types. It can search, read, and extract information from files.
We created Magentic-One with one agent of each type because their combined abilities help tackle tough benchmarks. By splitting tasks among different agents, we keep the code simple and modular, like in object-oriented programming. This also makes each agent's job easier since they only need to focus on specific tasks. For example, the websurfer agent only needs to navigate webpages and doesn't worry about writing code, making the team more efficient and effective.
### Planning and Tracking Task Progress
<center>
<img src="./imgs/autogen-magentic-one-arch.png" alt="drawing" style=""/>
</center>
The figure illustrates the workflow of an orchestrator managing a multi-agent setup, starting with an initial prompt or task. The orchestrator creates or updates a ledger with gathered information, including verified facts, facts to look up, derived facts, and educated guesses. Using this ledger, a plan is derived, which consists of a sequence of steps and task assignments for the agents. Before execution, the orchestrator clears the agents' contexts to ensure they start fresh. The orchestrator then evaluates if the request is fully satisfied. If so, it reports the final answer or an educated guess.
If the request is not fully satisfied, the orchestrator assesses whether the work is progressing or if there are significant barriers. If progress is being made, the orchestrator orchestrates the next step by selecting an agent and providing instructions. If the process stalls for more than two iterations, the ledger is updated with new information, and the plan is adjusted. This cycle continues, iterating through steps and evaluations, until the task is completed. The orchestrator ensures organized, effective tracking and iterative problem-solving to achieve the prompt's goal.
Note that many parameters such as terminal logic and maximum number of stalled iterations are configurable. Also note that the orchestrator cannot instantiate new agents. This is possible but not implemented in Magentic-One.
## Table of Definitions:
| Term | Definition |
| --------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Agent | A component that can (autonomously) act based on observations. Different agents may have different functions and actions. |
| Planning | The process of determining actions to achieve goals, performed by the Orchestrator agent in Magentic-One. |
| Ledger | A record-keeping component used by the Orchestrator agent to track the progress and manage subgoals in Magentic-One. |
| Stateful Tools | Tools that maintain state or data, such as the web browser and markdown-based file browser used by Magentic-One. |
| Tools | Resources used by Magentic-One for various purposes, including stateful and stateless tools. |
| Stateless Tools | Tools that do not maintain state or data, like the commandline executor used by Magentic-One. |
## Capabilities and Performance
### Capabilities
- Planning: The Orchestrator agent in Magentic-One excels at performing planning tasks. Planning involves determining actions to achieve goals. The Orchestrator agent breaks down complex tasks into smaller subtasks and assigns them to the appropriate agents.
- Ledger: The Orchestrator agent in Magentic-One utilizes a ledger, which is a record-keeping component. The ledger tracks the progress of tasks and manages subgoals. It allows the Orchestrator agent to monitor the overall progress of the system and take corrective actions if needed.
- Acting in the Real World: Magentic-One is designed to take action in the real world based on observations. The agents in Magentic-One can autonomously perform actions based on the information they observe from their environment.
- Adaptation to Observation: The agents in Magentic-One can adapt to new observations. They can update their knowledge and behavior based on the information they receive from their environment. This allows Magentic-One to effectively handle dynamic and changing situations.
- Stateful Tools: Magentic-One utilizes stateful tools such as a web browser and a markdown-based file browser. These tools maintain state or data, which is essential for performing complex tasks that involve actions that might change the state of the environment.
- Stateless Tools: Magentic-One also utilizes stateless tools such as a command-line executor. These tools do not maintain state or data.
- Coding: The Coder agent in Magentic-One is highly skilled in programming languages and is responsible for writing code. This capability enables Magentic-One to create and execute code to accomplish various tasks.
- Execution of Code: The Computer Terminal agent in Magentic-One acts as an interface that can execute code written by the Coder agent. This capability allows Magentic-One to execute the code and perform actions in the system.
- File Navigation and Extraction: The File Surfer agent in Magentic-One specializes in navigating and extracting information from various file types such as PDFs, PowerPoints, and WAV files. This capability enables Magentic-One to search, read, and extract relevant information from files.
- Web Interaction: The Web Surfer agent in Magentic-One is proficient in web-related tasks. It can browse the internet, retrieve information from websites, and interact with web-based applications. This capability allows Magentic-One to handle interactive web pages, forms, and other web elements.
### What Magentic-One Cannot Do
- **Video Scrubbing:** The agents are unable to navigate and process video content.
- **User in the Loop Optimization:** The system does not currently incorporate ongoing user interaction beyond the initial task submission.
- **Code Execution Beyond Python or Shell:** The agents are limited to executing code written in Python or shell scripts.
- **Agent Instantiation:** The orchestrator agent cannot create new agents dynamically.
- **Session-Based Learning:** The agents do not learn from previous sessions or retain information beyond the current session.
- **Limited LLM Capacity:** The agents' abilities are constrained by the limitations of the underlying language model.
- **Web Surfer Limitations:** The web surfer agent may struggle with certain types of web pages, such as those requiring complex interactions or extensive JavaScript handling.
### Safety and Risks
**Code Execution:**
- **Risks:** Code execution carries inherent risks as it happens in the environment where the agents run using the command line executor. This means that the agents can execute arbitrary Python code.
- **Mitigation:** Users are advised to run the system in isolated environments, such as Docker containers, to mitigate the risks associated with executing arbitrary code.
**Web Browsing:**
- **Capabilities:** The web surfer agent can operate on most websites, including performing tasks like booking flights.
- **Risks:** Since the requests are sent online using GPT-4-based models, there are potential privacy and security concerns. It is crucial not to provide sensitive information such as keys or credit card data to the agents.
**Safeguards:**
- **Guardrails from LLM:** The agents inherit the guardrails from the underlying language model (e.g., GPT-4). This means they will refuse to generate toxic or stereotyping content, providing a layer of protection against generating harmful outputs.
- **Limitations:** The agents' behavior is directly influenced by the capabilities and limitations of the underlying LLM. Consequently, any lack of guardrails in the language model will also affect the behavior of the agents.
**General Recommendations:**
- Always use isolated or controlled environments for running the agents to prevent unauthorized or harmful code execution.
- Avoid sharing sensitive information with the agents to protect your privacy and security.
- Regularly update and review the underlying LLM and system configurations to ensure they adhere to the latest safety and security standards.
### Performance
Magentic-One currently achieves the following performance on complex agent benchmarks.
#### GAIA
GAIA is a benchmark from Meta that contains complex tasks that require multi-step reasoning and tool use. For example,
> _Example_: If Eliud Kipchoge could maintain his record-making marathon pace indefinitely, how many thousand hours would it take him to run the distance between the Earth and the Moon its closest approach? Please use the minimum perigee value on the Wikipedia page for the Moon when carrying out your calculation. Round your result to the nearest 1000 hours and do not use any comma separators if necessary.
In order to solve this task, the orchestrator begins by outlining the steps needed to solve the task of calculating how many thousand hours it would take Eliud Kipchoge to run the distance between the Earth and the Moon at its closest approach. The orchestrator instructs the web surfer agent to gather Eliud Kipchoge's marathon world record time (2:01:39) and the minimum perigee distance of the Moon from Wikipedia (356,400 kilometers).
Next, the orchestrator assigns the assistant agent to use this data to perform the necessary calculations. The assistant converts Kipchoge's marathon time to hours (2.0275 hours) and calculates his speed (approximately 20.81 km/h). It then calculates the total time to run the distance to the Moon (17,130.13 hours), rounding it to the nearest thousand hours, resulting in approximately 17,000 thousand hours. The orchestrator then confirms and reports this final result.
Here is the performance of Magentic-One on a GAIA development set.
| Level | Task Completion Rate\* |
| ------- | ---------------------- |
| Level 1 | 55% (29/53) |
| Level 2 | 34% (29/86) |
| Level 3 | 12% (3/26) |
| Total | 37% (61/165) |
*Indicates the percentage of tasks completed successfully on the *validation\* set.
#### WebArena
> Example: Tell me the count of comments that have received more downvotes than upvotes for the user who made the latest post on the Showerthoughts forum.
To solve this task, the agents began by logging into the Postmill platform using provided credentials and navigating to the Showerthoughts forum. They identified the latest post in this forum, which was made by a user named Waoonet. To proceed with the task, they then accessed Waoonet's profile to examine the comments section, where they could find all comments made by this user.
Once on Waoonet's profile, the agents focused on counting the comments that had received more downvotes than upvotes. The web_surfer agent analyzed the available comments and found that Waoonet had made two comments, both of which had more upvotes than downvotes. Consequently, they concluded that none of Waoonet's comments had received more downvotes than upvotes. This information was summarized and reported back, completing the task successfully.
| Site | Task Completion Rate |
| -------------- | -------------------- |
| Reddit | 54%  (57/106) |
| Shopping | 33%  (62/187) |
| CMS | 29%  (53/182) |
| Gitlab | 28%  (50/180) |
| Maps | 35%  (38/109) |
| Multiple Sites | 15%  (7/48) |
| Total | 33%  (267/812) |
### Logging in Team One Agents
Team One agents can emit several log events that can be consumed by a log handler (see the example log handler in [utils.py](src/autogen_magentic_one/utils.py)). A list of currently emitted events are:
- OrchestrationEvent : emitted by a an [Orchestrator](src/autogen_magentic_one/agents/base_orchestrator.py) agent.
- WebSurferEvent : emitted by a [WebSurfer](src/autogen_magentic_one/agents/multimodal_web_surfer/multimodal_web_surfer.py) agent.
In addition, developers can also handle and process logs generated from the AutoGen core library (e.g., LLMCallEvent etc). See the example log handler in [utils.py](src/autogen_magentic_one/utils.py) on how this can be implemented. By default, the logs are written to a file named `log.jsonl` which can be configured as a parameter to the defined log handler. These logs can be parsed to retrieved data agent actions.
# Setup
You can install the Magentic-One package using pip and then run the example code to see how the agents work together to accomplish a task.
1. Clone the code.
```bash
git clone -b staging https://github.com/microsoft/autogen.git
cd autogen/python/packages/autogen-magentic-one
pip install -e .
```
2. Configure the environment variables for the chat completion client. See instructions below.
3. Now you can run the example code to see how the agents work together to accomplish a task.
**NOTE:** The example code may download files from the internet, execute code, and interact with web pages. Ensure you are in a safe environment before running the example code.
```bash
python examples/example.py
```
## Environment Configuration for Chat Completion Client
This guide outlines how to configure your environment to use the `create_completion_client_from_env` function, which reads environment variables to return an appropriate `ChatCompletionClient`.
### Azure with Active Directory
To configure for Azure with Active Directory, set the following environment variables:
- `CHAT_COMPLETION_PROVIDER='azure'`
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
```json
{
"api_version": "2024-02-15-preview",
"azure_endpoint": "REPLACE_WITH_YOUR_ENDPOINT",
"model_capabilities": {
"function_calling": true,
"json_output": true,
"vision": true
},
"azure_ad_token_provider": "DEFAULT",
"model": "gpt-4o-2024-05-13"
}
```
### With OpenAI
To configure for OpenAI, set the following environment variables:
- `CHAT_COMPLETION_PROVIDER='openai'`
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:
```json
{
"api_key": "REPLACE_WITH_YOUR_API",
"model": "gpt-4o-2024-05-13"
}
```
### Other Keys
Some functionalities, such as using web-search requires an API key for Bing.
You can set it using:
```bash
export BING_API_KEY=xxxxxxx
```

View File

@ -40,10 +40,12 @@ Reply "TERMINATE" in the end when everything is done.""")
model_client: ChatCompletionClient, model_client: ChatCompletionClient,
description: str = DEFAULT_DESCRIPTION, description: str = DEFAULT_DESCRIPTION,
system_messages: List[SystemMessage] = DEFAULT_SYSTEM_MESSAGES, system_messages: List[SystemMessage] = DEFAULT_SYSTEM_MESSAGES,
request_terminate: bool = False,
) -> None: ) -> None:
super().__init__(description) super().__init__(description)
self._model_client = model_client self._model_client = model_client
self._system_messages = system_messages self._system_messages = system_messages
self._request_terminate = request_terminate
async def _generate_reply(self, cancellation_token: CancellationToken) -> Tuple[bool, UserContent]: async def _generate_reply(self, cancellation_token: CancellationToken) -> Tuple[bool, UserContent]:
"""Respond to a reply request.""" """Respond to a reply request."""
@ -53,7 +55,10 @@ Reply "TERMINATE" in the end when everything is done.""")
self._system_messages + self._chat_history, cancellation_token=cancellation_token self._system_messages + self._chat_history, cancellation_token=cancellation_token
) )
assert isinstance(response.content, str) assert isinstance(response.content, str)
return "TERMINATE" in response.content, response.content if self._request_terminate:
return "TERMINATE" in response.content, response.content
else:
return False, response.content
# True if the user confirms the code, False otherwise # True if the user confirms the code, False otherwise

View File

@ -6,6 +6,7 @@ import logging
import os import os
import pathlib import pathlib
import re import re
import time
import traceback import traceback
from typing import Any, BinaryIO, Dict, List, Optional, Tuple, Union, cast # Any, Callable, Dict, List, Literal, Tuple from typing import Any, BinaryIO, Dict, List, Optional, Tuple, Union, cast # Any, Callable, Dict, List, Literal, Tuple
from urllib.parse import quote_plus # parse_qs, quote, unquote, urlparse, urlunparse from urllib.parse import quote_plus # parse_qs, quote, unquote, urlparse, urlunparse
@ -85,7 +86,7 @@ class MultimodalWebSurfer(BaseWorker):
self, self,
description: str = DEFAULT_DESCRIPTION, description: str = DEFAULT_DESCRIPTION,
): ):
"""Do not instantiate directly. Call MultimodalWebSurfer.create instead.""" """To instantiate properly please make sure to call MultimodalWebSurfer.init"""
super().__init__(description) super().__init__(description)
# Call init to set these # Call init to set these
@ -116,12 +117,28 @@ class MultimodalWebSurfer(BaseWorker):
start_page: str | None = None, start_page: str | None = None,
downloads_folder: str | None = None, downloads_folder: str | None = None,
debug_dir: str | None = os.getcwd(), debug_dir: str | None = os.getcwd(),
to_save_screenshots: bool = False,
# navigation_allow_list=lambda url: True, # navigation_allow_list=lambda url: True,
markdown_converter: Any | None = None, # TODO: Fixme markdown_converter: Any | None = None, # TODO: Fixme
) -> None: ) -> None:
"""
Initialize the MultimodalWebSurfer.
Args:
model_client (ChatCompletionClient): The client to use for chat completions.
headless (bool): Whether to run the browser in headless mode. Defaults to True.
browser_channel (str | type[DEFAULT_CHANNEL]): The browser channel to use. Defaults to DEFAULT_CHANNEL.
browser_data_dir (str | None): The directory to store browser data. Defaults to None.
start_page (str | None): The initial page to visit. Defaults to DEFAULT_START_PAGE.
downloads_folder (str | None): The folder to save downloads. Defaults to None.
debug_dir (str | None): The directory to save debug information. Defaults to the current working directory.
to_save_screenshots (bool): Whether to save screenshots. Defaults to False.
markdown_converter (Any | None): The markdown converter to use. Defaults to None.
"""
self._model_client = model_client self._model_client = model_client
self.start_page = start_page or self.DEFAULT_START_PAGE self.start_page = start_page or self.DEFAULT_START_PAGE
self.downloads_folder = downloads_folder self.downloads_folder = downloads_folder
self.to_save_screenshots = to_save_screenshots
self._chat_history: List[LLMMessage] = [] self._chat_history: List[LLMMessage] = []
self._last_download = None self._last_download = None
self._prior_metadata_hash = None self._prior_metadata_hash = None
@ -175,35 +192,57 @@ class MultimodalWebSurfer(BaseWorker):
if not os.path.isdir(self.debug_dir): if not os.path.isdir(self.debug_dir):
os.mkdir(self.debug_dir) os.mkdir(self.debug_dir)
current_timestamp = "_" + int(time.time()).__str__()
debug_html = os.path.join(self.debug_dir, "screenshot.html") screenshot_png_name = "screenshot" + current_timestamp + ".png"
async with aiofiles.open(debug_html, "wt") as file: debug_html = os.path.join(self.debug_dir, "screenshot" + current_timestamp + ".html")
await file.write( if self.to_save_screenshots:
f""" async with aiofiles.open(debug_html, "wt") as file:
<html style="width:100%; margin: 0px; padding: 0px;"> await file.write(
<body style="width: 100%; margin: 0px; padding: 0px;"> f"""
<img src="screenshot.png" id="main_image" style="width: 100%; max-width: {VIEWPORT_WIDTH}px; margin: 0px; padding: 0px;"> <html style="width:100%; margin: 0px; padding: 0px;">
<script language="JavaScript"> <body style="width: 100%; margin: 0px; padding: 0px;">
var counter = 0; <img src= {screenshot_png_name} id="main_image" style="width: 100%; max-width: {VIEWPORT_WIDTH}px; margin: 0px; padding: 0px;">
setInterval(function() {{ <script language="JavaScript">
counter += 1; var counter = 0;
document.getElementById("main_image").src = "screenshot.png?bc=" + counter; setInterval(function() {{
}}, 300); counter += 1;
</script> document.getElementById("main_image").src = "screenshot.png?bc=" + counter;
</body> }}, 300);
</html> </script>
""".strip(), </body>
</html>
""".strip(),
)
if self.to_save_screenshots:
await self._page.screenshot(path=os.path.join(self.debug_dir, screenshot_png_name))
self.logger.info(
WebSurferEvent(
source=self.metadata["type"],
url=self._page.url,
message="Screenshot: " + screenshot_png_name,
)
)
self.logger.info(
f"Multimodal Web Surfer debug screens: {pathlib.Path(os.path.abspath(debug_html)).as_uri()}\n"
) )
await self._page.screenshot(path=os.path.join(self.debug_dir, "screenshot.png"))
self.logger.info(f"Multimodal Web Surfer debug screens: {pathlib.Path(os.path.abspath(debug_html)).as_uri()}\n")
async def _reset(self, cancellation_token: CancellationToken) -> None: async def _reset(self, cancellation_token: CancellationToken) -> None:
assert self._page is not None assert self._page is not None
future = super()._reset(cancellation_token) future = super()._reset(cancellation_token)
await future await future
await self._visit_page(self.start_page) await self._visit_page(self.start_page)
if self.debug_dir: if self.to_save_screenshots:
await self._page.screenshot(path=os.path.join(self.debug_dir, "screenshot.png")) current_timestamp = "_" + int(time.time()).__str__()
screenshot_png_name = "screenshot" + current_timestamp + ".png"
await self._page.screenshot(path=os.path.join(self.debug_dir, screenshot_png_name)) # type: ignore
self.logger.info(
WebSurferEvent(
source=self.metadata["type"],
url=self._page.url,
message="Screenshot: " + screenshot_png_name,
)
)
self.logger.info( self.logger.info(
WebSurferEvent( WebSurferEvent(
source=self.metadata["type"], source=self.metadata["type"],
@ -373,7 +412,7 @@ setInterval(function() {{
# Handle metadata # Handle metadata
page_metadata = json.dumps(await self._get_page_metadata(), indent=4) page_metadata = json.dumps(await self._get_page_metadata(), indent=4)
metadata_hash = hashlib.sha256(page_metadata.encode("utf-8")).hexdigest() metadata_hash = hashlib.md5(page_metadata.encode("utf-8")).hexdigest()
if metadata_hash != self._prior_metadata_hash: if metadata_hash != self._prior_metadata_hash:
page_metadata = ( page_metadata = (
"\nThe following metadata was extracted from the webpage:\n\n" + page_metadata.strip() + "\n" "\nThe following metadata was extracted from the webpage:\n\n" + page_metadata.strip() + "\n"
@ -394,9 +433,18 @@ setInterval(function() {{
position_text = str(percent_scrolled) + "% down from the top of the page" position_text = str(percent_scrolled) + "% down from the top of the page"
new_screenshot = await self._page.screenshot() new_screenshot = await self._page.screenshot()
if self.debug_dir: if self.to_save_screenshots:
async with aiofiles.open(os.path.join(self.debug_dir, "screenshot.png"), "wb") as file: current_timestamp = "_" + int(time.time()).__str__()
await file.write(new_screenshot) screenshot_png_name = "screenshot" + current_timestamp + ".png"
async with aiofiles.open(os.path.join(self.debug_dir, screenshot_png_name), "wb") as file: # type: ignore
await file.write(new_screenshot) # type: ignore
self.logger.info(
WebSurferEvent(
source=self.metadata["type"],
url=self._page.url,
message="Screenshot: " + screenshot_png_name,
)
)
ocr_text = ( ocr_text = (
await self._get_ocr_text(new_screenshot, cancellation_token=cancellation_token) if use_ocr is True else "" await self._get_ocr_text(new_screenshot, cancellation_token=cancellation_token) if use_ocr is True else ""
@ -435,9 +483,17 @@ setInterval(function() {{
screenshot = await self._page.screenshot() screenshot = await self._page.screenshot()
som_screenshot, visible_rects, rects_above, rects_below = add_set_of_mark(screenshot, rects) som_screenshot, visible_rects, rects_above, rects_below = add_set_of_mark(screenshot, rects)
if self.debug_dir: if self.to_save_screenshots:
som_screenshot.save(os.path.join(self.debug_dir, "screenshot.png")) current_timestamp = "_" + int(time.time()).__str__()
screenshot_png_name = "screenshot_som" + current_timestamp + ".png"
som_screenshot.save(os.path.join(self.debug_dir, screenshot_png_name)) # type: ignore
self.logger.info(
WebSurferEvent(
source=self.metadata["type"],
url=self._page.url,
message="Screenshot: " + screenshot_png_name,
)
)
# What tools are available? # What tools are available?
tools = [ tools = [
TOOL_VISIT_URL, TOOL_VISIT_URL,
@ -516,8 +572,8 @@ When deciding between tools, consider if the request can be best addressed by:
# Scale the screenshot for the MLM, and close the original # Scale the screenshot for the MLM, and close the original
scaled_screenshot = som_screenshot.resize((MLM_WIDTH, MLM_HEIGHT)) scaled_screenshot = som_screenshot.resize((MLM_WIDTH, MLM_HEIGHT))
som_screenshot.close() som_screenshot.close()
if self.debug_dir: if self.to_save_screenshots:
scaled_screenshot.save(os.path.join(self.debug_dir, "screenshot_scaled.png")) scaled_screenshot.save(os.path.join(self.debug_dir, "screenshot_scaled.png")) # type: ignore
# Add the multimodal message and make the request # Add the multimodal message and make the request
history.append( history.append(

View File

@ -104,6 +104,7 @@ def message_content_to_str(
class LogHandler(logging.FileHandler): class LogHandler(logging.FileHandler):
def __init__(self, filename: str = "log.jsonl") -> None: def __init__(self, filename: str = "log.jsonl") -> None:
super().__init__(filename) super().__init__(filename)
self.logs_list: List[Dict[str, Any]] = []
def emit(self, record: logging.LogRecord) -> None: def emit(self, record: logging.LogRecord) -> None:
try: try:
@ -121,6 +122,7 @@ class LogHandler(logging.FileHandler):
"type": "OrchestrationEvent", "type": "OrchestrationEvent",
} }
) )
self.logs_list.append(json.loads(record.msg))
super().emit(record) super().emit(record)
elif isinstance(record.msg, AgentEvent): elif isinstance(record.msg, AgentEvent):
console_message = ( console_message = (
@ -135,6 +137,7 @@ class LogHandler(logging.FileHandler):
"type": "AgentEvent", "type": "AgentEvent",
} }
) )
self.logs_list.append(json.loads(record.msg))
super().emit(record) super().emit(record)
elif isinstance(record.msg, WebSurferEvent): elif isinstance(record.msg, WebSurferEvent):
console_message = f"\033[96m[{ts}], {record.msg.source}: {record.msg.message}\033[0m" console_message = f"\033[96m[{ts}], {record.msg.source}: {record.msg.message}\033[0m"
@ -145,6 +148,7 @@ class LogHandler(logging.FileHandler):
} }
payload.update(asdict(record.msg)) payload.update(asdict(record.msg))
record.msg = json.dumps(payload) record.msg = json.dumps(payload)
self.logs_list.append(json.loads(record.msg))
super().emit(record) super().emit(record)
elif isinstance(record.msg, LLMCallEvent): elif isinstance(record.msg, LLMCallEvent):
record.msg = json.dumps( record.msg = json.dumps(
@ -155,6 +159,7 @@ class LogHandler(logging.FileHandler):
"type": "LLMCallEvent", "type": "LLMCallEvent",
} }
) )
self.logs_list.append(json.loads(record.msg))
super().emit(record) super().emit(record)
except Exception: except Exception:
self.handleError(record) self.handleError(record)

View File

@ -41,7 +41,7 @@ pytest_plugins = ("pytest_asyncio",)
BLOG_POST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math" BLOG_POST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AutoGen" BLOG_POST_TITLE = "Does Model and Inference Parameter Matter in LLM Applications? - A Case Study for MATH | AutoGen"
BING_QUERY = "Microsoft" BING_QUERY = "Microsoft"
DEBUG_DIR = "test_logs_web_surfer_autogen"
skip_all = False skip_all = False
@ -65,6 +65,22 @@ else:
skip_openai = True skip_openai = True
def _rm_folder(path: str) -> None:
"""Remove all the regular files in a folder, then deletes the folder. Assumes a flat file structure, with no subdirectories."""
for fname in os.listdir(path):
fpath = os.path.join(path, fname)
if os.path.isfile(fpath):
os.unlink(fpath)
os.rmdir(path)
def _create_logs_dir() -> None:
logs_dir = os.path.join(os.getcwd(), DEBUG_DIR)
if os.path.isdir(logs_dir):
_rm_folder(logs_dir)
os.mkdir(logs_dir)
def generate_tool_request(tool: ToolSchema, args: Mapping[str, str]) -> list[FunctionCall]: def generate_tool_request(tool: ToolSchema, args: Mapping[str, str]) -> list[FunctionCall]:
ret = [FunctionCall(id="", arguments="", name=tool["name"])] ret = [FunctionCall(id="", arguments="", name=tool["name"])]
ret[0].arguments = dumps(args) ret[0].arguments = dumps(args)
@ -106,7 +122,9 @@ async def test_web_surfer() -> None:
runtime.start() runtime.start()
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer, MultimodalWebSurfer) actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer, MultimodalWebSurfer)
await actual_surfer.init(model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium") await actual_surfer.init(
model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium", debug_dir=DEBUG_DIR
)
# Test some basic navigations # Test some basic navigations
tool_resp = await make_browser_request(actual_surfer, TOOL_VISIT_URL, {"url": BLOG_POST_URL}) tool_resp = await make_browser_request(actual_surfer, TOOL_VISIT_URL, {"url": BLOG_POST_URL})
@ -189,7 +207,9 @@ async def test_web_surfer_oai() -> None:
runtime.start() runtime.start()
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, MultimodalWebSurfer) actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, MultimodalWebSurfer)
await actual_surfer.init(model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium") await actual_surfer.init(
model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium", debug_dir=DEBUG_DIR
)
await runtime.send_message( await runtime.send_message(
BroadcastMessage( BroadcastMessage(
@ -248,7 +268,9 @@ async def test_web_surfer_bing() -> None:
runtime.start() runtime.start()
actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, MultimodalWebSurfer) actual_surfer = await runtime.try_get_underlying_agent_instance(web_surfer.id, MultimodalWebSurfer)
await actual_surfer.init(model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium") await actual_surfer.init(
model_client=client, downloads_folder=os.getcwd(), browser_channel="chromium", debug_dir=DEBUG_DIR
)
# Test some basic navigations # Test some basic navigations
tool_resp = await make_browser_request(actual_surfer, TOOL_WEB_SEARCH, {"query": BING_QUERY}) tool_resp = await make_browser_request(actual_surfer, TOOL_WEB_SEARCH, {"query": BING_QUERY})
@ -262,10 +284,15 @@ async def test_web_surfer_bing() -> None:
markdown = await actual_surfer._get_page_markdown() # type: ignore markdown = await actual_surfer._get_page_markdown() # type: ignore
assert "https://en.wikipedia.org/wiki/" in markdown assert "https://en.wikipedia.org/wiki/" in markdown
await runtime.stop_when_idle() await runtime.stop_when_idle()
# remove the logs directory
_rm_folder(DEBUG_DIR)
if __name__ == "__main__": if __name__ == "__main__":
"""Runs this file's tests from the command line.""" """Runs this file's tests from the command line."""
_create_logs_dir()
asyncio.run(test_web_surfer()) asyncio.run(test_web_surfer())
asyncio.run(test_web_surfer_oai()) asyncio.run(test_web_surfer_oai())
# IMPORTANT: last test should remove the logs directory
asyncio.run(test_web_surfer_bing()) asyncio.run(test_web_surfer_bing())

View File

@ -4,8 +4,10 @@ resolution-markers = [
"python_full_version < '3.11'", "python_full_version < '3.11'",
"python_full_version == '3.11.*'", "python_full_version == '3.11.*'",
"python_full_version >= '3.12' and python_full_version < '3.12.4'", "python_full_version >= '3.12' and python_full_version < '3.12.4'",
"python_full_version < '3.13'", "python_full_version < '3.11'",
"python_full_version >= '3.13'", "python_full_version == '3.11.*'",
"python_full_version >= '3.12' and python_full_version < '3.12.4'",
"python_full_version >= '3.12.4'",
] ]
[manifest] [manifest]
@ -436,7 +438,7 @@ requires-dist = [
{ name = "opentelemetry-api", specifier = "~=1.27.0" }, { name = "opentelemetry-api", specifier = "~=1.27.0" },
{ name = "pillow" }, { name = "pillow" },
{ name = "protobuf", specifier = "~=4.25.1" }, { name = "protobuf", specifier = "~=4.25.1" },
{ name = "pydantic", specifier = "<3.0.0,>=2.0.0" }, { name = "pydantic", specifier = ">=2.0.0,<3.0.0" },
{ name = "tiktoken" }, { name = "tiktoken" },
{ name = "typing-extensions" }, { name = "typing-extensions" },
] ]
@ -534,7 +536,7 @@ source = { editable = "packages/autogen-magentic-one" }
dependencies = [ dependencies = [
{ name = "aiofiles" }, { name = "aiofiles" },
{ name = "autogen-core" }, { name = "autogen-core" },
{ name = "autogen-ext" }, { name = "autogen-ext", extra = ["docker"] },
{ name = "beautifulsoup4" }, { name = "beautifulsoup4" },
{ name = "mammoth" }, { name = "mammoth" },
{ name = "markdownify" }, { name = "markdownify" },
@ -567,7 +569,7 @@ dev = [
requires-dist = [ requires-dist = [
{ name = "aiofiles" }, { name = "aiofiles" },
{ name = "autogen-core", editable = "packages/autogen-core" }, { name = "autogen-core", editable = "packages/autogen-core" },
{ name = "autogen-ext", editable = "packages/autogen-ext" }, { name = "autogen-ext", extras = ["docker"], editable = "packages/autogen-ext" },
{ name = "beautifulsoup4" }, { name = "beautifulsoup4" },
{ name = "mammoth" }, { name = "mammoth" },
{ name = "markdownify" }, { name = "markdownify" },
@ -578,7 +580,7 @@ requires-dist = [
{ name = "pdfminer-six" }, { name = "pdfminer-six" },
{ name = "playwright" }, { name = "playwright" },
{ name = "puremagic" }, { name = "puremagic" },
{ name = "pydantic", specifier = "<3.0.0,>=2.0.0" }, { name = "pydantic", specifier = ">=2.0.0,<3.0.0" },
{ name = "pydub" }, { name = "pydub" },
{ name = "python-pptx" }, { name = "python-pptx" },
{ name = "requests" }, { name = "requests" },
@ -3672,7 +3674,7 @@ name = "psycopg"
version = "3.2.3" version = "3.2.3"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "typing-extensions", marker = "python_full_version < '3.13'" }, { name = "typing-extensions" },
{ name = "tzdata", marker = "sys_platform == 'win32'" }, { name = "tzdata", marker = "sys_platform == 'win32'" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/d1/ad/7ce016ae63e231575df0498d2395d15f005f05e32d3a2d439038e1bd0851/psycopg-3.2.3.tar.gz", hash = "sha256:a5764f67c27bec8bfac85764d23c534af2c27b893550377e37ce59c12aac47a2", size = 155550 } sdist = { url = "https://files.pythonhosted.org/packages/d1/ad/7ce016ae63e231575df0498d2395d15f005f05e32d3a2d439038e1bd0851/psycopg-3.2.3.tar.gz", hash = "sha256:a5764f67c27bec8bfac85764d23c534af2c27b893550377e37ce59c12aac47a2", size = 155550 }
@ -4798,7 +4800,7 @@ name = "sqlalchemy"
version = "2.0.36" version = "2.0.36"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "greenlet", marker = "(python_full_version < '3.13' and platform_machine == 'AMD64') or (python_full_version < '3.13' and platform_machine == 'WIN32') or (python_full_version < '3.13' and platform_machine == 'aarch64') or (python_full_version < '3.13' and platform_machine == 'amd64') or (python_full_version < '3.13' and platform_machine == 'ppc64le') or (python_full_version < '3.13' and platform_machine == 'win32') or (python_full_version < '3.13' and platform_machine == 'x86_64')" }, { name = "greenlet", marker = "platform_machine == 'AMD64' or platform_machine == 'WIN32' or platform_machine == 'aarch64' or platform_machine == 'amd64' or platform_machine == 'ppc64le' or platform_machine == 'win32' or platform_machine == 'x86_64'" },
{ name = "typing-extensions" }, { name = "typing-extensions" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/50/65/9cbc9c4c3287bed2499e05033e207473504dc4df999ce49385fb1f8b058a/sqlalchemy-2.0.36.tar.gz", hash = "sha256:7f2767680b6d2398aea7082e45a774b2b0767b5c8d8ffb9c8b683088ea9b29c5", size = 9574485 } sdist = { url = "https://files.pythonhosted.org/packages/50/65/9cbc9c4c3287bed2499e05033e207473504dc4df999ce49385fb1f8b058a/sqlalchemy-2.0.36.tar.gz", hash = "sha256:7f2767680b6d2398aea7082e45a774b2b0767b5c8d8ffb9c8b683088ea9b29c5", size = 9574485 }