autogen/notebook/agentchat_webcrawling_with_...

427 lines
23 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Web Scraping using Spider API\n",
"\n",
"This notebook shows how to use the open \n",
"source [Spider](https://spider.cloud/) web crawler together with AutoGen agents."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we need to install the Spider SDK and the AutoGen library."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"! pip install -qqq pyautogen spider-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting up the LLM configuration and the Spider API key is also required."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"config_list = [\n",
" {\"model\": \"gpt-4o\", \"api_key\": os.getenv(\"OPENAI_API_KEY\")},\n",
"]\n",
"\n",
"spider_api_key = os.getenv(\"SPIDER_API_KEY\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define the tool for scraping and crawling data from any website with Spider.\n",
"Read more about tool use in this [tutorial chapter](/docs/tutorial/tool-use)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'content': 'Spider - The Fastest Web Crawling Service[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHub](https://github.com/spider-rs/spider) [Twitter](https://twitter.com/spider_rust) Toggle ThemeSign InRegisterTo help you get started with Spider, well give you $200 in credits when you spend $100. [Get Credits](/credits/new)LangChain integration [now available](https://python.langchain.com/docs/integrations/document_loaders/spider)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 2.5secs ###To crawl 200 pages### 100-500x ###Faster than alternatives### 500x ###Cheaper than traditional scraping services Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.Example used tailwindcss.com - 04/16/2024[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labelingSee what\\'s being said----------[<img alt=\"Merrick Christensen\" src=\"https://pbs.twimg.com/profile_images/1715432638574678016/EpXREbKV_normal.jpg\" height=\"48\" width=\"48\" />](https://twitter.com/iammerrick/status/1787873425446572462)[Merrick Christensen](https://twitter.com/iammerrick/status/1787873425446572462)[@iammerrick ](https://twitter.com/iammerrick/status/1787873425446572462)· [Follow](https://twitter.com/intent/follow?screen_name=iammerrick)[](https://twitter.com/iammerrick/status/1787873425446572462)Rust based crawler Spider is next level for crawling & scraping sites. So fast. Their cloud offering is also so easy to use. Good stuff. [ github.com/spider-rs/spid… ](https://github.com/spider-rs/spider)[ 3:53 PM · May 7, 2024 ](https://twitter.com/iammerrick/status/1787873425446572462) [](https://help.twitter.com/en/twitter-for-websites-ads-info-and-privacy)[12 ](https://twitter.com/intent/like?tweet_id=1787873425446572462) [Reply ](https://twitter.com/intent/tweet?in_reply_to=1787873425446572462)[ Read more on Twitter ](https://twitter.com/iammerrick/status/1787873425446572462)[<img alt=\"William Espegren\" src=\"https://pbs.twimg.com/profile_images/1725251248796864513/GjyNyMl8_normal.jpg\" height=\"48\" width=\"48\" />](https://twitter.com/WilliamEspegren/status/1789419820821184764)[William Espegren](https://twitter.com/WilliamEspegren/status/1789419820821184764)[@WilliamEspegren ](https://twitter.com/WilliamEspegren/status/1789419820821184764)· [Follow](https://twitter.com/intent/follow?screen_name=WilliamEspegren)[](https://twitter.com/WilliamEspegren/status/1789419820821184764)Web crawler built in rust, currently the nr1 performance in the world with crazy resource management Aaaaaaand they have a cloud offer, thats wayyyy cheaper than any competitor Name a reason for me to use anything else? [ github.com/spider-rs/spid… ](https://github.com/spider-rs/spider)[ 10:18 PM · May 11, 2024 ](https://twitter.com/WilliamEspegren/status/1789419820821184764) [](https://help.twitter.com/en/twitter-for-websites-ads-info-and-privacy)[2 ](https://twitter.com/intent/like?tweet_id=1789419820821184764) [Reply ](https://twitter.com/intent/tweet?in_reply_to=1789419820821184764)[ Read 1 reply ](https://twitter.com/WilliamEspegren/status/1789419820821184764)[<img alt=\"Troy Lowry\" src=\"https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png\" height=\"48\" width=\"48\" />](https://twitter.com/Troyusrex/status/1791497607925088307)[Troy Lowry](https://twitter.com/Troyusrex/status/1791497607925088307)[@Troyusrex ](https://twitter.com/Troyusrex/status/1791497607925088307)· [Follow](https://twitter.com/intent/follow?screen_name=Troyusrex)[](https://twitter.com/Troyusrex/status/1791497607925088307)[ @spider\\\\_rust ](https://twitter.com/spider_rust) First, the good: Spider has enabled me to speed up my scraping 20X and with a bit higher quality than I was getting before. I am having a few issues however. First, the documentation link doesn\\'t work ([ spider.cloud/guides/(/docs/… ](https://spider.cloud/guides/(/docs/api)))I\\'ve figured out how to get it to work…[ 3:54 PM · May 17, 2024 ](https://twitter.com/Troyusrex/status/1791497607925088307) [](https://help.twitter.com/en/twitter-for-websites-ads-info-and-privacy)[1 ](https://twitter.com/intent/like?tweet_id=1791497607925088307) [Reply ](https://twitter.com/intent/tweet?in_reply_to=1791497607925088307)[ Read 2 replies ](https://twitter.com/Troyusrex/status/1791497607925088307)FAQ----------Frequently asked questions about Spider<details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\" open=\"\"><summary class=\"cursor-pointer\">What is Spider?----------</summary>Spider is a leading web crawling tool designed for speed and cost-effectiveness, supporting various data formats including LLM-ready markdown.</details><details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\"><summary class=\"cursor-pointer\">Why is my website not crawling?----------</summary>Your crawl may fail if it requires JavaScript rendering. Try setting your request to \\'chrome\\' to solve this issue.</details><details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\"><summary class=\"cursor-pointer\">Can you crawl all pages?----------</summary>Yes, Spider accurately crawls all necessary content without needing a sitemap.</details><details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\"><summary class=\"cursor-pointer\">What formats can Spider convert web data into?----------</summary>Spider outputs HTML, raw, text, and various markdown formats. It supports JSON, JSONL, CSV, and XML for API responses.</details><details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\"><summary class=\"cursor-pointer\">Is Spider suitable for large scraping projects?----------</summary>Absolutely, Spider is ideal for large-scale data collection and offers a cost-effective dashboard for data management.</details><details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\"><summary class=\"cursor-pointer\">How can I try Spider?----------</summary>Purchase credits for our cloud system or test the Open Source Spider engine to explore its capabilities.</details><details itemscope=\"\" itemprop=\"mainEntity\" itemtype=\"https://schema.org/Question\"><summary class=\"cursor-pointer\">Does it respect robots.txt?----------</summary>Yes, compliance with robots.txt is default, but you can disable this if necessary.</details>[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula) [FAQ](/faq)© 2024 Spider from A11yWatch[GitHubGithub](https://github.com/spider-rs/spider) [X - Twitter ](https://twitter.com/spider_rust)', 'error': None, 'status': 200, 'url': 'https://spider.cloud'}]\n"
]
}
],
"source": [
"from typing import Any, Dict, List\n",
"\n",
"from spider import Spider\n",
"from typing_extensions import Annotated\n",
"\n",
"\n",
"def scrape_page(\n",
" url: Annotated[str, \"The URL of the web page to scrape\"],\n",
" params: Annotated[dict, \"Dictionary of additional params.\"] = None,\n",
") -> Annotated[Dict[str, Any], \"Scraped content\"]:\n",
" # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n",
" client = Spider(spider_api_key)\n",
"\n",
" if params is None:\n",
" params = {\"return_format\": \"markdown\"}\n",
"\n",
" scraped_data = client.scrape_url(url, params)\n",
" return scraped_data[0]\n",
"\n",
"\n",
"def crawl_page(\n",
" url: Annotated[str, \"The url of the domain to be crawled\"],\n",
" params: Annotated[dict, \"Dictionary of additional params.\"] = None,\n",
") -> Annotated[List[Dict[str, Any]], \"Scraped content\"]:\n",
" # Initialize the Spider client with your API key, if no api key is specified it looks for SPIDER_API_KEY in your environment variables\n",
" client = Spider(spider_api_key)\n",
"\n",
" if params is None:\n",
" params = {\"return_format\": \"markdown\"}\n",
"\n",
" crawled_data = client.crawl_url(url, params)\n",
" return crawled_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the agents and register the tool."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from autogen import ConversableAgent, register_function\n",
"\n",
"# Create web scraper agent.\n",
"scraper_agent = ConversableAgent(\n",
" \"WebScraper\",\n",
" llm_config={\"config_list\": config_list},\n",
" system_message=\"You are a web scraper and you can scrape any web page to retrieve its contents.\"\n",
" \"Returns 'TERMINATE' when the scraping is done.\",\n",
")\n",
"\n",
"# Create web crawler agent.\n",
"crawler_agent = ConversableAgent(\n",
" \"WebCrawler\",\n",
" llm_config={\"config_list\": config_list},\n",
" system_message=\"You are a web crawler and you can crawl any page with deeper crawling following subpages.\"\n",
" \"Returns 'TERMINATE' when the scraping is done.\",\n",
")\n",
"\n",
"# Create user proxy agent.\n",
"user_proxy_agent = ConversableAgent(\n",
" \"UserProxy\",\n",
" llm_config=False, # No LLM for this agent.\n",
" human_input_mode=\"NEVER\",\n",
" code_execution_config=False, # No code execution for this agent.\n",
" is_termination_msg=lambda x: x.get(\"content\", \"\") is not None and \"terminate\" in x[\"content\"].lower(),\n",
" default_auto_reply=\"Please continue if not finished, otherwise return 'TERMINATE'.\",\n",
")\n",
"\n",
"# Register the functions with the agents.\n",
"register_function(\n",
" scrape_page,\n",
" caller=scraper_agent,\n",
" executor=user_proxy_agent,\n",
" name=\"scrape_page\",\n",
" description=\"Scrape a web page and return the content.\",\n",
")\n",
"\n",
"register_function(\n",
" crawl_page,\n",
" caller=crawler_agent,\n",
" executor=user_proxy_agent,\n",
" name=\"crawl_page\",\n",
" description=\"Crawl an entire domain, following subpages and return the content.\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Start the conversation for scraping web data. We used the\n",
"`reflection_with_llm` option for summary method\n",
"to perform the formatting of the output into a desired format.\n",
"The summary method is called after the conversation is completed\n",
"given the complete history of the conversation."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mUserProxy\u001b[0m (to WebScraper):\n",
"\n",
"Can you scrape william-espegren.com for me?\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[31m\n",
">>>>>>>> USING AUTO REPLY...\u001b[0m\n",
"\u001b[33mWebScraper\u001b[0m (to UserProxy):\n",
"\n",
"\u001b[32m***** Suggested tool call (call_qCNYeQCfIPZkUCKejQmm5EhC): scrape_page *****\u001b[0m\n",
"Arguments: \n",
"{\"url\":\"https://www.william-espegren.com\"}\n",
"\u001b[32m****************************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[35m\n",
">>>>>>>> EXECUTING FUNCTION scrape_page...\u001b[0m\n",
"\u001b[33mUserProxy\u001b[0m (to WebScraper):\n",
"\n",
"\u001b[33mUserProxy\u001b[0m (to WebScraper):\n",
"\n",
"\u001b[32m***** Response from calling tool (call_qCNYeQCfIPZkUCKejQmm5EhC) *****\u001b[0m\n",
"[{\"content\": \"William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \\u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegren/)[LinkedIn](https://www.linkedin.com/in/william-espegren/)[Twitter](https://twitter.com/WilliamEspegren)[team-collaboration/version-control/github Created with Sketch.Github](https://github.com/WilliamEspegren)\", \"error\": null, \"status\": 200, \"url\": \"https://www.william-espegren.com\"}]\n",
"\u001b[32m**********************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[31m\n",
">>>>>>>> USING AUTO REPLY...\u001b[0m\n",
"\u001b[33mWebScraper\u001b[0m (to UserProxy):\n",
"\n",
"I successfully scraped the website \"william-espegren.com\". Here is the content retrieved:\n",
"\n",
"```\n",
"William Espegren - Portfolio\n",
"\n",
"keep scrolling\n",
"\n",
"MADE WITH\n",
"CSS, JS\n",
"\n",
"MADE BY\n",
"Uppsala\n",
"\n",
"William Espegren\n",
"With Love\n",
"\n",
"Open For Projects\n",
"\n",
"[CONTACT ME](https://www.linkedin.com/in/william-espegren/)\n",
"[Instagram](https://www.instagram.com/williamespegren/)\n",
"[LinkedIn](https://www.linkedin.com/in/william-espegren/)\n",
"[Twitter](https://twitter.com/WilliamEspegren)\n",
"[Github](https://github.com/WilliamEspegren)\n",
"```\n",
"\n",
"Is there anything specific you would like to do with this information?\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mUserProxy\u001b[0m (to WebScraper):\n",
"\n",
"Please continue if not finished, otherwise return 'TERMINATE'.\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[31m\n",
">>>>>>>> USING AUTO REPLY...\u001b[0m\n",
"\u001b[33mWebScraper\u001b[0m (to UserProxy):\n",
"\n",
"TERMINATE\n",
"\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Scrape page\n",
"scraped_chat_result = user_proxy_agent.initiate_chat(\n",
" scraper_agent,\n",
" message=\"Can you scrape william-espegren.com for me?\",\n",
" summary_method=\"reflection_with_llm\",\n",
" summary_args={\"summary_prompt\": \"\"\"Summarize the scraped content\"\"\"},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n",
"\n",
"Can you crawl william-espegren.com for me, I want the whole domains information?\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[31m\n",
">>>>>>>> USING AUTO REPLY...\u001b[0m\n",
"\u001b[33mWebCrawler\u001b[0m (to UserProxy):\n",
"\n",
"\u001b[32m***** Suggested tool call (call_0FkTtsxBtA0SbChm1PX085Vk): crawl_page *****\u001b[0m\n",
"Arguments: \n",
"{\"url\":\"http://www.william-espegren.com\"}\n",
"\u001b[32m***************************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[35m\n",
">>>>>>>> EXECUTING FUNCTION crawl_page...\u001b[0m\n",
"\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n",
"\n",
"\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n",
"\n",
"\u001b[32m***** Response from calling tool (call_0FkTtsxBtA0SbChm1PX085Vk) *****\u001b[0m\n",
"[{\"content\": \"William Espegren - Portfoliokeep scrollingMADE WITHCSS, JSMADE BYUppsalaWilliam EspegrenWith \\u00b7LoveOpen For Projects[CONTACT ME](https://www.linkedin.com/in/william-espegren/)[Instagram](https://www.instagram.com/williamespegren/)[LinkedIn](https://www.linkedin.com/in/william-espegren/)[Twitter](https://twitter.com/WilliamEspegren)[team-collaboration/version-control/github Created with Sketch.Github](https://github.com/WilliamEspegren)\", \"error\": null, \"status\": 200, \"url\": \"http://www.william-espegren.com\"}]\n",
"\u001b[32m**********************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[31m\n",
">>>>>>>> USING AUTO REPLY...\u001b[0m\n",
"\u001b[33mWebCrawler\u001b[0m (to UserProxy):\n",
"\n",
"The crawl of [william-espegren.com](http://www.william-espegren.com) has been completed. Here is the gathered content:\n",
"\n",
"---\n",
"\n",
"**William Espegren - Portfolio**\n",
"\n",
"Keep scrolling\n",
"\n",
"**MADE WITH:** CSS, JS\n",
"\n",
"**MADE BY:** Uppsala\n",
"\n",
"**William Espegren**\n",
"\n",
"**With Love**\n",
"\n",
"**Open For Projects**\n",
"\n",
"**[CONTACT ME](https://www.linkedin.com/in/william-espegren/)**\n",
"\n",
"- [Instagram](https://www.instagram.com/williamespegren/)\n",
"- [LinkedIn](https://www.linkedin.com/in/william-espegren/)\n",
"- [Twitter](https://twitter.com/WilliamEspegren)\n",
"- [Github](https://github.com/WilliamEspegren)\n",
"\n",
"---\n",
"\n",
"If you need further information or details from any specific section, please let me know!\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mUserProxy\u001b[0m (to WebCrawler):\n",
"\n",
"Please continue if not finished, otherwise return 'TERMINATE'.\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[31m\n",
">>>>>>>> USING AUTO REPLY...\u001b[0m\n",
"\u001b[33mWebCrawler\u001b[0m (to UserProxy):\n",
"\n",
"TERMINATE\n",
"\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Crawl page\n",
"crawled_chat_result = user_proxy_agent.initiate_chat(\n",
" crawler_agent,\n",
" message=\"Can you crawl william-espegren.com for me, I want the whole domains information?\",\n",
" summary_method=\"reflection_with_llm\",\n",
" summary_args={\"summary_prompt\": \"\"\"Summarize the crawled content\"\"\"},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output is stored in the summary."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The website belongs to William Espegren, who is based in Uppsala and possesses skills in CSS and JavaScript. He is open to new projects. You can contact him through the following links:\n",
"\n",
"- [LinkedIn](https://www.linkedin.com/in/william-espegren/)\n",
"- [Instagram](https://www.instagram.com/williamespegren/)\n",
"- [Twitter](https://twitter.com/WilliamEspegren)\n",
"- [GitHub](https://github.com/WilliamEspegren)\n",
"\n",
"Feel free to reach out to him for project collaborations.\n"
]
}
],
"source": [
"print(scraped_chat_result.summary)\n",
"# print(crawled_chat_result.summary) # We show one for cleaner output"
]
}
],
"metadata": {
"front_matter": {
"description": "Scraping/Crawling web pages and summarizing the content using agents.",
"tags": [
"web scraping",
"spider",
"tool use"
],
"title": "Web Scraper & Crawler Agent using Spider"
},
"kernelspec": {
"display_name": "autogen",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}