GPT4V + Puppeteer = AI agent browse web like human?

There's one type of AI agent use case that has been trending really fast for the past few weeks. Multiple teams have made huge progress towards this direction. From Hyper R, teams published a self-operating computer framework where they let GPT 4V have direct access and control of your whole computer. This means you can open Spotlight to search for any apps like Google Chrome, conduct actions like searching for a specific URL, go to Google Docs, write an H block with your Google account, as well as more advanced actions like browsing the internet or booking flight tickets.

On the other hand, teams from Multi showcased a web AI agent that has direct access to the web browser. It successfully completed the California online driving test by itself. This is the first fully autonomous completion of a real-world human knowledge task by AI. This type of agent system, where it gives a super powerful multimodal model like GPT 4V direct computer access, is such a fascinating idea as it seems to unlock so much potential.

So, what are the use cases and opportunities that the self-operating computer can enable? Let's explore.

‍

Use Cases and Market Opportunities

One way to look at this is to examine previous attempts to build similar systems and understand their use cases and limitations. One direct market category is Robotic Process Automation (RPA). RPA is a category of software that helps enterprises build automated robots to handle repetitive and standardized tasks, such as invoice handling or data entry jobs. Platforms like UI Path have been successful in this category, allowing users to build automations that interact directly with desktop apps like calculators, browsers, Excel, or legacy systems without API endpoints. Enterprises already spend more than $3 billion every year on these process automations.

However, the limitations of these RPA solutions are quite clear. Most of these systems struggle with non-standardized or ever-changing processes, not to mention processes that involve more complex decision-making. For example, if you want a robot to scrape pricing and product data from Nike, Adidas, and Puma websites, you would need to build a specific process for each website. If any of these websites update their structure, the previous automation would break. This makes the setup cost very high, limiting the use of RPA to standardized and repetitive tasks in enterprises.

This is where multimodal AI agents that can directly control the computer and browser become exciting. Theoretically, they can handle much more complex situations with much less setup cost. In the data scraping example, instead of building a specific automation for each website, you can simply give the agent the URLs of different competitors and let it automatically navigate the websites, take screenshots, and extract data. The agent can make its own decision, and even if the website structure changes, it can adapt. These AI agents can go beyond automation and complete intelligent tasks, such as customer support, sales, and marketing. As the agent gains the ability to access different systems, we are getting closer to deploying real AI workers in companies.

However, delivering useful AI worker solutions is not just about understanding the technology. It also requires understanding the end-to-end workflow for specific job functions. A recent research report by HubSpot surveyed and interviewed more than 1,400 global sales leaders to understand how modern sales teams work and their workflow. The report covers key challenges, opportunities, best practices, and top AI use cases adopted by sales leaders. It provides a deep dive into how sales functions work and the key opportunities in the market.

Now, let's dive into an example of how to build an AI agent with direct control of your web browser to perform sophisticated web research and tasks.

Building an AI Agent with Direct Control of the Web Browser

To build an AI agent with direct control of your web browser, we will use a combination of GPT 4V and Puppeteer, a Node.js library for controlling web browsers programmatically. Puppeteer allows us to take screenshots, interact with web elements, and navigate through websites.

First, let's create a Node.js file called "screenshot.js" to take a screenshot of a web page using Puppeteer. We will define the URL of the web page and set a timeout for the page to load. Then, we will create a Puppeteer function to launch a new browser, open a new page, set the viewport size, navigate to the URL, and wait until the page is fully loaded. Finally, we will take a screenshot of the page and save it as "screenshot.jpg".

Next, we will create a Python file called "vision_scraper.py" to call the "screenshot.js" file and extract data from the screenshot using GPT 4V. We will import the necessary libraries, load the OpenAI API key from an .env file, and create a function to convert the image file to a format that can be passed to GPT 4V. Then, we will create a function to take a URL of a website, remove any existing screenshot file, and run the "screenshot.js" file using the subprocess module. After taking the screenshot, we will pass it to GPT 4V using the OpenAI API and extract the desired information.

Now, let's move on to building a more advanced AI agent that can interact with different websites, click on links, and perform web research. We will create a new JavaScript file called "web_agent.js" and import the necessary libraries. We will also create a function to convert the image file to a base64 format, a function to take a URL and capture a screenshot using Puppeteer, and a function to create a command-line interface for user interaction.

One important function in the "web_agent.js" file is the "highlight_links" function. This function removes any previous highlighted bounding boxes and returns all the interactive elements (buttons, inputs, text areas, and links) on the web page. It also checks if each element is visible and within the viewport. The function highlights the interactive elements and sets a special attribute called "gbt_link_text" for each element. This attribute will be used as an identifier for the elements that the AI agent should interact with.

In the main part of the code, we create a Puppeteer browser, open a new page, set the viewport, and define a system message for the AI agent. We then create a loop to continuously interact with the web page. The AI agent receives messages from GPT 4V, displays them to the user, and performs actions based on the messages. If the message is to click on a specific UI element, the agent will try to find the element on the web page and interact with it. If the message is to fetch a URL, the agent will navigate to that URL. The loop continues until the agent thinks the task is completed.

With this AI agent, you can ask it questions, navigate through websites, and extract information. It can even complete complex tasks like finding information on specific websites or interacting with forms. Although the functionality is still limited and there are improvements to be made, this AI agent shows great potential for completing complex tasks.

In conclusion, the combination of GPT 4V and Puppeteer allows us to build AI agents that can browse the web like humans. These agents have the potential to revolutionize automation, customer support, sales, and marketing. With direct access and control of the computer and web browser, AI agents can handle complex tasks with less setup cost. The possibilities are endless, and I'm excited to see the innovative AI agents that will be created in the future.

‍

Resources:

Get free hubspot research of how does Sales team use AI in 2024: https://offers.hubspot.com/sales-trends-report?utm_source=youtube&utm_medium=social&utm_campaign=CR00148Nov2023_AIJason%2Fpartner_youtube

🔗 Links

Follow me on twitter: https://twitter.com/jasonzhou1993
Github - Web AI agent: https://www.crafters.ai/aitools/ai-web-agent

‍

Frequently Asked Questions

1. Can AI agents replace human workers?

No, AI agents are designed to assist and augment human workers, not replace them. They can handle repetitive and standardized tasks, allowing humans to focus on more complex and creative work.

2. How accurate are AI agents in completing tasks?

The accuracy of AI agents depends on various factors, such as the quality of training data, the complexity of the task, and the capabilities of the AI model. With advancements in AI technology, the accuracy of AI agents is continuously improving.

3. Are AI agents secure?

AI agents can be secure if proper security measures are implemented. It is important to ensure that AI agents have limited access to sensitive data and are protected against potential vulnerabilities and attacks.

4. Can AI agents learn from user interactions?

Yes, AI agents can learn from user interactions. By analyzing user feedback and behavior, AI agents can improve their performance and provide more personalized and accurate assistance.

5. How can AI agents benefit businesses?

AI agents can benefit businesses by automating repetitive tasks, improving customer support, increasing efficiency, and providing valuable insights. They can help businesses save time and resources, enhance productivity, and deliver better customer experiences.