LLM operates Computers: An Introduction and Framework of RL-driven Agent

1. Proposer-Agent-Evaluator (PAE)
2. WebArena: A Realistic Web Environment for Building Autonomous Agents
3. WebVoyager: Building an End-to-End Web Agent with LMMs
4. Ragen: Not What We Need
5. Tools May Take
- 5.1. Magentic-UI
  - 5.1.1. Roles
  - 5.1.2. Workflow
- 5.2. Puppeteer

1. Proposer-Agent-Evaluator (PAE)

Paper: Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

1.1. Observations (what a VLM can see)

Domain-specific Observations and Actions

Number Marks:

"To provide better action grounding, we follow the practice from prior works (Zheng et al., 2024; He et al., 2024) to augment the observation space with number marks on top of each interactive element such as web links and text boxes."

1.2. Task Proposer (how to generate corpus)

Prompt, and only use the website name as the information
Provide the context information of a website
Provide some screenshots and user demos (usage cases)

1.3. Evaluator (How to provide the reward & How to evaluate)

The reward is 0/1, purely result-oriented.

the success of the final outcome
Using VLMs as the annotator to provide the feedback

1.4. CoT Training (How to train with reasoning)

Using the Filtered Behavior Cloning (Filtered BC) as the RL algorithm.

1.5. Experiments

1.5.1. Datasets or Environments

WebVoyager
- Desc: 15 websites, 643 tasks (Now 13 websites with 557 tasks are available)
- https://github.com/MinorJerry/WebVoyager
- https://arxiv.org/abs/2401.13919
WebArena
- Desc: 812 hand-crafted tasks on 5 websites
- https://arxiv.org/abs/2307.13854
- https://github.com/web-arena-x/webarena

1.5.2. Performances

1.6. TODO Source code Analysis

TODO.

2. WebArena: A Realistic Web Environment for Building Autonomous Agents

2.1. Website Selection

analyze 200 user's browser histories, and select 4 representative categories:

E-commerce (Amazon, eBay)
Social forum platforms (Reddit, StackExchange)
Collaborative development platforms (GitLab)
Content Management Systems (CMS)

And then they build a simulation version of these websites.

2.2. Observations

A browser view, with tabs and web pages.

HTML file of the webpage & a DOM tree
screenshot of the current web page
accessibility tree of the web page. see: https://developer.mozilla.org/en-US/docs/Glossary/Accessibility_tree

2.3. Actions

"emulates the keyboard and mouse operations"

2.4. Intents

Some universal templates while difficult and unique for each task.

2.5. Evaluation Metrics

exact_math
must_include
fuzzy_match

3. WebVoyager: Building an End-to-End Web Agent with LMMs

This work is based on Selenium (https://www.selenium.dev/), a webdriver for automatically test websites on browsers.

3.1. Observations

Same as PAE.

screenshots with labels of clicks
Labels are provided by https://github.com/ddupont808/GPT-4V-Act

3.2. Actions

Similar to WebArena, i.e., mouse and keyboard actions. A simplified version of code, i.e., "click [39]".

3.3. Datasets

15 Realistic Websites
Omit websites that requires login or CPTCHA
Dataset Construction: Using LLM to generate tasks, and utilize human check to filter them.
Evaluation: including Golden Answers and Possible Answers.
- Golden: know all full list space of answers
- Possible: may not correct, or only know a partial list.

3.4. Results

4. Ragen: Not What We Need

Reasons:

Trajectory-level is not what we need.
This paper is still in gaming environments.

5. Tools May Take

5.1. Magentic-UI

Wechat introduction: Wechat Introduction.

Github Repository: https://github.com/microsoft/magentic-ui/

No RL training.
Human-AI interaction

5.1.1. Roles

🧑‍💼 Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
🌐 WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator. This agent is a significant improvement over the AutoGen MultimodalWebSurfer in terms of the actions it can do (tab management, select options, file upload, multimodal queries).
💻 Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
📁 FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.
🧑 UserProxy is an agent that represents the user interacting with Magentic-UI. The Orchestrator can delegate work to the user instead of the other agents.

5.1.2. Workflow

5.2. Puppeteer

https://pptr.dev/

A JavaScript library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default.