LLM operates Computers: An Introduction and Framework of RL-driven Agent

Table of Contents

1. Proposer-Agent-Evaluator (PAE)

1.1. Observations (what a VLM can see)

Domain-specific Observations and Actions

screenshot_20250602_150959.png

Number Marks:

"To provide better action grounding, we follow the practice from prior works (Zheng et al., 2024; He et al., 2024) to augment the observation space with number marks on top of each interactive element such as web links and text boxes."

1.2. Task Proposer (how to generate corpus)

  1. Prompt, and only use the website name as the information
  2. Provide the context information of a website
  3. Provide some screenshots and user demos (usage cases)

1.3. Evaluator (How to provide the reward & How to evaluate)

The reward is 0/1, purely result-oriented.

  1. the success of the final outcome
  2. Using VLMs as the annotator to provide the feedback

1.4. CoT Training (How to train with reasoning)

  • Using the Filtered Behavior Cloning (Filtered BC) as the RL algorithm.

1.5. Experiments

1.5.1. Datasets or Environments

1.5.2. Performances

screenshot_20250602_152337.png

1.6. TODO Source code Analysis

TODO.

2. WebArena: A Realistic Web Environment for Building Autonomous Agents

2.1. Website Selection

analyze 200 user's browser histories, and select 4 representative categories:

  • E-commerce (Amazon, eBay)
  • Social forum platforms (Reddit, StackExchange)
  • Collaborative development platforms (GitLab)
  • Content Management Systems (CMS)

    And then they build a simulation version of these websites.

2.2. Observations

A browser view, with tabs and web pages.

2.3. Actions

"emulates the keyboard and mouse operations"

screenshot_20250602_163305.png

2.4. Intents

Some universal templates while difficult and unique for each task.

screenshot_20250602_163355.png

2.5. Evaluation Metrics

  • exact_math
  • must_include
  • fuzzy_match

3. WebVoyager: Building an End-to-End Web Agent with LMMs

This work is based on Selenium (https://www.selenium.dev/), a webdriver for automatically test websites on browsers.

3.1. Observations

Same as PAE.

screenshot_20250603_095039.png

3.2. Actions

Similar to WebArena, i.e., mouse and keyboard actions. A simplified version of code, i.e., "click [39]".

screenshot_20250603_095715.png

3.3. Datasets

  • 15 Realistic Websites
  • Omit websites that requires login or CPTCHA
  • Dataset Construction: Using LLM to generate tasks, and utilize human check to filter them.
  • Evaluation: including Golden Answers and Possible Answers.
    • Golden: know all full list space of answers
    • Possible: may not correct, or only know a partial list.

screenshot_20250603_095537.png

3.4. Results

screenshot_20250603_095757.png

4. Ragen: Not What We Need

screenshot_20250603_100439.png

Reasons:

  • Trajectory-level is not what we need.
  • This paper is still in gaming environments.

5. Tools May Take

5.1. Magentic-UI

Wechat introduction: Wechat Introduction.

Github Repository: https://github.com/microsoft/magentic-ui/

  • No RL training.
  • Human-AI interaction

5.1.1. Roles

  • 🧑‍💼 Orchestrator is the lead agent, powered by a large language model (LLM), that performs co-planning with the user, decides when to ask the user for feedback, and delegates sub-tasks to the remaining agents to complete.
  • 🌐 WebSurfer is an LLM agent equipped with a web browser that it can control. Given a request by the Orchestrator, it can click, type, scroll, and visit pages in multiple rounds to complete the request from the Orchestrator. This agent is a significant improvement over the AutoGen MultimodalWebSurfer in terms of the actions it can do (tab management, select options, file upload, multimodal queries).
  • 💻 Coder is an LLM agent equipped with a Docker code-execution container. It can write and execute Python and shell commands and provide a response back to the Orchestrator.
  • 📁 FileSurfer is an LLM agent equipped with a Docker code-execution container and file-conversion tools from the MarkItDown package. It can locate files in the directory controlled by Magentic-UI, convert files to markdown, and answer questions about them.
  • 🧑 UserProxy is an agent that represents the user interacting with Magentic-UI. The Orchestrator can delegate work to the user instead of the other agents.

5.1.2. Workflow

screenshot_20250603_101530.png

5.2. Puppeteer

https://pptr.dev/

A JavaScript library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi. Puppeteer runs in the headless (no visible UI) by default.


Author: Zi Liang (zi1415926.liang@connect.polyu.hk) Create Date: Mon Jun 2 10:54:48 2025 Last modified: 2025-06-04 Wed 16:32 Creator: Emacs 30.1 (Org mode 9.7.11)