You Only Look at Screens: Multimodal Chain-of-Action Agents

09/20/2023
by   Zhuosheng Zhang, et al.
0

Autonomous user interface (UI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-UI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique – leveraging a series of intermediate previous action histories and future action plans – to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-UI achieves state-of-the-art performance with an action type prediction accuracy of 90 action success rate of 74 https://github.com/cooelf/Auto-UI.

READ FULL TEXT

page 9

page 14

research
07/19/2023

Android in the Wild: A Large-Scale Dataset for Android Device Control

There is a growing interest in device-control systems that can interpret...
research
11/17/2022

Planning with Large Language Models via Corrective Re-prompting

Extracting the common sense knowledge present in Large Language Models (...
research
10/06/2020

Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Text-based games present a unique challenge for autonomous agents to ope...
research
03/30/2021

Grounding Open-Domain Instructions to Automate Web Support Tasks

Grounding natural language instructions on the web to perform previously...
research
05/25/2023

Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memory

The captivating realm of Minecraft has attracted substantial research in...
research
08/11/2023

BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents

The massive successes of large language models (LLMs) encourage the emer...
research
05/31/2023

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Much of the previous work towards digital agents for graphical user inte...

Please sign up or login with your details

Forgot password? Click here to reset