Android in the Wild: A Large-Scale Dataset for Android Device Control

by   Christopher Rawles, et al.

There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at


page 16

page 17

page 19

page 20

page 21


Grounding Open-Domain Instructions to Automate Web Support Tasks

Grounding natural language instructions on the web to perform previously...

You Only Look at Screens: Multimodal Chain-of-Action Agents

Autonomous user interface (UI) agents aim to facilitate task automation ...

DroidBot-GPT: GPT-powered UI Automation for Android

This paper introduces DroidBot-GPT, a tool that utilizes GPT-like large ...

Hierarchical Decision Making by Generating and Following Natural Language Instructions

We explore using latent natural language instructions as an expressive a...

ERICA: An Empathetic Android Companion for Covid-19 Quarantine

Over the past year, research in various domains, including Natural Langu...

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Much of the previous work towards digital agents for graphical user inte...

Please sign up or login with your details

Forgot password? Click here to reset