Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs

03/15/2022
by   Taichi Iki, et al.
0

Pre-trained Transformers are good foundations for unified multi-task models owing to their task-agnostic representation. Pre-trained Transformers are often combined with text-to-text framework to execute multiple tasks by a single model. Performing a task through a graphical user interface (GUI) is another candidate to accommodate various tasks, including multi-step tasks with vision and language inputs. However, few papers combine pre-trained Transformers with performing through GUI. To fill this gap, we explore a framework in which a model performs a task by manipulating the GUI implemented with web pages in multiple steps. We develop task pages with and without page transitions and propose a BERT extension for the framework. We jointly trained our BERT extension with those task pages, and made the following observations. (1) The model learned to use both task pages with and without page transition. (2) In four out of five tasks without page transitions, the model performs greater than 75 (3) The model did not generalize effectively on unseen tasks. These results suggest that we can fine-tune BERTs to multi-step tasks through GUIs, and there is room for improvement in their generalizability. Code will be available online.

READ FULL TEXT

page 6

page 18

research
09/06/2023

Combining pre-trained Vision Transformers and CIDER for Out Of Domain Detection

Out-of-domain (OOD) detection is a crucial component in industrial appli...
research
04/02/2020

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-moda...
research
05/26/2022

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Although the pre-trained Vision Transformers (ViTs) achieved great succe...
research
03/04/2023

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

In the fashion domain, there exists a variety of vision-and-language (V+...
research
04/05/2023

Context-Aware Classification of Legal Document Pages

For many business applications that require the processing, indexing, an...
research
12/31/2020

Unified Mandarin TTS Front-end Based on Distilled BERT Model

The front-end module in a typical Mandarin text-to-speech system (TTS) i...
research
07/02/2022

Sequence-aware multimodal page classification of Brazilian legal documents

The Brazilian Supreme Court receives tens of thousands of cases each sem...

Please sign up or login with your details

Forgot password? Click here to reset