UIBert: Learning Generic Multimodal Representations for UI Understanding

07/29/2021
by   Chongyang Bai, et al.
0

To improve the accessibility of smart devices and to simplify their usage, building models which understand user interfaces (UIs) and assist users to complete their tasks is critical. However, unique challenges are proposed by UI-specific characteristics, such as how to effectively leverage multimodal UI features that involve image, text, and structural metadata and how to achieve good performance when high-quality labeled data is unavailable. To address such challenges we introduce UIBert, a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data to learn generic feature representations for a UI and its components. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26

READ FULL TEXT

page 1

page 2

page 4

page 5

research
12/22/2020

ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

As mobile devices are becoming ubiquitous, regularly interacting with a ...
research
05/22/2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Large-scale image-text contrastive pre-training models, such as CLIP, ha...
research
04/11/2023

MoMo: A shared encoder Model for text, image and multi-Modal representations

We propose a self-supervised shared encoder model that achieves strong r...
research
03/02/2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The milestone improvements brought about by deep representation learning...
research
03/31/2023

Self-Supervised Multimodal Learning: A Survey

Multimodal learning, which aims to understand and analyze information fr...
research
08/18/2023

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

With the success of self-supervised learning, multimodal foundation mode...
research
03/15/2023

GPT-4 Technical Report

We report the development of GPT-4, a large-scale, multimodal model whic...

Please sign up or login with your details

Forgot password? Click here to reset