A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

05/05/2023
by   Andrea Burns, et al.
0

Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Experiments show that the new annotations from WikiWeb2M improve task performance compared to data from prior work. We also include ablations on sequence length, input features, and model size.

READ FULL TEXT

page 19

page 22

research
05/09/2023

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Webpages have been a rich resource for language and vision-language task...
research
09/13/2016

Multimodal Attention for Neural Machine Translation

The attention mechanism is an important part of the neural machine trans...
research
12/23/2022

Do DALL-E and Flamingo Understand Each Other?

A major goal of multimodal research is to improve machine understanding ...
research
02/01/2022

WebFormer: The Web-page Transformer for Structure Information Extraction

Structure information extraction refers to the task of extracting struct...
research
01/19/2022

CM3: A Causal Masked Multimodal Model of the Internet

We introduce CM3, a family of causally masked generative models trained ...
research
02/13/2020

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, such as...
research
05/25/2022

Leveraging Locality in Abstractive Text Summarization

Despite the successes of neural attention models for natural language ge...

Please sign up or login with your details

Forgot password? Click here to reset