CM3: A Causal Masked Multimodal Model of the Internet

01/19/2022
by   Armen Aghajanyan, et al.
8

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

READ FULL TEXT

page 5

page 7

page 8

page 9

research
07/14/2021

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale ...
research
12/05/2022

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Recent works have shown that unstructured text (documents) from online s...
research
05/23/2023

WebIE: Faithful and Robust Information Extraction on the Web

Extracting structured and grounded fact triples from raw text is a funda...
research
05/12/2022

A Generalist Agent

Inspired by progress in large-scale language modeling, we apply a simila...
research
12/13/2022

TIER: Text-Image Entropy Regularization for CLIP-style models

In this paper, we study the effect of a novel regularization scheme on c...
research
05/05/2023

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Webpages have been a rich, scalable resource for vision-language and lan...
research
04/12/2022

InCoder: A Generative Model for Code Infilling and Synthesis

Code is seldom written in a single left-to-right pass and is instead rep...

Please sign up or login with your details

Forgot password? Click here to reset