Image-and-Language Understanding from Pixels Only

12/15/2022
by   Michael Tschannen, et al.
0

Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without

READ FULL TEXT

page 5

page 6

page 13

page 20

page 21

page 22

research
11/19/2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

In this paper, we propose a single UniFied transfOrmer (UFO), which is c...
research
05/04/2022

CoCa: Contrastive Captioners are Image-Text Foundation Models

Exploring large-scale pretrained foundation models is of significant int...
research
01/18/2021

Joint Energy-based Model Training for Better Calibrated Natural Language Understanding Models

In this work, we explore joint energy-based model (EBM) training during ...
research
06/06/2022

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Large sparsely-activated models have obtained excellent performance in m...
research
03/29/2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to de...
research
06/05/2022

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand info...
research
03/20/2023

Multimodal Shannon Game with Images

The Shannon game has long been used as a thought experiment in linguisti...

Please sign up or login with your details

Forgot password? Click here to reset