Playing Lottery Tickets with Vision and Language

04/23/2021
by   Zhe Gan, et al.
11

Large-scale transformer-based pre-training has recently revolutionized vision-and-language (V+L) research. Models such as LXMERT, ViLBERT and UNITER have significantly lifted the state of the art over a wide range of V+L tasks. However, the large number of parameters in such models hinders their application in practice. In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models. We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR^2. Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks (i.e., the tickets) that strictly match the performance of the full UNITER model. However, it is encouraging to confirm that we can find "relaxed" winning tickets at 50 99 transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60 of the full accuracy on average over all the tasks. (iii) Adversarial training can be further used to enhance the performance of the found lottery tickets.

READ FULL TEXT
research
06/11/2020

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial trai...
research
09/25/2019

UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...
research
07/23/2020

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

In natural language processing (NLP), enormous pre-trained models like B...
research
05/15/2020

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...
research
05/23/2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a...
research
01/26/2022

Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest inc...
research
03/31/2023

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

We present the largest and most comprehensive empirical study of pre-tra...

Please sign up or login with your details

Forgot password? Click here to reset