VinVL: Making Visual Representations Matter in Vision-Language Models

01/02/2021
by   Pengchuan Zhang, et al.
10

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model <cit.>, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model <cit.>, and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

READ FULL TEXT

page 2

page 17

page 18

page 24

page 28

research
11/16/2021

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Most existing methods in vision language pre-training rely on object-cen...
research
03/07/2019

ViTOR: Learning to Rank Webpages Based on Visual Features

The visual appearance of a webpage carries valuable information about it...
research
06/24/2021

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

TextVQA requires models to read and reason about text in images to answe...
research
01/08/2019

Richer and Deeper Supervision Network for Salient Object Detection

Recent Salient Object Detection (SOD) systems are mostly based on Convol...
research
08/17/2021

RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection

3D point cloud understanding has made great progress in recent years. Ho...
research
03/29/2021

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
research
10/07/2009

Introducing New AdaBoost Features for Real-Time Vehicle Detection

This paper shows how to improve the real-time object detection in comple...

Please sign up or login with your details

Forgot password? Click here to reset