Does Visual Pretraining Help End-to-End Reasoning?

07/17/2023
by   Chen Sun, et al.
0

We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.

READ FULL TEXT

page 2

page 4

page 7

page 15

page 16

research
10/12/2022

Self-supervised video pretraining yields strong image representations

Videos contain far more information than still images and hold the poten...
research
06/01/2022

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Self-supervised learning for computer vision has achieved tremendous pro...
research
06/11/2020

VirTex: Learning Visual Representations from Textual Annotations

The de-facto approach to many vision tasks is to start from pretrained v...
research
05/26/2023

Im-Promptu: In-Context Composition from Image Prompts

Large language models are few-shot learners that can solve diverse tasks...
research
12/01/2021

Revisiting the Transferability of Supervised Pretraining: an MLP Perspective

The pretrain-finetune paradigm is a classical pipeline in visual learnin...
research
03/23/2020

Linguistically Driven Graph Capsule Network for Visual Question Reasoning

Recently, studies of visual question answering have explored various arc...
research
10/12/2021

Dynamic Inference with Neural Interpreters

Modern neural network architectures can leverage large amounts of data t...

Please sign up or login with your details

Forgot password? Click here to reset