Perceiver: General Perception with Iterative Attention

03/04/2021
by   Andrew Jaegle, et al.
18

Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.

READ FULL TEXT

page 3

page 6

page 12

page 15

page 16

research
06/30/2021

Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dim...
research
08/01/2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

In line with the human capacity to perceive the world by simultaneously ...
research
02/22/2022

Hierarchical Perceiver

General perception systems such as Perceivers can process arbitrary moda...
research
12/06/2021

Input-level Inductive Biases for 3D Reconstruction

Much of the recent progress in 3D vision has been driven by the developm...
research
11/25/2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Can we train a single transformer model capable of processing multiple m...
research
02/09/2023

Hypernetworks build Implicit Neural Representations of Sounds

Implicit Neural Representations (INRs) are nowadays used to represent mu...
research
12/22/2022

Scalable Adaptive Computation for Iterative Generation

We present the Recurrent Interface Network (RIN), a neural net architect...

Please sign up or login with your details

Forgot password? Click here to reset