Multimodal Token Fusion for Vision Transformers

04/19/2022
by   Yikai Wang, et al.
0

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images.

READ FULL TEXT

page 5

page 6

page 7

page 10

research
11/24/2021

An Image Patch is a Wave: Phase-Aware Vision MLP

Different from traditional convolutional neural network (CNN) and vision...
research
12/21/2022

What Makes for Good Tokenizers in Vision Transformer?

The architecture of transformers, which recently witness booming applica...
research
11/10/2020

Deep Multimodal Fusion by Channel Exchanging

Deep multimodal fusion by using multiple sources of data for classificat...
research
09/20/2023

RMT: Retentive Networks Meet Vision Transformers

Transformer first appears in the field of natural language processing an...
research
04/10/2023

ViT-Calibrator: Decision Stream Calibration for Vision Transformer

A surge of interest has emerged in utilizing Transformers in diverse vis...
research
10/03/2022

A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers

The Vision Transformer (ViT) architecture has recently established its p...
research
09/15/2022

Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

Vision Transformers (ViTs) have proven to be effective, in solving 2D im...

Please sign up or login with your details

Forgot password? Click here to reset