Meta-Transformer: A Unified Framework for Multimodal Learning

07/20/2023
by   Yiyuan Zhang, et al.
0

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (e.g. natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a frozen encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2023

Unified Model for Image, Video, Audio and Language Tasks

Large Language Models (LLMs) have made the ambitious quest for generalis...
research
04/22/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...
research
12/02/2021

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Biological intelligence systems of animals perceive the world by integra...
research
09/21/2023

Multimodal Transformers for Wireless Communications: A Case Study in Beam Prediction

Wireless communications at high-frequency bands with large antenna array...
research
06/29/2023

Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation

In multimodal-aware recommendation, the extraction of meaningful multimo...
research
07/18/2023

Multimodal LLMs for health grounded in individual-specific data

Foundation large language models (LLMs) have shown an impressive ability...
research
05/10/2023

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

We present Integrated Multimodal Perception (IMP), a simple and scalable...

Please sign up or login with your details

Forgot password? Click here to reset