OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

07/01/2021
by   Jing Liu, et al.
0

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, , token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.

READ FULL TEXT

page 3

page 8

research
12/31/2020

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi...
research
05/20/2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach...
research
07/20/2022

Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

Multiple modalities for certain information provide a variety of perspec...
research
02/07/2023

Revisiting Pre-training in Audio-Visual Learning

Pre-training technique has gained tremendous success in enhancing model ...
research
04/20/2022

Cross-stitched Multi-modal Encoders

In this paper, we propose a novel architecture for multi-modal speech an...
research
08/26/2023

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Cross-modal retrieval has become popular in recent years, particularly w...
research
04/26/2017

Deep Cross-Modal Audio-Visual Generation

Cross-modal audio-visual perception has been a long-lasting topic in psy...

Please sign up or login with your details

Forgot password? Click here to reset