Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment

12/04/2020
by   Paul Pu Liang, et al.
6

The natural world is abundant with concepts expressed via visual, acoustic, tactile, and linguistic modalities. Much of the existing progress in multimodal learning, however, focuses primarily on problems where the same set of modalities are present at train and test time, which makes learning in low-resource modalities particularly difficult. In this work, we propose algorithms for cross-modal generalization: a learning paradigm to train a model that can (1) quickly perform new tasks in a target modality (i.e. meta-learning) and (2) doing so while being trained on a different source modality. We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities? Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. We study this problem on 3 classification tasks: text to image, image to audio, and text to speech. Our results demonstrate strong performance even when the new target modality has only a few (1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.

READ FULL TEXT

page 7

page 8

page 16

page 17

research
07/25/2016

Learning Aligned Cross-Modal Representations from Weakly Aligned Data

People can recognize scenes across many different modalities beyond natu...
research
10/12/2021

Are you doing what I say? On modalities alignment in ALFRED

ALFRED is a recently proposed benchmark that requires a model to complet...
research
08/19/2019

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck

Deep generative models have led to significant advances in cross-modal g...
research
05/08/2023

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

The speech-to-singing (STS) voice conversion task aims to generate singi...
research
08/15/2023

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

In this paper, we propose Emotionally paired Music and Image Dataset (EM...
research
05/12/2023

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

In this paper, we study a novel problem in egocentric action recognition...
research
07/31/2023

Latent Masking for Multimodal Self-supervised Learning in Health Timeseries

Limited availability of labeled data for machine learning on biomedical ...

Please sign up or login with your details

Forgot password? Click here to reset