SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

11/10/2021
by   Minyoung Kim, et al.
0

We tackle the cross-modal retrieval problem, where the training is only supervised by the relevant multi-modal pairs in the data. The contrastive learning is the most popular approach for this task. However, its sampling complexity for learning is quadratic in the number of training data points. Moreover, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address these issues, we propose a novel loss function that is based on self-labeling of the unknown classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss, hence leading to linear sampling complexity. We also maintain the queues for storing the embeddings of the latest batches, for which clustering assignment and embedding learning are done at the same time in an online fashion. This removes computational overhead of injecting intermittent epochs of entire training data sweep for offline clustering. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval, and for all these tasks our method achieves significant performance improvement over the contrastive learning.

READ FULL TEXT
research
10/17/2021

Contrastive Learning of Visual-Semantic Embeddings

Contrastive learning is a powerful technique to learn representations th...
research
05/06/2023

Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Cross-modal retrieval, where the query is an image and the doc is an ite...
research
09/30/2021

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations

Contrastive learning allows us to flexibly define powerful losses by con...
research
03/31/2022

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Visual appearance is considered to be the most important cue to understa...
research
07/01/2022

(Un)likelihood Training for Interpretable Embedding

Cross-modal representation learning has become a new normal for bridging...
research
03/26/2019

Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Labeling training datasets has become a key barrier to building medical ...
research
09/01/2020

Practical Cross-modal Manifold Alignment for Grounded Language

We propose a cross-modality manifold alignment procedure that leverages ...

Please sign up or login with your details

Forgot password? Click here to reset