Learning Shared Multimodal Embeddings with Unpaired Data

06/21/2018
by   AJ Piergiovanni, et al.
0

In this paper, we propose a method to learn a joint multimodal embedding space. We compare the effect of various constraints using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.

READ FULL TEXT
research
08/22/2023

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

We introduce SONAR, a new multilingual and multimodal fixed-size sentenc...
research
03/25/2023

Learning video embedding space with Natural Language Supervision

The recent success of the CLIP model has shown its potential to be appli...
research
01/22/2020

Zero-Shot Activity Recognition with Videos

In this paper, we examined the zero-shot activity recognition task with ...
research
11/14/2022

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

State-of-the-art empirical work has shown that visual representations le...
research
08/31/2023

Learning to Taste: A Multimodal Wine Dataset

We present WineSensed, a large multimodal wine dataset for studying the ...
research
01/02/2019

Action2Vec: A Crossmodal Embedding Approach to Action Learning

We describe a novel cross-modal embedding space for actions, named Actio...
research
04/23/2020

Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition

Recognizing an activity with a single reference sample using metric lear...

Please sign up or login with your details

Forgot password? Click here to reset