Cross-modal Embeddings for Video and Audio Retrieval

01/07/2018
by   Dídac Surís, et al.
0

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.

READ FULL TEXT
research
08/10/2019

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Cross-modal retrieval aims to retrieve data in one modality by a query i...
research
12/05/2021

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Cross-modal retrieval is to utilize one modality as a query to retrieve ...
research
06/17/2017

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...
research
05/28/2020

Investigating Correlations of Automatically Extracted Multimodal Features and Lecture Video Quality

Ranking and recommendation of multimedia content such as videos is usual...
research
09/21/2018

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

This paper proposes a new strategy for learning powerful cross-modal emb...
research
08/23/2022

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

Authors make their videos visually accessible by adding audio descriptio...
research
02/23/2023

Data leakage in cross-modal retrieval training: A case study

The recent progress in text-based audio retrieval was largely propelled ...

Please sign up or login with your details

Forgot password? Click here to reset