Log In Sign Up

Cross-modal Embeddings for Video and Audio Retrieval

by   Dídac Surís, et al.

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.


Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Cross-modal retrieval aims to retrieve data in one modality by a query i...

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Cross-modal retrieval is to utilize one modality as a query to retrieve ...

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

This paper proposes a new strategy for learning powerful cross-modal emb...

DIME: An Online Tool for the Visual Comparison of Cross-Modal Retrieval Models

Cross-modal retrieval relies on accurate models to retrieve relevant res...

Collaborative Learning to Generate Audio-Video Jointly

There have been a number of techniques that have demonstrated the genera...

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

Authors make their videos visually accessible by adding audio descriptio...

1 Introduction

Videos have become the next frontier in artificial intelligence. The rich semantics contained in them make them a challenging data type posing several challenges in both perceptual, reasoning or even computational level. Mimicking the learning process and knowledge extraction that humans develop from our visual and audio perception remains an open research question, and video contain all this information in a format manageable for science and research.

Videos are used in this work for two main reasons. Firstly, they naturally integrate both visual and audio data, providing a weak labeling of one modality with respect to the other. Secondly, the high volume of both visual and audio data allows training machine learning algorithms whose models are governed by a high amount of parameters. The huge scale video archives available online and the increasing number of video cameras that constantly monitor our world, offer more data than computation power available to process them.

The popularization of deep neural networks among the computer vision and audio communities has defined a common framework boosting multimodal research. Tasks like video sonorization, speaker impersonation or self-supervised feature learning have exploited the opportunities offered by artificial neurons to project images, text and audio in a feature space where bridges across modalities can be built.

This work exploits the relation between the visual and audio contents in a video clip to learn a joint embedding space with deep neural networks. Two multilayer perceptrons (MLPs), one for visual features and a second one for audio features, are trained to be mapped into the same cross-modal representation. We adopt a self-supervised approach, as we exploit the unsupervised correspondence between the audio and visual tracks in any video clip.

We propose a joint audiovisual space to address a retrieval task formulating a query from any of the two modalities. As depicted in Figure 1, whether a video or an audio clip can be used as a query to search its matching pair in a large collection of videos. For example, an animated GIF could be sonorized by finding an adequate audio track, or an audio recording illustrated with a related video.

In this paper, we present a simple yet very effective model for retrieving documents with a fast and light search. We do not address an exact alignment between the two modalities that would require a much higher computation effort.

The paper is structured as follows. Section 2 introduces the related work on learned audiovisual embeddings with neural networks. Section 3 presents the architecture of our model and Section 4 how it was trained. Experiments are reported in Section 5 and final conclusions drawn in Section 6. The source code and trained model used in this paper is publicly available from

Figure 1: A learned cross-modal embedding allows retrieving video from audio, and vice versa.

2 Related work

In the past years, the relationship between the audio and the visual content in videos has been researched in several contexts. Overall, conventional approaches can be divided into four categories according to the task: generation, classification, matching and retrieval.

As online music streaming and video sharing websites have become increasingly popular, some research has been done on the relationship between music and album covers [1, 2, 3, 4] and also on music and videos (instead of just images) as the visual modality [5, 6, 7, 8] to explore the multimodal information present in both types of data.

A recent study [9] also explored the cross-modal relations between the two modalities but using images with people talking and speech. It is done through Canonical Correlation Analysis (CCA) and cross-modal factor analysis. Also applying CCA, [10] uses visual and sound features and common subspace features for aiding clustering in image-audio datasets. In a work presented by [11]

, the key idea was to use greedy layer-wise training with Restricted Boltzmann Machines (RBMs) between vision and sound.

The present work is focused on using the information present in each modality to create a joint embedding space to perform cross-modal retrieval. This idea has been exploited especially using text and image joint embeddings [12, 13, 14], but also between other kinds of data, for example creating a visual-semantic embedding [15] or using synchronous data to learn discriminative representations shared across vision, sound and text [16].

However, joint representations between the images (frames) of a video and its audio have yet to be fully exploited, being [17] the work that most has explored this option up to the knowledge of the authors. In their paper, they seek for a joint embedding space but only using music videos to obtain the closest and farthest video given a query video, only based on either image or audio.

The main idea of the current work is borrowed from [14], which is the baseline to understand our approach. There, the authors create a joint embedding space for recipes and their images. They can then use it to retrieve recipes from any food image, looking to the recipe that has the closest embedding. Apart from the retrieval results, they also perform other experiments, such as studying the localized unit activations, or doing arithmetics with the images.

3 Architecture

In this section we present the architecture for our joint embedding model, which is depicted in the Figure 2.

As inputs, we have the vector of features representing the images of the video and the vector of features representing the audio. These features are already precomputed and provided in the YouTube-8M dataset

[18]. In particular, we use the video-level features, which represent the whole video clip with two vectors: one for the audio and another one for the video. These feature representations are the result of an average pooling of the local audio features computed over windows of one second, and local visual features computed over frames sampled at 1 Hz.

The main objective of the system is to transform the two different features (image and audio, separately) to other features laying in a joint space. This means that for the same video, ideally the image features and the audio features will be transformed to the same joint features, in the same space. We will call these new features embeddings, and will represent them with , for the image embeddings, and , for the audio embeddings.

The idea of the joint space is to represent the concept of the video, not just the image or the audio, but a generalization of it. As a consequence, videos with similar concepts will have closer embeddings and videos with different concepts will have embeddings further apart in the joint space. For example, the representation of a tennis match video will be close to the one of a football match, but not to the one of a maths lesson.

Thus, we use a set of fully connected layers of different sizes, stacked one after the other, going from the original features to the embeddings. The audio and the image network are completely separated. These fully connected layers perform a non-linear transformation on the input features, mapping them to the embeddings, being the parameters of this non-linear mapping learned in the optimization process.

After that, a classification from the two embeddings is done, also using a fully connected layer from them to the different classes, using a sigmoid as activation function. We will get more insight on this step in section


Figure 2: Schematic of the used architecture.

The number of hidden layers is not necessarily fixed, as well as the number of neurons per layer, since we experimented with different configurations. Each hidden layer uses ReLu as activation function, and all the weights in each layer are regularized using L2 norm.

4 Training

In this section we present the used losses as well as their meaning and intuition.

4.1 Similarity Loss

The objective of this work is to get the two embeddings of the same video to be as close as possible (ideally, the same), while keeping embeddings from different videos as far as possible.

Formally, we are given a video , represented by the audio and visual features ( represents the image features and the audio features of ). The objective is to maximize the similarity between , the embedding obtained by transformations on , and , the embedding obtained by transformations on .

At the same time, however, we have to prevent embeddings from different videos to be “close” in the joint space. In other words, we want them to have low similarity. However, the objective is not to force them to be opposite to each other. Instead of forcing them to have similarity equal to zero, we allow a margin of similarity small enough to force the embeddings to be clearly not in the same place in in the joint space. We call this margin .

During the training, both positive and negative pairs are used, being the positive pairs the ones for which and correspond to the same video , and the negative pairs the ones for which and do not correspond to the same video, this is, . The proportion of negative samples is .

For the negative pairs, we selected random pairs that did not have any common label, in order to help the network to learn how to distinguish different videos in the embedding space. The notion of “similarity” or “closeness” is mathematically translated into a cosine similarity between the embeddings, being the cosine similarity defined as:


for any pair of real-valued vectors and .

From this reasoning we get to the first and most important loss:


where denotes positive sampling, and denotes negative sampling.

4.2 Classification Regularization

Inspired by the work presented in [14], we provide additional information to our system by incorporating the video labels (classes) provided by the YouTube-8M dataset. This information is added as a regularization term that seeks to solve the high-level classification problem, both from the audio and from the video embeddings, sharing the weights between the two branches. The key idea here is to have the classification weights from the embeddings to the labels shared by the two modalities.

This loss is optimized together with the previously explained similarity loss, serving as a regularization term. Basically, the system learns to classify the audio and the images of a video (separately) into different classes or labels provided by the dataset. We limit its effect by using a regularization parameter


To incorporate the previously explained regularization to the joint embedding, we use a single fully connected layer, as shown in Figure 2

. Formally, we can obtain the label probabilities as

and , where represents the learned weights, which are shared between the two branches. The softmax activation is used in order to obtain probabilities at the output. The objective is to make as similar as possible to , and as similar as possible to , where and are the category labels for the video represented by the image features and the audio features, respectively. For positive pairs, and are the same.

The loss function used for the classification is the well known cross entropy loss:


Thus, the classification loss is:


Finally, the loss function to be optimized is:


4.3 Parameters and Implementation Details

For our experiments we used the following parameters:

  • [noitemsep]

  • Batch size of .

  • We saw that starting with different than zero led to a bad embedding similarity because the classification accuracy was preferred. Thus, we began the training with and set it to at step number 10,000.

  • Margin .

  • Percentage of negative samples = 0.6.

  • 4 hidden layers in each network branch, the number of neurons per layer being, from features to embedding, 2000, 2000, 700, 700 in the image branch, and 450, 450, 200, 200 in the audio branch.

  • Dimensionality of the feature vector = 250.

  • We trained a single epoch.

The simulation was programmed using Tensorflow

[19], having as a baseline the code provided by the YouTube-8M challenge authors111

5 Results

5.1 Dataset

The experiments presented in this section were developed over a subset of 6,000 video clips from the YouTube-8M dataset [18]. This dataset does not contain the raw video files, but their representations as precomputed features, both from audio and video. Audio features were computed using the method explained in [20] over audio windows of 1 second, while visual features were computed over frames sampled at 1 Hz with the Inception model provided in TensorFlow [19].

The dataset provides video-level features, which represent all the video using a single vector (one for audio and another for visual information), and thus does not maintain temporal information; and also provides frame-level features, which consist on a single vector representing each second of audio, and a single vector representing each frame of the video, sampled at 1 frame per second.

The main goal of this dataset is to provide enough data to reach state of the art results in video classification. Nevertheless, such a huge dataset also permits approaching other tasks related to videos and cross-modal tasks, such as the one we approach in this paper. For this work, and as a baseline, we only use the video-level features.

5.2 Quantitative Performance Evaluation

We divide our results in two different categories: quantitative (numeric) results and qualitative results.

To obtain the quantitative results we use the Recall@k metric. We define Recall@k as the recall rate at top K for all the retrieval experiments, this is, the percentage of all the queries where the corresponding video is retrieved in the top K, hence higher is better.

The experiments are performed with different dimension of the feature vector. The Table 1 shows the results of recall from audio to video. In other words, from the audio embedding of a video, how many times we retrieve the embedding corresponding to the images of that same video. Table 2 shows the recall from video to audio.

To have a reference, the random guess result would be . The obtained results show a very clear correspondence between the embeddings coming from the audio features and the ones coming from the video features. It is also interesting to notice that the results from audio to video and from video to audio are very similar, because the system has been trained bidirectionally.

Number of elements Recall@1 Recall@5 Recall@10
256 21.5% 52.0% 63.1%
512 15.2% 39.5% 52.0%
1024 9.8% 30.4% 39.6%
Table 1: Evaluation of Recall from audio to video
Number of elements Recall@1 Recall@5 Recall@10
256 22.3% 51.7% 64.4%
512 14.7% 38.0% 51.5%
1024 10.2% 29.1% 40.3%
Table 2: Evaluation of Recall from video to audio

5.3 Qualitative Performance Evaluation

In addition to the objective results, we performed some insightful qualitative experiments. They consisted on generating the embeddings of both the audio and the video for a list of 6,000 different videos. Then, we randomly chose a video, and from its image embedding, we retrieved the video with the closest audio embedding, and the other way around (from one video’s audio we retrieved the video with the closest image embedding). If the closest embedding corresponded to the same video, we took the second one in the ordered list.

Figure 3: Qualitative results. On the left we show the results obtained when we gave a video as a query. On the right, the results are based on an audio as a query.

The Figure 3 shows some experiments. On the left, we can see the results given a video query and getting the closest audio; and on the right the input query is an audio. Examples depicting the real videos and audio are available online 222 It shows both the results when going from image to audio, and when going from audio to image. Four different random examples are shown in each case. For each result and each query, we also show their YouTube-8M labels, for completeness.

The results show that when starting from the image features of a video, the retrieved audio represents a very accurate fit for those images. Subjectively, there are non negligible cases where the retrieved audio actually fits better the video than the original one, for example when the original video has some artificially introduced music, or in cases where there is some background commentator explaining the video in a foreign (unknown) language. This analysis can also be done similarly the other way around, this is, with the

audio colorization

approach, providing images for a given audio.

6 Conclusions and Future Work

We presented an effective method to retrieve audio samples that fit correctly to a given (muted) video. The qualitative results show that the already existing online videos, due to its variety, represent a very good source of audio for new videos, even in the case of only retrieving from a small subset of this large amount of videos. Due to the existing difficulty to create new audio from scratch, we believe that a retrieval approach is the path to follow in order to give audio to videos.

The range of possibilities to extend the presented work is excitingly broad. The first idea would be to make use of the YouTube-8M dataset variety and information. The temporal information provided by the individual image and audio features is not used in the current work. The most promising future work implies using this temporal information to match audio and images, making use of the implicit synchronization the audio and the images of a video have, without needing any supervised control. Thus, the next step in our research is introducing a recurrent neural network, which will allow us to create more accurate representations of the video, and also retrieve different audio samples for each image, creating a fully synchronized system.

Also, it would be very interesting to study the behavior of the system depending on the class of the input. Observing the dataset, it is clear that not all the classes have the same degree of correspondence between audio and image, as for example some videos have artificially (posterior) added music, which is not related at all to the images.

In short, we believe the YouTube-8M dataset allows for promising research in the future in the field of video sonorization and audio retrieval, for it having a huge amount of samples, and for it capturing multi-modal information in a highly compact way.

7 Acknowledgements

This work was partially supported by the Spanish Ministry of Economy and Competitivity and the European Regional Development Fund (ERDF) under contract TEC2016-75976-R. Amanda Duarte was funded by the mobility grant of the Severo Ochoa Program at Barcelona Supercomputing Center (BSC-CNS).