Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

08/10/2019 ∙ by Donghuo Zeng, et al. ∙ 0

Deep learning has successfully shown excellent performance in learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and video, should be taken into account. Music video retrieval by given musical audio is a natural way to search and interact with music contents. In this work, we study cross-modal music video retrieval in terms of emotion similarity. Particularly, audio of an arbitrary length is used to retrieve a longer or full-length music video. To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into a shared space to bridge the semantic gap between audio and video. This also preserves the similarity between audio and visual contents from different videos with the same class label and the temporal structure. The contribution of our approach is mainly manifested in the two aspects: i) We propose to select top k audio chunks by attention-based Long Short-Term Memory (LSTM)model, which can represent good audio summarization with local properties. ii) We propose an end-to-end deep model for cross-modal audio-visual learning where S-DCCA is trained to learn the semantic correlation between audio and visual modalities. Due to the lack of music video dataset, we construct 10K music video dataset from YouTube 8M dataset. Some promising results such as MAP and precision-recall show that our proposed model can be applied to music video retrieval.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep cross-modal learning is a very important research topic in the area of multimedia and computer vision, with the goal of learning joint representation between different data modalities such as image-text 

[21, 24] and audio-lyrics [25]. In the cross-modal music video retrieval, taking a piece of music audio segment to retrieve visual contents is a natural way to find an interesting music video that facilitates and improves people’s music experiences. Let us imagine the scenario: when a user sits in the bar, a song attracts his attention. He instantly records the song by his cellphone and with this as query finds semantically similar music videos, as shown in Fig. 1. Correlation learning between visual and audio sequences is non-trivial. However, little work has contributed to this task where temporal structures of different modalities should be considered.

Fig. 1: Overview of music video retrieval: Select one or more representative audio chunks as query to find similar music video, based on content similarity.

The large volumes of music videos emerged in the Internet provide a nice opportunity for us to learn the correlation between visual and audio temporal sequences. A music video contains visual and audio modalities, which are embedded in musical temporal sequences to express music theme and story. Moreover, as a special form of expression, a music video also conveys strong feelings and emotions, which are semantically contained in audio and visual modalities. That is to say, music emotion is delivered by both audio and visual modalities in music video. This motivates us to learn a joint embedding space where music audio and visual contents are assumed with same semantically meaning.

In this work, we study how to use audio to retrieve music video under a realistic situation: with a segment of music audio that has a variable length as a query, the system automatically finds the music video that is similar to this audio with regard to emotions. In other words, an audio with an arbitrary length can retrieve a longer or full-length music video. It is natural for users to search music video in this way. However, this is a challenging research issue because audio and video are different modalities that have different low-level features with different properties of temporal structures. To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep Canonical Correlation Analysis (S-DCCA) that projects audio and video into a joint feature space to bridge the gap across different modalities. This also preserves the similarity among audio and visual contents from different videos with the same class label and the temporal structure. In addition to selecting 10K music video data from the YouTube-8M dataset, most importantly, several contributions are made in this paper as follows:

i) To the best of our knowledge, this is the first work that studies how to retrieve a full video by an audio having a variable length.

ii) We propose to select k representative audio chunks based on emotion features extracted by a Long Short-Term Memory (LSTM)-based attention model, which serve as audio summary meanwhile conserving the temporal structure.

iii) We propose an end-to-end deep architecture for cross-modal audio-visual embedding where S-DCCA is trained to learn the semantic correlation between audio and visual modalities.

iv) Evaluation demonstrates that our deep model has competitive performance compared with state-of-the-art approaches.

The rest of this paper is structured as follows. Section II introduces related work on deep cross-modal embedding and multimedia retrieval. Section III presents the architecture of our model and Section IV reports the experimental results. Finally, Section VI draws the conclusion and points out future work.

Ii Related Work

Cross-modal music retrieval intensively focuses on studying music and visual modalities[22, 2, 14, 4, 18, 7]. Similarity between audio features extracted from songs and image features extracted from the album covers are trained by a Java SOMToolbox framework in [14]. Based on this similarity, people can easily manage a music collection and utilize album cover as visual content to search a music song from music dataset. Based on multimodal mixture models, a statistical method is applied to jointly model music, images, and text [4] for facilitating music multimodal retrieval. The sensor data streams are mapped to a geo-feature. A visual feature is calculated from video content. With the trained model, mood tags associated with visual-aware likelihood are generated. Then, the likelihoods of the mood tags associated with location information and video content are combined by late fusion. Mood tags with large likelihoods are regarded as scene moods of this video. Finally, the songs matching user’s listening history are extracted as personalized recommendations [18]. To learn the semantic correlation between music and video, a method to choosing features and statistical novelty based on kernel methods [7] is suggested to segment music song. Co-occurring changes from audio and video in music videos can be found, where the correlations can be applied to cross-modal audio-visual music retrieval.

The key idea of cross-modal correlation learning is to learn a joint space where different modalities can be correlated semantically. In particular, recent progresses mainly focus on cross-modal learning between text and image such as [11, 26]

. Most existing deep architectures with two sub-networks exploit pre-trained convolutional neural network (CNN)

[19] as image branch [23] and utilize pre-trained text-level embedding model [12] or hand-crafted feature extraction such as bag of words [11]

as text branch. Then image and text features are projected to the shared space to compute a ranking loss function by a feed-forward way. Image-text benchmarks such as

[13, 16] are used to evaluate the performances of cross-modal matching and retrieval.

Existing deep cross-modal retrieval methods have two properties: i) little work related to cross-modal correlation learning takes into account temporal structure of different modal data. ii) Pre-trained models are directly used to extract image or text features. Distinguished from existing deep cross-modal retrieval architectures, this work takes into account temporal structures to learn the correlation between audio and video for enabling cross-modal music video retrieval, where sequential audio and visual contents are projected to the same canonical space. An end-to-end neural network architecture with two-branch sequential structures for audio and video is investigated. Most importantly, we propose a novel method that extracts representative chunks from audio, which is able to summarize audios with different lengths. In addition, we propose a supervised deep CCA method to learn their semantic correlation.

Iii Architecture

Ideally, continuous audio segments (called chunks in this paper) , which are short enough, have the same music property, such as emotion attribute. This motivates us to equally divide a long audio sequence into chunks with the same length. Then, the emotion information of each chunk is computed, and the best chunks with the most attention intensity are used to represent the whole audio sequence. By the cross-modal correlation between the best audio chunks and visual features, the most similar videos can be found.

Iii-a Neural Attention Modeling

The main part of attention computation is realized by the Long Short Term Memory (LSTM) networks model  [9] with a bi-directional extension. A LSTM model contains self-loops which can keep the gradient flow for long periods. The weights in the self-loops are updated based on the context and can be changed dynamically according to the input sequence by four components of LSTM structures:
1) Input gate decides which values will be updated, which depends on the current input and the previous hidden state , and is calculated as follows:


2) Forget gate decides what kind of information should be abandoned from the cell state, and is computed as follows:


3) Units will be updated from the old state as follows:


4) Output gate is achieved as follows:


where represents the current input, denotes the previous hidden state, W and b are the weight and bias matrices, respectively.

Fig. 2:

(a) Main structure of neural attention model, which takes a sequence of audio chunks as input, processes it by the forward and backward LSTM models (achieved by the blue circles), and finally uses the output of bi-directional LSTM models to calculate the attention score and attention distribution as a 72-dimensional vector. (b) A LSTM memory block, including three gates.

Fig. 3: Emotion learning model for evaluating the contribution of each chunk to emotions. When an original 216 seconds audio is divided into 3 chunks, the model calculates the contribution score of each chunk, which helps to obtain the top k-th chunk.

LSTM is a one way computation method. In order to consider both past and future information, the extension of LSTM networks adds one more layer with the opposite temporal sequence and is named bi-directional LSTM, as shown in Fig. 2

. In our works, each audio is divided into 72 chunks, each with 3 seconds. Then, the bi-directional LSTM model is applied on each chunk. In the attention model, the input of bi-directional LSTMs is the output of global max-pooling layer, which is the first attention layer to compute the contribution scores of different audio chunks. The attention score

of the t-th chunk can be computed as follows.


where are the outputs of forward and backward LSTM for the t-th chunk, the and are the weight parameters of attention score function. When the attention score is obtained, the attention distribution is calculated by a softmax function:


We regard this architecture as an emotion learning model [10], which is trained over the MER31K dataset, using emotion tags from AllMusic111 The detail of selecting audio segments achieved by emotion learning model is shown in Fig.3. Firstly, the emotion learning model is used to evaluate the contributions of each chunk to emotions. The contribution score allows us to rank the chunks. Secondly, in the ranked chunks, the best top k are selected. For instance, the first audio in the Fig. 3 is divided into 3 chunks, and depending on the contribution scores, the third chunk is selected as the best one, because it has the highest score within the audio.

Fig. 4: Audio-visual embedding architecture through S-DCCA. (left) During the training process, the model learns the correlation between audio and visual content. (right) Using audio chunks as input to retrieve music videos.

Iii-B Supervised Deep Canonical Correlation Analysis and Distance Similarity

CCA[3] is a classical approach for correlation analysis among two or more modalities. Its core idea is to learn projection matrices that map features of different modalities into the same space, where the correlation between similar items of different modalities are maximized.

Denote as an audio feature, as a visual feature, and denote , and as matrices that linearly map and to the same space, then and are found by maximizing the correlation between and , as follows:


where and represent the covariance matrices of X and Y, respectively and is their cross covariance matrix.

DCCA extends CCA, realizing non-linear projections by deep neural networks (DNN). Assume the output of layer is and ( and ), and are the weights and biases of the layers. Then, the layer outputs = , at two branches, where : is a nonlinear function. The output of the final () layer are , = . Let represent the parameters , , = 1, …, d, and represent the parameters , , = 1, …, d. They are optimized by


Supervised deep CCA does not merely consider one-to-one match between all pairs of audio-visual data and apply deep CCA to learn the correlation. In order to preserve the similarity among items with the same class label, audio and visual contents from different videos with the same class label are formed as new relevant pairs to increase the number of training samples.

In the training process, maximizing the CCA objective function to obtained the linear projections weight , and non-linear function , as follow.


where the covariance matrices , and are computed as.

where is the number of all pairs. The value decide two factor of the number of training dataset, different from DCCA, S-DCCA considers pairs between audio and visual contents from videos with the same class label, including those pairs formed from different videos, as shown in (LABEL:corr3b). similar to DCCA, all parameters are optimized by formulation (LABEL:corr3a). The left side of Fig. 4 shows the whole process.

Iii-C K-means clustering

k-means clustering is a very popular unsupervised learning method for cluster analysis in data mining. k-means clustering enables n variables to be separated into k clusters based on the nearest mean, where k is usually pre-defined by users.

Given a set of variables X=(, , …, ), where each variable is a d-dimensional vector. In order to cluster them into k groups () , firstly, a common method is to randomly choose k values from as initial cluster centers, then iteratively update the cluster center after assigning each variable to its closest cluster till the cluster center never changes. The objective function is defined as follows:


where is the mean of points or cluster center of . In our experiments, we allocate 3 annotated audios for each 10 predefined categories (angry, tender, bitter, cheerful, fun, bright, happy, anxious, calm and warm) to compute the initiated mean . We use the k-means method to cluster all audios into 10 semantic classes based on the emotion features.

Iii-D Matching and Ranking

It is not easy to recognize emotion inside the visual modality, because the visual feature of the dataset is high-level semantic features without clear emotion expression like facial expression changes or body movement. However, the high-level semantic information extracted or trained from complicated deep network is able to represent emotion attributes contained in music. Based on this background, we design a S-DCCA model to learn the correlation between audio and video, which enables us to use audio to retrieve video clip.

The audio-visual embedding is to map audio chunks and visual features to a common space. This space links audio chunks and visual feature in terms of emotion, and enables us to implement cross-modal music video retrieval based on emotion similarity. In the cross-modal retrieval, given an audio chunk or multiple chunks as query, we calculate the similarity between the query audio chunks and each of the visual features from the database in the emotion-based embedding space. We use the cosine similarity between

and as the similarity metric, which is defined as follows.


The detail of our architecture is shown in Fig. 4. which consists of 2 branches: audio branch and visual branch. Firstly, the pre-trained VGG16 model is used to extract frame-level audio feature and the pre-trained Inception model is used to extract frame-level visual feature, for all data in the dataset. Secondly, the frame-level visual feature is represented as video-level feature by the max pooling method. As for audio branch, we load frame-level audio feature into the pre-trained emotion learning model [10] to extract emotion features , based on which the best top k chunks are selected to do music video retrieval, then feed them into Sub-Net1 and Sub-Net2 respectively. Thirdly, based on the extracted emotion features, we apply k-means to cluster the audio into 10 groups. Fourthly, the visual video-level feature and emotion of top k audio chunks are fed into 4 fully connected layers, which generates compact features. Finally, CCA components of these compact features are used to compute the similarity between video and audio chunks.

Iv Experiments

The performances of the proposed S-DCCA for cross-modal music video retrieval are evaluated in this section, with the studies on the influence of the number of chunks and cross-modal music video retrieval by audio.

Iv-a Dataset and Evaluation Metric

Iv-A1 Dataset

The second version of YouTube-8M dataset  [1] is a large scale video dataset, which includes more than 7 million videos with 4716 classes labeled by the annotation system. The dataset consists of three parts: training set, validate set, and test set. In the training set, each class contains at least 100 training videos. Features of these videos are extracted by the state-of-the-art popular pre-trained models and released for public use. Each video contains audio and visual modality. Based on the visual information, videos are divided into 24 topics, such as sports, game, arts&entertainment, etc. Specially, the arts&entertainment topic contains the “music video” label which allows us to construct a music dataset. A video that is included in our music video dataset (MV-10K) should satisfy two conditions:

  1. Each video should include the [music video] label, without other labels.

  2. The length of each video ranges from 213 to 219 seconds.

In order to keep enough information in each chunk, the number of chunks for each audio is set as 3, 6, 9. We select videos whose length is around 216 second, because 216 is the common multiple of 3, 6, 9. In our experiment, we separately get 4 subsets of videos based on different video lengths, and the details are shown in Table  I.

YouTube-8M has already released the frame-level feature and video-level feature for both audio and visual information. Frame-level visual feature is extracted by public Inception model which is trained on the ImageNet. Each frame of the visual content is computed per second in the first 6 minutes. After transfer learning and feature dimension reductions with PCA, the dimension of frame-level visual feature is

×1024, where is the video lengths in seconds. The video-level visual feature is obtained by the DBoF approach  [1]. The frame-level audio feature is extracted by a VGG-like model, as described in  [8], and their average is computed as the video-level audio feature.

Iv-A2 Evaluation Metrics

In this paper, we choose recall, precision, and MAP as the main metrics for the quantitative evaluation of our method.

Precision and Recall [15] are a pair of metrics, which are related to the numbers of relevant documents and retrieved documents. In our experiments, precision is the fraction of retrieved music videos that are relevant to the audio query and recall is the fraction of the relevant music videos that are correctly retrieved.

Mean Average Precision (MAP) [6] for all audio queries is the mean of the average precision for each audio query. When using a music audio as query, in its ranked retrieved music videos, the average precision (AP) is defined as


where is the number of relevant music videos that belong to the same cluster as the query, is the precision of top music videos, is a binary value which is 1, if the music video belongs to the same cluster as the query, and 0 otherwise. The cluster for each audio-visual pair only is used in the process of training. During testing, we assume all the music videos that have the same cluster label as the query audio are relevant.

Length Span Selected size
2163: [213, 219] 10,000
2166: [210, 222] 20,000
2169: [207, 225] 30,000
21612: [204, 228] 40,000
TABLE I: The Information of Music Dataset Selected

Iv-B Experiment Setting

The frame-level video feature in YouTube-8M is computed one frame per second, according to the pre-trained emotion learning model. We divide the 216 second frame-level audio feature into 72 chunks.The attention model is applied to each chunk to calculate the contribution score of emotion, and each 3 second share the same score. Finally, the result of max pooling is regarded as the score of emotion for each chunk.

The following parameters are used in our experiments:

  • Network parameter. Both the audio and the branch have 4 hidden layers. The number of units per layer is 512, 512, 256, 256 in the visual branch, and 128, 128, 64, 64 in the audio branch. The number of CCA component is 30. We set the probability of dropout to 0.2 and apply

    as the activation function in each hidden layer and use

    function in the final layer.

  • Experiment parameter. Train batch size is 512 and test batch size is 64. The number of training epochs is 50.

  • We run the experiments with 5 fold cross-validation and get the average performance.

  • The optimizer is used and the learning rate is set to 0.001.

Iv-C Baseline

Multi-view [27]

learning is a technology in machine learning that learn one function per view to model multiple views and optimizes all functions to remove the cross-view gap.

CCA [20]

algorithm is to find the correlations between two multivariate sets of vectors by linear projections, which depends on singular value decomposition.

KCCA [5] is also a method to extract common features from two data sets Instead of the linear correlation KCCA tries to obtain non-linear correlation through the kernel method, which uses Gaussian kernel and set parameter =0.4.

DCCA [3] is to learn the nonlinear transformations of two data sets such that outputs are highly correlated.

C-CCA [17] (Cluster-CCA) is a CCA variant. Different from standard CCA. C-CCA algorithm clusters each data set into several groups or classes and tries to enhance the intra-cluster correlation.

Fig. 5: Precision-recall curve with the number of chunks set to 3, where “mean” denotes using the average of frame level audio feature as query, k (=1, 2) is the number of audio chunks selected as query.
Fig. 6: Precision-recall curve with the chunks=6, where “mean” denotes using the average of frame level audio feature, k(=1, 2, 3) is the number of audio chunks selected as query.
Fig. 7: Precision-recall curve with the chunks=9, where “mean” denotes using the average of frame level audio feature, k (=1, 2, 3) is the number of audio chunks selected as query.
Fig. 8: Precision-recall curve, achieved by changing the number of output, where k (=1, 2, 3) is the number of chunks selected from all chunks (c) of an audio as query; for example, k/c=1/3 denotes selecting 1 chunk from an audio that is divided into 3 chunks. ”mean” denotes using the average of the whole audio as query.
Fig. 9: Mean average precision when using different numbers of audio chunks selected as query for video retrieval, denotes the number of chunks selected as query, denotes the number of overall chunks that the audio is divided into.
k/chunks 1/3 2/6 3/9 mean
Multi-views 14.02 14.36 14.25 14.58
CCA 18.34 18.39 18.32 18.35
KCCA 17.54 17.04 17.49 17.80
DCCA 18.35 18.39 18.22 18.40
C-CCA 18.51 19.60 19.73 19.72
S-DCCA 21.38 21.43 21.24 21.76
TABLE II: The MAP results of different methods under different configurations.

Iv-D Experiment Result and Analysis

Our experiments of S-DCCA use three different training data sets to obtain three different models. The basic C-CCA and S-DCCA model are trained by the 8000 one-to-one pairs. To enhance to intra-cluster correlation, we further consider the correlation between audios and visual contents from different videos of the same cluster, to learn the relationship between the two modalities. We also try to construct more audio-visual pairs during the training. The C-CCA-extend1 and S-DCCA-extend1 are trained by around 0.8 million pairs, C-CCA-extend2 and S-DCCA-extend2 models by around 1.5 million pairs. where the former -extend1 model uses 50% of all music videos of a cluster to form training pairs with each audio in the cluster, and the latter -extend2 model applies 100% of all music videos in the same cluster to form training pairs.

We use the precision-recall curve to draw the tendency of results as the number of outputs increases so as to compare our S-DCCA model with DCCA model and S-DCCA-extend2 model. Our model tries to leverage the temporal structure inside the query audio, and each query audio is divided into 3, 6, or 9 chunks, from which k chunks are selected as the actual query. In order to investigate the overall performance of our S-DCCA, we use MAP as the metric and compare S-DCCA with others CCA variants (DCCA, C-CCA, KCCA), we set the same dimension of embedding for all methods, and set the same hidden layers structure for DCCA, S-DCCA, S-DCCA-extend1, and S-DCCA-extend2. The correct retrieved video in the rank list which has the same category as query, otherwise it is incorrect video.

Figs. 567 demonstrates the precision-recall curve, comparing DCCA and S-DCCA-extend2 model. The pair of precision and recall value is achieved by changing the number of music videos output. Generally, with the increase of the number of music videos output, the recall increases and the precision decreases. In the S-DCCA-extend2 model, these three figures show that precision starts with the highest value and then sharply decreases before recall arrives at 0.2, then precision almost remains stable as recall increases to 1.0. As is known, the query and the model as two main factors control the curve trend. As for the query factor, when each audio is divided into 3 or 6 chunks, the precision and recall curves of the selected chunks and full-length audio are very close. But when each audio is divided into 9 chunks, and 3 chunks are selected as query, the performance is better than other configurations when the number of output is small. This infers that the 3 chunks have most contribution of emotion and this kind of information is helpful for cross-modal retrieval. As for the model factor, S-DCCA-extend2 is better than DCCA, which indicates that more videos in the output belong to the same cluster as the query in S-DCCA-extend2, than in DCCA.

We also investigate the influence of the number of overall chunks and the number of chunks selected. Fig. 8, shows that with the same volume of audio information as query, when the audio is divided into 9 chunks and 3 chunks are selected as the query the S-DCCA-extend2 model achieves the best performance (precision ranges from 26.6% to 23.8%; recall ranges from 0.20 to 0.41).

In order to further study the influence of the number of overall chunks and the number of chunks selected as query , the MAP results of different models are compared in Table II and Fig. 9. As for the number of chunks selected, generally there is no big difference in MAP when the same model is used. When the same audio information is used as query, comparing the MAP results among different models, it shows that the training process explicitly exploiting the cluster information generally outperforms the one without cluster information. As a result, S-DCCA (and S-DCCA-extend1, S-DCCA-extend2) and C-CCA (and C-CCA-extend1, C-CCA-extend2) can get higher MAP than Multi-views, CCA, KCCA, and DCCA. It indicates that the correlation learning based on both cluster information and instance features is better than those using instance features only. With the increases in the volume of the training data, from two groups, group 1: C-CCA, C-CCA-extend1, C-CCA-extend2, and group 2: S-DCCA, S-DCCA-extend1, S-DCCA-extend2, the MAP gets higher and higher. It proves that considering all possible pairs within two data sets for each label cluster can get better performance than one-to-one pairs, and it also illustrates the limited training data cannot well learn the correlation between audio and visual feature in this case. Generally, using parts of audio as queries to do retrieval can get close performance as in this case where full-length audio is used as queries.

V Conclusions

We proposed a supervised deep CCA model to learn a semantic space where audio and visual data from music video, which are in different modalities, are linked to learn the cross-modal correlation. Besides the pairwise similarity, the semantic similarity between audio and visual contents from different videos in the same cluster is also explicitly considered. An end-to-end deep architecture that represents an audio sequence as representative chunks is studied. The experimental evaluation run over MV-10K data selected from Youtube-8M proves the effectiveness of the proposed deep audio-visual embedding algorithm in cross-modal music video retrieval. We will try to integrate more users’ preference information to our Deep architecture for personalized music cross-modal video recommendation. We will investigate the task of taking a short video as query to retrieve a longer or full audio in the future work.


This work was partially supported by JSPS KAKENHI Grant Number 16K16058. The first Author would like to thank Francisco Raposo for discussing how to implement CCA.


  • [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
  • [2] Esra Acar, Frank Hopfgartner, and Sahin Albayrak. Understanding affective content of music videos through learned representations. In International Conference on Multimedia Modeling, pages 303–314. Springer, 2014.
  • [3] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International Conference on Machine Learning, pages 1247–1255, 2013.
  • [4] Eric Brochu, Nando De Freitas, and Kejie Bao. The sound of an album cover: Probabilistic multimedia and information retrieval. In Artificial Intelligence and Statistics (AISTATS), 2003.
  • [5] Nello Cristianini, John Shawe-Taylor, et al.

    An introduction to support vector machines and other kernel-based learning methods

    Cambridge university press, 2000.
  • [6] Fangxiang Feng, Xiaojie Wang, and Ruifan Li.

    Cross-modal retrieval with correspondence autoencoder.

    In Proceedings of the 22nd ACM international conference on Multimedia, pages 7–16. ACM, 2014.
  • [7] Olivier Gillet, Slim Essid, and Gal Richard. On the correlation of automatic audio and visual segmentations of music videos. IEEE Transactions on Circuits and Systems for Video Technology, 17(3):347–355, 2007.
  • [8] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 131–135. IEEE, 2017.
  • [9] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [10] Yu-Siang Huang, Szu-Yu Chou, and Yi-Hsuan Yang. Music thumbnailing via neural attention modeling of music emotion. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017, pages 347–350. IEEE, 2017.
  • [11] Qing-Yuan Jiang and Wu-Jun Li. Deep cross-modal hashing. CoRR, 2016.
  • [12] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368, 2016.
  • [13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [14] Rudolf Mayer. Analysing the similarity of album art with self-organising maps. In

    International Workshop on Self-Organizing Maps

    , pages 357–366. Springer, 2011.
  • [15] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. 2011.
  • [16] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia, pages 251–260. ACM, 2010.
  • [17] Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. Cluster canonical correlation analysis. In Artificial Intelligence and Statistics, pages 823–831, 2014.
  • [18] Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann.

    Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings.

    In Proceedings of the 22nd ACM international conference on Multimedia, pages 607–616. ACM, 2014.
  • [19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [20] Bruce Thompson. Canonical correlation analysis. Encyclopedia of statistics in behavioral science, 2005.
  • [21] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 5005–5013, 2016.
  • [22] Yi Yu, Zhijie Shen, and Roger Zimmermann. Automatic music soundtrack generation for outdoor videos from contextual sensor information. In Proceedings of the 20th ACM international conference on Multimedia, pages 1377–1378. ACM, 2012.
  • [23] Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. Venuenet: Fine-grained venue discovery by deep correlation learning. In Multimedia (ISM), 2017 IEEE International Symposium on, pages 288–291. IEEE, 2017.
  • [24] Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. Category-based deep cca for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems, pages 1–9, 2018.
  • [25] Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transaction on Multimedia Computing Communication and Applications, 2017.
  • [26] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. Video captioning and retrieval models with semantic attention. arxiv preprint. arXiv preprint arXiv:1610.02947, 2, 2016.
  • [27] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.