Cross-modal Variational Auto-encoder for Content-based Micro-video Background Music Recommendation

07/15/2021 ∙ by Jing Yi, et al. ∙ IEEE Wuhan University 0

In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Corresponding author: Zhenzhong Chen, zzchen@whu.edu.cn

Nowadays, micro-videos have become an increasingly prevalent medium on the Web. Compared with texts or pictures, micro-videos contain rich visual contents, audio sounds, as well as textual tags and descriptions, which allow people to more vividly record and share their daily lives. Matching micro-videos with suitable background music can help uploaders better convey their contents and emotions, and increase the click-through rate of their uploaded videos. However, manually selecting the background music becomes a painstaking task due to the voluminous and ever-growing pool of candidate music. Therefore, automatically recommending background music to videos becomes a crucial task that attracts lots of attention among researchers.

Existing cross-modal matching methods have mainly focused on the matching between textual and visual (image or video) modalities. For example, Xu et al. [59] proposed a joint video-language framework for text-based video retrieval evaluated on [6]. Semedo and Magalhães [39] proposed a new temporal constraint for image-text matching using the NUS-WIDE benchmark [10]. Zhang et al. [62] utilized inter-modal attention to discover semantic alignments between words and image patches for image-text retrieval, where the intra-modal attention is also exploited to learn semantic correlations of fragments for each modality. Meanwhile, metric-learning-based methods have been proposed to better implement cross-modal matching [7, 34, 51]. The continual success in such areas is facilitated by the establishment of Flickr [38]

, MS-COCO

[32], ActivityNet-captions [27], MSR-VTT [58], etc., which are widely employed benchmarks for the evaluation of visual-textual retrieval models [49, 62, 9, 14]. He et al. [19] further proposed a new dataset for fine-grained cross-modal retrieval. However, there exists no publicly available dataset for matching videos with background music, which becomes the main obstacle for future research regarding the automatic recommendation of background music to videos.

It is very challenging to construct such a video background music recommendation dataset. For example, how to collect different music clips and acquire different videos using the music clips as background music is a problem. Taking the issues into consideration, we manage to establish a micro-video background music matching dataset, which we name TT-150k, from the popular micro-video sharing platform TikTok222https://www.tiktok.com/. Specifically, the candidate music clips are selected from the TikTok pop charts and the background music adopted by high-quality micro-videos to ensure their quality. The established TT-150k dataset includes more than 3,000 music clips and about 150,000 micro-videos that use the music clips from the candidate pool. In our establishment, we ensure that the popularity of the music clips is proportional to its true distribution in the population of TikTok, and therefore the dataset can faithfully reflect the music popularity distribution in the real-world scenario.

With the TT-150k dataset established, we aim to design an effective and accurate algorithm for the automatic matching of micro-videos with background music. Previous studies on video-music matching [35, 29] mainly rely on manually annotated emotion tags to match music and videos in the affective space. However, such manual annotation is laborious and time-consuming for large datasets. Therefore, it is imperative to seek methods that match music to videos without manually-labeled information. In addition, the semantic structure of micro-videos can vary drastically with that of the music: A silent micro-video is the hodgepodge of visual and textual information, while the music contains only the audio information. Therefore, it is challenging to match the semantic-rich video latent space with the monotonous music latent space. To address the above challenges, we propose a Cross-Modal Variational Auto-Encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to micro-videos by projecting them into a shared low-dimensional latent space. Specifically, the latent space is constrained by cross-generation such that the embeddings of a matched video-music pair are closer than that of an unmatched pair. Through this, a more accurate similarity measurement between micro-videos and music can be obtained and utilized for the recommendation. Meanwhile, to fully utilize the heterogeneous visual and textual information of the micro-video for matching, the bimodal information in micro-videos is comprehensively fused according to the product-of-experts (PoE) principle. In this way, the information in each modality is weighted by the reciprocal of its variance estimation to give the modality with a lower noise level more weights, such that irrelevant information in the micro-video embeddings can be reduced. Besides, the PoE fusion is shown to be robust to the textual modality missing problem, which is a commonly encountered problem when performing micro-video analysis. The main contributions of this paper can be concretely summarized as follows:

  • [leftmargin=0pt]

  • We propose a novel hierarchical Bayesian generative model, CMVAE, for content-based micro-video background music recommendation. CMVAE matches relevant background music to a micro-video by constraining their latent variables to generate each other, whereas their latent embeddings can be better aligned via cross-generation. Therefore, better matching between the micro-video and music can be achieved in the latent space for recommendations.

  • The product-of-experts (PoE) principle is adopted to fuse the modality-specific embeddings of visual and textual modalities of the micro-video so that the complementary multimodal information can be better exploited. With PoE, the semantic information in different modalities are weighted according to their variance estimations for giving more weights to the modality with a lower noise level, such that the micro-video embeddings could contain less irrelevant information for a more robust generalization.

  • A large-scale content-based micro-video background music recommendation dataset, TT-150k, is established. The dataset contains more than 3,000 candidate music clips and about 150k corresponding micro-videos, where the popularity distribution is consistent with that of the real-world scenario. On the established TT-150k dataset, experiments show that the proposed CMVAE significantly outperforms the state-of-the-art methods.

The remainder of this article is organized as follows. Section 2 gives a literature review of the works related to our proposed method. Section 3 introduces the establishment of the dataset TT-150k. In Section 4, we illustrate the proposed CMVAE with details. Section 5 presents the experimental settings and analyzes the experimental results. Finally, Section 6 briefly summarizes the article.

2 Related Work

2.1 Audiovisual Cross-modal Matching

Cross-modal matching aims to match relevant materials where the modality of the target is different from that of the query [47]. Compared with the matching of two sources that come from a single modality, cross-modal matching requires measuring the similarity of heterogeneous sources composed of different modalities, and therefore it is a more challenging task. Recent researches have mainly focused on textual and visual modalities matching [52, 50, 63, 8, 22, 64, 51, 44, 13]

, using open-sourced datasets such as MSCOCO

[32] and Flickr [38]. Although the need for matching audio and visual contents, such as video background music recommendation, has existed for a long time, few efforts have been dedicated to solving such a problem.

Several pioneering methods include: Chao et al. [5] recommended background music for digital photo albums using the relatedness between image tags and music mood tags. Wu et al. [56] proposed ranking canonical correlation analysis (CCA) for cross-modal matching of music and images. Li and Kumar [29] utilized emotion tags as metadata to align the music and video modalities. These emotion tags for the video, however, were manually annotated from crowdsourcing, which is high in labor and time costs. Chen et al. [35] relied on Thayer’s emotion model [45] and the videos and audios falling in the same quadrant of emotions were regarded as positive samples. Lin et al. [33] argued the importance of matching the rhythm of the music with the visual movement of the video, where the tags of videos provided by users were used to calculate their relationship with song lyrics of music candidates. Suris et al. [42] further employed the visual features and audio features provided by Youtube-8M [1] to constrain the visual and audio embeddings of the same video as close as possible and predict the corresponding label of the video. Wu et al. [57] constructed a music-image dataset with manual annotation to explore the matching patterns between the two modalities using CCA-based method. CBVMR [21] introduced a content-based retrieval model that only used the matching signal between music and videos without any metadata like emotions. However, since there exists no publicly available dataset, these methods are not directly comparable to each other, which motivates us to establish the content-based micro-video background music recommendation dataset, TT-150k, in this paper.

2.2 Variational-based Recommendation

Variational auto-encoder (VAE) [26] can be represented as a probabilistic graphical model , where is the observation and is the latent variable that governs the generation of . The main idea of VAE is to construct an analytical approximation to the intractable posterior of from the distribution family as the encoder network, given generative model represented by a decoder network that reconstructs from . The objective of VAE is to maximize the marginal log likelihood , which is proved to be equivalent to optimize an Evidence Lower BOund (ELBO) [26]:

(1)

where is sampled from the observation population. and are parameters of the decoder and encoder, respectively.

Compared to auto-encoder (AE) [2], which would suffer from performance degradation when facing noisy inputs, and denoising auto-encoder (DAE) [46]

, which adds fixed noise to the input during the training phase and cannot handle diverse noise for different samples, VAE could capture semantic structures of high-dimensional data into latent variables with the inferred latent embeddings corrupted with dynamic Gaussian noise to improve the robustness. Owing to its advantages, VAE has recently been extended for recommendations. For example, MultiVAE

[31] was a generative model that fited a user’s interactions with items via a VAE with multinomial likelihood. MacridVAE [36] extended MultiVAE by constraining the user representations to be disentangled in both the macro and the micro-level. On the other hand, Li and She [30] focused more on modeling item content factors, which coupled item content VAEs with a matrix factorization module. However, the utilization of VAE for multimodal cross-modal retrieval is comparatively less explored, which leads us to propose CMVAE for micro-video background music recommendation.

3 Introduction of The TT-150k dataset

To bridge the gap that there exists no publicly available matching dataset that associates videos with background music, we establish TT-150k to consider the practical scenario where a music clip could be associated to multiple videos while the relevant datasets such as Youtube-8m [1] mainly associate one music clip to one video only. TT-150k is intended to be a benchmark for micro-video background music recommendation. It is collected on the popular micro-video sharing platform TikTok, where numerous videos are uploaded every day with fantastic background music. The central criterion for establishing this dataset is to faithfully reflect the distribution of micro-videos and candidate music clips in the real-world scenario to facilitate research regarding the discovery of matching patterns between music and micro-videos. Then, relevant background music can be automatically recommended upon the upload of a micro-video.

3.1 Dataset Introduction

We first built a music candidate pool with approximately 3,000 music clips. A music clip was selected into the pool if it satisfied one of the two criteria: 1) it appeared on the TikTok pop charts, or 2) it was used in a randomly crawled popular video (which ensures the quality of the background music). Note that the background music clip of a popular micro-video is not necessary a popular music clip, and therefore the popularity distribution of the music clips in the candidate pool can still faithfully reflect the real-world scenario. The music clips with the title “original music" were eliminated, because these music clips are often made and uploaded by users that are tailored for a certain video, which may not be able to generalize well to other micro-videos. Besides, varied remixes of the same music are indexed as different music clips in our dataset. With the music candidate pool established, we crawled a list of videos that use a certain music clip in the above music set as background music according to TikTok’s song search API. We also collected the number of videos using the music clips as background music (i.e., music usage amount) at the same time.

Figure 1: Popularity distribution of the candidate music clips in our TT-150k dataset.

To approximate the true distribution of music adoption, i.e.,

its popularity, we used the crawled music usages to sample videos with the principle to make the relative popularity of these music clips as close as possible to the relative popularity of the music clips in the real-world scenario. Specifically, we used a Gamma distribution to fit the distribution of music usage amount through Maximum Likelihood Estimation. The specific popularity distribution of the candidate music clips after sampling is shown in Figure

1. With the sampling strategy, we can reduce the number of videos without losing the original distribution of real-world data, which makes it suitable to analyze the matching pattern of videos and background music. After calculating the number of videos that needed to be crawled based on the relative music popularity distribution, we sampled the latest uploaded videos for different music clips. At the same time, brief descriptions or several hashtags attached to the video were also crawled if they exist. In this way, the micro-videos consist of the bimodal representations of visual and the corresponding textual contents.

3.2 Feature Engineering

For visual information of the micro-videos, we preprocess the videos and extract video-level features using an effective pre-trained ResNet model [17]. The structure of ResNet enhances the quality of image representations by refining the raw information from the input images in a cascade manner by residual modeling. Specifically, we first utilize FFmpeg333https://www.ffmpeg.org/ to extract video frames at three frames per second. Then we could use the pre-trained model such as ResNet [17] or I3D [4] to extract the visual features. We adopt ResNet in this work for its simplicity and follow the work [1]

of temporal global average pooling to fuse features extracted from different frames final video-level features.

For textual information of the micro-videos, observing that TikTok is an international micro-video sharing platform where the languages the users use are diverse, we use the multi-lingual Bert-M model [11] to extract the textual features. Bert-M is pre-trained on a large corpus composed of 104 languages where multi-lingual aligned semantic structures could be learned.

For music features, we extract the spectrograms from the raw audio clips, and then utilize the Vggish network [20] pre-trained on AudioSet [15] to extract the song-genre-related features. Moreover, we exploit openSMILE [12] to extract the pitch and emotion-related features. Specifically, we reduce the dimension of openSMILE features to the same size of Vggish features with PCA [53] after normalizing them to zero mean and unit variance.

3.3 Dataset Statistics

The statistics of the micro-video background music recommendation dataset are presented in Table 1. In summary, the established TT-150k dataset contains 3,003 music clips and 146,351 videos, where a music clip is selected by as least 3 micro-videos and at most 219 micro-videos. Figure 2 shows an exemplar subset of music clips and micro-videos in the established TT-150k dataset.

#Music #Video avg std #v/#m min/max #v/#m
3,003 146,351 49 57 3 / 219
Table 1: Statistics of the established TT-150k dataset.
Figure 2: An exemplar subset of videos and their matched background music in the established TT-150k dataset.

TT-150k is established with videos that use a certain music clip in the candidate music set as the background music. In this way, for each video, the ground truth is the background music that the uploader chose for the video. We use two dataset split strategy to create train, validation and test sets. The first strategy is the “weak generation", where for each background music clip, the interacted videos are split by an ratio of 8:1:1 to construct train, validation and test sets. In this case, although the videos in the test set are new, the candidate music clips in the test set have been adopted by at least one micro-video in the training set. The second strategy we consider is the “strong generalization" where the music clips in the test set are not present in the training set. To keep the popularity distribution of music candidates in the test set identical to that of the training set, a stratified sampling strategy is adopted to split the music clips. Specifically, in each of the popularity stratum, we randomly sample the music clips with all their associated videos with an 8:1:1 ratio for train, validation and test sets.

4 Methodology

The overall framework of CMVAE is shown in Figure 3. Specifically, CMVAE aims to align the latent embeddings of music and videos via a cross-modal generative process to model the matching of the music and videos. With the generative process defined, the intractable posterior distributions of the latent variables are estimated by variational inference. Furthermore, to effectively utilize the comprehensive multimodal video contents, CMVAE fuses the modality-specific latent variables by a product-of-experts system, where the information in each modality is weighted by its importance for the matching purpose. The details of the proposed model are expounded in the following sections.

4.1 Problem Definition

Suppose we have a set of music and a set of video , where each video is associated with a visual feature extracted from the image sequence and a textual feature from its description. For each music and video , we define a mapping that depicts whether or not music matches a video . The mapping induces a set of triad , where is the matching indicator. Given a new video , our goal is to retrieve a list of music candidates where each music is a potential match for the query video.

Figure 3: The framework of our proposed CMVAE for background music recommendation of micro-videos. Specifically, music feature is encoded into the latent variable which follows by an MLP. Video features ( and ) are fed into modality-specific encoders with the latent variables from visual and textual modalities fused by the product-of-experts (PoE) principle to compute

. The final loss function consists of reconstruction losses, KL divergence losses, and the matching loss.

Figure 4: The probabilistic graphical model (PGM) of the proposed CMVAE. Hierarchically, we first model modality-level latent embeddings, and then the micro-video-level latent embedding is obtained for cross-generation.

4.2 Cross-modal Variational Auto-encoder

Given a video-music pair , considering that their original feature spaces are high-dimensional and misaligned, CMVAE aims to first map and into to a shared -dimensional Gaussian latent space where their matching degree can be properly judged. Such a latent space should have the property that the distance of latent variables for a matched video-music pair is closer than that of an unmatched pair. In CMVAE, this is achieved by constraining the music latent embedding and video latent embedding to be able to generate the video feature and music feature for a matched pair, i.e., cross-generation. Based on such criteria, the generation process can be formulated as:

(2)
(3)

In addition to the cross-generation, we also define the generation of the matching status, i.e., , by the matching generative distribution

, where the probability depicts the matching degree of the given video-music pairs. Since we measure the matching degree of

, in the latent space via , , can be specified as the expected conditional distribution:

(4)

Considering that each video consists of multimodal features (i.e., visual and textual features and ), both of which are important for the matching of background music, the generative process is extend to multimodal scenarios as shown in Figure 4, where the cross-generation is specified with finer granularity as follows:

(5)
(6)

To infer the joint video latent embedding based on the complementary information from both visual and textual modalities, inspired by [55], we assume the observations of the two modalities, , are conditional independent given , where the joint posterior can be factorized into modality-specific posteriors as follows:

(7)

where is the joint latent embedding inferred from multimodal video features. and are latent embeddings inferred from the visual modality and the textual modality, respectively. Eq. (7) shows that the multimodal fusion of the bimodal information of micro-videos is in essence a product-of-experts (PoE) system. For Gaussian variables, the product is also Gaussian where the new mean and new variance become:

(8)
(9)

Since theoretically, the mean of a Gaussian variable depicts its semantic structure and variance denotes uncertainty, the mean vector of the micro-video embedding is a weighted sum of the semantic information in each modality according to its informative level for the matching end. Therefore, the heterogeneous information from the visual and textual modalities is comprehensively fused where irrelevant information is down-weighted for a better generalization. Another by-product of such factorization is that, if textual modality is missing in the training or test phase, which is a commonly encountered problem when the users are reluctant to add descriptions to their videos, we can temporarily drop the textual network in CMVAE and proceed the training or test process with the visual network.

After defining the inference process for the joint video variational posterior , our discussion shifts to the estimation of , and

with the collected matching video-music pairs. Since the generative processes of the modalities and the matching status are parameterized as deep neural networks, the posterior distributions for

, and are intractable. Therefore, we resort to variational inference, where we assume , and come from tractable families of distributions (which are also parameterized by deep neural networks) and in those families find the distributions closest to the true posteriors measured by the KL-divergence [3]. Previous work proves that the minimization of the KL-divergence is equivalent to the maximization of the Evidence Lower BOund (ELBO) [26], which is composed of two parts: the cross reconstruction term and the KL-divergence term .

(10)

which aims to reconstruct the input features with latent variables. The term penalizes the latent variables over deviating from the prior, which prevents encoding excessive information from the inputs and serves as a regularizer. It can be formulated as:

(11)

where and

are the priors for the music and video latent variables, respectively, both of which are specified as the standard Normal distribution

. In addition, inspired by [54, 48], the video-music matching loss that maximizes the for matched pairs is specified as a bi-directional ranking loss as follows:

(12)

where . The first term of the right hand side of Eq. (12) is the video-to-music matching loss, where for a fixed video , , denotes the matched music and unmatched music, respectively. The second term is the music-to-video matching loss, where the symbols are defined accordingly. For each term, we select unmatched samples with top-cosine similarities as the negative samples. Finally, combining the ELBO and matching loss, the joint training objective can be specified as:

(13)

where and control the weight of the cross reconstruction loss and the matching loss, respectively. Note that to increase the model’s robustness to modality missing in the test phase, we utilize a sub-sampled training strategy inspired by [55], where the latent embedding of the visual modality or the textual modality of a micro-video is also constrained to complete the matching task by adding the single-modal objective to the joint training objective . The detailed training steps of the proposed CMVAE are summarized in Algorithm 1 for reference.

Input: A video-music matching dataset ;
  the video feature set contains the visual modality and the textual modality for each video ;
  the music feature set contains the audio modality ;
  the mapping with ;
   indicates the model parameters.
Randomly initialize . while not converged do
       Randomly sample a batch from . forall  do
             Compute via variational encoders.
       end forall
      Compute of from and with the PoE module. Add the KL-divergence with prior to the loss. Sample and compute and via the reparametrization trick. Add the cross reconstruction loss and matching loss as Eq. (10) and Eq. (12) to . Add the loss of modality-specific and to get final . Compute the gradient of loss . Update by taking stochastic gradient steps.
end while
return Output: CMVAE model trained on dataset .
Algorithm 1 CMVAE-SGD: Training CMVAE with SGD.

5 Experiments

In this section, we conduct extensive experiments based on the established TT-150k dataset to evaluate our proposed model CMVAE. Specifically, we aim to answer the following research questions: 1) How does CMVAE perform compared with the baseline methods for the micro-video background music recommendation? 2) How do the cross-generation strategy and the PoE fusion module in CMVAE contribute to the performance? 3) How does CMVAE perform if the textual modality is missing in the test phase?

Moreover, a qualitative assessment of CMVAE is included to visualize its recommendations.

Existing music (Weak generalization) New music (Strong generalization)
Recall@10 Recall@15 Recall@20 Recall@25 Recall@10 Recall@15 Recall@20 Recall@25
CMVAE 0.2606 0.3684 0.4489 0.5137 0.1610 0.2274 0.2881 0.3479
Random 0.0959 0.1398 0.1788 0.2191 0.0978 0.1443 0.1866 0.2268
PopularRank 0.0689 0.1030 0.1390 0.1766 - - - -
CCA [16] 0.0465 0.0640 0.0934 0.1278 0.0412 0.0670 0.0941 0.1273
CEVAR [42] 0.1262 0.1749 0.2302 0.2794 0.1223 0.1823 0.2416 0.3019
CMVBR [21] 0.0927 0.1431 0.1929 0.2425 0.0966 0.1441 0.1940 0.2395
DSCMR [64] 0.1062 0.1553 0.2077 0.2651 0.1133 0.1775 0.2241 0.2793
UWML [51] 0.1053 0.1554 0.2103 0.2573 0.0933 0.1299 0.1729 0.2255
Dual-VAE [40] 0.1607 0.2272 0.2897 0.3514 0.1450 0.2143 0.2705 0.3294
DA-VAE [25] 0.1645 0.2390 0.3063 0.3698 0.1417 0.2104 0.2740 0.3280
Cross-VAE [61] 0.2584 0.3590 0.4360 0.5077 0.1391 0.2008 0.2706 0.3233
Table 2: Comparison between the proposed CMVAE and various baselines under weak and strong generalization scenarios.

5.1 Methods for Comparisons

To demonstrate the effectiveness of the proposed CMVAE, we draw comparisons with several state-of-the-art cross-modal matching baselines. For background music recommendation in this article, each video has exactly one background music as the ground truth, so the co-occurrence information necessary for collaborative filtering methods does not exist. Therefore, collaborative-based methods (e.g. NCF [18]) are not applicable in our paper. For methods where the original task is to match two sources with a single modality, we concatenate the visual and textual features as the “single modality" representations of the micro-videos. The baselines are listed as follows:

  • [leftmargin=*]

  • Random: For each video, music clips are randomly selected as the candidates for recommendation.

  • PopularRank: The number of adoption of a music clip, i.e., its popularity, is utilized to select and recommend top- popular music clips for every video.

  • CCA [16]: CCA (Canonical correlation analysis) is a widely adopted multivariate correlation analysis method. We extract two canonical variates from music features and video features via CCA, and then calculate cosine similarity for matching.

  • CEVAR [42]: CEVAR aims to make the visual and audio embeddings which come from the same video close using a cosine similarity loss with the supervision loss of the corresponding class of the video optimized together. To make it amenable to our problem, we eliminate the label prediction loss for comparison.

  • CMVBR [21]: CMVBR is a content-based method that introduces a soft intra-modal structure loss to better align intra-modal features for matching two multimodal sources.

  • DSCMR [64]: DSCMR is a supervised model with semantic category labels of the samples attached, which minimizes the recognition loss in the label space and the common representation space. A weight-sharing strategy is used to reduce the cross-modal difference of multimedia data. We eliminate the recognition loss of the label space to make it applicable for our dataset.

  • UWML [51]: UWML is a metric learning-based method with a triplet loss utilized to encourage the closeness of positive pairs than that of negative pairs. It introduces a weighting framework for the positive and negative pairs where a larger weight calculated by polynomial functions based on the similarity scores is assigned to a harder-to-match pair.

We also compare our proposed cross-modal variational auto-encoder with the state-of-the-art variational-based generative models:

  • [leftmargin=*]

  • Dual-VAE [40]: Dual-VAE aims to exploit deconvolution on word sequence in the decoder part in a text-based retrieval model. It is trained jointly on the question-to-question and answer-to-answer reconstruction.

  • DA-VAE [25]: This dual-aligned variational auto-encoder focuses on the situation where samples in some modalities are missing for image-text retrieval. DA-VAE utilizes an entropy-maximization constraint to further align two modalities along with the self reconstruction loss.

  • Cross-VAE [61]: This crossing variational auto-encoder utilizes a cross reconstruction strategy where answers are reconstructed from question embeddings and questions are reconstructed from answer embeddings for text-based retrieval.

For the above three methods, we only use the corresponding cross-modal generation module to replace the PoE-based multimodal cross-generation module in the proposed CMVAE for a fair comparison, where we concatenate the multimodal features of micro-video to be the query modality. Dual-VAE and DA-VAE both optimize the self reconstruction loss and the matching loss, whereas DA-VAE further utilizes an entropy-maximization constraint to better align two modalities. For Cross-VAE, cross reconstruction loss is utilized along with the matching loss to form the training objective.

5.2 Evaluation Settings

We evaluate the models under both the weak and strong generalization scenarios. For weak generalization, all the music clips in the test set have been adopted by at least one micro-video in the training set, while for strong generalization, music clips in the test set are not present in the training stage. We use Recall@ [51] to evaluate the model performance, which is a standard metric for cross-modal retrieval that calculates the average hit ratio of the matched music clips over all the music clips that are ranked with top match scores. Since in our dataset, the popularity of different music clips varies considerably, as shown in Figure 1, weighting the match equally for different music clips could lead to systematic bias that favors the popular music clips. Therefore, following [24], music clips for evaluation are weighted by the inverse of their popularity levels (i.e., estimated propensity scores), and thus the Recall@ we adopt in this paper is formalized as follows:

(14)
(15)

where represents the set of test videos and denotes the weight of the video calculated by Eq. (15). Specifically, represents the reciprocal of the popularity of the music clip that the target video used as background music. With Eq. (14), the used in this paper calculates the popularity-debiased hit-ratio of relevant music clips ranked among the top- recommendation list for all videos in the test set.

Our CMVAE model is implemented in Pytorch

[37]

. The embedding size is fixed to 512 and the batch size is set to 1,024 for all models. We use ReLU as the non-linear activation. For all the models, the hyperparameters are selected based on metrics on the validation set. The learning rate is tuned amongst {0.0001, 0.005, 0.001, 0.05, 0.01}. By searching, we set the weight of the L2 norm penalty as 0.001, the weight of the music-to-video matching loss

as 3, and choose the ten most confusing negative samples in the bi-directional ranking loss during training. For strong and weak generalizations, we set the weight of the reconstruction loss to and the learning rate to 0.05 and 0.0001, respectively.

5.3 Performance Comparison

The comparisons between the proposed CMVAE and the state-of-the-art baselines are summarized in Table 2. For strong generalization, since the music clips in the test set are not present at the training stage, we assume that the popularity of the test music clips is unknown or cannot be accurately estimated due to lack of data, and thus exclude the PopularRank method from comparisons.

Among the methods that we draw comparisons with, CCA finds a linear projection that maximizes the correlation between projected vectors of two different modalities. The inferior performance of CCA demonstrates that the matching pattern between video and music could not be modeled by a simple linear relation. For the deep learning-based methods, CEVAR and CMVBR are originally designed for video-music matching, whereas DSCMR and UWML are designed for matching between visual (

i.e., image or video) and textual modalities. Therefore, the modules that are specifically designed for image-text matching tasks such as stacked cross attention network [28], which aligns image regions and text words, are not applicable to our task and are therefore removed. We find that directly extending these image-text matching methods to the video-music setting where one of the sources is composed of multiple modalities could not achieve satisfying results. The reason may be that the patterns that responsible for the matching between videos and background music are more elusive than that of image-text matching, and therefore the naive matching loss could not force the model to capture the complex matching patterns for the alignment of the music and video latent spaces. CEVAR performs the best among the selected deep learning-based baselines. Compared to CMVBR, CEVAR utilizes a cosine matching loss to constrain the closeness of the embeddings of the matched video-music pairs instead of the inner-product loss, which eliminates the systematic difference of image embeddings due to varied lightness or saturation, etc.

The generative-based matching approaches listed at the bottom of Table 2 improve significantly over the matching-loss-based baselines. Among them, Dual-VAE and DA-VAE employ only self reconstruction, which leads to a more structured shared latent space for the micro-video and music embeddings. DA-VAE further imposes an entropy-maximization constraint on the joint representations motivated by Jaynes’s theory [23] which improves the performance over Dual-VAE. By aggregating a cross reconstruction module to Dual-VAE, Cross-VAE performs significantly better than DA-VAE for weak generalization and performs on par with DA-VAE for strong generalization.

CMVAE performs significantly better compared to all other methods. We attribute the improvement of CMVAE to the following two designs: 1) CMVAE takes advantage of both generative-based and matching-loss-based models, which is optimized against the composite loss consists of the cross reconstruction loss, the bi-directional ranking loss, and the VAE regularization loss. Therefore, the latent embeddings are constrained to encode more matching-relevant information from the videos and music while regularizing them from overfitting. 2) By aggregating information in the textual and visual modalities with the PoE module, the information in each modality is fused according to its importance to the matching purpose. Consequently, the irrelevant and redundant information contained in the micro-video embeddings is reduced such that a more robust generalization can be achieved for the model.

Recall@10 Recall@15 Recall@20 Recall@25
0.1305 0.1979 0.2583 0.3173
0.1637 0.2302 0.2937 0.3544
0.2606 0.3684 0.4489 0.5137
(a) Existing music
Recall@10 Recall@15 Recall@20 Recall@25
0.1163 0.1767 0.2287 0.2935
0.1470 0.2183 0.2730 0.3314
0.1610 0.2274 0.2881 0.3479
(b) New music
Table 3: The results of Recalls on different generation strategies.

5.4 Ablation Study of CMVAE

In this section, we investigate the effectiveness of different components in CMVAE. In detail, as the cross-generation strategy plays a vital role in CMVAE, we compare it with other generation strategies. Moreover, we explore the effectiveness of PoE fusion by comparing it with other variational multimodal fusion methods.

5.4.1 The effectiveness of cross-generation strategy

To explore the effectiveness of the cross-generation strategy which is a core component of our method, we compare different generation strategies with cross-generation. The method with only matching loss and without the generation module [60] and the method with the dual-generation (self reconstruction strategy with VAE is utilized) [40] are selected for comparisons. We can see significant performance improvements with results shown in Table 3. Superior results verify that the cross-generation strategy can force the latent embeddings of videos and background music clips to align with each other where similarities could be calculated to recommend music clips to videos. However, the dual-generation strategy can only enhance the modeling of latent embeddings by constraining the video latent embeddings to generate video features and the music latent embeddings to generate music features. While the alignment of video and music embeddings are not guaranteed. Inferior results yield on the method without using any generation strategy. The semantic-rich video latent space and the monotonous music latent space are naturally hard to align for matching videos and music. Moreover, weakly associations between videos and background music make the matching loss hard to learn the matching patterns and perform effective recommendations.

5.4.2 The effectiveness of PoE fusion

Recall@10 Recall@15 Recall@20 Recall@25
0.1990 0.2929 0.3673 0.4453
0.2144 0.3058 0.3856 0.4498
0.2606 0.3684 0.4489 0.5137
(a) Existing music
Recall@10 Recall@15 Recall@20 Recall@25
0.1408 0.2008 0.2633 0.3325
0.1238 0.1879 0.2614 0.3249
0.1610 0.2274 0.2881 0.3479
(b) New music
Table 4: The results of Recalls on different multimodal variational models for videos.

In this section, we explore the effectiveness of PoE by comparing it with other multimodal variational fusion methods: the mixture-of-experts (MoE) [41]

which assumes the joint variational posterior follows a Gaussian mixture distribution, and the joint variational auto-encoder (JMVAE)

[43] which concatenates encoded features of multiple modalities to learn a joint variational posterior. Table 4 shows that CMVAE with the PoE module performs significantly better than the other two methods for both the strong and weak generalization. The superior results demonstrate that the video latent variable, where the mean is a sum of mean vectors from visual and textual modalities weighted by the reciprocal of corresponding variance, contains less irreverent information and therefore improves the model generalization. In contrast, MoE treats each modality as equally important and JMVAE simply concatenates the available modalities as the micro-video representations, which makes the utilization of multimodal information less effective than the PoE module used in CMVAE.

5.5 Robustness to Modality Missing

Recall@10 Recall@15 Recall@20 Recall@25
V 0.2479 0.3472 0.4281 0.4977
V + T 0.2606 0.3684 0.4489 0.5137
(a) Existing music
Recall@10 Recall@15 Recall@20 Recall@25
V 0.1573 0.2275 0.2849 0.3459
V + T 0.1610 0.2274 0.2881 0.3479
(b) New music
Table 5: The results of Recalls on modality missing problem in the test phase. (V: Visual, T: Textual)

In the real-world scenario, it is inevitable to encounter modality missing when analyzing micro-video contents. For example, some casual uploaders may be reluctant to write a title or hashtags when posting a video. Therefore, it is crucial for our multimodal model to be able to deal with the modality missing problem in the test phase without training multiple networks corresponding to each subset of modality combinations. With the PoE module to fuse the multimodal information, our model can easily address the modality missing problem by using the remaining visual embedding as a surrogate to the joint video embedding. To explore the performance of CMVAE when modality missing occurs at the test phase, we eliminate the textual modality of micro-videos in the test sets and report the performance. Results listed in Table 5 show that the performance decrease of CMVAE due to the missing of textual modality is minor. Since CMVAE is trained on the sub-sampled training paradigm [55] where the latent embeddings of a single modality are forced to do the matching task, such results demonstrate the robustness of the textual modality missing problem.

5.6 Qualitative Assessment

Figure 5: Qualitative assessment of CMVAE for micro-video background music recommendation by visualizing some examples of query videos in the test set and the retrieved music clips ranked by their matching levels. The pictures in blue on the left are the representative frames of the micro-videos. The pictures on the right are the cover images of the music clips which are ranked by their matching level to the micro-video judged by CMVAE with the ground-truth music clips marked in red.

We conduct experiments to visualize some examples of query videos in the test set and the retrieved music clips ranked by their matching levels as illustrated in Figure 5 to better check the effectiveness of the recommendations of our method. Specifically, the pictures in blue on the left are the representative frames of the micro-videos with the textual information attached. The pictures on the right are the cover images of the music clips which are ranked by their matching level to the micro-video judged by CMVAE with the ground-truth music clips marked in red.

From the figure, we can see that our CMVAE can recommend suitable background music clips by aligning video and music latent embeddings via cross-generation using multi-modal information. Moreover, the expression and atmosphere the video conveys contained in the visual and textual information are matched to the atmosphere expressed in the background music recommended by CMVAE. Take the first two given videos in the figure as examples, the uppermost video expresses the atmosphere of magic and the music ranked top few basically is in ethereal tunes. The video in the second line shows a painting procedure which matches more lovely and light music by our model.

6 Conclusions

In this paper, we introduce CMVAE, a hierarchical Bayesian cross-modal generative model for content-based micro-video background music recommendation. To solve the problem of lacking a publicly available dataset, we establish a large-scale database, TT-150k, which contains more than 3,000 candidate music clips and about 150k micro-videos with the popularity distribution reflecting the real-world scenario. Experimental results demonstrate that by modeling the matching of relevant music to micro-videos as a multimodal cross-generation problem with PoE as the fusion strategy, CMVAE can significantly improve the background music recommendation quality compared to state-of-the-art methods, and it is robust to textual modality missing problem in the test phase.

References

  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2.1, §3.2, §3.
  • [2] P. Baldi (2011) Autoencoders, unsupervised learning and deep architectures. In

    Proceedings of the International Conference on Unsupervised and Transfer Learning Workshop

    ,
    pp. 37–50. Cited by: §2.2.
  • [3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §4.2.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4724–4733. Cited by: §3.2.
  • [5] J. Chao, H. Wang, W. Zhou, W. Zhang, and Y. Yu (2011) Tunesensor: a semantic-driven music recommendation service for digital photo albums. In Proceedings of the International Semantic Web Conference, Cited by: §2.1.
  • [6] D. Chen and W. B. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Cited by: §1.
  • [7] F. Chen, J. Shao, Y. Zhang, X. Xu, and H. T. Shen (2020) Interclass-relativity-adaptive metric learning for cross-modal matching and beyond. IEEE Transactions on Multimedia (), pp. 1–1. Cited by: §1.
  • [8] H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han (2019) Cross-modal image-text retrieval with semantic consistency. In Proceedings of the ACM International Conference on Multimedia, pp. 1749–1757. Cited by: §2.1.
  • [9] H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12655–12663. Cited by: §1.
  • [10] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Cited by: §1.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171–4186. Cited by: §3.2.
  • [12] F. Eyben, M. Wöllmer, and B. Schuller (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia, pp. 1459–1462. Cited by: §3.2.
  • [13] X. Fu, Y. Zhao, Y. Wei, Y. Zhao, and S. Wei (2020) Rich features embedding for cross-modal retrieval: a simple baseline. IEEE Transactions on Multimedia 22 (9), pp. 2354–2365. Cited by: §2.1.
  • [14] D. Gao, L. Jin, B. Chen, M. Qiu, Y. Wei, Y. Hu, and H. Wang (2020) FashionBERT: text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251–2260. Cited by: §1.
  • [15] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §3.2.
  • [16] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16 (12), pp. 2639–2664. Cited by: 3rd item, Table 2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.2.
  • [18] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the International Conference on World Wide Web, pp. 173–182. Cited by: §5.1.
  • [19] X. He, Y. Peng, and L. Xie (2019) A new benchmark and approach for fine-grained cross-media retrieval. In Proceedings of the ACM International Conference on Multimedia, pp. 1740–1748. Cited by: §1.
  • [20] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. (2017) CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 131–135. Cited by: §3.2.
  • [21] S. Hong, W. Im, and H. S. Yang (2018) CBVMR: content-based video-music retrieval using soft intra-modal structure constraint. In Proceedings of the ACM on International Conference on Multimedia Retrieval, pp. 353–361. Cited by: §2.1, 5th item, Table 2.
  • [22] P. Huang, G. Kang, W. Liu, X. Chang, and A. G. Hauptmann (2019) Annotation efficient cross-modal retrieval with adversarial attentive alignment. In Proceedings of the ACM International Conference on Multimedia, pp. 1758–1767. Cited by: §2.1.
  • [23] E. T. Jaynes Information theory and statistical mechanics. ii. Physical Review 108, pp. 171–190. Cited by: §5.3.
  • [24] Z. Ji, H. Pi, W. Wei, B. Xiong, M. Woźniak, and R. Damasevicius (2019) Recommendation based on review texts and social communities: a hybrid model. IEEE Access 7, pp. 40416–40427. Cited by: §5.2.
  • [25] M. Jing, J. Li, L. Zhu, K. Lu, Y. Yang, and Z. Huang (2020) Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the ACM International Conference on Multimedia, pp. 3283–3291. Cited by: 2nd item, Table 2.
  • [26] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations, Cited by: §2.2, §4.2.
  • [27] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-captioning events in videos. In International Conference on Computer Vision, Cited by: §1.
  • [28] K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision, Cited by: §5.3.
  • [29] B. Li and A. Kumar (2019) Query by video: cross-modal music retrieval. In International Society for Music Information Retrieval Conference, pp. 604–611. Cited by: §1, §2.1.
  • [30] X. Li and J. She (2017) Collaborative variational autoencoder for recommender systems. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 305–314. Cited by: §2.2.
  • [31] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara (2018) Variational autoencoders for collaborative filtering. In Proceedings of the International Conference on World Wide Web, pp. 689–698. Cited by: §2.2.
  • [32] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §1, §2.1.
  • [33] Y. Lin, T. Tsai, M. Hu, W. Cheng, and J. Wu (2014) Semantic based background music recommendation for home videos. In International Conference on Multimedia Modeling, pp. 283–290. Cited by: §2.1.
  • [34] V. E. Liong, J. Lu, Y. Tan, and J. Zhou (2017) Deep coupled metric learning for cross-modal matching. IEEE Transactions on Multimedia 19 (6), pp. 1234–1244. Cited by: §1.
  • [35] C. Liu and Y. Chen (2018) Background music recommendation based on latent factors and moods. Knowledge-Based Systems 159, pp. 158–170. Cited by: §1, §2.1.
  • [36] J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu (2019) Learning disentangled representations for recommendation. In Advances in Neural Information Processing Systems, pp. 5712–5723. Cited by: §2.2.
  • [37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §5.2.
  • [38] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2641–2649. Cited by: §1, §2.1.
  • [39] D. Semedo and J. Magalhães (2020) Adaptive temporal triplet-loss for cross-modal embedding learning. In Proceedings of the ACM International Conference on Multimedia, pp. 1152–1161. Cited by: §1.
  • [40] D. Shen, Y. Zhang, R. Henao, Q. Su, and L. Carin (2018) Deconvolutional latent-variable model for text sequence matching. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 32. Cited by: 1st item, §5.4.1, Table 2.
  • [41] Y. Shi, N. Siddharth, B. Paige, and P. Torr (2019) Variational mixture-of-experts autoencoders for multi-modal deep generative models. In Advances in Neural Information Processing Systems, pp. 15692–15703. Cited by: §5.4.2.
  • [42] D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró-i-Nieto (2018) Cross-modal embeddings for video and audio retrieval. In Proceedings of the European Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.1, 4th item, Table 2.
  • [43] M. Suzuki, K. Nakayama, and Y. Matsuo (2017) Joint multimodal learning with deep generative models. In International Conference on Learning Representations Workshops, Cited by: §5.4.2.
  • [44] H. Tan, X. Liu, B. Yin, and X. Li (2021)

    Cross-modal semantic matching generative adversarial networks for text-to-image synthesis

    .
    IEEE Transactions on Multimedia (), pp. 1–1. Cited by: §2.1.
  • [45] R. E. Thayer (1990) The biopsychology of mood and arousal. Oxford University Press. Cited by: §2.1.
  • [46] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    .

    Journal of Machine Learning Resaerch

    11, pp. 3371–3408.
    Cited by: §2.2.
  • [47] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215. Cited by: §2.1.
  • [48] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 394–407. Cited by: §4.2.
  • [49] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013. Cited by: §1.
  • [50] Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao (2019)

    Camp: cross-modal adaptive message passing for text-image retrieval

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5764–5773. Cited by: §2.1.
  • [51] J. Wei, X. Xu, Y. Yang, Y. Ji, Z. Wang, and H. T. Shen (2020) Universal weighting metric learning for cross-modal matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13005–13014. Cited by: §1, §2.1, 7th item, §5.2, Table 2.
  • [52] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan (2016) Cross-modal retrieval with CNN visual features: a new baseline. IEEE Transactions on Cybernetics 47 (2), pp. 449–460. Cited by: §2.1.
  • [53] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2 (1-3), pp. 37–52. Cited by: §3.2.
  • [54] F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang (2013) Cross-media semantic representation via bi-directional learning to rank. In Proceedings of the ACM International Conference on Multimedia, pp. 877–886. Cited by: §4.2.
  • [55] M. Wu and N. Goodman (2018)

    Multimodal generative models for scalable weakly-supervised learning

    .
    In Advances in Neural Information Processing Systems, pp. 5575–5585. Cited by: §4.2, §4.2, §5.5.
  • [56] X. Wu, Y. Qiao, X. Wang, and X. Tang (2012) Cross matching of music and image. In Proceedings of the ACM International Conference on Multimedia, pp. 837–840. Cited by: §2.1.
  • [57] X. Wu, Y. Qiao, X. Wang, and X. Tang (2016) Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia 18 (7), pp. 1305–1318. Cited by: §2.1.
  • [58] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-vtt: a large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296. Cited by: §1.
  • [59] R. Xu, C. Xiong, W. Chen, and J. Corso (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: §1.
  • [60] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y. Sung, et al. (2020) Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 87–94. Cited by: §5.4.1.
  • [61] W. Yu, L. Wu, Q. Zeng, S. Tao, Y. Deng, and M. Jiang (2020) Crossing variational autoencoders for answer retrieval. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 5635–5641. Cited by: 3rd item, Table 2.
  • [62] Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li (2020) Context-aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3536–3545. Cited by: §1.
  • [63] Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao (2019)

    Cross-modal interaction networks for query-based moment retrieval in videos

    .
    In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664. Cited by: §2.1.
  • [64] L. Zhen, P. Hu, X. Wang, and D. Peng (2019) Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10394–10403. Cited by: §2.1, 6th item, Table 2.