As an emerging type of user-generated content, micro-video drastically enriches people's entertainment experiences and social interactions. However, the popularity pattern of an individual micro-video still remains elusive among the researchers. One of the major challenges is that the potential popularity of a micro-video tends to fluctuate under the impact of various external factors, which makes it full of uncertainties. In addition, since micro-videos are mainly uploaded by individuals that lack professional techniques, multiple types of noise could exist that obscure useful information. In this paper, we propose a multimodal variational encoder-decoder (MMVED) framework for micro-video popularity prediction tasks. MMVED learns a stochastic Gaussian embedding of a micro-video that is informative to its popularity level while preserves the inherent uncertainties simultaneously. Moreover, through the optimization of a deep variational information bottleneck lower-bound (IBLBO), the learned hidden representation is shown to be maximally expressive about the popularity target while maximally compressive to the noise in micro-video features. Furthermore, the Bayesian product-of-experts principle is applied to the multimodal encoder, where the decision for information keeping or discarding is made comprehensively with all available modalities. Extensive experiments conducted on a public dataset and a dataset we collect from Xigua demonstrate the effectiveness of the proposed MMVED framework.READ FULL TEXT VIEW PDF
In this paper, we propose a cross-modal variational auto-encoder (CMVAE)...
Personalized hashtag recommendation methods aim to suggest users hashtag...
Billions of photos are uploaded to the web daily through various types o...
The notion of creativity, as opposed to related concepts such as beauty ...
The topic diversity of open-domain videos leads to various vocabularies ...
This paper introduces the unsupervised learning problem of playable vide...
Much of the micro data used for epidemiological studies contain sensitiv...
In recent years, with the prevalence of social media, people grow increasingly enthusiastic about spreading various forms of their creations on the Internet, and the consumption of user-generated contents (UGCs) has gradually become an indispensable daily entertainment activity for most of the populations. Interestingly, for tons of UGCs posted on social media every day, most of them soon fall into oblivion, while some are able to attract lots of attention and disseminate widely, reflected by receiving comparatively large amounts of views, likes, comments, reposts.
Various internal and external factors may affect the popularity of UGCs, including not only the quality of their content, the influence of their publishers, but even the release timing as well [26, 6]. Accurate predictions of the popularity level of UGC could be of great benefit: It allows the service providers to make strategic decisions as to the management of their resources, such as caching and network optimization , and provides users with more satisfactory personal recommendations . Therefore, the popularity prediction of UGCs has received lots of attention among researchers.
Extensive work has been done in the domain of popularity prediction for traditional forms of UGCs, such as articles , news , and images . Generally speaking, early studies on popularity prediction of UGC can be divided into two categories, i.e., feature-driven methods and generative (time series-driven) methods 
. Feature-driven approaches first extracted a large number of features related to the UGC contents, the profile of users, or the social networks, and trained machine learning models such as Support Vector Machine (SVM)
or Random Forest (RF)
to optimize a mapping function from the feature space to a pre-defined popularity space. They mainly focused on feature engineering techniques and could achieve good performance, provided that the extracted features are effective. On the other hand, for analysis of the popularity evolution pattern of UGCs over time, generative approaches utilized the temporal regularities of the popularity curve at the early stage to fit an autoregressive model with the linear or non-linear dynamics to predict its future trends[38, 16, 31]
; most of them relied on strong hypotheses about the popularity accumulation mechanism. Recently, inspired by the outstanding performance of deep learning models, several deep learning-based methods for UGC popularity prediction have been proposed[5, 2]
in both categories, taking advantage of the massive representative powers of convolutional neural networks (CNNs) to extract more powerful features and recurrent neural network (RNNs) to capture the more complicated temporal relationships.
However, as an emerging form of UGC, litter efforts have been dedicated to the understanding of micro-videos. Compared to popularity prediction for traditional UGCs or professionally made videos such as movies, predicting the popularity of micro-videos faces its own challenge due to the following factors: (1) Heterogeneity: Aside from a short video that lasts from 6 seconds to 60 seconds, several other components are usually attached, such as background music, titles, and hashtags, each describing the micro-video from a different perspective. Therefore, micro-videos are way richer in information than other forms of UGCs, which renders it more difficult to extract relevant features and fuse them effectively across modalities to explain the observed popularity trend. (2) External Uncertainty: Even if features that are highly informative to the popularity of micro-videos could be properly extracted and fused, since micro-videos are made to gratify the mercurial taste of the massive online audiences, the popularity of a micro-video could vary under the impacts of lots of external uncertainties such as the time of publishing, change of trends or influence from online-celebrities and other social media. Thus, the observed popularity usually appears a certain amount of randomness. (3) Internal Noise: Moreover, compared to the professional teams by whom the movies are made, the uploaders of micro-videos are usually unaware of various expertise such as film grammars to convey affective information to the audience, or lack the professional devices to shoot the video, which may lead to poor quality in both visual and acoustic contents. Besides, when users upload micro-videos online, they may make up sensational titles and tags that are irrelevant to the content just for the eye-catching effect, and the profile of users could be falsified as well, which makes the textual and user features untrustworthy.
Therefore, in order to address these challenges, we propose a multimodal variational encoder-decoder (MMVED) framework for micro-video popularity prediction. Unlike the majority of previous approaches where the map from features of a piece of UGC to its popularity level is assumed to be deterministic, MMVED learns a stochastic Gaussian embedding of a micro-video that is informative to the prediction of its future popularity level, while preserves the inherent randomness of the popularity caused by various external uncertainties simultaneously. In particular, MMVED takes the maximization of a variational approximation to the information bottleneck as its objective, such that only relevant cues in the information-rich micro-video features can be extracted into the hidden representation, ignoring the irrelevant and noisy parts. Specifically, in the multimodal encoder, we adopt the Bayesian product-of-experts strategy to fuse the modality-specific embeddings, where both the information heterogeneity and difference of uncertainty of all modalities are comprehensively considered. MMVED could also be shown to be parameter-economic to deal with the modalities missing problem in the test phase.
Please note that a preliminary conference version of this paper has been presented at WWW 2020 
. Compared to the initial paper that constructed the MMVED objective by adding the Variational Auto-Encoder (VAE) loss as an ad-hoc regularizer to the vanilla encoder-decoder target, this manuscript rigorously proves that such MMVED objective in essence lower-bounds the information bottleneck objective through the deep variational information bottleneck (D-VIB) theory, which further reveals the underlying information-sifting and denoising mechanism of the proposed framework; Besides, we demonstrate both theoretically and empirically that various form of decoders, such as MLP with cross-entropy loss, MLP with MSE loss or RNN could be incorporated into the MMVED framework for corresponding tasks of UGC popularity prediction, such as popularity classification, regression, and temporal regression, etc. Moreover, more advanced feature engineering techniques are utilized to characterize different aspects of the micro-video, which take advantage of recent advances of deep neural networks in both computer vision and audio processing community.
The remainder of this paper is organized as follows: Section 2 surveys related work concerning the popularity prediction task of UGCs; Section 3 expounds the proposed MMVED framework in detail; Section 4 describes the adopted feature engineering techniques; Section 5 and 6 evaluate the proposed model on two micro-video popularity prediction datasets and analyze experimental results; Finally, section 7 concludes the paper.
Due to its importance in recommendation, advertising, and many other applications, popularity prediction of UGCs on social media receives considerable attention in both industry and academia. The premiere step to predict the popularity of UGCs is to measure it numerically. Generally, the popularity of a piece of UGC is defined by the volume of positive response it receives, which can be estimated by the number of viewers, likes, comments, and reposts. In practice, the weighted average of any combinations of these indexes could adequately serve as an indicator of popularity levels of UGCs, and they are widely used in researches to construct the groundtruth for UGC popularity prediction tasks.
For popularity prediction of online textual contents, such as hashtags, microblogs, and articles, most studies focused on combining the representative features from the textual contents ,  and the social context, such as the way users are linked, to train a popularity predictor. One exemplar work is 
, where several content features from hashtags of the tweets were fused with contextual features from the user social graph to train multiple outside classifiers. Besides, for popularity prediction of Reddit comments, Zayats and Ostendorf
extracted the textual features from the reviews and combined the user interaction graph with a long short term memory (LSTM) network to model the influence of user connections and temporal evolution to the popularity of comments over time.
As for predicting the popularity of images and long-videos  (such as those posted on YouTube) where rich information is contained in visual or aural modalities, multimodal learning that fuses information from different views is one of the most commonly used techniques. Li et al.  introduced a novel propagation-based popularity prediction method by considering both intrinsic video attractiveness and the underlying propagation structure. Khosla et al.  explored the relative significance of individual features involving multiple visual features, as well as various social context features, including the number of views and contacts. Cao et al.  utilized two coupled neural networks to iteratively capture the cascading effect in information diffusion to predict future popularity. Trzcinski et al.  predicted the popularity of the video with both the visual clues and the early popularity pattern of the video after its publish over a certain period. However, none of these methods considered the cold start scenario. In order to deal with such a challenge, McParlane et al.  took a content-based strategy and utilized only visual appearance and user context for Flickr image popularity prediction. Bielski et al.  predicted the popularity of videos before they are published, by exploiting the spatio-temporal characteristics of videos through a soft self-attention mechanism, and they intuitively interpreted the impact of contents on video popularity by Grad-CAM Algorithm.
Although an emerging form of UGC, several pioneer work have been dedicated towards the popularity prediction of micro-videos. Chen et al.  first employed four types of heterogeneous features, i.e., the visual, acoustic, textual, and social features to describe the characteristics of micro-videos, and proposed a transductive multimodal learning model to regress the popularity index of micro-videos after their release. Trzciński and Przemysław , on the other hand, focused on using the early popularity pattern of the popularity curve to predict the trend for the time after, which extended the micro-video popularity prediction task to an online manner. Afterwards, Jing et al.  thoroughly discussed the detrimental effects of internal noise to the micro-video analysis studies, and they augmented the multi-view learning method with a low-rank constraint, such that only a few principal patterns in micro-video features are allowed to be kept in its representation.
The most salient character that distinguishes our methods from the above ones is that we also consider the inherent uncertainty in popularity and treats all external uncertain factors as randomness, which can be properly preserved when we learn the Gaussian embedding of the micro-videos. Besides, compared with 
, which also considered the adverse effect of noise, we resort to combining the representative learning power of the deep neural network and the denoising ability of information bottleneck structure, where for each sample, the information that is allowed to be extracted into the hidden representation is actively learned and dynamically decided. Furthermore, in, the popularity was a single numerical value, whereas our model is also suitable for tasks where the popularity is represented by a time-related sequence, which is more difficult since it requires understanding the hidden evolutionary pattern of the popularity trend.
Consider we collect a micro-video popularity prediction dataset of the form , where
is the features extracted frommodalities, and is the corresponding popularity groundtruth which could assume one of the following three forms: for classification, for regression, or for temporal regression, our goal is to learn a hidden representation of a micro-video based on the integrated information from all the modalities that is highly informative for its potential popularity level, while being robust to the external uncertain factors and the internal feature noises. The main notations used in section 3 are summarized in Table 1 for the convenience of reference.
|number of modalities in a micro-video|
|number of samples of the dataset|
|number of samples to estimate the gradient|
|dimension of the Gaussian embedding|
|variable for the set of multimodal features|
|variable for the popularity groundtruth|
|variable for the Gaussian embedding|
|mean of the Gaussian embedding|
|standard deviation of the Gaussian embedding|
|trainable parameters of the encoder|
|trainable parameters of the decoder|
|weight in the training objective|
|mutual information operator|
|expected log-likelihood term of|
|information bottleneck lower bound|
As the hidden representation is derived from micro-video features whereas the popularity is predicted based on , their relationship can be represented by a Bayesian chain and can be modeled by an encoder-decoder framework. Since the mapping from the content of the video to its popularity may not be deterministic due to some unpredictable external factors, it is reasonable to assign some randomness, and represent both the encoder and decoder as probabilistic distributions. In this paper we assume the hidden representation lies in -dimensional Gaussian space, and use and to denote the encoder and decoder distribution respectively.
Due to the general low quality of micro-videos, not all information from input modalities are beneficial for prediction of their popularity, and thus the noise of should not be extracted into the hidden representation. Towards this end, we first define a common criteria to measure the
level of one random variable to another, i.e. the mutual information (MI), which is formulated as follows:
Since the learned embedding is expected to pick up only cues in the input that are relevant to the popularity while ignoring the noisy distractors, utilizing the MI, we can construct the following constrained optimization objective:
where is maximum limit of information contained in of . The equation above forces the stochastic embedding to be expressive of while being compressive of at the same time, which provides a mechanism for the model to predict the true popularity level while accessing the minimal amount of information from the input to block the noise. By introducing a Lagrange multiplier , the constrained optimization problem above can be proved to be equivalent to the following unconstrained one, which has the same form as the information bottleneck first proposed in .
Intuitively, determines the degree of penalty associated with keeping more information of in , which controls the trade-off between minimum usage of information in the input and predicting the popularity target.
Since the decoder distribution can take the form of any valid conditional distributions and most of which are not even differentiable, it is intractable to directly calculate the two MI terms in according to Eq. 1 and optimize. Therefore, we resort to the variational approach , which posits that the decoder comes from a tractable family of distribution and finds a distribution in that family that is closest to the optimal decoder distribution measured by the KL-divergence. According to the deep variational information bottleneck (D-VIB) theory , for , the can be proved to be lower bounded by:
which we note as the Information Bottleneck Lower BOund (IBLBO). The IBLBO has the same form as the OBJ we used in . However, in , the OBJ is constructed by ad-hocly adding a VAE loss to the encoder-decoder objective. The full proof of IBLBO can be referred to in Appendix A. Hence, we can optimize IBLBO as a surrogate for the intractable . Note that the first expectation of Eq. 4 is taken w.r.t the data distribution
. Given that the true data distribution is usually inaccessible, the empirical data distribution is used instead, and in the rest we would omit the first expectation for simplification. The IBLBO resembles the likelihood-prior trade-off commonly found in Statistics, since the first term in the objective is the expected log-likelihood, which encourages probability density of the encoder to be put where the hidden representation could best explain the observed popularity trend, whereas the second term is the KL-divergence of the encoder distribution with the uninformative prior, which penalizes the deviation offrom the prior for keeping excessive information in the input and serves as a regularizer. We name the model represented by Eq. 4 as Multimodal Variational Encoder-Decoder (MMVED) for the rest part of the paper.
The stochastic encoder for the micro-video
is parameterized as a deep neural network (DNN), more specifically, a multi-layer perceptron (MLP), which is basically a stack of fully connected layers with intermediate activations. A naive way to construct the encoder networkis to adopt the early fusion strategy , which takes the concatenated features from all modalities as its input and outputs the parameters of the encoder distribution, i.e., its mean (semantic part) and and logarithm of standard deviation (uncertainty and noise part):
However, such approach was shown by previous work to have unsatisfactory performance for micro-video popularity prediction task , since it is unable to account for the relatedness among multiple modalities. Thus, in order to make inference of based on the complementary information from all modalities, inspired by the recent advance in multimodal variational inference frameworks  , we assume conditional independence among features from each modality given the hidden embedding, i.e. and . Then, as was shown by , the joint inference distribution can be factorized into the product of modality-specific encoder distributions , which is a typical Bayesian product-of-experts (PoE) system . We refer interested readers to  for the full prove of above deductions, and some of the most important steps are included in Appendix B for the self-containment of our paper. Such factorization means that we can first use the modality-specific encoder networks to compute the parameters of the latent representation for each modality as follows:
Then, by using the property of multivariate Gaussian, the parameters of the hidden representation for the whole micro-video can be calculated as:
where is the element-wise product operation. Eq. 7 shows that, under the conditional independence assumption, the semantic part of the video-level hidden representation is in essence the average of modality-specific weighted by the reciprocal of the corresponding variance
weighted by the reciprocal of the corresponding variance. With such calculation, for experts with greater precision, which indicates that the information from their associated modalities bear less uncertainty and more relevance to the popularity prediction end, they will have more influence (larger weights) over the overall hidden representation than those with higher variance. Another by-product of adopting such factorization is that it is easy to deal with modality missing problem in the test phase, since under such cases the only change to our framework is to remove the corresponding modality-specific encoder networks and re-compute the PoE distribution with available modalities, eliminating the needs of additional inference networks and multi-stage training regimes.
The choice for the tractable family of the decoder distributions is conditioned on the specific popularity prediction task at hand, and we will discuss three commonly faced tasks and the corresponding decoder structures.
If the popularity groundtruth
is represented as a binary variable with 1 indicating popular and 0 otherwise, the familyof the decoder distributions can be chosen as an MLP where the output
is squashed through a Sigmoid function as follows:
where . The maximization of the expected term in Eq. 4 is equivalent to minimization of the binary cross-entropy loss between the MLP prediction and the groundtruth.
However, representing the popularity level of a micro-video as a binary variable is too coarse in granularity, which fails to discriminate micro-videos with different popularity degrees. Therefore, it’s more common to define the as a continuous variable. Under such circumstances, in order to calculate , we could make the following assumption:
which is equivalent to assume the output
of the MLP specifies the mean of a unit-variance Gaussian distribution. Under such assumption, theterm in Eq. 4 can be rewritten into the following form:
where is a constant, and the maximization of which is equivalent to minimizing the mean square error (MSE) loss commonly used in the regression task.
Moreover, if the target is a popularity sequence observed at interval after the post of a micro-video, the tractable family of the variational decoder can be chosen as the recurrent neural network (RNN). RNN captures the dynamic patterns of a sequence through the maintenance of a hidden state, whose value in one timestep is composed of information passed on from previous timestep and incorporation of new information from the outside. RNN is comprised of two parts: (1) a temporal dynamic that determines the internal evolution of hidden state; (2) a mapping from the RNN state to the output. Mathematically, the two parts of an RNN can be formulated as:
where are the input-state, state-state, state-output transition matrices respectively. For vanilla RNN, could be directly taken as its initial state . However, if long-short term memory (LSTM) , which addresses the exploding gradient issues of vanilla RNNs by the introduction of gating mechanism, is used, could be transformed through two distinctive dense layers to get its initial context variable and hidden state separately. The input at each timestep is set to be the standardized absolute time of that prediction timestep (which is different from the RNN timestep that denotes the relative time to the start) based on the observation that the popularity level of a micro-video tends to vary in a regular fashion at different times of the day. Besides, in order to avoid the error accumulation due to the exposure bias , predicted results from the previous timestep is not fed into the RNN cell at the next timestep as additional inputs.
Similar to the regression task, we can assume that the output of the RNN specifies the mean of a unit-variance Gaussian variable for the popularity level at each timestep:
Then, utilizing the local Markov assumption of the RNNs , i.e., , the joint probability of the whole popularity sequence can be factorized into the product of per timestep conditional probabilities (Eq. 12). Therefore, the following equation holds:
and the maximization of which is equivalent to the minimization of the sum of the per-step MSE losses.
By now, each component of the MMVED framework is fully specified, and we distinguish the MMVED for classification, regression, and temporal regression with MMVED-CLS, MMVED-REG, MMVED-TMP for the rest of the paper. Figure 1 schematically illustrates the structures of the proposed MMVED framework.
is not trivial, since their gradients is in essence the gradients with probabilistic distributions, which precludes us from calculating them analytically. As a result, Monte Carlo (MC) methods are introduced to form unbiased estimators for the gradients.
We assume that the encoder and the decoder network is parameterized by and respectively. Since the KL term in the IBLBO can be calculated analytically, the only gradient needs to be computed by sampling is the expected log likelihood term . Observing that the term is explicit in expectation form, its gradient w.r.t the ELL can be easily estimated with MC by averaging the gradients computed from samples taken from the encoder distribution:
where symbol means that the RHS is an unbiased estimator for the LHS, and is the th sample draw from the encoder distribution . The unbiased estimator for the gradients w.r.t is more difficult to obtain the since there is no obvious way to explicitly reform it into an expectation form. However, with reparameterization trick , samples from can be reformed by a bivariate transformation from a deterministic part and a stochastic noise of the following form:
where is a random vector the same size as
drawn from the standard normal distribution. Eq.16 means that the gradients of ELL w.r.t can be estimated by drawing samples from the noise distribution taking and average:
As was discussed in , the variance of the reparameterization trick is low enough such that one sample is enough for the training to converge. Thus, the in both Eq. 17 and Eq. 15 is set to one. Eq. 16 can be viewed alternatively from the denoising perspective. It shows that during training, the hidden representations are injected with self-adaptive random Gaussian noises controlled by the standard deviation , which was shown by previous work to have robustness towards noise in the test phase compared to adding a deterministic noise in feature space , and provides a systematic way to model the latent noise generation mechanism incurred by the low-quality of micro-videos. The specific training steps of our proposed MMVED is summarized in Algorithm 1.
Although MMVED appears a certain amount of randomness to deal with the interference of uncertainty in popularity during the training phase, its behavior in validation and prediction is designed to be deterministic, such that the results of different rounds of prediction for the same micro-video are consistent. Since the mean of the Gaussian variable carries the information whereas the standard deviation preserves the uncertainty, after the training of MMVED, for predicting the popularity of a newly released micro-video, the mean of the Gaussian embedding is kept as its fixed representation. Then, the extracted representation is fed into the decoder to make the prediction of the popularity.
In this section, we discuss the form of , i.e., the multimodal representation of the micro-video. Previous work has confirmed that whether a micro-video comes into fashion after its release is closely related to its visual-aural contents, its attached descriptions, and the profile of its publisher . Thus, features are extracted from four different aspects, namely, visual, acoustic, textual, and social, as the multimodal characterization of the micro-video.
In order to describe the visual contents of the micro-video, we utilize the state-of-the-art convolutional neural network (CNN). Specifically, we keep the convolutional base of the ResNet50 
pre-trained on ImageNet as the fixed feature extractor, which has achieved great success in various computer vision tasks such as image classification, object detection, and action recognition.
As is shown in , the last layers of CNNs pre-trained on ImageNet encode the information with regard to the existence of certain objects and their relationships, which we believe could be of benefit to the popularity prediction of micro-videos, since certain objects, such as delicious food and cute pets are naturally more attractive than others. The detailed feature extraction steps are described as follows: for each micro-video, we first extract its keyframes with the FFmpeg toolkit222http://ffmpeg.org/.. Then, for each keyframe, a 2,048-D activation is obtained from the global average pooling layer of the pre-trained ResNet. In the end, the extracted activations are temporally averaged to get a fixed-length representation for the whole micro-video.
The audio modality of a micro-video usually includes two key types of information that could be helpful to predict its popularity: the original audio track that provides complementary information with the visual elements, such as the tone and speech of the protagonists, and the accompanied background music, which shows a strong hint for the affective states of the uploader. Similarly, we use the latest CNN that is specifically designed for audio spectrogram to extract the aural features, which has shown promising results in micro-video recommendation , hashtag recommendation  and video affective content analysis .
In particular, for each micro-video with keyframes, we first split the audio track of length into non-overlapping one-second windows, with the center of the th audio window computed as follows:
is the inverse of the audio sample rate. Under such cases, the center of each audio window precisely matches the position of the corresponding keyframe. Then, the spectrogram for each audio window is computed with code open-sourced by, and is fed into VGGish  pre-trained on AudioSet 
to extract a 128-dimensional deep feature. The features for the same micro-video are globally pooled, such that video-level aural features are obtained.
Besides, micro-videos are usually accompanied with textual descriptions, such as the titles, remarks, or some hashtags, to summarize the important content and emotional feelings the uploaders wish to convey, or accentuate the salient characteristics of the micro-video. Thus, the attached texts provide new aspects to support the popularity prediction. For example, a title containing "funny moments of Husky!" suggests that the micro-video is intended to resonate with a large population of pet lovers, whereas the hashtags like "AdvancedTheoryDiscovery" indicate the post videos are specially made for a small group.
For the majority of the languages that are used worldwide, mature natural language processing (NLP) toolkits are usually developed and made public by the NLP specialists. For example, the FudanNLP333https://github.com/FudanNLP/nlp-beginner/.  toolkit, which is one of the most commonly referred arsenal for Chinese natural language processing, can make classification of all the texts based on models pre-trained on large corpora, and extract a 20-D textual feature vector for each sentence. As for English, the sentence2vec444https://github.com/klb3713/sentence2vec/.  is able to embed English sentences of variable-length into a hidden space based on their semantic similarities. Besides, the Stanford CoreNLP toolkits555http://stanfordnlp.github.io/CoreNLP/.  is equipped with a broad range of text analysis packages, one of which is able to classify the input texts into five sentiment classes with pre-trained sentiment TreeBank model . Both toolkits were shown to be handy for extracting reliable textual features that can support the popularity prediction of the micro-video [6, 19].
In addition to the features that are directly linked to the micro-video content, the profile of its publisher could also provide key information for predicting its potential popularity. For instance, uploaders who have more followers or have their accounts verified may have more influence, and their productions tend to attract more attention among viewers, in comparison with the common users.
As different micro-video sharing platforms unveil different attributes of the uploaders to the public, features from the social modality may vary from platform to platform. Nonetheless, we summarize some useful and universal social characteristics specific to the publisher: the follower-followee counts, the total post counts, the loop counts, the verification status, etc., which mainly portray the degree of influence of the uploader.
Finally, all features extracted in four modalities are standardized into zero-mean and unit variance to eliminate the scale bias and stabilize the training of MMVED network.
We first focus on the micro-video popularity regression task with MMVED-REG model on NUS dataset. The NUS micro-video popularity regression dataset666http://acmmm2016.wixsite.com/micro-videos/. is established and released by researchers from the Lab for Media Search (LMS) at the National University of Singapore. The dataset contains 303,242 micro-videos collected between July 2015 and October 2015 from a then widely-used micro-video sharing platform Vine777https://vine.co/., most of which last 6-8 seconds. The stabilized value of the comments, reposts, likes, and loops number, are recorded and averaged to formulate the sole popularity score of a micro-video. Unfortunately, at the time of our experiments, a proportion of the links to micro-videos in the NUS dataset were invalid, and we could successfully download only 186,637 of them. Therefore, for a fair comparison, we keep the same number of test samples with previous papers (303,24, 10% of 303,242) [6, 19], and put aside another 303,24 samples for validation. The models are tested over five random splits of the dataset, and the averaged prediction performances are reported.
In our implementation of MMVED-REG model, for visual, acoustic, textual, and social modality, the number of units for the hidden layers of the modality-specific MLP encoder is empirically set to 32, 8+8 (for mean and logstd respectively). We use Adam  as the SGD optimizer, with learning rate initialized at and linearly dropped to
at the end. Training stops after 50 epochs. As the regularization coefficientis very important to the performance of our framework, we will first discuss its impact in section 5.2.2, and then set it fixed to the optimal empirical value based on the experimental results in sections after.
For our method to be comparable with previous work, we follow the usage of normalized mean squared error (nMSE) first proposed in  to measure the performance of our model. The nMSE metric is defined as follows:
where and are the real and predicted popularity score for the th micro-video sample, and is the standard deviation of the popularity groundtruth. Intuitively, the nMSE re-scales the normal MSE metric with the groundtruth variation, which eliminates the bias incurred by varied groundtruth variance for different dataset splits.
As is aforementioned, the role plays can be viewed from two aspects: First, Eq. 3 shows that controls the penalty for encoding extra unit of information of the input modalities into the hidden representation ; Second, Eq. 4 shows that is the weight for the KL term and thus controls the penalty for deviation of the encoder distribution from the standard Gaussian prior. Both views can be unified in the sense that the closer the encoder distribution is to the uninformative prior, the less information in would be left in .
In order to find the optimal value for , we train our MMVED-REG with different settings of , record their training dynamics, and evaluate their performance on the test set. The results are illustrated in Figure 2. As we can find out from Figure 2, generally, the test performance increases first and then drops with the increment of . Such phenomenon could be explained by the fact that when the value of is too small, the information bottleneck constraint set on the Gaussian representation is relaxed, and thus information flows unbridledly from the noisy input modalities to the hidden representations, which makes the training unstable, oscillating and hard to converge (the blue curve). On the other hand, when the value of grows too large, excessive constraint is set upon the hidden embedding, where useful information for popularity prediction is blocked out as well as the noisy ones; therefore, the multimodal encoder is unable to learn a good representation from the micro-video content, which leads the model to converge to a suboptima (the red curve) compared to the model trained with a suitable (the green curve).
Besides, we notice an interesting phenomenon that for our model, making insufficient information constraint () is worse than imposing oversized information constraint (
). Such results could reveal that the information-rich but low-quality nature of micro-videos renders the extracted multimodal features low in signal-to-noise ratio, which not only interferes with but sabotages the training process and generalization ability of the popularity prediction model. Therefore, based on the analysis above, we fixof the MMVED-REG to its empirical optimal value 0.7, when we draw the comparison with other SoTA methods on the NUS dataset in the next section.
The SoTA methods that are selected for the comparison with the proposed MMVED-REG model are listed below:
TMALL. The transductive multi-modal learning model  regresses the popularity of micro-videos through learning embeddings of micro-videos where the hidden space is constrained to be consistent with the multimodal feature spaces, such that the semantic characteristics of micro-videos are preserved.
TRLMVR. The transductive low-rank multi-view regression model  is an extension of TMALL, where a novel low-rank constraint is set upon the learned micro-video embeddings such that only principal components of the feature space are allowed to be kept in the representations.
|The proposed MMVED-REG||0.914|
The best results are in bold.
Table 2 reports the performances of the proposed MMVED-REG and the SoTA algorithms. From Table 2, we first notice that compared to ELM where features from different modalities are simply concatenated, TMALL utilizes a multi-view method to fuse heterogeneous features from four modalities subject to a consistency constraint, and it achieves a better result than ELM. TRLMVR further improves TMALL by adding a low-rank constraint of hidden space to the multi-view learning objective such that the insignificant components of the feature spaces are removed, which leads to substantial improvements compared to TMALL. However, in TRLMVR, the low-rank constraint is hardwired and universal for all samples, such that samples with different noise levels could not be properly distinguished. Besides, their hidden representation of TRLMVR is inherently deterministic. On the other hand, in our MMVED-REG model, we first preserve the uncertainty of popularity mapping by modeling the hidden representation as a Gaussian variable. In addition, for each micro-video, the mean (semantic part) and std (uncertainty and noise part) of its Gaussian embedding are independently inferred through the PoE inference network. Furthermore, an information bottleneck constraint is set upon the Gaussian embeddings, such that the relevant information obscured by the information-rich but noisy micro-video contents can be actively learned to be extracted into the hidden representation with the guidance of the training popularity groundtruth. Therefore, our method achieves the best results among the SoTA methods.
Next, our discussion shifts towards the temporal popularity prediction of micro-videos on Xigua dataset. The Xigua micro-video dataset is established based on a famous micro-video sharing platform Xigua888https://www.ixigua.com/ in China, whose distinguishing characteristic is that the increments of loops for each micro-video are contiguously recorded every 15 minutes immediately after its release for 3 days, which makes the popularity groundtruth a sequence. The detailed crawling strategy and dataset description can be referred to in . Xigua dataset includes 3,231,072 records from 11,219 micro-videos posted by 2664 users between July 24th and August 14th, 2019. In Xigua dataset, the ResNet visual feature is reduced to 128D by PCA to prevent overfitting, and the followers/followees counts of the uploader together with the verification status are recorded as the 3D social feature.
Since predicting popularity every 15-minute is too fine in granularity for practical use, the popularity sequences are re-sampled such that the recording interval is equivalent to 8 hours for the main part of the experiments. Nonetheless, the influence of will be thoroughly discussed in section 6.2.4. The reported performances, unless other-wisely specified, are the averaged results over five splits, each of which randomly selects 64% of the micro-videos for training, 16% for validation, and 20% for testing. For the implementation of MMVED-TMP, the encoder structure, and the training strategy remain similar to MMVED-REG. Besides, the dimension of the hidden units for the variational RNN decoder is set to 8 empirically.
Two metrics are utilized in our work to evaluate the performance of different temporal popularity prediction methods. The first is the temporal normalized Mean Squared Error (nMSE-TMP), which is a variant of nMSE that is suitable for the measure of closeness between two sequences. The nMSE-TMP metric utilized in our paper is defined as:
where and are the real and predicted popularity sequence for the th micro-video sample in the dataset, and stands for the standard deviation operator. The nMSE-TMP index rescales the mean squared error of predictions by the groundtruth variance, which alleviates the bias due to the difference of fluctuation among popularity sequences and the difference of variation among the dataset splits.
Besides, observing that even two sequences close enough measured by the nMSE-TMP could have the opposite trend, we adopt the Spearman’s rank correlation (SRC) metric  as a complement, which is defined as follows:
Generally, SRC indicates the trend consistency of the prediction, which is complementary to the nMSE-TMP metric since the latter one only focuses on the deviation of squared absolute value between and .
Similarly, we first analyze the impact of to the performance and training dynamic of MMVED-TMP model under the nMSE and SRC metric. The results are illustrated in Figure 3 and 4. From both figures, we can find out that the relationship between , model performance, and training dynamic is in line with the regression task. Both metrics indicate that the model performance improves first but then worsens with the increment of . Besides, with set to , where no information constraint is set upon the hidden representation, the validation performance slightly increases in the first few epochs but drops drastically afterwards, which indicates that the model overfits to the noise and irrelevant information specific to the training set, whereas with set to 0.3 or 1, since the information capacity of the embedding is restricted by the KL divergence constraint, the validation performance improves in a nearly monotonic manner. Moreover, with the set to the suitable value ( = 0.3), the asymptotic optimal performance of MMVED-TMP is better than those with too large .
Besides, we also note several differences between MMVED-TMP on the Xigua dataset and MMVED-REG on the NUS dataset when analyzing the sensitivity of . The most salient one is that for MMVED-TMP, the optimal value for shifts to a smaller value 0.3, which implies that under our model design and experimental settings, less constraint leads to better performance for the temporal popularity regression task. Moreover, for MMVED-TMP model, if no constraint is set upon the hidden representation (=0), although over-fits quickly, which is reflected by the rapid rebound of validation loss, the training loss converges for all splits (while for MMVED-REG model, training fails to converge for some splits). Such phenomena could be explained as follows: First, in , by visualizing the statistical characteristics of the popularity sequence in a stratified manner, we found that the popularity level of micro-videos tend to have similar temporal evolution patterns, reaching the peak shortly after their launch, and then gradually dropping to near-zero and stabilize, as people show preference to fresh micro-videos than the stale and obsolete ones. Thus, the absolute time which is fed into the RNN at each timestep is itself a strong indicator for popularity level, which makes MMVED-TMP less dependent on the latent representation compared to MMVED-REG, for which, on the contrary, the noisy micro-video content is the sole source of information. Therefore, the MMVED-TMP model exhibits more tolerance to the irrelevant information in the hidden representation. Since empirically, setting to 0.3 reaches the optimal performance for the MMVED-TMP model, we use it as the default value for in the sections afterwards.
In order to further prove the effectiveness of the proposed MMVED-TMP model, in this section, we choose two machine learning baselines, i.e., the temporal support vector regressor and the temporal random forest, and two deep learning baselines, i.e., the contextual LSTM, the multimodal deterministic encoder-decoder, to verify the superiority of the proposed MMVED-TMP framework:
Temporal SVR. The Support Vector Regressor (SVR) is a widely-used kernel method in regression areas. For SVR to be amenable to the temporal regression task, the absolute time is jointly concatenated with the features of micro-videos from four modalities.
. The Random Forest Regressor (RFR) is another machine learning method based on an ensemble of decision trees. We make the same pre-processing for features as the Temporal SVR to make the RFR suitable for the temporal prediction task.
Contextual LSTM. The Contextual LSTM  augments the input at each timestep of the LSTM with a global embedding of the object as auxiliary contextual information. In our implementation, we encode the concatenation of features from all modalities with an MLP as the contextual variable.
Deterministic Encoder-Decoder. In order to verify the effectiveness of uncertainty preservation, we design the multimodal deterministic encoder-decoder (MMDED) baseline, where the mean of the PoE Gaussian embedding is taken as a deterministic representation of the input modalities. All other structures remain the same as the MMVED-TMP.
|The proposed MMVED-TMP||0.971||0.767|
The best results are in bold.
Small nMSE and large SRC indicate good performance.
Table 3 summarizes the comparison of performance between the proposed MMVED-TMP model and the baselines. Three conclusions can be drawn from the Table 3: First, deep learning-based methods consistently outperform the machine learning-based ones by a large margin, since for the machine learning-based baselines, the prediction of popularity level at each timestep is made independently by taking the absolute time as an auxiliary 1D-feature, and thus no temporal relationship is utilized in these methods, whereas capturing the dynamic pattern of sequence, on the contrary, is the strength of the RNNs. Second, for the encode-decoder based deep learning methods, those that take the hidden encoding as the initial state of the RNN (MMDED, MMVED-TMP) outperform those use hidden encoding to augment the input at each timestep (CLSTM). The reason could be that by only using the embedding micro-video as an initialization for the hidden state of the RNN, the former methods force the encoder to summarize relevant information from the micro-video features with respect to the popularity trend, blocking the undesired shortcut for the network to base the prediction solely on the absolute time and ignore the encoding. Besides, by concatenating the high dimensional micro-video feature with the absolute time, the CLSTM model dilates the 1D absolute time information, which is shown to be the strong indicator for the popularity trend of the micro-videos in the previous section.
Finally, we also confirm that modeling the hidden representation as a stochastic variable (MMVED-TMP) performs better than its deterministic counterpart (MMDED). Two reasons could explain such result: first, the mapping from the feature of micro-video to its popularity trend is non-deterministic as external uncertain factors also influence its popularity level, and encoding the micro-video content into stochastic Gaussian embedding where the uncertainty is preserved in its variance is flexible enough to deal with such randomness. Besides, the MMVED-TMP is able to systematically model the hidden noise generative process by adding a self-adaptive noise to its encoded hidden representation during the training phase, which shows more robustness to noise than the MMDED model when faced with new data.
In order to gain more insight into the effects of the popularity sequence length to the performance of the proposed MMVED-TMP model and the baselines, in this section we test the model performance with varied re-sample interval (which, as a result, changes the popularity sequence length). As the change of alters the groundtruth variance, which invalidates the comparison of nMSE metric, we only use the SRC metric as the performance indicator. The evaluation results are illustrated in Figure 5. Figure 5 shows that generally, for all three models, SRC drops as the sequence length increases, since longer sequence makes it harder to propagate gradient information at the back of the popularity sequence to both the temporal RNN and the multimodal encoder; Besides, the prediction error could accumulate more severely with longer output sequences. However, even for popularity sequence of length 36, which is quite large due to the off-line nature of our task, the prediction made by MMVED-TMP still gets more than 0.5 of correlation with the groundtruth and shows the lowest variance of all dataset splits compared with two other baselines, which demonstrates the potential generalization ability of the proposed MMVED-TMP model to popularity prediction tasks with finer granularity and longer term.
|A + T + S||0.9706||0.7669|
|V + T + S||0.9707||0.7670|
|V + A + S||0.9705||0.7669|
|V + T + A||0.9710||0.7614|
|No missing modality||0.9705||0.7667|
Finally, in this section, we demonstrate the robustness of our model under the circumstances where some modalities are missing in the test phase. Such a problem is quite common since some users are naturally reluctant to turn on the microphone when film their videos, to add descriptions to their posted micro-videos, or to upload any user profile. As is mentioned in section 3.3, with the conditional independence assumption, when faced with new samples with fewer modalities than those in the training set, we simply drop the corresponding encoder networks and re-compute the product-of-experts distribution with the encodings calculated from the available modalities, saving the effort to re-train the whole MMVED framework from scratch.
In the experiments, we fix the weights of the trained MMVED-TMP model, eliminate one modality with the corresponding encoder network at a time, and report the test performance. The results are listed in Table 4. From Table 4, we can find out that the detrimental effect of missing one modality is quite negligible. Aside from the strong temporal features, such results could also be explained by the PoE nature of the multimodal encoder, where the decision for which information to encode from the micro-video content into the hidden representation is made based on the complementary information from all modalities, weighted by their confidence. We observe that the elimination of social modality costs the performance the most, which is in agreement with , accentuating the important role the uploader plays for the potential popularity of a micro-video. The elimination of visual and aural modality, on the other hand, shows least effects on the performance, since although features from both modalities have the largest dimension, they are redundant and can be complemented by information from other available modalities, whereas the user attributes from the social modality are compact and comparatively irreplaceable.
In this paper, we propose a multimodal variational encoder-decoder (MMVED) framework for micro-video popularity prediction tasks. The proposed MMVED is capable of learning a stochastic Gaussian embedding of a micro-video that is informative to its potential popularity level, where the inherent uncertainty is properly preserved. Besides, faced with the internal noise of micro-videos, a deep information bottleneck constraint is set upon the Gaussian embedding such that only relevant information is allowed to flow from the input modalities to the hidden representations.
Based on experiments on two real-world datasets, we draw the following conclusions: First, explicitly taking the uncertainty into consideration by modeling the hidden representation as random variables could improve the performance of micro-video popularity prediction models compared with their deterministic counterpart. Second, it is necessary to make constraints on the information contained in the stochastic embedding to deal with the noise in micro-video features, such trained models can achieve satisfactory generalization abilities.
Portuguese Conference on Artificial Intelligence, pp. 535–546. Cited by: §1, §2.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.1.
Training products of experts by minimizing contrastive divergence. Neural Computation 14 (8), pp. 1771–1800. Cited by: §3.3.
Collaborative variational autoencoder for recommender systems. In ACM International Conference on Knowledge Discovery and Data Mining, pp. 305–314. Cited by: §3.5.
Multimodal generative models for scalable weakly-supervised learning. In Advances in Neural Information Processing Systems, pp. 5580–5590. Cited by: §3.3.
If holds, i.e. contains all necessary information w.r.t the hidden representation , the following information bottleneck inequality holds:
where is the data distribution, is the encoder distribution, is the variational decoder distribution and is the specified prior for . In order to prove the theorem, we first prove two lemmas below.
Under the assumptions of Theorem 1, the following holds:
According to the definition of mutual information:
Observing that the KL-divergence is non-negative, the following inequality holds.
Then, the is lower-bounded by:
Since is the entropy of , which is a positive constant independent of the hidden encoding , it can be safely ignored. According to the independence assumption of Theorem 1, , then, a new lower bound of can be deduced:
which finishes our proof of Lemma 2.
Under the assumptions of Theorem 1, the following holds:
For clarity, with a little misuse of notation, we use to denote the marginal distribution and as the assumed prior. Observing the fact that
The following upper bound can be obtained:
which finishes our proof of Lemma 3.
By making the conditional independence assumption, i.e., assuming ( and and using the Bayesian rule, the following equation holds:
Moreover, to avoid quotient of probability distributions, we further assume thatcan be approximated with where is the encoder network for the th modality, and is the prior. Then, the following simplification of Eq. 31 can be made:
which finishes the deduction for the product-of-experts based multimodal encoder utilized in our paper.