Live video commenting is a new interaction mode emerged among online video websites. This technique allows viewers to write real-time comments while watching videos, in order to express opinions about the video contents or to interact with other viewers. Based on the features above, the Automatic Live Video Commenting (ALVC) task aims to generate live comments for videos, while taking both the video and the surrounding comments111In this paper, the surrounding comments refer to the comments made by other viewers near the current timestamp. into consideration. Figure 1 illustrates an example for this task. Automatically generating real-time comments brings more fun into video watching and reduces the difficulty of understanding video contents for human viewers. Besides, it also engages people’s attention and increases the popularity of the video.
Despite its usefulness described above, the ALVC task has not been widely explored. Ma et al. (2019) is the first to propose this task, which is the only endeavor so far. They employ separate attention on the video and surrounding comments to obtain their representations. However, such approach neglects the interactions between video and text. In fact, the surrounding comments are written based on the video, while they also highlight important features of the video frames. Therefore, in this work, we aim to integrate the information from video and text based on co-attention mechanism.
As an effective method in multi-modal scenarios, co-attention has been applied in multiple tasks Lu et al. (2016); Xiong et al. (2017); Nguyen and Okatani (2018); Yang et al. (2019); Li et al. (2019b). Different from the traditional co-attention mechanism, we propose a novel model with a Diversified Co-Attention (DCA) module and a Gated Attention Module (GAM) for the ALVC task. By learning different distance metrics to characterize the dependencies between two information sources, DCA builds bidirectional interactions between video frames and surrounding comments from multiple perspectives, so as to produce diversified co-dependent representations. Integrating the diversified information from DCA, the GAM is designed to collect an informative and comprehensive context for comment generation. We further propose a simple yet effective parameter orthogonalization technique to alleviate potential information redundancy in DCA’s multi-prespective setting. Experiment results suggest that our model outperforms the previous approaches in the ALVC task as well as the traditional co-attention, reaching new state-of-the-art results. The incremental analysis further supports the effectiveness of the proposed DCA and GAM in the proposed model.
The main contributions of this work are summarized as follows:
We propose a novel model for the ALVC task, equipped with a diversified co-attention module and a gated attention module.
We propose an effective parameter orthogonalization technique, which is able to alleviate potential information redundancy of DCA’s multi-perspective setting.
Experiments show that our approach achieves state-of-the-art results in the ALVC task, and generates comments with more valuable information.
2 Preliminary: Co-Attention
Co-attention is an effective attention-based mechanism that has been used in several multi-modal tasks Lu et al. (2016); Xiong et al. (2017); Nguyen and Okatani (2018); Yang et al. (2019); Li et al. (2019b). Although these approaches are different in detailed implementation, their general idea is to let visual and textual inputs attend to each other by computing their similarity. Generalizing the previous works, in this paper, we formalize the co-attention mechanism as follows.
Given a sequence of visual features and a sequence of textual features 222Here we assume that and share the same dimension. Otherwise, a linear transformation can be introduced to ensure that their dimensions are the same.
share the same dimension. Otherwise, a linear transformation can be introduced to ensure that their dimensions are the same., we first connect them by computing their similarity :
where contains learnable weights.
Each element in the similarity matrix denotes the similarity score between and . is normalized row-wise to produce the vision-to-text attention weights , and column-wise to produce the text-to-vision attention weights . The final representations are computed as the product of attention weights and original features:
where and denote the co-dependent representations of vision and text.
3 Proposed Model
The ALVC task is defined as follows: given a timestamp , surrounding video frames and surrounding comments 333We concatenate all surrounding comments in time order as a single sequence ., the model aims at generating a reasonable comment . Here, the surrounding video frames and surrounding comments are sampled from time interval , where is the hyper-parameter indicating the length of sampling interval. Figure 2(a) presents the sketch of our proposed model.
3.1 Video Encoder and Text Encoder
The text encoder aims to obtain representations of surrounding comments, which is implemented as a GRU Cho et al. (2014)
network. The hidden representation of each wordis computed as:
where is the word embedding of . The textual representation matrix is denoted as .
The video encoder is used to obtain representations of video frames and is implemented as another GRU network. The hidden representation of each video frame is computed as:
3.2 Diversified Co-Attention
In this section, we first introduce the idea of metric learning into the traditional co-attention mechanism, and then describe DCA’s multi-perspective setting as well as the parameter orthogonalization technique used in our model.
3.2.1 Metric Learning
The design of DCA derives from the idea of metric learning. Different from the traditional co-attention mechanism described in Section 2, we consider the parameter matrix in Eq.(1) as a task-specific distance metric in the joint space of video and text. Such distance metric is learned to measure the distance between video and text representations.
According to Xing et al. (2002), to ensure that is a distance metric, should be required as a positive semi-definite matrix, satisfying non-negativity and triangle inequality. Since is continuously updated during model training, the positive semi-definite constraint is difficult to keep satisfied. To remedy this problem, we propose an alternative solution: and are first applied with the same linear transformation , then the inner product of the transformed matrices is computed as their similarity score:
where is regarded as an approximation of in Eq.(1). Since is symmetric positive definite, it is naturally a positive semi-definite matrix, which meets the requirement of a distance metric.
3.2.2 Multi-Perspective Setting
As distance metrics between vectors can be defined in various forms, learning a single distance metric is not suffice to comprehensively measure the similarity between two kinds of representations. On the other hand, we hope the DCA module is able to provide informative context for the comment decoder from diversified perspectives.
To address this contradiction, we introduce a multi-perspective setting in our DCA module. We ask the DCA module to learn multiple distance metrics to capture the dependencies between video and text from different perspectives. To achieve this, DCA learns different parameter matrices in Eq.(6), where is a hyper-parameter denoting the number of perspectives. Intuitively, each represents a learnable distance metric. Given two sets of representations and , each yields a similarity matrix as well as co-dependent representations and from its unique perspective. DCA is then able to build bidirectional interactions between two information sources from multiple perspectives. Finally, a mean-pooling layer is used to integrate the representations from different perspectives:
3.2.3 Parameter Orthogonalization
One potential problem of the above multi-perspective setting is information redundancy, meaning that the information extracted from different perspectives may overlap excessively. Specifically, the parameter matrices may tend to be highly similar after many rounds of training. According to information theory, if the information learned from different perspectives is identical, then multiple perspectives will unfortunately degenerate to be equivalent to a single perspective.
To alleviate such information redundancy problem, one possible solution is to apply regularization to differentiate the learned parameters. However, we empirically find that the introduction of regularization terms results in the collapse of model training (see the Appendix for details). As an alternative solution, we propose a parameter orthogonalization technique. After back propagation updates all parameters at each learning step, we further perform the following secondary update on each :
where means the trace of the matrix and is a hyper-parameter.
In Appendix, we prove that Eq.(8) is equivalent to an orthonormality constraint, ensuring that are nearly orthogonal. According to Lin et al. (2017), this suggests that the information carried by these matrices rarely overlaps. Experiments in Section 4.6 and 4.7 demonstrate that by reducing information redundancy in the multi-perspective setting, the orthogonalization technique assists DCA to collect diversified information from video and text sources.
3.3 Gated Attention Module
In order to balance the co-dependent representations from DCA and original representations from the encoders, a Gated Attention Module (GAM) is designed following the DCA module. The GAM also serves to collect informative context for the comment decoder. Given the hidden state of the decoder at timestep , the GAM generates a context vector for next decoder timestep . The sketch of GAM is shown in Figure 2(b).
First off, we apply attention mechanism on the co-dependent and original representations respectively, using as query:
where is the attention mechanism Bahdanau et al. (2015). Then, and are passed through a gated unit to generate comprehensive textual representations:
where , and are learnable parameters,
denotes the sigmoid function anddenotes element-wise multiplication. is the balanced textual representation of and . Symmetrically, we obtain the balanced visual representation through Eq.(9)Eq.(11) based on and .
In the ALVC task, the contribution of video information and textual information towards the desired comment may not be equivalent. It is inappropriate to model video and text in a completely symmetrical way. Therefore, we calculate the final context vector as:
where is a learnable vector to weight the information collected from video and text. denotes the outer product and denotes a feed-forward network. The outer product is a more informative way to represent the relationship between vectors than the dot product, and we choose the combination of outer product and to collect an informative context for generation.
Given the context vector obtained by GAM, the decoder aims to generate a comment via another GRU network. The hidden state at timestep is computed as:
where is the word generated at time-step , and semicolon denotes vector concatenation. The decoder samples a word
from the output probability distribution:
where denotes a output linear layer. The model is trained by maximizing the log-likelihood of the ground-truth comment.
3.5 Extension to Transformer
We also implement our model based on Transformer Vaswani et al. (2017). Specifically, the text encoder, video encoder and comment decoder are implemented as Transformer blocks. The DCA module and the GAM in the Transformer version remain the same with the aforementioned GRU-based model. Since this extension is not the focus of this paper, we will not explain it in more detail. Readers can refer to Vaswani et al. (2017) for detailed descriptions of the Transformer block.
We conduct experiments on the Live Comment Dataset444https://github.com/lancopku/livebot proposed by Ma et al. (2019). The dataset is collected from the popular Chinese video streaming website Bilibili555https://www.bilibili.com. It contains 895,929 instances in total, which belong to 2,361 videos. Each instance contains a human-written comment (ground-truth), 5 surrounding comments from nearest timestamps and 5 surrounding video frames presented as images (see Figure 1 for example). Table 1 shows detailed statistics about the Live Comment dataset.
The baseline models in our experiments include the previous approaches in the ALVC task as well as the traditional co-attention model. For each listed Seq2Seq-based models, we implement another Transformer-based version by replacing the encoder and decoder to Transformer blocks, so as to compare with our Transformer-based model.
S2S-Video Vinyals et al. (2015) uses CNN to encode the video frames and a GRU decoder to generate the comment.
S2S-Text Bahdanau et al. (2015) uses a GRU encoder to encode the surrounding comments and a GRU decoder to generate the comment.
S2S-Concat Venugopalan et al. (2015) adopts two GRU encoders to encode video frames and surrounding comments, respectively. Outputs from two encoders are concatenated and fed into a GRU decoder.
S2S-SepAttn Ma et al. (2019) employs separate attention on video and text representations. The attention contexts are concatenated and fed into a GRU decoder.
S2S-CoAttn Yang et al. (2019) applies traditional co-attention on video and text representations. The co-attention outputs are fed into a GRU decoder via attention mechanism.
Accordingly, the Transformer versions are named as Trans-Video, Trans-Text, Trans-Concat, Trans-SepAttn and Trans-CoAttn.
4.3 Experiment Settings
In our experiments, all comments are limited to no longer than 20 Chinese words. Word embeddings are set to size 512 and are learned from scratch. We adopt 34-layer Resnet He et al. (2016)
pretrained on ImageNet to process the raw video frames before feeding them into the video encoder. For Seq2Seq-based models, both encoders and the decoder are implemented as 2-layer GRU networks, and the encoders are set to bi-directional. And for Transformer-based models, both encoders and the decoder are composed of 6 Transformer blocks. We set the number of perspectives toin Eq. (7) and in Eq. (8) is set to 0.01. The length of data sampling interval is set to 10 seconds.
4.4 Evaluation Metrics
Since the possible comments can be diverse for a certain video, how to properly evaluate comment generation is still a tough challenge. In this work, considering both quality and diversity of the generated comments, we use both reference-based and rank-based metrics in evaluation. Besides, we also conduct human evaluation to solidify the results.
4.4.1 Automatic Evaluation
Reference-based metrics aim to evaluate the consistency or similarity between the generated text and the ground-truth. Therefore, they are widely used to measure the quality of the generated text. We apply several popular reference-based metrics, including BLEUPapineni et al. (2002), ROUGE Lin (2004), METEOR Banerjee and Lavie (2005) and CIDEr Vedantam et al. (2015).
Due to the diversity of reasonable comments to a certain video, we cannot collect all possible comments for reference-based comparison. Hence, only using reference-based metrics is not sufficient for evaluation. Following Das et al. (2017) and Ma et al. (2019), we adopt rank-based metrics as a complement to the reference-based metrics. Given a set of selected candidate comments, The model is asked to assign a likelihood score to each candidate. Since the model generates the sentence with the highest likelihood score, it is reasonable to discriminate a good model based on its ability to rank the ground-truth comment on the top. The 100 candidate comments are collected as follows:
Ground-truth: The human-written comment in the original video.
30 most similar comments to the video title in the training set, excluding the ground-truth comment and the input surrounding comments. Plausibility is computed as the cosine similarity between the comment and the video title based on tf-idf values.
Popular: 20 most frequently appeared comments in the training set, most of which are meaningless short sentences like “Hahaha” or “Great”.
Random: Comments that are randomly picked from the training set to make the candidate set up to 100 sentences.
The models are asked to sort the candidate list in descending order of likelihood. We report evaluation results on the following metrics: Recall@ (the percentage that the ground-truth appears in the top of the ranked candidates), MR (the mean rank of the ground-truth), and MRR (the mean reciprocal rank of the ground-truth).
In human evaluation, we randomly pick 200 instances from the test set. We ask five human annotators to score the generated comments from different models on a scale of 1 to 5. The annotators are required to evaluate from the following aspects: Fluency (whether the sentence is grammatically correct), Relevance (whether the comment is relevant to the video and surrounding comments), Informativeness (whether the comment carries rich and meaningful information) and Overall (the annotator’s general recommendation).
4.5 Experiment Results
The results of the automatic and human evaluation are shown in Table 2 and Table 3 respectively, showing that our model performs better than the previous approaches in the ALVC task as well as the traditional co-attention model.
According to rank-based metrics in automatic evaluation, our model assigns higher ranks to ground-truth comments. These results prove that our model has stronger ability in discriminating highly relevant comments from irrelevant ones. Since the generation process is also retrieving the best sentence among all possible word sequences, it can be inferred that our model performs better at generating high-quality sentences.
Furthermore, reference-based metrics in automatic evaluation suggest that our model generates comments which are of better quality and more consistent to ground-truth comments. Nevertheless, the BLEU, ROUGE and METEOR scores of all models are pretty low, which accords with our previous hypothesis in Section 4.4.1 that reference-based metrics has their limitations in such diversified generation tasks.
Additionally, our model receives more favor from human judges in human evaluation. This proves that our model generates comments that are more consistent with human writing habits. We also discover that the margin between our model and baselines in Informativeness is larger than the other perspectives. Since the proposed DCA and GAM modules integrate information of video and text, the generated sentences tend to contain richer information than the other models.
The experiments show consistent results in Seq2Seq-based models and Transformer-based models. Hence, the proposed DCA and GAM modules are believed to have good universality, which can adapt to different model architectures.
4.6 Incremental analysis
In order to understand the efficacy of the proposed methods, we further conduct an incremental analysis on different settings of our Transformer-based model. The results are shown in Table 4, showing that the DCA module, the orthogonalization technique and the GAM all contribute to the performance of the proposed model.
As is shown in the results, DCA outscores the traditional co-attention by learning multiple distance metrics in the joint space of video and text. Thus, DCA builds interactions between two information sources from multiple perspectives. The parameter orthogonalization further improves the model by reducing information redundancy in DCA’s multi-perspective setting (see further explanations in Section 4.7). The GAM also makes contributions to the model by balancing and integrating the information collected from encoders and the DCA module.
4.7 Visualization of DCA
In this section, we illustrate the effectiveness of parameter orthogonalization by visualizing the similarity matrices in the DCA module. In the vanilla DCA (shown in Figure 3LABEL:), each is generated by a distance metric through Eq.(6). However, the similarity matrices are highly similar to each other. This shows that the information extracted from perspectives suffers from the information redundancy problem. After introducing the parameter orthogonalization (shown in Figure 3LABEL:), apparent differences can be seen among these similarity matrices. This further explains the performance improvement after adding the orthogonalization technique to the proposed model in Table 4. The parameter orthogonalization assists the DCA module to learn discrepant distance metrics, generating diversified representations and thus alleviates information redundancy in DCA’s multi-perspective setting.
5 Related Work
Automatic Article Commenting
One similar task to our work is automatic article commenting. Qin et al. (2018) is the first to introduce this task and constructs a Chinese news dataset. Lin et al. (2019) combines retrieval and generation methods based on user-generated data to assist comment generation. Ma et al. (2018)
proposes a retrieval-based commenting framework on unpaired data via unsupervised learning.Yang et al. (2019) leverages visual information for comment generation on graphic news. Zeng et al. (2019) uses a gated memory module to generate personalized comment on social media. Li et al. (2019a) models the news article as a topic interaction graph and proposes a graph-to-sequence model.
Another similar task to ALVC is video captioning, which aims to generate a descriptive text for a video. Venugopalan et al. (2015) applies a unified deep neural network with CNN and LSTM layers. Shetty and Laaksonen (2016) strives to extract both object attributes and action features in the video. Shen et al. (2017)
proposes a sequence generation model with weakly supervised information for dense video captioning.Xiong et al. (2018) produces descriptive paragraphs for videos via a recurrent network by assembling temporally localized descriptions. Li et al. (2019c) uses a residual attention-based LSTM to reduce information loss in generation. Xu et al. (2019) jointly performs event detection and video description via a hierarchical network.
Our model is also inspired by the previous work of co-attention. Lu et al. (2016) introduces a hierarchical co-attention model in visual question answering. Xiong et al. (2017) uses a dynamic co-attention based on iterative pointing procedure. Nguyen and Okatani (2018) proposes a dense co-attention network with a fully symmetric architecture. Tay et al. (2018) applies a co-attentive multi-pointer network to model user-item relationships. Hsu et al. (2018) adds co-attention module into CNNs to perform unsupervised object co-segmentation. Yu et al. (2019) applies a deep modular co-attention network in combination of self-attention and guided-attention. Li et al. (2019b) uses positional self-attention and co-attention to replace RNNs in video question answering.
In this work, we propose a novel model for automatic live video commenting. Equipped with a Diversified Co-Attention module and a Gated Attention Module, the model integrates information from diversified perspectives. We further propose an effective parameter orthogonalization technique to alleviate potential information redundancy in DCA’s multi-perspective setting. Experiments on both Seq2Seq and Transformer architecture prove the advantage of our model over previous approaches and the traditional co-attention. Further incremental analysis and visualization prove the effectiveness of various methods proposed in our paper.
- Neural machine translation by jointly learning to align and translate. In ICLR 2015, Cited by: §3.3, 2nd item.
- METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, pp. 65–72. Cited by: §4.4.1.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014, pp. 1724–1734. Cited by: §3.1.
- Parseval networks: improving robustness to adversarial examples. In ICML 2017, pp. 854–863. Cited by: Appendix A.
- Visual dialog. In CVPR 2017, pp. 1080–1089. Cited by: §4.4.1.
- Deep residual learning for image recognition. In CVPR 2016, pp. 770–778. Cited by: §4.3.
- Co-attention cnns for unsupervised object co-segmentation. In IJCAI 2018, pp. 748–756. Cited by: §5.
- Coherent comments generation for chinese articles with a graph-to-sequence model. In ACL 2019, pp. 4843–4852. Cited by: §5.
- Beyond rnns: positional self-attention with co-attention for video question answering. In AAAI 2019, pp. 8658–8665. Cited by: §1, §2, §5.
- Residual attention-based LSTM for video captioning. World Wide Web 22 (2), pp. 621–636. Cited by: §5.
- Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.4.1.
- Learning comment generation by leveraging user-generated data. In ICASSP 2019, pp. 7225–7229. Cited by: §5.
- A structured self-attentive sentence embedding. In ICLR 2017, Cited by: Appendix A, §3.2.3.
- Hierarchical question-image co-attention for visual question answering. In NeurIPS 2016, pp. 289–297. Cited by: §1, §2, §5.
- LiveBot: generating live video comments based on visual and textual contexts. In AAAI 2019, pp. 6810–6817. Cited by: §1, 4th item, §4.1, §4.4.1.
- Unsupervised machine commenting with neural variational topic model. arXiv preprint arXiv:1809.04960. Cited by: §5.
- Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR 2018, pp. 6087–6096. Cited by: §1, §2, §5.
- Bleu: a method for automatic evaluation of machine translation. In ACL 2002, pp. 311–318. Cited by: §4.4.1.
- Automatic article commenting: the task and dataset. In ACL 2018, pp. 151–156. Cited by: §5.
- Weakly supervised dense video captioning. In CVPR 2017, pp. 5159–5167. Cited by: §5.
- Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the 24th ACM international conference on Multimedia, pp. 1073–1076. Cited by: §5.
- Multi-pointer co-attention networks for recommendation. In KDD 2018, pp. 2309–2318. Cited by: §5.
- Attention is all you need. In NeurIPS 2017, pp. 5998–6008. Cited by: §3.5.
- CIDEr: consensus-based image description evaluation. In CVPR 2015, pp. 4566–4575. Cited by: §4.4.1.
- Sequence to sequence - video to text. In ICCV 2015, pp. 4534–4542. Cited by: 3rd item, §5.
- Show and tell: A neural image caption generator. In CVPR 2015, pp. 3156–3164. Cited by: 1st item.
- Distance metric learning with application to clustering with side-information. In NeurIPS 2002, pp. 505–512. Cited by: §3.2.1.
- Dynamic coattention networks for question answering. In ICLR 2017, Cited by: §1, §2, §5.
- Move forward and tell: A progressive generator of video descriptions. In ECCV 2018, pp. 489–505. Cited by: §5.
- Joint event detection and description in continuous video streams. In WACV 2019, pp. 396–405. Cited by: §5.
- Cross-modal commentator: automatic machine commenting based on cross-modal information. In ACL 2019, pp. 2680–2686. Cited by: §1, §2, 5th item, §5.
- Deep modular co-attention networks for visual question answering. In CVPR 2019, pp. 6281–6290. Cited by: §5.
- Automatic generation of personalized comment based on user profile. In ACL 2019, pp. 229–235. Cited by: §5.
Appendix A Derivation of Parameter Orthogonalization
According to (Lin et al., 2017), to alleviate the problem of information redundancy, should be as orthogonal666It means that . as possible. We first try to introduce a regularization term (Cissé et al., 2017)
into the original loss function as an orthonormality constraint:
where is a tunable hyper-parameter. However, we empirically find that the naive introduction of regularization term tends to cause the collapse of model training. The reason is that it may cause the parameters to be updated in a direction away from the main gradient. To remedy this, we propose an approximate alternative: after back propagation updates all parameters at each learning step, we adopt a post-processing method equivalent to the aforementioned orthonormality constraint by updating the gradient of regularization term :
For unit learning rate, the gradient update is , namely,