Supervised Video Summarization via Multiple Feature Sets with Parallel Attention

The assignment of importance scores to particular frames or (short) segments in a video is crucial for summarization, but also a difficult task. Previous work utilizes only one source of visual features. In this paper, we suggest a novel model architecture that combines three feature sets for visual content and motion to predict importance scores. The proposed architecture utilizes an attention mechanism before fusing motion features and features representing the (static) visual content, i.e., derived from an image classification model. Comprehensive experimental evaluations are reported for two well-known datasets, SumMe and TVSum. In this context, we identify methodological issues on how previous work used these benchmark datasets, and present a fair evaluation scheme with appropriate data splits that can be used in future work. When using static and motion features with parallel attention mechanism, we improve state-of-the-art results for SumMe, while being on par with the state of the art for the other dataset.


Unsupervised Video Summarization via Multi-source Features

Video summarization aims at generating a compact yet representative visu...

Discriminative Feature Learning for Unsupervised Video Summarization

In this paper, we address the problem of unsupervised video summarizatio...

Exploring Global Diversity and Local Context for Video Summarization

Video summarization aims to automatically generate a diverse and concise...

Contrastive Attention Mechanism for Abstractive Sentence Summarization

We propose a contrastive attention mechanism to extend the sequence-to-s...

Transfoming Multi-Concept Attention into Video Summarization

Video summarization is among challenging tasks in computer vision, which...

Automatic Generation of Headlines for Online Math Questions

Mathematical equations are an important part of dissemination and commun...

Revisiting the Onsets and Frames Model with Additive Attention

Recent advances in automatic music transcription (AMT) have achieved hig...

1 Introduction

In the current information age, the enormous amount of available informative or entertaining multimedia content has increased the need for methods to detect important and thus relevant content. The number of available videos is also growing rapidly [DBLP:conf/cvpr/SongVSJ15, DBLP:conf/eccv/ZhangCSG16]

and the velocity highlights that the detection of important video segments is an essential and crucial task for the field of computer vision. Every minute, hundreds 

[DBLP:conf/eccv/ZhangCSG16, DBLP:conf/eccv/ZhangGS18] or even a thousand hours of videos are being uploaded to video or social media platforms. It can be observed that skipping forward to a desired or interesting part of a video is widespread and this behavior is also denoted as ”skim through” [DBLP:conf/eccv/PotapovDHS14]. No doubt this is subjective behavior but similar behavior of various humans can highlight the importance of a particular video segment. Overall, it is obvious that the fine-grained identification of important segments in a video is an important task. Video summarization can be defined as the conversion of a (potentially long) video into a shorter video that contains essential segments and thus allows a viewer to understand the video content. One of the main challenges is to identify and select important frames and segments in the original video that can be reused in the video summary. The generation of summaries in the form of selected frames from a video is useful when the main task is to get shorter videos with possible important parts. Many methods apply video segmentation as an initial pre-processing step for video summarization [DBLP:conf/cvpr/SongVSJ15, DBLP:conf/eccv/ZhangCSG16, DBLP:conf/eccv/GygliGRG14, DBLP:conf/mm/Wang000FT19]. For example, fine-grained approaches at the video segment level of a second can open doors to potential applications such as live video stream from surveillance, learning, or the entertainment sector. Although the problem of assigning importance scores to frames for video summarization has been studied before  [DBLP:conf/eccv/ZhangCSG16, DBLP:conf/eccv/GygliGRG14, DBLP:conf/mm/Wang000FT19, DBLP:conf/eccv/PotapovDHS14, DBLP:conf/nips/ShiCWYWW15, DBLP:conf/nips/GongCGS14], little attention has been paid to the impact of core system components, that is the role of different types of visual features and how to combine them. In this regard, most previous work just relied on content-based features for image classification [DBLP:conf/eccv/ZhangGS18, DBLP:journals/corr/abs-2006-01410, DBLP:conf/accv/FajtlSAMR18, DBLP:journals/corr/abs-1708-09545, DBLP:conf/mm/FengLKZ18].

In this paper, we address this research gap and investigate how different feature types, i.e., static and motion features, can be integrated in a model architecture for video summarization. How a fusion of different types of features affects the overall performance by incorporating them with an attention mechanism similar as used by previous work [DBLP:journals/corr/abs-2006-01410, DBLP:conf/accv/FajtlSAMR18, DBLP:journals/corr/abs-1708-09545, DBLP:conf/wacv/FuTC19]. For this task, we propose a novel deep learning model for supervised video summarization called Multi-Source Visual Attention (MSVA). The model fuses image and motion features based on a self-attention mechanism [DBLP:conf/accv/FajtlSAMR18] in a parallel fashion. Our comprehensive analysis on two benchmark datasets shows that our model outperforms other systems on the SumMe dataset [DBLP:conf/eccv/ZhangCSG16], while obtaining performance similar to the state of the art on the TVSum dataset [DBLP:conf/cvpr/SongVSJ15]. In addition, we uncover issues in the experimental evaluation of previous methods: in some cases, videos were either excluded from the evaluation or reused in multiple splits, which makes it difficult to compare the systems in a fair manner and hinders the reproducibility of results. The crucial aspect of cross-fold validation on both benchmark datasets, where previous methods either excluded some data points from evaluation or some data points were repeated in multiple splits, which makes it difficult to compare the published systems fairly. Therefore, we present a revised version of both benchmark datasets by providing five new non-overlapping splits and evaluating previous approaches on them and share the source code of our methodology and the evaluation111 We share the source code for the proposed model and the new non-overlapping splits for the TVSum and SumMe datasets with the research community. Our main contributions can be summarized as follows:

1.) We introduce a novel architecture based on multi-source visual features with an attention mechanism. Track changes is off

2.) We identify issues in previous experimental setups and reproduce the experimental results for some approaches on valid cross-validation folds for two benchmark datasets.

3.) State-of-the-art results are improved for the SumMe dataset, while achieving similar results in comparison with other models on the TVSum dataset.

The rest of the paper is structured as follows. In Section 2, we review previous work on supervised video summarization. Section 3 describes the different feature sets, the proposed model architecture, and the attention mechanism. Experimental results and the comparison with other state-of-the-art methods are reported in Section 4. We conclude the paper with a summary in Section 5.

Figure 1:

The neural network architecture for the Multi-Source Visual Attention (MSVA) model with parallel self-attention mechanism based on multiple feature sets

2 Related Work

To solve the task of video summarization, both supervised and unsupervised machine learning approaches have been suggested in the literature. Supervised methods train classifiers to learn the importance of a frame or segment for a summary. The process starts with the segmentation of videos, either uniformly into equally sized chunks, as done by Gygli et al. 

[DBLP:conf/eccv/GygliGRG14], or using algorithms like kernel temporal segmentation (KTS) by Potapov et al. [DBLP:conf/eccv/PotapovDHS14]. Gygli et al. [DBLP:conf/eccv/GygliGRG14] computed an interestingness score for each segment using a weighted sum of features by combining low-level spatio-temporal salience or high-level motion information, while Song et al. [DBLP:conf/cvpr/SongVSJ15] measure frame-level importance using learned factorization. Another approach is suggested by Potapov et al. [DBLP:conf/eccv/PotapovDHS14] to train SVMs to classify frames in segments obtained through KTS.

Recurrent Neural Networks (RNN) or specifically long short-term memory (LSTM) and bidirectional LSTM (BiLSTM) have been also proposed for video summarization, where a BiLSTM model is stacked with Determinantal Point Process (DPP) [DBLP:conf/eccv/ZhangCSG16], weighted memory layers with LSTM [DBLP:conf/eccv/ZhangGS18]. In these approaches, either model helps to avoid similar frames in the final selection of a summary or solves this problem by encoding long video sequences to short sequences. As mentioned before, attention mechanisms are widely used in video summarization and combined with different neural architectures [DBLP:journals/corr/abs-2006-01410, DBLP:conf/accv/FajtlSAMR18, DBLP:journals/corr/abs-1708-09545, DBLP:conf/wacv/FuTC19] where promising or even best results have been achieved recently.

Our model MSVA differs from approaches like MAVS [DBLP:conf/mm/FengLKZ18], M-AVS  [DBLP:journals/corr/abs-1708-09545] and MC-VSA  [DBLP:journals/corr/abs-2006-01410] as follows. The proposed MSVA model has multiple sources of visual features where attention is applied to each source in a parallel fashion. The MAVS system [DBLP:conf/mm/FengLKZ18] is a memory augmented video summarizer with global attention, M-AVS [DBLP:journals/corr/abs-1708-09545] considers multiplicative attention for video summarization with encoder-decoder, and MC-VSA [DBLP:journals/corr/abs-2006-01410] a multi-concept video self-attention where attention is applied to multiple layers of encoder and decoder. The majority of previous work [DBLP:conf/eccv/ZhangCSG16, DBLP:conf/eccv/ZhangGS18, DBLP:journals/corr/abs-1708-09545] use pre-trained image features from GoogleNet [DBLP:conf/cvpr/SzegedyLJSRAEVR15] to encode video frames.

3 Supervised Video Summarization with Multi-source Features

In this section, we describe the overall architecture of the Multi-Source Visual Attention (MSVA) model and details about the different building blocks. First, we define some variables of the target problem. A video can be represented as a sequence , where and is the visual frame at time . The sequence of frames can be represented by different visual features as , where

is a vector that represents the extracted features from the

t-th frame based on a specific feature encoder with a dimension . The task of the model is to produce as an output that represents the importance score of frames.

The feature encoders can be based on pre-trained models like GoogleNet [DBLP:conf/cvpr/SzegedyLJSRAEVR15], or content-based image features as mentioned in related work, in order to extract features to represent frames in videos. In contrast to prior work, we exploit additional models to enhance the representation of visual information in frames. For instance, different actions such as bungee jumping or hiking

need to have different representations and rely on motion and temporal aspects, and not just on the static image content and object categories present in the ImageNet dataset. Therefore, we propose the use of content-based image features in combination with motion-related features to capture a richer representation within our model.

Once the features are extracted, we employ an attention mechanism followed by a couple of linear layers with different features and fuse them to obtain a common embedding space to represent frames. After fusion, we apply linear layers, normalization, activation functions, and predict the importance score of given input frames. The overall architecture of the

MSVA model is shown in Figure 1

. Next, we describe visual feature extraction, the attention mechanism, and fusion techniques to incorporate visual features from multiple sources for video summarization.

3.1 Content-based Image and Motion Features

We propose to combine three different feature types to provide a richer representation of frames in videos. After applying uniform sub-sampling to all frames of a video (two frames per second), the selected frames are passed as input to the pre-trained models to extract the following features.

Image content

: A deep neural network is trained on the ImageNet dataset for an object classification task  

[DBLP:conf/cvpr/SzegedyLJSRAEVR15]. The most common pre-trained model to extract visual features for the video summarization domain is GoogleNet [DBLP:conf/cvpr/SzegedyLJSRAEVR15]. We use the same model to extract content-based frame features from the pool5 layer ( dimensions), and represent them as .

Motion: To leverage motion-related features, we use the pre-trained I3D (Inflated 3D ConvNet) [DBLP:conf/cvpr/CarreiraZ17] model on Kinetics dataset [kinetics_dataset], which is composed of human actions such as drawing, drinking, laughing, hugging, opening present. From this model, we can extract two types of features: RGB (red, green, blue) and optical flow. RGB features, denoted as , capture the channel-wise color information with regard to scene changes, while the optical flow, denoted as , represents the motion in consecutive frames. The features are extracted from the second last layer of the pre-trained I3D model ( dimensions).

3.2 Attention Module

The attention module used in our architecture is based on Fajtl et al.’s approach [DBLP:conf/accv/FajtlSAMR18]. It is shown in Figure 1 (b). Each type of feature is fed into a separate attention mechanism followed by two linear and single normalization layers. An important aspect of the attention mechanism is the aperture that defines the aperture window with a range [-, +]; the normalized attention vectors for a window are . The attention weights at time index with a subset from the feature set are calculated as follows.


Here, is the number of video frames, and are learnable weight matrices, is a scale parameter that reduces the value of the dot product, is the i-th feature vector from the entire input sequence.


is a pairwise attention weight of the input vector with respect to a vector from the entire input sequence . The vector contains attention weights for the target frame at time based on the other vectors from the input sequence.


The feature vector of a visual descriptor is multiplied with attention weights and fed into two linear layers (, ) and then into a normalization layer to obtain a latent representation of each type of features (as shown in Figure 1 (c)):


3.3 Multi-source Fusion

As mentioned above, our proposed model incorporates different types of features. In this stage, all latent representations

after attention and the following linear transformation layers are fused to obtain a single vector representation of input frames. We use an addition function to fuse three vectors

, , corresponding to the latent representations of input features , , (image, optical flow, RGB). The result is passed to a linear layer (), followed by ReLU activation, dropout, normalization and another linear layer (

). Finally, the output vector is fed into a sigmoid function that outputs importance scores

for the input frames:


The formula given in Equation 5 and the model architecture depicted in Figure 1 can be considered as an intermediate fusion, since the combination of latent representations of different feature types is performed in the intermediate layers of the neural network. Additionally, we experiment with other techniques such as early and late fusion. Early fusion is realized by combining the input features after the encoding of input frames, then it is followed by a single attention mechanism, linear layers, normalization, dropouts, activation functions, and the classification layer. For the late fusion, we combine the latent representations in the last linear layer (), which indicates that latent representations of different types of features are processed by different layers in parallel until .

4 Experimental Setup and Results

In this section, we present details about benchmark datasets, evaluation protocols, ablation studies, comparison with state-of-the-art baselines, and video-wise qualitative analysis.

4.1 Datasets and Evaluation Metrics

We use the following benchmark datasets to evaluate our approach and compare it with state-of-the-art approaches:

1.) TVSum [DBLP:conf/cvpr/SongVSJ15]: 50 videos with the length of 2-10 minutes, annotated by 20 users.

2.) SumMe [DBLP:conf/eccv/ZhangCSG16]: videos with the length of 1-6 minutes, annotated by 18 users.

The evaluation of previous approaches on these datasets is based on 5-fold cross-validation and the reported results are scores averaged across the five splits of the corresponding datasets. The splits for TVSum and SumMe datasets are provided by Zhang et al. [DBLP:conf/eccv/ZhangCSG16]. When reproducing the results of previous work, we have observed methodological issues in the evaluation for both benchmark datasets. Some videos are dropped from certain splits, and are never part of the validation set. For instance, ”video_5” and ”video_8” in SumMe as well as ”video_21” and ”video_28” in TVSum are not part of any validation split. In total, eight videos from SumMe and videos from TVSum are dropped from certain splits and were not evaluated. Another issue is that some videos are repeated in multiple splits. To fix the mentioned problems and to provide a fair comparison for future research, we release new versions of the two benchmark datasets by providing five non-overlapping splits where videos are equally divided across splits without any repetition or exclusion.

Regarding the evaluation, Otani et al. [DBLP:conf/cvpr/OtaniNRH19] argue that the score for the task of video summarization has certain limitations and proposed to measure the correlation between predicted and human-annotations. In particular, they suggested Spearman’s and Kendall’s as correlation coefficients to evaluate models on how close the summaries predicted by models are to human annotations. Following these arguments, we evaluate our model architecture on the original splits and on the new non-overlapping splits for both datasets in terms of

scores and correlation coefficients. The corresponding non-overlapping splits and the source code for evaluation metrics will be available to enable fair comparisons and reproducibility of future research

222 We have reproduced work from Fajtl et al. [DBLP:conf/accv/FajtlSAMR18] with correlation coefficients according to [DBLP:journals/corr/abs-2006-01410, DBLP:conf/cvpr/OtaniNRH19] for the respective experiments by evaluating on both measure and correlation coefficients to compare against previous methods.

4.2 Results

Dataset Fusion Features
- 50.5
early ++ 46.7
+ 44.5
SumMe + 44.8
intermediate ++ 53.4
+ 50.9
+ 51.5
late ++ 51.0
+ 50.1
+ 50.8
- 60.1
early ++ 57.3
+ 56.7
TVSum + 56.3
intermediate ++ 61.5
+ 61.1
+ 61.2
late ++ 60.1
+ 58.9
+ 59.7
Table 1: Ablation study with different feature types, fusion techniques with best aperture size (250) for the MSVA model. is the average score calculated on the newly provided five non-overlapping splits.

Ablation study: We evaluated different building blocks of our proposed MSVA model. We performed a grid search on model hyper-parameters such as the aperture size of an attention mechanism (-), fusion techniques (early, intermediate, late), and combination of feature types (object, RGB, optical flow) using random % of data. The linear layers are of size , Adam

(Adaptive Moment Estimation) is used as the optimizer, and each variation is trained for

epochs with a stopping criterion of 50 epochs when the loss is static.

The results are given in Table 1. The reported score is the average of 5-fold cross-validation on the non-overlapping splits for both benchmark datasets that we provide. We include only the best performing combinations of hyper-parameters (including aperture window size 250). It can be concluded that the model with an intermediate fusion of three features gives the best performance for both datasets with an aperture size of .

Overall comparison: We compared our proposed MSVA model with other state-of-the-art models that were evaluated on both benchmark datasets.

The results are given in Table 2 for both benchmark datasets. We report score, Kendall’s and Spearman’s correlation coefficients on the newly provided non-overlapping splits (denoted as ) along with on the original splits (denoted as ). The results of previous work that did not share source code are reported only for the original splits. We evaluated VASNet [DBLP:conf/accv/FajtlSAMR18] on all evaluation metrics as the source code of the model is available333 Based on the given results, we can see that our model improves the state-of-the-art results for SumMe dataset and achieves comparable results for the TVSum dataset, while outperforming multiple systems including VASNet [DBLP:conf/accv/FajtlSAMR18]. A similar pattern is seen for both correlation coefficients, where our model obtains the best results for both datasets. Another observation is that the performance of VASNet is reduced by 1-2 points in terms of score when evaluated on non-overlapping splits in comparison to the original splits of both datasets. It can be explained by the fact that some videos were not part of any splits for 5-fold cross-validation and some videos were repeated across multiple splits. The MAVS approach [DBLP:conf/mm/FengLKZ18] is the best performing model on the TVSum from the compared models, while it has poor performance on SumMe. On the contrary, our proposed model outperforms all baselines on SumMe, while still achieving comparable results for TVSum as well, and best results in terms of correlations.

Dataset Method
SumMe MAVS [DBLP:conf/mm/FengLKZ18] 43.1 - - -
M-AVS [DBLP:journals/corr/abs-1708-09545] 44.4 - - -
re-SEQ2SEQ [DBLP:conf/eccv/ZhangGS18] 44.9 - - -
VASNet [DBLP:conf/accv/FajtlSAMR18] 49.7 48.0 0.16 0.17
MC-VSA [DBLP:journals/corr/abs-2006-01410] 51.6 - - -
MSVA (ours) 54.5 53.4 0.20 0.23
TVSum M-AVS [DBLP:journals/corr/abs-1708-09545] 61.0 - - -
VASNet [DBLP:conf/accv/FajtlSAMR18] 61.4 59.8 0.16 0.17
MC-VSA [DBLP:journals/corr/abs-2006-01410] 63.7 - 0.116 0.142
re-SEQ2SEQ [DBLP:conf/eccv/ZhangGS18] 63.9 - - -
MAVS [DBLP:conf/mm/FengLKZ18] 67.5 - - -
MSVA (ours) 62.8 61.5 0.19 0.21
Table 2: Comparison of different methods on benchmark datasets. is an average score calculated on the newly provided five non-overlapping splits, is the reported score by previously systems on original five splits. is Spearman’s and is Kendall’s correlation coefficients.
Figure 2: Comparison of predictions and ground truth labels on videos with low (a) “Playing on water slide” and high (b) “Playing ball” score from the SumMe dataset

4.3 Qualitative analysis

In Figure 2, we plot the importance score predictions of MSVA model compared to the average ground truth labels assigned by the annotators. Our model achieved lower performance for the video on the left, while it achieved a higher performance for the video on the right. The video on the left is called “Playing on water slide” with main content being recorded next to a water slide and a number of kids are playing around it. The difficulty lies in selecting different importance scores for frames that look similar, i.e., the background is still a water slide and children. Such confusion can be observed for ground truth labels assigned in the middle of the video where different scores are assigned to visually similar frames. The video on the right is called “Playing ball” with the main content being a dog and a bird playing with a white ball. One analysis in this video is that there are few objects appearing at a particular time, i.e., three objects, where one is prominent and the others are small like a bird and a ball.

Figure 3: score analysis for all videos in SumMe. The scores are taken when videos were used for testing across 5-folds.

To understand the impact of splitting the dataset, we plot scores for all videos in SumMe in Figure 3. The scores are computed when an individual video was part of the test set across five folds. There is a big difference in the F1 score across videos. Thus, the exclusion of certain videos affects the overall comparison across different models as well as the repetition of the same videos in multiple splits has an impact.

4.4 Discussion

The experimental results for both datasets demonstrate the importance of exploiting multiple sources of feature types, particularly with regard to motion. Similarly, the attention mechanism on each source in a parallel fashion plays a vital role to provide decisive power to the model. Although the supervised method has the advantage to learn from annotated labels, the model also incorporates bias from datasets based on (partially disagreeing) annotations from multiple users, as shown by the qualitative analysis mentioned above. Lastly, it is shown that the fusion strategy has a noticeable impact on the performance when utilizing multiple feature sets.

Model performance varies a lot on different videos for which the reasons ca reasons can be given by observing the visual content of the video. This also demands to get features from the visual contextual domain to fill the gap on these difficult kinds of videos.

5 Conclusion

In this paper, we have proposed a model architecture that utilizes visual features from multiple sources, i.e., static object and motion features, with an attention mechanism and different fusion techniques. The intermediate fusion of object and motion features appeared to be the best configuration as shown by experiment results. State-of-the-art results were improved on the benchmark dataset SumMe and comparable results were obtained on the other benchmark dataset TVSum. Furthermore, methodological issues have been identified in the evaluation setup of previous work, and we have provided non-overlapping splits for cross-validation for a fair comparison. In the future, we will focus on semantic aspects to enhance the model with additional decisive capabilities.