Video recording devices have become omnipresent. Most of the videos taken with smartphones, surveillance cameras and wearable cameras are recorded with a capture first, filter later mentality. However, most raw videos never end up getting curated and remain too long, shaky, redundant and boring to watch. This raises new challenges in searching both within and across videos.
The problem of making videos content more accessible has spurred research in automatic tagging [Qi2007, Ballan2014, Mazloom2016] and video summarization [Sun2014, Gygli2015, Potapov2014, Ghosh2012, Khosla, Lu2013, Park2014, Kim, Zhao]. In automatic tagging, the goal is to predict meta-data in form of tags, which makes videos searchable via text queries. Video summarization, on the other hand, aims at making videos more accessible by reducing them to a few interesting and representative frames [Khosla, Ghosh2012] or shots [Gygli2015, Song2015].
This paper combines the goals of summarising videos and makes them searchable with text. Specifically, we propose a novel method that generates video summaries adapted to a text query (See Fig. 1). Our approach improves previous works in the area of textual-visual embeddings [Kiros2014, Liu2015] and proposes an extension of an existing video summarization method using submodular mixtures [Gygli2015] for creating summaries that are query-adaptive.
Our method for creating query-relevant summaries consists of two parts. We first develop a relevance model which allows us to rank frames of a video according to their relevance given a text query. Relevance is computed as the sum of the cosine similarity between embeddings of frames and text queries in a learned visual-semantic embedding space and a query-independent term. While the embedding captures semantic similarity between video frames and text queries, the query-independent term predicts relevance based on the quality, composition and the interestingness of the content itself. We train this model on a large dataset of image search data[hua2013clickture] and our newly introduced Relevance and Diversity dataset (Section 5). The second part of the summarization system is a framework for optimising the selected set of frames not only for relevance, but also for representativeness and diversity using a submodular mixture of objectives. Figure 2 shows an overview of our complete pipeline. We publish our codes and demos 222https://github.com/arunbalajeev/query-video-summary and make the following contributions:
Several improvements on learning a textual-visual embedding for thumbnail selection compared to the work by Liu [Liu2015]. These include better alignment of the learning objective to the task at test time and modeling the text queries using LSTMs, fetching significant performance gains.
A way to model semantic similarity and quality aspects of frames jointly, leading to better performance compared to using the similarity to text queries only.
We adapt the submodular mixtures model for video summarization by Gygli [Gygli2015] to create query-adaptive and diverse summaries using our frame-based relevance model.
A new video thumbnail dataset providing query relevance and diversity labels. As the judgements are subjective, we collect multiple annotations per video and analyse the consistency of the obtained labelling.
2 Related Work
The goal of video summarization is to select a subset of frames that gives a user an idea of the video’s content at a glance [Truong2007]. To find informative frames for this task, two dominant approaches exist: (i) modelling generic frame interestingness [Ghosh2012, Gygli2016] or (ii) using additional information such as the video title or a text query to find relevant frames [Liu2009, Song2015, Liu2015]. In this work we combine the two into one model and make several contributions for query-adaptive relevance prediction. Such models are related to automatic tagging [Qi2007, Ballan2014, Mazloom2016], textual-visual embeddings [Frome2013, Socher2014, Liu2015] and image description [Das2013, Barbu2012, Karpathy2015, Donahue2015b, Mao2014, Chen2014, Karpathy2014c, Fang2015] . In the following we discuss approaches for video summarization, generic interestingness prediction models and previous works for obtaining embeddings.
Video summarization methods can be broadly classified into abstractive and extractive approaches. Abstractive or compositional approaches transform the initial video into a more compact and appealing representation, e.g. hyperlapses[Kopf2014], montages [Sun2014a] or video synopses [Pritch2008]. The goal of extractive methods is instead to select an informative subset of keyframes [Wolf1996, Ghosh2012, Khosla, Kim] or video segments [Gygli2015, Lu2013] from the initial video. Our method is extractive. Extractive methods need to optimise at least two properties of the summary: the quality of the selected frames and their diversity [Sharghi2016, Gygli2015, Gong2014]. Sometimes, additional objectives such as temporal uniformity [Gygli2015] and relevance [Sharghi2016] are also optimised. The simplest approach to obtain a representative and diverse summary is to cluster videos into events and select the best frame per event [DeAvila2011]. More sophisticated approaches jointly optimise for importance and diversity by using determinantal point process (DPPs) [Gong2014, Sharghi2016, zhang2016video] or submodular mixtures [Lin, Gygli2015]. Most related to our paper is the work of Sharghi [Sharghi2016], who present an approach for query-adaptive video summarization using DPPs. Their method however limits to a small, fixed set of concepts such as car or flower. The authors leave handling of unconstrained queries, as in our approach, for future work. In this work, we formulate video summarization as a maximisation problem over a set of submodular functions, following [Gygli2015].
Most methods that predict frame interestingness are based on supervised learning. The prediction problem can be formulated as a classification[Potapov2014], regression [Ghosh2012, zen2016mouse], or, as is now most common, as a ranking problem [Sun2014, Gygli2016, yao2016highlight, sun2017semantic]. To simplify the task, some approaches assume the domain of the video given and train a model for each domain [Potapov2014, Sun2014, yao2016highlight].
An alternative approach based on unsupervised learning, proposed by Xiong[xiong2014detecting], detects “snap points” by using a web image prior. Their model considers frames suitable as keyframes if the composition of the frames matches the composition of the web images, regardless of the frame content. Our approach is partially inspired by this work in that it predicts relevance even in the absence of a query, but relies on supervised learning.
Unconstrained Textual-visual models. Several methods exist that can retrieve images given unconstrained text or vice versa [Frome2013, Mao2014, Karpathy2014c, Karpathy2015, Donahue2015b, Fang2015, habibian2016videostory]. These typically project both modalities into a joint embedding space [Frome2013], where semantic similarity can be compared using a measure like cosine similarity. Word2vec [Mikolov2013a] and GloVe [pennington2014glove] are popular choices to obtain the embeddings of text. Deep image features are then mapped to the same space via a learned projection. Once both modalities are in the same space, they may be easily compared [Frome2013]. A multi-modal semantic embedding space is often used by Zero-shot learning approaches [Frome2013, norouzi2013zero, jain2015objects2action] to predict test labels which are unseen in the training. Habibian [habibian2016videostory], in the same spirit, propose zero-shot recognition of events in videos by learning a video representation that aligns text, audio and video features. Similarly, Liu [Liu2015] use textual-visual embeddings for video thumbnail selection. Our relevance model is based on Liu [Liu2015]
, but we provide several important improvements. (i) Rather than keeping the word representation fixed, we jointly optimise the word and image projection. (ii) Instead of embedding each word separately, we train an LSTM model that combines a complete query into one single embedding vector, thus it even learns multi-word combinations such asvisit to lake and Star Wars movie. (iii) In contrast to Liu [Liu2015], we directly optimise the target objective. Our experiments show that these changes lead to significantly better performance in predicting relevant thumbnails.
3 Method for Relevance Prediction
The goal of this work is to introduce a method to automatically select a set of video thumbnails that are both relevant with respect to a query, but also diverse enough to represent the video. To later optimise relevance and diversity jointly, we first need a way to evaluate the relevance of frames.
Our relevance model learns a projection of video frames and text queries into the same embedding space. We denote the projection of and as and , respectively. Once trained, the relevance of a frame given a query
can be estimated via some similarity measure. As[Frome2013], we use the cosine similarity
While this lets us assess the semantic relevance of a frame w.r.t. a query, it is also possible to make a prediction on the suitability as thumbnails a priori, based on the frame quality, composition, [xiong2014detecting]. Thus, we propose to extend above notion of relevance and model the quality aspects of thumbnails explicitly by computing the final relevance as the sum of the embedding similarity and the query-independent frame quality term,
where is a query-independent score determining the suitability of as a thumbnail, based on the quality of a frame.
In the following, we investigate how to formulate the task of obtaining the embeddings and , as well as .
3.1 Training objective
Intuitively, our model should be able to answer “What is the best thumbnail for this query?”. Thus, the problem of picking the best thumbnail for a video is naturally formulated as a ranking problem. We desire that the embedding vectors of a query and frame that are a good match are more similar than ones of the same query and a non-relevant frame111Liu [Liu2015] does the inverse. It poses the problem as learning to assign a higher similarity to corresponding frame and query than to the same frame and a random query. Thus, the model learns to answer the question “what is a good query for this image?”. . Thus, our model should learn to satisfy the rank constraint that given a query , the relevance score of the relevant frame is higher than the relevance score of the irrelevant frame :
Alternatively, we can train the model by requiring that both the similarity score and the quality score of the relevant frame are higher than for the irrelevant frame explicitly, rather than imposing a constraint only on their sum, as above. In this case we would be imposing the two following constraints:
Experimentally, we find that training with these explicit constraints leads to slightly improved performance (See Tab. 1).
In order to impose these constraints and train the model, we define the loss as
where is a cost function and is a margin parameter. We follow [Gygli2016] and use a Huber loss for , the robust version of an loss. Next, we describe how to parametrize the , and , so that they can be learned.
3.2 Text and Frame Representation
We use a convolutional neural network (CNN) for predictingand , while
is obtained via a recurrent neural network. To jointly learn the parameters of these networks, we use a Siamese ranking network, trained with triplets ofwhere the weights for the subnets predicting and are shared. We provide the model architecture in supplementary material. We now describe the textual representation and the image representations and in more detail.
Textual representation. As a feature representation of the textual query , we first project each word of the query into a -dimensional semantic space using the word2vec model [mikolov2013distributed], which is trained on GoogleNews dataset. We fine-tune the word2vec model using the unique queries from the Bing Clickture dataset [hua2013clickture] as sentences. Then, we encode the individual word representations into a single fixed-length embedding using an LSTM [hochreiter1997long]. We use a many-to-one prediction, where the model outputs a fixed length output at the final time-step. This allows us to emphasize visually informative words and handle phrases.
Image representation. To represent the image, we leverage the feature representations of a pre-trained VGG-19 network [simonyan2014very]
. We replace the softmax layer(1000 nodes) of VGG-19 network with a linear layerwith 301 dimensions. The first 300 dimensions are used as the embedding , while the last dimension represents the quality score .
4 Summarization model
We use the framework of submodular optimization to create summaries that take into account multiple objectives [Lin]. In this framework, summarization is posed as the problem of selecting a subset (in our case, of frames) that maximizes a linear combination of submodular objective functions . Specifically,
where denote the set of all possible solutions and the features of video . In this work, we assume that the cardinality is fixed to some value (we use in our experiments).
For non-negative weights , the objective in Eq. (6) is submodular [Krause2011], meaning that it can be optimized near-optimally in an efficient way using a greedy algorithm with lazy evaluations [Nemhauser1978, Minoux1978].
Objective functions. We choose a small set of objective functions, each capturing different aspects of the summary.
Query similarity where is the query embedding, is frame embedding and denotes the cosine similarity defined in Eq. (1).
Quality score , where represents score that is based on the quality of as a thumbnail. This model scores the image relevance in a query-independent manner based on properties such as contrast, composition, etc.
Diversity of the elements in the summary
, according to some dissimilarity measure . We use the Euclidean distance in of the FC2 features of the VGG-19 network for 222Derivation of submodularity of this objective is provided in the suppl..
Representativeness [Gygli2015]. This objective favors selecting the medoid frames of a video, such that the visually frequent frames in the video are represented in the summary.
Weight learning. To learn the weights in Eq. (6), ground truth summaries for query-video pairs are required. Previous methods typically only optimized for relevance [Liu2015] or used small datasets with limited vocabularies [Sharghi2016]. Thus, to be able to train our model, we collected a new dataset with relevance and diversity annotations, which we introduce in the next Section.
If relevance and diversity labels are known, we can estimate the optimal mixing weights of the submodular functions through subgradient descent [Lin]. In order to directly optimize for the F1-score used at test time, we use a locally modular approximation based on the procedure of [narasimhan2012submodular] and optimize the weights using AdaGrad [Duchi2011].
5 Relevance And Diversity Dataset (RAD)
We collected a dataset with query relevance and diversity annotation to let us train and evaluate query-relevant summaries. Our dataset consists of videos, each of which was retrieved given a different query.
Using Amazon Mechanical Turk (AMT) we first annotate the video frames with query relevance labels, and then partition the frames into clusters according to visual similarity. These kind of labels were used previously in the MediaEval diverse social images challenge[Ionescu2015] and enabled evaluation of the automatic methods for creating relevant and diverse summaries.
To select a representative sample of queries and videos for the dataset, we used the following procedure: We take the top YouTube queries between and from different categories as seed queries333https://www.google.com/trends/explore. These queries are typically rather short and generic concepts, so to obtain longer, more realistic queries we use YouTube auto-complete to suggest phrases. Using this approach we collect queries. Some examples are brock lesnar vs big show, taylor swift out of the woods, etc. For each query, we take the top video result with a duration of to minutes.
|Method||Cost||LSTM||Quality||HIT@1 VG or G||Spear Corr.||mAP|
|Loss of Liu||68.75||0.186||0.6308|
|Loss of Liu + LSTM||70.62||0.270||0.6507|
|Ours: Huber + LSTM||72.63||0.367||0.6685|
|Ours: Frame quality only||65.95||0.236||0.6315|
|Ours: Huber + LSTM +||70.76||0.371||0.6657|
|Ours: Huber + LSTM +||74.76||0.376||0.6712|
To annotate the videos, we set up two consecutive tasks on AMT. All videos are sampled at one frame per second. In the first task, a worker is asked to label each frame with its relevance w.r.t. the given query. Options for answers are “Very Good”,“Good”, “Not good” and “Trash”, where trash indicates that the frame is both irrelevant and low-quality (e.g. blurred, bad contrast, etc.). After annotating the relevance, the worker is asked to distribute the frames into clusters according to their visual similarity. We obtain one clustering per worker, where each clustering consists of mutually exclusive subsets of video frames as clusters. The number of clusters in the clustering is chosen by the worker. Each video is annotated by different people and a total of subjects participated in the annotation. To ensure high-quality annotations, we defined a qualification task, where we check the results manually to ensure the workers provide good annotations. Only workers who pass this test are allowed to take further assignments.
We now analyse the two kinds of annotations obtained through this procedure and describe how we merge these annotations into one set of ground truth labels per video.
Label distributions. The distribution of relevance labels is “Very Good”: , “Good”: , “Not good”: and “Trash”: . The minimum, maximum and mean number of clusters per video are , and respectively over all videos of RAD.
Relevance annotation consistency. Given the inherent subjectivity of the task, we want to know whether annotators agree with each other about the query relevance of frames. To do this, we follow previous work [Isola2011a, GygliICCV13, wang2016event] and compute the Spearman rank correlation () between the relevance scores of different subjects, splitting five annotations of each video into two groups of two and three raters each. We take all split combination to find mean for a video.
Our dataset has an average correlation of over all videos, where is a perfect correlation while would indicate no consistency in the scores. On the related task of event-specific image importance, using five annotators, consistency is only [wang2016event]. Thus, we can be confident that our relevance labels are of high quality.
Cluster consistency. To the best of our knowledge, we are the first to annotate multiple clusterings per video and look into the consistency of multiple annotators. MediaEval, for example, used multiple relevance labels but only one clustering [Ionescu2015]. Various ways of measuring the consistency of clusterings exist, e.g. Variation of Information, Normalised Mutual Information or the Rand index (See Wagner and Wagner [Wagner2007] for an excellent overview). In the following we propose to use Normalised Mutual Information (NMI), an information theoretic measure [Fred2003] which is the ratio of the mutual information between two clusterings () and the sum of entropies of the clusterings ():
We chose NMI over the more recently proposed Variation of Information (VI) [meilua2003comparing], as NMI has a fixed range () while still being closely related to VI (see supplementary material).
Our dataset has a cluster consistency of . Since NMI is if two clusterings are independent and iff they are identical, we see that our annotators have a high degree of agreement.
Ground truth For evaluation on the test videos, we create a single ground truth annotation for each video. We merge the five relevance annotations as well as the clustering of each query-video pair. For the final ground truth of relevance prediction, we require the labels be either positive or negative for each video frame. We map all “Very Good” labels to , “Good” labels to and “Not Good” and “Trash” labels to . We compute the mean of the five relevance annotation labels and label the frame as positive if the mean is and as negative otherwise.
To merge clustering annotations, we calculate NMI between all pairs of clustering and choose the clustering with the highest mean NMI, the most prototypical cluster. An example of relevance and clustering annotation is provided in Fig. 6.
6 Configuration testing
Before comparing our proposed relevance model against state of the art in Sec. 7, we first analyze our model performance using different objectives, cost functions and text representation. For evaluation, we use Query-dependent Thumbnail Selection Dataset (QTS) provided by [Liu2015]. The dataset contains candidate thumbnails for each video, each of which is labeled one of the five: Very Good (VG), Good (G), Fair (F), Bad (B), or Very Bad (VB). We evaluate on the available query-video pairs. To transform the categorical labels to numerical values, we use the same mapping as [Liu2015].
As evaluation metrics, we are using HIT@1 and mean Average Precision (mAP) as reported and defined in Liu[Liu2015], as well as the Spearman’s Rank Correlation. HIT@1 is computed as the hit ratio for the highest ranked thumbnail.
Training dataset. For training, we use two datasets: (i) the Bing Clickture dataset [hua2013clickture] and (ii) the RAD dataset (Sec. 5). Clickture is a large dataset consisting of queries and retrieved images from Bing Image search. The annotation is in form of triplets meaning that the image was clicked times in the search results of the query . This dataset is well suited for training our relevance model, since our task is the retrieval of relevant keyframes from a video, given a text query. It is, however, from the image and not the video domain. Thus, we additionally fine-tune the models on the complete RAD dataset consisting of query-video pairs. From each query-video pair, we sample an equi number of positive and negative frames to give equal weight to each video. In total, we use triplets (as in Sec. 3.2) from the Clickture and triplets from the RAD for training.
Implementation details. We preprocess the images as in [simonyan2014very]. We truncate the number of words in the query at , as a tradeoff between the mean and maximum query length in Clickture dataset( and respectively) [mueller2016siamese]. We set the margin parameter in the loss in Eq. (5) to 1 and the tradeoff parameter for the Huber loss to as in [Gygli2016]. The LSTM consists of a hidden layer with units. We train the parameters of the LSTM and projection layer
using stochastic gradient descent with adaptive weight updates (AdaGrad)[Duchi2011]. We add an penalty on the weights, with a of . We train for epochs using minibatches of triplets.
6.1 Tested components
We discuss three important components of our model next.
Objective. We compare our proposed training objective to that of Liu [Liu2015]. Their model is trained to rank a positive query higher than a negative query given a fixed frame. In contrast, our method is trained to rank a positive frame higher than a negative frame given a fixed query.
Cost function. We also investigate the importance of modeling frame quality. In particular, we compare different cost functions. (i) We enforce two ranking constraints: one for the quality term and one for the embedding similarity, as in Eq.(4) (), (ii) We sum the quality and similarity term into one output score, for which we enforce the rank constraint, as in Eq.(3) () or (iii) we don’t model quality at all.
Text representation. As mentioned in Sec. 3.2, we represent the words of the query using word vectors. To combine the individual word representations into single vector, we investigate two approaches: (i) averaging the word embedding vectors and (ii) using an LSTM model that learns to combine the individual word embeddings.
We show the results of our detailed experiments in Tab. 1. They give insights on several important points.
Text representation. Modeling queries with an LSTM, rather than averaging the individual word representations, improves performance significantly. This is not surprising, as this model can learn to ignore words that are not visually informative (e.g. 2014).
|HIT @ 1|
|Method||VG||VG or G||Spear.||mAP|
Objective and Cost function. The analysis shows that training with our objective leads to better performance compared to using the objective of Liu [Liu2015]. This can be explained with the properties of videos, which typically contain many frames that are low-quality or not visually informative [song2016click]. Thus, formulating the thumbnail task in a way that the model can learn about these quality aspects is beneficial. Using the appropriate triplets for training boosts performance substantially (correlation with the loss of Liu [Liu2015] + LSTM: , Ours: Huber + LSTM ). When including a quality term in the model, performance improves further, where an explicit loss performs slightly better (Ours: Huber + LSTM + in Tab. 1).
Somewhat surprisingly, modeling quality alone already outperforms Liu [Liu2015] in terms of mAP, despite not using any textual information. Quality adds a significant boost to performance in the video domain. Interestingly, this is different in the image domain, due to the difference in quality statistics. Images returned by a search engine are mostly of good quality, thus explicitly accounting for it does not improve performance (see supplementary material).
|No textual input|
|Ours: Frame quality||69.0||0.135||0.749|
|Liu [Liu2015] +LSTM||70.0||0.134||0.731|
|Liu [Liu2015] +LSTM||72.0||0.204||0.730|
To conclude, we see that the better alignment of the objective to the keyframe retrieval task, the addition of an LSTM and modeling quality of the thumbnails improves performance. Together, they provide an substantial improvement compared to Liu ’s model. Our method achieves an absolute improvement of % in HIT@1, % in mAP, and an improvement in correlation from to . These gains are even more significant when we consider the possible ranges of these metrics. e.g. for Spearman correlation, human agreement is at on the RAD dataset (Sec. 5.1), thus providing an upper bound. Similarly, HIT@1 and mAP have small effective ranges given their high scores for a random model.
In the previous section, we have determined that our objective, embedding queries with an LSTM and explicitly modelling quality performs best. We call this model QAR (Quality-Aware Relevance) in the following and compare against state-of-the-art(s-o-a) models on the QTS and RAD datasets. We also evaluate the full summarization model on RAD. For these experiments, we split RAD into videos for training, for validation and for testing.
Evaluation metrics. For relevance we use the same metrics as in Sec. 6
. To evaluate video summaries on RAD, we additionally use F1 scores. The F1 score is the harmonic mean of precision of relevance prediction and cluster recall[Ionescu2015]. It is high, if a method selects relevant frames from diverse clusters.
7.1 Evaluating the Relevance Model
We evaluate our model (QAR) and compare it to Liu [Liu2015] and Video2GIF [Gygli2016].
Query-dependent Thumbnail Selection Dataset (QTS) [Liu2015]
We compare against the s-o-a on the QTS evaluation dataset in Tab. 2. We report the performance of Liu [Liu2015] from their paper. Note, however, that the results are not directly comparable, as they use query-video pairs for predicting relevance, while only the titles are shared publicly. Thus, we use the titles instead, which is an important difference. Relevance is annotated with respect to the queries, which often differ from the video titles. We compare the re-implementation of [Liu2015] using titles in detail in Tab. 1.
Encouragingly, our model performs well even when just using the titles and outperforms them on most metrics. It improves mAP by % over [Liu2015] and correlation by a margin of 0.254 (Table 2). Figure 3 shows the precision-recall curve for the experiment. As can be seen QAR outperforms [Liu2015] for all recall ratios. To better understand the effects of using titles or queries, we quantify the value of the two on the RAD dataset below.
Our dataset (RAD) We also evaluate our model on the RAD test set (Tab. 3). QAR (ours) significantly outperforms the previous s-o-a of [Liu2015, Gygli2016], even when augmenting Liu [Liu2015] with an LSTM. QAR improves mAP by % when using Titles and % when using Queries over our implementation of Liu [Liu2015]+LSTM.
We also see that modeling quality leads to significant gains in terms of mAP when using Titles or Queries ( in both cases). HIT@1 for query relevance, however, is lower when including quality. We believe that the reason for this is that when the query is given, the textual-visual similarity is a more reliable signal to determine the single best keyframe. While including quality improves the overall ranking on mAP, it is solely based on appearance and thus seems to inhibit the fine-grained ranking results at low recall(Fig. 4). However, when only the title is used, the frame quality becomes a stronger predictor for thumbnail selection and improves performance on all metrics. We present some qualitative results of different methods for relevance prediction in Fig. 5.
7.2 Evaluating the Summarization Model
As mentioned in Sec. 4, we use four objectives for our summarization model. Referring to Tab. 4, we use QAR model to get Similarity and Quality scores while Diversity and Representativeness scores are obtained as described in Sec. 4. We compare the performance of our full model with each individual objective, a baseline based on Maximal Marginal Relevance (MMR) [carbonell1998use] and Hecate [song2016click]. MMR greedily builds a set that maximises the weighted sum of two terms: (i) The similarity of the selected elements to a query and (ii) The dissimilarity to previously selected elements. To estimate the similarity to the query we use our own model (QAR without ) and for dissimilarity the diversity as defined in Sec. 4. Finally, we compare it to Hecate, recently introduced in [song2016click]
. Hecate estimates frame quality using the stillness of the frame and selects representative and diverse thumbnails by clustering the video with k-means and selecting the highest quality frame from the k largest clusters.
Results Quantitative results are shown in Tab. 4, while Fig. 6 shows qualitative results. As can be seen, combining all objectives with our model works best. It outperforms all single objectives, as well as the MMR [carbonell1998use] baseline, even though MMR also uses our well-performing similarity estimation. Similarity alone has the highest precision, but tends to pick frames that are visually similar (Fig. 6), thus resulting in low cluster recall. Diversification objectives (diversity and representativeness) have a high cluster recall, but the frames are less relevant. Somewhat surprisingly, Hecate [song2016click] is a relatively strong baseline. In particular, it performs well in terms of relevance, despite using a simple quality score. This further highlights the importance of quality for the thumbnail selection task. It also indicates that the used VGG-19 architecture might be suboptimal for predicting quality. CNNs for classification use small input resolutions, thus making it difficult to predict quality aspects such as blur. Finding better architectures for that task is actively researched, e.g. [lu2015deep, mai2016composition], and might be used to improve our method.
When analysing the learned weights (Tab. 4
) we find that the similarity prediction is the most important objective, which matches our expectations. Quality gets a lower, but non-zero weight, thus showing that it provides information that is complementary to query-similarity. Thus, it helps predicting the relevance of a frame. The reader should however be aware that differences in the variance of the objectives can affect the weights learned. Thus, they should be taken with a grain of salt and only be considered tendencies.
We introduced a new method for query-adaptive video summarization. At its core lies a textual-visual embedding, which lets us select frames relevant to a query. In contrast to earlier works, such as [zeng2016semantic, Sharghi2016], this model allows us to handle unconstrained queries and even full sentences. We proposed and empirically evaluated different improvements over [Liu2015], for learning a relevance model. Our empirical evaluation showed that a better training objective, a more sophisticated text model, and explicitly modelling quality leads to significant performance gains. In particular, we showed that quality plays an important role in the absence of high-quality relevance information, such as queries, when only the title can be used. Finally, we introduced a new dataset for thumbnail selection which comes with query-relevance labels and a grouping of the frames according to visual and semantic similarity. On this data, we tested our full summarization framework and showed that it compares favourably to strong baselines such as MMR [carbonell1998use] and [song2016click]. We hope that our new dataset will spur further research in query adaptive video summarization.
We thank Achanta Radhakrishna, Radu Timofte and Prof. Sabine Susstrunk for their insightful comments.