PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation

Highlight detection models are typically trained to identify cues that make visual content appealing or interesting for the general public, with the objective of reducing a video to such moments. However, the "interestingness" of a video segment or image is subjective. Thus, such highlight models provide results of limited relevance for the individual user. On the other hand, training one model per user is inefficient and requires large amounts of personal information which is typically not available. To overcome these limitations, we present a global ranking model which conditions on each particular user's interests. Rather than training one model per user, our model is personalized via its inputs, which allows it to effectively adapt its predictions, given only a few user-specific examples. To train this model, we create a large-scale dataset of users and the GIFs they created, giving us an accurate indication of their interests. Our experiments show that using the user history substantially improves the prediction accuracy. On our test set of 850 videos, our model improves the recall by 8 highlight detectors. Furthermore, our method proves more precise than the user-agnostic baselines even with just one person-specific example.



There are no comments yet.


page 1

page 3

page 6


Adaptive Video Highlight Detection by Learning from User History

Recently, there is an increasing interest in highlight detection researc...

PR-Net: Preference Reasoning for Personalized Video Highlight Detection

Personalized video highlight detection aims to shorten a long video to i...

Less is More: Learning Highlight Detection from Video Duration

Highlight detection has the potential to significantly ease video browsi...

Learning to Transfer Graph Embeddings for Inductive Graph based Recommendation

With the increasing availability of videos, how to edit them and present...

Multi-Interest-Aware User Modeling for Large-Scale Sequential Recommendations

Precise user modeling is critical for online personalized recommendation...

Where's YOUR focus: Personalized Attention

Human visual attention is subjective and biased according to the persona...

A Blast From the Past: Personalizing Predictions of Video-Induced Emotions using Personal Memories as Context

A key challenge in the accurate prediction of viewers' emotional respons...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the increasing availability of camera devices, more and more video is recorded and shared. In order to share these videos, however, they typically have to be edited to remove boring and redundant content and present the most interesting parts only. It’s no coincidence that animated GIFs had a revival in the past years, as they make exactly this promise: that the video is reduced to the single most interesting moment (bakhshi2016fast). Most users resort to online tools such as giphy, or ezgif to create their GIFs manually, but video editing is usually a tedious and time consuming task. Recently, the research community has taken growing interest in automating the editing process (arev2014automatic; GygliSum15; gygli2016video2gif; LeePred15; lin2015summarizing; PotapovSum; SunRank; yaohighlight; zhang2016summary; zhang2016video; ZhaoSum; yang2015unsupervised). These existing methods, however, share a common limitation, as they all learn a generic highlight detection or summarization model. This limits their potential performance, as not all users share the same interests (soleymani2015quest) and are thus editing video in different ways (AVS). One user may edit basketball videos to extract the slams, another one may just want to see the team’s mascot jumping. A third may prefer to see the kiss cam segments of the game. An automatic method should therefore adapt its results to specific users, as exemplified in Figure 1.

Figure 1. The notion of a highlight in a video is, to some extent, subjective. While previous methods trained generic highlight detection models, our method takes a user’s previously selected highlights into account when making predictions. This allows to reduce the ambiguity of the task and results in more accurate predictions.

In this work, we address the limitation of generic highlight detection and propose a model that explicitly takes a user’s interests into account. Our model builds on the success of deep ranking models for highlight detection (gygli2016video2gif; yaohighlight), but makes the crucial enhancement of making highlight detection personalized. Thereby, our model uses information on the GIFs a user previously created. This allows the model to make accurate user-specific predictions, as a user’s GIF history allows for a fine-grained understanding of their interests and thus provides a strong signal for personalization. This stands in contrast to relying on demographic data such as age or gender or interest in certain topics only. Knowing that a user is interested in basketball, for example, is not sufficient. A high-performing highlight detection model needs to have knowledge of what parts of basketball videos the user likes. Thus, to obtain a high-performing model, we need to collect information on a user’s interest in certain objects or events (babaguchi2007learning; GygliInt13) and use that information for highlight detection.

To obtain that kind of data, we turn to and its user base and collect a novel and large-scale dataset of users and the GIFs they created. On this data, we train several models for highlight detection that condition on the user history. Our experiments show that using the history allows making significantly more accurate predictions compared to generic highlight detection models, relatively improving upon the previous state of the art for automatic GIF creation (gygli2016video2gif) by 4.3% in MSD and 8% in recall.

To summarize, we make the following contributions:

  • A new large-scale dataset with personalized highlight information. It consists of users with annotations on videos. To the best of our knowledge, this is the first dataset with personalized highlight information as well as the biggest highlight detection dataset in general. We make the dataset publicly available111

  • A novel model for personalized highlight detection (PHD). Our model’s predictions are conditioned on a specific user by providing his previously chosen highlight segments as inputs to the model. This allows to use all available annotations by training a single high-capacity model for all users jointly, while making personalized predictions at test time.

  • An extensive experimental analysis and comparison to generic highlight detection approaches. Our qualitative analysis finds that users often have high consistency in the content they select. Empirically, we show that our model can effectively use this history as a signal for highlight detection. Our experiments further show the benefits of our approach over existing personalization methods: our model improves over generic highlight detection, even when only one user specific example is available, and outperforms the baseline of training one model per user.

2. Related Work

Our work aims to predict what video segments a user is most interested in, using visual features alone. It is thus a content-based recommender system (ricci2011introduction) for video segments, similar to highlight detection and personalized video summarization. Our method further relates to collaborative filtering. In the following, we discuss the most relevant and recent works in these areas. For an excellent overview and review of earlier video summarization and highlight detection techniques, we refer the reader to (truong2007video).

Personalized video summarization.

Early approaches in summarization cannot be personalized, as they are based on heuristics such as the occurrence of certain events, 

e.g. the scoring of a goal (truong2007video). Exceptions are (jaimes2002learning; agnihotri2005framework; babaguchi2007learning; takahashi2007user), which build a user profile and use it for personalization. Most notably, Jaimes et al.  (jaimes2002learning) also learn user-specific models directly from highlight annotation of a particular user. All these methods, however, rely on annotated meta-data, rather than using only audio-visual inputs.

In the last years, methods using supervised learning on audio-visual inputs have become increasingly popular 

(ChuCosum; gong2014diverse; GygliSum15; LeeDisco12; ma2005generic; plummer2017enhancing; xu2015gaze; zhang2016summary; zhang2016video). These methods learn (parts of) a summarization model from annotated training examples. Thus, they can be personalized by training on annotation coming from a single user, similar to (jaimes2002learning). While that approach works in principle, it has two important practical issues. (i) Computational cost. Having a model per user is often infeasible in practice, due to the cost of training and storing models. (ii) Limited data. Typically, only a small number of examples per user are available. This limits the class of possible methods to simple models that can be trained from a handful of examples. In contrast to that, we train a global model that is personalized via its inputs, by conditioning on the user history. This allows to train more complex models by learning from all users jointly. Thus, the proposed approach is able to perform well even for users that have not been seen in training and that have no examples to train with (cold start problem). Furthermore, as the user information is an input and is not embedded into the model parameters, our method does not need retraining as new user information arrives.

An alternative way to personalize summarization models is by analyzing the user behavior at recording (arev2014automatic; VariniPref) or visualization time (peng2011editing; zen2016mouse), or requiring user input at inference time, either through specifying a text query (liu2015multi; sharghi2016query; vasudevan2017query; yang2003videoqa) or via an interactive approach (AVS; singla2016noisy). In the interactive approaches, the user gives feedback on individual proposals (AVS) or pairwise preferences (singla2016noisy), which is then used to present a refined summary. Instead, we do not require the user to know the full content of the video, nor require any input, such as user feedback, at test time: our model uses the user’s history as the signal for personalization.

Highlight detection methods.

The goal of highlight detection is to find the most interesting events of a video. In contrast to traditional video summarization approaches it does not aim to give an overview of the video, but rather just to extract the best moments (truong2007video). Recent methods for that task have used a ranking formulation, where the goal is to score interesting segments higher than non-interesting ones (gygli2016video2gif; SunRank; yaohighlight; jiao2017video; yu2018deep). While (SunRank) used a ranking SVM model, (gygli2016video2gif; yaohighlight; jiao2017video; yu2018deep)

trained a deep neural network using a ranking loss. Our work is similar to these approaches, in particular 

(gygli2016video2gif), which also proposes a highlight detection model trained for GIF creation. But while they train a generic model, we use the user history to make personalized predictions. (soleymani2015quest) also predicts personalized interestingness, but does so on images and by training a separate model per user. It thus suffers from the same issues as (jaimes2002learning) and other existing supervised summarization methods that train one separate model per user. Ren et al.  (ren2017personalized) improve upon these methods by proposing a generic regression model which is personalized with a second, simpler model. The second model predicts the residual of the generic model for a specific user. Thus, as our approach, this method can also handle users with no history, but it still requires (re-)training a model for each new user.

Collaborative Filtering.

In collaborative filtering (CF) (koren2009matrix), interactions of users with items (e.g. movie ratings) are used to learn user and item representations that accurately predict these and new interactions, e.g. a user’s rating for a movie. CF has shown strong performance, for example in the Netflix challenge (bell2007lessons) and is used for video recommendation at YouTube (covington2016deep). While powerful, CF techniques cannot be easily applied to highlight detection, as that would require several interactions with the same video segment. We find that in our data, few users create GIFs from the same video, let alone the same segment. This prevents learning a model from only interaction data alone.

3. Dataset

(a) (b)
Figure 2. The dataset in numbers: distribution of the amount of (a) videos per user, and (b) gifs per user.

In order to be able to do personalized highlight detection, one key challenge is obtaining a training set that provides useful user information. Thereby, different kinds of user information is possible, e.g. meta-data on the user’s age, gender, geographic location, what web editor was used and so on. For our dataset, instead, we directly collect information on what video segments a specific user considers a highlight. Having this kind of data allows for strong personalization models, as specific examples of what a user is interested in help the model obtain a fine-grained understanding of that specific user. This stands in contrast to knowing demographic data, which would only allow to customize models based on loose indicators of interest such as the gender or location. Our idea of using a web video editor as a data source is similar to (gygli2016video2gif; SunRank), but we additionally associate each GIF with a specific user, which allows for personalization.

3.1. Data source

To obtain personalized highlight data, we have turned to and its user base. is a video editor for the web and has a large base of registered users. When a user creates a GIF, e.g. by extracting a key moment from a YouTube video, that GIF is linked to the user. This allows to query for user profiles for users which have created several GIFs, i.e. contain a history that describes the user’s interest. To have a reasonably sized sample of the users of interest, we restricted the selection to users that created GIFs from a minimum of five videos, where the last video is used for prediction, while the remaining ones serve as the history. Thus, in our dataset, each user has a history of at least four videos.

(a) Examples from a user consistently selecting GIFs of soccer players (202 GIFs). His interests differ from the majority of users, which consider goal scenes the most interesting (truong2007video).
(b) Examples from a user consistently selecting GIFs of funny or cute pets (446 GIFs)
(c) Examples from a user with GIFs with interests in several categories like sports, funny animals and people (21 GIFs).
Figure 3. Example user histories (subsampled)
Figure 4. Ten most selected moments. The most popular moments in our dataset often show cats, music videos such as k-pop or famous movie scenes.

3.2. Analysis

Almost users on fulfilled our conditions. In total, the dataset contains annotations on YouTube videos. This is a significant leap with respect to other popular datasets such as the YouTube video highlight dataset (SunRank), which contains about annotations, and the Video2GIF dataset (gygli2016video2gif), which includes annotation in the form of GIFs.

Out of the users, were selected to form the test set (more details on the use of the dataset is given in Section 5.3). The selection was done such that the test videos (i.e. the last video the user created a GIF from) are between seconds and minutes long, to avoid too simple scenarios as well as prevent extremely sparse labels suffering from chronological bias (song2015tvsum). The distributions for the number of videos and GIFs per user in the full dataset are shown in Figure 2. Note that a user may generate more than one GIF from the same video, and thus the total amount of GIFs is greater than the number of videos.

In Figure 3 we show examples of users histories. When analyzing users we find that most have a clear focus, e.g. mostly or even exclusively create GIFs of funny pets. In some cases, users also have multiple interests (c.f. Figure 2(c)

) and some have a clear focus with one or two outliers that show a different type of content. On the other hand, the most popular moments (most selected video segments in our dataset) show higher diversity. Their contents range from scenes with pets to interviews, cartoons, music videos and scenes of famous movies (see Figure 

4). Given the high diversity of the dataset, and the consistent interests of specific users, we hypothesize that the user’s history provides a reliable signal for predicting what GIFs he or she will create in the future.

4. Method

In the following, we introduce our approach for highlight detection, which uses information about a user when making predictions. In particular, we propose a model that predicts the score of a segment as a function of both the segment itself and the user’s previously selected highlights. As such, the model learns to take into account the user history to make accurate personalized predictions.

We define as the video from which a user wants to generate a GIF, and the segments that form it. For our method, we use a ranking approach (joachims2002optimizing), where a model is trained to score positive video segments, , higher than negative segments, , from the same video. Thereby a segment is a positive if it was part of the user’s GIF and a negative otherwise, as in (gygli2016video2gif). In contrast to previous works (SunRank; gygli2016video2gif; yaohighlight), however, we do not make the predictions based on the segment alone, but also take a user’s previously chosen highlights, their history, into account. Thus, our objective is


where , are positive and negative segments coming from the same video and is the score assigned to segment . denotes all the GIFs that user previously generated, i.e. the user’s history. Our formulation thus allows the model to personalize its predictions by conditioning on the user’s previously selected highlights.

While there are several ways to do personalization, making the user history an input to the model has the advantage that a single model is sufficient and that the model can use all annotations from all users in training. A single model can predict personalized highlights for all users and new user information can trivially be included. Previous methods instead embedded the personal preferences into the model weights (soleymani2015quest; ren2017personalized), which requires training one model per user and retraining to accommodate the new information.

We propose two models for , which are combined with late fusion. One takes the segment representation and aggregated history as input (PHD-CA), while the second uses the distances between the segments and the history (SVM-D). Next, we discuss these two architectures in more detail. In all models we represent the segments and the history elements using C3D (tran2015learning)

(conv5 layer). We denote these vector representations

and , respectively.

4.1. Model with aggregated history

We propose to use a feed-forward neural network (FNN) similar to 

(gygli2016video2gif; yaohighlight), but with the history as an additional input. More specifically, we average the history representations across examples to obtain . The segment representation and the aggregated history are then concatenated and used as input to the model:


As a model, we used a small neural network with 2 hidden layers with 512 and 64 neurons

222 While different aggregation methods are possible, we found averaging the history to work well in practice. We also tried alternative ways to aggregate, such as learning the aggregation with a sequence model (LSTM), but found this to lead to inferior performance (see Section 5). .

4.2. Distance-based model

The assumption behind using a model of the form is that the score of a segment depends on the similarity of the segment to a user’s history. Thus, we investigated explicitly encoding that assumption into the model. Specifically, we create a feature vector that contains the cosine distances to the most similar history elements . We denote this feature vector . Using this representation we train a linear ranking model (ranking SVM (lee2014large)) to predict the score of a segment, i.e.


where , are the learned weights and bias. While the distance features could directly be provided to the model introduced in Section 4.1, we find that training two separate models and combining them with late fusion leads to improved performance (c.f. Table 2). This is in line with previous approaches that found this method to be superior over fusing different modalities in a single neural network (simonyan2014two; carreira2017quo).

4.3. Model fusion

We propose to combine the models introduced in Section 4.1 and 4.2 with late fusion. As the models differ in the range of their predictions and their performance, we apply a weight for the model ensemble. To be concrete, the final prediction is computed as


where is learned with a ranking SVM on the videos of a held out validation set.

Figure 5. Model architectures. We show our proposed model (bold) and alternative ways to encode the history and fuse predictions (see section 5.2).

5. Experiments

We evaluate the proposed method, called PHD-CA + SVM-D, on the dataset introduced in Section 3. We start by comparing it against the state of the art for non-personalized highlight detection, as well as several personalization baselines in Section 5.1. Then, Section 5.2 analyses variations of our method and quantifies the contribution of the different inputs and architectural choices.

Evaluation metrics.

We follow (gygli2016video2gif) and report mean Average Precision (mAP) and normalized Meaningful Summary Duration, which rates how much of the video has to be watched before the majority of the ground truth selection was shown, if the shots in the video had been re-arranged to match the predicted ranking order. In addition, we report Recall@5, i.e. the ratio of frames from the user-generated GIFs (the ground truth) that are included in the highest ranked GIFs.

Model mAP nMSD R@5 Notes
Random 12.97% 50.60% 21.38%


Video2GIF (gygli2016video2gif) 15.69% 42.59% 27.28% Trained on (gygli2016video2gif)
Highlight SVM 14.47% 45.55% 26.13%
Video2GIF (ours) 15.86% 42.06% 28.42%


Max Similarity 15.49% 44.22% 26.44% unsup.
V-MMR 14.86% 43.72% 28.22% unsup.
Residual 14.89% 47.07% 26.05%
SVM-D 15.64% 43.49% 28.01%
Ours (CA + SVM-D) 16.68% 40.26% 30.71%
Table 1. State-of-the-art comparison (videos segmented into -second long shots). For mAP and R@5, the higher the score, the better the method. For MSD, the smaller is better. Best result per category in bold.

5.1. Baseline comparison

We compare our method against several strong baselines:

Video2GIF (gygli2016video2gif). This work is the state of the art for automatic highlight detection for GIF creation. We evaluate the pre-trained model which is publicly available. As the model is trained on a different dataset we additionally provide results for a slight variation of (gygli2016video2gif), trained on our dataset, which we refer to as Video2GIF (ours).

Highlight SVM. This model is a ranking SVM (lee2014large) trained to correctly rank positive and negative segments as per Eq. (1), but only using the segment’s descriptor and ignoring the user history.

Maximal similarity. This baseline scores segments according to their maximum similarity with the elements in the user history

. We use the cosine similarity as a similarity measure.

Video-MMR. Following the approach presented in (li2010multi), is used as query so that the segments that are most similar are scored highly. Specifically, we use the mean cosine similarity to the history elements

as an estimate of the relevance of a segment.

Residual Model. Inspired by (ren2017personalized), we include a residual model for ranking. (ren2017personalized) proposes a generic regression model and a second user-specific model that personalizes predictions by fitting the residual error of the generic model. To adapt this idea to the ranking setting, we propose training a user-specific ranking SVM that gets the generic predictions from Video2GIF (ours) as an input, in addition to the segment representation . Thus, a user’s model is defined as


where are the weights learned from the history .

Ranking SVM on the distances. This model corresponds to the model presented in Section 4.2.

(a) User with interest in forests
(b) User favouring knock-outs
(c) User creating previews for TV shows.
Figure 6. Qualitative Examples. We compare our method (PHD-CA + SVM-D) to generic highlight detection (Video2GIF (ours)). Videos for which personalization improves the Top 5 results are shown in (a) and (b). In both cases the users are consistent in what content they create GIFs from. Thus, personalization provides more accurate results (correct results have green borders). In (c) we show a failure case, where the user history is misleading the model.


We show quantitative results in Table 1 and qualitative examples in Figure 6. When analyzing the results, we find that our method outperforms (gygli2016video2gif) as well as all baselines by a significant margin. Adding information about the user history to the highlight detection model (Ours (CA + SVM-D)) leads to a relative improvement over generic highlight detection (Video2GIF (ours)) of 5.2% (+0.8%) in mAP, 4.3% (-1.8%) in mMSD and 8% (+2.3%) in Recall@5. This is a significant improvement in this challenging high-level task and compares favorably to the improvement obtained in previous work (gygli2016video2gif). The improvement of our method over using the user history alone is even larger, thus reinforcing the need to train a personalized highlight detection model that uses the information about all users jointly.

Models using only generic highlight information or only the similarity to previous GIFs perform similar (15.86% for Video2GIF (ours) vs. 15.64% mAP for SVM-D), despite the simplicity of the distance model. Thus, we can conclude that these two kind of information are both important and that there is a lot of signal contained in a user’s history about his future choice of highlights. This concurs with our qualitative analysis in Section 3.2, where we find that that most users in our dataset show high consistency in the kind of highlights they selected.

Given that the combination of the two kinds of information improves the final results, we conclude that they are complementary to each other and that it is beneficial to use models that consider them both. The residual model also combines generic highlight detection and personalization. It however estimates model weights per user, which leads to inferior results on our dataset, due to the small number of training examples per user. Indeed, the Residual baseline is outperformed by the generic highlight detection and the personalization baselines, in particular SVM-D. Our method, on the other hand, performs well in this challenging setting and outperforms all baselines by a large margin.

To better understand how the model works, Figure 6 shows qualitative results for our method and a non-personalized baseline, along with the user history. As can be seen from 5(a) & 5(b), our method effectively uses the history to make more accurate predictions. In 5(c) we show a failure case, where the history is not indicative of the highlight chosen by the user.

5.2. Detailed experiments

In the following, we analyze different variations of our approach. In particular, we compare various ways to include the user history, network architectures, and fusion of different inputs. Figure 5 shows these different configurations, while their performance is given in Table 2. Additionally, we analyze the performance of our model as the size of the user history varies (Figure 7).

Learning an aggregation vs averaging?

Our proposed model aggregates the history via averaging (PHD-CA, c.f. Section 4.1). Alternatively, Recurrent Neural Networks are often successfully used to encode visual sequences (srivastava2015unsupervised; GarciadelMolino2018predicting). Thus, we also explored a model that uses an LSTM to learn to aggregate the history (PHD-RH). The history is then concatenated to the segment representation and passed through 2 fully-connected layers. As can be seen from Table 2, having a predefined aggregation performs better than learning it. We attribute this to the challenge of learning a sequence embedding from limited data and conclude that an average aggregation provides an effective representation of the users’ history.

Convolutional combination or concatenation?

In Section 4.1 we propose to concatenate the average history to the segment representation . Since they both use the same C3D representation, however, it is also possible to first aggregate each dimension of the two vectors with 1D convolutions, before passing them through fully connected layers (PHD-SA). We compared these two approaches and found the concatenation to give superior performance. The convolutional aggregation uses the structure of the data to reduce the number of network parameters and therefore has roughly half the parameters of the concatenation model. Convolutional aggregation, however, requires the network to aggregate the history into the segment information per dimension, using the same weights. Thus it is limited in its modeling capacity, compared to a network using concatenated features as inputs.

Model mAP nMSD R@5
PHD-SA 15.73% 42.80% 28.65%
PHD-RH 15.74% 42.75% 27.45%
PHD-CA 16.58% 41.01% 28.18%
PHD-CA-ED (1st layer) 16.14% 41.26% 29.20%
PHD-CA-LD (last layer) 16.20% 41.07% 29.78%
Video2GIF (ours) + SVM-D 16.39% 40.90% 28.70%
PHD-CA + SVM-D 16.68% 40.26% 30.71%
Table 2. Detailed experiments. We analyze different ways to represent and aggregate the history, as well as ways to use the distances to the history to improve the prediction.
Figure 7. Performance of different methods as a function of the history size. We observe that our method improves over generic highlight detection with as little as one history element per user. Furthermore, performance has not saturated even when using the full history, thus indicating that our method can effectively use longer histories as well. Interestingly, we find that only models including the distances to the history as a feature improve Recall@5, i.e. provide better results at the top of the ranking. Best viewed in color.

Adding distances, with early or late fusion?

As we discussed, our assumption is that the similarity of a segment to the previously chosen GIFs is informative when predicting the score of a segment. Thus, we tested models that use the distance to the history elements as an additional input.

Since using distance features leads to a different representation compared to the feature activations of C3D, it is unclear how to best merge the two different modalities. We tried early fusion (concatenation of the two inputs, PHD-CA-ED), late fusion before the prediction layer in one single model (PHD-CA-LD) and late fusion with training two separate models (PHD-CA + SVM-D), as shown in Figure 5. We find that late fusion performs superior to early fusion, and that combining two different models outperforms merging on the last layer of the neural network. The superiority of late fusion is to be expected, as neural networks often struggle to combine information from different modalities (simonyan2014two; carreira2017quo). Adding the distances in the neural network even slightly decreases mAP, while Recall@5 improves. While this inconsistency is somewhat surprising, Recall@5 is arguably more important, as it evaluates the accuracy of the top-ranked elements, which is what matters for finding highlight in videos, while mAP considers the complete ranking. When using a separate model for the distances and fusing their predictions, we obtain a consistent improvement in all metrics.

We also tried adding personalization to a generic highlight detection model by combining its predictions with the predictions of the distance SVM (Video2GIF (ours) + SVM-D in Table 2). This leads to a significant improvement over the generic model. While it doesn’t perform quite as well as our full model, this approach provides a simple way to personalize existing highlight detection, in order to improve their performance.

How much does personalization help for different history sizes?

We are interested in how well the model performs when very little user-specific information is available. To do so, we restrict the history provided to the model to the last videos a user created GIFs from, rather than providing the full history333Note that some users may have less than videos in their history, and only videos can be considered..

We plot the performance as a function of the history length in Figure 7. From this plot, we make several important observations. (i) Adding personalization helps even for small histories. Recall@5 improves by 5.6% (+1.6%) over the generic model for a history size of , for example. Even for i.e. a single history video, our method outperforms generic highlight detection across all metrics. Having a model that performs well given few history elements is important, as the history size in our dataset follows a long tail distribution (c.f. Figure 2). Indeed, we discarded more than 90% of the user profiles when creating our dataset, as they had a history of fewer than 5 elements. (ii) While PHD-CA quickly improves mAP as the history grows, only the model including the distances significantly improves Recall@5. This is consistent with our experiments in Table 2. Improving the ranking of the highest scoring segments is challenging, as they often have only subtle differences. The similarity to a user’s history allows to capture these differences and thus obtain a better ordering of the top elements. (iii) Performance is not yet saturated for the history lengths in the dataset. Thus our model is not only able to make use of small histories, but can also effectively use larger histories to further improve prediction accuracy.

5.3. Implementation details

Data Setup:

The dataset consists of a total of users, of which are used for training, for validation, and for testing. At both training and test time, the goal of our models is to predict what part of video a user chooses, given his history . As such, corresponds to the last video from each user, and all other videos are used to build each user’s history . The validation set is used to find the best hyper-parameters for the highlight models and also to find the right weight for Eq. 4.

To train our models, we have sampled five positive-negative pairs from each user’s video , where a positive example is a shot that was part of the user’s GIFs for that video (see Figure 8), and a negative example is a shot that was not included in any GIF. To split the user selected segments into shots, we use the shot detection of (Gygli17DeepCut) and deterministically split shots longer than seconds into second chunks. For the user history , we use a maximum of shots, which are selected at random ensuring that there is at least one shot from each of the last videos in the user’s history. Since a user may generate several overlapping GIFs before being satisfied with the result, (and analogously the ground truth for ) does not correspond to each of the user-generated GIFs independently, but rather their union.

At test time the videos are segmented into fixed segments of 5 seconds to be able to compare to (gygli2016video2gif). Furthermore, (Gygli17DeepCut) may predict short shots and gaps (due to slow scene transitions), which, when used at test time, would lead to noise in the evaluation. We use the user’s full history when making predictions. Since the distance-based models require a -dimensional input, the distance vector is filled with zeros if , and the elements further away are discarded if .

Figure 8. Procedure to obtain the pairs of segments in , the user selection for the evaluation, and the user history from any other .

Training methodology:

We optimize the network parameters using grid search over different possible FNN architectures. Different dropout values (random search between and for the input layer, and to

for the intermediate ones) and activation functions (

ReLU and SELU (klambauer2017self)

) were explored, as well as the use of batch normalization 

(ioffe2015batch) after each layer. Using RMSProp as optimizer and a weight decay between and , the initial learning rate (randomly set between and

) is decreased by half every four epochs, for a total of

epochs per search iteration. The pairwise loss function used for all models is

. Our models are implemented in TensorFlow 


For the aggregation of and , a size of either four or ten neurons is considered for the 1-D convolution in PHD-SA, flattened with a single neuron convolution before the FNN layers. For the PHD-RH model, we tested using or neurons in the hidden layer of the LSTM.

For learning the combination of the user profile and the segment information, we ran hyper-parameter search and varied the number of hidden layers of the FNN from 1 to 3. We tested layers having up to neurons, where each following layer would have the same number or fewer neurons. We find that smaller architectures perform best: Two hidden layers of and neurons for PHD-CA (with dropout of and in the input and intermediate layers, respectively); a single hidden layer of neurons for PHD-SA ; and a single hidden layer of neurons for PHD-RH.

6. Conclusion

In this work, we proposed an approach for personalized highlight detection in videos. The core idea of our approach is to use a model that is trained for all users jointly and which is customized via its inputs, by providing a user’s previously chosen highlights at test time. Such an approach allows training a high-capacity model, even when few examples per user are available. In our experiments, we have shown that the user history provides a useful signal for future selections and that incorporating that information into our highlight detection model significantly improves performance: Our method outperforms generic highlight detection by 8% in Recall@5. When training a separate model per user, as done in previous work, personalization does not outperform generic highlight detection. Our method, on the other hand, works well, even when given very few user-specific training examples. It outperforms generic highlight detection given just a single user-specific training example, thus confirming the benefit of our model architecture.

Finally, in order to train and test our model, we have introduced a large-scale dataset with user-specific highlights. To the best of our knowledge, it is the first personalized highlight dataset at that scale and the first which is made publicly available.