Log In Sign Up

ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization

by   Yair Shemer, et al.

In recent years, there has been an increasing interest in building video summarization tools, where the goal is to automatically create a short summary of an input video that properly represents the original content. We consider shot-based video summarization where the summary consists of a subset of the video shots which can be of various lengths. A straightforward approach to maximize the representativeness of a subset of shots is by minimizing the total distance between shots and their nearest selected shots. We formulate the task of video summarization as an optimization problem with a knapsack-like constraint on the total summary duration. Previous studies have proposed greedy algorithms to solve this problem approximately, but no experiments were presented to measure the ability of these methods to obtain solutions with low total distance. Indeed, our experiments on video summarization datasets show that the success of current methods in obtaining results with low total distance still has much room for improvement. In this paper, we develop ILS-SUMM, a novel video summarization algorithm to solve the subset selection problem under the knapsack constraint. Our algorithm is based on the well-known metaheuristic optimization framework – Iterated Local Search (ILS), known for its ability to avoid weak local minima and obtain a good near-global minimum. Extensive experiments show that our method finds solutions with significantly better total distance than previous methods. Moreover, to indicate the high scalability of ILS-SUMM, we introduce a new dataset consisting of videos of various lengths.


page 7

page 9


Video Summarization Using Fully Convolutional Sequence Networks

This paper addresses the problem of video summarization. Given an input ...

Learning Video Summarization Using Unpaired Data

We consider the problem of video summarization. Given an input raw video...

Diversity-aware Multi-Video Summarization

Most video summarization approaches have focused on extracting a summary...

Learning to score and summarize figure skating sport videos

This paper focuses on fully understanding the figure skating sport video...

ERA: Entity Relationship Aware Video Summarization with Wasserstein GAN

Video summarization aims to simplify large scale video browsing by gener...

Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization

In this paper, we present a novel unsupervised video summarization model...

1 Introduction

In recent years, the amount of video data has significantly increased. In addition to many cinema movies, news videos, and TV-shows, people frequently shoot events with their cellphone and share it with others on social media. To illustrate, it has been reported that every minute, 300 hours of new videos are uploaded to YouTube. Consequently, the user’s ability to manage, search, and retrieve a specific item of content is limited. One remedy to this challenge can be an automatic video summarization algorithm where the goal is to create a shortened video that contains the essence of the original video. If fact, several commercial video summarization products are already on the market.

Various video summarization algorithms have been suggested in the literature. In most methods, the process consists of two main stages – segmenting the video into short video shots, and then choosing a subset of the shots to aggregate a summary (Otani et al., 2019). In order to be a good summary, this shot subset selection should optimize a certain property. For example, the selected shots should well represent the content of the video in the sense that each object from the original video has a similar object in the summary.

Video summarization approaches can generally be divided into supervised and unsupervised. Supervised methods include exploiting a ground truth importance score of each frame to train a model (Zhang et al., 2016; Zhao et al., 2018; Gygli et al., 2015) and utilizing auxiliary data such as web images (Khosla et al., 2013), titles (Song et al., 2015), category (Potapov et al., 2014) and any other side information (Yuan et al., 2017). A pitfall of a supervised approach is the necessity of expensive human-made labels. This drawback is especially restrictive because of the complicated and vague structure of a good summary, which requires a lot of labeled data.

In contrast, unsupervised methods do not need human-made labels as they follow rational guidelines for creating a good summary. One group of unsupervised algorithms maximizes the similarity between the original video and the generated summary using generative adversarial networks as evaluators

(Mahasseni et al., 2017; Jung et al., 2019; Yuan et al., 2019), or by dictionary learning (Cong et al., 2011; Zhao and Xing, 2014). Another salient group of methods seek to minimize the total distance between shots and their nearest selected shots while satisfying the limit on the summary duration. Attempts in this direction include using submodular optimization (Gygli et al., 2015)

, reinforcement learning

(Zhou et al., 2018) and clustering methods (Chheng, 2007; De Avila et al., 2011; Hadi et al., 2006).

Even though several methods have been proposed for minimizing this total distance, no experiments have presented to directly measure the success of these methods in obtaining solutions with low total distance. Our experiments indicate that the best current existing method obtains solutions with total distance, which is, in some datasets around 10% worse than the optimal solution on average. Hence, we see that there is room for a new method that leads to better solutions.

In this paper, we propose ILS-SUMM, a novel unsupervised video summarization algorithm which uses the Iterated Local Search (ILS) optimization framework (Lourenço et al., 2003) to find a representative subset of shots. We formalize the following optimization problem: given the entire set of shots with varied shot duration, select a subset which minimizes the total distance between shots and their nearest selected shots while satisfying a knapsack constraint, i.e., the limit on the summary duration. This problem is known in the Operations Research field as the Knapsack Median (KM) problem, and is known to be NP-hard. A major challenge in performing a local search in the solution domain is the high chance of getting stuck in a local minimum because of the hard knapsack constraint (Fang et al., 2002). Therefore we use the ILS framework, which is the basis for several state-of-the-art algorithms for NP-hard problems (Lourenço et al., 2019).

ILS-SUMM creates a video summary by selecting shots that well represent the original video, using the ILS framework. First, it initializes a solution that satisfies the summary duration limit. Then, it applies steps of improvements by adding or replacing one shot at a time, while allowing only steps that obey the knapsack constraint. When a local minimum point is reached, the ILS executes a gentle, yet noticeable, perturbation to the current solution, to get out from the local minimum while trying to keep part of the high quality of the solution it obtained. We perform extensive experiments on the SumMe and TvSum benchmarks showing that our method finds solutions that are on average less than 2% worse than the optimal solution, which is significantly superior than the results of previous methods. Moreover, experiments on long real open-source movies indicate ILS-SUMM scalability. A Python implementation of the proposed method is released in [].

2 Related Work

2.1 Video Summarization

Various unsupervised video summarization methods have been presented in the recent literature. Most methods share the underlying assumption that a good summary should represent and be similar to the original video. Cong et al. (2011) and Zhao and Xing (2014) build a representative dictionary of key frames that minimizes the reconstruction error of the original video. Mahasseni et al. (2017)

train a deep neural network to minimize the distance between original videos and a distribution of their summarizations.

Chheng (2007)

cluster the video shots into k clusters using the k means algorithm, and then select shots from various clusters.

Gygli et al. (2015) apply submodularity optimization to minimize the total distance between shots and the nearest of the selected shots. Zhou et al. (2018) use a reinforcement learning framework to train a neural network to select frames such that the representativeness and diversity of the summary will be maximized.

In recent years, most methods evaluate their results by measuring the similarity between automatic summaries and human-made summaries. Recently, Otani et al. (2019) observed that randomly generated summaries obtain competitive or better performance in this metric to the state-of-the-art. Based on this surprising observation, instead of evaluating our method using human labels, we measure the success of our algorithm in terms of having low total distance between all shots and the nearest of the selected shots.

2.2 Iterated Local Search

In the ILS framework, a sequence of locally optimal solutions is iteratively generated by a heuristic algorithm. The initial point of the heuristic search in each iteration is a perturbation of a previous obtained solution, rather than a complete random trail. ILS has been applied successfully to various optimization problems, leading to high performance and even established the current state-of-the-art algorithms in some tasks.

Some successful applications of ILS include solving common combinatorial problems such as graph partitioning problems, traveling salesperson problems and scheduling problems (De Corte and Sörensen, 2016), in addition to other applications such as image registration (Cordón and Damas, 2006), car sequencing (Cordeau et al., 2008; Ribeiro et al., 2008) and the generalized quadratic multiple knapsack problem (Avci and Topaloglu, 2017). To the best of our knowledge, this paper is the first to apply Iterated Local Search to the Knapsack Median problem.

3 Method

In this section, we introduce an unsupervised method for video summarization, based on the Iterated Local Search (ILS) optimization framework.

3.1 Formulation

Given an input video , the video is divided temporally into a set of shots , where is the number of shots in the video. We denote the duration in seconds of a shot as

. Each shot is represented by its middle frame feature vector

(for details see section 4.1). A condensed video summary is a representative subset of shots . The summarization task is then formulated by the following optimization problem. We denote as the total distance between all video shots and their nearest shot in :


The objective is to obtain the subset that minimizes the total distance:


subject to:


where denotes some distance metric between x and y, and T is the maximum total duration in seconds allowed for the video summary. Equation 2 expresses the goal that the subset will minimize the total distance between shots and their nearest selected shots. Equation 3 encodes the knapsack constraint which limits the total duration of the selected shots not to exceed T seconds. This problem is known in the Operations Research field as the Knapsack Median (KM) problem.

Motivated by the success of Iterated Local Search (ILS) in many computationally hard problems (Lourenço et al., 2019), we use this simple yet powerful metaheuristic to address the KM problem, and consequently obtain a representative video summary. A fundamental component in the ILS framework is a local search algorithm. In the following section, we first introduce a local search algorithm tailored to the KM problem, which we name Local-Search-SUMM, and subsequently we present the complete ILS-SUMM algorithm.

3.2 Local-Search-SUMM

A local search algorithm starts from a particular solution of a given problem and sequentially improves the solution by performing a local move at each step (Osman and Kelly, 1997). The pseudo-code of Local-Search-SUM, a local search algorithm we developed for the KM problem is given by Algorithm 1. This algorithm contains the following functions: – an objective function, – a map between a solution to a neighborhood of solutions, and – a function that selects one of the neighbors, all detailed below. As an input, Local-Search-SUM gets – an initialization of a solution. In each iteration of the algorithm, it selects a neighbor of the current solution and moves to this neighbor if it decreases the objective function. The loop continues until a local minimum is reached or until , i.e., the predefined maximum number of trials, is reached. To solve the KM problem with a local search algorithm, we use the setting as described below.

3:for  do
5:     if  then
7:     end if
8:end for
Algorithm 1 Local-Search-SUM
5:while  do
9:     if  then
12:     else
14:     end if
15:end while
Algorithm 2 ILS-SUMM

Objective function. Straightforwardly, we define the objective function as the total distance between shots and their nearest selected shots:


Note that an extension to a multi-objective function is straightforward, by changing to be a weighted sum of objectives as proposed by Gygli et al. (2015).

Initialization. To deal with the knapsack constraint, we define the local search initialization and neighborhood such that throughout all the solution process, only feasible solutions will be considered. To ensure that the initialized solution satisfies the knapsack constraint, the initialization subset is set to be the single shortest shot.

Neighborhood. A neighborhood includes any subset which is obtained by swapping or adding a single shot to , while satisfying the knapsack constraint. Removing a shot will never decrease the objective function, and therefore is not included in the neighborhood set.

Selection. As a selection method, , we use the steepest descent method, i.e., selecting the neighbor which decreases the objective function the most. To boost run-time performance the algorithm first considers adding a shot, and only if this is impossible it considers swaps. This approach reduces the complexity of the algorithm and demonstrates significantly better run-time performance in our experiments.

3.3 Ils-Summ

A local search algorithm may lead to a poor local minimum that is far away from the global minimum. Hence, after getting stuck in a local minimum, it is worthwhile to continue searching for other solutions which potentially can be far better. ILS performs this continued search, by repeatedly calling a local search algorithm which in each call starts from a different starting solution. As illustrated in figure 1, in each iteration the starting point of the next iteration is a perturbation of the current solution.

3:for  do
5:end for
6:if  then
8:end if
Algorithm 3 Perturbation

ILS-SUMM pseudo-code is given by Algorithm 2. It consists of three main components which are executed at each iteration: A perturbation of the last solution, a call of a local search algorithm and a decision whether to accept the new found local minimum or to stay with the old solution. As a local search, we use the Local-Search-SUMM introduced above. In the following, we will go into more details regarding the perturbation mechanism and acceptance criterion.

Figure 1: Illustration of Iterated Local Search framework. Given a local minimum , a perturbation leads to a solution . Then, a call of a local search algorithm obtains a local minimum which is potentially better than .

Perturbation. In this stage, the previous solution is modified to a different solution. Specifically, ILS-SUMM perturbs a given subset by swapping shots in with shots that are currently not in . See the perturbation mechanism pseudo-code in Algorithm 3. To maximize the chance of getting a feasible solution, the perturbation is executed in a constraint-greedy manner. This constraint greediness means that the longest-duration currently selected shots are swapped with the shortest-duration non-selected shots. However, if the new solution does not satisfy the knapsack constraint, then the original solution is returned (line 7). This perturbation mechanism is deterministic. Another option is to add randomness when selecting which shots to swap. Since the stochastic version did not lead to an improvement in the experiments, we retain the deterministic version which also enjoys the benefit of repeatability.

The strength of the perturbation can range between two extremes. On one extreme, it can totally change the solution and in fact, restart from a new solution. These complete initializations typically lead to long iterations and poor solutions because of the low quality of the starting solution. On the other extreme, applying a weak perturbation which only slightly changes the solution may lead to being stuck repeatedly in the same local minimum; hence, a good perturbation has a balanced intensity. As described in the ILS-SUMM pseudo code (Algorithm 2), we use a gradually increasing perturbation strength. It starts from and gradually increases by one until , i.e., a predefined maximum value of is reached. In this way, we use the minimal strength of perturbation that accomplishes exiting the current local minimum.

Acceptance Criterion. In this stage, the algorithm decides which solution will be perturbed in the next iteration to get a new starting point. Two extreme cases of this procedure are either to always continue with the new local minimum obtained or to stick with the currently best achieved local minimum. The first extreme can be interpreted as a random walk over the local minima, whereas the second extreme can be viewed as a greedy local search over the local minima (Lourenço et al., 2019). An intermediate approach would be to prioritize good solutions while occasionally exploring inferior solutions. An example of such a scheme is the Metropolis Heuristic where worse solutions randomly get the chance to be explored. An interesting modification of the Metropolis Heuristic is Simulated Annealing (Van Laarhoven and Aarts, 1987)

, where the temperature of these exploration events, i.e., the probability of moving to a worse solution, progressively decreases. Since all the above options demonstrate similar results in our experiments, we assign an acceptance criterion which chooses the best achieved local minimum.

4 Experiments

4.1 Experimental Setup

Datasets. We evaluate our approach on two popular video summarization datasets – SumMe (Gygli et al., 2014) and TvSum (Song et al., 2015), as well as on our new Open Source Total Distance dataset with full-length videos which we present below. SumMe consists of 25 user videos on various topics such as cooking, traveling, and sport. Each video length ranges from 1 to 6 minutes. TvSum consists of 50 videos with a duration varying between 2 to 10 minutes. We use a video shot segmentation technology to subdivide each video into shots (for details see Section 4.1). For SumMe and TvSum we use their common summary length limit which is 15% of the video length. We use Python PuLP library to obtain the optimal total distance of each video using an integer programming solver. As we will show in the results section, this integer programming tool has poor run-time scalability, but is useful for obtaining the optimal total distance as ground truth.

To evaluate the total distance results and the scalability on longer movies, we establish a new total distance dataset – Open-Source-Total-Distance (OSTD). This dataset consists of 18 movies with duration range between 10 and 104 minutes leveraged from the OVSD dataset (Rotman et al., 2016). For these videos, we set the summary length limit to be the minimum between 4 minutes and 10% of the video length. Links for OSTD movies and ground truth optimal total distance of all above datasets are available on [].

Implementation Details. For temporal segmentation, we use two different types of shot segmentation methods, in accordance with video types. For the OSTD movies, we use FFprobe Python tool (FFprob, 2019) since this tool has high accuracy when applied on videos with fast shots transitions. For the SumMe and TvSum datasets, we use KTS proposed by Potapov et al. (2014)

, since this shot segmentation method is more appropriate for catching slow shot transitions which are common in these two datasets. For feature extraction, we use the RGB color histogram with 32 bins for each of the three color channels. See some comments on using deep features in section 4.3.

For the perturbation mechanism of the ILS-SUMM, we set the maximum value of to be 5 since we found that this value leads to a balanced perturbation intensity. However, we observed that ILS-SUMM is not sensitive to this value, and other values are just as satisfactory.

Evaluation. To compare between different approaches, we calculate the total distance defined in equation (1) that each approach achieved for each video. We then calculate the optimality percentage, i.e., the ratio between the optimal value and the achieved value, mutiplied by . For each method, we average all optimality percentages achieved on all the videos of a specific dataset, and report the averaged optimality ratio.

Comparison. To compare total distance results with other approaches, we apply DR-DSN (Zhou et al., 2018) and Submodular (Gygli et al., 2015)

on the datasets. For both methods, we use the implementations provided by the original authors. Although both methods can optimize multiple objectives, for our experiments we set them to maximize only representativeness since this is the evaluation metric we use. As mentioned above, an extension of our method to a multi-objective setting is straightforward, but to simplify the comparison we focus on representativeness.

Figure 2: ILS-SUMM selection in Cosmos Laundromat movie from OSTD data set. The middle frame of shot is presented. A red circle denotes the shots that were chosen by ILS-SUMM algorithm.

4.2 Results

First, we compare our method with simple local search baselines. Then, we compare our method with previously proposed algorithms.

Comparison with baselines. We set the baseline algorithms as two variants of local search algorithms. The first baseline is Local-Search-SUMM described above in Algorithm 1. The second baseline, denoted by Restarts-SUMM, repeatedly restarts Local-Search-SUMM initialized with a different single shot at each restart and then selects the best result. The algorithm stops when it finishes going over the entire video shots or when the run-time resources are reached. For each video we set the Restart-SUMM maximum run-time allowed to be the video duration. Table 1 reports the total distance achieved by the baselines and ILS-SUMM on SumMe, TvSum, and OSTD datasets.

Local-Search-SUMM 70.80% 87.11% 91.66%
Restart-SUMM 93.19% 98.19% 94.95%
ILS-SUMM 98.48% 99.27% 98.38%
Table 1: Results (total distance optimality percentage) of different variants of local search on SumMe, TVSum and OSTD.

We can see that ILS-SUMM clearly outperforms Local-Search-SUMM. This result demonstrates the importance of the exploration process of ILS, since stopping the algorithm in the first reached local minimum as done in Local-Search-SUMM is far from optimal.

Although Restart-SUMM is significantly better than Local-Search-SUMM, it is still inferior to ILS-SUMM. More essentially, Restart-SUMM is highly unpractical since in many videos the time it takes for generating a summary with Restart-SUMM is equal to the time it takes watching the full video (for more details see the run-time analysis below). This indicates the usefulness of the ILS perturbation mechanism, which rather than initializing the solution to a completely new solution, partially reuses the good solution it already obtained and thus obtains better results in less time.

Comparison with previous approaches. Table 2 shows the results of ILS-SUMM measured against other video summarization methods that aim to minimize total distance, on SumMe, TvSum, and OSTD. It can be seen that ILS-SUMM significantly outperforms the other approaches on all datasets.

DR-DSN 90.78% 82.50% 62.56%
Submodular 85.18% 94.14% 95.99%
ILS-SUMM 98.48% 99.27% 98.38%
Table 2: Results (total distance optimality percentage) of different approaches on SumMe, TVSum and OSTD. Our ILS-SUMM exhibits a significant advantage over others

Run-time performance. Table 3 presents the run-time measurements of the PuLP, Submodular, Restart-SUMM and ILS-SUMM methods in OSTD dataset. Our experiments demonstrate that for obtaining a reasonable solution, Submodular is the fastest approach. These results may be expected since Submodular runs only two iterations of greedily adding shots, without any further exploration. However, as we presented above, ILS-SUMM obtains significantly better results than submodular optimization, while enjoying a substantially better run-time scalability than PuLP. With these numbers it is possible to make a decision of solution optimality vs. run-time for a given specific use of video summarization.

PuLP Submodular Restart-SUMM ILS-SUMM
Big Buck Bunny (596 [Sec]) 1.96% 0.02% 8.08% 0.33%
La Chute d’une Plume (624 [Sec]) 0.48% 0.01% 2.12% 0.09%
Elephants Dream (654 [Sec]) 0.60% 0.01% 2.34% 0.11%
Meridian (719 [Sec]) 1.66% 0.01% 5.10% 0.21%
Cosmos Laundromat (731 [Sec]) 0.95% 0.01% 2.41% 0.06%
Tears of Steel (734 [Sec]) 1.51% 0.01% 6.90% 0.29%
Sintel (888 [Sec]) 0.83% 0.01% 5.67% 0.32%
Jathia’s Wager (1261 [Sec]) 2.09% 0.02% 21.20% 0.38%
1000 Days (2620 [Sec]) 5.48% 0.02% 71.48% 0.98%
Pentagon (3034 [Sec]) 4.71% 0.02% 50.10% 0.60%
Seven Dead Men (3424 [Sec]) 22.49% 0.02% 62.47% 0.36%
Boy Who Never Slept (4186 [Sec]) 25.47% 0.03% 100% 0.84%
Sita Sings the Blues (4891 [Sec]) 58.33% 0.06% 100% 4.97%
CH7 (5189 [Sec]) 24.10% 0.02% 100% 0.85%
Honey (5210 [Sec]) 45.12% 0.03% 100% 1.19%
Valkaama (5586 [Sec]) 51.82% 0.04% 100% 1.86%
Star Wreck (6195 [Sec]) 91.96% 0.05% 100% 2.38%
Route 66 (6205 [Sec]) 49.06% 0.05% 100% 3.13%
Table 3: Run-time comparison (% of video duration) between PuLP, Submodular, Restart-SUMM and ILS-SUMM in OSTD dataset.

4.3 Deep Features

Recently, deep features are being used for many applications, including video summarization, as they give state-of-the-art results in many applications such as semantic image classification, visual art processing and image restoration. However, since the question of what is a right evaluation of video summarization is still an open question (Otani et al., 2019), there is no solid evidence for an advantage in using deep features rather than color histogram features for this task. To decide which features to use, we extracted both types of features for all videos from the SumMe dataset. For color histograms we used 32 bins of each of the RGB channels, and as deep features we used the penultimate layer from the ResNet model (He et al., 2016)

pre-trained on ImageNet

(Deng et al., 2009). Then, for each video, we applied a dimensionality reduction on these features using PCA.

(a) Visualization of the color histogram features.
(b) Visualization of the deep features.
Figure 3: Visualization of the features of the shots of ”Bearpark Climbing” video from SumMe dataset. In each figure the features dimension were reduced to two dimensions using PCA. Figure (a) visualizes the color histogram features, and Figure (b) visualizes the deep features.

We observed that even though deep features are better in representing the semantics of the images, color histogram features seem to represent background and scene changes better. For example, Figure 2(a) visualizes the color histogram feature space of the ”Bearpark Climbing” video from the SumMe dataset, and Figure 2(b) visualizes the deep feature space. The plot’s axes are the two first principle components of the shot features. Each shot is represented by the image of the middle frame in the shot. It can be seen that in both cases, deep and shallow features, different scenes tend to be located and grouped in different parts of the feature space. However, the grouping of the color histogram space visually looks better than the deep features grouping, especially given the task definition of creating a summary which is visually similar to the source. Therefore we used the color histogram features to represents shots in this paper. Future research may examine the integration of deep and color histogram features.

5 Conclusion

In this paper, we have proposed a new subset selection algorithm based on the Iterated Local Search (ILS) framework for unsupervised video summarization. Motivated by the success of ILS in many computationally hard problems, we leverage this method for explicitly minimizing the total distance between video shots and their nearest selected shots under a knapsack-like constraint on the total summary duration. We have shown that a proper balance between local search and global exploration indeed leads to an efficient and effective algorithm for the Knapsack Median problem. Our experiments on video summarization datasets indicate that ILS-SUMM outperforms other video summarization approaches and finds solutions with significantly better total distance. Furthermore, experiments on a long videos dataset we have introduced demonstrate the high saclability of our method.


  • M. Avci and S. Topaloglu (2017) A multi-dtart iterated local search algorithm for the generalized quadratic multiple knapsack problem. Computers & Operations Research 83, pp. 54–65. Cited by: §2.2.
  • T. Chheng (2007) Video summarization using clustering. Department of Computer Science University of California, Irvine. Cited by: §1, §2.1.
  • Y. Cong, J. Yuan, and J. Luo (2011) Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14 (1), pp. 66–75. Cited by: §1, §2.1.
  • J. Cordeau, G. Laporte, and F. Pasin (2008) Iterated tabu search for the car sequencing problem. European Journal of Operational Research 191 (3), pp. 945–956. Cited by: §2.2.
  • O. Cordón and S. Damas (2006) Image registration with iterated local search. Journal of Heuristics 12 (1-2), pp. 73–94. Cited by: §2.2.
  • S. E. F. De Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albuquerque Araújo (2011) VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32 (1), pp. 56–68. Cited by: §1.
  • A. De Corte and K. Sörensen (2016) An iterated local search algorithm for water distribution network design optimization. Networks 67 (3), pp. 187–198. Cited by: §2.2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 248–255. Cited by: §4.3.
  • H. Fang, Y. Kilani, J. H. Lee, and P. J. Stuckey (2002) Reducing search space in local search for constraint satisfaction. In AAAI/IAAI, pp. 28–33. Cited by: §1.
  • FFprob (2019) Python software foundation, Cited by: §4.1.
  • M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool (2014) Creating summaries from user videos. In European Conference on Computer Vision, pp. 505–520. Cited by: §4.1.
  • M. Gygli, H. Grabner, and L. Van Gool (2015) Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3090–3098. Cited by: §1, §1, §2.1, §3.2, §4.1.
  • Y. Hadi, F. Essannouni, and R. O. H. Thami (2006) Unsupervised clustering by k-medoids for video summarization. In ISCCSP’06 (The Second International Symposium on Communications, Control and Signal Processing), Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.3.
  • Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon (2019) Discriminative feature learning for unsupervised video summarization. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8537–8544. Cited by: §1.
  • A. Khosla, R. Hamid, C. Lin, and N. Sundaresan (2013) Large-scale video summarization using web-image priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705. Cited by: §1.
  • H. R. Lourenço, O. C. Martin, and T. Stützle (2003) Iterated local search. In Handbook of Metaheuristics, pp. 320–353. Cited by: §1.
  • H. R. Lourenço, O. C. Martin, and T. Stützle (2019) Iterated local search: framework and applications. In Handbook of Metaheuristics, pp. 129–168. Cited by: §1, §3.1, §3.3.
  • B. Mahasseni, M. Lam, and S. Todorovic (2017) Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 202–211. Cited by: §1, §2.1.
  • I. H. Osman and J. P. Kelly (1997) Meta-heuristics theory and applications. Journal of the Operational Research Society 48 (6), pp. 657–657. Cited by: §3.2.
  • M. Otani, Y. Nakashima, E. Rahtu, and J. Heikkila (2019) Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7596–7604. Cited by: §1, §2.1, §4.3.
  • D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid (2014) Category-specific video summarization. In European Conference on Computer Vision, pp. 540–555. Cited by: §1, §4.1.
  • C. C. Ribeiro, D. Aloise, T. F. Noronha, C. Rocha, and S. Urrutia (2008) A hybrid heuristic for a multi-objective real-life car sequencing problem with painting and assembly line constraints. European Journal of Operational Research 191 (3), pp. 981–992. Cited by: §2.2.
  • D. Rotman, D. Porat, and G. Ashour (2016) Robust and efficient video scene detection using optimal sequential grouping. In 2016 IEEE International Symposium on Multimedia (ISM), pp. 275–280. Cited by: §4.1.
  • Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes (2015) Tvsum: summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179–5187. Cited by: §1, §4.1.
  • P. J. Van Laarhoven and E. H. Aarts (1987) Simulated annealing. In Simulated Annealing: Theory and Applications, pp. 7–15. Cited by: §3.3.
  • L. Yuan, F. E. Tay, P. Li, L. Zhou, and J. Feng (2019) Cycle-sum: cycle-consistent adversarial lstm networks for unsupervised video summarization. arXiv preprint arXiv:1904.08265. Cited by: §1.
  • Y. Yuan, T. Mei, P. Cui, and W. Zhu (2017) Video summarization by learning deep side semantic embedding. IEEE Transactions on Circuits and Systems for Video Technology 29 (1), pp. 226–237. Cited by: §1.
  • K. Zhang, W. Chao, F. Sha, and K. Grauman (2016)

    Video summarization with long short-term memory

    In European Conference on Computer Vision, pp. 766–782. Cited by: §1.
  • B. Zhao, X. Li, and X. Lu (2018) Hsa-rnn: hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7405–7414. Cited by: §1.
  • B. Zhao and E. P. Xing (2014) Quasi real-time summarization for consumer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2513–2520. Cited by: §1, §2.1.
  • K. Zhou, Y. Qiao, and T. Xiang (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §4.1.