Multi-View Egocentric Video Summarization

by   Mohamed Elfeki, et al.
University of Central Florida

With vast amounts of video content being uploaded to the Internet every minute, video summarization becomes critical for efficient browsing, searching, and indexing of visual content. Nonetheless, the spread of social and egocentric cameras tends to create an abundance of sparse scenarios captured by several devices, and ultimately required to be jointly summarized. In this paper, we propose the problem of summarizing videos recorded simultaneously by several egocentric cameras that intermittently share the field of view. We present a supervised-learning framework that (a) identifies a diverse set of important events among dynamically moving cameras that often are not capturing the same scene, and (b) selects the most representative view(s) at each event to be included in the universal summary. A key contribution of our work is collecting a new multi-view egocentric dataset, Multi-Ego, due to the lack of an applicable and relevant alternative. Our dataset consists of 41 sequences, each recorded simultaneously by 3 cameras and covering a wide variety of real-life scenarios. The footage is annotated comprehensively by multiple individuals under various summarization settings: (a) single view, (b) two view, and (c) three view, with a consensus analysis ensuring a reliable ground truth. We conduct extensive experiments on the compiled dataset to show the effectiveness of our approach over several state-of-the-art baselines. We also show that it can learn from data of varied number-of-views, deeming it a scalable and a generic summarization approach. Our dataset and materials are publicly available.


Video Summarization in a Multi-View Camera Network

While most existing video summarization approaches aim to extract an inf...

Diversity-aware Multi-Video Summarization

Most video summarization approaches have focused on extracting a summary...

4D Visualization of Dynamic Events from Unconstrained Multi-View Videos

We present a data-driven approach for 4D space-time visualization of dyn...

Multi-view Metric Learning for Multi-view Video Summarization

Traditional methods on video summarization are designed to generate summ...

Multi-View Surveillance Video Summarization via Joint Embedding and Sparse Optimization

Most traditional video summarization methods are designed to generate ef...

Multi-View Video-Based 3D Hand Pose Estimation

Hand pose estimation (HPE) can be used for a variety of human-computer i...

Speech Reconstitution using Multi-view Silent Videos

Speechreading broadly involves looking, perceiving, and interpreting spo...

1 Introduction

In a world where nearly everyone has several mobile cameras ranging from smart-phones to body-cameras [23, 34], brevity becomes no longer an accessory. It is rather essential to efficiently extract important and relevant contents from this immense array of static and moving cameras. The task of video summarization aims at selecting a set of frames or segments from a visual sequence, such that it contains the most important and representative events across all the sequence. Not only summarization is useful for providing the means to efficiently extract the essence of data, but it also serves many other different applications such as video indexing [12], video retrieval [38]

, and anomaly detection 


Figure 1: Several views are recorded simultaneously and intermittently overlap their fields-of-view. Our supervised approach dynamically accounts for inter- and intra-view dependencies, providing a comprehensive summary of all views.

Problem and challenges: In this work, we address a generic setting where multiple users have egocentric cameras that record simultaneous footage. Since users are allowed to move freely in an uncontrolled environment, the cameras’ fields-of-view may or may not overlap through the sequence. Summary extracted for such setting should capture a diverse set of important events across all the footage (i.e. all different viewpoints). Nevertheless, whenever an event is being captured by more than one camera, the summary should only include the most representative view and dismiss the rest.

This setting presents itself in several real-life scenarios where many egocentric videos are required to be summarized. For instance, rising claims of police misconduct led to a proliferation of police body cameras [33, 2]. Typical police patrols contain multiple officers working 10-12 hour shifts. Although it is crucial to thoroughly inspect key details, manually going through 10-hour video content is extremely challenging and prone to human errors. Multiplying shift lengths by the number of officers on duty, it is obvious that there are copious amounts of data to analyze with no guiding index. A similar example occurs at sports and social events such as concerts, live-shows, live-games. Those events tend to be recorded by many several cameras simultaneously that are dynamically changing their fields-of-view. Nevertheless, the final highlight summary of such events is likely to contain frames from all the cameras.

Despite considerable progress in single-view video summarization for both egocentric and fixed cameras (e.g., [43, 29, 7, 21]), these techniques are not readily applicable to summarizing multi-view videos. Single-view summarizers ignore the temporal order by processing the simultaneous views in a sequential order to fit as a single-view input. This results in redundant and repetitive summaries that do not exhibit the multi-view nature of the footage. On the other end of the spectrum, the literature of multi-view video summarization mainly focuses on surveillance camera summarization (e.g.,  [27, 28, 26]). This enables some methods to rely on geometric alignment of cameras inferring the relationship between their fields-of-view and utilizing it for a representative summary (e.g., [1, 6]

). Thus, previous work mostly uses unsupervised and semi-supervised methods that are based on heuristic-based objective functions, which will be incompetent whenever a dynamic change of cameras’ geometric positioning occurs. Furthermore, egocentric videos display rapid changes in illumination, unpredictable camera motion, unusual composition and viewpoints, and often complex hand-object manipulations, leading to an increased complexity of summarizing egocentric videos.

Our contributions: To address the shortcomings of the discussed settings, we propose a generic environment where multiple egocentric cameras are simultaneously recording and their fields-of-view are intermittently overlapping. Summarizing such videos requires accounting for both intra- and inter-view dependencies. That’s why we present -to the best of our knowledge- the first supervised-learning approach that performs multi-view summarization. This enables learning importance and diversity factors from human-annotated summaries, even with rapid changes induced by egocentric vision. Specifically, we formulate a novel adaptation of the widely used Determinantal Point Process (DPP) [43, 21, 7, 30] to accommodate our setting, which we call Multi-DPP. Since no existing dataset is readily applicable to evaluate such setting, we collect and annotate a new dataset, Multi-Ego. Using an extensive experimental protocol, we show that our method outperforms state-of-the-art frameworks in unsupervised multi-view summarization, semi-supervised multi-view summarization and supervised single-view summarization.

2 Related Work

Single-View Video Summarization

Among many approaches proposed for summarizing single-view videos supervised approaches usually stood out with best performances. In such a setting, the purpose is to simulate the patterns that people exhibit when performing the summarization task, by using human-annotated summaries. There are two-factor influence the supervised models’ performance: (a) reliability of annotations, and (b) framework’s modeling capability. Ensuring the reliability of annotations is evaluated based on a consensus analysis as in several benchmark datasets  [19, 31, 17]. As for the modeling capabilities, supervised approaches vary in their modeling complexity and effectiveness  [7, 10, 42, 9, 41, 4].

Recurrent Neural Networks in general, and Long Short-Term Memory (LSTM) [11] in particular has been widely used in video processing to obtain the temporal features in videos [36, 25, 45, 18]. In the recent years, using LSTMs has been a common practice to solve video summarization problem  [13, 32, 39, 44, 40]. For instance, Zhang et al. [43]

use a mixture of Bi-directional LSTMs (Bi-LSTM) and Multi-Layer Perceptron to summarize single-view videos in a supervised manner. They maximize the likelihood of Determinantal point processes (DPP) measure

[16, 8, 37] to enforce diversity within the selected summary. Also, Mahasseni et al. [21] present a framework that adversarially trains LSTM networks, where the discriminator is used to learn a discrete similarity measure for training the recurrent encoder/decoder and the frame selector LSTMs.

Multi-view Video Summarization

Most multi-view summarization methods tend to rely on feature selection in an unsupervised optimization paradigms  

[24, 26, 28, 27]. Fu et al.  [6] introduce the problem of multi-view video summarization tailored for fixed surveillance cameras. They construct a spatiotemporal graph and formulate the problem as a graph-labeling task. Similarly, in [27, 26] authors assume that cameras in a surveillance camera network have a considerable overlap in their fields-of-view. Therefore they apply well-crafted objective functions that learn an embedding space and jointly optimize for a succinct representative summary. Since those approaches target fixed surveillance cameras, they rightfully assume a significant correlation among the frames along the same view over time. In our generalized setting, cameras move dynamically and contain rapid changes in the field-of-view rendering the aforementioned assumption weak and make the problem harder to solve.

Arev et al. [1], introduces a similar problem to ours entails editing footage recorded from social cameras. They propose a graph-based approach that provides an automatically generated cut of a specific length out of the videos from all users. Additionally, they obtain a universal knowledge of the event by constructing the 3D structure from motion in the event. While their technique may work in certain scenarios, in general, constructing 3D structure is unattainable in most situations where the cameras are dynamically moving and containing considerable egocentric noise.

3 Multi-Ego: A new multi-view egocentric summarization dataset

Figure 2: Visualizing annotations provided by human subjects on one of the sequences across the three views (Y-axis). This shows a major consensus between subjects’ annotations in every view.

While a number of multi-view datasets exist (e.g. [6, 24]), none of them are recorded in egocentric perspective. Therefore, we decided to collect our own data that align with the established problem setting. We asked three users to collect a total of 12 hours of videos using go-pro cameras with a frame-rate of 30 fps while performing different real-life activities. During the data collection, we covered various uncontrolled environments and activities. We also ensured to present different levels of interactions among the individuals: (a) two views interacting while the third one is independent, (b) all views interacting with each other, and (c) all views independent of each other.

Then, we extracted 41 different sequences that vary in length from three to seven minutes. Each sequence contains three views covering a variety of indoors and outdoors activities. To make the data more accessible for training and evaluation, we grouped the sequences into 6 different collections: (1) Car-ride, (2) College-Tour, (3) Supermarket, (4) Sea-world, (5) Indoors-Outdoors, and (6) Library. More details about the data-collection process, contents of the sequences, and sample frames are provided in supplementary materials.

3.1 Collecting User Annotations

To annotate and process the data for the summarization task, we sub-sample the videos uniformly to one fps following  [30]. This resulted in a reasonable number of frames in each sequence: 180 to 360 frames for each of the three views. Further, each shot is constructed to contain three consecutive frames given to the annotators. The number of frames per shot was chosen empirically to maintain a consistent activity within one shot.

Figure 3: Percentage of frames selected by at least 1, 2, 3, 4, 5 subjects for the annotations. In every collection, at least 3 annotators agree on which represents the summary.

Then, we asked 5 human annotators (four undergraduate students and one high school student) to perform a three-stage annotation task. In stage one, they were asked to choose the most interesting and informative shots that represent each view independently without any consideration towards the other views. To construct two-view summaries in stage two, we only displayed the first two views simultaneously, while asking the users to select the shots from any of the two views that best represent both cameras. Similar to stage two, in stage three the users were asked to select shots from any of the three views that best represent all the cameras. It is worth noting that the annotators were not limited to choose only one view of a certain shot, and they could choose as many as they deem important.

The annotating-in-stages procedure explained above was employed because of a human’s limited capability in keeping track of the unfolding storylines along multiple views simultaneously. Consequently, using this technique resulted in a significant improvement in the consensus between user summaries compared to when we initially collected summaries in an unordered annotation task. Figure  2 shows an example of the major consensus between the users in stage three after following the multi-stage process. Please refer to supplementary materials for further details about the annotation process and a behavioral analysis on the obtained annotations.

3.2 Analyzing User Annotations

To ensure the reliability and consistency of the obtained annotations, we perform a consensus analysis using two metrics: average pairwise f1-measure and selection ratio. Following [31, 30, 29]

, we compute the average pairwise f1-measure to estimate the frame-level overlap and agreement. We calculated the f1-measure for all possible pairs of users’ annotations and averaged the results across all the pairs, obtaining an average of 0.803, 0.762, and 0.834 for the first, second, and third stage respectively.

For further annotation quality assessment, we used the selection ratio metric. According to [31, 7, 30], the usual summary length should be 5-15% of the total length of the sequence. Any frame that is a part of the final summary should be selected by at least three out of the five annotators. Figure 3 shows the ratio of the frames (with respect to sequence length) that have been chosen by at least 1, 2, 3, 4, and 5 subjects, respectively for each collection in stage three. For all the collections, the ratio of the frames chosen by at least three users is within the 5-15% range.

3.3 Creating Oracle Summaries

Finally, training a supervised method usually requires a single set of labels. That means in our case, we need to use only one summary per video, which is often referred to as Oracle Summary. To create an oracle summary using multiple human-created summaries, we follow  [7] to use the algorithm proposed in [15]

. This algorithm greedily chooses the shot that results in the largest marginal gain on the f-score, and iteratively keeps repeating the greedy selection until the length of the summary reaches 15% of the single-view length.

4 Approach

We first discuss the standard DPP formulation in Section 4.1. Then, we illustrate how we adapted the formulation to the Multi-view setting in Section 4.2. Then, in section 4.3, we elaborate on the details of our approach. Finally we discuss our system’s scalability in Section 4.4.

4.1 Determinantal Point Process (DPP)

DPP is a probabilistic measure that provides a tractable and efficient means to capture negative correlation with respect to a similarity measure [20, 16]. Formally, a discrete point process on a ground set

is a probability measure on the power set

, where is the size of the ground set. A point process is called determinantal if: .

is the selection random variable that is sampled according to

and is a symmetric semi-definite positive matrix that represents the kernel.

Kulesza et al.  [14] proposed modeling the marginal kernel as a Gram matrix in the following manner:


When optimizing the DPP kernel, this decomposition learns a “quality score" of each item, where

. It also allows learning a feature vector

of subset . In this case, the dot product , where is evaluated as a “pair-wise similarity measure" between the features of item and the features of item . Thus, the DPP marginal kernel can be used to quantify the diversity within any subset selected from a ground set . Choosing a diverse subset is equivalent to a brief representative subset since the redundancy is being minimized. Hence, it is only natural that a considerable number of document and video summarization approaches use this measure to extract representative summaries of documents and videos  [15, 43, 21, 7, 37].

4.2 Adapting DPP to Multi-view: Multi-DPP

The standard DPP process described above is suitable for selecting a diverse subset from a single ground set. However, when presented with several ground sets , the standard process can only be applied in one of two settings: either (a) merging all the ground sets into a single ground set and selecting a diverse subset out of the merged ground set, or (b) selecting a diverse subset from each ground set and then merging all the selected subsets .

Even though that the former setting preserves the information of all elements of the ground sets, but it causes the complexity of the subset selection to exponentially grow. In practice, this leads to an accumulation of error due to overflow and underflow computations as well as slower computation. Additionally, the latter setting assumes a non-intersection between the features of the different ground-sets. This is essentially inapplicable if the ground-sets have a significant dynamic feature overlap, leading to redundancy and compromising the very purpose of the DPP. To address those shortcomings, we propose a new adaptation of the discussed DPP decomposition, called Multi-DPP.

In Multi-DPP, ground sets are processed in parallel allowing any potential feature overlap across the ground sets to be treated temporally-appropriate and keeping a linear growth with respect to the number of views. For every element in the ground sets, we need to represent two joint quantities: features and quality, such that they follow the following four characteristics. First, we need a model that can operate on any number of ground sets (i.e., generic to any number of ground sets

). Second, we need a joint representation of the features at each index, such that it only selects the most effective ones (i.e., invariance to noise and non-important features). Third, we need a joint representation of the qualities at each index, such that is affected by the quality of each ground set at a particular index (i.e., variance to the quality of each ground set). Forth, we need to ensure that our adaptation follows the DPP decomposition in Eq.  

1, by selecting joint features , and joint qualities .

To account for joint features, we apply max-pooling to select the most effective features across all ground sets at every index, which also satisfies the feature decomposition in Eq.  

1. Selecting joint qualities -on the other hand- needs to account for the quality of each ground set in every index. We choose to use the product of all the qualities at each index. This deems the joint quality at each index to be dependent on all ground-sets while also ensuring . Therefore, we generalize the Determinantal Point Process based on the decomposition in Eq.  1 as follows:


where is the number of the ground sets and is the subset selected from ground set .

Then, we follow  [16] to formulate an optimization of a supervised learning algorithm. We apply Maximum Likelihood Estimation of the Multi-DPP measure as follows:


where is the set of parameters of our model, is the target subset (i.e., ground-truth) and indexes the set of training examples.

4.3 Our Framework

As shown in Figure  4, the input to our system is temporally aligned views, each containing frames. We begin by extracting spatial features of each frame in each view using a pre-trained CNN. Then, we input spatial features to a Bidirectional-LSTM layer which extracts temporal features from each view. We aggregate both the spatial and temporal features, representing each frame with a comprehensive spatiotemporal feature at each view. We note that extracting the spatiotemporal features in this manner is a common practice as in  [43, 21, 3]. We choose to share the weights of the Bi-LSTM layer across the views for two reasons: (a) it allows the system to operate on any number of views without increasing the number of trainable parameters which alleviates overfitting, and (b) the process of learning temporal features is independent of the view, thus it should utilize data from all views to produce better temporal modeling.

Figure 4: Our Framework: A multi-stream Bi-LSTM extracts spatio-temporal features across all the views. Then Multi-DPP is applied to increase diversity within the selected time-steps. To choose the representative view(s) at each time-step, we apply cross entropy loss.

We break down our objective into two tasks: selecting diverse events and identifying the view(s) contributing to illustrating each selected event in the summary. In the first task, to select diverse events, we construct a feature set that accounts for all the views at each time-step. We do so by max-pooling the spatiotemporal features from all the views, resulting in the most prominent feature at each index of the feature vector. We follow the max-pooling by a two-layer Multi-Layer Perceptron (MLP) that applies non-linear activation on the joint features that are represented as in Eq.  2.

For the second task, to identify the most representative view(s) at each event, we use a two-layer MLP that classifies each view at each time step. Formulating this task as a classification problem serves three purposes. First, it selects the views that are included in the summary, which is an intrinsic part of the solution. Second, it regularizes the process of learning the importance of each event by not selecting any view when the time-step is non-important. Finally, the classification confidence of view

can be used to represent the quality () at time-step . This is later used to compute the Multi-DPP measure that determines which time-steps are selected. In the case of non-overlapping views, the framework may need to select multiple views at the same time-step. That’s why, we conduct an independent view classification by applying binary classification, which allows classifying each view independently from the rest.

Similar to the weights of the Bi-LSTM, the view classifier MLP weights are also shared across the views for two reasons. First, it uses the same number of trainable parameters for any number-of-views data, resulting in fewer trainable parameters which control the problem of overfitting to training data. Second, it establishes a view-dependent classification. That is, at any time-step, choosing a representative view among all the views is affected by the relative quality of all the views, rather than each one independently. During training, we start by estimating the quality of each view at each time-step , which serves as the view selection and is used later to compute the Multi-DPP measure. Then, we optimize the view(s) selection procedure by using the binary cross-entropy objective: ; where are the ground truth and model’s prediction for the time-step in view . To evaluate the Multi-DPP measure, we compute the joint-features as in Eq. 2. We jointly optimize the framework by minimizing the sum of both the losses and using the Oracle summary as the ground-truth.

4.4 Multi-view Scalability

A scalable multi-view video summarization algorithm must be invariant to view order and number-of-views. Invariance to view order implies producing the same summary for input views as to , for all possible permutations of . Our approach intrinsically satisfies the first invariance requirement by constructing the joint-features using max-pooling. Thus, it is only influenced by the most effective features with no regard to its order.

The second condition requires the ability to train on datasets with varying numbers-of-views and test on a dataset with any number-of-views. Satisfying this condition requires the number of trainable parameters to be invariant from the number-of-views of the input. This way the same set of parameters can be used to train/test on data with any number-of-views. We followed two techniques ensuring a fixed number of trainable parameters: (a) max pooling view-specific features, and (b) weight-sharing for Bi-LSTM and view selection layers. Applying max-pooling on view-specific features produces a fixed-size joint feature vector that is invariant from the number of views in the input. Additionally, choosing the prominent features across the views entails learning the intra-view dependencies. Finally, weight sharing across the Bi-LSTM view-streams and view selection layers ensures our framework has a single set of trainable parameters for each of those layers regardless of the number of views. Consequently, our model is supervised to learn inter-view dependencies, which are along with intra-view dependencies, crucial for a representative and brief multi-view summarization.

5 Experiments and Results

5.1 Baseline Methods

Since we propose the first supervised multi-view summarization approach, we compare our method to a random sampling baseline, unsupervised and semi-supervised multi-view summarization, and fully supervised single-view summarization:

Random summarization: Sampling uniform frames across all the views such that the summary constitutes 15% of the single view’s length.

Multi-view summarization

  • Unsupervised feature selection  [22]: Optimizing the feature space of all the views to select the best relevant subset of features with respect to norms.

  • Unsupervised joint embedding [26]: Projecting all views’ features to a latent embedding. Using a sparse representative selection, it jointly optimizes learning the embedding space as well as the optimal features subset.

  • Semi-supervised sub-modular mixture of objectives [10]: Learning an objective for each view representing the importance of global characteristics of a summary. Then jointly optimizing a universal loss for the objectives from all the views.

Supervised Single-View Summarization [43] Extracting a single-view summary using Bi-LSTM and MLPs while optimizing the standard DPP measure on the extracted features. DPP loss increases the diversity within the selected summary, which is equivalent to selecting a representative summary. To apply the single-view configuration on multi-view videos, we examine two settings:

  • Merge-Views: Aggregating the views and summarizing the aggregate footage using the single-view summarizer. The summary should be consistent if the views are independent.

  • Merge-Summaries: Summarizing each view independently and then aggregating the summaries. Complementary to the former setting, this should result in a consistent summary if the summaries are independent.

Two-View Three-View
Precision Recall F1-Score Precision Recall F1-Score
Random Baseline Uniform Sampling 9.83 10.65 9.85 5.83 5.16 5.77
feature selection [22] 17.83 19.15 17.46 12.33 16.28 10.70
joint embedding [26] 18.37 25.20 20.66 13.88 24.85 17.17
Sub-modular mixture
of objectives [10]
19.91 25.21 22.71 18.49 22.71 20.19
Merge-Views  [43] 27.87 28.57 27.67 23.25 23.87 22.95
Merge-Summaries  [43] 26.61 27.25 26.43 22.86 23.59 22.76
Ours: Cross-Entropy (CE) 27.33 27.83 27.13 21.33 22.03 21.10
Ours: Multi-DPP + CE 28.58 29.05 28.30 25.06 25.79 25.03
Table 1: Performance Evaluation(%) for two-view and three-view settings. Ours consistently outperforms the baselines on all the measures. We also run an ablation study to show the effect of optimizing Multi-DPP measure as compared to only using Cross-Entropy loss.

5.2 Experimental Setup

We use GoogLeNet  [35] features for all the methods as an input. For a fair comparison, we train all supervised baselines  [10, 43] and Ours with the same experimental setup: iterations number, batch size, and optimization.

The supervised frameworks are trained for twenty iterations with a batch size of 10 sequences. Adam optimizer is used to optimize the losses with a learning rate of 0.001. After each iteration, we calculate the mean validation loss and only evaluate the model with the best validation loss across all iterations. We discuss further details of the architecture and training in the supplementary materials.

As discussed in section 3.1, we categorize our dataset sequences into six collections to facilitate the training and evaluation. In our experiments, we follow a round-robin approach to train-validate-test the supervised/semi-supervised learning frameworks. We use four collections for training, one for validation, and one for testing across all the 30 different combinations of collections. For unsupervised approaches (Random

[26] and [22]), since no training is required, we only test methods on each collection separately.

To evaluate the summaries produced by all the methods, we follow the protocols in  [21, 43, 13, 31] to compare the predictions against the oracle summary. We start by temporally segmenting all the views using the KTS algorithm  [29] to non-overlapping intervals. Then, we repetitively extract key-shot based summaries using MAP  [42] while setting the threshold of summary length to be 15% of a single view’s length. For each of the selected shots, we consider all of its frames to be included in the summary.

5.3 Performance Evaluation

Similar to [28, 26, 43, 21, 6]

, we use three metrics to evaluate our performance: f1-score, precision, and recall. These metrics evaluate the quality of the produced summaries by comparing frame-level correspondences between the predicted summary and the ground-truth summary. Table  

1 shows the mean precision, recall, and F1-score across all the combinations of training-validation-testing for both the two-view setting and three-view setting (i.e., stages two and three of the annotations).

[26, 22] obtain the lowest performance due to the lack of supervision, indicating an inability to adapt to visual changes occurring due to the egocentric motion. Semi-supervision in  [10] slightly improves the performance, however, it still is not capable of completely learning representative summaries. Finally, the full supervision in  [43]

reasonably adapts to learn the noisy patterns of egocentric motion. However, Merge-summaries setting ignores the temporal ordering of the views by processing the summaries sequentially. Also, applying the Merge-views setting results in growing the complexity of the training exponentially with the increase of the number of the views, resulting in an accumulated error within learning process (e.g., Merge-views performs relatively worse on the 3-view setting than on 2-view setting). Unlike the rest of the baselines, our approach processes data in parallel, preserving the temporal order of the views and keeping the linear growth with the increase of the number-of-views. It also adapts well to the noisy egocentric motion by learning in a fully supervised manner. Thus, our method consistently outperforms all other methods across all evaluation metrics indicating the best match with the ground-truth summary.

As an ablation study, we evaluate our approach with only optimizing the cross-entropy loss, and compare it with our full model to study the impact of enforcing diversity when summarizing multi-view videos. Ours: Cross-Entropy (CE) in Table 1 corresponds to training our model by only selecting representative views, without explicitly enforcing diversity. It is worth noting that we cannot train our model with only the Multi-DPP measure, because we would not have a criterion for view selection. Evidently, learning our model with the Multi-DPP measure in addition to the CE loss improves the results, especially in the three-view setting due to the increase of the input footage required to diversify. In general, it can be noticed that the performance in the two-view setting is higher than that in the three-view setting. This is due to the increase in problem complexity when considering more views to be summarized. However, we note that the ranking of all the methods remains nearly the same for both settings.

Additionally, we address a shortcoming of the common evaluation metrics that present itself in our setting. Consider the case of two or more views having nearly identical visual content at the same time-step, which happens due to the dynamic overlapping of fields-of-view. When annotating the sequences, the user will only include one of the views in the ground-truth summary at important events. However, if the prediction model selects any of the other views, it should not be penalized since the views are visually similar. To address this case, we evaluate the F1-score at several levels of similarity thresholds. That is, if the Euclidean distance of the normalized CNN features between two views at the same time-step is less than a threshold (0%, 10%, 20%, 30%), we do not penalize the prediction model if it selects any of the views instead of the other. We recompute the F1-scores for all the models at different threshold values. As shown in Fig. 5, our method continues to obtain the highest F1-score at all the threshold levels.

To ensure a fair comparison with the fully-supervised single-view, we also verify if the order in Merge-Views affects the results. We trained the model described in  [43] using different view orders, but the produced summaries were nearly identical (i.e. a difference in F1-score ). This aligns with the observation made in [7] that the DPP measure is agnostic about the order of items. Therefore it reaches indistinguishable predictions regardless of the order of the views in the training process.

Figure 5: F1-score computed whereas prediction models are not penalized if mistakenly chose a view that is similar to GT view within various threshold levels.

5.4 Scalability Analysis

In this section, we study our framework’s capability to learn from a varying number-of-views in a sequence by verifying if the training process can exploit any increase in data regardless of its numbers-of-views. We start by splitting our data into two categories: (a) three-view (Collections: Indoors-Outdoors, SeaWorld, Supermarket), and (b) two-view (Collections: Car-Ride, College-Tour, Library). This division has nearly the same number of sequences in each category: 20 sequences in the three-view category, and 21 sequences for the two-view category. We investigate the performance of three train/test configurations where the testing data is limited to a single category:

  • Same category training (2two-view& 1two-view): Train the framework on 2 collections from the same category as the testing collection.

  • Different category training (3two-view& 3three-view): Training the framework on 3 collections from one category, and then test it on a collection belonging to a different category.

  • Training using Data from the two categories (3two-view+2two-view& 2two-view+3two-view): Train the framework on data from different categories, and test it on a collection from one of the categories in the training data.

For each of the scenario enumerated above, the model is tested on all the 3 possible test collections available to us. For example, when evaluating 3two-view, there are 3 collection instances of the three-view category. Therefore, we report the average performance across all of them.

Test Train Precision Recall F1-Score


2two-view 29.83 29.77 29.67
3three-view 29.77 30.30 30.2
2two-view +
34.37 35.03 34.33


2three-view 18.53 18.80 18.33
2two-view 18.23 18.27 17.67
3two-view +
21.53 21.87 21.33
Table 2: Scalability Analysis: Our framework can be trained and tested on data of different number-of-views. It exploits data from various number-of-views to improve the performance on test data.

As shown in Table 2, training our framework on same categories or different categories obtain comparable results when testing on both two-view and three-view settings. However, increasing training data size by combining both of the categories significantly improves the results. This shows that our model can be trained and tested on data of various number-of-views and also can take advantage of any data increase with no regard to its number-of-views setting.

6 Conclusion

In this work, we proposed the problem of multi-view video summarization for dynamically moving cameras that often do not share the same field-of-view. Unlike previous work in multi-view video summarization, we presented a supervised approach that utilizes human-annotated labels to generate a comprehensive summary for all the views without any prior assumptions on the videos accommodating for a generalized summarization setting. It identifies important events across all the views as well as it selects the view(s) that best illustrate each event in the final summary.

We also introduced a new dataset that is recorded in uncontrolled environments including a variety of real-life activities. Several human users annotated the footage, then we ran a consensus analysis on the annotations to ensure reliable ground-truth. When evaluating our approach on the collected benchmark, it outperformed all other baselines including state-of-the-art approaches in single-view video summarization, semi-supervised and unsupervised multi-view summarization.


  • [1] I. Arev, H. S. Park, Y. Sheikh, J. Hodgins, and A. Shamir. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG), 33(4):81, 2014.
  • [2] B. Ariel, W. A. Farrar, and A. Sutherland. The effect of police body-worn cameras on use of force and citizens’ complaints against the police: A randomized controlled trial. Journal of quantitative criminology, 31(3):509–535, 2015.
  • [3] C. Chen and Chen. Video to text summary: Joint video summarization and captioning with recurrent neural networks. In BMVC, pages 1–10, 2017.
  • [4] C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and M. S. Ryoo. Identifying first-person camera wearers in third-person videos. arXiv preprint arXiv:1704.06340, 2017.
  • [5] Y. Feng, Y. Yuan, and X. Lu. Learning deep event models for crowd anomaly detection. Neurocomputing, 219:548–556, 2017.
  • [6] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z.-H. Zhou. Multi-view video summarization. IEEE Transactions on Multimedia, 12(7):717–729, 2010.
  • [7] B. Gong, W.-L. Chao, K. Grauman, and F. Sha. Diverse sequential subset selection for supervised video summarization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2069–2077. Curran Associates, Inc., 2014.
  • [8] S. Gupta. 1 determinantal point processes.
  • [9] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool. Creating summaries from user videos. In European conference on computer vision, pages 505–520. Springer, 2014.
  • [10] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3090–3098, 2015.
  • [11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [12] R. Hong, L. Li, J. Cai, D. Tao, M. Wang, and Q. Tian.

    Coherent semantic-visual indexing for large-scale image retrieval in the cloud.

    IEEE Transactions on Image Processing, 2017.
  • [13] Z. Ji, K. Xiong, Y. Pang, and X. Li. Video summarization with attention-based encoder-decoder networks. arXiv preprint arXiv:1708.09545, 2017.
  • [14] A. Kulesza and B. Taskar. Structured determinantal point processes. In NIPS, 2010.
  • [15] A. Kulesza and B. Taskar. Learning determinantal point processes. 2011.
  • [16] A. Kulesza, B. Taskar, et al.

    Determinantal point processes for machine learning.

    Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012.
  • [17] Y. J. Lee and K. Grauman. Predicting important objects for egocentric video summarization. International Journal of Computer Vision, 114(1):38–55, 2015.
  • [18] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816–833. Springer, 2016.
  • [19] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li.

    A user attention model for video summarization.

    In Proceedings of the tenth ACM international conference on Multimedia, pages 533–542. ACM, 2002.
  • [20] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83–122, 1975.
  • [21] B. Mahasseni, M. Lam, and S. Todorovic. Unsupervised video summarization with adversarial lstm networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pages 1–10, 2017.
  • [22] F. Nie, H. Huang, X. Cai, and C. H. Ding. Efficient and robust feature selection via joint ℓ2, 1-norms minimization. In Advances in neural information processing systems, pages 1813–1821, 2010.
  • [23] W. OBILE. Ericsson mobility report, 2016.
  • [24] S.-H. Ou, C.-H. Lee, V. S. Somayazulu, Y.-K. Chen, and S.-Y. Chien. On-line multi-view video summarization for wireless video sensor network. IEEE Journal of Selected Topics in Signal Processing, 9(1):165–179, 2015.
  • [25] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1029–1038, 2016.
  • [26] R. Panda and A. R. Chowdhury. Multi-view surveillance video summarization via joint embedding and sparse optimization. IEEE Transactions on Multimedia, 2017.
  • [27] R. Panda, A. Dasy, and A. K. Roy-Chowdhury. Video summarization in a multi-view camera network. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 2971–2976. IEEE, 2016.
  • [28] R. Panda, N. C. Mithun, and A. Roy-Chowdhury. Diversity-aware multi-video summarization. IEEE Transactions on Image Processing, 2017.
  • [29] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In European conference on computer vision, pages 540–555. Springer, 2014.
  • [30] A. Sharghi, J. S. Laurel, and B. Gong. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. arXiv preprint arXiv:1707.04960, 2017.
  • [31] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5179–5187, 2015.
  • [32] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, pages 843–852, 2015.
  • [33] J. Stanley. Police body-mounted cameras: With right policies in place, a win for all. New York: ACLU, 2013.
  • [34] A. Swartz. Gopro posts record fourth-quarter sales but stock falls 15 percent on poor outlook, 2015.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
  • [36] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
  • [37] A. Vershik. Asymptotic Combinatorics with Applications to Mathematical Physics: A European Mathematical Summer School held at the Euler Institute, St. Petersburg, Russia, July 9-20, 2001. Springer, 2003.
  • [38] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao. Pairwise relationship guided deep hashing for cross-modal retrieval. In AAAI, pages 1618–1625, 2017.
  • [39] H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer Vision, pages 4633–4641, 2015.
  • [40] H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer Vision, pages 4633–4641, 2015.
  • [41] R. Yonetani, K. M. Kitani, and Y. Sato. Ego-surfing first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5445–5454, 2015.
  • [42] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1059–1067, 2016.
  • [43] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Video summarization with long short-term memory. In European Conference on Computer Vision, pages 766–782. Springer, 2016.
  • [44] B. Zhao, X. Li, and X. Lu. Hierarchical recurrent neural network for video summarization. In Proceedings of the 2017 ACM on Multimedia Conference, pages 863–871. ACM, 2017.
  • [45] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI, volume 2, page 8, 2016.