Multi-view Metric Learning for Multi-view Video Summarization

05/25/2014 ∙ by Yanwei Fu, et al. ∙ 0

Traditional methods on video summarization are designed to generate summaries for single-view video records; and thus they cannot fully exploit the redundancy in multi-view video records. In this paper, we present a multi-view metric learning framework for multi-view video summarization that combines the advantages of maximum margin clustering with the disagreement minimization criterion. The learning framework thus has the ability to find a metric that best separates the data, and meanwhile to force the learned metric to maintain original intrinsic information between data points, for example geometric information. Facilitated by such a framework, a systematic solution to the multi-view video summarization problem is developed. To the best of our knowledge, it is the first time to address multi-view video summarization from the viewpoint of metric learning. The effectiveness of the proposed method is demonstrated by experiments.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real world applications, unlabeled data usually arrive in the form of a number of highly correlated views. Examples of this kind can be frequently encountered in the field of video processing, where different cameras may focus on roughly the same field-of-view (Fov) from different viewpoints, such as in the case of office coverage or surveillance records. In such case, one might expect to utilize correlations to help understand and characterize the data, and more preferably, find an “optimal” metric that reflects the intrinsic structure of the input data.

In this paper, we are interested in this kind of problem, especially multi-view video summarization. Suppose the complicated human motion in local geometric coordination is a function varying with time and sampled temporally by multiple cameras simultaneously. In order to reveal the characteristics of this original space, traditional methods generally extract high dimensional feature vector space for each view video with a manifold assumption individually. Many dimension reduction methods are then utilized. However, different view videos often include distinctive and complementary information to the original dataset. For this purpose, we present multi-view metric learning framework to integrate all views of these videos into the new metric learning space and to disclose the intrinsic features of original human moving. Here, video summarization is such intrinsic feature we are striving for.

Figure 1: The flowchar of the multi-view metric learning framework with application to multi-view video summarization.

We firstly provide a unified framework for multi-view video summarization by multi-view metric learning. Multi-view video simultaneously captures the different visual projections of the same time-space manifold in real life. Our multi-view metric learning is learned to project all the multi-view videos into a new metric space to best simulate the real world manifold space. This thus greatly facilitates the video summarization by preserving most intrinsic features across different views. Specifically, the framework is derived from Maximal Margin Clustering (MMC) by minimizing the disagreement minimization criterion for learned metric. In the learned metric space, visual data are summarized by clustering them and extracting key frames in each cluster.

2 Previous work

Multi-view learning has received considerable attention in the past decade. Most of previous methods are however devoted to the semi-supervised learning (

[23] for a detailed survey). Some studies on the unsupervised case have been performed [4, 22, 9], focusing on merging different metrics and minimizing disagreements among them. Our approach is different with them in that we take simultaneously the Minimizing-Disagreement criterion and the Maximal Margin Clustering (MMC) criterion into account. The optimization involved achieves a trade-off between them.

Maximum Margin Clustering (MMC) is a classical approach to clustering[18, 19] that aims at finding clusters with large margins. It often exhibits superior performance compared against traditional clustering algorithms. Following the same criteria as MMC approaches, the method proposed in this paper optimizes a graph-theoretic measure to find a kernel matrix that allows larger margins between clusters.

Metric learning [17, 16] and multiple kernel learning[7] aim at finding an “optimal” distance metric (or convex combination of kernels that implicitly defines a distance metric) that allows distance-based or kernel-based algorithms to achieve better performance. Previous studies on metric learning and multiple kernel learning mainly focus on situations where additional information (such as side information [17] or class labels[16, 24, 1]

) is available. Studies on “pure” unsupervised learning only utilize either the maximum margin criteria

[20, 21] or the minimizing-disagreement criteria [9].

Video summarization is a well studied topic in the past two decades. We refer [12] for a comprehensive survey. Although some previous studies have been dedicated to the problem of multi-camera systems, but they were either focusing on tracking moving objects across cameras with non-overlapping field of views [11, 10] or compression [13]. Fu et al. [5] wass the first effort to systematically study the problem of skim-based multi-view video summarization (especially in the surveillance videos) by using hypergraph structures. [8] extracted the keyframes of such multi-view summarization. We explore this problem by multi-view metric learning framework. We instead directly address the problem of video summarization on multiple overlapping views.

3 Multi-view metric learning framework

Suppose is the the low-level features of different views, where are the coordinate matrices. Our goal is to find a unified coordinate matrix minimizing


where are the empirical, structural, and disagreement losses of , respectively. are parameters controlling the trade-off of objectives.

The classical MMC contains the former two parts: . However, this problem requires the new metric learning must preserve some important information of data points in original space. Therefore, disagreement minimization criterion (DMC) is added by .

The empirical loss is usually defined according to label information (such as labels of instances or certain “side information”). For example, in supervised multiple kernel learning, is usually defined as the minimum hinge loss achievable on the metric defined by . The structural loss

can be defined as complexity of classifiers (as in the case of SVM), or be used to ensure “similar” instances have “similar” labels (as in the some formulation of manifold learning, e.g.

[2]), etc. The disagreement loss measures the extent to which is different with the .

3.1 Unsupervised multi-view metric learning

This section discusses the choice of each loss function for the framework.

First, suppose are the similarity matrices defined by the metric spaces , respectively, where is the similarity between data points and on the -th view111We use RBF kernel to define similarity. [14] has a full discussion.. Let be the normalized Laplacian of , where the normalized Laplacian of a similarity matrix is defined as


where with . And I is the eye matrix.

A good video summary will have a better coordinated and invariant to the metric transformations of synchronous frames, such as rotation, translation, and scaling. More subtly, it is nontrivial to make the framework robust to different visual conditions especially for surveillance video summary. To this end, we define the disagreement loss as


is the similarity transformation of the metric . This function can be viewed as a simplified version of the Canonical Correlation Analysis (CCA) [6] measure. Like CCA, it is invariant to certain kinds of metric transformations such as rotation, translation, and scaling and better coordinate different visual conditions. Furthermore, it is more desirable in that it introduces no optimization variables.

Our definition of is motivated by the following results on spectral graph theory.

Theorem ([14])

The multiplicity

of the eigenvalue

of equals the number of connected components in the graph.

Theorem ([3])

For , we have

where , is the shortest path from to , , , and are the eigenvalues of .

These theorems indicate that the first smallest eigenvalues of determine the quality of -clustering on the metric implicitly defined by (which is a transformation of the metric ). Therefore, we define the structural loss as


where are the eigenvalues of , and is a parameter indicating the desired number of clusters.

Finally, the unsupervised learning settings donot have label information, we simply let

Combining the definitions above, we finally formulate our optimization objective for unsupervised multi-view metric learning as


3.2 Discussion for some alternative choices

As is the measure of disagreement between metric spaces, one may consider the CCA as a good choice. However, the calculation of CCA involves optimization on transformation matrices, which will introduce optimization variables into the optimization problem, making the optimization intractable.

A simplification of the CCA measure leads to the following prediction-based disagreement measure [15]:


where and denote, the prediction of the classifier learned according to the metrics and . This definition is advisable when classification results can be easily deduced from the learned metric in the same optimization framework. Yet problems arise when we are facing clustering tasks, where the disagreement between different clustering results may be difficult to calculate.

Compared with these definitions, our definition of disagreement loss is more straightforward and computationally efficient as it is directly based on the metric learned and introduces no additional optimization variables.

4 Optimization for unsupervised multi-view metric learning

In this section, we present an efficient algorithm for solving the optimization problem in Eq. 5.

Let and . And note that, once the is found, a metric space is implicitly defined. In fact, given , the coordinate matrix is a metric space, where

is the eigenvector corresponding to the

-th smallest eigenvalue of , and -means algorithm can be used for clustering according to this metric space. This is exactly the way in which normalized cut on a graph is usually performed [14]. Therefore, for the purpose of clustering, it suffices to compute the itself (note that, has the same eigenvectors as and therefore leads to the same clustering result). The optimization problem now turns to


With consideration of efficiency, we further assume that . It can be efficiently solved by alternating descent method: firstly fixed , can be solved via eigen-decomposition of ; then fixed , is solved by a quadratic programming (Eq.9) until convergence. This quadratic programming problem can be efficiently solved by Mosek in that is always small in practice :


5 Application to multi-view video summarization

To generate video summary, we assume that each event in the real world corresponds to a distribution centered at a small region in a “latent” semantic space. Each “instance” of the event is a data point sampled according to in the latent semantic space.

Our solution to multi-view video summarization is summarized in Algorithm 1.We deal with videos of the same spot with different angles, so the high-dimensional low-level features of each view is embedded in the same low-dimensional space. This justifies the usage of the above-mentioned framework, which imposes a disagreement-minimization criteria on the metric learning.

  1. Decompose video records into sets of frames, denoting as , where is the -dimensional feature representation of the frames in the -th view.

  2. Learn a unified metric space according to the information lying in .

  3. Perform clustering on , using the centers of clusters as representatives, denoting as .

  4. Select a frame for each out of the frames corresponding to it, and output these frames as the final summary.

Algorithm 1 Multi-view video summarization.

6 Experiments

We conduct our experiments on Road and Office1 datasets [5] which is captured by three hand-held video cameras with 360 degree coverage of the scene. Some representative frames are shown in Fig. 1. The same important objects (bus or human) are highlighted and extracted from original different views. This facilitates the quick browsing and understanding the original videos with overlapping views. For baselines, we construct a graph for the frames in each view, employ normalized cut for clustering and select the representative frames. ED (Euclidean distance) method utilizes original feature vector space (Euclidean space) of each view for metric learning, while DM methods use Diffusion metric for metric learning.

We employ the groundtruth of important events of Office1 dataset defined in [5] to measure the objectiveness performance. We reported the results in [5] and extract the same length summary for Uni., Ran., ED, DM and our method in Tab.1. The results shows that our method is better than the other methods.

Methods No. Eve. Precision() Recall()
[5] 16 100 61
Uni./Ran. 10/5 70/60 26.9/11.5
ED/DM 10/13 80/76.9 30.8/38.5
Ours 20 100 76.9
Table 1: Objectively performance comparison with previous methods on office1. Uni. means we uniformly summarize the videos, while Ran. indicates we randomly summarize the frames of videos.

To further evaluate the effectiveness of these five methods, we conduct user study by inviting 12 participants and gave their judgements for the results. Table 2 shows the scores which are normalized from 0 to 1 and higher scores indicate better satisfaction. The summary results of [5] are not directly comparable in this part. Because it is skim-based summary while ours are keyframe summary. It shows that the learned multi-view metric space can improve the user satisfaction than other baselines.

road office1
Uni./Ran. 0.4/0.3 0.45/0.35
ED/DM 0.68/0.78 0.72/0.75
Ours 0.80 0.76
Table 2: Statistical data of user study.

7 Conclusion

In this paper, we present a systematic solution to multi-view video summarization. The solution is based on reconstructing the latent semantic metric by multi-view metric learning. The multi-view metric learning method achieves a balance between the separability of clusters and the similarity to original metrics with an efficient optimization algorithm.

The multi-view metric learning algorithm proposed in the paper can be used to efficiently learn an “optimal” combination of multiple metrics. The “optimality” is defined as a trade-off between the maximum margin between clusters achievable on the metric and the similarity between the learned metric and the original ones.


  • [1] A. Argyriou, M. Herbster, and M. Pontil. Combining graph laplacians for semi–supervised learning. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS 2005), pages 67–74. MIT Press, Cambridge, MA, 2006.
  • [2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples.

    Journal of Machine Learning Research

    , 7:2399–2434, 2006.
  • [3] F. R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, No. 92. American Mathematical Society, 1997.
  • [4] V. R. de Sa. Spectral clustering with two views. In Proceedings of the International Conference on Machine Learning (ICML 2005) Workshop on Learning with Multiple Views, Bonn, Germany, 2005.
  • [5] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z.-H. Zhou. IEEE TMM, 2010.
  • [6] D. R. Hardoon, S. Szedmák, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004.
  • [7] G. R. G. Lanckriet, T. D. Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626–2635, 2004.
  • [8] P. Li, Y. Guo, and H. Sun. Multi-keyframe abstraction from videos. 2011.
  • [9] B. Long, P. S. Yu, and Z. M. Zhang. A general model for multiple view unsupervised learning. In Proceedings of the SIAM International Conference on Data Mining (SDM 2008), pages 822–833, Atlanta, Georgia, USA, 2008.
  • [10] C. C. Loy, T. Xiang, and S. Gong. Multi-camera activity correlation analysis. In

    2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009)

    , 2009.
  • [11] B. Prosser, S. Gong, and T. Xiang. Multi-camera matching using bi-directional cumulative brightness transfer functions. In 2008 British Machine Vision Conference (BMVC 2008), pages xx–yy, 2008.
  • [12] B. T. Truong and S. Venkatesh. Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications, 3(1):3, 2007.
  • [13] A. Vetro, S. Yea, M. Zwicker, W. Matusik, and H. Pfister. Overview of multiview video coding and anti-aliasing for 3D displays. In 2007 IEEE International Conference on Image Processing (ICIP 2007), pages 17–20, 2007.
  • [14] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
  • [15] Z. Wang, S. Chen, and T. Sun. Multik-mhks: A novel multiple kernel learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):348–353, 2008.
  • [16] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS 2005), pages 1473–1480. MIT Press, Cambridge, MA, 2006.
  • [17] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15 (NIPS 2002), pages 505–512. MIT Press, Cambridge, MA, 2003.
  • [18] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 (NIPS 2004), pages 1537–1544. MIT Press, Cambridge, MA, 2005.
  • [19] L. Xu and D. Schuurmans.

    Unsupervised and semi-supervised multi-class support vector machines.


    The Twentieth National Conference on Artificial Intelligence (AAAI 2005)

    , pages 904–910, Pittsburgh, Pennsylvania, USA, 2005.
  • [20] C.-Y. Yeh, C.-W. Huang, and S.-J. Lee. Multi-kernel support vector clustering for multi-class classification. In Proceedings of the 2008 3rd International Conference on Innovative Computing Information and Control (ICICIC 2008), page 331, Washington, DC, USA, 2008. IEEE Computer Society.
  • [21] B. Zhao, J. T. Kwok, , and C. Zhang. Multiple kernel clustering. In Proceedings of the SIAM International Conference on Data Mining (SDM 2009), pages 638–649, Sparks, Nevada, USA, 2009.
  • [22] D. Zhou and C. J. C. Burges. Spectral clustering and transductive learning with multiple views. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), pages 1159–1166, Corvalis, Oregon, USA, 2007.
  • [23] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.
  • [24] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), pages 1191–1198, Corvalis, Oregon, USA, 2007.