1 Introduction
In many real world applications, unlabeled data usually arrive in the form of a number of highly correlated views. Examples of this kind can be frequently encountered in the field of video processing, where different cameras may focus on roughly the same fieldofview (Fov) from different viewpoints, such as in the case of office coverage or surveillance records. In such case, one might expect to utilize correlations to help understand and characterize the data, and more preferably, find an “optimal” metric that reflects the intrinsic structure of the input data.
In this paper, we are interested in this kind of problem, especially multiview video summarization. Suppose the complicated human motion in local geometric coordination is a function varying with time and sampled temporally by multiple cameras simultaneously. In order to reveal the characteristics of this original space, traditional methods generally extract high dimensional feature vector space for each view video with a manifold assumption individually. Many dimension reduction methods are then utilized. However, different view videos often include distinctive and complementary information to the original dataset. For this purpose, we present multiview metric learning framework to integrate all views of these videos into the new metric learning space and to disclose the intrinsic features of original human moving. Here, video summarization is such intrinsic feature we are striving for.
We firstly provide a unified framework for multiview video summarization by multiview metric learning. Multiview video simultaneously captures the different visual projections of the same timespace manifold in real life. Our multiview metric learning is learned to project all the multiview videos into a new metric space to best simulate the real world manifold space. This thus greatly facilitates the video summarization by preserving most intrinsic features across different views. Specifically, the framework is derived from Maximal Margin Clustering (MMC) by minimizing the disagreement minimization criterion for learned metric. In the learned metric space, visual data are summarized by clustering them and extracting key frames in each cluster.
2 Previous work
Multiview learning has received considerable attention in the past decade. Most of previous methods are however devoted to the semisupervised learning (
[23] for a detailed survey). Some studies on the unsupervised case have been performed [4, 22, 9], focusing on merging different metrics and minimizing disagreements among them. Our approach is different with them in that we take simultaneously the MinimizingDisagreement criterion and the Maximal Margin Clustering (MMC) criterion into account. The optimization involved achieves a tradeoff between them.Maximum Margin Clustering (MMC) is a classical approach to clustering[18, 19] that aims at finding clusters with large margins. It often exhibits superior performance compared against traditional clustering algorithms. Following the same criteria as MMC approaches, the method proposed in this paper optimizes a graphtheoretic measure to find a kernel matrix that allows larger margins between clusters.
Metric learning [17, 16] and multiple kernel learning[7] aim at finding an “optimal” distance metric (or convex combination of kernels that implicitly defines a distance metric) that allows distancebased or kernelbased algorithms to achieve better performance. Previous studies on metric learning and multiple kernel learning mainly focus on situations where additional information (such as side information [17] or class labels[16, 24, 1]
) is available. Studies on “pure” unsupervised learning only utilize either the maximum margin criteria
[20, 21] or the minimizingdisagreement criteria [9].Video summarization is a well studied topic in the past two decades. We refer [12] for a comprehensive survey. Although some previous studies have been dedicated to the problem of multicamera systems, but they were either focusing on tracking moving objects across cameras with nonoverlapping field of views [11, 10] or compression [13]. Fu et al. [5] wass the first effort to systematically study the problem of skimbased multiview video summarization (especially in the surveillance videos) by using hypergraph structures. [8] extracted the keyframes of such multiview summarization. We explore this problem by multiview metric learning framework. We instead directly address the problem of video summarization on multiple overlapping views.
3 Multiview metric learning framework
Suppose is the the lowlevel features of different views, where are the coordinate matrices. Our goal is to find a unified coordinate matrix minimizing
(1) 
where are the empirical, structural, and disagreement losses of , respectively. are parameters controlling the tradeoff of objectives.
The classical MMC contains the former two parts: . However, this problem requires the new metric learning must preserve some important information of data points in original space. Therefore, disagreement minimization criterion (DMC) is added by .
The empirical loss is usually defined according to label information (such as labels of instances or certain “side information”). For example, in supervised multiple kernel learning, is usually defined as the minimum hinge loss achievable on the metric defined by . The structural loss
can be defined as complexity of classifiers (as in the case of SVM), or be used to ensure “similar” instances have “similar” labels (as in the some formulation of manifold learning, e.g.
[2]), etc. The disagreement loss measures the extent to which is different with the .3.1 Unsupervised multiview metric learning
This section discusses the choice of each loss function for the framework.
First, suppose are the similarity matrices defined by the metric spaces , respectively, where is the similarity between data points and on the th view^{1}^{1}1We use RBF kernel to define similarity. [14] has a full discussion.. Let be the normalized Laplacian of , where the normalized Laplacian of a similarity matrix is defined as
(2) 
where with . And I is the eye matrix.
A good video summary will have a better coordinated and invariant to the metric transformations of synchronous frames, such as rotation, translation, and scaling. More subtly, it is nontrivial to make the framework robust to different visual conditions especially for surveillance video summary. To this end, we define the disagreement loss as
(3) 
is the similarity transformation of the metric . This function can be viewed as a simplified version of the Canonical Correlation Analysis (CCA) [6] measure. Like CCA, it is invariant to certain kinds of metric transformations such as rotation, translation, and scaling and better coordinate different visual conditions. Furthermore, it is more desirable in that it introduces no optimization variables.
Our definition of is motivated by the following results on spectral graph theory.
Theorem ([14])
Theorem ([3])
For , we have
where , is the shortest path from to , , , and are the eigenvalues of .
These theorems indicate that the first smallest eigenvalues of determine the quality of clustering on the metric implicitly defined by (which is a transformation of the metric ). Therefore, we define the structural loss as
(4) 
where are the eigenvalues of , and is a parameter indicating the desired number of clusters.
Finally, the unsupervised learning settings donot have label information, we simply let
Combining the definitions above, we finally formulate our optimization objective for unsupervised multiview metric learning as
(5)  
(6) 
3.2 Discussion for some alternative choices
As is the measure of disagreement between metric spaces, one may consider the CCA as a good choice. However, the calculation of CCA involves optimization on transformation matrices, which will introduce optimization variables into the optimization problem, making the optimization intractable.
A simplification of the CCA measure leads to the following predictionbased disagreement measure [15]:
(7) 
where and denote, the prediction of the classifier learned according to the metrics and . This definition is advisable when classification results can be easily deduced from the learned metric in the same optimization framework. Yet problems arise when we are facing clustering tasks, where the disagreement between different clustering results may be difficult to calculate.
Compared with these definitions, our definition of disagreement loss is more straightforward and computationally efficient as it is directly based on the metric learned and introduces no additional optimization variables.
4 Optimization for unsupervised multiview metric learning
In this section, we present an efficient algorithm for solving the optimization problem in Eq. 5.
Let and . And note that, once the is found, a metric space is implicitly defined. In fact, given , the coordinate matrix is a metric space, where
is the eigenvector corresponding to the
th smallest eigenvalue of , and means algorithm can be used for clustering according to this metric space. This is exactly the way in which normalized cut on a graph is usually performed [14]. Therefore, for the purpose of clustering, it suffices to compute the itself (note that, has the same eigenvectors as and therefore leads to the same clustering result). The optimization problem now turns to(8) 
With consideration of efficiency, we further assume that . It can be efficiently solved by alternating descent method: firstly fixed , can be solved via eigendecomposition of ; then fixed , is solved by a quadratic programming (Eq.9) until convergence. This quadratic programming problem can be efficiently solved by Mosek in that is always small in practice :
(9) 
5 Application to multiview video summarization
To generate video summary, we assume that each event in the real world corresponds to a distribution centered at a small region in a “latent” semantic space. Each “instance” of the event is a data point sampled according to in the latent semantic space.
Our solution to multiview video summarization is summarized in Algorithm 1.We deal with videos of the same spot with different angles, so the highdimensional lowlevel features of each view is embedded in the same lowdimensional space. This justifies the usage of the abovementioned framework, which imposes a disagreementminimization criteria on the metric learning.
6 Experiments
We conduct our experiments on Road and Office1 datasets [5] which is captured by three handheld video cameras with 360 degree coverage of the scene. Some representative frames are shown in Fig. 1. The same important objects (bus or human) are highlighted and extracted from original different views. This facilitates the quick browsing and understanding the original videos with overlapping views. For baselines, we construct a graph for the frames in each view, employ normalized cut for clustering and select the representative frames. ED (Euclidean distance) method utilizes original feature vector space (Euclidean space) of each view for metric learning, while DM methods use Diffusion metric for metric learning.
We employ the groundtruth of important events of Office1 dataset defined in [5] to measure the objectiveness performance. We reported the results in [5] and extract the same length summary for Uni., Ran., ED, DM and our method in Tab.1. The results shows that our method is better than the other methods.
Methods  No. Eve.  Precision()  Recall() 

[5]  16  100  61 
Uni./Ran.  10/5  70/60  26.9/11.5 
ED/DM  10/13  80/76.9  30.8/38.5 
Ours  20  100  76.9 
To further evaluate the effectiveness of these five methods, we conduct user study by inviting 12 participants and gave their judgements for the results. Table 2 shows the scores which are normalized from 0 to 1 and higher scores indicate better satisfaction. The summary results of [5] are not directly comparable in this part. Because it is skimbased summary while ours are keyframe summary. It shows that the learned multiview metric space can improve the user satisfaction than other baselines.
road  office1  

Uni./Ran.  0.4/0.3  0.45/0.35 
ED/DM  0.68/0.78  0.72/0.75 
Ours  0.80  0.76 
7 Conclusion
In this paper, we present a systematic solution to multiview video summarization. The solution is based on reconstructing the latent semantic metric by multiview metric learning. The multiview metric learning method achieves a balance between the separability of clusters and the similarity to original metrics with an efficient optimization algorithm.
The multiview metric learning algorithm proposed in the paper can be used to efficiently learn an “optimal” combination of multiple metrics. The “optimality” is defined as a tradeoff between the maximum margin between clusters achievable on the metric and the similarity between the learned metric and the original ones.
References
 [1] A. Argyriou, M. Herbster, and M. Pontil. Combining graph laplacians for semi–supervised learning. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS 2005), pages 67–74. MIT Press, Cambridge, MA, 2006.

[2]
M. Belkin, P. Niyogi, and V. Sindhwani.
Manifold regularization: A geometric framework for learning from
labeled and unlabeled examples.
Journal of Machine Learning Research
, 7:2399–2434, 2006.  [3] F. R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, No. 92. American Mathematical Society, 1997.
 [4] V. R. de Sa. Spectral clustering with two views. In Proceedings of the International Conference on Machine Learning (ICML 2005) Workshop on Learning with Multiple Views, Bonn, Germany, 2005.
 [5] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z.H. Zhou. IEEE TMM, 2010.
 [6] D. R. Hardoon, S. Szedmák, and J. ShaweTaylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004.
 [7] G. R. G. Lanckriet, T. D. Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626–2635, 2004.
 [8] P. Li, Y. Guo, and H. Sun. Multikeyframe abstraction from videos. 2011.
 [9] B. Long, P. S. Yu, and Z. M. Zhang. A general model for multiple view unsupervised learning. In Proceedings of the SIAM International Conference on Data Mining (SDM 2008), pages 822–833, Atlanta, Georgia, USA, 2008.

[10]
C. C. Loy, T. Xiang, and S. Gong.
Multicamera activity correlation analysis.
In
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009)
, 2009.  [11] B. Prosser, S. Gong, and T. Xiang. Multicamera matching using bidirectional cumulative brightness transfer functions. In 2008 British Machine Vision Conference (BMVC 2008), pages xx–yy, 2008.
 [12] B. T. Truong and S. Venkatesh. Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications, 3(1):3, 2007.
 [13] A. Vetro, S. Yea, M. Zwicker, W. Matusik, and H. Pfister. Overview of multiview video coding and antialiasing for 3D displays. In 2007 IEEE International Conference on Image Processing (ICIP 2007), pages 17–20, 2007.
 [14] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
 [15] Z. Wang, S. Chen, and T. Sun. Multikmhks: A novel multiple kernel learning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):348–353, 2008.
 [16] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS 2005), pages 1473–1480. MIT Press, Cambridge, MA, 2006.
 [17] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems 15 (NIPS 2002), pages 505–512. MIT Press, Cambridge, MA, 2003.
 [18] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 (NIPS 2004), pages 1537–1544. MIT Press, Cambridge, MA, 2005.

[19]
L. Xu and D. Schuurmans.
Unsupervised and semisupervised multiclass support vector machines.
InThe Twentieth National Conference on Artificial Intelligence (AAAI 2005)
, pages 904–910, Pittsburgh, Pennsylvania, USA, 2005.  [20] C.Y. Yeh, C.W. Huang, and S.J. Lee. Multikernel support vector clustering for multiclass classification. In Proceedings of the 2008 3rd International Conference on Innovative Computing Information and Control (ICICIC 2008), page 331, Washington, DC, USA, 2008. IEEE Computer Society.
 [21] B. Zhao, J. T. Kwok, , and C. Zhang. Multiple kernel clustering. In Proceedings of the SIAM International Conference on Data Mining (SDM 2009), pages 638–649, Sparks, Nevada, USA, 2009.
 [22] D. Zhou and C. J. C. Burges. Spectral clustering and transductive learning with multiple views. In Machine Learning, Proceedings of the TwentyFourth International Conference (ICML 2007), pages 1159–1166, Corvalis, Oregon, USA, 2007.
 [23] X. Zhu. Semisupervised learning literature survey. Technical Report 1530, Computer Sciences, University of WisconsinMadison, 2005.
 [24] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Machine Learning, Proceedings of the TwentyFourth International Conference (ICML 2007), pages 1191–1198, Corvalis, Oregon, USA, 2007.