1 Introduction
Unsupervised partitioning problems are ubiquitous in machine learning and other dataoriented fields such as computer vision, bioinformatics or signal processing. They include (a) traditional
unsupervised clustering problems, with the classical Kmeans algorithm, hierarchical linkage methods [14] and spectral clustering [22], (b) unsupervised image segmentation problems where two neighboring pixels are encouraged to be in the same cluster, with meanshift techniques [9] or normalized cuts [25], and (c) changepoint detection problems adapted to multivariate sequences (such as video) where segments are composed of contiguous elements, with typical windowbased algorithms [11] and various methods looking for a change in the mean of the features (see, e.g., [8]).All the algorithms mentioned above rely on a specific distance (or more generally a similarity measure) on the space of configurations. A good metric is crucial to the performance of these partitioning algorithms and its choice is heavily problemdependent. While the choice of such a metric has been originally tackled manually (often by trial and error), recent work has considered learning such metric directly from data. Without any supervision, the problem is illposed and methods based on generative models may learn a metric or reduce dimensionality (see, e.g., [10]), but typically with no guarantees that they lead to better partitions. In this paper, we follow [4, 32, 3] and consider the goal of learning a metric for potentially several partitioning problems sharing the same metric, assuming that several fully or partially labelled partitioned datasets are available during the learning phase. While such labelled datasets are typically expensive to produce, there are several scenarios where these datasets have already been built, often for evaluation purposes. These occur in video segmentation tasks (see Section 6.1), image segmentation tasks (see Section 6.3) as well as changepoint detection tasks in bioinformatics (see [15] and Section 5.3).
In this paper, we consider partitioning problems based explicitly or implicitly on the minimization of Euclidean distortions, which include Kmeans, spectral clustering and normalized cuts, and meanbased changepoint detection. We make the following contributions:

We review and unify several partitioning algorithms in Section 2, and cast them as the maximization of a linear function of a rescaled equivalence matrix, which can be solved by algorithms based on spectral relaxations or dynamic programming.

Given fully labelled datasets, we cast in Section 4 the metric learning problem as a largemargin structured prediction problem, with proper definition of regularizers, losses and efficient lossaugmented inference.

Given partially labelled datasets, we propose in Section 5 an algorithm, iterating between labelling the full datasets given a metric and learning a metric given the fully labelled datasets. We also consider in Section 5.3 extensions that allow changes in the full distribution of univariate time series (rather than changes only in the mean), with application to bioinformatics.

We provide in Section 6 experiments where we show how learning the metric may significanty improve the partitioning performance in synthetic examples, video segmentation and image segmentation problems.
Related work.
The need for metric learning goes far beyond unsupervised partitionning problems. [30] proposed a large margin framework for learning a metric in nearestneighbours algorithms based on sets of mustlink/must not link constraints, while [13]
considers a probabilitybased nonconvex formulation. For these works, a single dataset is fully labelled and the goal is to learn a metric leading to good testing performance on unseen data.
Some recent work [17] proved links between metric learning and kernel learning, permitting to kernelize any Mahalanobis distance learning problem.
Metric learning has also been considered in semisupervised clustering of a single dataset, where some partial constraints are given. This includes the works of [4, 32], both based on efficient convex formulations. As shown in Section 6, these can be used in our settings as well by stacking several datasets into a single one. However, our discriminative largemargin approach outperforms these.
Moreover, the task of learning how to partition was tackled in [3]
for spectral clustering. The problem setup is the same (availability of several fully partitioned datasets), however, the formulation is nonconvex and relies on the unstable optimization of eigenvectors. In Section
5.1, we propose a convex more stable largemargin approach.Other approaches do not require any supervision [10], and perform dimensionality reduction and clustering at the same time, by iteratively alternating the computation of a lowrank matrix and a clustering of the data using the corresponding metric. However, they are unable to take advantage of the labelled information that we use.
Our approach can also be related to the one of [26]. Given a small set of labelled instances, they use a similar largemargin framework, inspired by [29]
to learn parameters of Markov random fields, using graph cuts for solving the “lossaugmented inference problem” of structured prediction. However, their segmentation framework does not apply to unsupervised segmentation (which is the goal of this paper). In this paper, we present a supervised learning framework aiming at learning how to perform an unsupervised task.
Our approach to learn the metric is nevertheless slightly different of the ones mentioned above. Indeed, we cast this problem as the solution of a structured SVM as in [29, 27]. This make our paper shares many conceptual steps with works like [7, 21] where they use a structured SVM to learn in one case weights for graph matchings and a metric for ranking in the other case.
2 Partitioning through matrix factorization
In this section, we consider multidimensional observations , which may be represented in a matrix . Partitioning the observations into classes is equivalent to finding an assignment matrix , such that if the th observation is affected to cluster and otherwise. For general partitioning problems, no additional constraints are used, but for changepoint detection problems, it is assumed that the segments are contiguous and with increasing labels. That is, the matrix is of the form
where is the
dimensional vector with constant components equal to one, and
is the number of elements in cluster . For any partition, we may reorder (non uniquely) the data points so that the assignment matrix has the same form; this is typically useful for the understanding of partitioning problems.2.1 Distortion measure
In this paper, we consider partitioning models where each data point in cluster is modelled by a vector (often called a centroid or a mean) , the overall goal being to find a partition and a set of means so that the distortion measure is as small as possible, where is the Euclidean norm in . By considering the Frobenius norm defined through , this is equivalent to minimizing
(1) 
with respect to an assignment matrix and the centroid matrix .
2.2 Representing partitions
Following [3, 10], the quadratic minimization problem in can be solved in closed form, with solution (it can be found by computing the matrix gradient and setting it to zero). Thus, the partitioning problem (with known number of clusters ) of minimizing the distortion in Eq. (1), is equivalent to:
(2) 
Thus, the problem is naturally parameterized by the matrix . This matrix, which we refer to as a rescaled equivalence matrix, has a specific structure. First the matrix is diagonal, with th diagonal element equal to the number of elements in the cluster containing the th data point. Thus if and are in different clusters and otherwise equal to where is the number of elements in the cluster containing the th data point. Thus, if the points are reordered so that the segments are composed of contiguous elements, then we have the following form
In this paper, we use this representation of partitions. Note the difference with alternative representations which has values in , used in particular by [18].
We denote by the set of rescaled equivalence matrices, i.e., matrices such that there exists an assignment matrix such that . For situations where the number of clusters is unspecified, we denote by the union of all for .
Note that the number of clusters may be obtained from the trace of , since . This can also be seen by noticing that , i.e.,
is a projection matrix, with eigenvalues in
, and the number of eigenvalues equal to one is exactly the number of clusters. Thus,Learning the number of clusters .
Given the number of clusters , we have seen from Eq. (2) that the partitioning problem is equivalent to
(3) 
In changepoint detection problems, an extra constraint of contiguity of segments is added.
In the common situation when the number of clusters
is unknown, then it may be estimated directly from data by penalizing the distortion measure by a term proportional to the number of clusters, as usually done for instance in changepoint detection
[19]. This is a classical idea that can be traced back to the AIC criterion [1] for instance. Given that the number of clusters for a rescaled equivalence matrix is , this leads to the following formulation:(4) 
Note that our metric learning algorithm also learns this extra parameter .
Thus, the two types of partitioning problems (with fixed or unknown number of clusters) can be cast as the problem of maximizing a linear function of the form with respect to , with the potential constraint that . In general, such optimization problems may not be solved in polynomial time. In Section 2.3, we show how adding contiguity constraints makes it possible to obtain a solution in polynomial time through dynamic programming. For general situations, the means algorithm, although not exact, can be used to get good partitioning in polynomial time. In Section 2.4, we provide a spectral relaxation, which we use within our largemargin framework in Section 4.
2.3 Changepoint detection by dynamic programming
The changepoint detection problem is a restriction of the general partitioning problem where the segments are composed of contiguous elements. We denote by the set of partition matrices for the changepoint detection problem, and , its restriction to partitions with segments.
The problem is thus of solving Eq. (4) (known number of clusters) or Eq. (3) (unknown number of clusters) with the extra constraint that . In these two situations, the contiguity constraint leads to exact polynomialtime algorithms based on dynamic programming. See, e.g., [24]. This leads to algorithms for maximizing , when is positive semidefinite in . When the number of segments is known the running time complexity is .
We now describe a reformulation that can solve for any matrix (potentially with negative eigenvalues, as from Eq. (4)). This algorithm is presented in Algorithm 1. It only requires some preprocessing of the input matrix , namely computing its summed area table (or image integral), defined to have the same size as and with . In words it is the sum of the elements of which are above and to the left of respectively and . A similar algorithm can be derived in the case where .
2.4 Kmeans clustering and spectral relaxation
For a known number of clusters , Kmeans is an iterative algorithm aiming at minimizing the distortion measure in Eq. (1): it iterates between (a) optimizing with respect to , i.e., , and (b) minimizing with respect to (by assigning points to the closest centroids). Note that this algorithm only converges to a local minimum and there is no known algorithm to perform an exact decoding in polynomial time in high dimensions . Moreover, the Kmeans algorithm cannot be readily applied to approximately maximize any linear function with respect to , i.e., when is not positivedefinite or the number of clusters is not known.
Following [25, 22, 3], we now present a spectral relaxation of this problem. This is done by relaxing the set to the set of matrices that satisfy (i.e., removing the constraint that takes a finite number of distinct values). When the number of clusters is known, this leads to the classical spectral relaxation, i.e.,
which is equal to the sum of the largest eigenvalues of ; the optimal matrix of the spectral relaxation is the orthogonal projector on the eigenvectors of with largest eigenvalues.
When the number of clusters is unknown, we have:
where is the sum of positive eigenvalues of . The optimal matrix of the spectral relaxation is the orthogonal projector on the eigenvectors of with positive eigenvalues. Note that in the formulation from Eq. (4), this corresponds to thresholding all eigenvalues of which are less than .
We denote by and the relaxed set of rescaled equivalence matrices.
2.5 Metric learning
In this paper, we consider learning a Mahalanobis metric, which may be parameterized by a positive definite matrix . This corresponds to replacing dotproducts by , and by . Thus, when the number of cluster is known, this corresponds to
(5) 
or, when the number of clusters is unknown, to:
(6) 
Note that by replacing by and dividing the equation by , we may use an equivalent formulation of Eq. (6) with , that is:
(7) 
The key aspect of the partitioning problem is that it is formulated as optimizing with respect to a function linearly parameterized by . The linear parametrization in will be useful when defining proper losses and efficient lossaugmented inference in Section 4.
Note that we may allow to be just positive semidefinite. In that case, the zeroeigenvalues of the pseudometric corresponds to irrelevant dimensions. That means in particular we have performed dimensionality reduction on the input data. We propose a simple way to encourage this desirable property in Section 4.3.
3 Loss between partitions
Before going further and apply the framework of Structured prediction [29] in the context of metric learning, we need to find a loss on the output space of possible partitioning which is well suited to our context. To avoid any notation conflict, we will refer in that section to as a general set of partition (it can corresponds for instance to ).
3.1 Some standard loss
The Rand index
When comparing partitions [16], a standard way to measure how different two of them are is to use the Rand [23] index which is defined, for two partitions of the same set of elements and as the sum of concordant pairs over the number of possible pairs. More precisely, if we consider all the possible pairs of elements of , the concordant pairs are defined as the sum of the pairs of elements which both belong to the same set in and and of the pairs which are not in the same set both in and . In matricial terms, it is linked to the Frobenius distance between the equivalence matrices representing and (these matrices are binary matrices of size which are 1 if and only if the element and the element belong to the same set of the partition).
This loss is not necessarily very well suited to our problem, since intuitively one can see that it doesn’t take into account the size of each subset inside the partition, whereas our concern is to optimize intra class variance which is a rescaled indicator.
Hausdorff distance
In the changepoint detection litterature, a very common way to measure dissimilarities between partitions is the socalled Hausdorff distance [6] on the elements of the frontier of the elements of the partitions (the need for a frontier makes it inapplicable directly to the case of general clustering). Let’s consider two partitions of a finite set of T elements. We assume that the elements have a sequential order and thus elements of partitions and have to be contiguous. It is then possible to define the frontier (or set of ruptures) of as the collection of indexes . Then, by embedding the set into (it corresponds just to normalize the time indexes so that they are in ), we can consider a distance on , (typically the absolute value) and then define the associated Hausdorff distance
The loss considered in our context
In this paper, we consider the following loss, which was originated proposed in a slightly different form by [16] and has then been widely used in the field of clustering [3]. This loss is a variation of the association in a contingency table (see [16]). More precisely, if we consider the contingency table associated to (partition of a set of size ) with elements and with elements (the contingency table being the table such that the number of elements in element of and in element of ), we have that .
(8) 
Moreover, if the partitions encoded by and have clusters and , then This loss is equal to zero if the partitions are equal, and always less than . Another equivalent interpretation of this index is given by, with the usual convention that for the element of indexed by is the subset of where belongs:
This index seems intuitively much more suited to the study of the problem of variance minimization since it involves the rescaled equivalence matrices which parametrize naturally these kind of problems. We examine in the Appendix more facts about these losses and their links, especially about the asymptotic behaviour of the loss we use in the paper. We also show a link between this loss and the Hausdorff in the case of changepoint detection.
4 Structured prediction for metric learning
As shown in the previous section, our goal is to learn a positive definite matrix , in order to improve the performance of structured output algorithm that minimizes with respect to , the following cost function of Eq. 7. Using the change of variable described in the table below, the partitioning problem may be cast as
where is the Frobenius dot product.
Number of clusters 


Known  
Unknown 
We denote by the vector space where the vector defined above belongs to. Our goal is thus to estimate from pairs of observations . This is exactly the goal of largemargin structured prediction [29], which we now present. We denote by a generic set of matrices, which may either be , , , , , , depending on the situation (see Section 4.2 for specific cases).
4.1 Largemargin structured output learning
In the marginrescaling framework of [29], using a certain loss between elements of (here partitions), the goal is to minimize with respect to ,
where is any (typically convex) regularizer. This framework is standard in machine learning in general and metric learning in particular (see e.g, [17]
). This loss function
is not convex in , and may be replaced by the convex surrogateleading to the minimization of
(9) 
In order to apply this framework, several elements are needed: (a) a regularizer , (b) a loss function , and (c) the associated efficient algorithms for computing , i.e., solving the lossaugmented inference problem .
As discussed in Section 3, a natural loss on our output space is given by the Frobenius norm of the rescaled equivalence matrices associated to partitions.
4.2 Lossaugmented inference problem
Efficient minimization is key to the applicability of largemargin structured prediction and this problem is a classical computational bottleneck. In our situation the cardinality of is exponential, but the choice of loss between partitions lead to the problem where:

if the number of clusters is known.

otherwise.
Thus, the lossaugmented problem may be performed for the changepoint problems exactly (see Section 2.3) or through a spectral relaxation otherwise (see Section 2.4). Namely, for changepoint detection problems, is either or , while for general partitioning problems, it is either or .
4.3 Regularizer
We may consider several parametrizations/regularizers for our positive semidefinite matrix . We may classically (see e.g, [17]) penalize , which is the classical squared Euclidean norm. However, two variants of our algorithm are often needed for practical problems.
Diagonal metric.
To limit the number of parameters, we may be interested in only reweighting the different dimensions of the input data, i.e., we can impose the metric to be diagonal, i.e, where . Then, the constraint is , and we may penalize by or , depending whether we want to promote zeros in
(i.e., to do feature selection).
Lowrank metric.
Another potentially desirable property is the interpretability of the obtained metric in terms of its eigenvectors. Ideally we want to have a pseudometric with a small rank. As it is classically done, we relaxed it into the sum of singular values. Here, since the matrix
is symmetric positive definite, this is simply the trace .4.4 Optimization
In order to optimize the objective function of Eq. (9), we can use several optimization techniques. This objective present the drawback of being nonsmooth and thus the convergence speed that we can expect are not very fast.
In the structured prediction litterature, the most common solvers are based on cuttingplane methods (see [29]) which can be used in our case for small dimensionalproblem (i.e., low ). Otherwise we use a projected subgradient method, which leads to more numerous but cheaper iterations.
Cutting plane and Bundle methods [28] shows the best speed performances when the dimension of the feature space of the data to partition is low, but were empirically outperformed by a subgradient in the very high dimensional setting.
5 Extensions
We now present extensions which make our metric learning more generally applicable.
5.1 Spectral clustering and normalized cuts
Normalized cut segmentation is a graphbased formulation for clustering aiming at finding roughly balanced cuts in graphs [25]. The input data is now replaced by a similarity matrix and, for a known number of clusters , as shown by [22, 3], it is exactly equivalent to
where is the normalized similarity matrix.
Parametrization of the similarity matrix .
Typically, given data points (in image segmentation problem, these are often the concatenation of the positions in the image and local feature vectors), the similarity matrix is computed as
(10) 
where is a positive semidefinite matrix. Learning the matrix is thus of key practical importance.
However, our formulation would lead to efficiently learning (as a convex optimization problem) parameters only for a linear parametrization of . While the linear combination is attractive computationally, we follow the experience from the supervised setting where learning linear combinations of kernels, while formulated as a convex problem, does not significantly improve on methods that learn the metric within a Gaussian kernel with nonconvex approaches (see, e.g., [12, 20]).
We thus stick to the parametrization of Eq. (10). In order to make the problem simpler and more tractable, we consider spectral clustering directly with and not with its normalized version, i.e., our partitioning problem becomes
In order to solve the previous problem, the spectral relaxation outlined in Section 2.4 may be used, and corresponds to computing the eigenvectors of (the first ones if is known, and the ones corresponding to eigenvalues greater than a certain threshold otherwise).
Nonconvex optimization.
In our structured output prediction formulation, the loss function for the th observation becomes (for the case where the number of clusters is known):
It is not a convex function of , however, it is a difference of a concave and a convex function, which can be dealt with using majorizationminimization algorithm [33]. The idea of this algorithm is simply to upperbound the concave part by its linear tangent. Then the problem becomes convex and can be optimized using one of the method proposed in Section 4.4 We then iterate the process, which is known to be converging to a stationary point.
5.2 Partial labellings
The largemargin convex optimization framework relies on fully labelled datasets, i.e., pairs where is a dataset and the corresponding rescaled equivalence matrix. In many situations however, only partial information is available. In these situations, starting from the PCA metric, we propose to iterate between (a) label all datasets using the current metric and respecting the constraints imposed by the partial labels and (b) learn the metric using Section 4 from the fully labelled datasets. See an application in Section 6.1.
5.3 Detecting changes in distribution of temporal signals
In sequential problems, for now, we are just able to detect changes in the mean of the distribution of time series but not to detect changepoints in the whole distribution (e.g., the mean may be constant but the variance piecewise constant). Let us consider a temporal series in which some breakpoints occur in the distribution of the data. From this single series, we build several series permitting to detect these changes, by considering features built from
, in which the change of distribution appears as a change in mean. A naive way would be to consider the moments of the data
but unfortunately as grows these moments explode. A way to prevent them from exploding is to use the robust Hermite moments [31]. These moments are computed using the Hermite functions and permit to consider the dimensional series , where is the th Hermite function .Bioinformatics application.
Detection of changepoints in DNA sequences for cancer prognosis provides a natural testbed for this approach. Indeed, in this field, researchers face data which are linked to the number of copies of each gene along the DNA (aCGH data as used in [15]). The presence of such changes are generally related to the development of certain types of cancers. On the data from the Neuroblastoma dataset [15], some caryotypes with changes of distribution were manually annotated. Without any metric learning, the global error rate in changepoint identification is 12%. By considering the first 5 Hermite moments and learning a metric, we reach a rate of 6.9%, thus improving significantly the performance.
6 Experiments
We have conducted a series of experiments showing improvements of our largemargin metric learning methods over previous metric learning techniques.
6.1 Change point detection
Synthetic examples and robustness to lack of information.
We consider dimensional time series of length with an unknown number of breakpoints. Among these series only 10 are relevant to the problem of changepoint detection, i.e., 290 series have abrupt changes which should be discarded. Since the identity of the 10 relevant time series is unknown, by learning a metric we hope to obtain high weights on the relevant series and small weights on the others. The number of segments is not assumed to be known and is learned automatically.
Moreover, in this experiment we progressively remove information, in the sense that as input of the algorithm we only give a fraction of the original time series (and we measure the amount of information given through the ratio of the given temporal series compared to the original one). Results are presented in Figure 1. As expected, the performance without metric learning is bad, while it is improved with PCA. Techniques such as RCA [4] which use the labels improve even more (all datasets were stacked into a single one with the corresponding supervision); however, it is not directly adapted to changepoint detection, it requirse dimensionality reduction to work and the performance is not robust to the choice of the number of dimensions. Note also that all methods except ours are given the exact number of changepoints. Our largemargin approach outperforms the other metric, in the convex setting (i.e., extreme right of the curves), but also in partiallysupervised setting where we use the alternative approach describe in Section 5.2.
Video segmentation.
We applied our method to data coming from old TV shows (the length of the time series in that case is about 5400, with 60 to 120 changepoints) where some speaking passages alternate with singing ones. The videos are from 1h up to 1h30 long. We aim at recovering the segmentation induced by the speaking parts and the musical ones. Following [2], we use GIST features for the video part and MFCC features for the audio. The features were aggregated every second so that the temporal series we are considering are about several thousands vectors long, which is still computationally tractable using the dynamic programming of Algorithm 1. We used 4 shows for train, 3 for validation, 3 for test. The running times of our Matlab implementation were in order of a few hours.
The results are described in Table 1. We consider three different settings: using only the image stream, only the audio stream or both. In these three cases, we consider using the existing metric (no learning), PCA, or our approach. In all settings, metric learning improves performance. Note that the performance is best with only the audio stream and our metric learning, given both streams, manages to do almost as well as with only the audio stream, thus illustrating the robustness of using metric learning in this context.
Method 
Audio  Video  Both  

PCA  23  41  34  40  55  25  29  53  37 
Reg. parameter  29  48  33  59  55  47  40  48  36 
Metric learning  

6.2 means clustering
Using the partition induced by the classes as ground truth, we tested our algorithm on some classification datasets from the UCI machine learning repository, using the classification information as partitions, following the methodology proposed by [32]. This application of our framework is a little extreme in the sense that we assume only one partitioning as training point (i.e., ). The results are presented in Table 2. For the “Letters” and “Mov. Libras” datasets, there are no significant differences, while for the “Wine” dataset, RCA is the best, and for the “Iris” dataset, our largemargin approach is best: even in this extreme case, we are competitive with existing techniques.
Dataset  Ours  Euclidean  RCA  [32]  

Iris  0.55  0.43  0.02  0.30  0.01  
Wine  1.03  3.4  0.14  3.08  0.1  
Letters  34.5  41.62  34.8  0.5  35.26  
Mov. Libras  14  15  22  2  15.07  1 
6.3 Image Segmentation
We now consider learning metrics for normalized cuts and consider the Weizmann horses database [5], for which groundtruth segmentation is available. Using color and position features, we learn a metric with the method presented in Section 5.1 on 10 fully labelled images. We then test on the remaining 318 images.
We compare the results of this procedure to a crossvalidation approach with an exhaustive search on a 2D grid adjusting one parameter for the position features and one other for color ones. The loss between groundtruth and segmentations obtained by the normalized cuts algorithm is measured either by Eq. (8) or the Jaccard distance. Results are summarized in Table 3, with some visual examples in Figure 2. The metric learning within the Gaussian kernel significantly improves performance. The running times of our pure Matlab implementation were in order of several hours to get convergence of the convexconcave procedure we used.
Loss used  Learned metric  Grid  
Loss of Eq. (8)  1.54  1.77  0.3 
Jaccard distance  0.45  0.53  0.11 

is the standard deviation of the difference between the loss with our metric and the grid search. To assess the significance of our results, we perform ttests whose pvalues are respectively
and .7 Conclusion
We have presented a largemargin framework to learn metrics for unsupervised partitioning problems, with application in particular to changepoint detection in video streams and image segmentation, with a significant improvement in partitioning performance. For the applicative part, following recent trends in image segmentation (see, e.g., [18]), it would be interesting to extend our changepoint framework so that it allows unsupervised cosegmentation of several videos: each segment could then be automatically labelled so that segments from different videos but with the same label correspond to the same action.
References
 [1] H. Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716 – 723, dec 1974.
 [2] S. Arlot, A. Celisse, and Z. Harchaoui. Kernel changepoint detection, Feb. 2012. arXiv:1202.3878.
 [3] F. Bach and M. Jordan. Learning spectral clustering. In Adv. NIPS, 2003.
 [4] A. BarHillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6(1):937, 2006.
 [5] E. Borenstein and S. Ullman. Learning to segment. In Proc. ECCV, 2004.
 [6] L. Boysen, A. Kempe, V. Liebscher, A. Munk, and O. Wittich. Consistencies and rates of convergence of jumppenalized least squares estimators. Annals of Statistics, 37:157–183.
 [7] T. S. Caetano, L. Cheng, Q. V. Le, and A. J. Smola. Learning Graph Matching. In IEEE 11th International Conference on Computer Vision (ICCV 2007), pages 1–8, 2007.
 [8] J. Chen and A. K. Gupta. Parametric Statistical Change Point Analysis. Birkhäuser, 2011.
 [9] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790–799, 1995.

[10]
F. De la Torre and T. Kanade.
Discriminative cluster analysis.
In Proc. ICML, 2006.  [11] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE Trans. Sig. Proc., 53(8):2961–2974, 2005.
 [12] P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In Proc. ICCV, 2009.
 [13] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Adv. NIPS, 2004.
 [14] J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Applied statistics, pages 54–64, 1969.
 [15] T. Hocking, G. Schleiermacher, I. JanoueixLerosey, O. Delattre, F. Bach, and J.P. Vert. Learning smoothing models of copy number profiles using breakpoint annotations. HAL, archives ouvertes, 2012.
 [16] L. J. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985.

[17]
P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon.
Metric and kernel learning using a linear transformation.
J. Mach. Learn. Res., 13:519–547, Mar. 2012.  [18] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image cosegmentation. In Proc. CVPR, 2010.
 [19] M. Lavielle. Using penalized contrasts for the changepoint problem. Signal Proces., 85(8):1501–1510, 2005.
 [20] M. Marszałek, C. Schmid, H. Harzallah, and J. Van De Weijer. Learning object representations for visual object class recognition. Technical Report 00548669, HAL, 2007.
 [21] B. Mcfee and G. Lanckriet. Metric learning to rank. In In Proceedings of the 27th annual International Conference on Machine Learning (ICML, 2010.
 [22] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Adv. NIPS, 2002.
 [23] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):pp. 846–850, 1971.
 [24] G. Rigaill. Pruned dynamic programming for optimal multiple changepoint detection. Technical Report 1004.0887, arXiv, 2010.
 [25] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22:888–905, 1997.
 [26] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. Proc. ECCV, 2008.
 [27] B. Taskar, C. Guestrin, and D. Koller. Maxmargin markov networks. Adv. NIPS, 2003.
 [28] C. H. Teo, S. Vishwanathan, A. Smola, and V. Quoc. Bundle methods for regularized risk minimization. Journal of Machine Learning research, 2009.
 [29] I. Tsochantaridis, T. Hoffman, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. Journal of Machine Learning Research, 2005.
 [30] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In Adv. NIPS, 2006.
 [31] M. Welling. Robust higher order statistics. Proc. Int. Workshop Artif. Intell. Statist.(AISTATS, 2005).
 [32] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with applications to clustering with sideinformation. Adv. NIPS, 2002.
 [33] A. Yuille and A. Rangarajan. The concaveconvex procedure. Neural Computation, 15(4):915–936, 2003.
A Asymptotics of the loss between partitions
Note that in this section, we will denote by the “normalized” loss between partitions. This means that, with the notations of the article when considering two matrices and representing some partitions and in the generic set of partitions , we have . Throughout this section, we will refer to the size of a partition as the number of clusters.
a.1 Hypothesis

We assume we consider and two partitions of the same size, with a common number of clusters .

, we denote , the flow which goes out from to when goes to .

We define the global outer flow as and the global inner flow as
a.2 Main result
Theorem 1.
Let and two partitions satisfying our hypothesis. If we note , then such that and of the same size , ,
Proof.
From the expressions of Section 3.1, we can write :
The second term can be pretty easily bounded using
We can go further, noticing that , which leads eventually to, if (and this is the case if tends to 0 in the sense of the assumption of the theorem):
Now, let’s bound the first term, which is a little more long:
But, for the same reasons as when we bounded the second term
Using the fact that , we finally get that, when :
Thus, putting everything together, when , we get the statement of the theorem. ∎
B Equivalence between the loss between partition and the Hausdorff distance for change point detection
As mentioned in the title of this , there is a deep link between the Hausdorff distance and the distance between partition we used throughout this paper in the case of changepoint detection applications. We propose here to show that the two distances are equivalent.
b.1 Hypothesis and notations

We consider the segmentations and has having been embedded in so that we can consider a distance on to define the Hausdorff distance between the frontiers of the elements of and .

We denote the minimal length of a segment in a partition and the maximal one.

We denote by the Hausdorff distance between partitions as described in Section 3
b.2 Main result
Theorem 2.
Let P,Q denote two partitions. If and , then we have the following:
Moreover, without assuming , we get
Proof.
First, let’s do the majorization part Using the expressions of Section 3.1, we have to minorate . Note that the hypothesis of the Hausdorff distane being inferior to the half of the minimal length is just here to say that the th segment of partition Q can only overlap with th, th and th elements of . Thus :
which gives us the majorization.
Note that we used the fact that the inequality holds.
For the minoration, note that it is true all the time, but we will just give the proof in the case where the Hausdorff distance is such that and where .
First, let’s begin by some general statements :
i)By definition .
ii) If the first term in the max is attained, that means there exists some such that . It also means that, if we look at the sequences, there is no elements of is between and . Thus, by definition of the loss , and a short computation leads to .
iii) If the second term in the max is attained, the same minoration holds by permuting indices.
Let’s go back to our special case, we have and .
This leads to
∎
Comments
There are no comments yet.