1.1 Clustering and change-point detection
An important problem in the vast domain of statistical learning is the question of unsupervised classification of high dimensional data. Many examples fall into this category such as the classification of curves, of images, the segmentation of domains (geographical, economical, medical, astrophysical…) into homogeneous regions given observed quantities.
The problem can be summarized into finding clusters (“coherent ensembles”) for individuals, given that each one is observed through a -dimensional vector. We will be mostly interested here (although it is not a necessary condition) in the case where
-dimensional vector. We will be mostly interested here (although it is not a necessary condition) in the case whereis large, possibly very large compared to . A lot of examples fall into this setting, for instance the case where you need to partition a geographic zone into smaller areas which are highly homogeneous with respect to climatic quantities such as temperature, wind measures. In such a case the data generally consists, for meteorological stations —regularly spaced on the zone— of years of hourly measures, leading to a huge vector of dimension for each station.
In such a case, the clustering problem can generally be summarized in two steps: in a first step, a preprocessing finds a representation of the data, which can be the raw data or a more subtle transformation. Then, a segmentation algorithm is performed on the transformed data. We will focus here on the case where this algorithm is the famous -means algorithm, corresponding to estimation via the empirical risk minimizer.
Between or around these two steps and especially when is very large, there is a need for “smoothing”, or in other terms, to reduce the dimension . This is especially important from a computational point of view. Without this step, the -means algorithm might be unstable or even not work at all.
In this paper, we will consider the problem from a theoretical point of view (as opposed to algorithmic point of view).
More precisely, we will concentrate on the following questions:
Without referring to the feasibility, what is more efficient to obtain a clustering result: keeping the raw data or smoothing the data?
If the data are high-dimensional but “sparse”, is there a way to use this sparsity to get better clustering results?
If smoothing proves to be efficient, how could it be performed? Do usual nonparametric smoothing methods work as well in a clustering problem?
Does on-line (signal by signal smoothing —station by station in the meteorological example) performs as well as off-line smoothing (using a preprocessing involving all the signals)?
We will attack this problem in a much simpler setting, and see that even in this simplified framework, there are still open questions. The first simplification will be that the number of classes will be fixed: we will assume that there are exactly two classes. Moreover, we make the more restrictive assumption that the change between one class and the other occurs on a time scale. In other words, there exists a change-point : before , the observations have a certain regime, after , they have another regime.
In a wide range of areas, change-point problems may occur in a high-dimensional context. This is the case for instance in the analysis of network traffic data Levy-Leduc and Roueff (2009); Lung-Yut-Fong et al. (2012), in bioinformatics, when studying copy-number variation Bleakley and Vert (2011); Zhang et al. (2010), in astrostatistics Bourguignon et al. (2011); Suleiman et al. (2014); Meillier et al. (2016) or in multimedia indexation Harchaoui et al. (2009). In these practical applications, the number of observations is relatively small compared with their dimension, with the change-point possibly occurring only for a few components.
The change-point problem has its own interest and has also a long history. For an introduction to the domain, the reader may refer for instance to the monographs and articles by Shiryaev (1978), Ritov (1990), Müller (1992), Basseville and Nikiforov (1993), Brodsky and Darkhovsky (1993), Carlstein et al. (1994), Csörgő and Horváth (1997). Change-point detection based on resampling has been investigated in Fiosina and Fiosins (2011) and Arlot and Celisse (2011).
Minimax estimation is considered for example in Korostelev (1987) , in the Gaussian white noise model.
In this framework, high-dimensional change-point problems are studied by
, in the Gaussian white noise model. In this framework, high-dimensional change-point problems are studied byKorostelev and Lepski (2008), who propose an asymptotically minimax estimator of the change-point location, when the Euclidean norm of the gap tends to as .
1.2 Main results and organization of the paper
We begin Section 2 by introducing the two class model, and the change-point model. As well, we present the empirical minimizer estimator of the change-point, depending on the smoothing.
We prove that up to a logarithmic term the empirical minimizer (-means) has a minimax rate of convergence. It is important to notice that we do not know whether this logarithmic term is necessary or not. Indeed, in Korostelev and Lepski (2008), “the edges are known”, meaning that the minimax rate is established in the case where the change-point cannot occur before a known proportion of the observation and as well after a known proportion of observation. Our method of estimation is agnostic to this knowledge, creating obvious additional difficulties.
Secondly, in Section 3, we show that if the data is sparse —here, in a “Sobolev” sense, there exists an optimal smoothing. To attain this optimal smoothing, we provide a method relying on the Lepski smoother. This method, which basically consists in performing a Lepski smoothing on a surrogate data vector built on the whole observation, has the advantage on being performed beforehand and will not create additional computational difficulties in the -means algorithm. It could be interestingly compared with the lasso--means (see Levrard (2013, 2015)), which is known to be difficult to implement in large data sets.
We provide in Section 4 a numerical experimentation of our methods, which shows promising results.
Section 5 is devoted to the proofs.
2 Two class model – Change-point model
Let . For a set , we will denote its cardinal by . We observe independent signals . We assume that each signal , is observed discretely, through components: for every , is a random vector with values in .
We consider the following Gaussian clustering framework. We suppose that there exist a set (unknown) and two vectors and of such that
Recently, Enikeeva and Harchaoui (2017) have considered a similar Gaussian model from the point of view of testing. The aim is to test whether there is a change-point or not. The Gaussian vectors may be high-dimensional, with the change possibly occurring in an unknown subset of the components. In a double asymptotic setting, where the number of observations and the dimension grow to infinity, the authors build an optimal test.
The covariance matrix of the noise is chosen to be proportional to identity. Choosing instead a covariance of the form , where is a known matrix, would lead to a somewhat identical discussion by a simple change of variable, provided that we make also the appropriate regularity assumptions on the parameters and .
For a first step, we suppose here to be known. Note that may depend on . For instance, if we think of a Gaussian white noise, then would frequently be of the form , where is an absolute known constant. We will not investigate the case where is unknown.
The fact that the noise is Gaussian is a simplification which is useful but not essential: basically, concentration inequalities are needed and similar results are likely under sub-Gaussian hypotheses on the errors.
We will mainly be interested in the behavior of the maximum likelihood (MLE) estimators (also known as two-class -means estimators in this context):
2.1 Simplified two-class model: change-point clustering
Change-point clustering essentially means that the set has the following form:
In other terms, there exist a change-point and two vectors and of , such that the model is given by
To prove our results, we will additionally assume the following conditions in different places.
2.1.1 Condition [Edge-out]
We assume that there exists , such that . This condition is introduced basically to avoid problems at the border of the interval . It is important to notice that our results will depend on . However, the procedure is agnostic to , which is not supposed to be known.
2.1.2 Condition on the means 
For , we define
We will suppose that and are in .
This assumption expresses a form of sparsity of the coefficients, which reflects an ordering in their importance: the first ones are supposedly more important than the last ones. This is quite a reasonable assumption when, as it is generally the case, the clustering is operated in two steps: during the first one, the data is projected to a feature space (in a linear or nonlinear way) supposedly reflecting the salient parts of the data.
Note that there are possible extensions to other kinds of sparsity, considering for instance coefficients belonging to the set
where , but this choice requires more sophisticated smoothing algorithms.
2.2 Smooth estimation of
Our problem is, considering the regularity assumptions, to determine, for the problem of estimating the change-point , whether or not it is efficient to smooth the data. More specifically, we will investigate the effect of replacing the vectors , (called in the sequel “raw data”), by, for , , , the vectors of of the first coordinates of .
We consider the associated empirical minimizer:
and we set
To investigate the behavior of the procedure, we begin by proving a result describing the behavior of this estimated change-point .
In the sequel, we will use the notation
We also define, for ,
Let us assume condition [edge-out].
For any , there exist constants and such that, if
Note that no condition on the sparsity of and is needed for this result.
Thanks to Korostelev and Lepski (2008), one can observe that is the minimax rate in this framework. Compared to their result, we are apparently loosing a logarithmic factor. But it is important to stress that in their paper, the bound was supposed to be known, whereas our estimator is adaptive in . So it is not absurd to suggest that the logarithmic term might be necessary.
For , . The rate is composed of two different regimes: a “good one” , not depending on the dimension and a “slow one” which is rapidly deteriorating with the dimension.
From the results above, we deduce that if , the rate of convergence is , whereas if , it is . This last rate is obviously much better, and with this latter condition on , taking (so raw data) allows to obtain the best rate . Taking a smaller could lead to a reduction of damaging the rate.
However this latter condition is quite restrictive on when is large. In the next paragraph, we will try to refine this condition, gaining on the size of the smoothed vector.
Without assumptions on the behavior of the parameters and , there is nothing much to hope about the way is increasing in . However, the regularity assumption allows us to assume that for such that , then and are comparable, in the sense that . Indeed, , so that
This is precisely what is exploited in the first part of Theorem 1 below.
Let us observe that if and are comparable, then is much easier to analyse. In particular we see that again it is composed of two regimes —a slow one and a good one— and the dependence in is more clear: for , and for larger ’s.
This is also corresponding to what is frequently observed in practical applications: when the dimension is increasing, one first observes indications of convergence being decreasing, then stable for a while and then increasing substantially.
2.3 Consequences of Proposition 1
An immediate consequence of Proposition 1 is the following theorem.
We consider the model (1), and we assume conditions [edge-out], and .
For any , there exist constants and such that, if
Optimizing Condition (2) in leads to .
Under the conditions above, for any , there exist constants and such that, if
We see here that there is an advantage in smoothing since it allows to obtain the best rate with less restricting conditions on the gap .
We see that the greater the parameter , the faster the rate of convergence of , which is natural, since corresponds to the Euclidean distance between the two means and and the segmentation task is obviously easier when groups are well-separated.
At first sight, the rate of convergence and the conditions could seem quite unsatisfactory, but observe that very often is of the form . In this case, the rate of convergence is of the order
If we now look for a procedure searching for an optimal in an adaptive way (without knowing the regularity ), some remarks can be made before giving a solution. In particular, one may ask whether it is possible to optimize individually (on each signal of ), or if it is necessary to perform an off-line preprocessing (requiring the use of all the signals). The form of the optimal smoothing allows to answer this question, proving that any adaptive smoothing performed individually on each signal (thresholding, lasso… ) would lead instead to an optimal smoother of the form: , inevitably creating in the rates a loss of a polynomial factor in . This means that it is certainly more efficient to find a procedure performing the smoothing globally (off-line).
3 Adaptive choice of
3.1 Lepski’s procedure
where the ’s are i.i.d. , a standard choice for the smoothing parameter consists in defining as follows:
where is a tuning constant of the procedure.
Here, we will use Lepski’s procedure in a special case.
First, using the complete data set (so off-line), we will create a surrogate data vector, estimating a parameter of regularity . These data will be used to find an optimal .
Of course, it is known that estimating the regularity of a signal is impossible without important extraneous assumptions, but what adaptive procedures are producing —and especially in this case Lepski’s procedure— is a smoothing parameter which, with overwhelming probability will be smaller than the optimal one
which, with overwhelming probability will be smaller than the optimal one(defined above). This is not enough when one wants to estimate the regularity (unless extraneous assumptions are imposed). However, fortunately, Lepski’s procedure also controls the bias of the procedure, assuring that is still reasonable, which is precisely the need here.
3.2.1 Surrogate data
For the sake of simplicity, we consider that is even; otherwise, the modifications are elementary.
Let us consider the following vector:
It is easy to see that this model is a special case of (3) with
We consider the Lepski procedure applied to the vector , producing a smoothing parameter . This smoothing parameter is then just plugged in the -means procedure for estimating .
The following theorem states that the method leads to an optimal selection, up to logarithmic terms (same convergence rate as in Corollary 1 with ).
In the model (1), we assume that and belong to . We suppose that there exists a constant such that
Then, for any , there exist constants and and such that, if
4 Numerical study
In this section, we provide some simulations illustrating our theoretical results.
4.1 Rate of convergence
In this experiment, we study the rate of convergence of the estimator . Let , , , . Let us consider data generated from Model (1) with the means and obtained from the following distribution: .
To get a first insight about the rate of convergence, we simulate 1000 times a sample of length , for chosen between 20 and 4000, and plot in Figure 1 the mean and median of the error over the 1000 trials in function of , together with the function corresponding to the theoretical rate of convergence obtained in Proposition 1. Note that the rate of convergence of is given in the proposition up to a constant . Nevertheless, the figure provides an appropriate illustration of the result as soon as is large enough.
Then, simulating 1000 samples, for each value of the sample size between 500 and 4000, we try to estimate of the rate of convergence by computing the linear regression of
between 500 and 4000, we try to estimate of the rate of convergence by computing the linear regression ofby : omitting the logarithmic factor, an exponent is to be found, corresponding to the rate of convergence . Figure 2 provides an illustration of this linear regression, considering again the mean and the median over the 1000 trials. On this example, the estimated slope of the regression line is for the mean and for the median.
4.2 Selection of
In Theorem 2, we suggest to select using the Lepski method. Before introducing a practical procedure for the selection of , let us illustrate the fact that the performance of the estimator may indeed vary a lot as a function of , so that selecting the right is a crucial issue in the estimation of .
We set , , , . We consider data generated from Model (1) with means and built as follows:
Case : , , for .
Case : is such that for , and for . is such that for , and for .
We simulated 5000 data sets according to Model (1) in each of the two cases. Figure 3 and 4 show the mean and median error over the 5000 trials in function of . In the first case, the best result is obtained already with , whereas for the second, taking around 30 is a good choice.
Theorem 2 provides a theoretical way to select . However, since the statement depends on an unknown tuning constant , the theorem cannot be used directly for choosing in practice. In the sequel, two selection procedures for are investigated, yielding two estimators and .
Method 1. This method is often used to replace the search of tuning constants in adaptive methods. The idea is instead to find a division of the set into and its complementary, where the two subsets are corresponding to 2 “regimes” for the data, one of “big coefficients”, one of small ones.
Let and and consider
This quantity is computed for every and the value is chosen such that
Indeed, this -means-like procedure, by searching for a change-point along , should separate the first most significative differences , , from the remaining ones, expected to be less significative for estimating , in such a way that keeping for the estimation all components until seems a reasonable choice.
Method 2. The second idea is more computationally involved and based on subsampling. When performing subsampling, the indices drawn at random are sorted, so that the parameter of interest remains indeed approximatively unchanged. For each , we compute for a collection of subsamples. Then, is set to the value of
minimizing the variance ofover all subsamples. Here, 100 subsamples are built, each of them containing of the initial sample.
Proportions of data from to have also been tried, with quite similar results. Observe that picking a quite small proportion of data for subsampling could be interesting since it provides more variability between the subsamples, but, at the same time, the fact that the ratio between the dimension and the sample size is modified may be annoying when the aim is to select . We also considered a version of subsampling where a different subsampling index is drawn for every : again, this provides more variability in the subsamples, but may also vary more than in the classical version. The results were not significantly different.
The performance of the two methods is compared with the result obtained using the value of minimizing the average value of over a large number of trials, called hereafter oracle (here, as obtained above for 5000 trials). Of course, is not available in practice, since it depends on the true . However, it is introduced as a benchmark. The results, corresponding to 1000 trials, are shown in Figure 5 and Table 1. Observe that the performances of the two methods are very similar, with a slight advantage of Method 2 over Method 1. However, Method 2 is based on subsampling, and, as such, is more CPU-time consuming.
|Error over 1000 trials||Oracle|
5.1 Proof of Proposition 1
Our proof will heavily rely on standard concentration inequalities, detailed in Appendix (see Section 6).
In the sequel, for the sake of simplicity, we will assume that, additionally to , . This will not have any consequence on the result but will avoid unnecessary integer parts.
Also, in this proof, will be replaced by , when there is no possible confusion.
Let us denote by the probability distribution associated with model
the probability distribution associated with model1. We will consider the behavior of our estimators under the probability . Using the notation , and , observe that may be defined in the following way:
Here, the function is given (for ) by
Let us consider the case . The other case can be treated in a symmetrical way.
For , and under the distribution , we may write
where is the vector . Now, we have to minimize in this expression
The minimum is attained by taking, for every ,
So, the minimum is
Under , can be written in the following way:
Observe that , , are independent random variables, as well as , . Moreover, , .
We will only consider the first term, the other one can be treated in a symmetrical way. We have