Given the growing number of available statistical estimation strategies, trying to combine several procedures in a smart way is a very natural idea, and so, an abundant literature on aggregation of estimators for different types of statistical models has sprung up in recent years. A widespread aggregation method consists in building a linear or a convex combination of a bunch of initial estimators (see, for instance, Catoni (2004), Juditsky and Nemirovski (2000), Nemirovski (2000), Yang (2000, 2001, 2004), Györfi et al. (2002), Wegkamp (2003), Audibert (2004), Bunea et al. (2006, 2007a, 2007b), and Dalalyan and Tsybakov (2008). Observe that the model selection approach, which aims at selecting the best estimator in the list, is also linked to the same goal (see, for example, the monograph by Massart (2007)).
In this paper, we focus on the question of combining estimators for two kinds of statistical procedures, pertaining to the supervised learning framework: classification, which consists in assigning to an observation a label, out of a finite number of candidates, and regression, whose aim is to associate real outputs to entries.
Beside the usual linear aggregation and model selection methods, a quite different point of view has been introduced by Mojirsheibani (1999) for classification. The idea consists in an original combination method, which is non linear in the initial estimators and is based on a consensus concept. More specifically, in this combining scheme, an observation is considered to be reliable for contributing to the classification of a new query point if all initial classifiers predict the same label for both points. Then, the label for the query point is estimated thanks to a majority vote among the labels of the observations which have been retained this way. Note that more regular versions, based on smoothing kernels, have also been developed (Mojirsheibani (2000)). A numerical comparison study of several combining schemes is available in Mojirsheibani (2002b), and recently, a variant of the method has been proposed in Balakrishnan and Mojirsheibani (2015).
This strategy has just been adapted in the regression framework by Biau et al. (2016). In this context, an observation is used in the combination step if all the initial estimators predict a similar value for the observation and the new point: the difference between both predictions is required to be less than some prespecified threshold. Then, the new prediction by the combined estimator is the average of the outputs corresponding to the selected entries. Note that the functional data framework has also been considered, by Cholaquidis et al. (2016).
In the classification case as well as in the regression one, when the initial list contains a consistent estimator, it can be shown that the combined estimator inherits this consistency property. Let us mention that the techniques of proof, and consequently, the assumptions made in both situations, are quite different. For instance, the number of initial estimators is expected to tend to infinity with the sample size in Mojirsheibani (2000) whereas it is fixed in Biau et al. (2016).
It is worth pointing out that, in both contexts described above, the condition for an observation to be reliable is in principle required to be satisfied for all estimators. In a further paper, Mojirsheibani (2002a) notes that this rule may seem too restrictive and proposes to allow a few disagreements (typically, a single one). The resulting classifier is still consistent provided that the number of initial classifiers keeps tending to infinity after removing those with disagreement. Similarly, in Biau et al. (2016), this unanimity constraint is relaxed in practice by demanding that the distance condition for keeping an observation is true at least for a certain proportion of the estimators (for example, ). The theoretical results remain true provided that tends to 1.
Here, our purpose is to investigate a new approach, based on distances between observations, which also aims at reducing the effect of a possibly bad initial estimator. Roughly, choosing a kernel point of view, we will propose a combined estimator with weights constructed by mixing distances between entries with distances between predictions coming from the individual estimators. Our motivation for introducing such a strategy is the intuition that taking advantage of the efficiency of the consensus idea of Mojirsheibani (1999) and Biau et al. (2016) without for all that forgetting the information related to the proximity between entries shall help improving the prediction, especially in the presence of an initial estimator that does not perform very well.
Our modified rule will be shown to be consistent under general assumptions. In particular, the combined estimator may perfectly be consistent even if the list of initial estimators does not contain any consistent estimator. We also conduct numerical experiments, both on simulated and real data, which demonstrate the benefits of our strategy, with respect to the original combining method and the individual estimators.
The paper is organized as follows. The new combined estimator is defined in Section 2. Then, the main theoretical results are stated in Section 3. Section 4 is devoted to numerical experiments with simulated and real examples. A few perspectives are presented in a brief conclusive paragraph, Section 5. For the sake of clarity, proofs are postponed to Section 6.
2 Notation and definition of the estimator
Let denote a random pair taking its values in . The variable has distribution . We are interested in two different situations: , which corresponds to the binary classification problem, and , that is bounded regression. Let stand for the regression function . Note that in the classification context.
Let denote the Bayes classifier, given by
It is well-known that minimizes over all possible classifiers the missclassification error .
Throughout the document, we assume that we are given a sample of the random pair .
Our goal in regression is to estimate the function using . In classification, we aim at building a classifier based on whose error mimics the Bayes classifier error.
For every if ,
will denote the Euclidean norm of the vector, and the closed ball centered at , with radius . To simplify notation, as there is no ambiguity, we will most of the time drop the index and simply write and .
Let be a kernel, that is a nonnegative and monotone decreasing function along rays starting from the origin. The next assumption will be made on the kernel (see Devroye et al. (1996)).
We suppose that the kernel is regular, that is, there exist and such that
For all , .
We propose to combine the predictions of initial estimators, denoted by in the classification context or in the regression context. Sometimes, the notation , embracing both cases, will be employed. For , let , and . For ease of exposition, we will assume throughout that and do not depend on the sample .
Using a simple sample-splitting device, our results extend to the case where the individual estimators depend on the sample. More specifically, we may split into two independent parts and , assuming that are based on alone, and using for the combination step.
The definition of our combined estimator is first introduced in the regression framework.
Let the function be such that , where .
Suppose that we are given a set of initial regression estimators . The regression combined estimator is defined by
where , , .
To lighten the equations, we will sometimes use the notation to mean the function and for .
We introduce the notation , for the quantity defined by
We now state the definition of the estimator in the context of classification.
Suppose that we are given a set of initial classifiers . The combined classifier be defined by
where , , and .
The random vectors form an i.i.d. sequence, and so is the sequence .
Here, the exponent or will be dropped whenever there is no need to specify the classification or regression context.
In the particular case where can be expressed in function of the squared Euclidean norm, i.e. we have, for , , then the quantity takes the somewhat more explicit form . This is the case for a Gaussian kernel for instance.
3 Main results
We are now in a position to state the main results of the paper.
Theorem 3.1 (Regression case).
If and as , then, for every , there exists such that for , the following exponential bound holds:
where is a constant depending on and .
Let denote the Bayes error and the missclassification error of .
Theorem 3.2 (Classification case).
If and as , then, for every , there exists such that for , the following exponential bound holds:
where is a constant depending on and .
These results may be seen as “combining” versions of the strong consistency results for kernel regression and the kernel classification rule, described in Devroye and Krzyżak (1989) (see also the books of Devroye et al. (1996) and Györfi et al. (2002)).
For Theorem 3.1 (regression context), let us write
Thus, the result will be obtained by replacing, on the one hand, by , and by controlling, on the other hand, the error due to the difference between the two terms. This is done respectively in Lemma 3.1 and 3.2 below.
If and as , then there exists , depending on and , such that for every , if is large enough,
The next lemma is devoted to the control of the difference between and .
If and as , then there exists , depending on and , such that for every , if is large enough,
The following upper bound holds:
We deduce from this upper bound that
To control the quantity in order to prove Lemma 3.1, we may use the following decomposition, for :
The term will be studied first and then McDiarmid’s inequality will be employed to handle the deviation .
Let and (recall that equals 0 or 1 in the classification case and we consider bounded regression with ) be a continuous function with compact support such that
The following inequality holds:
We will derive an upper bound for the integral against of each term. The next result is proved in Section 6.
For a function as defined in (3.4), the following statements hold.
If as , then
Let us write
for some constant depending on , the kernel and the dimensions and . Moreover, if as , then, for and large enough,
The proof of this proposition rests upon the following important result, which extends the covering lemma of Devroye and Krzyżak (1989) to our context.
Lemma 3.4 (Covering lemma).
There exists , depending on and , such that
, there exists such that
4 Numerical Experiments
The section presents numerical experiments to illustrate the benefits of using the new combining approach. The classification case is illustrated with numerical simulations and the regression case with real operational data recorded from two applications: modeling of the electrical power consumption of an air compressor and modeling of the electricity production of different wind turbines in a wind farm.
Two aggregation strategies are run, the strategy by Mojirsheibani (1999) or Biau et al. (2016), which only combines output predictions, and our strategy, which combines the input distance and output predictions. In the sequel, the original approach is called Cobra, from the name of the R package implementing the method (Guedj, 2013), and our version is called MixCobra.
Figure 1 presents the 6 examples simulated to illustrate the classification performances. For each example, a data set of observations is simulated, split into two classes (with half of the observations in each class). All the figures are depicted with observations, to get graphics reflecting more accurately the distribution; however, to be more realistic with respect to available data in most applications, only
observations (100 per class) are used to evaluate the performances. The first four graphs present mixtures of uniform or Gaussian distributions. The two last graphs represent respectively two concentric circles and two interlocked spirals (see annex for details on simulated data).
Parametric and non parametric machines are here introduced for classification. For parametric methods, linear discriminant analysis (lda) and logistic regression (logit) are used. For nonparametric methods,
-nearest neighbors (knn) (withCobra aggregation procedure is performed using all the available machines, meaning that the consensus is required for all machines. For each simulated example, the performances of each machine are computed using repetitions. For each repetition, observations are generated: 75% randomly chosen observations are used to calibrate the model and the remaining 25% to compute the test performances.
One first interesting question is to analyze whenever a specific machine always shows the best performance over all the repetitions. During an off-line study, if a practitioner observes that a given machine performs always the best, then the choice of the model to use is easy to state and an aggregation procedure relying on all machines may not be useful in this particularly case. Table 1 shows, for each machine, the number of runs where this machine provides the best model (i.e. has the best performance). For each simulated example (each column), the sum of the rows corresponding to the different methods equals the total number of repetitions (). We observe that, except for the “circles” example for which the knn machine outperforms all the others, almost every machine wins at least once over all the runs. For all simulated examples except the “circles” example, an aggregation procedure may be interesting.
Table 2 presents the average performances for all the classification examples. As expected for the “gauss” example corresponding to the well separated mixture of two Gaussian distributions, both parametric and non parametric machines perform well; otherwise, the nonparametric methods perform in general better.
|lda||7.0 (8.7)||51.0 (15.6)||12.9 (10.6)||17.6 (7.2)||52.5 (16.5)||34.3 (15.8)|
|logit||7.5 (8.2)||51.1 (15.6)||9.6 (9.0)||17.8 (7.3)||52.4 (16.3)||34.4 (16.0)|
|knn||7.0 (8.1)||25.0 (14.5)||6.0 (7.1)||9.2 (6.6)||0.0 (0.0)||4.1 (6.4)|
|svm||8.7 (9.0)||25.9 (14.6)||5.9 (8.0)||10.3 (6.7)||1.8 (4.1)||12.2 (10.8)|
|cart||7.2 (8.3)||23.0 (14.8)||4.0 (6.0)||7.2 (5.8)||0.0 (0.0)||17.8 (13.7)|
|bag||8.7 (8.8)||28.3 (15.8)||5.1 (7.5)||7.5 (5.5)||0.5 (2.2)||3.0 (5.4)|
|rf||8.9 (9.1)||27.5 (15.3)||4.7 (6.8)||7.4 (5.5)||0.3 (1.7)||2.2 (5.0)|
|Cobra||8.6 (8.6)||25.8 (15.2)||8.2 (8.9)||8.5 (5.9)||1.1 (3.5)||6.2 (8.3)|
|Mixcobra||7.7 (9.2)||28.2 (15.4)||5.4 (6.7)||7.2 (5.8)||0.3 (1.7)||2.8 (6.2)|
Average classification error (and standard deviation into brackets) for the different machines and both aggregation methodsCobra and MixCobra, computed for the simulated classification examples (1 unit ).
Concerning the aggregation procedures, Cobra and MixCobra succeed in yielding very satisfactory performances. When local information brings indirectly strong knowledge on the class label, as for the “circles” and “spirals” examples, MixCobra outperforms compared to Cobra. This may be explained by the fact that the input part of MixCobra weights behaves like a nonparametric kernel-like method, performing well in such cases, as shown by the low error of the knn classifier.
For the regression framework, we use the eponymous R package COBRA available on CRAN (Guedj, 2013) to compute the performance results for Cobra aggregation. It should be noted that this package implements the procedure only for the regression framework and not for classification (so, it was not possible to use the package in the classification framework). In the R package COBRA, two parameters, called hereafter and , drive the numerical results of the aggregation strategy (see Guedj, 2013). The smoothing parameter governs the consensus, insofar that it sets a threshold for the distance between the prediction for a new observation and the prediction for an observation in the training set: for every machine ,
The collective estimator is a weighted average of the outputs of the training data set: for ,
The value of the weight attributed to each output observation , , of the training data set depends on the consensus, possibly computed only for a certain proportion of machines. This approach will be referred to as “Cobra Adaptive method” by opposition to the “Cobra Fixed method” when the number of machines is fixed.
A too small value leads to many zero weights and discards a large number of data points, whereas a large value provides just an average of all the predictions. The parameter drives the proportion of machines to be actually considered among all initial machines. With no a priori information on the performances of the different machines, both parameters and are quite hard to choose. Practically, in the R COBRA package, these parameters are computed by cross-validation, using a subset of the observations in the training data set.
Concerning MixCobra, two parameters need also to be chosen, as in the initial Cobra method. The parameter drives the influence of the inputs (through the proximity between a new input observation and the input observations of the training data set) and the parameter drives the consensus of the predictions. It may be interesting to note that, in MixCobra, the two parameters play more symmetric roles than the parameters of Cobra. In practice, as in the COBRA package, both parameters are chosen by cross-validation.
The following paragraphs illustrate, in two industrial applications, the benefit of combining several regressors relying on a consensus notion using the Cobra and the MixCobra method.
Modeling Air Compressors
Air compressors are crucial elements of production and distribution of air gases. Compressors are instrumented with sensors that provide valuable information on the compressor status. Power consumption is measured on the electric motor driving the compressor and other information as flow, input pressure, output pressure, cooling water temperature and air temperature are also recorded. Monitoring compressor power consumption is essential to an Air Separation Unit safe and efficient operation (Cadet et al., 2005). In this study,
regression machines are used to model the air compressor consumption: linear regression model (lm), regression tree (cart), bagging regression tree (bag), random forest (rf), Support Vector Machines (svm) and-nearest neighbors (knn) with different values of .
The data set contains hourly observations of a working air compressor. Each variable is first transformed to have zero mean and unit variance. The performances are here computed by cross-validation using replications. For each replication, 2/3 of the observations are randomly chosen to calibrate the machines and 1/3 of the remaining observations are used to estimate the performances of the different machines. The performances on the test sets are presented in Table 3: for each case, the mean squared error is computed. We observed that the cart bagging and the svm machines provide in average the smallest error. Moreover, we observe that the best machine (i.e. the machine providing the smallest test performance error) changes from run to run. For replications, the best machine is alternatively the cart bagging machine (49 runs), the svm machine (49 runs) or the lm machine (2 runs). In this case, aggregation methods seem particularly well appropriate. Figure 2 shows the boxplots computed with the different machines and the 3 aggregation algorithms: Cobra with all machines (CobraF), Cobra with an adaptive number of machines (CobraA) and MixCobra. As expected, the Cobra algorithm with all machines yields in average the worst performance among the three aggregation techniques. Choosing adaptively the number of machines (CobraA) allows to discard a possibly bad machine and, so, improves the performance of Cobra. We observe that MixCobra provides, in average, the best performance, associated with a low standard deviation compared to the CobraF and CobraA aggregation methods.
Modeling Wind Turbines
The second application aims at modeling 6 different wind turbines on a wind plant in France. Each turbine is described by 5 variables, representing half-hourly information. The production of electricity is here the target variable. The explanatory variables are the wind power, the wind direction (sine, cosine), and the temperature. For each wind turbine, each variable is first transformed to have zero mean and unit variance.
Table 4 presents the performances obtained in modeling the 6 wind turbines (, ), for operational data points, using the same set of learning machines as in the previous paragraph and the corresponding Cobra and MixCobra aggregations. For all wind turbines, the cart bagging machine provides in average the best performances. However, we study in detail, for each wind turbine, the performances obtained for each run over the repetitions. Table 5 presents the number of runs where every given machine provides the smallest error. For the repetitions, the model which provides the smallest error is never the same, and is alternatively either cart bagging or svm. We observe that the proportion differs for each wind turbine. In this case, aggregation methods, and particularly MixCobra, are especially interesting, providing a unique global approach to model all wind turbines, as repetitions show that the best model is never the same.
|lm||14.39 (0.9)||18.22 (1.2)||16.65 (1.2)||15.72 (1.7)||15.16 (1.2)||16.97 (1.8)|
|cart||18.39 (1.7)||19.53 (1.8)||19.46 (1.9)||19.10 (2.0)||19.02 (1.7)||19.22 (1.8)|
|bag||9.13 (0.7)||9.95 (1.2)||9.61 (1.0)||9.35 (0.7)||9.75 (0.8)||10.17 (1.1)|
|rf||11.43 (0.9)||16.32 (2.7)||15.41 (2.2)||13.89 (1.8)||15.12 (2.0)||14.87 (1.7)|
|svm||9.28 (1.0)||13.55 (3.5)||12.72 (3.5)||11.12 (2.3)||12.01 (2.1)||12.00 (2.6)|
|knn 2||13.00 (0.6)||15.59 (1.5)||14.65 (1.1)||14.62 (1.0)||15.75 (1.2)||15.70 (1.1)|
|knn 5||12.12 (0.7)||15.26 (1.8)||13.90 (1.1)||13.60 (1.0)||14.95 (1.3)||14.75 (1.3)|
|knn 10||12.69 (0.8)||16.78 (1.8)||15.00 (1.3)||14.41 (1.2)||15.93 (1.4)||15.66 (1.5)|
|Cobra Fixed||18.83 (4.5)||29.55 (10.6)||31.76 (10.9)||24.01 (8.5)||24.73 (9.2)||23.92 (8.4)|
|Cobra Adaptive||14.15 (5.5)||17.76 (8.1)||17.38 (9.8)||16.43 (7.4)||17.71 (9.0)||15.29 (7.2)|
|MixCobra||9.02 (0.6)||10.47 (1.4)||10.04 (1.0)||9.45 (0.8)||10.37 (0.9)||10.57 (1.1)|
Increasing the dimension or the number of estimators
Since our combining method relies on a kernel rule, it is expected to be affected in some sense by the curse of dimensionality. However, an interesting feature of our approach is that the term based on the distance between the entries,, and the term involving the preliminary estimations, , are not affected the same way, since the “dimension” in not the same in both cases. Indeed, in the first case, the dimension is the actual dimension of the space containing the entries, whereas in the second case, the role of the dimension is played by the number of individual estimators. In fact, the final combined estimator shows an interesting flexibility through the calibration of and . When the dimension increases, the method will give more weight to the combining part, which is not affected in itself by increasing and may only be affected through the particular behavior of the initial estimators considered. Conversely, for reasonable values of , the effect of an increase of the number of estimators may be balanced by the distance term between the entries.
|lm||11.6 (0.4)||11.7 (0.3)||11.8 (0.4)||11.9 (0.4)||11.8 (0.3)|
|cart||26.4 (1.9)||26.0 (2.0)||26.6 (1.9)||26.8 (2.2)||26.2 (1.8)|
|bag||10.7 (0.6)||11.4 (0.7)||12.2 (0.7)||12.3 (0.6)||12.6 (0.6)|
|rf||20.1 (1.6)||17.0 (1.3)||17.0 (1.5)||18.8 (1.4)||18.2 (1.7)|
|svm||10.8 (1.0)||14.8 (1.1)||16.4 (0.7)||16.6 (0.7)||16.2 (0.7)|
|knn 2||18.4 (1.2)||41.3 (2.1)||52.9 (2.3)||60.3 (2.8)||65.1 (3.4)|
|knn 5||18.8 (1.0)||36.5 (1.9)||45.5 (1.8)||51.9 (2.5)||56.6 (3.1)|
|knn10||20.5 (1.1)||36.3 (2.1)||44.5 (1.9)||50.3 (2.1)||54.4 (3.6)|
|MixCobra||10.8 (0.7)||13.2 (0.9)||13.9 (0.8)||14.5 (0.9)||14.1 (0.4)|
Performances with the initial air compressor data embedded into high dimensional spaces, adding respectively 5, 10, 15, 20, 25 random variables.
To study the effect of the inputs dimensionality, the original compressor data are embedded in successive high dimensional spaces of size , by artificially adding to the 6 initial variables
independent random variables uniformly distributed.
Table 6 shows the performances computed by cross validation using replications. As previously, for each replication, of the observations are randomly chosen to calibrate the machines and of the remaining observations are used to estimate the performances for the different machines. We observe, as expected, that the performances of the knn machines strongly deteriorate as the dimension increases. The performances of the random forest and the svm machines are also impacted with an increase of the dimension but to a lower extent. The performances of the lm machines and the cart machines are stable. Table 6 shows that MixCobra performs well, even if the performances of some machines decrease, and adapts to the increase of dimensionality.
It is worth pointing out that further definitions of combining estimate linking an aggregation part and a distance between entries part could be investigated. A point of view that seem very appealing is to employ general multivariate kernels and allow the bandwidth to be different depending on the direction, that is the machine and also the direction in . The advantage of this method would be a great flexibility through his data-driven adaptive nature, whereas the increase of the number of parameters necessary for the estimation of a bandwidth matrix would represent a challenge both for estimation and computational aspects.
In our future research work, we plan to study more specifically the high dimensional case. In this framework, it could be interesting to compute more appropriate distances between entries. In particular, a promising strategy could be to try to bring into play the correlations between the output and the different explanatory variables.
6.1 Proof of Lemma 3.3
For any classifier , we may write
Then, applying this to and , we obtain
The event corresponds to either one of the two following situations:
, , which can be rewritten and . In this case, we have
, which means that and . We have
6.2 Proof of the covering lemma 3.4
We recover with balls with radius and centers . Each element of is in a finite number of such balls. This number depends on the dimension .
For the first bound, we search for an upper bound for , and a lower bound for for every such that . Here, we use the notation and .
For the upper bound, we note that the element of
belongs to a finite number of balls of radius and center one of the ’s. We are going to upper bound the quantity of interest by the sum of the suprema over these different balls. The fact that belongs to such a ball (with a center ) means that . We may write
For the lower bound, we use the property . We have
Now, if , let us check that
Thus, when , we obtain
Consequently, we have