In classification/regression, empirical risk estimates are sample mean statistics and the theory of Empirical Risk Minimization (ERM) has been originally developed in this context, see Devroye et al. (1996). The ERM theory essentially relies on the study of maximal deviations between these empirical averages and their expectations, under adequate complexity assumptions on the set of prediction rule candidates. The relevant tools are mainly concentration inequalities for empirical processes, see Ledoux and Talagrand (1991) for instance.
In a wide variety of problems that received a good deal of attention in the machine learning literature and ranging from clustering to image recognition through ranking or learning on graphs, natural estimates of the risk are not basic sample means but take the form of averages of-tuples, usually referred to as
-statistics in Probability and Statistics, seeLee (1990). In Clémençon et al. (2005) for instance, ranking is viewed as pairwise classification and the empirical ranking error of any given prediction rule is a -statistic of order , just like the within cluster point scatter2014) or empirical performance measures in metric learning, refer to Cao et al. (2012) for instance. Because empirical functionals are computed by averaging over tuples of sampling observations, they exhibit a complex dependence structure, which appears as the price to be paid for low variance estimates. Linearization techniques (see Hoeffding, 1948) are the main ingredient in studying the behavior of empirical risk minimizers in this setting, allowing to establish probabilistic upper bounds for the maximal deviation of collection of centered -statistics under appropriate conditions by reducing the analysis to that of standard empirical processes. However, while the ERM theory based on minimization of -statistics is now consolidated (see Clémençon et al., 2008), putting this approach in practice generally leads to significant computational difficulties that are not sufficiently well documented in the machine learning literature. In many concrete cases, the mere computation of the risk involves a summation over an extremely high number of tuples and runs out of time or memory on most machines.
Whereas the availability of massive information in the Big Data era, which machine learning procedures could theoretically now rely on, has motivated the recent development of parallelized / distributed approaches in order to scale-up certain statistical learning algorithms, see Bekkerman et al. (2011) or Bianchi et al. (2013) and the references therein, the present paper proposes to use sampling techniques as a remedy to the apparent intractability of learning from data sets of explosive size, in order to break the current computational barriers. More precisely, it is the major goal of this article to study how a simplistic sampling technique (i.e. drawing with replacement) applied to risk estimation, as originally proposed by Blom (1976) in the context of asymptotic pointwise estimation, may efficiently remedy this issue without damaging too much the “reduced variance” property of the estimates, while preserving the learning rates (including certain ”fast-rate” situations). For this purpose, we investigate to which extent a -process, that is a collection of -statistics, can be accurately approximated by a Monte-Carlo version (which shall be referred to as an incomplete -process throughout the paper) involving much less terms, provided it is indexed by a class of kernels of controlled complexity (in a sense that will be explained later). A maximal deviation inequality connecting the accuracy of the approximation to the number of terms involved in the approximant is thus established. This result is the key to the analysis of the statistical performance of minimizers of risk estimates when they are in the form of an incomplete -statistic. In particular, this allows us to show the advantage of using this specific sampling technique, compared to more naive approaches with exactly the same computational cost, consisting for instance in first drawing a subsample and then computing a risk estimate of the form of a (complete) -statistic based on it. We also show how to incorporate this sampling strategy into iterative statistical learning techniques based on stochastic gradient descent (SGD), see Bottou (1998). The variant of the SGD method we propose involves the computation of an incomplete -statistic to estimate the gradient at each step. For the estimator thus produced, rate bounds describing its statistical performance are established under mild assumptions. Beyond theoretical results, we present illustrative numerical experiments on metric learning and clustering with synthetic and real-world data that support the relevance of our approach.
The rest of the article is organized as follows. In Section 2, we recall basic definitions and concepts pertaining to the theory of -statistics/processes and present important examples in machine learning where natural estimates of the performance/risk measure are -statistics. We then review the existing results for the empirical minimization of complete -statistics. In Section 3, we recall the notion of incomplete -statistic and we derive maximal deviation inequalities describing the error made when approximating a -statistic by its incomplete counterpart uniformly over a class of kernels that fulfills appropriate complexity assumptions. This result is next applied to derive (possibly fast) learning rates for minimizers of the incomplete version of the empirical risk and to model selection. Extensions to incomplete -statistics built by means of other sampling schemes than sampling with replacement are also investigated. In Section 4, estimation by means of incomplete -statistics is applied to stochastic gradient descent for iterative ERM. Section 5 presents some numerical experiments. Finally, Section 6 collects some concluding remarks. Technical details are deferred to the Appendix.
2 Background and Preliminaries
As a first go, we briefly recall some key notions of the theory of -statistics (Section 2.1) and provide several examples of statistical learning problems for which natural estimates of the performance/risk measure are in the form of -statistics (Section 2.2). Finally, we review and extend the existing rate bound analysis for the empirical minimization of (complete) generalized -statistics (Section 2.3). Here and throughout, denotes the set of all strictly positive integers, the set of nonnegative real numbers.
2.1 -Statistics/Processes: Definitions and Properties
For clarity, we recall the definition of generalized -statistics. An excellent account of properties and asymptotic theory of -statistics can be found in Lee (1990).
Let and . Let , , be independent samples of sizes and composed of i.i.d. random variables taking their values in some measurable space be a measurable function, square integrable with respect to the probability distribution
and composed of i.i.d. random variables taking their values in some measurable spacewith distribution respectively. Let
be a measurable function, square integrable with respect to the probability distribution. Assume in addition (without loss of generality) that is symmetric within each block of arguments (valued in ), . The generalized (or -sample) -statistic of degrees with kernel , is then defined as
where the symbol refers to summation over all subsets related to a set of indexes and .
The above definition generalizes standard sample mean statistics, which correspond to the case . More generally when , is an average over all -tuples of observations, while corresponds to the multi-sample situation with a -tuple for each sample . A -process is defined as a collection of -statistics indexed by a set of kernels. This concept generalizes the notion of empirical process.
Many statistics used for pointwise estimation or hypothesis testing are actually generalized -statistics (e.g. the sample variance, the Gini mean difference, the Wilcoxon Mann-Whitney statistic, Kendall tau). Their popularity mainly arises from their “reduced variance” property: the statistic
has minimum variance among all unbiased estimators of the parameter
tends to infinity, we have:
Asymptotic results and deviation/moment inequalities for-sample -statistics can be classically established by means of specific representations of this class of functionals, see (15) and (27) introduced in later sections. Significant progress in the analysis of -statistics and -processes has then recently been achieved by means of decoupling theory, see de la Peña and Giné (1999). For completeness, we point out that the asymptotic behavior of (multisample) -statistics has been investigated under weaker integrability assumptions than that stipulated in Definition 1, see Lee (1990).
2.2 Motivating Examples
In this section, we review important supervised and unsupervised statistical learning problems where the empirical performance/risk measure is of the form of a generalized -statistics. They shall serve as running examples throughout the paper.
Clustering refers to the unsupervised learning task that consists in partitioning a set of data pointsin a feature space into a finite collection of subgroups depending on their similarity (in a sense that must be specified): roughly, data points in the same subgroup should be more similar to each other than to those lying in other subgroups. One may refer to Chapter 14 in Friedman et al. (2009) for an account of state-of-the-art clustering techniques. Formally, let be the number of desired clusters and consider a symmetric function such that for any . measures the dissimilarity between pairs of observations : the larger , the less similar and . For instance, if , could take the form , where , for all and is any borelian nondecreasing function such that . In this context, the goal of clustering methods is to find a partition of the feature space in a class of partition candidates that minimizes the following empirical clustering risk:
where . Assuming that the data are i.i.d. realizations of a generic random variable drawn from an unknown probability distribution on , the quantity , also known as the intra-cluster similarity or within cluster point scatter, is a one sample -statistic of degree two ( and ) with kernel given by:
according to Definition 1 provided that . The expectation of the empirical clustering risk is given by
where is an independent copy of the r.v. , and is named the clustering risk of the partition . The statistical analysis of the clustering performance of minimizers of the empirical risk (5) over a class of appropriate complexity can be found in Clémençon (2014). Based on the theory of -processes, it is shown in particular how to establish rate bounds for the excess of clustering risk of any empirical minimizer, namely, under appropriate complexity assumptions on the cells forming the partition candidates.
2.2.2 Metric Learning
Many problems in machine learning, data mining and pattern recognition (such as the clustering problem described above) rely on a metric to measure the distance between data points. Choosing an appropriate metric for the problem at hand is crucial to the performance of these methods. Motivated by a variety of applications ranging from computer vision to information retrieval through bioinformatics, metric learning aims at adapting the metric to the data and has attracted a lot of interest in recent years(see for instance Bellet et al., 2013, for an account of metric learning and its applications). As an illustration, we consider the metric learning problem for supervised classification. In this setting, we observe independent copies of a random couple , where the r.v. takes values in some feature space and in a finite set of labels, with say. Consider a set of distance measures . Roughly speaking, the goal of metric learning in this context is to find a metric under which pairs of points with the same label are close to each other and those with different labels are far away. The risk of a metric can be expressed as:
is a convex loss function upper bounding the indicator function, such as the hinge loss . The natural empirical estimator of this risk is
which is a one sample -statistic of degree two with kernel given by:
2.2.3 Multipartite Ranking
Given objects described by a random vector of attributes/featuresand the (temporarily hidden) ordinal labels assigned to it, the goal of multipartite ranking is to rank them in the same order as that induced by the labels, on the basis of a training set of labeled examples. This statistical learning problem finds many applications in a wide range of fields (e.g. medicine, finance, search engines, e-commerce). Rankings are generally defined by means of a scoring function , transporting the natural order on the real line onto the feature space and the gold standard for evaluating the ranking performance of is the manifold, or its usual summary the criterion ( standing for Volume Under the Surface), see Clémençon and Robbiano (2014) and the references therein. In Clémençon et al. (2013), optimal scoring functions have been characterized as those that are optimal for all bipartite subproblems. In other words, they are increasing transforms of the likelihood ratio , where denotes the class-conditional distribution for the -th class. When the set of optimal scoring functions is non-empty, the authors also showed that it corresponds to the functions which maximize the volume under the surface
Given independent samples for , the empirical counterpart of the VUS can be written in the following way:
The empirical (11) is a -sample -statistic of degree with kernel given by:
2.3 Empirical Minimization of -Statistics
As illustrated by the examples above, many learning problems can be formulated as finding a certain rule in a class in order to minimize a risk of the same form as (2.1), , with kernel . Based on independent i.i.d. samples
the ERM paradigm in statistical learning suggests to replace the risk by the -statistic estimation in the minimization problem. The study of the performance of minimizers of the empirical estimate over the class of rule candidates naturally leads to analyze the fluctuations of the -process
Given the bound
a probabilistic control of the maximal deviation naturally provides statistical guarantees for the generalization ability of the empirical minimizer . As shown at length in the case and in Clémençon et al. (2008) and in Clémençon (2014) for specific problems, this can be achieved under adequate complexity assumptions of the class . These results rely on the Hoeffding’s representation of -statistics, which we recall now for clarity in the general multisample -statistics setting. Denote by the symmetric group of order for any and by the -th coordinate of any permutation for . Let be the integer part of any real number and set
Observe that the -sample -statistic (1) can be expressed as
This representation, sometimes referred to as the first Hoeffding’s decomposition (see Hoeffding, 1948), allows to reduce a first order analysis to the case of sums of i.i.d. random variables. The following result extends Corollary 3 in Clémençon et al. (2008) to the multisample situation.
Let be a collection of bounded symmetric kernels on such that
Suppose also that is a VC major class of functions with finite Vapnik-Chervonenkis dimension . For all , we have with probability at least ,
(Uniform boundedness) We point out that condition (16) is clearly satisfied for the class of kernels considered in the multipartite ranking situation, whatever the class of scoring functions considered. In the case of the clustering example, it is fulfilled as soon as the essential supremum of is uniformly bounded over , whereas in the metric learning example, it is satisfied when the essential supremum of the r.v. is uniformly bounded over . We underline that this simplifying condition can be easily relaxed and replaced by appropriate tail assumptions for the variables , , combining the arguments of the subsequent analysis with the classical “truncation trick” originally introduced in Fuk and Nagaev (1971).
(Complexity assumptions) Following in the footsteps of Clémençon et al. (2008) which considered -sample -statistics of degree 2, define the Rademacher average
where are independent Rademacher random variables (random symmetric sign variables), independent from the ’s. As can be seen by simply examining the proof of Proposition 1 (Appendix A), a control of the maximal deviations similar to (17) relying on this particular complexity measure can be obtained: the first term on the right hand side is then replaced by the expectation of the Rademacher average , up to a constant multiplicative factor. This expected value can be bounded by standard metric entropy techniques and in the case where is a VC major class of functions of dimension , we have:
See Appendix A for further details.
3 Empirical Minimization of Incomplete -Statistics
We have seen in the last section that the empirical minimization of -statistics leads to a learning rate of . However, the computational cost required to find the empirical minimizer in practice is generally prohibitive, as the number of terms to be summed up to compute the -statistic (1) is equal to:
In the usual asymptotic framework (4), it is of order as . It is the major purpose of this section to show that, in the minimization problem, the -statistic can be replaced by a Monte-Carlo estimation, referred to as an incomplete -statistic, whose computation requires to average much less terms, without damaging the learning rate (Section 3.1). We further extend these results to model selection (Section 3.2), fast rates situations (Section 3.3) and alternative sampling strategies (Section 3.4).
3.1 Uniform Approximation of Generalized -Statistics
As a remedy to the computational issue mentioned above, the concept of incomplete generalized -statistic has been introduced in the seminal contribution of Blom (1976). The calculation of such a functional involves a summation over low cardinality subsets of the -tuples of indices, , solely. In the simplest formulation, the subsets of indices are obtained by sampling independently with replacement, leading to the following definition.
(Incomplete Generalized -statistic) Let . The incomplete version of the -statistic (1) based on terms is defined by:
where is a set of cardinality built by sampling with replacement in the set
and for all .
We stress that the distribution of a complete -statistic built from subsamples of reduced sizes drawn uniformly at random is quite different from that of an incomplete -statistic based on terms sampled with replacement in , although they involve the summation of the same number of terms, as depicted by Fig. 1.
In practice, should be chosen much smaller than the cardinality of , namely , in order to overcome the computational issue previously mentioned. We emphasize the fact that the cost related to the computation of the value taken by the kernel at a given point depending on the form of is not considered here: the focus is on the number of terms involved in the summation solely. As an estimator of , the statistic (19) is still unbiased, i.e. , but its variance is naturally larger than that of the complete -statistic . Precisely, writing the variance of the r.v. as the expectation of its conditional variance given plus the variance of its conditional expectation given , we obtain
One may easily check that , and the difference vanishes as increases. Refer to Lee (1990) for further details (see p. 193 therein). Incidentally, we underline that the empirical variance of (19) is not easy to compute either since it involves summing approximately terms and bootstrap techniques should be used for this purpose, as proposed in Bertail and Tressou (2006). The asymptotic properties of incomplete -statistics have been investigated in several articles, see Janson (1984); Brown and Kildea (1978); Enqvist (1978). The angle embraced in the present paper is of very different nature: the key idea we promote here is to use incomplete versions of collections of -statistics in learning problems such as that described in Section 2.2. The result stated below shows that this approach solves the numerical problem, while not damaging the learning rates under appropriate complexity assumptions on the collection of (symmetric) kernels considered, the complexity being described here in terms of VC dimension for simplicity. In particular, it reveals that concentration results established for -processes (i.e. collections of -statistics) such as Proposition 1 may extend to their incomplete versions, as shown by the following theorem.
(Maximal deviation) Let be a collection of bounded symmetric kernels on that fulfills the assumptions of Proposition 1. Then, the following assertions hold true.
For all , with probability at least , we have: , ,
For all , with probability at least , we have: , ,
(Complexity assumptions continued) We point out that a bound of the same order as that stated above can be obtained under standard metric entropy conditions by means of classical chaining arguments, or under the assumption that the Rademacher average defined by
has an expectation of the order . The quantity indicates whether the subset of indexes has been picked at the -th draw () or not (), see the calculation at the end of Appendix C. Equipped with this notation, notice that the ’s are i.i.d. multinomial random variables such that . This assumption can be easily shown to be fulfilled in the case where is a VC major class of finite VC dimension (see the proof of Theorem 1 in Appendix B). Notice however that although the variables , , are conditionally i.i.d. given , they are not independent and the quantity (22) cannot be related to complexity measures of the type (18) mentioned in Remark 2.
We underline that, whereas can be proved to be of order under adequate complexity assumptions in the specific situation where is a collection of degenerate -statistics (see Section 3.3), the bound in Theorem 1 cannot be improved in the degenerate case. Observe indeed that, conditioned upon the observations , the deviations of the approximation (19) from its mean are of order , since it is a basic average of i.i.d. terms.
From the theorem stated above, one may straightforwardly deduce a bound on the excess risk of kernels minimizing the incomplete version of the empirical risk based on terms, i.e. such that
The first assertion of Theorem 1 provides a control of the deviations between the -statistic (1) and its incomplete counterpart (19) uniformly over the class . As the number of terms increases, this deviation decreases at a rate of . The second assertion of Theorem 1 gives a maximal deviation result with respect to . Observe in particular that, with the asymptotic settings previously specified, and as . The bounds stated above thus show that, for a number of terms tending to infinity at a rate as , the maximal deviation is asymptotically of the order , just like , see bound (17) in Proposition 1. In short, when considering an incomplete -statistic (19) with terms only, the learning rate for the corresponding minimizer is of the same order as that of the minimizer of the complete risk (1), whose computation requires to average terms. Minimizing such incomplete -statistics thus yields a significant gain in terms of computational cost while fully preserving the learning rate. In contrast, as implied by Proposition 1, the minimization of a complete -statistic involving terms, obtained by drawing subsamples of sizes uniformly at random, leads to a rate of convergence of , which is much slower except in the trivial case where and . These striking results are summarized in Table 1.
The important practical consequence of the above is that when is too large for the complete risk (1) to be used, one should instead use the incomplete risk (19) (setting the number of terms as large as the computational budget allows).
|Empirical risk criterion||Nb of terms||Rate bound|
Complete -statistic based on subsamples
Incomplete -statistic (our result)
3.2 Model Selection Based on Incomplete -Statistics
Automatic selection of the model complexity is a crucial issue in machine learning: it includes the number of clusters in cluster analysis (see Clémençon, 2014) or the choice of the number of possible values taken by a piecewise constant scoring function in multipartite ranking for instance (cf. Clémençon and Vayatis, 2009). In the present situation, this boils down to choosing the adequate level of complexity of the class of kernels , measured through its (supposedly finite) VC dimension for simplicity, in order to minimize the (theoretical) risk of the empirical minimizer. It is the purpose of this subsection to show that the incomplete -statistic (19) can be used to define a penalization method to select a prediction rule with nearly minimal risk, avoiding procedures based on data splitting/resampling and extending the celebrated structural risk minimization principle, see Vapnik (1999). Let be the collection of all symmetric kernels on and set . Let be a sequence of uniformly bounded major subclasses of , of increasing complexity (VC dimension). For any , let denote the VC dimension of the class and set . We suppose that there exists such that . Given and , the complexity penalized empirical risk of a solution of the ERM problem (23) with is
where the quantity is a distribution free penalty given by:
As shown in Assertion of Corollary 1
, the quantity above is an upper bound for the expected maximal deviationand is thus a natural penalty candidate to compensate the overfitting within class . We thus propose to select
As revealed by the theorem below, choosing , the prediction rule based on a penalized criterion involving the summation of terms solely, achieves a nearly optimal trade-off between the bias and the distribution free upper bound (25) on the variance term.
(Oracle inequality) Suppose that Theorem 1’s assumptions are fulfilled for all and that . Then, we have: , ,
3.3 Fast Rates for ERM of Incomplete -Statistics
In Clémençon et al. (2008), it has been proved that, under certain “low-noise” conditions, the minimum variance property of the -statistics used to estimate the ranking risk (corresponding to the situation and ) leads to learning rates faster than . These results rely on the Hajek projection, a linearization technique originally introduced in Hoeffding (1948) for the case of one sample -statistics and next extended to the analysis of a much larger class of functionals in Hájek (1968). It consists in writing as the sum of the orthogonal projection
which is itself a sum of independent basic sample means based on i.i.d. r.v.’s (of the order each, after recentering), plus a possible negligible term. This representation was used for instance by Grams and Serfling (1973) to refine the CLT in the multisample -statistics framework. Although useful as a theoretical tool, it should be noticed that the quantity is not of practical interest, since the conditional expectations involved in the summation are generally unknown.
Although incomplete -statistics do not share the minimum variance property (see Section 3.1), we will show that the same fast rate bounds for the excess risk as those reached by ERM of -statistics (corresponding to the summation of pairs of observations) can be attained by empirical ranking risk minimizers, when estimating the ranking risk by incomplete -statistics involving the summation of terms solely.
For clarity (and comparison purpose), we first recall the statistical learning framework considered in Clémençon et al. (2008). Let be a pair of random variables defined on the same probability space, where is a real-valued label and models some input information taking its values in a measurable space hopefully useful to predict . Denoting by an independent copy of the pair . The goal pursued here is to learn how to rank the input observations and , by means of an antisymmetric ranking rule (i.e. for any ), so as to minimize the ranking risk
The minimizer of the ranking risk is the ranking rule (see Proposition 1 in Clémençon et al., 2008). The natural empirical counterpart of (28) based on a sample of independent copies of the pair is the -sample -statistic of degree two with kernel for all and in given by:
Equipped with these notations, a statistical version of the excess risk is a -statistic with kernel . The key “noise-condition”, which allows to exploit the Hoeffding/Hajek decomposition of , is stated below.