Survey schemes for stochastic gradient descent with applications to M-estimation

01/09/2015 ∙ by Stephan Clémençon, et al. ∙ 0

In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the "full data" statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling with unequal inclusion probabilities on stochastic gradient descent-based M-estimation methods in large-scale statistical and machine-learning problems. Precisely, we prove that, in presence of some a priori information, one may significantly increase asymptotic accuracy when choosing appropriate first order inclusion probabilities, without affecting complexity. These striking results are described here by limit theorems and are also illustrated by numerical experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many situations, data are not the sole information that can be exploited by statisticians. Sometimes, they can also make use of weights resulting from some survey sampling design. Such quantities correspond either to true inclusion probabilities or else to calibrated or post-stratification weights, minimizing some discrepancy under certain margin constraints for the inclusion probabilities. Asymptotic analysis of Horvitz-Thompson estimators based on survey data (see

HT51 ) has received a good deal of attention, in particular in the context of mean estimation and regression (see Hajek64 ,Ros72 , Rob82 , DS92 , Ber98 for instance). The last few years have witnessed significant progress towards a comprehensive functional limit theory for distribution function estimation, refer to Wellner88 , Breslow06 , Breslow08 , Breslow09 , Wellner11 or BCC13 . In parallel, the field of machine-learning has been the subject of a spectacular development. Its practice has become increasingly popular in a wide range of fields thanks to various breakout algorithms (e.g.neural networks, SVM, boosting methods) and is supported by a sound probabilistic theory based on recent results in the study of empirical processes, see DGL96 , Kolt06 , BBL05 . However, our increasing capacity to collect data, due to the ubiquity of sensors, has improved much faster than our ability to process and analyze Big Datasets, see BB08 . The availability of massive information in the Big Data era, which machine-learning procedures could theoretically now rely on, has motivated the recent development of parallelized/distributed variants of certain statistical learning algorithms, see BBL11 , mateos:bazerque:giannakis:2010 , navia:etal:2006 or BCJM13 among others. It also strongly suggests to use survey techniques, as a remedy to the apparent intractability of learning from datasets of explosive size, in order to break the current computational barriers, see CRT13 . The present article explores the latter approach, following in the footsteps of CRT13 , where the advantages of specific sampling plans compared to naive sub-sampling strategies are proved when the risk functional is estimated by generalized -statistics.

Our goal is here to show how to incorporate sampling schemes into iterative statistical learning techniques based on stochastic gradient descent (SGD in abbreviated form, see B98 ) such as SVM, Neural Networks or soft -means for instance and establish (asymptotic) results, in order to guarantee their theoretical validity. The variant of the SGD method we propose involves a specific estimator of the gradient, which shall be referred to as the Horvitz-Thompson gradient estimator (HTGD estimator in abbreviated form) throughout the paper and accounts for the sampling design by means of which the data sample has been selected at each iteration. For the estimator thus produced, consistency and asymptotic normality results describing its statistical performance are established under adequate assumptions on the first and second order inclusion probabilities. They reveal that accuracy may significantly increase (i.e.

the asymptotic variance may be drastically reduced) when the inclusion probabilities of the survey design are picked adequately, depending on some supposedly available extra information, compared to a naive implementation with equal inclusion probabilities. This is thoroughly discussed in the particular case of the Poisson survey scheme. Although it is one of the simplest sampling designs, many more general survey schemes may be expressed as Poisson schemes conditioned upon specific events. We point out that statistical learning based on non i.i.d. data has been investigated in

Steinwart09 (see also Duchi13 for analogous results in the on-line framework). However, the framework considered by these authors relies on mixing assumptions, guaranteeing the weak dependency of the data sequences analyzed, and is thus quite different from that developed in the present article. We point out that a very preliminary version of this work has been presented at the 2014 IEEE International Conference on Big Data.

The rest of the paper is structured as follows. Basics in -estimation and SGD techniques together with key notions in survey sampling theory are briefly recalled in section 2. Section 3 first describes the Horvitz-Thompson variant of the SGD in the context of a general -estimation problem. In section 4

, limit results are established in a general framework, revealing the possible significant gain in terms of asymptotic variance resulting from sampling with unequal probabilities in presence of extra information. They are next discussed in more depth in the specific case of Poisson surveys. Illustrative numerical experiments, consisting in fitting a logistic regression model (respectively, a semi-parametric shift model ) with extra information, are displayed in section

5. Technical proofs are postponed to the Appendix section, together with a rate bound analysis of the HTGD algorithm.

2 Theoretical background

As a first go, we start off with describing the mathematical setup and recalling key concepts in survey theory involved in the subsequent analysis. Here and throughout, the indicator function of any event is denoted by , the Dirac mass at any point by and the power set of any set by

. The euclidean norm of any vector

, , is denoted by . The transpose of a matrix is denoted by , the square root of any symmetric semi-definite positive matrix by .

2.1 Iterative -estimation and SGD methods

Let with be some parameter space and

be some smooth loss function. Let

be a random variable taking its values in

such that is square integrable for any . Set for all . Consider the risk minimization problem

Based on independent copies of the r.v. , the empirical version of the risk function is , where

for all . As , asymptotic properties of -estimators, i.e. minimizers of , have been extensively investigated, see Chapter 5 in vdV98 for instance. Here and throughout, we respectively denote by and the gradient and Hessian operators w.r.t. . By convention, denotes the identity operator and gradient values are represented as column vectors.

Gradient descent. Concerning computational issues (see Berts03 ), many practical machine-learning algorithms implement variants of the standard gradient descent method, following the iterations:

(1)

with an initial value arbitrarily chosen and a learning rate (step size or gain) such that and . Here we place ourselves in a large-scale setting, where the sample size of the training dataset is so large that computing the gradient of

(2)

at each iteration (1) is too demanding regarding available memory capacity. Beyond parallel and distributed implementation strategies (see BBL11 ), a natural approach consists in replacing (2) by a counterpart computed from a subsample of reduced size , fixed in advance so as to fulfill the computational constraints, and drawn at random (uniformly) among all possible subsets of same size:

(3)

The convergence properties of such a stochastic gradient descent, usually referred to as mini-batch SGD have received a good deal of attention, in particular in the case , suited to the on-line situation where training data are progressively available. Results, mainly based on stochastic approximation combined with convex minimization theory, under appropriate assumptions on the decay of the step size are well-documented in the statistical learning literature. References are much too numerous to be listed exhaustively, see KYBook for instance.

Example 1.

(Binary classification) We place ourselves in the usual binary classification framework, where is a binary random output, taking its values in say, and is an input random vector valued in a high-dimensional space , modeling some (hopefully) useful observation for predicting . Based on training data , the goal is to build a prediction rule , where is some measurable function, which minimizes the risk

where expectation is taken over the unknown distribution of the pair of r.v.’s and denotes a cost function (i.e. a measurable function such that for any ). For simplicity, consider the case where decision function candidates are assumed to belong to the parametric set of square integrable (with respect to ’s distribution) functions indexed by , , and the convex cost function is . Notice that, in this case, the optimal decision function is given by: , . The classification rule

thus coincides with the naive Bayes classifier. We abusively set

for all . Consider the problem of finding a classification rule with minimum risk, i.e. the optimization problem . In the ideal case where a standard gradient descent could be applied, one would iteratively generate a sequence , , satisfying the following update equation:

where is the learning rate. Naturally, as ’s distribution is unknown, the expectation involved in the -th iteration cannot be computed and must be replaced by a statistical version,

in accordance with the Empirical Risk Minimization paradigm. This is a particular case of the problem previously described, where and .

Example 2.

(Logistic regression) Consider the same probabilistic model as above, except that the goal pursued is to find so as to minimize

which is nothing else than the opposite of the conditional log-likelihood given the ’s related to the parametric logistic regression model: ,

2.2 Survey sampling

Let be a probability space and . In the framework we consider throughout the article, it is assumed that is a sample of i.i.d. random variables defined on , taking their values in . The ’s correspond to independent copies of a generic r. v. observed on a finite population . We call a survey sample of (possibly random) size of the population , any subset with cardinality less that . Given the statistical population

, a sampling scheme (design/plan) without replacement is determined by a probability distribution

on the set of all possible samples . For any , the (first order) inclusion probability,

is the probability that the unit belongs to a random sample drawn from distribution . We set . The second order inclusion probabilities are denoted by

for any in . Equipped with these notation, we have for . When no confusion is possible, we shall omit to mention the dependence in when writing the first/second order probabilities of inclusion. The information related to the resulting random sample is fully enclosed in the r.v. , where . Given the statistical population, the conditional -d marginal distributions of the sampling scheme

are the Bernoulli distributions

, , and the conditional covariance matrix of the r.v. is given by . Observe that, equipped with the notations above, .

One of the simplest survey plans is the Poisson scheme (without replacement). For such a plan , conditioned upon the statistical population of interest, the ’s are independent Bernoulli random variables with parameters in . The first order inclusion probabilities thus characterize fully such a plan: equipped with the notations above, for . Observe in addition that the size of a sample generated this way is random with mean and goes to infinity as with probability one, provided that remains bounded away from zero. In addition to its simplicity (regarding the procedure to select a sample thus distributed), it plays a crucial role in sampling theory, insofar as it can be used to build a wide range of survey plans by conditioning arguments, see Hajek64 . For instance, a rejective sampling plan of size corresponds to the distribution of a Poisson scheme conditioned upon the event . One may refer to Cochran77 , Dev87 for accounts of survey sampling techniques and examples of designs to which the subsequent analysis applies.

2.3 The Horvitz-Thompson estimator

Suppose that independent r.v.’s , copies of a generic variable taking its values in , are observed on the population . A natural approach to estimate the total based on a sample generated from a survey design with (first order) inclusion probabilities consists in computing the Horvitz-Thompson estimator (HT estimator in abbreviated form)

(4)

with by convention. Notice that, given the whole statistical population

, the HT estimator is an unbiased estimate of the total:

almost-surely. When samples drawn from are of fixed size, the conditional variance is given by:

(5)

When the survey design is a Poisson plan with probabilities , it is given by:

(6)
Remark 1.

(Auxiliary information) In practice, the first order inclusion probabilities are defined as a function of an auxiliary variable, taking its values in say, which is observed on the entire population (e.g. a -dimensional marginal vector for instance): for all we can write for some link function . When and are strongly correlated, proceeding this way may help us select more informative samples and consequently yield estimators with reduced variance. A more detailed discussion on the use of auxiliary information in the present context can be found in subsection 4.1.

Going back to the SGD problem, the Horvitz-Thompson estimator of the gradient based on a survey sample drawn from a design with (first order) inclusion probabilities is:

(7)

As pointed out in Remark 1, ideally, the quantity should be strongly correlated with . Hence, this leads to consider a procedure where the survey design used to estimate the gradient may change at each step, as in the HTGD algorithm described in the next section. For instance, one could stipulate the availability of extra information taking the form of random fields on a space , with , and assume the existence of a link function such that . Of course, such an approach is of benefit only when the cost of the computation of the weight is smaller than that of the gradient . As shall be seen in section 5, this happens to be the case in many situations encountered in practice.

3 Horvitz-Thompson gradient descent

This section presents, in full generality, the variant of the SGD method we promote in this article. It can be implemented in particular when some extra information about the target (the gradient vector field in the present case) is available, allowing hopefully for picking a sample yielding a more accurate estimation of the (true) gradient than that obtained by means of a sample chosen completely at random. Several tuning parameters must be picked by the user, including the parameter which controls the number of terms involved in the empirical gradient estimation at each iteration, see Fig. 1.

Horvitz-Thompson Gradient Descent Algorithm (HTGD) (Input.) Datasets and . Maximum (expected) sample size . Collection of sampling plans with first order inclusion probabilities for , indexed by with (expected) sample sizes less than . Learning rate . Number of iterations . (Initialization.) Choose in . (Iterations.) For Draw a survey sample , described by according to with inclusion probabilities for . Compute the HT gradient estimate at

Update the estimator
(Output.) The HTGD estimator .

Figure 1: The generic HTGD algorithm

The asymptotic accuracy of the estimator or decision rule produced by the algorithm above as is investigated in the next section under specific assumptions.

Remark 2.

(Balance between accuracy and computational cost) We point out that the complexity of any Poisson sampling algorithm is , just as in the usual case where data involved in the standard SGD are uniformly drawn with(out) replacement. However, even if it can be straightforwardly parallelized, the numerical computation of the inclusion probabilities at each step naturally induces a certain amount of additional latency. Hence, although HTGD may largely outperform SGD for a fixed number of iterations, this should be taken into consideration for optimizing computation time.

4 Main results

This section is dedicated to analyze the performance of the HTGD method under adequate constraints, related to the (expected) size of the survey samples considered. We first focus on Poisson survey schemes and next discuss how to establish results in a general framework.

4.1 Poisson schemes

Fix and . Given the sample , consider a Poisson scheme with parameter . In this case, Eq. (5) becomes:

Searching for the parameters such that the distance between the empirical gradient evaluated at and the HT version given is minimum under the constraint that the expected sample size is equal to yields the optimization problem:

(8)

As can be shown by means of the Lagrange multipliers method, the solution corresponds to weights being proportional to the values taken by the norm of the gradient:

(9)

However, selecting a sample distributed this way requires to know the full statistical population . In practice, one may consider situations where the weights are defined by means of a link function and auxiliary variables correlated with the ’s, as suggested previously. Observe in addition that the goal pursued here is not to estimate the gradient but to implement a stochastic gradient descent involving an expected number of terms fixed in advance, while yielding results close to those that would be obtained by means of a gradient descent algorithm with mean field based on the whole dataset. However, as shall be seen in the subsequent analysis, in general these two problems do not share the same solution from the angle embraced in this article.

In the next subsection, assumptions on the survey design under which the HTGD method yields accurate asymptotic results, surpassing those obtained with equal inclusion probabilities (i.e. for all ), are exhibited.

4.2 Limit theorems

We now consider a collection of general (i.e. not necessarily Poisson) sampling designs and investigate the limit properties of the -estimator produced by the HTGD algorithm conditioned upon the data (or in presence of extra variables, cf Remark 1). The asymptotic analysis involves the regularity conditions listed below, which are classically required in stochastic approximation.

Assumption 1.

The conditions below hold true.

  • For any , is of class .

  • For any compact set , we have with probability one: ,

  • The set of stationary points is of finite cardinality.

Theorem 1.

(Consistency) Assume that the learning rate decays to so that and . Suppose also that the HTGD algorithm is stable, i.e. there exists a compact set s.t. for all . Under Assumption 1, conditioned upon the data , the sequence converges to an element of the set with probability one, as .

The stability condition is generally difficult to check. In practice, one may guarantee it by confining the sequence to a compact set fixed in advance and using a projected version of the algorithm above. For simplicity, the present study is restricted to the simplest framework for stochastic gradient descent and we refer to KYBook or Borkar (see section 5.4 therein) for further details.

Consider . The following local assumptions are also required to establish asymptotic normality results conditioned upon the event .

Assumption 2.

The conditions below hold true.

  • There exists a neighborhood of such that for all , the mapping is of class on .

  • The Hessian matrix is a stable

    positive-definite matrix: its smallest eigenvalue is

    with .

  • For all , the mapping is continuous.

Theorem 2.

(A conditional CLT) Suppose that Assumptions 1-2 are fulfilled and that for some constants and . When , take and . Set otherwise. Given the observations (respectively, ) and conditioned upon the event , we have the convergence in distribution as

where the asymptotic covariance matrix is the unique solution of the Lyapunov equation:

(10)

with

(11)

The result stated below provides the asymptotic conditional distribution of the error. Its proof is a direct application of the second order delta method and is left to the reader.

Corollary 1.

(Error rate) Under the hypotheses of Theorem 2, given the observations (respectively, ) and conditioned upon the event

, we have the convergence in distribution towards a non-central chi-square distribution:

as , where is a -dimensional Gaussian centered r.v. with the identity as covariance matrix.

Before showing how the results above can be used to understand how specific sampling designs may improve statistical analysis, a few comments are in order.

Remark 3.

(Asymptotic covariance estimation) An estimate of could be obtained by solving the equation , replacing in (11) the (unknown) target value by the estimate produced by the HTGD algorithm after iterations. Alternatively, a percentile Bootstrap method could be also used for this purpose, repeating times the HTGD algorithm based on replicates of the original sample .

For completeness, a rate bound analysis of the HTGD algorithm is also provided in the Appendix section.

4.3 Asymptotic covariance optimization in the Poisson case

Now that the limit behavior of the solution produced by the HTGD algorithm has been described for general collections of survey designs of fixed expected sample size, we turn to the problem of finding survey plans yielding estimates with minimum variability. Formulating this objective in a quantitative manner, this boils down to finding so as to minimize , for an appropriately chosen norm on the space of matrices with real entries for instance. In order to get a natural summary of the asymptotic variability, we consider here the Hilbert-Schmidt norm, i.e. for any where denotes the Trace operator. For simplicity’s sake, we focus on Poisson schemes and consider the case where . Let be a collection of first order inclusion probabilities. The following result exhibits an optimal collection of Poisson schemes among those with as expected sizes, in the sense that it yields an HTGD estimator with an asymptotic covariance of square root with minimum Hilbert-Schmidt norm. We point out that it is generally different from that considered in subsection 4.1, revealing the difference between the issue of estimating the empirically gradient accurately by means of a Poisson Scheme and that of optimizing the HTGD procedure.

Proposition 1.

(Optimality) Let . The collection of Poisson designs defined by: , ,

is a solution of the minimization problem

where the infimum is taken over all collections of Poisson designs. In addition, we have

Of course, the optimal solution exhibited in the result stated above is completely useless from a practical perspective, since the matrix is unknown in practice and the computation of the values taken by the gradient at each point is precisely what we are trying to avoid in order to reduce the computational cost of the SGD procedure. In the next section, we show that choosing inclusion probabilities positively correlated with the ’s is actually sufficient to reduce asymptotic variability (compared to the situation where equal inclusion probabilities are used). In addition, as illustrated by the two easily generalizable examples described in section 5, such a sampling strategy can be implemented in many situations.

4.4 Extensions to more general Poisson survey designs

In this subsection, we still consider Poisson schemes and the case for simplicity and now place ourselves in the situation where the information at disposal consists of a collection of i.i.d. random pairs valued in . We consider inclusion probabilities

defined through a link function , see Remark 1. The computational cost of is assumed to be much smaller than that of (see the examples in section 5 below) for all . The assumption introduced below involves the empirical covariance between and , for . Observe that it can be written as:

with .

Assumption 3.

The link function fulfills the following condition:

The result stated below reveals to which extent sampling with inclusion probabilities defined by some appropriate link function may improve upon sampling with equal inclusion probabilities, for , when implementing stochastic gradient descent. Namely, the accuracy of the HTGD gets closer and closer to the optimum, as the empirical covariance increases to its maximum. Notice that in the case where inclusion probabilities are all equal, we have .

Proposition 2.

Let be fixed. Suppose that the collection of Poisson designs with expected sizes is defined by a link function satisfying Assumption 3. Then, we have

as well as

where denotes the empirical variance of the r.v. .

As illustrated by the easily generalizable examples provided in the next section, one may generally find link functions fulfilling Assumption 3 without great effort, permitting to gain in accuracy from the implementation of the HTGD algorithm.

5 Illustrative numerical experiments

For illustration purpose, this section shows how the results previously established apply to two problems by means of simulation experiments. For both examples, the performance of the HTGD algorithm is compared with that of a basic SGD strategy with the same (mean) sample size.

5.1 Linear logistic regression

Consider the linear logistic regression model corresponding to Example 2 with and for all . Let be a low dimensional marginal vector of the input r.v. , of dimension say, so that one may write as well as in a similar manner. The problem of fitting the parameter through conditional MLE corresponds to the case

We propose to implement the HTGD with the link function , where

In order to illustrate the advantages of the HTGD technique for logistic regression, we considered the toy numerical model with parameters and , the

input variables being independent, uniformly distributed on

. The maximum likelihood estimators of were computed using the HTGD and SGD (mini-batch) . In order to compare them, the same number of iterations was chosen in each situation and a learning rate proportionnal to was considered.

As a first go, we drew a single sample of size on which the two algorithms were performed for iterations. Two sub-sample sizes were considered : and . As can be seen on Fig. 3, while virtually equivalent in terms of computation time, thus taking a larger sample improves the efficiency of the HTGD. It also appears to reach a better level of precision in less steps than both competitors, a phenomenon that is consistent on all coordinates of .

Figure 2: Evolution of the estimator of with the number of iterations in the HTGD (solid), mini-batch SGD (dotted) and GD (dashed) algorithms with (left) and (right)

So as to account for the randomness due to the data, we then simulated samples according to the model for two population sizes, and . For both the HTGD and the mini-batch SGD algorithms, a sub-sample size of was chosen. As shown in Table 1, the HTGD seems to be more robust to data randomness than SGD and GD. It is not surprising, since the sampling phase selects the most informative observations relative to the gradient descent, which makes HTGD less sensitive to the possible noise. It also provides more precise estimates, as suggested by the results in Table 3.

HTGD 1.52 1.45
SGD 2.21 2.09

Table 1:

Mean standard deviations of the final estimates of

across the simulations
Min. Median Max. Mean S.D.
HTGD
-9.5 -8.7 -7.8 -8.6 1.45
13.3 14.6 15.9 14.5 1.52
SGD
-9.9 -8.2 -7.4 -8.2 2.09
12.7 13.9 16.6 15.2 2.21
Table 2: Statistics on the global behavior of the final estimates of and across the simulations
Figure 3: 50 trajectories of the estimator of with the number of iterations in the HTGD (solid), mini-batch SGD (dotted) over 50 populations (left) and of over 1 populations (right)

5.2 The symmetric model

Consider now an i.i.d. sample drawn from an unknown probability distribution on , supposed to belong to the semi-parametric collection , , dominated by some -finite measure . The related densities are denoted by , where is a location parameter and a a (twice differentiable) density, symmetric about , i.e.. The density is unknown in practice and may be multimodal. For simplicity, we assume here that but similar arguments can be developed when

. For such a general semi-parametric model, it is well-known that neither the sample mean nor the median (if, for instance, the distribution does not weight the singleton

) are good candidates for estimating the location parameter . In the semiparametric literature this model is referred to as the symmetric model, see BKRW93 . It is known that the tangent space (i.e. the set of scores) with respect to the parameter of interest and that with respect to the nuisance parameter are orthogonal. The global tangent space at is given by

where is the tangent space with respect to the nuisance parameter:

Orthogonality simply results from the fact that

is an odd function and implies that the parameter

can be adaptively estimated, as if the density was known, refer to BKRW93 for more details. In practice

is estimated by means of some symmetrized kernel density estimator. Given a Parzen-Rosenblatt kernel

(e.g. a Gaussian kernel) for instance, consider the estimate

where is the smoothing bandwidth, and form its symetrized version (which is an even function)

The related score is given by

In order to perform maximum likelihood estimation approximately, one can try to implement a gradient descent method to get an efficient estimator of . For instance, for a reasonable sample size , it is possible to show that, starting for instance from the empirical median with an adequate choice of the rate , the sequence

converges to the true MLE. The complexity of this algorithm is typically of order if is the number of iterations, due the tedious computations to evaluate the kernel density estimator (and its derivatives) at all points . It is thus relevant in this case to try to reduce it by means of (Poisson) survey sampling. The iterations of such an algorithm would be then of the form

As shown in section 4.3, the optimal choice would be to choose proportional to at the -th iteration:

(12)

Unfortunately this is not possible because is unknown and replacing by in (12) yields obvious computational difficulties. For this reason, we suggest to use the (much simpler) Poisson weights:

Fig. 5 depicts the performance of the HTGD algorithm when and is a balanced mixture of two Gaussian densities with means and respectively and same variance , compared to that of the usual SGD method. Based on a population sample of size , the HTGD and SGD methods have been implemented with and