online-variance-reduction
Online Variance Reduction
view repo
Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform importance sampling techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling datapoints, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.
READ FULL TEXT VIEW PDFOnline Variance Reduction
Empirical risk minimization (ERM) is among the most important paradigms in machine learning, and is often the strategy of choice due to its generality and statistical efficiency. In ERM, we draw a set of samples
from the underlying data distribution and we aim to find a solution that minimizes the empirical risk,(1) |
where
is a given loss function, and
is usually a compact domain.In this work we are interested in sequential procedures for minimizing the ERM objective, and relate to such methods as ERM solvers. More concretely, we focus on the regime where the number of samples
is very large, and it is therefore desirable to employ ERM solvers that only require few passes over the dataset. There exists a rich arsenal of such efficient solvers which have been investigated throughout the years, with the canonical example from this category being Stochastic Gradient Descent (SGD).
Typically, such methods require an unbiased estimate of the loss function at each round, which is usually generated by sampling a few points uniformly at random from the dataset. However, by employing uniform sampling, these methods are insensitive to the intrinsic structure of the data. In case of SGD, for example, some data points might produce large gradients, but they are nevertheless assigned the same probability of being sampled as any other point. This ignorance often results in high-variance estimates, which is likely to degrade the performance.
The above issue can be mended by employing non-uniform importance sampling. And indeed, we have recently witnessed several techniques to do so: Zhao and Zhang (2015) and similarly Needell et al. (2014), suggest using prior knowledge on the gradients of each datapoint in order to devise predefined importance sampling distributions. Stich et al. (2017) devise adaptive sampling techniques guided by a robust optimization approach. These are only a few examples of a larger body of work (Bouchard et al., 2015; Alain et al., 2015; Csiba and Richtárik, 2016).
Interestingly, the recent works of Namkoong et al. (2017) and Salehi et al. (2017) formulate the task of devising importance sampling distributions as an online learning problem with bandit feedback. In this context, they think of the algorithm, which adaptively chooses the distribution, as a player that competes against the ERM solver. The goal of the player is to minimize the cumulative variance of the resulting (gradient) estimates. Curiously, both methods rely on some form of the “linearization trick”111 By “linearization trick” we mean that these methods update according to a first order approximation of the costs rather than the costs themselves. to resort to the analysis of the EXP3 (Auer et al., 2002).
On the other hand, the theoretical guarantees of the above methods are somewhat limited. Strictly speaking, none of them provides regret guarantees with respect to the best fixed distribution in hindsight: Namkoong et al. (2017) only compete with the best distribution among a subset
of the simplex (around the uniform distribution). Conversely,
Salehi et al. (2017) compete against a solution which might perform worse than the best in hindsight up to a multiplicative factor of .In this work, we adopt the above mentioned online learning formulation, and design novel importance sampling techniques. Our adaptive sampling procedure is simple and efficient, and in contrast to previous work, we are able to provide regret guarantees with respect to the best fixed point among the simplex. As our contribution, we
motivate theoretically why regret minimization is meaningful in this setting,
propose a novel bandit algorithm for variance reduction ensuring regret of ,
empirically validate our method and provide an efficient implementation222The source code is available at https://github.com/zalanborsos/online-variance-reduction.
On the technical side, we do not rely on a “linearization trick” but rather directly employ a scheme based on the classical Follow-the-Regularized-Leader approach. Our analysis entails several technical challenges, most notably handling unbounded cost functions while only receiving partial (bandit) feedback. Our design and analysis draws inspiration from the seminal works of Auer et al. (2002) and Abernethy et al. (2008). Although we present our method for choosing datapoints, it naturally applies to choosing coordinates in coordinate descent or even blocks of thereof (Allen-Zhu et al., 2016; Perekrestenko et al., 2017; Nesterov, 2012; Necoara et al., 2011). More broadly, the proposed algorithm can be incorporated in any sequential algorithm that relies on an unbiased estimation of the loss. A prominent application of our method is variance reduction for SGD, which can be achieved by considering gradient norms as losses, i.e., replacing . With this modification, our method is minimizing the cumulative variance of the gradients throughout the optimization process. The latter quantity directly affects the quality of optimization (we elaborate on this in Appendix A).
The paper is organized as follows. In Section 2, we formalize the online learning setup of variance reduction and motivate why regret is a suitable performance measure. As the first step of our analysis, we investigate the full information setting in Section 3, which serves as a mean for studying the bandit setting in Section 4. Finally, we validate our method empirically and provide the detailed discussion of the results in Section 5.
Typical sequential solvers for ERM usually require a fresh unbiased estimate of the loss at each round, which is obtained by repeatedly sampling from the dataset. The template of Figure 1 captures a rich family of such solvers such as SGD, SAGA (Defazio et al., 2014), SVRG (Johnson and Zhang, 2013), and online -Means (Bottou and Bengio, 1995).
Sequential Optimization Procedure for ERM
A natural way to devise the unbiased estimates is to sample uniformly at random and return . Indeed, uniform sampling is the common practice when applying SGD, SAGA, SVRG and online -Means. Nevertheless, any distribution in the probability simplex induces an unbiased estimate. Concretely, sampling an index induces the estimate
(2) |
and it is immediate to show that . This work is concerned with efficient ways of choosing a “good” sequence of sampling distributions .
It is well known that the performance of typical solvers (e.g. SGD, SAGA, SVRG) improves as the variance of the estimates is becoming smaller. Thus, a natural criterion for measuring the performance of a sampling distribution is the variance of the induced estimate
Denoting and noting that the second term is independent of , we may now cast the task of sequentially choosing the sampling distributions as the online optimization problem shown in Figure 2
. In this protocol, we treat the sequential solver as an adversary that chooses a sequence of loss vectors
, where denotes . Each loss vector is a function of , the solution chosen by the solver in the corresponding round (note that we abstract out this dependence of in ). The cost333We use the term “cost function” to refer to in order to distinguish it from the loss . that the player incurs at roundis the second moment of the loss estimate, which is induced by the distribution chosen by the player at round
.Online Variance Reduction Protocol
Next, we define the regret, which is our performance measure for the player,
Our goal is to devise a no-regret algorithm such that , which in turn guarantees that we recover asymptotically the best fixed sampling distribution. In the bandit feedback setting, the player aims to minimize its expected regret , where the expectation is taken with respect to the randomized choices of the player and the adversary. Note that we allow the choices of the adversary to depend on the past choices of the player.
There are few noteworthy comments regarding the above setup. First, it is immediate to verify that the cost functions are convex in , therefore this is an online convex optimization problem. Secondly, the cost functions are unbounded in , which poses a challenge in ensuring no-regret. Finally, notice that the player receives a bandit feedback, i.e., he is allowed to inspect the losses only at the coordinate chosen at time . To the best of our knowledge, this is the first natural setting where, as we will show, it is possible to provide no regret guarantees despite bandit feedback and unbounded costs.
Throughout this work, we assume that the losses are bounded, for all and . Note that our analysis may be extended to the case where the bounds are instance-dependent, i.e., for all and . In practice, it can be beneficial to take into account the different ’s, as we demonstrate in our experiments.
Let us focus on the family of ERM solvers depicted in Figure 1. As discussed above, devising loss estimates such that has low variance is beneficial for such solvers — in case of SGD, this is due to strong connection between the cumulative variance of gradients and the quality of optimization that we discuss in more detail in Appendix A. Translating this observation into the online variance reduction setting suggests a natural performance measure: rather than competing with the best fixed distribution in hindsight, we would like to compete against the sequence of best distributions per-round . This optimal sequence ensures zero variance in every round, and is therefore the ideal baseline to compete against. This also raises the question whether regret guarantees, which compare against the best fixed distribution in hindsight, are at all meaningful in this context. Note that regret minimization is meaningful in stochastic optimization, when we assume that the losses are generated i.i.d. from some fixed distribution (Cesa-Bianchi et al., 2004). Yet, this certainly does not apply in our case since losses are non-stationary and non-oblivious.
Unfortunately, ensuring guarantees compared to the sequence of best distributions per-round seems generally hard. However, as we show next, regret is still a meaningful measure for sequential ERM solvers. Concretely, recall that our ultimate goal is to minimize the ERM objective. Thus, we are only interested in ERM solvers that actually converge to a (hopefully good) solution for the ERM problem. More formally, let us define as follows,
where we recall that , and assume the above limit to exist for every . We will also denote . Moreover, let us assume that the asymptotic solution is better on average than any of the sequential solutions in the following sense,
where . This assumption naturally holds when the ERM solver converges to the optimal solution for the problem, which applies for SGD in the convex case.
The next lemma shows that under these mild assumptions, competing against the best fixed distribution in hindsight is not far from competing against the ideal baseline.
Consider the online variance reduction setting, and for any denote . Assuming that the losses, , are non-negative for all , , the following holds for any ,
Thus, the above lemma connects the convergence rate of the ERM solver to the benefit that we get by regret minimization. It shows that the benefit is larger if the ERM solver converges faster. As an example, let us assume that , which loosely speaking holds for SGD. This assumption implies , hence by Lemma 1 the regret guarantees translate into guarantees with respect to the ideal baseline, with an additional cost of .
In this section, we analyze variance reduction with full-information feedback. We henceforth consider the same setting as in Figure 2, with the difference that in each round the player receives as a feedback the loss vector at all points instead of only . We introduce a new algorithm based on the FTRL approach, and establish an regret bound for our method in Theorem 3. While this setup in itself has little practical relevance, it later serves as a mean for investigating the bandit setting.
Follow-the-Regularized-Leader (FTRL) is a powerful approach to online learning problems. According to FTRL, in each round, one selects a point that minimizes the cost functions over past rounds plus a regularization term, i.e., . The regularizer usually assures that the choices do not change abruptly over the rounds. We choose which allows to write FTRL as
(3) |
The regularizer is a natural candidate in our setting, since it has the same structural form as the cost functions. It also prevents FTRL from assigning vanishing probability mass to any component, thus ensuring that the incurred costs never explode. Moreover, assures a closed form solution to the FTRL as the following lemma shows.
Denote . The solution to Eq. (3) is .
Recalling , allows to write the FTRL objective as follows,
It is immediate to validate that the offered solution satisfies the first order optimality conditions in . Global optimality follows since the FTRL objective is convex in the simplex.
∎
We are interested in the regret incurred by our method. The following theorem shows that, despite the non-standard form of the cost functions, we can obtain regret.
Before presenting the proof, we briefly describe it. Trying to apply the classical FTRL regret bounds, we encounter a difficulty, namely that the regularizer in Equation (3) can be unbounded. To overcome this issue, we first consider competing with the optimal distribution on a restricted simplex where is bounded. Then we investigate the cost of considering the restricted simplex instead of the full simplex.
Along the lines described above, consider the simplex and the restricted simplex where is to be defined later. We can now decompose the regret as follows,
(4) |
We continue by separately bounding the above terms. To bound , we will use standard tools which relate the regret to the stability of the FTRL decision sequence (FTL-BTL lemma). Term is bounded by a direct calculation of the minimal values in and .
The following lemma bounds term .
Setting , we have:
The regret of FTRL may be related to the stability of the online decision sequence as shown in the following lemma due to Kalai and Vempala (2005) (proof can also be found in Hazan (2011) or in Shalev-Shwartz et al. (2012)):
Let be a convex set and be a regularizer. Given a sequence of cost functions defined over , then setting ensures,
Notice that is non-negative and bounded by over . Thus, applying the above lemma implies that ,
Using the closed form solution for the ’s (see Lemma. 2) enables us to upper bound the last term as follows,
(5) |
Combining the above with completes the proof.
∎
The next lemma bounds term .
Using first order optimality conditions we are able show that the minimal value of the over is exactly . Similar analysis allows to extract a closed form solution to the best in hindsight over . This in turn enables to upper bound the minimal value over by . Combining these bounds together with we are able to prove the lemma.
∎
In this section, we investigate the bandit setting (see Figure 2) which is of great practical appeal as we described in Section 2. Our method for the bandit setting is depicted in Algorithm 1, and it ensures a bound of on the expected regret (see Theorem 8). Importantly, this bound holds even for non-oblivious adversaries. The design and analysis of our method builds on some of the ideas that appeared in the seminal work of Auer et al. (2002).
Algorithm 1 is using the bandit feedback in order to design an unbiased estimate of the true loss in each round. These estimates are then used instead of the true losses by the full information FTRL algorithm that was analyzed in the previous section. We do not directly play according to the FTRL predictions but rather mix them with a uniform distribution. Mixing is necessary in order to ensure that the loss estimates are bounded, which is a crucial condition used in the analysis. Next we elaborate on our method and its analysis.
The algorithm samples444The sampling and update in the presented form have a complexity of . There is a standard way to improve this based on segment trees that gives for sampling and update. A detailed description of this idea can be found in section A.4. of Salehi et al. (2017). The efficient implementation of the sampler is available at https://github.com/zalanborsos/online-variance-reduction an arm at every round and receives a bandit feedback . This may be used in order to construct an estimate of the true (squared) loss as follows,
and it is immediate to validate that the above is unbiased in the following sense,
Analogously to the previous section it is natural to define modified cost functions as
Clearly, is an unbiased estimate of the true cost, . From now on we omit the conditioning on for notational brevity.
Having devised an unbiased estimate, we could return to the full information analysis of FTRL with the modified losses. However, this poses a difficulty, since the modified losses can possibly be unbounded. We remedy this by mixing the FTRL output, , with a uniform distribution. Mixing encourages exploration, and in turn gives a handle on the possibly unbounded modified losses. Let , and define
Indeed, since , we have .
We start with analyzing the pseudo-regret of our algorithm, where we compare the cost incurred by the algorithm to the cost incurred by the optimal distribution in expectation. The pseudo-regret is defined below,
(6) |
where the expectation is taken with respect to both the player’s choices and the loss realizations. The pseudo-regret is only a lower bound for the expected regret, with an equality when the adversary is oblivious, i.e., does not take the past choices of the player into account.
Let . Assuming , Algorithm 1 ensures the following bound,
Using the unbiasedness of the modified costs we have
We can decompose into the following terms:
where is the cost we incur by mixing, and is upper bounded by the regret of playing FTRL with the modified losses. Now we inspect each term separately.
An upper bound of on results from the following simple observation:
For bounding , notice that is performing FTRL over the modified cost sequence. Combining this together the bound allows us to apply Theorem 3 and get,
(7) |
Due to Jensen’s inequality we have
Putting these results together, we get an upper bound on the pseudo-regret which we can optimize in terms of :
Using the bound and since we assumed , we can set to get the result. Note that is dependent on knowing in advance. If we do not assume that this is possible, we can use the “doubling trick” starting from and incur an additional constant multiplier in the regret. ∎
Ultimately, we are interested in the expected regret, where we allow the adversary to make decisions by taking into account the player’s past choices, i.e., to be non-oblivious. Next we present the main result of this paper, which establishes a regret bound, where the notation hides the logarithmic factors.
Assuming , the following holds for the expected regret,
Using the unbiasedness of the modified costs allows to decompose the regret as follows,
(8) |
where the last line uses Equation (7) together with Jensen’s inequality (similarly to the proof of Theorem 7). We have also used the closed form solution for the minimal values of and over the simplex.
Our approach to bounding the remaining term is to establish high probability bound for . In order to do so we shall bound the following differences . This can be done by applying the
appropriate concentration results described below.
Bounding .
Fix and define . Recalling that , we have that is a martingale difference sequence with respect to the filtration associated with the history of the strategy. This allows us to apply a version of Freedman’s inequality (Freedman, 1975), which bounds the
sum of differences with respect to their cumulative conditional variance.
Loosely speaking, Freedman’s inequality implies that w.p. ,
Importantly, the sum of conditional variances can be related to the regret. Indeed let be the best distribution in hindsight, i.e., , and define
Then the following can be shown,
To simplify the proof sketch, ignore the second term. Plugging this back into Freedman’s inequality we get,
(9) |
Final bound. Combining the above with the definition of one can to show that w.p. ,
Since is bounded by , we can take a small enough such that,
where the second line uses Jensen’s inequality with respect to the concave function , and the last line uses together with the fact that , which is also a consequence of Jensen’s inequality since . Plugging the above bound back into Eq. (4) we are able to establish the proof. The full proof is deferred to Appendix E. Note that in the full proof we do not explicitly relate the conditional variances to the regret, but this is rather more implicit in the analysis. ∎
Training a binary classifier with imbalanced data is a challenging task in machine learning. Practices for dealing with imbalance include optimizing class weight hyperparameters, hard negative mining
(Shrivastava et al., 2016) and synthetic minority oversampling (Chawla et al., 2002). Without accounting for imbalance, the minority samples are often misclassified in early stages of the iterative training procedures, resulting in high loss and high gradient norms associated with these points. Importance sampling schemes for reducing the variance of the gradient norms will sample these instances more often at the early phases, offering a way of tackling imbalance.For verifying this intuition, we perform the image classification experiment of Bouchard et al. (2015)
. We train one-vs-all logistic regression Pascal VOC 2007 dataset
(Everingham et al., 2010)with image features extracted from the last layer of the VGG16
(Simonyan and Zisserman, 2015)pretrained on Imagenet. We measure the average precision by reporting its mean over the 20 classes of the test data. The optimization is performed with AdaGrad
(Duchi et al., 2011), where the learning rate is initialized to 0.1. The losses received by the bandit methods are the norms of the logistic loss gradient. We compare our method, Variance Reducer Bandit (VRB), to:uniform sampling for SGD,
Adaptive Weighted SGD (AW) (Bouchard et al., 2015) — variance reduction by sampling from a chosen distribution whose parameters are optimized alternatingly with the model parameters,
MABS (Salehi et al., 2017) — bandit algorithm for variance reduction that relies on EXP3 through employing modifies losses.
The hyperparameters of the methods are chosen based on cross-validation on the validation portion of the dataset. The results can be seen in Figure 4, where the shaded areas represent confidence intervals over 10 runs. The best performing method is AW, but its disadvantage compared to the bandit algorithms is that it requires choosing a family of sampling distributions, which usually incorporates prior knowledge, and calculating the derivative of the log-density. VRB and AW both outperform uniform subsampling with respect to the training time. VRB performs similarly to AW at convergence, and speeds up training 10 times compared to uniform sampling, by attaining a certain score level 10 times faster. We have also experimented with the variance reduction method of Namkoong et al. (2017), but it did not outperform uniform sampling significantly. Since cross-validation is costly, in Figure 4 we show the effect of the hyperparameters of our method. More specifically, we compare the performance of VRB with misspecified regularizer to the best chosen by cross-validation, and we compensate by using higher mixing coefficient . The fact that only the early-stage performance is affected is a sign of method’s robustness against regularizer misspecification.
In this experiment, we show that in some applications it is beneficial to work with per-sample upper bound estimates instead of a single global bound. As an illustrative example, we choose mini-batch -Means clustering (Sculley, 2010). This is a slight deviation from the presented theory, since we sample multiple points for the batch and update the sampler only once, upon observing the loss for the batch.
In the case of -Means, the parameters consist of the coordinates of the centers . As the cost function for a point is the squared Euclidean distance to the closest center, the loss received by VRB is the norm of the gradient . This lends itself to a natural estimation of : choose a point randomly from the dataset and define . For this experiment, we set .
We solve mini-batch -Means for and batch size with uniform sampling and VRB. The initial centers are chosen with -Means++ (Arthur and Vassilvitskii, 2007) from a random subsample of 1000 points from the training data and they are shared between the methods. We generate 10 different sets of initial centers and run both algorithms 10 times on each set of centers, with different random seeds for the samplers. We train the algorithm on of the data, and measure the cost of the test portion for the following datasets:
CSN (Faulkner et al., 2011) — cellphone accelerometer with 80,000 observations and 17 features,
KDD (KDD Cup 2004, ) — data set used for Protein Homology Prediction KDD competition containing 145,751 observations with 74 features,
MNIST (LeCun et al., 1998) — 70,000 low resolution images of handwritten characters transformed using PCA with whitening and retaining 10 dimensions.
The evolution of the cost function on the test set with respect to the elapsed training time is shown in Figure 5. The chosen datasets illustrate three observed behaviors of our algorithm. In the case of CSN, our method significantly outperforms uniform subsampling. In the case of KDD, the advantage of our method can be seen in the reduced variance of the cost over multiple runs, whereas on MNIST we observe no advantage. This behavior is highly dependent on intrinsic dataset characteristics: for MNIST, we note that the entropy of the best-in-hindsight sampling distribution is close the entropy of the uniform distribution. We have also compared VRB with the bandit algorithms mentioned in the previous section. Since mini-batch
-Means converges in 1-2 epochs, these methods with uniform initialization do not outperform uniform subsampling significantly. Thus, for this setting, careful initialization is necessary, which is naturally supported by our method.
We presented a novel importance sampling technique for variance reduction in an online learning formulation. First, we motivated why regret is a sensible measure of performance in this setting. Despite the bandit feedback and the unbounded costs, we provided an expected regret guarantee of , where we reference is the best fixed sampling distribution in hindsight. We confirmed the theoretical findings with empirical validation.
Among the many possible future directions stands the question of the tightness of the expected regret bound of the algorithm. Another naturally arising idea is theoretical analysis of the method when employed in conjunction with advanced stochastic solvers such as SVRG and SAGA.
The authors would like to thank Hasheminezhad Seyedrouzbeh for useful discussions during the course of this work. This research was supported by SNSF grant through the NRP 75 Big Data program. K.Y.L. is supported by the ETH Zurich Postdoctoral Fellowship and Marie Curie Actions for People COFUND program.
Journal of artificial intelligence research
, 16:321–357, 2002.International journal of computer vision
, 88(2):303–338, 2010.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 761–769, 2016.The relationship between cumulative second moment of the gradients and quality of optimization has been demonstrated in several works. Since the difference between the second moment and the variance is independent of the sampling distribution , the guarantees of our method also translate to guarantees with respect to the cumulative second moments of the gradient estimates. Here we provide two concrete references.
For the following, assume that we would like to minimize a convex objective,
and we assume that we are able to draw i.i.d. samples from the unknown distribution . Thus, given a point we are able to design an unbiased estimate for by sampling and taking (clearly, ). Now assume a gradient-based update rule, i.e.,
(10) |
and . Next we show that for two very popular gradient based-methods — AdaGrad and SGD for strongly-convex functions, the performance is directly related to the cumulative second moment of the gradient estimates, . The latter is exactly the objective of our online variance reduction method.
The AdaGrad algorithm employs the same rule as in Eq. (10) using . The next theorem substantiates its guarantees.
The SGD algorithm for -strongly-convex objectives employs the same rule as in Eq. (10) using . The next theorem substantiates its guarantees.
Assume that is -strongly convex, then:
Denote . Next, we bound the cumulative loss per point ,
(11) |
where the second line uses
Comments
There are no comments yet.