is currently the standard in machine learning for the optimization of highly multivariate functions if their gradient is corrupted by noise. This includes the online or mini-batch training of neural networks, logistic regression(Zhang, 2004; Bottou, 2010) and variational models (e.g. Hoffman et al., 2013; Hensman et al., 2012; Broderick et al., 2013)
. In all these cases, noisy gradients arise because an exchangeable loss-functionof the optimization parameters , across a large dataset , is evaluated only on a subset :
If the indices are i.i.d. draws from
, by the Central Limit Theorem, the error
is unbiased and approximately normal distributed. Despite its popularity and its low cost per step,sgd has well-known deficiencies that can make it inefficient, or at least tedious to use in practice. Two main issues are that, first, the gradient itself, even without noise, is not the optimal search direction; and second, sgd requires a step size (learning rate) that has drastic effect on the algorithm’s efficiency, is often difficult to choose well, and virtually never optimal for each individual descent step. The former issue, adapting the search direction, has been addressed by many authors (see George and Powell, 2006, for an overview). Existing approaches range from lightweight ‘diagonal preconditioning’ approaches like Adam (Kingma and Ba, 2014), AdaGrad (Duchi et al., 2011), and ‘stochastic meta-descent’ (Schraudolph, 1999)
, to empirical estimates for the natural gradient(Amari et al., 2000) or the Newton direction (Roux and Fitzgibbon, 2010), to problem-specific algorithms (Rajesh et al., 2013), and more elaborate estimates of the Newton direction (Hennig, 2013). Most of these algorithms also include an auxiliary adaptive effect on the learning rate. Schaul et al. (2013)
provided an estimation method to explicitly adapt the learning rate from one gradient descent step to another. Several very recent works have proposed the use of reinforcement learning and ‘learning-to-learn’ approaches for parameter adaption(Andrychowicz et al., 2016; Hansen, 2016; Li and Malik, 2016). Mostly these methods are designed to work well on a specified subset of optimization problems, which they are also trained on; they thus need to be re-learned for differing objectives. The corresponding algorithms are usually orders of magnitude more expensive than the low-level black box proposed here, and often require a classic optimizer (e.g sgd) to tune their internal hyper-parameters.
None of the mentioned algorithms change the size of the current descent step. Accumulating statistics across steps in this fashion requires some conservatism: If the step size is initially too large, or grows too fast, sgd can become unstable and ‘explode’, because individual steps are not checked for robustness at the time they are taken.
In essence, the same problem exists in deterministic (noise-free) optimization problems. There, providing stability is one of several tasks of the line search subroutine. It is a standard constituent of algorithms like the classic nonlinear conjugate gradient (Fletcher and Reeves, 1964) and BFGS (Broyden, 1969; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) methods (Nocedal and Wright, 1999, §3).111In these algorithms, another task of the line search is to guarantee certain properties of the surrounding estimation rule. In BFGS, e.g., it ensures positive definiteness of the estimate. This aspect will not feature here. In the noise-free case, line searches are considered a solved problem (Nocedal and Wright, 1999, §3). But the methods used in deterministic optimization are not stable to noise. They are easily fooled by even small disturbances, either becoming overly conservative or failing altogether. The reason for this brittleness is that existing line searches take a sequence of hard decisions to shrink or shift the search space. This yields efficiency, but breaks hard in the presence of noise. Section 3 constructs a probabilistic line search for noisy objectives, stabilizing optimization methods like the works cited above. As line searches only change the length, not the direction of a step, they could be used in combination with the algorithms adapting sgd’s direction, cited above. In this paper we focus on parameter tuning of the sgd algorithm and leave other search directions to future work.
2.1 Deterministic Line Searches
There is a host of existing line search variants (Nocedal and Wright, 1999, §3). In essence, though, these methods explore a univariate domain ‘to the right’ of a starting point, until an ‘acceptable’ point is reached (Figure 1). More precisely, consider the problem of minimizing , with access to . At iteration , some ‘outer loop’ chooses, at location , a search direction (e.g. by the BFGS rule, or simply for gradient descent). It will not be assumed that has unit norm. The line search operates along the univariate domain for . Along this direction it collects scalar function values and projected gradients that will be denoted and . Most line searches involve an initial extrapolation phase to find a point with . This is followed by a search in , by interval nesting or by interpolation of the collected function and gradient values, e.g. with cubic splines.222This is the strategy in minimize.m by C. Rasmussen, which provided a model for our implementation. At the time of writing, it can be found at http://learning.eng.cam.ac.uk/carl/code/minimize/minimize.m
2.1.1 The Wolfe Conditions for Termination
As the line search is only an auxiliary step within a larger iteration, it need not find an exact root of ; it suffices to find a point ‘sufficiently’ close to a minimum. The Wolfe conditions (Wolfe, 1969) are a widely accepted formalization of this notion; they consider acceptable if it fulfills
using two constants chosen by the designer of the line search, not the user. W-I is the Armijo or sufficient decrease condition (Armijo, 1966). It encodes that acceptable functions values should lie below a linear extrapolation line of slope . W-II is the curvature condition, demanding a decrease in slope. The choice accepts any value below , while rejects all points for convex functions. For the curvature condition, only accepts points with ; while accepts any point of greater slope than . W-I and W-II are known as the weak form of the Wolfe conditions. The strong form replaces W-II with . This guards against accepting points of low function value but large positive gradient. Figure 1 shows a conceptual sketch illustrating the typical process of a line search, and the weak and strong Wolfe conditions. The exposition in §3.3 will initially focus on the weak conditions, which can be precisely modeled probabilistically. Section 3.3.1 then adds an approximate treatment of the strong form.
2.2 Bayesian Optimization
A recently blossoming sample-efficient approach to global optimization revolves around modeling the objective with a probability measure ; usually a Gaussian process (gp). Searching for extrema, evaluation points are then chosen by a utility functional . Our line search borrows the idea of a Gaussian process surrogate, and a popular acquisition function, expected improvement (Jones et al., 1998). Bayesian optimization (bo) methods are often computationally expensive, thus ill-suited for a cost-sensitive task like a line search. But since line searches are governors more than information extractors, the kind of sample-efficiency expected of a Bayesian optimizer is not needed. The following sections develop a lightweight algorithm which adds only minor computational overhead to stochastic optimization.
3 A Probabilistic Line Search
We now consider minimizing from Eq. 1. That is, the algorithm can access only noisy function values and gradients at location , with Gaussian likelihood
and Appendix A regarding estimation of the variances, and some further notes on the independence assumption of and . Each evaluation of uses a newly drawn mini-batch.
Our algorithm is modeled after the classic line search routine minimize.m2 and translates each of its ingredients one-by-one to the language of probability. The following table illustrates the four ingredients of the probabilistic line search and their corresponding classic parts.
|1) 1D surrogate for objective||piecewise cubic splines||gp where the mean are piecewise cubic splines|
|2) candidate selection||one local minimizer of cubic splines xor extrapolation||local minimizers of cubic splines and extrapolation|
|3) choice of best candidate||———||bo acquisition function|
|4) acceptance criterion||classic Wolfe conditions||probabilistic Wolfe conditions|
The table already motivates certain design choices, for example the particular choice of the gp-surrogate for
, which strongly resembles the classic design. Probabilistic line searches operate in the same scheme as classic ones: 1) they construct a surrogate for the underlying 1D-function 2) they select candidates for evaluation which can interpolate between datapoints or extrapolate 3) a heuristic chooses among the candidate locations and the function is evaluated there 4) the evaluated points are checked for Wolfe-acceptance. The following sections introduce all of these building blocks with greater detail: A robust yet lightweight Gaussian process surrogate onfacilitating analytic optimization (§ 3.1); a simple Bayesian optimization objective for exploration (§ 3.2); and a probabilistic formulation of the Wolfe conditions as a termination criterion (§ 3.3). Appendix Appendix D. – Pseudocode contains a detailed pseudocode of the probabilistic line search; algorithm 1 very roughly sketches the structure of the probabilistic line search and highlights its essential ingredients.
3.1 Lightweight Gaussian Process Surrogate
We model information about the objective in a probability measure . There are two requirements on such a measure: First, it must be robust to irregularity (low and high variability) of the objective. And second, it must allow analytic computation of discrete candidate points for evaluation, because a line search should not call yet another optimization subroutine itself. Both requirements are fulfilled by a once-integrated Wiener process, i.e. a zero-mean Gaussian process prior with covariance function
Here and denote a shift by a constant . This ensures this kernel is positive semi-definite, the precise value is irrelevant as the algorithm only considers positive values of (our implementation uses ). See §3.4 regarding the scale . With the likelihood of Eq. 3, this prior gives rise to a gp posterior whose mean function is a cubic spline333Eq. 4 can be generalized to the ‘natural spline’, removing the need for the constant (Rasmussen and Williams, 2006, §6.3.1). However, this notion is ill-defined in the case of a single observation, which is crucial for the line search. (Wahba, 1990). We note in passing that regression on and from observations of pairs can be formulated as a filter (Särkkä, 2013) and thus performed in time. However, since a line search typically collects data points, generic gp inference, using a Gram matrix, has virtually the same, low cost.
with (using the indicator function if , else 0)
Given a set of evaluations
(vectors, with elements) with independent likelihood 3, the posterior is a gp with posterior mean function and covariance function as follows:
The posterior marginal variance will be denoted by . To see that is indeed piecewise cubic (i.e. a cubic spline), we note that it has at most three non-vanishing derivatives444There is no well-defined probabilistic belief over and higher derivatives—sample paths of the Wiener process are almost surely non-differentiable almost everywhere (Adler, 1981, §2.2). But is always a member of the reproducing kernel Hilbert space induced by , thus piecewise cubic (Rasmussen and Williams, 2006, §6.1)., because
This piecewise cubic form of is crucial for our purposes: having collected values of and , respectively, all local minima of can be found analytically in time in a single sweep through the ‘cells’ , (here denotes the start location, where are ‘inherited’ from the preceding line search. For typical line searches , c.f. §4. In each cell, is a cubic polynomial with at most one minimum in the cell, found by an inexpensive quadratic computation from the three scalars . This is in contrast to other gp regression models—for example the one arising from a squared exponential kernel—which give more involved posterior means whose local minima can be found only approximately. Another advantage of the cubic spline interpolant is that it does not assume the existence of higher derivatives (in contrast to the Gaussian kernel, for example), and thus reacts robustly to irregularities in the objective. In our algorithm, after each evaluation of , we use this property to compute a short list of candidates for the next evaluation, consisting of the local minimizers of and one additional extrapolation node at , where is the currently largest evaluated , and is an extrapolation step size starting at and doubled after each extrapolation step. 555 For the integrated Wiener process and heteroscedastic noise, the variance
For the integrated Wiener process and heteroscedastic noise, the variancealways attains its maximum exactly at the mid-point between two evaluations; including the variance into the candidate selection biases the existing candidates towards the center (additional candidates might occur between evaluations without local minimizer, even for noise free observations/classic line searches). We did not explore this further since the algorithm showed very good sample efficiency already with the adopted scheme.
Another motivation for using the integrated Wiener process as surrogate for the objective, as well as for the described candidate selection, are classic line searches. There, the 1D-objective is modeled by piecewise cubic interpolations between neighboring datapoints. In a sense, this is a non-parametric approach, since a new spline is defined, when a datapoint is added. Classic line searches always only deal with one spline at a time, since they are able to collapse all other parts of the search space. For noise free observations, the mean of the posterior gp is identical to the classic cubic interpolations, and thus candidate locations are identical as well; this is illustrated in Figure 3. The non-parametric approach also prevents issues of over-constrained surrogates for more than two datapoints. For example, unless the objective is a perfect cubic function, it is impossible to fit a parametric third order polynomial to it, for more than two noise free observations. All other variability in the objective would need to be explained away by artificially introducing noise on the observations. An integrated Wiener process very naturally extends its complexity with each newly added datapoint without being overly assertive – the encoded assumption is, that the objective has at least one derivative (which is also observed in this case).
3.2 Choosing Among Candidates
The previous section described the construction of discrete candidate points for the next evaluation. To decide at which of the candidate points to actually call and , we make use of a popular acquisition function from Bayesian optimization. Expected improvement (Jones et al., 1998) is the expected amount, under the gp surrogate, by which the function might be smaller than a ‘current best’ value (we set , where are observed locations),
The next evaluation point is chosen as the candidate maximizing the product of Eq. 9 and Wolfe probability , which is derived in the following section. The intuition is that precisely encodes properties of desired points, but has poor exploration properties; has better exploration properties, but lacks the information that we are seeking a point with low curvature; thus puts weight on (by W-II) clearly ruled out points. An illustration of the candidate proposal and selection is shown in Figure 4.
In principle other acquisition functions (e.g. the upper-confidence bound, gp-ucb (Srinivas et al., 2010)) are possible, which might have a stronger explorative behavior; we opted for since exploration is less crucial for line searches than for general bo and some (e.g. gp-ucb) had one additional parameter to tune. We tracked the sample efficiency of instead and it was very good (low); the experimental Subsection 4.3 contains further comments and experiments on the alternative choices of and as standalone acquisition functions; they performed equally well (in terms of loss and sample efficiency) to their product.
3.3 Probabilistic Wolfe Conditions for Termination
The key observation for a probabilistic extension of the Wolfe conditions W-I and W-II is that they are positivity constraints on two variables that are both linear projections of the (jointly Gaussian) variables and :
The gp of Eq. (5) on thus implies, at each value of
, a bivariate Gaussian distribution
The quadrant probability for the Wolfe conditions to hold, is an integral over a bivariate normal probability,
with correlation coefficient . It can be computed efficiently (Drezner and Wesolowsky, 1990), using readily available code.666e.g. http://www.math.wsu.edu/faculty/genz/software/matlab/bvn.m The line search computes this probability for all evaluation nodes, after each evaluation. If any of the nodes fulfills the Wolfe conditions with , greater than some threshold , it is accepted and returned. If several nodes simultaneously fulfill this requirement, the most recently evaluated node is returned; there are additional safeguards for cases where e.g. no Wolfe-point can be found, which can be deduced from the pseudo-code in Appendix D; they are similar to standard safeguards of classic line search routines (e.g. returning the node of lowest mean). Section 3.4.1 below motivates fixing . The acceptance procedure is illustrated in Figure 5.
3.3.1 Approximation for Strong Conditions:
As noted in Section 2.1.1, deterministic optimizers tend to use the strong Wolfe conditions, which use and . A precise extension of these conditions to the probabilistic setting is numerically taxing, because the distribution over is a non-central -distribution, requiring customized computations. However, a straightforward variation to 14 captures the spirit of the strong Wolfe conditions, that large positive derivatives should not be accepted: Assuming (i.e. that the search direction is a descent direction), the strong second Wolfe condition can be written exactly as
The value is bounded to confidence by
Hence, an approximation to the strong Wolfe conditions can be reached by replacing the infinite upper integration limit on in Eq. 14 with . The effect of this adaptation, which adds no overhead to the computation, is shown in Figure 2 as a dashed line.
3.4 Eliminating Hyper-parameters
As a black-box inner loop, the line search should not require any tuning by the user. The preceding section introduced six so-far undefined parameters: . We will now show that , can be fixed by hard design decisions: can be eliminated by standardizing the optimization objective within the line search; and the noise levels can be estimated at runtime with low overhead for finite-sum objectives of the form in Eq. 1. The result is a parameter-free algorithm that effectively removes the one most problematic parameter from sgd—the learning rate.
3.4.1 Design Parameters
Our algorithm inherits the Wolfe thresholds and from its deterministic ancestors. We set and . This is a standard setting that yields a ‘lenient’ line search, i.e. one that accepts most descent points. The rationale is that the stochastic aspect of sgd is not always problematic, but can also be helpful through a kind of ‘annealing’ effect.
The acceptance threshold is a new design parameter arising only in the probabilistic setting. We fix it to . To motivate this value, first note that in the noise-free limit, all values are equivalent, because then switches discretely between 0 and 1 upon observation of the function. A back-of-the-envelope computation, assuming only two evaluations at and and the same fixed noise level on and (which then cancels out), shows that function values barely fulfilling the conditions, i.e. , can have while function values at for with ‘unlucky’ evaluations (both function and gradient values one standard-deviation from true value) can achieve . The choice
balances the two competing desiderata for precision and recall. Empirically (Fig.6), we rarely observed values of close to this threshold. Even at high evaluation noise, a function evaluation typically either clearly rules out the Wolfe conditions, or lifts well above the threshold. A more in-depth analysis of , , and is done in the experimental Section 4.2.1.
The parameter of Eq. 4 simply scales the prior variance. It can be eliminated by scaling the optimization objective: We set and scale within the code of the line search. This gives and , and typically ensures the objective ranges in the single digits across , where most line searches take place. The division by causes a non-Gaussian disturbance, but this does not seem to have notable empirical effect.
3.4.3 Noise Scales
The likelihood 3 requires standard deviations for the noise on both function values () and gradients (). One could attempt to learn these across several line searches. However, in exchangeable models, as captured by Eq. 1, the variance of the loss and its gradient can be estimated directly for the mini-batch, at low computational overhead—an approach already advocated by Schaul et al. (2013). We collect the empirical statistics
(where denotes the element-wise square) and estimate, at the beginning of a line search from ,
This amounts to the assumption that noise on the gradient is independent. We finally scale the two empirical estimates as described in Section §3.4.2: , and ditto for . The overhead of this estimation is small if the computation of itself is more expensive than the summation over . In the neural network examples N-I and N-II of the experimental Section 4, the additional steps added only cost overhead to the evaluation of the loss. A more general statement about memory and time requirements can be found in Sections 3.6 and 3.7. Of course, this approach requires a mini-batch size . For single-sample mini-batches, a running averaging could be used instead (single-sample mini-batches are not necessarily a good choice. In our experiments, for example, vanilla sgd with batch size 10 converged faster in wall-clock time than unit-batch sgd). Estimating noise separately for each input dimension captures the often inhomogeneous structure among gradient elements, and its effect on the noise along the projected direction. For example, in deep models, gradient noise is typically higher on weights between the input and first hidden layer, hence line searches along the corresponding directions are noisier than those along directions affecting higher-level weights. A detailed description of the noise estimator can be found in Appendix Appendix A. – Noise Estimation.
3.4.4 Propagating Step Sizes Between Line Searches
As will be demonstrated in §4, the line search can find good step sizes even if the length of the direction is mis-scaled. Since such scale issues typically persist over time, it would be wasteful to have the algorithm re-fit a good scale in each line search. Instead, we propagate step lengths from one iteration of the search to another: We set the initial search direction to with some initial learning rate . Then, after each line search ending at , the next search direction is set to (with ). Thus, the next line search starts its extrapolation at times the step size of its predecessor (Section 4.2.2 for details).
3.5 Relation to Bayesian Optimization and Noise-Free Limit
The probabilistic line search algorithm is closely related to Bayesian optimization (bo) since it approximately minimizes a 1D-objective under potentially noisy function evaluations. It thus uses notions of bo (e.g. a gp-surrogate for the objective, and an acquisition function to discriminate locations for the next evaluation of the loss), but there are some differences concerning the aim, requirements on computational efficiency, and termination condition, which are shortly discussed here: (i) Performance measure: The final performance measure in bo is usually the lowest found value of the objective function. Line searches are subroutines inside of a greedy, iterative optimization machine, which usually performs several thousand steps (and line searches); many, very approximate steps often performs better than taking less, but preciser steps. (ii) Termination: The termination condition of a line search is imposed from the outside in the form of the Wolfe conditions. Stricter Wolfe conditions do not usually improve the performance of the overall optimizer, thus, no matter if a better (lower) minimum could be found, any Wolfe-point is acceptable at all times. (iii) Sample efficiency: Since the last evaluation from the previous line search can be re-used in the current line search, only one additional value and gradient evaluation is enough to terminate the procedure. This ‘immediate-accept’ is the desired behavior if the learning rate is currently well calibrated. (iv) Locations for evaluation: bo, usually calls an optimizer to maximize some acquisition function, and the preciseness of this optimization is crucial for performance. Line searches just need to find a Wolfe-acceptable point; classic line searches suggest, that it is enough to look at plausible locations, like minimizer of a local interpolator, or some rough extrapolation point; this inexpensive heuristic usually works rather well. (v) Exploration: bo needs to solve an intricate trade-off problem in between exploring enough of the parameters space for possible locations of minima, and exploiting locations around them further. Since line searches are only concerned with finding a Wolfe-point, they do not need to explore the parameter space of possible step sizes to that extend; crucial features are rather the possibility to explore somewhat larger steps than previous ones (which is done by extrapolation-candidates), and likewise to shorted steps (which is done by interpolation-candidates).
In the limit of noise free observed gradients and function values () the probabilistic line search behaves like its classic parent, except for very slight variations in the candidate choice (building block 3): The gp-mean reverts to the classic interpolator; all candidate locations are thus identical, but the probabilistic line search might propose a second option, since (even if there is a local minimizer) it always also proposes an extrapolation candidate. For intuitive purposes, this is illustrated in the following table.
|building block||classic||probabilistic (noise free)|
|1) 1D surrogate for objective||piecewise cubic splines||gp-mean identical to classic interpolator|
|2) candidate selection||local minimizer of cubic splines xor extrapolation||local minimizer of cubic splines or extrapolation|
|3) choice of best candidate||———||bo acquisition function|
|4) acceptance criterion||classic Wolfe conditions||identical to classic Wolfe conditions|
3.6 Computational Time Overhead
The line search routine itself has little memory and time overhead; most importantly it is independent of the dimensionality of the optimization problem. After every call of the objective function the gp (§3.1) needs to be updated, which at most is at the cost of inverting a -matrix, where usually is equal to , or but never . In addition, the bivariate normal integral of Eq. 14 needs to be computed at most times. On a laptop, one evaluation of costs about 100 microseconds. For the choice among proposed candidates (§3.2), again at most , for each, we need to evaluate and (Eq. 9) where the latter comes at the expense of evaluating two error functions. Since all of these computations have a fixed cost (in total some milliseconds on a laptop), the relative overhead becomes less the more expensive the evaluation of .
The largest overhead actually lies outside of the actual line search routine. In case the noise levels and are not known, we need to estimate them. The approach we took is described in Section 3.4.3 where the variance of is estimated using the sample variance of the mini-batch, each time the objective function is called. Since in this formulation the variance estimation is about half as expensive as one backward pass of the net, the time overhead depends on the relative cost of the feed forward and backward passes (Balles et al., 2016). If forward and backward pass are the same cost, the most straightforward implementation of the variance estimation would make each function call 1.25 times as expensive.777It is desirable to decrease this value in the future reusing computation results or by approximation but this is beyond this discussion. At the same time though, all exploratory experiments which very considerably increase the time spend when using sgd with a hand tuned learning rate schedule need not be performed anymore. In Section 4.1 we will also see that sgd using the probabilistic line search often needs less function evaluations to converge, which might lead to overall faster convergence in wall clock time than classic sgd in a single run.
3.7 Memory Requirement
Vanilla sgd, at all times, keeps around the current optimization parameters and the gradient vector . In addition to this, the probabilistic line search needs to store the estimated gradient variances (Eq. 18) of same size. The memory requirement of sgd+probLS is thus comparable to AdaGrad or Adam. If combined with a search direction other than sgd always one additional vector of size needs to be stored.
This section reports on an extensive set of experiments to characterise and test the line search. The overall evidence from these tests is that the line search performs well and is relatively insensitive to the choice of its internal hyper-parameters as well the mini-batch size. We performed experiments on two multi-layer perceptrons N-I and N-II; both were trained on two well known datasets MNIST and CIFAR-10.
MNIST (LeCun et al., 1998): multi-class classification task with 10 classes: hand-written digits in gray-scale of size (numbers ‘0’ to ’9’); training set size 60 000, test set size 10 000.
CIFAR-10 (Krizhevsky and Hinton, 2009): multi-class classification task with 10 classes: color images of natural objects (horse, dog, frog,…) of size ; training set size 50 000, test set size 10 000; like other authors, we only used the “batch 1” sub-set of CIFAR-10 containing 10 000 training examples.
In addition we train logistic regressors with sigmoidal output (N-III) on the following binary classification tasks:
Wisconsin Breast Cancer Dataset (WDBC) (Wolberg et al., 2011): binary classification of tumors as either ‘malignant’ or ‘benign’. The set consist of 569 examples of which we used 169 to monitor generalization performing; thus 400 remain for the training set; 30 features describe for example radius, area, symmetry, et cetera. In comparison to the other datasets and networks, this yields a very low dimensional optimization problem with only 30 (+1 bias) input parameters as well as just a small number of datapoints.
GISETTE (Guyon et al., 2005): binary classification of the handwritten digits ‘4’ and ‘9’. The original images are taken from the MNIST datset; then the feature set was expanded and consists of the original normalized pixels, plus a randomly selected subset of products of pairs of features, which are slightly biased towards the upper part of the image; in total there are 5000 features, instead of 784 as in the original MNIST. The size of the training set and test set is 6000 and 1000 respectively.
EPSILON: synthetic dataset from the PASCAL Challenge 2008 for binary classification. It consists of 400 000 training set datapoint and 100 000 test set datapoints, each having 2000 features.
In the text and figures, sgd using the probabilistic line search will occasionally be denoted as sgd+probLS. Section 4.1 contains experiments on the sensitivity to varying gradient noise levels (mini-batch sizes) performed on both multi-layer perceptrons N-I and N-II, as well as on the logistic regressor N-III. Section 4.2 discusses sensitivity to the hyper-parameters choices introduced in Section 3.4 and Section 4.3 contains additional diagnostics on step size statistics. Each single experiment was performed times with different random seeds that determined the starting weights and the mini-batch selection and seeds were shared across all experiments. We report all results of the instances as well as means and standard deviations.
4.1 Varying Mini-batch Sizes
The noise level of the gradient estimate and the loss is determined by the mini-batch size and ultimately there should exist an optimal that maximizes the optimizer’s performance in wall-clock-time. In practice of course the cost of computing and is not necessarily linear in since it is upper bounded by the memory capacity of the hardware used. We assume here, that the mini-batch size is chosen by the user; thus we test the line search with the default hyper-parameter setting (see Sections 3.4 and 4.2) on four different mini-batch sizes:
and (for MNIST, CIFAR-10, and EPSILON)
, and (for WDBC and GISETTE)
which correspond to increasing signal-to-noise ratios. Since the training set of WDBC only consists of 400 datapoints, the run with the larges mini-batch size of 400 in fact runs full-batch gradient descent on WDBC; this is not a problem, since—as discussed above—the probabilistic line search can also handle noise free observations.999Since the dataset size of WDBC is very small, we used the factor instead of to scale the sample variances of Eq. 17; for both factors are nearly identical. The former measures the noise level relative to the empirical risk, the latter relative to the risk; so both choices are sensible depending on what is the desired objective. We compare to sgd-runs using a fixed step size (which is typical for these architectures) and an annealed step size with annealing schedule . Because annealed step sizes performed much worse than sgd+fixed step size, we will only report on the latter results in the plots.101010An example of annealed step size performance can be found in Mahsereci and Hennig (2015). Since classic sgd without the line search needs a hand crafted learning rate we search on exhaustive logarithmic grids of
We run different initialization for each learning rate, each mini-batch size and each net and dataset combination ( runs in total) for a large enough budget to reach convergence; and report all numbers. Then we perform the same experiments using the same seeds and setups with sgd using the probabilistic line search and compare the results. For sgd+probLS, is the initial learning rate which is used in the very first step. After that, the line search automatically adapts the learning rate, and shows no significant sensitivity to its initialization.
Results of N-I and N-II on both, MNIST and CIFAR-10 are shown in Figures 7, 14, 15, and 16; results of N-III on WDBC, GISETTE and EPSILON are shown in Figures 18, 17, and 19 respectively. All instances (sgd and sgd+probLS) get the same computational budget (number of mini-batch evaluations) and not the same number of optimization steps. The latter would favour the probabilistic line search since, on average, a bit more than one mini-batch is evaluated per step. Likewise, all plots show performance measure versus the number of mini-batch evaluations, which is proportional to the computational cost.
All plots show similar results: While classic sgd is sensitive to the learning rate choice, the line search-controlled sgd performs as good, close to, or sometimes even better than the (in practice unknown) optimal classic sgd instance. In Figure 7, for example, sgd+probLS converges much faster to a good test set error than the best classic sgd instance. In all experiments, across a reasonable range of mini-batch sizes and of initial values, the line search quickly identified good step sizes , stabilized the training, and progressed efficiently, reaching test set errors similar to those reported in the literature for tuned versions of these kind of architectures and datasets. The probabilistic line search thus effectively removes the need for exploratory experiments and learning-rate tuning.
Overfitting and training error curves: The training error of sgd+probLS often plateaus earlier than the one of vanilla sgd, especially for smaller mini-batch sizes. This does not seem to impair the performance of the optimizer on the test set. We did not investigate this further, since it seemed like a nice natural annealing effect; the exact causes are unclear for now. One explanation might be that the line search does indeed improve overfitting, since it tries to measure descent (by Wolfe conditions which rely on the noise-informed gp). This means, that if—close to a minimum—successive acceptance decisions can not identify a descent direction anymore, diffusion might set in.
4.2 Sensitivity to Design Parameters
Most, if not all, numerical methods make implicit or explicit choices about their hyper-parameters. Most of these are never seen by the user since they are either estimated at run time, or set by design to a fixed, approximately insensitive value. Well known examples are the discount factor in ordinary differential equation solvers(Hairer et al., 1987, §2.4), or the Wolfe parameters and of classic line searches (§3.4.1). The probabilistic line search inherits the Wolfe parameters and from its classical counterpart as well as introducing two more: The Wolfe threshold and the extrapolation factor . does not appear in the classical formulation since the objective function can be evaluated exactly and the Wolfe probability is binary (either fulfilled or not). While is thus a natural consequence of allowing the line search to model noise explicitly, the extrapolation factor is the result of the line search favoring shorter steps, which we will discuss below in more detail, but most prominently because of bias in the line search’s first gradient observation.
In the following sections we will give an intuition about the task of the most influential design parameters , , and , discuss how they affect the probabilistic line search, and validate good design choices through exploring the parameter space and showing insensitivity to most of them. All experiments on hyper-parameter sensitivity were performed training N-II on MNIST with mini-batch size . For a full search of the parameter space -- we performed runs in total with different parameter combinations. All results are reported.
4.2.1 Wolfe II Parameter and Wolfe Threshold
As described in Section 3.4, encodes the strictness of the curvature condition W-II. Pictorially speaking, a larger extends the range of acceptable gradients (green shaded are in the lower part of Figure 5) and leads to a lenient line search while a smaller value of shrinks this area, leading to a stricter line search. controls how certain we want to be, that the Wolfe conditions are actually fulfilled. In the extreme case of complete uncertainty about the collected gradients and function values () will always be , if the strong Wolfe conditions are imposed. In the limit of certain observations () is binary and reverts to the classic Wolfe criteria. An overly strict line search, therefore (e.g. and/ or ), will still be able to optimize the objective function well, but will waste evaluations at the expense of efficiency. Figure 10 explores the - parameter space (while keeping fixed at 1.3). The left column shows final test and train set error, the right column the average number of function evaluations per line search, both versus different choices of Wolfe parameter . The left column thus shows the overall performance of the optimizer, while the right column is representative for the computational efficiency of the line search. Intuitively, a line search which is minimally invasive (only corrects the learning rate, when it is really necessary) is preferred. Rows in Figure 10 show the same plot for different choices of the Wolfe threshold .
The effect of strict can be observed clearly in Figure 10 where for smaller values of the average number of function evaluations spend in one line search goes up slightly in comparison to looser restrictions on , while still a very good perfomace is reached in terms of train and test set error. Likewise, the last row of Figure 10 for the extreme value of (demanding certainty about the validity if the Wolfe conditions), shows significant loss in computational efficiency having an average number of function evaluations per line search, but still does not break. Lowering this threshold a bit to increases the computational efficiency of the line search to be nearly optimal again.
Ideally, we want to trade off the desiderata of being strict enough to reject too small and too large steps that prevent the optimizer to converge, but being lenient enough to allow all other reasonable steps, thus increasing computational efficiency. The values and , which are adopted in our current implementation are marked as dark red vertical lines in Figure 10.
4.2.2 Extrapolation Factor
The extrapolation parameter , introduced in Section 3.4.4, pushes the line search to try a larger learning rate first, than the one which was accepted in the previous step. Figure 9 is structured like Figure 10, but this time explores the line search sensitivity in the - parameter space (abscissa and rows respectively) while keeping fixed at . Unless we choose (no step size increase between steps) in combination with a lenient choice of the line search performs well. For now we adopt as default value which again is shown as dark red vertical line in Figure 9.
The introduction of is a necessity and well-working fix because of a few shortcomings of the current design. First, the curvature condition W-II is the single condition that prevents too small steps and pushes optimization progress. On the other hand both W-I and W-II simultaneously penalize too large steps (see Figure 1 for a sketch). This is not a problem in case of deterministic observation (), where W-II undoubtedly decides if a gradient is still too negative. Unless W-II is chosen very tightly (small ) or unnecessarily large (both choices, as discussed above, are undesirable), in the presence of noise, will thus be more reliable in preventing overshooting than pushing progress. The first row of Figure 9 illustrates this behavior, where the performance drops somewhat if no extrapolation is done () in combination with a looser version of W-II (larger ).
Another factor that contributes towards accepting small rather than larger learning rates is a bias introduced in the first observation of the line search at . Observations that the gp gets to see are projections of the gradient sample onto the search direction . Since the first observations is computed from the same mini-batch as the search direction (not doing this would double the optimizer’s computational cost) an inevitable bias is introduced of approximate size of (where is the expected angle between gradient evaluations from two independent mini-batches at ). Since the scale parameter of the Wiener process is implicitly set by (§3.4.2), the gp becomes more uncertain at unobserved points than it needs to be; or alternatively expects the 1D-gradient to cross zero at smaller steps, and thus underestimates a potential learning rate. The posterior at observed positions is little affected. The over-estimation of rather pushes the posterior towards the likelihood (since there is less model to trust) and thus still gives a reliable measure for and . The effect on the Wolfe conditions is similar. With biased towards larger values, the Wolfe conditions, which measure the drop in projected gradient norm, are thus prone to accept larger gradients combined with smaller function values, which again is met by making small steps. Ultimately though, since candidate points at that are currently queried for acceptance, are always observed and unbiased, this can be controlled by an appropriate design of the Wolfe factor (§3.4.1 and §4.2.1) and of course .
4.2.3 Full Hyper-Parameter Search: --
An exhaustive performance evaluation on the whole ---grid is shown in Appendix Appendix C. – Parameter Sensitivity in Figures 20-24 and Figures 25-35. As discussed above, it shows the necessity of introducing the extrapolation parameter and shows slightly less efficient performance for obviously undesirable parameter combinations. In a large volume of the parameter space, and most importantly in the vicinity of the chosen design parameters, the line search performance is stable and comparable to carefully hand tuned learning rates.
4.2.4 Safeguarding Mis-scaled gps:
For completeness, an additional experiment was performed on the threshold parameter which is denoted by in the pseudo-code (Appendix Appendix D. – Pseudocode) and safeguards against gp mis-scaling. The introduction of noisy observations necessitates to model the variability of the 1D-function, which is described by the kernel scale parameter (§3.4.2). Setting this hyper-parameter is implicitly done by scaling the observation input, assuming a similar scale than in the previous line search (§3.4.2) . If, for some reason, the previous line search accepted an unexpectedly large or small step (what this means is encoded in ) the gp scale for the next line search is reset to an exponential running average of previous scales ( in the pseudo-code). This occurs very rarely (for the default value the reset occurred in of all line searches), but is necessary to safeguard against extremely mis-scaled gp’s. therefore is not part of the probabilistic line search model as such, but prevents mis-scaled gps due to some unlucky observation or sudden extreme change in the learning rate. Figure 8 shows performance of the line search for and showing no significant performance change. We adopted in our implementation since this is the expected and desired multiplicative (inverse) factor to maximally vary the learning rate in one single step.
4.3 Candidate Selection and Learning Rate Traces
In the current implementation of the probabilistic line search, the choice among candidates for evaluation is done by evaluating an acquisition function at every candidate point ; then choosing the one with the highest value for evaluation of the objective (§3.2). The Wolfe probability actually encodes precisely what kind of point we want to find and incorporates both (W-I and W-II) conditions about the function value and to the gradient (§3.3). However does not have very desirable exploration properties. Since the uncertainty of the gp grows to ‘the right’ of the last observation, the Wolfe probability quickly drops to a low, approximately constant value there (Figure 4). Also is partially allowing for undesirably short steps (§4.2.2). The expected improvement , on the other hand, is a well studied acquisition function of Bayesian optimization trading off exploration and exploitation. It aims to globally find a point with a function value lower than a current best guess. Though this is a desirable property also for the probabilistic line search, it is lacking the information that we are seeking a point that also fulfills the W-II curvature condition. This is evident in Figure 4 where significantly drops at points where the objective function is already evaluated but does not. In addition, we do not need to explore the positive space to an extend, the expected improvement suggests, since the aim of a line search is just to find a good, acceptable point at positive and not the globally best one. The product of both acquisition function is thus a trade-off between exploring enough, but still preventing too much exploitation in obviously undesirable regions. In practice, though, we found that all three choices ((i) , (ii) only, (iii) only) perform comparable. The following experiments were all performed training N-II on MNIST; only the minibatch size might vary as indicated.
Figure 11 compares all three choices for mini-batch size and default design parameters. The top plot shows the evolution of the logarithmic test and train set error (for plot and color description see Figure caption). All test and train set error curves respectively bundle up (only lastly plotted clearly visible). The choice of acquisition function thus does not change the performance here. Rows 2-4 of Figure 11 show learning rate traces of a single seed. All three curves show very similar global behavior. First the learning rate grows, then drops again, and finally settles around the best found constant learning rate. This is intriguing since on average a larger learning rate seems to be better at the beginning of the optimization process, then later dropping again to a smaller one. This might also explain why sgd+probLS in the first part of the optimization progress outperforms vanilla sgd (Figure 7). Runs, that use just slightly larger constant learning rates than the best performing constant one (above the gray horizontal lines in Figure 11) were failing after a few steps. This shows that there is some non-trivial adaptation going on, not just globally, but locally at every step.
Figure 12 shows traces of accepted learning rates for different mini-batch sizes . Again the global behavior is qualitatively similar for all three mini-batch sizes on the given architecture. For the largest mini-batch size (last row of Figure 12) the probabilistic line search accepts a larger learning rate (on average and in absolute value) than for the smaller mini-batch sizes and , which is in agreement with practical experience and theoretical findings (Hinton (2012, §4 and 7), Goodfellow et al. (2016, §9.1.3), Balles et al. (2016)).
Figure 13 shows traces of the (scaled) noise levels and and the average number of function evaluations per line search for different noise levels (); same colors show the same setup but different seeds. The average number of function evaluations rises very slightly to for minibatch size towards the end of the optimization process, in comparison to for . This seems counter intuitive in a way, but since larger minibatch sizes also observe smaller value and gradients (especially towards the end of the optimization process), the relative noise levels might actually be larger. (Although the curves for varying are shown versus the same abscissa, the corresponding optimizers might be in different regions of the loss surface, especially probably reaches regions of smaller absolute gradients). At the start of the optimization the average number of function evaluations is high, because the initial default learning rate is small () and the line search extends each step multiple times.
The line search paradigm widely accepted in deterministic optimization can be extended to noisy settings. Our design combines existing principles from the noise-free case with ideas from Bayesian optimization, adapted for efficiency. We arrived at a lightweight “black-box” algorithm that exposes no parameters to the user. Empirical evaluations so far show compatibility with the sgd search direction and viability for logistic regression and multi-layer perceptrons. The line search effectively frees users from worries about the choice of a learning rate: Any reasonable initial choice will be quickly adapted and lead to close to optimal performance. Our matlab implementation can be found at http://tinyurl.com/probLineSearch.
Thanks to Jonas Jaszkowic who prepared the base of the pseudo-code.
Appendix A. – Noise Estimation
Section 3.4.3 introduced the statistical variance estimators
of the function and gradient estimate and at position . The underlying assumption is that and are distributed according to
which implies Eq 3
where is the possibly new search direction at . This is an approximation since the true covariance matrix is in general not diagonal. A better estimator for the projected gradient noise would be (dropping from the notation)