The past two decades have witnessed the almost unprecedented rise of an intriguing algorithmic field: machine learning (ML). With roots in statistics and computer science, ML has a mathematical optimization engine at its core. In fact, these days, ML and other data-driven disciplines have influenced much of the latest theoretical and practical advances in the optimization research community. Still, despite these connections, many barriers remain between the statistics, computer science, and optimization communities working on ML and related topics. These barriers—of differing terminology, goals, and avenues for collaboration—have led to duplications of efforts and continue to inhibit effective exchanges of ideas.
The aim of this tutorial is to present an overview of some of the key issues and research questions that relate to optimization within the field of ML. With the Operations Research (OR) community in mind, we assume that the reader is familiar with basic optimization methodologies, but will introduce terminology and concepts used in the broader ML community in a manner that we hope will facilitate communication between OR experts and those from other contributing fields. To aid in this pursuit, we provide in Table 1 a small glossary of the most important terms that will be introduced and used throughout this tutorial.
|input||element from input space||
|output||element from output space||label (vector)|
|sample set||pairs from input output space|
|testing set||other set of pairs from input output space|
|prediction function||function such that predicts||predictor,|
|parameter vector||parameterization vector for prediction function,||weights|
|i.e., with in a space|
|loss function||function for assigning penalty when|
|does not predict the correct label for|
|training/testing loss||average loss evaluated over sample/testing set|
|training/testing error||percentage of mislabeled elements|
|in the sample/testing set|
|stepsize sequence||multipliers for steps in optimization algorithm||learning rate|
1.1 Motivating illustration
The idea of machine learning arises with the fundamental question of whether machines (i.e., computers) can “think” like humans. At a more tangible level, this leads to questions such as whether, given a particular input, a machine can produce a reasonable/rational output response, e.g., for the purpose of making a prediction or decision. Let us begin with a simple illustration of how this might be done before introducing the idea of a learning problem more formally in the next subsection.
Suppose that a company aims to predict whether Product A is going to be profitable (yes or no) in an upcoming quarter. A human expert might attempt to make such a prediction by considering various factors that can be found in historical data, say about Product A, related products, and/or other factors. For simplicity of illustration, suppose that one considers two factors, also known as features: the demand for Product A and another factor, call it Factor X, both projected for the upcoming quarter. Certainly, one might expect that projected high demand might suggest high potential profitability in the upcoming quarter, but compounded with the influence of Factor X the outcome might be less obvious. For example, Factor X may reflect the cost of production or delivery of Product A, which might depend on the costs of raw materials or the set-up (take-down) costs of ramping up (reducing) production levels compared to previous quarters. Overall, the influence of Factor X could be complex and nonlinear.
Looking at historical data over multiple consecutive quarters, suppose that the data points in Figure 1 show the pairs of demand for Product A and value of Factor X that have been observed. The points in green indicate pairs corresponding to quarters in which Product A was profitable while those in red indicate unprofitable quarters. Using this data, one might aim to learn how to take the inputs of projected demand and Factor X and predict whether or not the product will be profitable. For example, this could be achieved by learning a dividing line or a dividing curve between green and red dots, as illustrated in the center and rightmost plots in Figure 1, respectively. The idea is that, if a good prediction tool (i.e., dividing line/curve) is determined, then one could take a new pair of inputs and accurately predict whether Product A will be profitable in the upcoming quarter (in this example, by determining which side of the dividing line lies the new pair of inputs).
Of course, our illustration in Figure 1 presents an idealized case in which dividing lines exist between data points corresponding to differing labels. The situation is not always so clean. We address this and other issues next in our more general discussion of learning.
1.2 Learning problems and (surrogate) optimization problems
Let us now define a generic learning problem more formally. In particular, let us consider the problem of trying to determine an effective procedure for taking an input (i.e., feature vector) from an input space and predicting what is its correct output (i.e., label vector) in an output space . This can be cast as trying to learn a prediction function, call it , that takes an input and produces that one hopes can identify . Not knowing what might be future inputs of interest, one tries to determine such that, over the distribution of pairs in the input output space
, one maximizes the probability of a correct prediction; i.e., the goal is to chooseto maximize
Here, is the indicator function that takes the value 1 if its argument is true and 0 otherwise. As for the notation “”, it could literally mean that the two vectors are equal (or close in some sense), that one is less than or equal to the other by some measure, or some other related meaning. We discuss various such possibilities later on.
Clearly, for any given application, there is not necessarily one correct manner in which the specifics of the function in (1) should be chosen. In addition, in its present form, it is far from tractable to maximize (1) directly. In the remainder of this section, we discuss how, with various approximations and manipulations, one can go from the generic learning goal of maximizing (1) to a practical optimization problem that one can reasonably solve.
The first issue that one must address when aiming to maximize (1) is what class of prediction functions to consider. If the class is too large and/or involves a diverse collection of complex functions, then finding that which maximizes (1) can be extremely difficult. On the other hand, if the class is too small or only involves simple functions, then even the optimal within the class might yield poor predictions for various inputs. (Recall the choice between trying to separate points based on a line or a more general type of curve in §1.1.) Overall, there is a critical balance that one should attempt to find when determining the class of prediction functions; see §1.3 and  for further discussion.
For our purposes, let us assume that some family of prediction functions is chosen that is parameterized by a real vector , i.e., . For example, in a simple, yet very common setting, one considers the class of linear predictors where with and , making the entire space of parameters . Another example is the class of quadratic functions in , where with , , and while denotes the -dimensional symmetric vectorization of the matrix ,111The symmetric vectorization of produces a vector whose elements correspond to the upper triangle of the outer product , namely terms of the form and . thus making . While this function is nonlinear in —and hence can fit complex relationships between and —it is linear in the parameters . Hence, this class can actually be viewed as a class of linear predictors for the modified feature space , where for each one has in . It turns out that many, but not all, classes of prediction functions can be viewed as linear predictors in some modified space. A notable exception of interest arises in the use of deep neural networks; see §3.
Our learning problem can now be cast as trying to determine the parameters solving
As a next step toward tractability, let us now discuss possible meanings for the expression . Typically, the meaning depends on the type of labels involved. For example:
In binary classification (“yes/no” labeling) with , the expression can represent . In this manner, we say that the predicted value identifies the correct output as long as the two values have the same sign.
In regression with , the expression might represent the fact that , where is some prescribed accuracy threshold.
In multi-class classification with such that , the expression might represent that is the index such that . In this manner, one can view the th element of as some predicted probability that the true label of is , and the label that will be predicted is the one with the largest predicted probability.
Let us continue our development toward a tractable problem by taking binary classification as a particular example. If we use to represent , then (2) becomes
This objective is very easy to interpret; it is the probability that the value correctly predicts the sign of . The problem can be rewritten as
The indicator is known as the -loss function. It counts a unit loss if incorrectly predicts the sign of , and no loss otherwise. This makes sense to do, but there are some drawbacks to using this loss function. For one thing, it is discontinuous, making the goal of optimizing over potentially difficult. In addition, and perhaps more importantly, this loss function does not quantify the magnitude of the error; e.g., for a given , small perturbations in the data can cause large perturbations in , even though other (large) perturbations might not affect at all. This can lead to instability. To overcome these issues, one can replace the -loss by a similar, yet continuous (and perhaps smooth) surrogate loss function. One way in which this can be done is through the idea of logistic regression, which can be described as follows. First, imagine that the label is not deterministic. Instead, given an input , let
represent the random variable corresponding to the correct label for. The goal is to choose a parameter vector such that
Since while , in order to make such a connection, we can create a relationship between these quantities by equating
(The left-hand side of this equation is known as the logit of .) This implies, after some simple algebraic manipulations, that
where (by taking the negative logarithm of (6)) we use the logistic loss function
When , this loss function is convex, and is, in fact, one of the most common loss functions used in learning to train linear predictors. Other convex loss functions that are commonly used are the hinge loss for classification, namely, , and the least square loss for regression, namely, . In the case of linear predictors, these two loss functions typically give rise to convex quadratic optimization problems that may be very large-scale for which specific optimization algorithms have been designed. (In this tutorial, we do not discuss such methods in detail since they are too specialized for our general setting.)
Problem (7) can be approached using stochastic optimization methods, such as stochastic approximation [78, 59] and sample average approximation [72, 64]. For guarantees in terms of solving (7), the theory for such methods necessarily assumes that one can sample from indefinitely according to the distribution . However, in many settings in machine learning, this is not possible, so the last issue that we must confront is the fact that the objective of (7) and its derivatives cannot be computed since is unknown. Instead, a surrogate problem can be considered. For example, in supervised learning, one assumes that there exists a set of input output pairs (known as samples or examples) that is available a priori. With this set, one can approximately solve (7) by solving a related deterministic optimization problem defined over this set of samples, i.e.,
Going forward, we will be interested in both problems (7) and (8). We consider the latter to be the problem that one can attempt to solve in practice, though we will also be interested in how solutions obtained from solving (8) relate to optimal solutions of (7). We refer to as the expected risk (loss) function and refer to as the empirical risk (loss) function.
1.3 Learning bounds, overfitting, and regularization
Let us now discuss relationships between problems (7) and (8) and their optimal solutions. To start, suppose that these problems are convex, as is the case when is the logistic loss. Let and , respectively, denote the expected and empirical risk minimizers. Given that one aims for minimizing in hopes of approximating minimizing , a variety of important questions arise about the potential differences between and , or between the optimal values of (7) and (8). For example, one might be interested in a bound for
Bounding the first term on the right-hand side relates to asking: While is designed to behave well on the sample set, can we be sure that it behaves well in expectation? Bounding the second term relates to asking whether we can be sure that the expected loss corresponding to is not far from the expected loss of the optimal vector with respect to .
Different bounds can be derived for these quantities in various problem-specific cases. Here, let us state a generic bound, which says that, with probability at least , (see for example, )
where is a scalar representing the complexity of the class of the prediction functions, i.e., with . A similar bound can be derived on the second term in (9). From (10), a few intuitive notions are confirmed. First, the discrepancy appears inversely proportional to , meaning that more data leads the empirical risk to better approximate expected risk. Second, the discrepancy depends on the complexity of the prediction functions.
It is beyond the scope of this tutorial to show how one might derive in general. Instead, for the sake of intuition, let us briefly introduce one such measure for binary classification: the Vapnik-Chervonenkis (VC) dimension. Briefly, the VC-dimension of a class of predictors with is the largest number for which there exists a set of points such that for any label set , there exists a predictor that predicts all labels without error, i.e., for all . For example, for the class of linear predictors of the form with , the VC-dimension is . To see this, in Figure 2 we illustrate a set of 3 points in and a linear predictor for each labeling of these points, where in each case one finds a predictor that predicts all labels without error. On the other hand, it is easy to show that for some set of distinct points in , there exists at least one labeling of the points which cannot be perfectly classified by a linear function. Hence, the VC-dimension of linear predictors in is .
For linear predictors, the VC-dimension is equal to the number of parameters that define the prediction function. However, this is not true for all predictor classes. In such cases, instead of using a precise measure of complexity in (10), a bound can be used. For example, if the input space is bounded by an -norm ball of radius and the class of predictors is such that is constrained in a ball of radius , then, with smoothness of a loss function , a bound on can be derived in terms of . For more information on complexity measures of classes of predictors, we refer the reader to [4, 31, 85].
For our purposes going forward, it is important to observe that as gets larger, a larger sample set is required to keep the right-hand side in (10) small. This means that in order for the optimal value of problem (8) to represent the actual expected risk corresponding to , there needs to be a balance between the complexity of the predictor class and the number of sample points. If optimization is performed over a complex class of predictors using a sample set that is not sufficiently large, then the optimal solution may achieve small , but have a large expected loss . Such a solution will have poor predictive properties because it overfits the training data. On the other hand, to learn “effectively”, one needs to balance the complexity of the predictors and the size of the data set.
One way to control the complexity of a class is simply to control the -norm of , since smaller results in smaller and a smaller bound on an appropriate complexity measure . Alternatively, particularly for linear predictors, one can control complexity by bounding the number of nonzeros in , thereby constraining the class to sparse predictors. This idea is of particular interest when each data vector has a lot of features, yet the prediction function is believed to depend only on a small (unknown) subset of these features. Once the subset of features is selected, the VC-dimension of the class reduces to the number of nonzeros in . Hence, if a sparse predictor achieves small empirical error, then it is likely to achieve small expected error.
Instead of explicitly constraining the number of nonzeros in , which would make the optimization problem harder, a regularization term such as can be added to the objective function. The addition of this term encourages feature selection. It does not always guarantee that a sparse solution will be reached, but it has been shown to work well in practice. Generically, to attempt to restrict the complexity of the prediction function class in some manner, one often considers a regularized optimization problem of the form
is a weighting parameter, and is either or some other convex (potentially nonsmooth) regularization function. How the parameter should be chosen is a subject of work in structural risk minimization. However, the details of this are beyond the scope of this tutorial. For our purposes, suffice it to say that for any given value of , one has an optimization problem to solve (at least approximately), so now let us turn to optimization algorithms that may serve well in the context of problem (11).
2 Methods for Solving the Logistic Regression Problem
The methods that we discuss in this section for solving problem (11) could be employed when and are any convex functions with respect to
. There is a large variety of machine learning models that fall in this category, including support vector machines, Lasso, sparse inverse covariance selection and others. For further details on these models see and references therein. Here, in order to be more concrete at various times, we will refer specifically to the case of regularized logistic regression for binary classification. To simplify our notation in this case, let us assume without loss of generality that . (That is, we omit the bias term , which can be done by augmenting the input vector by an extra feature which is always equal to .) Denoting the dimension of both and as , this leads to the specific convex problem
It is worthwhile to note that the regularization term is necessary for this problem. To see why this is the case, consider a parameter vector such that for all . Then, consider the unbounded ray . It is easy to see that, in this case, as , meaning that the minimum of this function cannot be achieved. On the other hand, by adding a (coercive) regularization function , it is guaranteed that problem (12) will have an optimal solution.
For the regularization function , we will refer to the common choices of and . For simplicity, we will refer mostly to the former choice, which makes the objective of (12) a continuously differentiable function. By contrast, results in a nonsmooth problem, for which minimization requires more sophisticated algorithms.
2.1 First-order methods
We begin by discussing first-order methods for solving (12). Here, “first-order” refers to the fact that these techniques only require first-order derivatives of the terms in .
2.1.1 Gradient descent
Conceptually, the most straightforward method for minimizing a smooth convex objective is gradient descent; e.g., see 
. In this approach, one starts with an initial solution estimateand iteratively updates the estimate via the formula
where is a stepsize parameter. The performance of the algorithm is inherently tied to the choice of stepsize sequence . In the optimization research community, it is well known that employing a line search in each iteration to determine can lead to a well-performing algorithm for a variety of types of problems. However, for ML applications, such operations are expensive due to the fact that each computation of requires a pass over the entire dataset, which can be prohibitively expensive if is large.
Fortunately, theoretical convergence guarantees for the algorithm can still be proved when each is set to a positive constant for all , as long as this fixed value is sufficiently small. (When the stepsize is fixed, it is known in the ML community as the learning rate for the algorithm. Some also use this term to refer to each or the entire sequence , even when it is not constant.) The convergence rate depends on whether is strongly convex or merely convex. If is -strongly convex, the gradient function is Lipschitz continuous with Lipschitz constant , and , then it can be shown that the number of iterations required until is at most , where and . This is a linear rate of convergence. (If , then these conditions hold for problem (12); in particular, and .) On the other hand, if is merely convex (as would be the case in (12) when ), then such an -optimal solution is obtained within at most iterations, which is a sublinear rate. One can actually improve these rates by employing an acceleration technique by Nesterov . Using such an approach allows one to attain -optimality within iterations when is strongly convex and within iterations when it is convex.
Extensions of gradient descent and accelerated gradient descent for solving -norm regularized logistic regression problems (i.e., (12) when ) are known as ISTA and FISTA, respectively, . Observe that, in this setting, the objective function may not be strongly convex even if . ISTA and FISTA have the same sublinear convergence rates as their smooth counterparts when the objective function is convex .
The important aspect of gradient descent in the context of ML is to recognize the computational cost of computing the gradient of in each iteration. In particular, in the context of ML, the cost of a single gradient computation is typically ; this can be seen, e.g., in the case of problem (12) with , where the gradient of for a given is
(Here, one funds a sum of terms, where for each term one needs to compute the inner product of -dimensional vectors, hence the cost.) The dependence of this computation on (which can be of order in various ML applications) is computationally prohibitive, and can be viewed as potentially wasteful when many of the elements of the sample set are the same or at least very similar. In the next subsection, we discuss a stochastic optimization method whose per-iteration computational cost is drastically smaller, and yet can still offer convergence guarantees.
2.1.2 Stochastic gradient method
The stochastic gradient method, well known in the OR community due to its use for minimizing stochastic objective functions, is the hallmark optimization algorithm in the ML community. Originally proposed by Robbins and Monro  in the context of solving stochastic systems of equations, the method is notable in that it can be employed to minimize a stochastic objective with nice convergence guarantees while the per-iteration cost is only as opposed to (as in gradient descent).
In each iteration, the stochastic gradient method computes an unbiased estimatorof the true gradient . This estimator can be computed at very low cost; e.g., for (12), a stochastic gradient can be computed as
where , known as the mini-batch, has elements chosen uniformly at random from . The step is then taken similar to gradient descent:
Absolutely critical to the algorithm is the choice of the stepsize sequence . Unlike gradient descent, a fixed stepsize (i.e., learning rate) does not guarantee convergence of the algorithm to the minimizer of a strongly convex , but rather only guarantees convergence to a neighborhood of the minimizer. However, Robbins and Monro showed that with
one can guarantee a sublinear rate of convergence (almost surely) to the minimizer . Note that the stochastic gradient method is not, strictly speaking, a descent method since is not guaranteed to be a descent direction for from and the objective function does not decrease monotonically over the optimization process. However, we will refer to it as SGD, for stochastic gradient descent, as is often done in the literature.
The convergence rate of SGD is slower than that of gradient descent. In particular, when is strongly convex, it only guarantees that an expected -accurate solution (i.e., one with ) is obtained once , and when is merely convex, it only guarantees that such a solution is found once . On the other hand, as previously mentioned, if the size of is bounded by a constant (independent of or ), then the per-iteration cost of SGD is times smaller than that of gradient descent.
This trade-off between convergence rate and per-iteration cost can be analyzed in the context of ML, with strong results in favor of SGD; e.g., see . To outline the ideas behind this analysis, let us ignore regularization terms and recall that our ultimate goal is to solve (7) with some accuracy . This means that we want to obtain such that , where is the optimal solution to (7). Instead, however, we solve problem (8) to some accuracy , obtaining such that , where is the optimal solution to (8). If we use as our approximate solution to (7), then, from (9) and (10) and since , we have with probability at least that
We can now compare algorithms by quantifying the computational costs they require to satisfy these bounds. For example, suppose that we apply some algorithm to solve problem (8), where, for a given , the cost of obtaining satisfying the latter bound in (18) is , which increases with both and . For a fixed family of functions with and complexity , obtaining the former bound in (18) requires (ignoring log factors). Considering now the optimization algorithm and its cost , it is clear that any algorithm that computes , its gradient, or its Hessian at any point has a cost of at least to perform a single iteration, regardless of the rate at which it converges to a minimizer of . This is the case, e.g., for the gradient descent method. SGD, on the other hand, has a per-iteration cost that is independent of and can be shown to converge to an -optimal solution within at most iterations (when the objective function is convex and one can sample indefinitely from the input output space according to the distribution ). From this discussion, we can conclude that, at least in theory, SGD is a superior algorithm for large-scale (i.e., large ) ML applications.
In practice, however, standard SGD is not necessarily the most effective approach to solve optimization problems in ML. Indeed, there is a great deal of active research in the ML and optimization communities on developing improvements and/or alternatives to SGD. In the subsequent two sections, we discuss two categories of such methods: variance reducing and second-order methods. However, there are a variety of approaches even beyond these two categories. For example, one extension of SGD that has often been found to perform better in practice is SGD with momentum; see Algorithm 1. This algorithm is a stochastic variant of the classical momentum method by Polyak , which enjoys an improved convergence rate compared to standard gradient descent. SGD with momentum has been shown to be quite successful in machine learning, especially deep learning, our topic in §3; e.g., see . It is shown in  that Nesterov’s accelerated gradient method  can be cast as a classical momentum approach. That said, to the best of our knowledge, this stochastic version of the momentum method does not have convergence guarantees. Some results analyzing stochastic variants of acceleration methods can be found in .
As a final remark on SGD, we note that aside from its slow convergence rate (in theory and practice), SGD is very sensitive to parameter choices, such as the mini-batch size and learning rate. The best choice of these parameters heavily depends on the dataset, and the wrong choice can severely slow progress or cause the algorithm to stall completely.
2.1.3 Variance reducing methods
The arguments for SGD that we raised in the previous section rely on the assumptions that the true aim is to solve problem (7) and that one can sample indefinitely from the space of inputs and outputs. However, in ML applications, the sample set is often given and fixed, and if
is sufficiently large, then there is good reason to view a discrete uniform distribution over the sample set as a good approximation to the distribution. Thus, one can argue that the conclusions of the previous section should only be used as a theoretical guideline, reminding us that while (7) might be the true problem of interest, one can reasonably be interested in the most efficient methods for solving (8), or, very often, the regularized problem (11).
Considering problem (11), one finds that SGD can be improved upon by exploiting the structure of the objective as a finite sum of functions plus a simple convex term. Several methods have been developed along these lines, such as SAG , SAGA , SDCA , and SVRG . SAG and SAGA, for example, rely on averaging the past stochastic gradients in a particular manner in an attempt to accumulate more accurate gradient estimates as the optimization algorithm proceeds. As a result, they enjoy the same convergence rate as full gradient methods (with better constant factors). However, these methods require the storage of past stochastic gradients so that components can be individually updated as the method progresses. In contrast, SVRG does not require such storage, though it does require computing the full gradient every iterations. For reference, we state SVRG as Algorithm 2. The algorithm performs one full gradient computation at each outer iteration, then takes steps along random directions which are stochastic corrections of this full gradient. The inner loop size must satisfy certain conditions to ensure convergence .
The name SVRG, which stands for stochastic variance-reduced gradient, comes from the fact that the algorithm can be viewed as a variance-reduced variant of SGD (specifically for finite-sum minimization). To see this, first observe that the random directions taken by the algorithm are in fact unbiased gradient estimates:
Second, notice that computation of the full gradient at each outer iteration helps reduce the variance of the stochastic gradient estimates used in SGD; indeed, if is “close” to , then one should expect to be “closer” to than is alone. In practice, SVRG is somewhat more robust than SGD is to the choice of learning rate , though it still can be quite sensitive to the choice of , the inner loop size.
A new method that combines some ideas from SVRG and SAGA, called SARAH , only differs from SVRG in terms of the inner loop step, which in SARAH is given by
This change causes so that the steps in SARAH are not based on unbiased gradient estimates. However, it attains improved convergence properties relative to SVRG.
In Tables 3 and 3, we summarize the complexity properties of several popular first-order methods when applied to minimize strongly convex (Table 3) and convex (Table 3) problems. Here, complexity refers to an upper bound on the number of iterations that may be required to attain an -accurate solution. In the former case, recall defined in §2.1.1, often referred to as the condition number of a strongly convex problem.
Yet another branch of variance reducing methods are those that employ the standard SGD step formula (16), but attempt to achieve a faster convergence rate by increasing the mini-batch size during the optimization process. If done carefully (and one can sample indefinitely from the input output space), then such an approach can achieve a linear rate of convergence for minimizing the expected risk (7); e.g., see . Similar ideas have also been explored for the case of minimizing finite sums; e.g., see .
2.2 Second-order and quasi-Newton methods
Motivated by decades of work in the deterministic optimization research community, one of the most active research areas in optimization for ML relates to how one might use second-order derivative (i.e., curvature) information to speed up training. As in the deterministic setting, the idea is to compute the th step by approximately minimizing a quadratic model of the objective function about of the form
where is positive definite, i.e., . A variety of algorithms of this type exist, each distinguished by how the curvature matrix is obtained and how an (approximate or exact) minimizer is computed. The prototypical example is Newton’s method where , assuming this matrix is positive (semi)definite. For example, for the case of the regularized logistic regression problem (12), one finds that this Hessian matrix, like the gradient in (14), comes from the sum of similar terms defined over the dataset:
Unfortunately, the computation and storage of a Hessian matrix becomes prohibitively expensive in ML applications when and/or is large. As an alternative, one may consider using stochastic Hessian information by replacing the average in (22) with an average over a small subset , as is done for computing gradient estimates. In the case of (22), this might appear to be a good idea since the stochastic Hessian is a sum of
rank-one matrices and a scaled identity matrix, meaning that solving a system of equations with such a matrix might not be too expensive. However, such a low-rank approximation might not adequately capture complete curvature information, making the added costs of the algorithm (as compared to those of SGD or one of its variants) not worthwhile. Recently, several variants ofsub-sampled Newton methods have been proposed, where the sample size is increased to improve the accuracy of the Hessian estimates as the algorithms progresses. We give a brief overview of these ideas in §3.4.
Another class of algorithms based on models of the form (21) are quasi-Newton methods. Interestingly, these approaches compute no explicit second-order derivatives; instead, they construct Hessian approximation matrices entirely from first-order derivatives by applying low-rank updates in each iteration. For example, let us briefly describe the most popular quasi-Newton algorithm, known as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. In this approach, one first recognizes that the minimizer of (21) is , revealing that it is actually convenient to compute inverse Hessian approximations. With in hand along with the step and displacement , one chooses to minimize subject to satisfying the secant equation . Using a carefully selected norm, the solution of this problem can be written explicitly as
where the difference between and can be shown to be only a rank-two matrix.
For reference, a complete classical BFGS algorithm is stated as Algorithm 3.
It has been shown (e.g., see ), under certain conditions including strong convexity of the objective function , that the BFGS method yields the Dennis-Moré condition, namely, , which in turn can be used to show that the BFGS method converges locally superlinearly. Critical in this analysis is the condition that for all , which guarantees that in (23) is positive definite as long as . When the objective is strongly convex, this holds automatically, but, for merely convex or potentially nonconvex , this is guaranteed by the Wolfe line search.
In the context of large-scale ML, the classical BFGS method is intractable for a variety of reasons. Of course, there are the issues of the computational costs of computing exact gradients and performing a line search in each iteration. However, in addition, there is the fact that the Hessian approximations will almost certainly be dense, which might lead to storage issues in addition to the expense of computing matrix-vector products with large dense matrices. Fortunately, however, the optimization literature has already addressed these latter issues with the idea of using limited memory quasi-Newton approaches, such as limited memory BFGS, also known as L-BFGS .
For our purposes, let us simply give an overview of several variants of the (L-)BFGS method that have been recently developed to address various challenges of ML applications. One common feature of all of these approaches is that the line search component is completely removed in favor of a predetermined stepsize sequence such as is used for SGD. Another common feature is that the sequence no longer represents (exact) gradient displacements, since exact gradients are too expensive to compute. Instead, the techniques mentioned in the bullets below use different ideas for this sequence of vectors.
If one defines as the difference of two stochastic gradient estimates formed using two different mini-batches, i.e., , then one typically fails to have and, even if this bound does hold, the Hessian approximation sequence can be highly volatile. An alternative idea that overcomes these issues is to definte so that the displacement is computed using the same mini-batch at each point. This is what is proposed in the methods known as oBFGS (oLBFGS)  and SGD-QN . This idea requires two computations of a stochastic gradient per iteration, one for each of the mini-batches (to compute the previous displacement) and (to compute the next step). The resulting method has been shown to perform well for convex problems, though convergence eventually slows down when noise in the gradient estimates starts to dominate the improvements made by the steps.
In , it is proposed to use large overlapping (but still distinct) mini-batches on consecutive iterations within an L-BFGS framework. In other words, , with and not independent, but containing a relatively large overlap. Positive results on optimizing the training error for logistic regression were obtained by combining L-BFGS with carefully designed distributed computations.
Another stochastic (L-)BFGS method has been proposed to encourage the self-correcting property of BFGS-type updating. In particular, it is known that (L-)BFGS behaves well when the inner product remains bounded above and below (away from zero). Hence, in , a modification of the secant equation is proposed to ensure such bounds on . Letting and , the method replaces the secant equation with , where for some . In particular, is chosen specifically to ensure that is bounded above and below by positive constants. With this damping of the update vectors, the method exhibits stable performance in convex and nonconvex settings.
An alternative idea aimed at ensuring that has been proposed in  and extended in  to a block version of the BFGS method. The key idea is to set not as the difference of stochastic gradient estimates, but as where is a sample Hessian computed on some batch , different from the batch that is used to compute the step . This approach has been successful on convex models such as logistic regression. However, if the sample Hessian is computed using small mini-batches (relative to ), then may be small and the method may become unstable.
Stochastic optimization methods (without variance reduction) cannot achieve a convergence rate that is faster than sublinear, even if second-order information is used. However, using second-order information is an attractive idea since, if the Hessian approximation matrices converge to the Hessian at the solution, then the constant in the convergence rate can be lessened as the effects of ill-conditioning can be reduced.
Unfortunately, despite practical improvements that have been observed, there has yet to be a practical second-order method that provably achieves such improvements in theory. As of now, most practical methods only provably achieve the convergence (rate) properties of SGD as long as the Hessian (approximation) matrices remain well-behaved. For example, if the sequence (not necessarily produced by BFGS updates) satisfies, for all ,
then achieves the same convergence rate properties as SGD. It is reasonable to assume that such bounds hold for the method discussed above, with appropriate safeguards; however, one should be careful in a quasi-Newton context in which the stochastic gradient estimates might be correlated with the Hessian approximations.
3 Deep Learning
We have seen that mathematical optimization plays an integral role in machine learning. In particular, in §1.2, we saw that in hopes of solving a given learning problem, a careful sequence of choices and practical approximations must be made before one arrives at a tractable problem such as (8). We have also seen how, as in many traditional machine learning models, one can arrive at a convex optimization problem such as (12) for which numerous types of algorithms can be applied with well-known strong convergence guarantees. However, for a variety of types of learning problems, researchers have found that ending up at a convex problem goes perhaps too far in terms of simplifications of the function one is trying to learn to make accurate predictions. Indeed, recent advances have shown that such convex models are unable to provide predictors that are as powerful as those that can be achieved through more complex nonconvex models. In this section, we provide an overview of recent developments in this area of optimization for machine learning.
The major advancements that have been made along these lines involve the use of deep neural networks (DNNs). The corresponding branch of ML known as deep learning (or hierarchical learning) represents classes of algorithms that attempt to construct high-level abstractions in data by using a deep graph with multiple layers involving sequential linear and nonlinear transformations [6, 51, 73, 37, 38, 23]. A variety of neural network types have been studied in recent years, including fully-connected networks (FNNs) [84, 28], convolutional networks (CNNs) , and recurrent networks (RNNs) [41, 57, 52]. For our purposes, we will mainly refer to the first two types of networks, while keeping in mind the others.
While the idea of “high-level abstractions” might sound complicated, understanding the basics of DNNs is not that difficult once one is familiar with the terminology. Deep neural networks derive their name from being inspired by simplifications of models in neuroscience. It is not necessary to be an expert in neuroscience to understand DNNs, but many do find it convenient to borrow its terminology when describing the structure of a DNN and how one combines input information through a sequence of stages until an ultimate output is produced. The idea makes sense as to how the human brain takes various pieces of input from multiple sources and combines them all through a multi-step process (learned since birth) in order to make a prediction. For example, to classify an image as one of a dog, one combines bits of information—e.g., sections that look like ears, a tail, and fur all oriented in the appropriate manner—to make such a determination.
Structurally, a DNN takes the form of a graph with subsets of nodes arranged in a sequence. Each subset of nodes (or neurons) is called a layer
, where in simple cases edges only exist between neurons in a layer and the subsequent layer. However, besides its structure, the key aspect of a DNN is how information is “fed” through it. In the simple case of afeed forward network, this occurs as follows. First, each element of an input vector is given to a different neuron in the first layer, also known as the input layer. The values in this layer are then each passed to the neurons in the next layer after multiplication with weights associated with the corresponding edges. Once at a given node, a value can be further transformed through application of a (linear or nonlinear) activation function before values continue to be passed through the network. The last layer of the network, which provides the predicted output , is known as the output layer, whereas layers between the input and output are called hidden layers. The more hidden layers that are present, the deeper the network. Figure 3 provides a partial illustration of a small fully-connected DNN for a simple classification model with , , and two hidden layers.