MATLAB scripts to reproduce the results of our OPT 2009 paper
We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ_1-regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark datasets.READ FULL TEXT VIEW PDF
MATLAB scripts to reproduce the results of our OPT 2009 paper
Sparse estimation through convex regularization has become a common practice in many application areas including bioinformatics and natural language processing. However facing the rapid increase in the size of data-sets that we analyze everyday, clearly needed is the development of optimization algorithms that are tailored for machine learning applications.
Regularization-based sparse estimation methods estimate unknown variables through the minimization of a loss term (or a data-fit term) plus a regularization term. In this paper, we focus on convex methods; i.e., both the loss term and the regularization term are convex functions of unknown variables. Regularizers may be non-differentiable on some points; the non-differentiability can promote various types of sparsity on the solution.
Although the problem is convex, there are three factors that challenge the straight-forward application of general tools for convex optimization (Boyd and Vandenberghe, 2004) in the context of machine learning.
The first factor is the diversity of loss functions. Arguably the squared loss is most commonly used in the field of signal/image reconstruction, in which many algorithms for sparse estimation have been developed(Figueiredo and Nowak, 2003; Daubechies et al., 2004; Cai et al., 2008). However the variety of loss functions is much wider in machine learning, to name a few, logistic loss and other log-linear loss functions. Note that these functions are not necessarily strongly convex like the squared loss. See Table 4 for a list of loss functions that we consider.
The second factor is the nature of the data matrix, which we call the design matrix in this paper. For a regression problem, the design matrix is defined by stacking input vectors along rows. If the input vectors are numerical (e.g., gene expression data), the design matrix is dense and has no structure. In addition, the characteristics of the matrix (e.g., the condition number) is unknown until the data is provided. Therefore, we would like to minimize assumptions about the design matrix, such as, sparse, structured, or well conditioned.
The third factor is the large number of unknown variables (or parameters) compared to observations. This is a situation regularized estimation methods are commonly applied. This factor may have been overlooked in the context of signal denoising, in which the number of observations and the number of parameters are equal.
Various methods have been proposed for efficient sparse estimation (see Figueiredo and Nowak (2003); Daubechies et al. (2004); Combettes and Wajs (2005); Andrew and Gao (2007); Koh et al. (2007); Wright et al. (2009); Beck and Teboulle (2009); Yu et al. (2010), and the references therein). Many previous studies focus on the non-differentiability of the regularization term. In contrast, we focus on the couplings between variables (or non-separability) caused by the design matrix. In fact, if the optimization problem can be decomposed into smaller (e.g., containing a single variable) problems, optimization is easy. Recently Wright et al. (2009) showed that the so called iterative shrinkage/thresholding (IST) method (see Figueiredo and Nowak (2003); Daubechies et al. (2004); Combettes and Wajs (2005); Figueiredo et al. (2007a)) can be seen as an iterative separable approximation process.
In this paper, we show that a recently proposed dual augmented Lagrangian (DAL) algorithm (Tomioka and Sugiyama, 2009) can be considered as an exact (up to finite tolerance) version of the iterative approximation process discussed in Wright et al. (2009). Our formulation is based on the connection between the proximal minimization (Rockafellar, 1976a) and the augmented Lagrangian (AL) algorithm (Hestenes, 1969; Powell, 1969; Rockafellar, 1976b; Bertsekas, 1982). The proximal minimization framework also allows us to rigorously study the convergence behaviour of DAL. We show that DAL converges super-linearly under some mild conditions, which means that the number of iterations that we need to obtain an -accurate solution grows no greater than logarithmically with . Due to the generality of the framework, our analysis applies to a wide variety of practically important regularizers. Our analysis improves the classical result on the convergence of augmented Lagrangian algorithms in Rockafellar (1976b) by taking special structures of sparse estimation into account. In addition, we make no asymptotic arguments as in Rockafellar (1976b) and Kort and Bertsekas (1976); instead our convergence analysis is build on top of the recent result in Beck and Teboulle (2009).
Augmented Lagrangian formulations have also been considered in Yin et al. (2008) and Goldstein and Osher (2008) for sparse signal reconstruction. What differentiates DAL approach of Tomioka and Sugiyama (2009) from those studied earlier is that the AL algorithm is applied to the dual problem (see Section 2.2), which results in an inner minimization problem that can be solved efficiently exploiting the sparsity of intermediate solutions (see Section 4.1). Applying AL formulation to the dual problem also plays an important role in the convergence analysis because some loss functions (e.g., logistic loss) are not strongly convex in the primal; see Section 5. Recently Yang and Zhang (2009) compared primal and dual augmented Lagrangian algorithms for -problems and reported that the dual formulation was more efficient. See also Tomioka et al. (2011) for related discussions.
This paper is organized as follows. In Section 2, we mathematically formulate the sparse estimation problem and we review DAL algorithm. We derive DAL algorithm from the proximal minimization framework in Section 3, and discuss special instances of DAL algorithm are discussed in Section 4. In Section 5, we theoretically analyze the convergence behaviour of DAL algorithm. We discuss previously proposed algorithms in Section 6 and contrast them with DAL. In Section 7 we confirm our analysis in a simulated -regularized logistic regression problem. Moreover, we extensively compare recently proposed algorithms for -regularized logistic regression including DAL in synthetic and benchmark datasets under a variety of conditions. Finally we conclude the paper in Section 8. Most of the proofs are given in the appendix.
In this section, we first formulate the sparse estimation problem as a convex optimization problem, and state our assumptions. Next we derive DAL algorithm for -problem as an augmented Lagrangian method in the dual.
We consider the problem of estimating an dimensional parameter vector from training examples as described in the following optimization problem:
where is the parameter vector to be estimated, is a design matrix, and is a loss function. We call the first term in the minimand the loss term and the second term the regularization term, or the regularizer.
We assume that the loss function is a closed proper strictly convex function111“Closed” means that the epigraph is a closed set, and “proper” means that the function is not everywhere ; see e.g., Rockafellar (1970). In the sequel, we use the word “convex function” in the meaning of “closed proper convex function”.. See Table 4 for examples of loss functions. We assume that has Lipschitz continuous gradient with modulus (see Assumption (A2) in Section 5.2). If
is twice differentiable, this condition is equivalent to saying that the maximum eigenvalue of the Hessian ofis uniformly bounded by . Such exists for example for quadratic loss, logistic loss, and other log-linear losses. However, non-smooth loss functions (e.g., the hinge loss and the absolute loss) are excluded. Note that since we separate the data matrix from the loss function, we can quantify the above constant without examining the data. Moreover, we assume that the convex conjugate222The convex conjugate of a function is a function over that takes values in and is defined as . is (essentially) twice differentiable. Note that the first order differentiability of the convex conjugate is implied by the strict convexity of the loss function (Rockafellar, 1970, Theorem 26.3).
The regularization term is a convex possibly non-differentiable function. In addition, we assume that for all , .
In this subsection, we review DAL algorithm following the line of Tomioka and Sugiyama (2009). Although, the squared loss function and the -regularizer were considered in the original paper, we deal with a slightly more general setting in Eq. (2) for notational convenience; i.e., we consider general closed convex loss functions instead of the squared loss. For general information on augmented Lagrangian algorithms (Powell, 1969; Hestenes, 1969; Rockafellar, 1976b), see Bertsekas (1982) and Nocedal and Wright (1999).
where is the indicator function (Rockafellar, 1970, p28) of the -ball of radius , namely
where , if , and otherwise.
Let us consider the augmented Lagrangian (AL) function with respect to the above dual problem (3)
where the primal variable is interpreted as a Lagrangian multiplier vector in the AL framework. Note that the AL function is the ordinary Lagrangian if .
Let be a non-decreasing sequence of positive numbers. At every time step , given the current primal solution , we maximize the AL function with respect to and . The maximizer is used to update the primal solution (Lagrangian multiplier) as follows:
Note that the maximization of the AL function (6) with respect to can be carried out in a closed form, because the terms involved in the maximization can be separated into terms, each containing single , as follows:
where denotes the th element of a vector. Since is infinity outside the domain , the maximizer is obtained as a projection onto the ball of radius as follows (see also Figure 9):
where denotes an -dimensional vector whose th element is given by . Note that the ratio is defined to be zero333This is equivalent to defining . We use instead of to define the soft-threshold operations corresponding to and the group-lasso regularizations (see Section 4.2) in a similar way. if . Substituting the above back into Eq. (7), we obtain the following update equation:
where is called the soft-threshold operation444This notation is a simplified version of the general notation we introduce later in Eq. (15). and is defined as follows:
The soft-threshold operation is well known in signal processing community and has been studied extensively (Donoho, 1995; Figueiredo and Nowak, 2003; Daubechies et al., 2004; Combettes and Wajs, 2005).
The first contribution of this paper is to derive DAL algorithm we reviewed in Section 2.2 from the proximal minimization framework (Rockafellar, 1976a), which allows for a new interpretation of the algorithm (see Section 3.3) and rigorous analysis of its convergence (see Section 5).
Choose some initial solution and a sequence of non-decreasing positive numbers .
The proximity term tries to keep the next solution close to the current solution . Importantly, the objective (11) is strongly convex even if the original objective (1) is not; see Rockafellar (1976b). Although at this point it is not clear how we are going to carry out the above minimization, by definition we have ; i.e., provided that the step-size is positive, the function value decreases monotonically at every iteration.
The function to be minimized in Eq. (11) is strongly convex. However, there seems to be no obvious way to minimize Eq. (11), because it is still (possibly) non-differentiable and cannot be decomposed into smaller problems because the elements of are coupled.
One way to make the proximal minimization algorithm practical is to linearly approximate (see Wright et al. (2009)) the loss term at the current point as
where constant terms are omitted from the right-hand side. Note that because of the linear approximation, there is no coupling between the elements of . For example, if , the minimand in the right-hand side of the above equation can be separated into terms each containing single , which can be separately minimized.
Rewriting the above update equation, we obtain the well-known iterative shrinkage/ thresholding (IST) method555It is also known as the forward-backward splitting method (Lions and Mercier, 1979; Combettes and Wajs, 2005; Duchi and Singer, 2009); see Section 6. (Figueiredo and Nowak, 2003; Daubechies et al., 2004; Combettes and Wajs, 2005; Figueiredo et al., 2007a). The IST iteration can be written as follows:
where the proximity operator is defined as follows:
Note that the soft-threshold operation (9) is the proximity operator corresponding to the -regularizer .
The above IST approach can be considered to be constructing a linear lower bound of the loss term in Eq. (11) at the current point . In this subsection we show that we can precisely (to finite precision) minimize Eq. (11) using a parametrized linear lower bound that can be adjusted to be the tightest at the next point . Our approach is based on the convexity of the loss function . First note that we can rewrite the loss function as a point-wise maximum as follows:
where is the convex conjugate functions of . Now we substitute this expression into the iteration (11) as follows:
Note that now the loss term is expressed as a linear function as in the IST approach; see Eq. (13). Now we exchange the order of minimization and maximization because the function to be minimaxed in Eq. (17) is a saddle function (i.e., convex with respect to and concave with respect to (Rockafellar, 1970)), as follows:
The minimization with respect to in Eq. (18) gives the following update equation
where denotes the maximizer with respect to in Eq. (18). Note that is in general different from used in the IST approach (14). Actually, we show below that if the max-min problem (18) is solved exactly. Therefore taking can be considered as a naive approximation to this.
The final step to derive DAL algorithm is to compute the maximizer in Eq. (18). This step is slightly involved and the derivation is presented in Appendix B. The result of the derivation can be written as follows (notice that the maximization in Eq. (18) is turned into a minimization by reversing the sign):
We call the function in Eq. (20) the augmented Lagrangian (AL) function.
What we need to do at every iteration is to minimize the AL function and update the Lagrangian multiplier as in Eq. (19) using the minimizer in Eq. (20). Of course in practice we would like to stop the inner minimization at a finite tolerance. We discuss the stopping condition in Section 5.
The algorithm we derived above is indeed a generalization of DAL algorithm we reviewed in Section 2.2. This can be shown by computing the proximity operator (19) and the Moreau envelope (21) for the specific case of -regularization; see Section 4.1 and also Table 4.
The AL function is continuously differentiable, because the AL function is a sum of (differentiable by assumption) and an envelope function (differentiable; see Appendix A). In fact, using Lemma 9 in Appendix A, the derivative of the AL function can be evaluated as follows:
where . The expression for the second derivative depends on the particular regularizer chosen.
Notice again that the above update equation (19) is very similar to the one in the IST approach (Eq. (14)). However, , which is the slope of the lower-bound (16) is optimized in the inner minimization (20) so that the lower-bound is the tightest at the next point . In fact, if then because of Eq. (22) and . The difference between the strategies used in IST and DAL to construct a lower-bound is highlighted in Figure 1. IST uses a fixed gradient-based lower-bound which is tightest at the current solution , whereas DAL uses a variational lower-bound, which can be adjusted to become tightest at the next solution .
The general connection between the augmented Lagrangian algorithm and the proximal minimization algorithm, and (asymptotic) convergence results can be found in Rockafellar (1976b); Bertsekas (1982). The derivation we show above is a special case when the objective function can be split into a part that is easy to handle (regularization term ) and the rest (loss term ).
In this section, we discuss special instances of DAL framework presented in Section 3 and qualitatively discuss the efficiency of minimizing the inner objective. We first discuss the simple case of -regularization (Section 4.1), and then group-lasso (Section 4.2) and other more general regularization using the so-called support functions (Section 4.3). In addition, the case of component-wise regularization is discussed in Section 4.4. See also Table 4 for a list of regularizers.
For the -regularization, , the update equation (19) can be rewritten as follows:
where is the proximity operator corresponding to the -regularizer defined in Eq. (9). Moreover, noticing that the convex conjugate of the -regularizer is the indicator function in Eq. (5), we can derive the envelope function in Eq. (21) as follows (see also Figure 9):
We use Newton’s method for the minimization of the inner objective . The overall algorithm is shown in Algorithm 1. The gradient and Hessian of the AL function (10) can be evaluated as follows (Tomioka and Sugiyama, 2009):
where , and is the matrix that consists of columns of that corresponds to “active” variables (i.e., the non-zero elements of ). Note that Eq. (24) equals the general expression (22) from the proximal minimization framework.
It is worth noting that in both the computation of matrix-vector product in Eq. (24) and the computation of matrix-matrix product in Eq. (25), the cost is only proportional to the number of non-zero elements of . Thus when we are aiming for a sparse solution, the minimization of Eq. (10) can be performed efficiently.
Let be the group-lasso penalty (Yuan and Lin, 2006), i.e.,
where is a disjoint partition of the index set , and is a sub-vector of that consists of rows of indicated by . The proximity operator corresponding to the group-lasso regularizer is obtained as follows:
where similarly to Eq. (9), denotes an -dimensional vector whose component is given by . Moreover, analogous to update equation (23) (see also Eq. (10)) in the -case, the update equations can be written as follows:
|where is the minimizer of the AL function|
The overall algorithm is obtained by replacing the soft-thresholding operations in Algorithm 1 by the one defined above (27). In addition, the gradient and Hessian of the AL function can be written as follows:
where , and is a subset of that consists of active groups, namely ; is a sub-matrix of that consists of columns of that corresponds to the index-set ; is the identity matrix; the vector is defined as and , where is defined analogously to . Note that in the above expression, for by the soft-threshold operation (27).
The -norm regularization and the group lasso regularization in Eq. (26) can be generalized to the class of support functions. The support function of a convex set is defined as follows:
For example, the -norm is the support function of the unit ball (see Rockafellar (1970)) and the group lasso regularizer (26) is the support function of the group-generalized -ball defined as . It is well known that the convex conjugate of the support function (31) is the indicator function of (see Rockafellar (1970)), namely,
The proximity operator corresponding to the support function (31) can be written as follows:
The -regularizer in Section 4.1 and the group lasso regularizer in Section 4.2 assume that all the components (variables or groups) are regularized by the same constant . However the general formulation in Section 3.3 allows using different regularization constant for each component.
For example, let us consider the following regularizer:
where (). Note that we can also include unregularized terms (e.g., a bias term) by setting the corresponding regularization constant . The soft-thresholding operation corresponding to the regularizer (34) is written as follows:
where again the ratio is defined to be zero if . Note that if , the soft-thresholding operation is an identity mapping for that component. Moreover, by noticing that the regularizer (34) is a support function (see Section 4.3), the envelope function in Eq. (21) is written as follows:
As a concrete example, let be an unregularized bias term and let us assume that all the components of are regularized by the same regularization constant . In other words, we aim to solve the following optimization problem:
where is the minimizer of the AL function as follows:
In this section, we first show the convergence of DAL algorithm assuming that the inner minimization problem (20) is solved exactly (Section 5.1), which is equivalent to the proximal minimization algorithm (11). The convergence is presented both in terms of the function value and the norm of the residual. Next, since it is impractical to perform the inner minimization to high precision, the finite tolerance version of the two theorems are presented in Section 5.2. The convergence rate obtained in Section 5.2 is slightly worse than the exact case. In Section 5.3, we show that the convergence rate can be improved by performing the inner minimization more precisely. Most of the proofs are given in Appendix C for the sake of readability.
Our result is inspired partly by Beck and Teboulle (2009) and is similar to the one given in Rockafellar (1976a) and Kort and Bertsekas (1976). However, our analysis does not require asymptotic arguments as in Rockafellar (1976a) or rely on the strong convexity of the objective as in Kort and Bertsekas (1976). Importantly the stopping criterion we discuss in Section 5.2 can be checked in practice. Key to our analysis is the Lipschitz continuity of the gradient of the loss function and the assumption that the proximation with respect to (see Eq. (15)) can be computed exactly. Connections between our assumption and the ones made in earlier studies are discussed in Section 5.4.
First notice that because minimizes Eq. (11). Therefore using the convexity of , we have666We use the notation for .
where the third line follows from Cauchy-Schwartz inequality and the last line follows from the inequality of arithmetic and geometric means.
Let be the sequence generated by DAL algorithm (Eqs. (19) and (20)); let be the set of minimizers of the objective (1) and let denote the minimum objective value. If the inner minimization (Eq. (20)) is solved exactly and the proximity parameter is increased exponentially, then DAL algorithm converges exponentially as follows:
where denotes the minimum distance between the initial solution and , namely, . Note that also grows exponentially. Let be any point in . Substituting in Eq. (38) and summing both sides from to , we have
|In addition, since () from Eq. (11), we have|
The above theorem claims the convergence of the residual function values obtained along the sequence . We can convert the above result into convergence in terms of the residual norm by introducing an assumption that connects the residual function value to the residual norm. In addition, we slightly generalize Lemma 5.1 to improve the convergence rate. Consequently, we obtain the following theorem.
where denotes the minimum objective value, and denotes the minimum distance between and the set of minimizers as .
If the inner minimization is solved exactly, we have the following inequality:
Moreover, this implies that
That is, converges to super-linearly if or and is increasing, in a global and non-asymptotic sense. See Appendix C.1. Note that the above super-linear convergence holds without the assumption in Theorem 5.1 that is increased exponentially.
The loss function has a Lipschitz continuous gradient with modulus , i.e.,
The proximation with respect to (see Eq. (15)) can be computed exactly.
Under assumptions (A2)–(A4), for arbitrary we have
The assumptions we make here are rather weak. In Assumption (A2), the loss function does not include the design matrix (see Table 4). Therefore, it is easy to compute the constant . Accordingly, the stopping criterion (A4) can be checked without assuming anything about the data.
Finally, an analogue of Theorem 5.1, which does not assume the exponential increase in , is obtained as follows.