Risk minimization is fundamental to machine learning. It admits a tradeoff between the empirical loss and regularization as:
where is the model parameter, is the loss and is the regularizer. The choice of regularizers is important and application-specific, and is often the crux to obtain good prediction performance. Popular examples include the sparsity-inducing regularizers, which have been commonly used in image processing (Beck and Teboulle, 2009; Mairal et al., 2009; Jenatton et al., 2011)
and high-dimensional feature selection(Tibshirani et al., 2005; Jacob et al., 2009; Liu and Ye, 2010)
; and the low-rank regularizer in matrix and tensor learning, with good empirical performance on tasks such as recommender systems(Candès and Recht, 2009; Mazumder et al., 2010) and visual data analysis (Liu et al., 2013; Lu et al., 2014).
Most of these regularizers are convex. Well-known examples include the -regularizer for sparse coding (Donoho, 2006), and the nuclear norm regularizer in low-rank matrix learning (Candès and Recht, 2009). Besides having nice theoretical guarantees, convex regularizers also allow easy optimization. Popular optimization algorithms in machine learning include the proximal algorithm (Parikh and Boyd, 2013), Frank-Wolfe (FW) algorithm (Jaggi, 2013), the alternating direction method of multipliers (ADMM) (Boyd et al., 2011), stochastic gradient descent and its variants (Bottou, 1998; Xiao and Zhang, 2014). Many of these are efficient, scalable, and have sound convergence properties.
|GP (Geman and Yang, 1995)|
|LSP (Candès et al., 2008)|
|MCP (Zhang, 2010a)|
|Laplace (Trzasko and Manduca, 2009)|
|SCAD (Fan and Li, 2001)|
However, convex regularizers often lead to biased estimation. For example, in sparse coding, the solution obtained by the -regularizer is often not as sparse and accurate (Zhang, 2010b). In low-rank matrix learning, the estimated rank obtained with the nuclear norm regularizer is often much higher (Mazumder et al., 2010). To alleviate this problem, a number of nonconvex regularizers have been recently proposed (Geman and Yang, 1995; Fan and Li, 2001; Candès et al., 2008; Zhang, 2010a; Trzasko and Manduca, 2009). As can be seen from Table 1, they are all (i) nonsmooth at zero, which encourage a sparse solution; and (ii) concave, which place a smaller penalty than the -regularizer on features with large magnitudes. Empirically, these nonconvex regularizers usually outperform convex regularizers.
Even with a convex loss, the resulting nonconvex problem is much harder to optimize. One can use general-purpose nonconvex optimization solvers such as the concave-convex procedure (Yuille and Rangarajan, 2002). However, the subproblem in each iteration can be as expensive as the original problem, and the concave-convex procedure is thus often slow in practice (Gong et al., 2013; Zhong and Kwok, 2014).
Recently, the proximal algorithm has also been extended for nonconvex problems. Examples include the NIPS (Sra, 2012), IPiano (Ochs et al., 2014), UAG (Ghadimi and Lan, 2016), GIST (Gong et al., 2013), IFB (Bot et al., 2016), and nmAPG (Li and Lin, 2015). Specifically, NIPS, IPiano and UAG allow in (1) to be Lipschitz smooth (possibly nonconvex) but has to be convex; while GIST, IFB and nmAPG further allow to be nonconvex. The current state-of-the-art is nmAPG. However, efficient computation of the underlying proximal operator is only possible for simple nonconvex regularizers. When the regularizer is complicated, such as the nonconvex versions of the fused lasso and overlapping group lasso regularizers (Zhong and Kwok, 2014), the corresponding proximal step has to be solved numerically and is again expensive. Another approach is by using the proximal average (Zhong and Kwok, 2014), which computes and averages the proximal step of each underlying regularizer. However, because the proximal step is only approximate, convergence is usually slower than typical applications of the proximal algorithm (Li and Lin, 2015).
When is smooth, there are endeavors to extend other algorithms from convex to nonconvex optimization. For the global consensus problem, standard ADMM converges only when is convex (Hong et al., 2016). When is nonconvex, convergence of ADMM is only established for problems of the form , where matrix has full row rank (Li and Pong, 2015)
. The convergence of ADMM in more general cases is an open issue. More recently, the stochastic variance reduced gradient (SVRG) algorithm(Johnson and Zhang, 2013), which is a variant of the popular stochastic gradient descent with reduced variance in the gradient estimates, has also been extended for problems with nonconvex . However, the regularizer is still required to be convex (Reddi et al., 2016a; Zhu and Hazan, 2016).
Sometimes, it is desirable to have a nonsmooth loss
. For example, the absolute loss is more robust to outliers than the square loss, and has been popularly used in applications such as image denoising(Yan, 2013), robust dictionary learning (Zhao et al., 2011) and robust PCA (Candès et al., 2011). The resulting optimization problem becomes more challenging. When both and are convex, ADMM is often the main optimization tool for problem (1) (He and Yuan, 2012). However, when either or is nonconvex, ADMM no longer guarantees convergence. Besides a nonconvex , we may also want to use a nonconvex loss , such as -norm (Yan, 2013) and capped- norm (Sun et al., 2013), as they are more robust to outliers and can obtain better performance. However, when is nonsmooth and nonconvex, none of the above-mentioned algorithms (i.e., proximal algorithms, FW algorithms, ADMM, and SVRG) can be used. As a last resort, one can use more general nonconvex optimization approaches such as convex concave programming (CCCP) (Yuille and Rangarajan, 2002). However, they are slow in general.
In this paper, we first consider the case where the loss function is smooth (possibly nonconvex) and the regularizer is nonconvex. We propose to handle nonconvex regularizers by reusing the abundant repository of efficient convex algorithms originally designed for convex regularizers. The key is to shift the nonconvexity associated with the nonconvex regularizer to the loss function, and transform the nonconvex regularizer to a familiar convex regularizer. To illustrate the practical usefulness of this convexification scheme, we show how it can be used with popular optimization algorithms in machine learning. For example, for the proximal algorithm, the resultant proximal step can be much easier after transformation. Specifically, for the nonconvex tree-structured lasso and nonconvex sparse group lasso, we show that the corresponding proximal steps have closed-form solutions on the transformed problems, but not on the original ones. For the nonconvex total variation problem, though there is no closed-form solution for the proximal step before and after the transformation, we show that the proximal step is still cheaper and easier for optimization after the transformation. To allow further speedup, we propose a proximal algorithm variant that allows the use of inexact proximal steps with convex when it has no closed-form proximal step solution. For the FW algorithm, we consider its application to nonconvex low-rank matrix learning problems, and propose a variant with guaranteed convergence to a critical point of the nonconvex problem. For SVRG in stochastic optimization and ADMM in consensus optimization, we show that these algorithms have convergence guarantees on the transformed problems but not on the original ones.
We further consider the case where is also nonconvex and nonsmooth (and is nonconvex). We demonstrate that problem (1) can be transformed to an equivalent problem with a smooth loss and convex regularizer using our proposed idea. However, as the proximal step with the transformed regularizer has to be solved numerically and exact proximal step is required, usage with the proximal algorithm may not be efficient. We show that this problem can be addressed by the proposed inexact proximal algorithm. Finally, in the experiments, we demonstrate the above-mentioned advantages of optimizing the transformed problems instead of the original ones on various tasks, and show that running algorithms on the transformed problems can be much faster than the state-of-art on the original ones.
The rest of the paper is organized as follows. Section 2 provides a review on the related works. The main idea for problem transformation is presented in Section 3, and its usage with various algorithms are discussed in Section 4. Experimental results are shown in Section 5, and the last section gives some concluding remarks. All the proofs are in Appendix A. Note that this paper extends a shorter version published in the proceedings of the International Conference of Machine Learning (Yao and Kwok, 2016).
We denote vectors and matrices by lowercase and uppercase boldface letters, respectively. For a vector, is its -norm, returns a diagonal matrix with . For a matrix (where without loss of generality), its nuclear norm is , where
’s are the singular values of, and its Frobenius norm is , and . For a square matrix , indicates it is a positive semidefinite. For two matrices and , . For a smooth function , is its gradient at . For a convex but nonsmooth , is its subdifferential at , and is a subgradient.
2 Related Works
In this section, we review some popular algorithms for solving (1). Here, is assumed to be Lipschitz smooth.
2.1 Convex-Concave Procedure (CCCP)
The convex-concave procedure (CCCP) (Yuille and Rangarajan, 2002; Lu, 2012) is a popular and general solver for (1). It assumes that can be decomposed as a difference of convex (DC) functions (Hiriart-Urruty, 1985), i.e., where is convex and is concave. In each CCCP iteration, is linearized at , and is generated as
where is a subgradient. Note that as the last two terms are linear, (2) is a convex problem and can be easier than the original problem .
However, CCCP is expensive as (2) needs to be exactly solved. Sequential convex programming (SCP) (Lu, 2012) improves its efficiency when is in form of (1). It assumes that is -Lipschitz smooth (possibly nonconvex); while can be nonconvex, but admits a DC decomposition as . It then generates as
2.2 Proximal Algorithm
The proximal algorithm (Parikh and Boyd, 2013) has been popularly used for optimization problems of the form in (1). Let be convex and -Lipschitz smooth, and is convex. The proximal algorithm generates iterates as
where and .
Recently, the proximal algorithm has been extended to nonconvex optimization. In particular, NIPS (Sra, 2012), IPiano (Ochs et al., 2014) and UAG (Ghadimi and Lan, 2016) allow to be nonconvex, while is still required to be convex. GIST (Gong et al., 2013), IFB (Bot et al., 2016) and nmAPG (Li and Lin, 2015) further remove this restriction and allow to be nonconvex. It is desirable that the proximal step has a closed-form solution. This is true for many convex regularizers such as the lasso regularier (Tibshirani, 1996), tree-structured lasso regularizer (Liu and Ye, 2010; Jenatton et al., 2011) and sparse group lasso regularizer (Jacob et al., 2009). However, when is nonconvex, such solution only exists for some simple , e.g., nonconvex lasso regularizer (Gong et al., 2013), and usually do not exist for more general cases, e.g., nonconvex tree-structured lasso regularizer (Zhong and Kwok, 2014).
Each of the constituent proximal steps can be computed inexpensively, and thus the per-iteration complexity is low. It only converges to an approximate solution to , but an approximation guarantee is provided. However, empirically, the convergence can be slow.
2.3 Frank-Wolfe (FW) Algorithm
The FW algorithm (Frank and Wolfe, 1956) is used for solving optimization problems of the form
where is Lipschitz-smooth and convex, and is a compact convex set. Recently, it has been popularly used in machine learning (Jaggi, 2013). In each iteration, the FW algorithm generates the next iterate as
Here, (5) is a linear subproblem which can often be easily solved; (6) performs line search, and the next iterate is generated from a convex combination of and in (7). The FW algorithm has a convergence rate of (Jaggi, 2013).
In this paper, we will focus on using the FW algorithm to learn a low-rank matrix . Without loss of generality, we assume that . Let ’s be the singular values of . The nuclear norm of , , is the tightest convex envelope of , and is often used as a low-rank regularizer (Candès and Recht, 2009). The low-rank matrix learning problem can be written as
where is the loss. For example, in matrix completion (Candès and Recht, 2009),
where is the observed incomplete matrix, contains indices to the observed entries in , and if , and 0 otherwise.
The FW algorithm for this nuclear norm regularized problem is shown in Algorithm 1 (Zhang et al., 2012). Let the iterate at the th iteration be . As in (5), the following linear subproblem has to be solved (Jaggi, 2013):
This can be obtained from the rank-one SVD of (step 3). Similar to (6), line search is performed at step 4. As a rank-one matrix is added into in each iteration, it is convenient to write as
where and . The FW algorithm has a convergence rate of (Jaggi, 2013). To make it empirically faster, Algorithm 1 also performs optimization at step 6 (Laue, 2012; Zhang et al., 2012). Substituting (Srebro et al., 2004) into (8), we have the following local optimization problem:
This can be solved by standard solvers such as L-BFGS (Nocedal and Wright, 2006).
2.4 Alternating Direction Method of Multipliers (ADMM)
ADMM is a simple but powerful algorithm first introduced in the 1970s (Glowinski and Marroco, 1975). Recently, it has been popularly used in diverse fields such as machine learning, data mining and image processing (Boyd et al., 2011). It can be used to solve optimization problems of the form
where are convex functions, and (resp. ) are constant matrices (resp. vector) of appropriate sizes. Consider the augmented Lagrangian , where is the vector of Lagrangian multipliers, and is a penalty parameter. At the th iteration of ADMM, the values of and are updated as
In this paper, we will focus a special case of (13), namely, the consensus optimization problem:
Here, each is Lipschitz-smooth, is the variable in the local objective , and is the global consensus variable. This type of problems is often encountered in machine learning, signal processing and wireless communication (Bertsekas and Tsitsiklis, 1989; Boyd et al., 2011). For example, in regularized risk minimization, is the model parameter, is the regularized risk functional defined on data subset , and is the regularizer. When is smooth and is convex, ADMM converges to a critical point of (16) (Hong et al., 2016). However, when is nonconvex, its convergence is still an open issue.
3 Shifting Nonconvexity from Regularizer to Loss
In recent years, a number of nonconvex regularizers have been proposed. Examples include the Geman penalty (GP) (Geman and Yang, 1995), log-sum penalty (LSP) (Candès et al., 2008) and Laplace penalty (Trzasko and Manduca, 2009). In general, learning with nonconvex regularizers is much more difficult than learning with convex regularizers. In this section, we show how to move the nonconvex component from the nonconvex regularizers to the loss function. Existing algorithms can then be reused to learn with the convexified regularizers.
First, we make the following standard assumptions on (1).
[itemsep = 0cm, topsep=0.125cm]
is bounded from below and ;
is -Lipschitz smooth (i.e., ), but possibly nonconvex.
Let be a function that is concave, non-decreasing, -Lipschitz smooth with non-differentiable at finite points, and . With the exception of the capped- norm penalty (Zhang, 2010a) and -norm regularizer, all regularizers in Table 1 satisfy requirements on . We consider of the following forms.
, where is a matrix and . When is the identity function, reduces to the nuclear norm.
First, consider in C1. Rewrite each nonconvex in (17) as
where , and . Obviously, is convex but nonsmooth. The following shows that , though nonconvex, is concave and Lipschitz smooth. In the sequel, a function with a bar on top (e.g., ) denotes that it is smooth; whereas a function with breve (e.g., ) denotes that it may be nonsmooth.
is concave and -Lipschitz smooth.
is concave and Lipschitz smooth with modulus .
can be decomposed as , where is concave and Lipschitz-smooth, while is convex but nonsmooth.
When , where is the unit vector for dimension , and
Using Corollary 3, can be decomposed as , where is concave and -Lipschitz smooth, while is convex and nonsmooth. When and , an illustration of , and for the various nonconvex regularizers is shown in Figure 1. When is the identity function and , in (19) reduces to the lasso regularizer .
where . Note that (which can be viewed as an augmented loss) is Lipschitz smooth while (viewed as a convexified regularizer) is convex but possibly nonsmooth. In other words, nonconvexity is shifted from the regularizer to the loss , while ensuring that the augmented loss is smooth.
Any in C2 can be decomposed as , where
is concave and -Lipschitz smooth, while is convex and nonsmooth.
Since is concave and is convex, the nonconvex regularizer can be viewed as a difference of convex functions (DC) (Hiriart-Urruty, 1985). Lu (2012); Gong et al. (2013); Zhong and Kwok (2014) also relied on DC decompositions of the nonconvex regularizer. However, they do not utilize this in the computational procedures, while we use the DC decomposition to simplify the regularizers. As will be seen, though the DC decomposition of a nonconvex function is not unique in general, the particular one proposed here is crucial for efficient optimization.
4 Example Use Cases
In this section, we provide concrete examples to show how the proposed convexification scheme can be used with various optimization algorithms. An overview is summarized in Table 2.
|proximal algorithm||4.1, 4.6||cheaper proximal step|
|FW algorithm||4.2||cheaper linear subproblem|
|(consensus) ADMM||4.3||cheaper proximal step; provide convergence guarantee|
|SVRG||4.4||cheaper proximal step; provide convergence guarantee|
|mOWL-QN||4.5||simpler analysis; capture curvature information|
4.1 Proximal Algorithms
In this section, we provide example applications on using the proximal algorithm for nonconvex structured sparse learning. The proximal algorithm has been commonly used for learning with convex regularizers (Parikh and Boyd, 2013). With a nonconvex regularizer, the underlying proximal step becomes much more challenging. Gong et al. (2013); Li and Lin (2015) and Bot et al. (2016) extended proximal algorithm to simple nonconvex , but cannot handle more complicated nonconvex regularizers such as the tree-structured lasso regularizer (Liu and Ye, 2010; Schmidt et al., 2011), sparse group lasso regularizer (Jacob et al., 2009) and total variation regularizer (Nikolova, 2004). Using the proximal average (Bauschke et al., 2008), Zhong and Kwok (2014) can handle nonconvex regularizers of the form , where each is simple. However, the solutions obtained are only approximate. General nonconvex optimization techniques such as the concave-convex procedure (CCCP) (Yuille and Rangarajan, 2002) or its variant sequential convex programming (SCP) (Lu, 2012) can also be used, though they are slow in general (Gong et al., 2013; Zhong and Kwok, 2014).
Using the proposed transformation, one only needs to solve the proximal step of a standard convex regularizer instead of that of a nonconvex regularizer. This allows reuse of existing solutions for the proximal step and is much less expensive. As proximal algorithms have the same convergence guarantee for convex and nonconvex (Gong et al., 2013; Li and Lin, 2015), solving the transformed problem can be much faster. The following gives some specific examples.
4.1.1 Nonconvex Sparse Group Lasso
In sparse group lasso, the feature vector is divided into groups. Assume that group contains dimensions in that group contains. Let if , and 0 otherwise. Given training samples , (convex) sparse group lasso is formulated as (Jacob et al., 2009):
where is a smooth loss, and is the number of (non-overlapping) groups.
For the nonconvex extension, the regularizer becomes
Using Corollary 3 and Remark 3, the convexified regularizer is . Its proximal step can be easily computed by the algorithm in (Yuan et al., 2011). Specifically, the proximal operator of can be obtained by computing for each group separately. This can then be used with any proximal algorithm that can handle nonconvex objectives (as is nonconvex). In particular, we will adopt the state-of-the-art nonmontonic APG (nmAPG) algorithm (Li and Lin, 2015) (shown in Algorithm 2). Note that nmAPG cannot be directly used with the nonconvex regularizer in (23), as the corresponding proximal step has no inexpensive closed-form solution.
As mentioned in Section 3, the proposed decomposition of the nonconvex regularizer can be regarded as a DC decomposition, which is not unique in general. For example, we might try to add a quadratic term to convexify the nonconvex regularizer. Specifically, we can decompose in (23) as , where
and . It can be easily shown that is concave, and Proposition 4.1.1 shows that is convex. Thus, can be transformed as , where is Lipschitz-smooth, and is convex but nonsmooth. However, the proximal step associated with has no simple closed-form solution.
4.1.2 Nonconvex Tree-Structured Group Lasso
In (convex) tree-structured group lasso (Liu and Ye, 2010; Jenatton et al., 2011), the dimensions in are organized as nodes in a tree, and each group corresponds to a subtree. The regularizer is of the form . Interested readers are referred to (Liu and Ye, 2010) for details.
For the nonconvex extension, becomes . Again, there is no closed-form solution of its proximal step. On the other hand, the convexified regularizer is . As shown in (Liu and Ye, 2010), its proximal step can be computed efficiently by processing all the groups once in some appropriate order.
4.1.3 Nonconvex Total Variation (TV) Regularizer
In an image, nearby pixels are usually strongly correlated. The TV regularizer captures such behavior by assuming that changes between nearby pixels are small. Given an image , the TV regularizer is defined as (Nikolova, 2004), and are the horizontal and vertical partial derivative operators, respectively. Thus, it is popular on image processing problems, such as image denoising and deconvolution (Nikolova, 2004; Beck and Teboulle, 2009).
As in previous sections, the nonconvex extension of TV regularizer can be defined as
Again, it is not clear how its proximal step can be efficiently computed. However, with the proposed transformation, the transformed problem is
where is the regularization parameter, is concave and Lipschitz smooth. One then only needs to compute the proximal step of the standard TV regularizer.
However, unlike the proximal steps in Sections 4.1.1 and 4.1.2, the proximal step of the TV regularizer has no closed-form solution and needs to be solved iteratively. In this case, Schmidt et al. (2011) showed that using inexact proximal steps can make proximal algorithms faster. However, they only considered the situation where both and are convex. In the following, we extend nmAPG (Algorithm 2), which can be used with nonconvex objectives, to allow for inexact proximal steps (steps 5 and 9 of Algorithm 3). However, Lemma 2 of (Li and Lin, 2015), which is key to the convergence of nmAPG, no longer holds dues to inexact proximal step. To fix this problem, in step 6 of Algorithm 3, we use instead of in Algorithm 2. Besides, we also drop the comparison of and (originally in step 9 of Algorithm 2).
Inexactness of the proximal step can be controlled as follows. Let , and be the objective in . As is convex, is also convex. Let be an inexact solution of this proximal step. The inexactness is upper-bounded by the duality gap , where is the dual of , and is the corresponding dual variable. In step 5 (resp. step 9) of Algorithm 3, we solve the proximal step until its duality gap is smaller than a given threshold . The following Theorem shows convergence of Algorithm 3.
If the proximal step is exact, can be used to measure how far is from a critical point (Gong et al., 2013; Ghadimi and Lan, 2016). In Algorithm 3, the proximal step is inexact, and is an inexact solution to , where if step 7 is executed, and if step 9 is executed. As converges to a critical point of (1), we propose using to measure how far is from a critical point. The following Proposition shows a convergence rate on .
(i) ; and (ii) converges to zero at a rate of .
Note that the (exact) nmAPG in Algorithm 2 cannot handle the nonconvex in (25) efficiently, as the corresponding proximal step has no closed-form solutions but has to be solved exactly. Even the proposed inexact nmAPG (Algorithm 3) cannot be directly used with nonconvex . As the dual of the nonconvex proximal step is difficult to derive and the optimal duality gap is nonzero in general, the proximal step’s inexactness cannot be easily controlled.
4.2 Frank-Wolfe Algorithm
In this section, we use the Frank-Wolfe algorithm to learn a low-rank matrix for matrix completion as reviewed in Section 2.3. The nuclear norm regularizer in (8) may over-penalize top singular values. Recently, there is growing interest to replace this with nonconvex regularizers (Lu et al., 2014, 2015; Yao et al., 2015; Gui et al., 2016). Hence, instead of (8), we consider
and . This only involves the standard nuclear norm regularizer. However, Algorithm 1 still cannot be used as in (28) is no longer convex. A FW variant allowing nonconvex is proposed in (Bredies et al., 2009). However, condition 1 in (Bredies et al., 2009) requires to satisfy . Such condition does not hold with in (27) as
In the following, we propose a nonconvex FW variant (Algorithm 4) for the transformed problem (27). It is similar to Algorithm 1, but with three important modifications. First, in (28) depends on the singular values of , which cannot be directly obtained from the factorization in (11). Instead, we use the low-rank factorization