Introduction
This paper aims to solve the following general problem
(1) 
where is a parameter, and the functions in the above formulation satisfy the following conditions:

is nonnegative, concave and increasing.

is a nonnegative multidimensional function, such that the following problem
(2) is convex and can be cheaply solved for any given nonnegative .

is a smooth function of type , i.e., continuously differentiable with the Lipschitz continuous gradient
(3) is called the Lipschitz constant of .

iff .
Note that problem (1) can be convex or nonconvex. Though is concave, can be convex w.r.t . Also and are not necessarily smooth, and is not necessarily convex.
Based on different choices of , , and , the general problem (1
) involves many sparse representation models, which have many important applications in machine learning and computer vision
[Wright et al.2009, Beck and Teboulle2009, Jacob, Obozinski, and Vert2009, Gong, Ye, and Zhang2012b]. For the choice of, the least square and logistic loss functions are two most widely used ones which satisfy (
C3):(4) 
where , and . As for the choice of , (absolute value of elementwise) and (square of elementwise) are widely used. One may also use ( denotes the th column of ) when pursuing column sparsity of a matrix . As for the choice of , almost all the existing nonconvex surrogate functions of the norm are concave on . In elementwise, they include norm () [Knight and Fu2000], logarithm function [Candès, Wakin, and Boyd2008], smoothly clipped absolute deviation [Fan and Li2001], and minimax concave penalty [Zhang2010].
The above nonconvex penalties can be further extended to define structured sparsity [Jacob, Obozinski, and Vert2009]. For example, let . By taking and , with being any of the above concave functions, then is the nonconvex group Lasso . By taking , is the group Lasso.
Problem (1) contains only one variable. We will show that our proposed model can be naturally used for handling problem with several variables (which we mean a group of variables that can be updated simultaneously due to the separability structure of the problem). An example for multitask learning can be found in [Gong, Ye, and Zhang2012b].
Related Works
If the condition (C3) holds and
(5) 
can be cheaply computed, then problem (1) can be solved by iteratively solving a series of problem (5) [Gong et al.2013]. Such an updating procedure is the same as the ISTA algorithm [Beck and Teboulle2009], which is originally for convex optimization. It can be proved that any accumulation point of is a stationary point of problem (1). If and are convex, the Fast ISTA algorithm (FISTA) [Beck and Teboulle2009] converges to the globally optimal solution with a convergence rate ( is the iteration number). But for nonconvex optimization, it is usually very difficult to get the globally optimal solution to problem (5). Sometimes, it is also not easy even if is convex.
The multistage algorithm in [Zhang2008] solves problem (1) by solving a series of convex problem.
(6) 
However, solving such a convex problem requires other iterative solvers which is not efficient. It also fails when is nonconvex.
More specially, the Iteratively Reweighted L1 (IRL1) [Chen and Zhou] and Iteratively Reweighted Least Squares (IRLS) [Lai, Xu, and Yin2013] algorithms are special cases of the multistage algorithm. They aim to solve the following regularization problem
(7) 
The above problem is NPhard. IRL1 instead considers the following relaxed problem
(8) 
with . IRL1 updates by solving
(9) 
with . Problem (8) is a special case of (1) by letting () and . However, IRL1 is not efficient since it has to solve a number of nonsmooth problem (9) by using some other convex optimization methods, e.g. FISTA.
The other method, IRLS, smooths problem (7) as
(10) 
and updates by solving
(11) 
with . Problem (10) is also a special case of (1) by taking and . However, the obtained solution by IRLS may not be naturally sparse, or it may require a lot of iterations to get a sparse solution. One may perform thresholding appropriately to achieve a sparse solution, but there is no theoretically sound rule to suggest a correct threshold.
Another related work is [Lu2012] which aims to solve
(12) 
In each iteration, is efficiently updated by solving a series of problem
(13) 
But their solver is only for problem (12) which is a special case of (1). The convergence proofs also depend on the special property of the norm, thus is not general.
Furthermore, previous iterative algorithms can only solve the problem with only one variable. They cannot be naively generalized to solve multivariable problems. However, there are many problems involving two or more variables, e.g. stable robust principle component analysis [Zhou et al.2010] and robust multitask feature learning [Gong, Ye, and Zhang2012b]. So it is desirable to extend the iteratively reweighted algorithms for the multivariable case.
Contributions
In this work, we propose a novel method to solve the general problem (1), and address the scalablity and multivariable issues. In each iteration we only need to solve problem (2), whose computational cost is usually the same as previous stateoftheart firstorder convex methods. This method is named as Proximal Iteratively REweighted (PIRE) algorithm. We further propose two multiple splitting versions of PIRE: PIRE with Parallel Splitting (PIREPS) and PIRE with Alternative Updating (PIREAU) to handle the multivariable problem. Parallel splitting makes the algorithm highly parallelizable, making PIREPS suitable for distributed computing. This is important for large scale applications. PIREAU may converge faster than PIREPS. In theory, we prove that any sequences generated by PIRE, PIREPS and PIREAU are bounded and any accumulation point is a stationary point. To the best of our knowledge, PIREPS and PIREAU are the first two algorithms for problem (1) with multivariables. If problem (1) is convex, the obtained solution is globally optimal.
Proximal Iteratively Reweighted Algorithm
In this section, we show how to solve problem (1) by our Proximal Iteratively Reweighted (PIRE) algorithm. Instead of minimizing in (1) directly, we update by minimizing the sum of two surrogate functions, which correspond to two terms of , respectively.
Input: , where is the Lipschitz constant of .
Initialize: , .
Output: .
while not converge do

Update by solving the following problem

Update the weight by
end while
First, note that is concave, is convex. By the definition of subgradient of the convex function, we have
(14) 
where is the subgradient of at , i.e.
(15) 
Eqn (14) is equivalent to
(16) 
Then is used as a surrogate function of .
The loss function , which has Lipschitz continuous gradient, owns the following property [Bertsekas1999]
(17) 
Let , is used as a surrogate function of .
Combining (16) and (17), we update by minimizing the sum of these two surrogate functions
(18) 
where is also called the weight corresponding to .
For the choice of in (18), our theoretical analysis shows that guarantees the convergence of the proposed algorithm. Note that is concave and increasing, this guarantees that in (15) is nonnegative. Usually problem (18) can be cheaply computed based on the condition (C2). For example, if , solving problem (18) costs only . Such computational cost is the same as the stateoftheart convex solvers for minimization. This idea leads to the Proximal Iteratively REweighted (PIRE) algorithm, as shown in Algorithm 1. In the next section, we will prove that the sequence generated by PIRE is bounded and any accumulation point is a stationary point of problem (1).
Convergence Analysis of PIRE
Theorem 1.
Let , and , where is the Lipschitz constant of . The sequence generated in Algorithm 1 satisfies the following properties:

is monotonically decreasing. Indeed,

The sequence is bounded;

. In particular, we have .
Proof. Since is the globally optimal solution to problem (18
), the zero vector is contained in the subgradient with respect to
. That is, there exists such that(19) 
A dotproduct with on both sides of (19) gives
(20) 
Recalling the definition of the subgradient of the convex function, we have
(21) 
(22) 
Since is concave, similar to , we get
(23) 
By the condition (C3), we have
(24) 
Now, combining (22)(23) and (24), we have
(25) 
Hence is monotonically decreasing. Summing all the above inequalities for , it follows that
(26) 
This implies that . Also is bounded due to the condition (C4).
Theorem 2.
Please refer to the Supplementary Material for the proof.
We conclude this section with the following remarks:

When proving the convergence of IRL1 for solving problem (8) or (12) in [Chen and Zhou, Lu2012], they use the Young’s inequality which is a special property of the function ()
(28) where . Eqn (28) is a special case of (23). But (23) is obtained by using the concavity of , which is much more general.

PIRE requires that . But sometimes the Lipschitz constant
is not known, or it is not computable for large scale problems. One may use the backtracking rule to estimate
in each iteration [Beck and Teboulle2009]. PIRE with multiple splitting shown in the next section also eases this problem.
PIRE with Multiple Splitting
In this section, we will show that PIRE can also solve multivariable problem as follows
(29) 
where and holds the same assumptions as and in problem (1), respectively. Problem (29) is similar to problem (1), but splits into , where , and .
Based on different assumptions of , we have two splitting versions of the PIRE algorithm. They use different updating orders of the variables.
PIRE with Parallel Splitting
If we still assume that (C3) holds, i.e. has a Lipschitz continuous gradient, with Lipschitz constant , PIRE is naturally parallelizable. In each iteration, we parallelly update by
(30) 
where the notion denotes the gradient w.r.t , , and is the weight vector corresponding to , which can be computed by
(31) 
When updating in the th iteration, only the variables in the th iteration are used. Thus the variables , , can be updated in parallel. This is known as Jacobi iteration in numerical algebra [Liu, Lin, and Su2013]. This algorithm is named as PIRE with Parallel Splitting (PIREPS). Actually the updating rule of PIREPS is the same as PIRE, but in parallel. It is easy to check that the proofs in Theorem 1 and 2 also hold for PIREPS.
For some special cases of , we can use different , usually smaller than , for updating . This may lead to faster convergence [Zuo and Lin2011]. If , we can update by
(32) 
where and is the Lipschitz constant of . If the size of is very large, may not be computable. We can split it to , and compute each instead. Similar convergence results in Theorem 1 and 2 also hold by updating in (32). For detailed proofs, please refer to the Supplementary Material. A main difference of the convergence poof is that we use the Pythagoras relation
(33) 
for the squared loss . This property is much tighter than the property (17) of function with Lipschitz continuous gradient.
The result that using the squared loss leads to smaller Lipschitz constants by PIREPS is very interesting and useful. Intuitively, it results to minimize a tighter upper bounded surrogate function. Our experiments show that this will lead to a faster convergence of the PIREPS algorithm.
PIRE with Alternative Updating
In this section, we propose another splitting method to solve problem (29) based on the assumption that each is Lipschitz continuous with constant . Different from PIREPS, which updates each based on , , we instead update based on all the latest . This is the known GaussSidel iteration in numerical algebra. We name this method as PIRE with Alternative Updating (PIREAU).
Since is Lipschitz continuous, similar to (17), we have
(34) 
The hand right part of (34) is used as a surrogate function of , which is tighter than (17) in PIRE. Then we update by
(35) 
where and is defined in (31).
The updating rule in PIREAU by (35) and (31) also leads to converge. Any accumulation point of is a stationary point. See the detailed proofs in the Supplementary Material.
Both PIREPS and PIREAU can solve the multivariable problems. The advantage of PIREPS is that it is naturally parallelizable, while PIREAU may converge with less iterations due to smaller Lipschitz constants. If the squared loss function is used, PIREPS use the same small Lipschitz constants as PIREAU.
Experiments
We present several numerical experiments to demonstrate the effectiveness of the proposed PIRE algorithm and its splitting versions. All the algorithms are implemented by Matlab, and are tested on a PC with 8 GB of RAM and Intel Core 2 Quad CPU Q9550.
Minimization
We compare our proposed PIRE, PIREPS and PIREAU algorithms with IRLS and IRL1 for solving the minimization problem (7). For fair comparison, we try to use the same settings of all the completed algorithms. We use the solution to the minimization problem as the initialization. We find that this will accelerate the convergence of the iteratively reweighted algorithms, and also enhance the recovery performance. The choice of in (8) and (10) plays an important role for sparse signal recovery, but theoretical support has not been carried out so far. Several different decreasing rules have been tested before [Candès, Wakin, and Boyd2008, Mohan and Fazel2012, Lai, Xu, and Yin2013], but none of them dominates others. Since the sparsity of sparse signal is usually unknown, we empirically set , with , and [Mohan and Fazel2012]. The algorithms are stopped when .
IRL1 requires solving (9) as inner loop. FISTA is employed to solve (9) with warm start, i.e. using as initialization to obtain . This trick greatly reduces the inner loop iteration, which is the main cost for IRL1. For PIREPS and PIREAU algorithms, we solve problem (29) by setting .
Size ()  Methods  Iter.  Time  Obj.  Recovery error  

(second).  ()  ()  
(100,500,50)  PIRE  116  0.  70  5.238  2.529 
PIREPS  58  0.  48  5.239  2.632  
PIREAU  56  0.  63  5.239  2.632  
IRLS  168  81.  82  5.506  2.393  
IRL1  56  3.  43  5.239  2.546  
(200,800,100)  PIRE  119  1.  48  16.923  2.246 
PIREPS  37  0.  82  16.919  2.192  
PIREAU  36  0.  88  16.919  2.192  
IRLS  169  474.  19  17.784  2.142  
IRL1  81  13.  53  16.924  2.248  
(300,1000,200)  PIRE  151  4.  63  42.840  2.118 
PIREPS  29  1.  38  42.815  1.978  
PIREAU  28  1.  34  42.815  1.977  
IRLS  171  1298.  70  44.937  2.015  
IRL1  79  35.  59  42.844  2.124  
(500,1500,200)  PIRE  159  8.  88  64.769  2.010 
PIREPS  26  2.  27  64.718  1.814  
PIREAU  25  2.  20  64.718  1.814  
IRLS  171  3451.  79  67.996  1.923  
IRL1  89  80.  89  64.772  2.013  
( 800,2000,200)  PIRE  140  14.  99  87.616  1.894 
PIREPS  33  5.  15  87.533  1.648  
PIREAU  32  4.  97  87.533  1.648  
IRLS  177  7211.  2  91.251  1.851  
IRL1  112  173.  26  87.617  1.895 
Sparse Signal Recovery
The first experiment is to examine the recovery performance of sparse signals by using the proposed methods. The setup for each trial is as follows. The dictionary
is a Gaussian random matrix generated by Matlab command randn, with the sizes
, and . The sparse signal is randomly generated with sparsity . The response , where is Gaussian random vector. Given and , we can recover by solving the minimization problem by different methods. The parameter is set to . We use the relative recovery error to measure the recovery performance. Based on the above settings and generated data, we find that the recovery performances are stable. We run 20 trials and report the mean relative error for comparison.Figure 1 plots the relative recovery errors v.s. different values () on three data sets with different numbers of measurements. The result for is obtained by FISTA for minimization. We can see that all the iteratively reweighted algorithms achieve better recovery performance with than minimization. Also a smaller value of leads to better recovery performance, though the minimization problem is nonconvex and a globally optimal solution is not available. In most cases, PIRE is comparative with IRLS and IRL1. A surprising result is that PIREPS and PIREAU outperform the other methods when . They use a smaller Lipschitz constant than PIRE, and thus may converge faster. But none of these iteratively reweighted methods is guaranteed to be optimal.
Running Time Comparison
The second experiment is to show the advantage in running time of the proposed methods. We implement all the completed methods in matrix form for solving the following minimization problem
(36) 
where and , , and is set to 0.5 in this test. , and are generated by the same procedure as the above section, and the same settings of algorithm parameters are followed. Each column of is with sparsity . We test on several different sizes of data sets, parameterized as . The iteration number, running time, objective function value and the relative recovery error are tabulated in Table 1. It can be seen that the proposed methods are much more efficient than IRLS and IRL1. PIREPS and PIREAU converge with less iteration and less running time. In our test, IRL1 is more efficient than IRLS. The reasons lie in: (1) initialization as a sparse solution to minimization is a good choice for IRL1, but not for IRLS; (2) For each iteration in IRLS, solving equations (11) in a loop by Matlab is not efficient; (3) IRL1 converges with less inner loop iterations due to warm start.
We also plot the running time v.s. objective function value on three larger data sets in Figure 2. The algorithms are stopped within 500 seconds in this test. IRLS costs much more time, and thus it is not plotted. IRL1 is not plotted for the case . It can be seen that PIREPS and PIREAU decreases the objective function value faster than PIRE.
MultiTask Feature Learning
In this experiment, we use our methods to solve the multitask learning problem. Assume we are given learning tasks associated with , where is the data matrix of the th task with each row a sample, is the label of the th task, is the number of samples for the th task, and is the data dimension. Our goal is to find a matrix such that . The capped norm is used to regularize [Gong, Ye, and Zhang2012a]
(37) 
where is the loss function, is the thresholding parameter, and is the th row of . The above problem can be solved by our proposed PIRE, PIREPS and PIREAU algorithms, by letting , and .
The Isolet [Bache and Lichman2013] data set is used in our test. 150 subjects spoke the name of each letter of the alphabet twice. Hence, we have 52 training examples from each speaker. The speakers are grouped into 5 subsets of 30 speakers each. Thus, we have 5 tasks with each task corresponding to a subset. There are 1560, 1560, 1560, 1558, and 1559 samples of 5 tasks, respectively. The data dimension is 617, and the response is the English letter label (126). We randomly select the training samples from each task with different training ratios (0.1, 0.2 and 0.3) and use the rest of samples to form the test set. We compare our PIRE, PIREPS and PIREAU (we set in PIREPS and PIREAU) with the MultiStage algorithm [Zhang2008]. We report the Mean Squared Error (MSE) on the test set and the running time for solving (37) on the training set. The results are averaged over 10 random splittings. As shown in Figure 3, it can be seen that all these methods achieve comparative performance, but our PIRE, PIREPS and PIREAU are much more efficient than the MultiStage algorithm.
Conclusions
This paper proposes the PIRE algorithm for solving the general problem (1). PIRE solves a series of problem (2), whose computational cost is usually very cheap. We further propose two splitting versions of PIRE to handle the multivariable problems. In theory, we prove that PIRE (also its splitting versions) converges and any limit point is a stationary point. We test our methods to solve the minimization problem and multitask feature learning problem. Experimental results on both synthesis and real data sets show that our methods are with comparative learning performance, but much more efficient, by comparing with IRLS and IRL1 or multistage algorithms. It would be interesting to apply PIRE for structured sparsity optimization, and also the nonconvex low rank regularized minimization problems [Lu et al.2014].
Acknowledgements
This research is supported by the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative and administered by the IDM Programme Office. Z. Lin is supported by NSF of China (Grant nos. 61272341, 61231002, and 61121002) and MSRA.
References
 [Bache and Lichman2013] Bache, K., and Lichman, M. 2013. UCI machine learning repository. http://archive.ics.uci.edu/ml.
 [Beck and Teboulle2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):183–202.
 [Bertsekas1999] Bertsekas, D. P. 1999. Nonlinear programming. Athena Scientific (Belmont, Mass.), 2nd edition.
 [Candès, Wakin, and Boyd2008] Candès, E.; Wakin, M.; and Boyd, S. 2008. Enhancing sparsity by reweighted minimization. Journal of Fourier Analysis and Applications 14(5):877–905.
 [Chen and Zhou] Chen, X., and Zhou, W. Convergence of the reweighted minimization algorithm for minimization. to appear in Comp. Optim. Appl.
 [Clarke1983] Clarke, F. H. 1983. Nonsmooth analysis and optimization. In Proceedings of the International Congress of Mathematicians (Helsinki, 1978), 847–853.
 [Fan and Li2001] Fan, J., and Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456):1348–1360.
 [Gong et al.2013] Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J. 2013. A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. ICML.
 [Gong, Ye, and Zhang2012a] Gong, P.; Ye, J.; and Zhang, C. 2012a. Multistage multitask feature learning. In NIPS.
 [Gong, Ye, and Zhang2012b] Gong, P.; Ye, J.; and Zhang, C. 2012b. Robust multitask feature learning. In ACM SIGKDD, 895–903. ACM.
 [Jacob, Obozinski, and Vert2009] Jacob, L.; Obozinski, G.; and Vert, J.P. 2009. Group Lasso with overlap and graph Lasso. In ICML, 433–440. ACM.
 [Knight and Fu2000] Knight, K., and Fu, W. 2000. Asymptotics for Lassotype estimators. Annals of Statistics 1356–1378.
 [Lai, Xu, and Yin2013] Lai, M.J.; Xu, Y.; and Yin, W. 2013. Improved iteratively reweighted least squares for unconstrained smoothed minimization. SIAM Journal on Numerical Analysis 51(2):927–957.
 [Liu, Lin, and Su2013] Liu, R.; Lin, Z.; and Su, Z. 2013. Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning. In ACML.
 [Lu et al.2014] Lu, C.; Tang, J.; Lin, Z.; and Yan, S. 2014. Generalized nonconvex nonsmooth lowrank minimization. In CVPR.
 [Lu2012] Lu, Z. 2012. Iterative reweighted minimization methods for regularized unconstrained nonlinear programming. Mathematical Programming.
 [Mohan and Fazel2012] Mohan, K., and Fazel, M. 2012. Iterative reweighted algorithms for matrix rank minimization. In JMLR, volume 13, 3441–3473.

[Wright et al.2009]
Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y.
2009.
Robust face recognition via sparse representation.
TPAMI 31(2):210–227.  [Zhang2008] Zhang, T. 2008. Multistage convex relaxation for learning with sparse regularization. In NIPS, 1929–1936.
 [Zhang2010] Zhang, C. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics.
 [Zhou et al.2010] Zhou, Z.; Li, X.; Wright, J.; Candès, E.; and Ma, Y. 2010. Stable principal component pursuit. In IEEE International Symposium on Information Theory Proceedings, 1518–1522. IEEE.
 [Zuo and Lin2011] Zuo, W., and Lin, Z. 2011. A generalized accelerated proximal gradient approach for totalvariationbased image restoration. TIP 20(10):2748–2759.
Proof of Theorem 2
Proof. The sequence generated in Algorithm 1 is bounded by Theorem 1. Hence, there exists an accumulation point , and a subsequence such that . From the fact that in Theorem 1, we have . Since solves problem (18), there exists such that
(38) 
By the upper semicontinuous property of the subdifferential [Clarke1983], there exists , with , such that
(39) 
Hence is a stationary point of (1).
Convergence Analysis of PIREPS
For the general PIREPS algorithm for solving problem (29) by (30) and (31), actually its updating rule is the same as PIRE. Thus the same convergence results in Theorem 1 and 2 hold. In this section, we provide the convergence analysis of PIREPS algorithm by (32) and (31) for solving problem (29) with the squared loss function .
Theorem 3.
Assume that the squared loss function is used in problem (29). Let , and , , and . The sequence generated by PIREPS in (32) and (31) satisfies the following properties:

is monotonically decreasing, i.e. . Indeed,

The sequence is bounded;

.

Any accumulation point of is a stationary point of problem (29).

For any , .
Proof. Since is the globally optimal solution to problem (32), the zero vector is contained in the subgradient with respect to . That is, there exists such that
(42) 
A dotproduct with on both sides of (42) gives
(43) 
Recalling the definition of the subgradient of the convex function, we have
(44) 
(45) 
By using the Pythagoras relation
Comments
There are no comments yet.