I Introduction
In the past decade, sparse and lowrank recovery have attracted much study attention in many areas, such as signal processing, image processing, statistics, bioinformatics and machine learning. To achieve sparsity and lowrankness promotion, sparsity and lowrankness constraints or penalties are commonly employed. Among the sparsity and lowrankness inducing penalties, the norm and nuclear norm penalties are of the most popular. This is mainly due to their convexity as it makes the involved optimization problems tractable in that, existing convex optimization techniques with wellestablished convergence properties can be directly used or can be applied after some extension.
Generally, under certain conditions, the
and nuclear norm penalties can reliably recover the underlying true sparse signal and lowrank matrix with high probability. However, both of them have a bias problem, which would result in significantly biased estimates, and cannot achieve reliable recovery with the least observations [1], [2], [3]. In comparison, a nonconvex penalty, such as the
, (), smoothly clipped absolute deviation (SCAD) or minimax concave penalty (MCP), is superior in that it can ameliorate the bias problem of the one. In recent, nonconvex regularization based sparse and lowrank recovery have drawn considerable interest and achieved significant performance improvement in many applications over convex regularization. This progress benefited a lot from the recent developments in nonconvex and nonsmooth optimization, and, meanwhile, promoted the developments of the latter.The goal of this article is to give an overview of the recent developments in nonconvex regularization based sparse and lowrank recovery in signal/image processing, statistics and machine learning. In a field as wide as this, we mainly focus on the following eight important topics.
1) Compressive sensing (CS): CS aims to acquire sparse signals (or signals can be sparsely represented in some basis) at a significantly lower rate than the classical Nyquist sampling [14]–[16]. In CS, exploiting the sparsity (or sparse representation) of the desired signals is the key for their reconstruction. There exist a number of recent works addressing nonconvex regularized sparse reconstruction, e.g., using the and regularization [39]–[66]. It has been demonstrated that, nonconvex regularization not only ameliorates the bias problem of the one but also requires fewer measurements for reliable recovery.
2) Sparse regression and variable selection
: Sparse regression aims to simultaneously select variables and estimate coefficients of variables, which is a fundamental problem in highdimensional statistical analysis. Nonconvex penalties, such as the SCAD,
, , MCP penalties, have been widely employed to attain more accurate estimation over the one [11], [12], [83]–[96].3) Sparse signals separation and image inpainting
: Sparse signals separation problems arise in many important applications, such as source separation, superresolution, inpainting, interference cancellation, saturation and clipping restoration, and robust sparse recovery in impulsive (sparse) noise. In these applications, the
penalty can attain considerable restoration performance improvement over the penalty [120].4) Sparse principal component analysis (PCA)
: PCA is a useful statistical tool for data analysis and dimensionality reducing, which is widely used in virtually all areas of science and engineering. Sparse PCA aims to obtain sparse loading vectors to enhance the interpretability of the principle components (PCs). Nonconvex regularization, such as
regularization, has been widely used for sparsity promotion [128]–[141].5) Large sparse covariance matrix estimation: Large covariance matrix estimation is a fundamental problem in modern highdimensional statistical analysis, which has found wide applications such as in economics, finance, bioinformatics, social networks, climate studies, and health sciences. In the highdimensional setting where the dimensionality is often comparable to (or even larger than) the sample size, sparsity regularized estimation is especially effective. Some works addressing nonconvex regularized covariance matrix estimation include [148], [152].
6) Large sparse inverse covariance matrix estimation: Large inverse covariance matrix estimation is also fundamental to highdimensional statistical analysis, which is closely related to undirected graphs under a Gaussian model. Some papers addressing nonconvex regularized inverse covariance matrix estimation include [160]–[166].
7) Matrix completion
: Matrix completion deals with the recovery of a lowrank matrix from its partially observed entries. Such problems arise in many applications such as in recommender systems, computer vision and system identification. Various algorithms have been developed using the nonconvex Schatten
() norm, truncated nuclear norm, and MCP penalties for lowrankness inducing, e.g., [6], [48], [175]–[182], to achieve better recovery performance over the nuclear norm.8) Robust PCA
: Robust PCA aims to enhance the robustness of PCA against outliers or sparse corruption which is ubiquitous in modern applications such as image processing, web data analysis, and bioinformatics. Basically, robust PCA is a joint sparse and lowrank recovery problem, which seeks to identify a lowdimensional structure from grossly corrupted observations. Existing works using different combinations of nonconvex sparse and lowrank penalties for robust PCA include [9], [193]–[197].
Among these topics, CS, sparse regression, sparse signals separation, and sparse PCA are sparse vector recovery problems, large sparse covariance and inverse covariance matrices estimation are sparse matrix recovery problems, whilst matrix completion and robust PCA are lowrank recovery problems. To be more precise, sparse PCA is not a vector recovery problem, but in many popular greedy methods, the PCs are estimated in a vectorbyvector manner. Meanwhile, robust PCA is a joint sparse and lowrank recovery problem.
In this paper, we also provide some critical perspectives. As it is often the case that, when new techniques are introduced, there may also be some overexcitement and abuse. We comment on the following points: There exist certain instances in both sparse and lowrank recovery, where the use of nonconvex regularization is simply unnecessary and will not significantly improve performance [95]. The use of nonconvex regularization does not always guarantee distinct performance improvement over convex regularization. Moreover, employing nonconvex regularization models can even be disadvantageous because the related nonconvex and nonsmooth optimization problems are less tractable than convex problems. Performance should be defined in a broader sense that includes not only recovery accuracy, but also other properties of the algorithm, such as convergence speed. Generally, for a nonconvex regularized algorithm, the performance is usually closely related to the initialization and the convergence rate is usually slower than that of a convex regularized algorithm. In comparison, a convex algorithm has better stability and convergence properties, and is insensitive to initialization since converging to a global minimal is usually easy guaranteed [202]. Meanwhile, for firstorder convex algorithms, welldeveloped acceleration techniques can be used with guaranteed convergence [26]. However, when such acceleration techniques are applied to nonconvex algorithms, there is no guarantee of the convergence and stability.
Therefore, the question of whether to use convex or nonconvex regularization models requires careful deliberation. We show that convex models may be preferable when the signal is not strictly sparse (or the matrix is not strictly lowrank) or the signaltonoise ratio (SNR) is low, since in these cases the performance improvement of nonconvex models are often not distinct and may not deserve the price of more slowly converging nonconvex algorithms. We provide a number of concrete examples that clarify these points.
We also address the issues of penalty selection and the convergence of related nonconvex algorithms. We hope that our paper will illuminate the role nonconvex regularization plays in sparse and lowrank recovery in signal processing, statistics and machine learning, and demonstrate when and how it should be used.
There also exist some recent overview articles related to the topics of this work, e.g., on lowrank matrix recovery [203], [204], and nonconvex optimization in machine leaning [205]. While the works [203], [204] mainly focus on matrix factorization based (nonconvex) lowrank recovery problems, the lowrank recovery problems investigated in this work are more general. Meanwhile, while the article [205] mainly focuses on the optimization aspect of general nonconvex problems in machine learning, we focus on nonconvex regularized problems characterized with the nonconvexity and nonsmoothness of the involved problems. Moreover, we provide a more wide scope on the applications in various fields.
Outline
: The rest of this paper is organized as follows. In section II, we review the proximity operator for nonconvex penalties, and present extended vector proximity operator (for joint sparse recovery) and singular value shrinkage operator (for lowrank matrix recovery) for generalized nonconvex penalties. Section III discusses nonconvex regularized sparse vector recovery problems, including CS, sparse regression, sparse signals separation and sparse PCA. Section IV reviews nonconvex regularized sparse matrix recovery problems, including large sparse covariance and inverse covariance matrices estimation. Section V discusses nonconvex regularized lowrank recovery problems, including matrix completion and robust PCA. Section VI further discusses other applications involving nonconvex regularized sparse and lowrank recovery. Section VII concludes the overview.
Notations: For a matrix , , , , and are the rank, trace, determinant, spectral norm and Frobenius norm, respectively, whilst , and
denote the maximal eigenvalue, minimal eigenvalue and the
th largest singular value of . For a matrix , is a diagonal matrix which has the same diagonal elements as that of , whilst for a vector , is a diagonal matrix with diagonal elements be . mean that is positivesemidefinite. and stand for the inner product and transpose, respectively. and stand for the gradient and subdifferential of the function , respectively. denotes the sign of a quantity with . stands for an identity matrix. with denotes the norm defined as . is the Kronecker delta function. denotes the expectation. denotes the indicator function.Ii Proximity Operator for Nonconvex Regularization Penalties
Proximity operator plays a central role in developing efficient proximal splitting algorithms for many optimization problems, especially for nonconvex and nonsmooth inverse problems encountered in the applications addressed in this paper. As will be shown later, for convex or nonconvex penalized minimization problems, proximity operator is the core of most highlyefficient firstorder algorithms which scale well for highdimensional problems. This section reviews nonconvex regularization penalties and their corresponding proximity operator, including the hardthresholding, norm, an explicit shrinkage, SCAD, MCP, and firm thresholding.
Iia Scalar Proximity Operator
For a proper and lower semicontinuous penalty function where is a threshold parameter, consider the following scalar proximal mapping
(1) 
As is separable, the proximity operator of a vector , denoted by , can be computed in an elementwise manner as
(2) 
Table 1 shows the penalties and the corresponding proximal mapping operator. Among the presented penalties, only the softthresholding penalty is convex, while the other penalties are (symmetric) folded concave functions, as shown in Fig. 1.
The hardthresholding was first introduced in [36] and then applied for wavelet applications in statistics [4], which is a natural selection for sparse inducing [77]–[81]. The wellknown softthresholding rule was first observed by Donoho, Johnstone, Hoch and Stem [35] and then used in wavelet applications [36], which forms the core of the LASSO introduced by Tibshirani [37]. The penalty is the most popular one as its convexity makes the related optimization problems more tractable than that using a nonconvex one.
However, the penalty has a bias problem. More specifically, when the true parameter has a relatively large magnitude, the softthresholding estimator is biased since it imposes a constant shrinkage on the parameter, as shown in Fig. 2. Fig. 2 plots the thesholding/shrinkage functions for the hard, soft,  and SCAD penalties with a same threshold. In contrast, the hardthresholding and SCAD estimators are unbiased for large parameter. Meanwhile, the thresholding rule corresponding to SCAD, and shrinkage fall in (sandwiched) between hard and softthresholding.
Penalty name  Penalty formulation  Proximity operator  

(i) Soft thresholding [35]  
(ii) Hard thresholding [36] 


(iii) norm [6], [7]  , 


(iv) shrinkage [10]  N/A ()  
(v) SCAD [11]  
(vi) MCP [12] 


(vii) Firm thresholding [13]  where 
The penalty with bridges the gap between the and penalties, and intuitively its shrinkage function is less biased than softthresholding. The proximity operator of the norm penalty does not has a closedform expression except for the two special cases of and [5] (in these two cases the proximal mapping can be explicitly expressed as the solution of a cubic or quartic equation), but it can be efficiently solved, e.g., by a Newton’s method. Moreover, explicit shrinkage mapping has been proposed in [8]–[10], which has some qualitative resemblance to the proximal mapping while being continuous and explicit. The shrinkage reduces to the softthresholding when , while it tends pointwise to the hardthresholding in the limit as . As an acceptable price to pay for having an explicit proximal mapping, its penalty does not have a closedform expression.
The SCAD penalty has been widely used in variable selection problems, and it has shown favorable effectiveness compared with other penalties in highdimensional variable selection problems [11]. As well as the , , and SCAD penalties, the MCP penalty can also ameliorate the bias problem of the
penalty [12], and it has been widely used for penalized variable selection in highdimensional linear regression. For the MCP penalty with each
, we can obtain a continuum of penalties and threshold operators by varying in the range . Moreover, the firm thresholding is a continuous and piecewiselinear approximation of the hardthresholding [13]. In addition, a class of nonconvex penalties which can maintain the convexity of the global cost function have been designed in [209]–[211], whilst log penalties have been considered in [212], [213].Among the presented shrinkage functions, the soft, SCAD, shrinkage and firm thresholding are continuous while the hard and thresholding are discontinuous. For each of the presented penalties, the corresponding thresholding/shrinkage operator satisfies
(3) 
where is a threshold dependent on . These conditions establish the sign consistency, shrinkage, thresholding, and limited shrinkage properties for the thresholding/shrinkage operator corresponding to a generalized penalty.
IiB Vector Proximity Operator (for Multitask Joint Sparse Recovery)
In many applications, as will be shown in section III, it is desirable to jointly recover multichannel signals to exploit the multichannel joint sparsity pattern. In this case, it involves solving the following vector optimization problem
(4) 
where and with be the channel number. The case of be the penalty has been considered in [120], we extend the result to generalized penalties in the following.
Theorem 1. Suppose that is a nondecreasing function for , for any , the solution to (4) is given by
(5) 
Proof: See Appendix A.
IiC Singular Value Shrinkage Operator (for LowRank Matrix Recovery)
For a matrix
, consider the singular value decomposition (SVD) of rank
, , where contains the singular values, and contain the orthonormal singular vectors. In lowrank matrix recovery, the lowrankness promotion on a matrix is usually achieved by sparsity promotion on the sigular values of the matrix. We denote a generalized penalty for lowrankness promotion by , which is defined as(6) 
where is a generalized penalty for sparsity inducing as introduced in Table 1. In the two cases of and , becomes the rank and nuclear norm of , respectively. When is the penalty, becomes the Schatten quasinorm of matrix.
In the following, we provide the generalized singularvalue shrinkage operator for a generalized penalty .
Theorem 2. For a rank matrix , suppose that it has an SVD , where and , and contain the left and right singular vectors. Then, for any defined as (6) with satisfying (3), the solution to the optimization problem
(7) 
is given by
(8) 
where is the proximity operator defined in (2).
Proof: See Appendix B.
Iii Sparse Vector Recovery
This section reviews nonconvex regularization based sparse vector signals recovery, mainly on the following four topics, CS, sparse regression and variable selection, sparse signals separation with application to image inpainting and superresolution, and sparse PCA. Strictly speaking, sparse PCA is not a vector recovery problem, but in many popular greedy approaches, the principle components are estimated in a onebyone (vectorbyvector) manner.
Iiia Compressive Sensing
In the past decade, compressive sensing has attracted extensive studies [14]–[17] and has found wide applications in radar [18], [19], communications [20], medical imaging [21], image processing [22], and speech signal processing [23]. In the CS framework, sparse signals (or signals can be sparsely represented in some basis) can be acquired at a significantly lower rate than the classical Nyquist sampling, and signals only need to be sampled at a rate proportional to their information content.
In CS, the objective is to reconstruct a sparse signal from its compressed measurement
(9) 
where with is the sensing matrix (also called measurement matrix), is additive measurement noise. Since , the recovery of from the compressed measurement is generally illposed. However, provided that is sparse and the sensing matrix satisfies some stable embedding conditions [17], can be reliably recovered with an error upper bounded by the noise strength. This can be achieved in the noiseless case by the formulation
(10) 
or in the noisy case by the formulation
(11) 
where is a penalty for sparsity inducing (a special case of with ), and bounds the norm of the residual error. This constrained formulation (11) can be converted into an unconstrained formulation as
(12) 
Naturally, using the norm penalty, i.e., and , which counts the number of nonzero components in the vector , (10), (11) and (12) are the exact formulations of finding a sparse vector to fulfill the linear constraint , satisfy the residual constraint
, and minimize the quadratic loss function in (12), respectively. However, with the
penalty the problems (10)(12) are nonconvex and NPhard, thus, convex relaxation methods are often considered, e.g., replace the penalty by the one. With and , the formulations (10), (11) and (12) are the wellknown basispursuit (BP) [24], basispursuit denoising (BPDN) and LASSO [13], respectively. In this case, the formulations are convex and hence tractable. A large number of algorithms have been developed in the past decade for these minimization problems (see [25]–[29] and the reference therein).The CS theory has established that if the sensing matrix satisfies some conditions, such as the restricted isometry property (RIP) [17], [30]–[32], the null space property [33], and the incoherence condition [34], the sparse signal can be reconstructed by regularization reliably. However, due to the relaxation, the recovery accuracy is often degraded, e.g., it often introduces extra bias [1], [2] and cannot reconstruct a signal with the least observations [3]. Furthermore, for some applications, the result of the minimization is not sparse enough and the original signals cannot be recovered. A simple example of such a case has been given in [38] with intuitive explanation.
To address this problem, a number of improved algorithms have been developed via employing the nonconvex norm penalty instead of the one, i.e., and with . For , is the quasinorm defined as . Compared with the norm, the norm is a closer approximation of the norm. It has been shown in [37] that under certain RIP conditions of the sensing matrix, regularized algorithms require fewer measurements to achieve reliable recovery than regularized algorithms. Moreover, the sufficient conditions in terms of RIP for regularization are weaker than those for regularization [39], [40]. Meanwhile, it has been shown in [41] that for any given measurement matrix with restricted isometry constant , there exists some that guarantees exact recovery of signals with support smaller than by minimization.
Recently, regularized sparse reconstruction has been extensively studied for CS, e.g., [5], [8], [10], [39]–[66], [200], and extended to structured sparse recovery [201]. As the penalty is nonsmooth and nonconvex, many of these algorithms solve a smoothed (approximated) minimization problem, e.g., the works [47]–[49] use an approximation of as
(13) 
where is a smoothing parameter. Furthermore, the iteratively reweighted algorithms [38], [44], [46] use the following two penalties (at the th iteration)
which explicitly relate to the norm approximation.
These algorithms have been shown to achieve better recovery performance relative to the regularized algorithms. However, due to the nonconvexity of minimization, many of these algorithms are generally inefficient and impractical for largescale problems. For example, the StSALq method in [47] requires repetitive computation of matrix inversion of dimension , whilst the IRucLq method in [48] solve a set of linear equations using matrix factorization (with matrix dimension of ). In comparison, the LpRLS method in [49], which uses an efficient conjugate gradient (CG) method in solving a sequence of smoothed subproblems, has better scalable capability. Moreover, the iteratively reweighted method in [50] also involves solving a sequence of weighted mixed subproblems. Although efficient firstorder algorithms can be used to solve the subproblems involved in the methods in [49], [50], both the methods have a double loop which hinders the overall efficiency. While subsequence convergence is guaranteed for the iteratively reweighted methods [48] and [50], there is no such guarantee for StSALq [47] and LpRLS [49].
In comparison, the proximal gradient descent (PGD) and alternative direction method of multipliers (ADMM) algorithms for the problem (12) are globally convergent under some mild conditions, while being much more efficient. Specifically, let , consider the following quadratic approximation of the objective in the formulation (12) at iteration and at a given point as
(14) 
where is a proximal parameter. Then, minimizing reduces to the proximity operator introduced in section 2 as
(15) 
which can be computed elementwise via the shrinkage/thresholding function in Table I.
This PGD algorithm fits the frameworks of the forwardbackward splitting method [67] and the generalized gradient projection method [68]. Very recently, the convergence properties of this kind of algorithms have been established via exploiting the KurdykaLojasiewicz (KL) property of the objective function [69]–[71]. Suppose that is a closed, proper, lower semicontinuous, KL function, if , then, the sequence generated by PGD (15) converges to a stationary point of the problem (12). Further, under some more conditions, the convergence of PGD to a local minimizer can be guaranteed [72], [73].
For ADMM algortithm, using an auxiliary variable
the problem (12) can be equivalently reformulated as
(16) 
Then, the ADMM algorithm consists of the following steps
where is the dual variable, is a penalty parameter. As a standard trick, the subproblem can be solved approximately via linearizaing the quadratic term. For a closed, proper, lower semicontinuous and KL , under some condition of the proximal parameter and (should be chosen sufficiently large), this proximal ADMM algorithm globally converges to a stationary point of the problem (16) [74]–[75].
For PGD and ADMM, the dominant computational load in each iteration is matrixvector multiplication with complexity , thus, scale well for highdimension problems. These two algorithms may be further accelerated by the schemes in [26], [76], however, for a nonconvex , there is no guarantee of convergence when using such acceleration schemes.
Example 1 (Image reconstruction). We evaluate the PGD (15) with hard,  and softthresholding on image reconstruction. The used images are of size (), include a synthetic image, “SheppLogan” and an MRI image, as shown in Fig. 3. We use the Haar wavelets as the basis for sparse representation of the images, and is a partial DCT matrix with . The thresholding is initialized by the solution of the softthresholding. The regularization parameter is selected by providing the best performance.
Fig. 4 shows the peaksignal noise ratio (PSNR) of recovery of thresholding for different value of with SNR = 50 dB, including the hard and softthresholding as special cases. It can be seen that each method is able to achieve a high PSNR greater than 60 dB in recovering the synthetic image, but degrades significantly in recovering the MRI image (less than 30 dB). This is due the nature that, as shown in Fig. 5, the wavelet coefficients of the synthetic image are truly sparse (approximately 5.7% nonzeros), while that of a reallife image are not strictly sparse but rather approximately follow an exponential decay, which is referred to as compressible. Also due to this, for the two images, the best performance of thresholding are given by and , respectively, as shown in Fig. 4. That is, a relatively small value of should be used for strictly sparse signals, while a relatively large value of should be used for compressible (nonstrictly sparse) signals. Moreover, the hard and thresholding can achieve significant performance improvement over the softthresholding only for strictly sparse signals.
IiiB Sparse Regression and Variable Selection
Nowadays, the analysis of data sets with the number of variables comparable to or even much larger than the sample size arises in many areas, such as genomics, health sciences, economics and machine learning [126]. In this context with highdimensional data, most traditional variable selection procedures, such as AIC and BIC, becomes infeasible and impractical due to too expensive computational cost. In this scenario, sparse regression, which can simultaneously select variables and estimate coefficients of variables, has become a very popular topic in the last decade due to its effectiveness in the highdimensional case [37], [82].
Let denote the (deterministic) design matrix, contains the unknown regression coefficients, and is the response vector. Further, assume that depends on through a linear combination and the conditional loglikelihood given is . In the variable selection problem, the assumption is that majority of the true regression coefficients are zero, and the goal is to identify and estimate the subset model. Under the sparse assumption of the true regression coefficients, a natural method for simultaneously locating and estimating those nonzero coefficients in is to maximize the following penalized likelihood of the form
(17) 
where is a generalized sparsity promotion penalty as defined in section 2. Moreover, there exists a popular alternative which uses the following penalized leastsquare (LS) formulation
(18) 
The wellknown LASSO method is first proposed in [37] for the linear regression problem (18). In the same spirit of LASSO, nonconcave penalty functions, such as SCAD [11] and MCP [12], have been proposed to select significant variables for various parametric models, including linear regression, generalized linear regression and robust linear regression models [83]. Extension to some semiparametric models, such as the Cox model and partially linear models have been considered in [84]–[86]. It has been shown in these works that, with appropriately selected regularization parameters, nonconvex penalized estimators can perform as well as the oracle procedure in selecting the correct subset model and estimating the true nonzero coefficients. Further, even for superpolynomial of sample size, nonconvex penalized likelihood methods possess model selection consistency with oracle properties [87]. In addition, adaptive LASSO has been proposed in [88], which uses an adaptively weighted
penalty. While LASSO variable selection can be inconsistent in some scenarios, adaptive LASSO enjoys the oracle properties and also leads to a nearminimax optimal estimator. The oracle property of adaptive LASSO has also been demonstrated in the highdimensional case [89].To solve a nonconvex penalized sparse regression problem, locally approximation of the penalty function can be utilized, such as local quadratic approximation (LQA) [11], [90] and local linear approximation (LLA) [91]. While the LQA based methods uses a backward stepwise variable selection procedure, the LLA based method [91] is onestep and hence more favorable.
More recently, benefit from the progress in nonconvex and nonsmooth optimization, direct methods have been widely designed. Specifically, in [92], a coordinatedescent algorithm has been proposed for the linear regression model (18). By using a continuation process of the parameters of the SCAD or MCP penalty, its convergence to a minimum is guaranteed under certain conditions. Meanwhile, an alternative coordinatedescent algorithm has been presented in [93] with guaranteed convergence to a local minimum. Then, a cyclic descent algorithm employing the penalty for multivariate linear regression has been introduced in [94]. Subsequently, a cyclic descent algorithm for the penalized LS problem has been proposed in [7], [95], whilst an majorizationminimization (MM) algorithm with momentum acceleration for the penalized LS formulation has been developed in [96]. For both the methods in [95] and [96], convergence to a local minimizer is guaranteed under certain conditions. Moreover, as introduced in the last subsection, there exist numerous algorithms, e.g., the PGD and ADMM algorithms, can be applied to the penalized LS formulation (18).
In addition, a global optimization approach has been recently proposed in [206] for concave penalties (e.g., SCAD and MCP) based nonconvex learning via reformulating the nonconvex problems as general quadratic programs. Meanwhile, theoretical analysis on the statistical performance of local minimizer methods using folded concave penalties has been provided in [207]. Moreover, a class of nonconvex penalties termed trimmed Lasso has been considered in [208], which enables exact control of the desired sparsity level.
Example 2 (Estimation accuracy in different SNR and different sparsity level). We evaluate the performance of different penalties for the linear sparse regression problem (18) using the PGD (15), with and . Three sparsity levels, 2%, 5%, and 10% active coefficients of , are considered. Fig. 6 shows the performance of different penalties versus SNR when the PGD is initialized by zero, while Fig. 7 shows the results when the PGD is initialized by the solution of the convex penalty. It can be seen that the performance of the nonconvex penalties is heavily dependent on the initialization. The advantage of a nonconvex penalty over the penalty is significant when the sparsity level of the coefficient vector is relatively low and/or the SNR is relatively high.
IiiC Sparse Signals Separation and Image Inpainting
Sparse signals separation has wide applications, such as source separation, superresolution and inpainting, interference cancellation, saturation and clipping restoration, and robust sparse recovery in impulsive (sparse) noise. The objective of the problem is to demix the two sparse vectors , , from their mixed linear measurements as
(19) 
where are known deterministic dictionaries. More specifics on the applications involving the model (19) are as follows.
1) Source separation
: such as the separation of texture in images [97], [98] and the separation of neuronal calcium transients in calcium imaging [99],
and are two dictionaries allowing for sparse representation of the two distinct features, and are the (sparse or approximately sparse) coefficients describing these features [100]–[102]. 2) Superresolution and inpainting: in the superresolution and inpainting problem for images, audio, and video signals [103]–[105], only a subset of the desired signal is available. Given , the objective is to fill in the missing parts in , in which case and stands for the missing parts. 3) Interference cancellation: in some audio, video, or communication applications, it is desired to restore a signal corrupted by narrowband interference, such as electric hum [101]. As narrowband interference can be sparsely represented in the frequency domain,can be an inverse discrete Fourier transform matrix.
4) Saturation and clipping restoration: in many practical systems where the measurements are quantized, nonlinearities in amplifiers may result in signal saturation and causes significant nonlinearity and potentially unbounded errors [101], [106], [107]. In this case, is the desired signal, is the situated measurement with stands for the saturation errors. 5) Robust sparse recovery in impulsive noise: impulsive noise is usually approximately sparse and has a distribution with heavy tail. In practical image and video processing applications [108]–[110], impulsive noise may come from the measurement process, or caused by transmission problems, faulty memory locations, buffer overflow and unreliable memory [111]–[114]. In these cases, represents the (sparsely) impulsive noise and .Exploiting the sparsity, and can be reconstructed via the following formulation
(20) 
where and are penalties for sparsity promotion, is a parameter takes the statistic difference between the two components into consideration. As will be shown later, when and have different sparsity properties, using two different penalties for and can obtain performance gain over using a same one.
When both and are the penalty, i.e., , (20) reduces to the sparse separation formulation in [102]. Further, when and , the formulation (20) degenerates to the BP form considered in [100]. Moreover, when and , (20) reduces to the regularized leastabsolute problem for robust sparse recovery [115], which has outstanding robustness in the presence of impulsive noise. In addition, the Mestimation method in [218] can also be considered as a special case of (20) with a selected robust loss term.
Naturally, nonconvex penalties can be expected to yield better reconstruction performance over the above convex methods. For example, the penalty has been used in [116]–[119] to obtain more robust restoration of images corrupted by saltandpepper impulsive noise. Very recently, a generalized formulation using penalty has been proposed in [120] with and , . The formulation (20) can be directly solved by a standard twoblock ADMM procedure [29], but it often fails to converge in the nonconvex case [120]. To develop convergent algorithms, a recenty work [120] proposed to solve a quadratic approximation of (20) and developed two firstorder algorithms based on the proximal block coordinate descent (BCD) and ADMM frameworks. The proximal BCD method consists of the following two update steps
(21)  
(22) 
where and are proximal parameters, is quadratic approximation parameter. Suppose that and are closed, proper, lower semicontinuous, KL functions, if and , the algorithm updated via (21) and (22) is a descent algorithm and the generated sequence converges to a stationary point of the approximated problem.
In many applications such as superresolution and inpainting for color images with 3 (RGB) channels, multichannel joint recovery is more favorable than independently recovery of each channel, as the former can exploit the feature correlation among different channels. In the multitask case, the linear measurements of channels can be expressed as
(23) 
where , , are the sparse features in the two components. To exploit the joint sparsity among the channels, joint sparsity penalties can be used, which is defined as
Using such a penalty, e.g., for and for , the proxiaml BCD algorithm for the multitask problem consists of the following two steps [120]
(24) 
(25) 
where and are proximal parameters. These two subproblems can be solved rowwise as (5). Suppose that and are closed, proper, lower semicontinuous, KL functions, if , and , the algorithm updated via (24) and (25) is a descent algorithm and the generated sequence converges to a stationary point of the approximated problem.
Example 3 (Color image inpainting). We compare the performance of the and regularization on color images inpainting using the BCD method (24) and (25). The task is to restore the original image from text overwriting or saltandpepper noise corruption. In this case, and represents the sparse corruption in the three (RGB) channels. We select as an inverse discrete cosine transformation (IDCT) matrix, and, accordingly, contains the DCT coefficients of the image. As shown in Fig. 8, regularization significantly outperforms the one, e.g., the improvement is more than 9 dB in the overwritten text case and more than 5 dB in the saltandpepper noise case.
Example 4 (Evaluation on different penalties). An important problem in practice is how to select a nonconvex penalty. This example sheds some light on this by three application cases. Case 1: the color image inpainting experiment in Fig. 8 (a). Case 2: , and are respectively DCT and Gaussian matrices, and are strictly sparse vectors with sparsity . Case 3 (robust sparse recovery in impulsive noise): , is a Gaussian matrix, , is a strictly sparse vector with , and is symmetric stable () noise with characteristic exponent and dispersion .
As the penalty has a flexible parametric form that adapts to different thresholding functions, we evaluate the effect of the values of and on the recovery performance in the three cases. Fig. 9 indicates that for strictly sparse signals, a relatively small value of is favorable, while for nonstrictly sparse signals, a relatively large value of is favorable. For example, in case 1, (DCT coefficients of the image) is nonstrictly sparse whilst is strictly sparse, thus, a relatively large value of and a relatively small value of should be used. In case 3, is strictly sparse whilst ( noise) is nonstrictly sparse, thus, a relatively small value of and a relatively large value of would result in good performance.
IiiD Sparse PCA
PCA is a useful tool for dimensionality reduction and feature extraction, which has been applied in virtually all areas of science and engineering, such as signal processing, machine learning, statistics, biology, medicine, finance, neurocomputing, and computer networks, to name just a few. In many real applications, sparse loading vectors are desired in PCA to enhance the interpretability of the principle components (PCs). For instance, in gene analysis, the sparsity of PCs can facilitate the understanding of the relation between the whole gene microarrays and certain genes; in financial analysis, the sparsity of PCs implies fewer assets in a portfolio thus is helpful to reducing the trading costs. In these scenarios, it is desirable not only to achieve the dimensionality reduction but also to reduce the number of explicitly used variables.
To achieve sparse PCA, as hoc methods were firstly designed via thresholding the PC loadings [124], [125]. Then, more popular methods incorporating a sparsity inducing mechanism into the traditional PCA formulation have been developed [126]. Let be a centered data matrix with and respectively be the dimensionality and the size of the data. Employing a sparsity constraint in the traditional PCA model, the sparse PCA problem to find an dimensional subspace can be formulated as
(26) 
for . is a sparsity inducing penalty. Instead of solving (26), a widely used greedy strategy is to find the approximate solution of (26) by sequentially solving the following single PC problem
(27) 
An alternative of the sparsity constrained formulation (27) is the sparsity penalized formulation as follows
(28) 
where is a regularization parameter.
There exist numerous algorithms employing different penalty functions for the formulations (26)–(28) and their variants. In [126], the constraint is used for the formulation (26). In [127], an elasticnet regression based algorithm has been proposed. Then, a semidefinite relaxation method has been developed in [128], [129] for the formulation (27) with constraint. Meanwhile, in [130], a regularized low rank matrix approximation method has been designed with the consideration of different penalties, i.e., , and SCAD. In [131], with and penalties, reformulations of (28) and its block variant have been solved via gradient algorithms. Generalized version of (28) with penalty has been considered in [132]. Moreover, an alternative discrete spectral formulation of constrained (27) and an effective greedy approach have been presented in [133]. In [134], unified alternating maximization framework for and constrained or penalized sparse PCA problems (using or loss) has been proposed.
More recently, robust sparse PCA using loss and penalty has been considered in [135]. Meanwhile, an ADMM based distributed sparse PCA algorithm has been proposed in [136] which covers the , log sum and MCP penalties. Moreover, Shatten penalty has been used for structured sparse PCA in [137]. In addition, there also exist several other methods for constrained or penalized sparse PCA problems, e.g., [138]–[141].
Iv Sparse Matrix Recovery
This section reviews the nonconvex regularization based sparse matrix recovery problems, mainly on large covariance matrix and inverse covariance matrix estimation, which are two fundamental problems in modern multivariate analysis [142]. Nowadays, the advance of information technology makes massive highdimensional data widely available for scientific discovery. In this context, effective statistical analysis for highdimensional data is becoming increasingly important. In many applications involving statistical analysis of highdimensional data, estimating large covariance or inverse covariance matrices is necessary for effective dimensionality reduction or discriminant analysis. Such applications arise in economics and finance, bioinformatics, social networks, smart grid, climate studies, and health sciences [142]–[144]. In the highdimensional setting, the dimensionality is often comparable to (or even larger than) the sample size. In these cases, the sample covariance matrix estimator has a poor performance [145], and intrinsic structures such as sparsity can be exploited to improve the estimation of covariance and inverse covariance matrices [142], [146]–[148], [155].
Iva Large Sparse Covariance Matrix Estimation
Consider a vector with covariance , the objective is to estimate its covariance from observations . Usually, compared to estimate directly, it is more favorable to estimate the correlation matrix first, . Then, given the estimated correlation matrix , the corresponding estimation of the covariance matrix is , where is the sample covariance matrix. That is because the correlation matrix has the same sparsity pattern of the covariance matrix but with all the diagonal elements known to be one, thus, it can be estimated more accurately than the covariance matrix [149]–[151].
Given the sample correlation matrix , the generalized thresholding estimator [148] solves the following problem
(29) 
with be generalized penalty function for sparsity promotion as introduced in section 2. Note that, the diagonal elements are not penalized since the diagonal elements of a correlation (also covariance) matrix are always positive. The solution to (29) is a thresholding of the sample correlation matrix , which can be efficiently computed as shown in section 2.
The thresholding estimator (29) have good theoretical properties. It is consistent over a large class of (approximately) sparse covariance matrices [148]. However, in practical finite sample applications, such an estimator is not always positivedefinite although it converges to a positivedefinite limit in the asymptotic setting [149], [151]. To simultaneously achieve sparsity and positivedefiniteness, positivedefinite constraint can be added into (29) as [152]
Comments
There are no comments yet.