Nonconvex Regularization Based Sparse and Low-Rank Recovery in Signal Processing, Statistics, and Machine Learning

08/16/2018 ∙ by Fei Wen, et al. ∙ Shanghai Jiao Tong University 10

In the past decade, sparse and low-rank recovery have drawn much attention in many areas such as signal/image processing, statistics, bioinformatics and machine learning. To achieve sparsity and/or low-rankness inducing, the ℓ_1 norm and nuclear norm are of the most popular regularization penalties due to their convexity. While the ℓ_1 and nuclear norm are convenient as the related convex optimization problems are usually tractable, it has been shown in many applications that a nonconvex penalty can yield significantly better performance. In recent, nonconvex regularization based sparse and low-rank recovery is of considerable interest and it in fact is a main driver of the recent progress in nonconvex and nonsmooth optimization. This paper gives an overview of this topic in various fields in signal processing, statistics and machine learning, including compressive sensing (CS), sparse regression and variable selection, sparse signals separation, sparse principal component analysis (PCA), large covariance and inverse covariance matrices estimation, matrix completion, and robust PCA. We present recent developments of nonconvex regularization based sparse and low-rank recovery in these fields, addressing the issues of penalty selection, applications and the convergence of nonconvex algorithms.



There are no comments yet.


page 7

page 10

page 12

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the past decade, sparse and low-rank recovery have attracted much study attention in many areas, such as signal processing, image processing, statistics, bioinformatics and machine learning. To achieve sparsity and low-rankness promotion, sparsity and low-rankness constraints or penalties are commonly employed. Among the sparsity and low-rankness inducing penalties, the -norm and nuclear norm penalties are of the most popular. This is mainly due to their convexity as it makes the involved optimization problems tractable in that, existing convex optimization techniques with well-established convergence properties can be directly used or can be applied after some extension.

Generally, under certain conditions, the

and nuclear norm penalties can reliably recover the underlying true sparse signal and low-rank matrix with high probability. However, both of them have a bias problem, which would result in significantly biased estimates, and cannot achieve reliable recovery with the least observations [1], [2], [3]. In comparison, a nonconvex penalty, such as the

, (), smoothly clipped absolute deviation (SCAD) or minimax concave penalty (MCP), is superior in that it can ameliorate the bias problem of the -one. In recent, nonconvex regularization based sparse and low-rank recovery have drawn considerable interest and achieved significant performance improvement in many applications over convex regularization. This progress benefited a lot from the recent developments in nonconvex and nonsmooth optimization, and, meanwhile, promoted the developments of the latter.

The goal of this article is to give an overview of the recent developments in nonconvex regularization based sparse and low-rank recovery in signal/image processing, statistics and machine learning. In a field as wide as this, we mainly focus on the following eight important topics.

1) Compressive sensing (CS): CS aims to acquire sparse signals (or signals can be sparsely represented in some basis) at a significantly lower rate than the classical Nyquist sampling [14]–[16]. In CS, exploiting the sparsity (or sparse representation) of the desired signals is the key for their reconstruction. There exist a number of recent works addressing nonconvex regularized sparse reconstruction, e.g., using the and regularization [39]–[66]. It has been demonstrated that, nonconvex regularization not only ameliorates the bias problem of the -one but also requires fewer measurements for reliable recovery.

2) Sparse regression and variable selection

: Sparse regression aims to simultaneously select variables and estimate coefficients of variables, which is a fundamental problem in high-dimensional statistical analysis. Nonconvex penalties, such as the SCAD,

, , MCP penalties, have been widely employed to attain more accurate estimation over the -one [11], [12], [83]–[96].

3) Sparse signals separation and image inpainting

: Sparse signals separation problems arise in many important applications, such as source separation, super-resolution, inpainting, interference cancellation, saturation and clipping restoration, and robust sparse recovery in impulsive (sparse) noise. In these applications, the

penalty can attain considerable restoration performance improvement over the penalty [120].

4) Sparse principal component analysis (PCA)

: PCA is a useful statistical tool for data analysis and dimensionality reducing, which is widely used in virtually all areas of science and engineering. Sparse PCA aims to obtain sparse loading vectors to enhance the interpretability of the principle components (PCs). Nonconvex regularization, such as

regularization, has been widely used for sparsity promotion [128]–[141].

5) Large sparse covariance matrix estimation: Large covariance matrix estimation is a fundamental problem in modern high-dimensional statistical analysis, which has found wide applications such as in economics, finance, bioinformatics, social networks, climate studies, and health sciences. In the high-dimensional setting where the dimensionality is often comparable to (or even larger than) the sample size, sparsity regularized estimation is especially effective. Some works addressing nonconvex regularized covariance matrix estimation include [148], [152].

6) Large sparse inverse covariance matrix estimation: Large inverse covariance matrix estimation is also fundamental to high-dimensional statistical analysis, which is closely related to undirected graphs under a Gaussian model. Some papers addressing nonconvex regularized inverse covariance matrix estimation include [160]–[166].

7) Matrix completion

: Matrix completion deals with the recovery of a low-rank matrix from its partially observed entries. Such problems arise in many applications such as in recommender systems, computer vision and system identification. Various algorithms have been developed using the nonconvex Schatten-

() norm, truncated nuclear norm, and MCP penalties for low-rankness inducing, e.g., [6], [48], [175]–[182], to achieve better recovery performance over the nuclear norm.

8) Robust PCA

: Robust PCA aims to enhance the robustness of PCA against outliers or sparse corruption which is ubiquitous in modern applications such as image processing, web data analysis, and bioinformatics. Basically, robust PCA is a joint sparse and low-rank recovery problem, which seeks to identify a low-dimensional structure from grossly corrupted observations. Existing works using different combinations of nonconvex sparse and low-rank penalties for robust PCA include [9], [193]–[197].

Among these topics, CS, sparse regression, sparse signals separation, and sparse PCA are sparse vector recovery problems, large sparse covariance and inverse covariance matrices estimation are sparse matrix recovery problems, whilst matrix completion and robust PCA are low-rank recovery problems. To be more precise, sparse PCA is not a vector recovery problem, but in many popular greedy methods, the PCs are estimated in a vector-by-vector manner. Meanwhile, robust PCA is a joint sparse and low-rank recovery problem.

In this paper, we also provide some critical perspectives. As it is often the case that, when new techniques are introduced, there may also be some overexcitement and abuse. We comment on the following points: There exist certain instances in both sparse and low-rank recovery, where the use of nonconvex regularization is simply unnecessary and will not significantly improve performance [95]. The use of nonconvex regularization does not always guarantee distinct performance improvement over convex regularization. Moreover, employing nonconvex regularization models can even be disadvantageous because the related nonconvex and nonsmooth optimization problems are less tractable than convex problems. Performance should be defined in a broader sense that includes not only recovery accuracy, but also other properties of the algorithm, such as convergence speed. Generally, for a nonconvex regularized algorithm, the performance is usually closely related to the initialization and the convergence rate is usually slower than that of a convex regularized algorithm. In comparison, a convex algorithm has better stability and convergence properties, and is insensitive to initialization since converging to a global minimal is usually easy guaranteed [202]. Meanwhile, for first-order convex algorithms, well-developed acceleration techniques can be used with guaranteed convergence [26]. However, when such acceleration techniques are applied to nonconvex algorithms, there is no guarantee of the convergence and stability.

Therefore, the question of whether to use convex or nonconvex regularization models requires careful deliberation. We show that convex models may be preferable when the signal is not strictly sparse (or the matrix is not strictly low-rank) or the signal-to-noise ratio (SNR) is low, since in these cases the performance improvement of nonconvex models are often not distinct and may not deserve the price of more slowly converging nonconvex algorithms. We provide a number of concrete examples that clarify these points.

We also address the issues of penalty selection and the convergence of related nonconvex algorithms. We hope that our paper will illuminate the role nonconvex regularization plays in sparse and low-rank recovery in signal processing, statistics and machine learning, and demonstrate when and how it should be used.

There also exist some recent overview articles related to the topics of this work, e.g., on low-rank matrix recovery [203], [204], and nonconvex optimization in machine leaning [205]. While the works [203], [204] mainly focus on matrix factorization based (nonconvex) low-rank recovery problems, the low-rank recovery problems investigated in this work are more general. Meanwhile, while the article [205] mainly focuses on the optimization aspect of general nonconvex problems in machine learning, we focus on nonconvex regularized problems characterized with the non-convexity and non-smoothness of the involved problems. Moreover, we provide a more wide scope on the applications in various fields.


: The rest of this paper is organized as follows. In section II, we review the proximity operator for nonconvex penalties, and present extended vector proximity operator (for joint sparse recovery) and singular value shrinkage operator (for low-rank matrix recovery) for generalized nonconvex penalties. Section III discusses nonconvex regularized sparse vector recovery problems, including CS, sparse regression, sparse signals separation and sparse PCA. Section IV reviews nonconvex regularized sparse matrix recovery problems, including large sparse covariance and inverse covariance matrices estimation. Section V discusses nonconvex regularized low-rank recovery problems, including matrix completion and robust PCA. Section VI further discusses other applications involving nonconvex regularized sparse and low-rank recovery. Section VII concludes the overview.

Notations: For a matrix , , , , and are the rank, trace, determinant, spectral norm and Frobenius norm, respectively, whilst , and

denote the maximal eigenvalue, minimal eigenvalue and the

-th largest singular value of . For a matrix , is a diagonal matrix which has the same diagonal elements as that of , whilst for a vector , is a diagonal matrix with diagonal elements be . mean that is positive-semidefinite. and stand for the inner product and transpose, respectively. and stand for the gradient and subdifferential of the function , respectively. denotes the sign of a quantity with . stands for an identity matrix. with denotes the -norm defined as . is the Kronecker delta function. denotes the expectation. denotes the indicator function.

Ii Proximity Operator for Nonconvex Regularization Penalties

Proximity operator plays a central role in developing efficient proximal splitting algorithms for many optimization problems, especially for nonconvex and nonsmooth inverse problems encountered in the applications addressed in this paper. As will be shown later, for convex or nonconvex penalized minimization problems, proximity operator is the core of most highly-efficient first-order algorithms which scale well for high-dimensional problems. This section reviews nonconvex regularization penalties and their corresponding proximity operator, including the hard-thresholding, -norm, an explicit -shrinkage, SCAD, MCP, and firm thresholding.

Ii-a Scalar Proximity Operator

For a proper and lower semi-continuous penalty function where is a threshold parameter, consider the following scalar proximal mapping


As is separable, the proximity operator of a vector , denoted by , can be computed in an element-wise manner as


Table 1 shows the penalties and the corresponding proximal mapping operator. Among the presented penalties, only the soft-thresholding penalty is convex, while the other penalties are (symmetric) folded concave functions, as shown in Fig. 1.

The hard-thresholding was first introduced in [36] and then applied for wavelet applications in statistics [4], which is a natural selection for sparse inducing [77]–[81]. The well-known soft-thresholding rule was first observed by Donoho, Johnstone, Hoch and Stem [35] and then used in wavelet applications [36], which forms the core of the LASSO introduced by Tibshirani [37]. The penalty is the most popular one as its convexity makes the related optimization problems more tractable than that using a nonconvex one.

However, the penalty has a bias problem. More specifically, when the true parameter has a relatively large magnitude, the soft-thresholding estimator is biased since it imposes a constant shrinkage on the parameter, as shown in Fig. 2. Fig. 2 plots the thesholding/shrinkage functions for the hard-, soft-, - and SCAD penalties with a same threshold. In contrast, the hard-thresholding and SCAD estimators are unbiased for large parameter. Meanwhile, the thresholding rule corresponding to SCAD, and -shrinkage fall in (sandwiched) between hard- and soft-thresholding.

Fig. 1: Plot of penalty functions for .
Fig. 2: Plot of thesholding/shrinkage functions (with a same threshold).
Penalty name Penalty formulation Proximity operator
(i) Soft thresholding [35]
(ii) Hard thresholding [36]
(iii) -norm [6], [7] ,
where , ,
(iv) -shrinkage [10] N/A ()
(v) SCAD [11]
(vi) MCP [12]
(vii) Firm thresholding [13] where
TABLE I: Regularization penalties and the corresponding proximity operator ( is a thresholding parameter).

The penalty with bridges the gap between the and penalties, and intuitively its shrinkage function is less biased than soft-thresholding. The proximity operator of the -norm penalty does not has a closed-form expression except for the two special cases of and [5] (in these two cases the proximal mapping can be explicitly expressed as the solution of a cubic or quartic equation), but it can be efficiently solved, e.g., by a Newton’s method. Moreover, explicit -shrinkage mapping has been proposed in [8]–[10], which has some qualitative resemblance to the proximal mapping while being continuous and explicit. The -shrinkage reduces to the soft-thresholding when , while it tends pointwise to the hard-thresholding in the limit as . As an acceptable price to pay for having an explicit proximal mapping, its penalty does not have a closed-form expression.

The SCAD penalty has been widely used in variable selection problems, and it has shown favorable effectiveness compared with other penalties in high-dimensional variable selection problems [11]. As well as the , , and SCAD penalties, the MCP penalty can also ameliorate the bias problem of the

penalty [12], and it has been widely used for penalized variable selection in high-dimensional linear regression. For the MCP penalty with each

, we can obtain a continuum of penalties and threshold operators by varying in the range . Moreover, the firm thresholding is a continuous and piecewise-linear approximation of the hard-thresholding [13]. In addition, a class of nonconvex penalties which can maintain the convexity of the global cost function have been designed in [209]–[211], whilst log penalties have been considered in [212], [213].

Among the presented shrinkage functions, the soft-, SCAD, -shrinkage and firm thresholding are continuous while the hard- and -thresholding are discontinuous. For each of the presented penalties, the corresponding thresholding/shrinkage operator satisfies


where is a threshold dependent on . These conditions establish the sign consistency, shrinkage, thresholding, and limited shrinkage properties for the thresholding/shrinkage operator corresponding to a generalized penalty.

Ii-B Vector Proximity Operator (for Multitask Joint Sparse Recovery)

In many applications, as will be shown in section III, it is desirable to jointly recover multichannel signals to exploit the multichannel joint sparsity pattern. In this case, it involves solving the following vector optimization problem


where and with be the channel number. The case of be the penalty has been considered in [120], we extend the result to generalized penalties in the following.

Theorem 1. Suppose that is a nondecreasing function for , for any , the solution to (4) is given by


Proof: See Appendix A.

Ii-C Singular Value Shrinkage Operator (for Low-Rank Matrix Recovery)

For a matrix

, consider the singular value decomposition (SVD) of rank

, , where contains the singular values, and contain the orthonormal singular vectors. In low-rank matrix recovery, the low-rankness promotion on a matrix is usually achieved by sparsity promotion on the sigular values of the matrix. We denote a generalized penalty for low-rankness promotion by , which is defined as


where is a generalized penalty for sparsity inducing as introduced in Table 1. In the two cases of and , becomes the rank and nuclear norm of , respectively. When is the penalty, becomes the Schatten- quasi-norm of matrix.

In the following, we provide the generalized singular-value shrinkage operator for a generalized penalty .

Theorem 2. For a rank matrix , suppose that it has an SVD , where and , and contain the left and right singular vectors. Then, for any defined as (6) with satisfying (3), the solution to the optimization problem


is given by


where is the proximity operator defined in (2).

Proof: See Appendix B.

Iii Sparse Vector Recovery

This section reviews nonconvex regularization based sparse vector signals recovery, mainly on the following four topics, CS, sparse regression and variable selection, sparse signals separation with application to image inpainting and super-resolution, and sparse PCA. Strictly speaking, sparse PCA is not a vector recovery problem, but in many popular greedy approaches, the principle components are estimated in a one-by-one (vector-by-vector) manner.

Iii-a Compressive Sensing

In the past decade, compressive sensing has attracted extensive studies [14]–[17] and has found wide applications in radar [18], [19], communications [20], medical imaging [21], image processing [22], and speech signal processing [23]. In the CS framework, sparse signals (or signals can be sparsely represented in some basis) can be acquired at a significantly lower rate than the classical Nyquist sampling, and signals only need to be sampled at a rate proportional to their information content.

In CS, the objective is to reconstruct a sparse signal from its compressed measurement


where with is the sensing matrix (also called measurement matrix), is additive measurement noise. Since , the recovery of from the compressed measurement is generally ill-posed. However, provided that is sparse and the sensing matrix satisfies some stable embedding conditions [17], can be reliably recovered with an error upper bounded by the noise strength. This can be achieved in the noiseless case by the formulation


or in the noisy case by the formulation


where is a penalty for sparsity inducing (a special case of with ), and bounds the -norm of the residual error. This constrained formulation (11) can be converted into an unconstrained formulation as


Naturally, using the -norm penalty, i.e., and , which counts the number of nonzero components in the vector , (10), (11) and (12) are the exact formulations of finding a sparse vector to fulfill the linear constraint , satisfy the residual constraint

, and minimize the quadratic loss function in (12), respectively. However, with the

penalty the problems (10)-(12) are nonconvex and NP-hard, thus, convex relaxation methods are often considered, e.g., replace the penalty by the one. With and , the formulations (10), (11) and (12) are the well-known basis-pursuit (BP) [24], basis-pursuit denoising (BPDN) and LASSO [13], respectively. In this case, the formulations are convex and hence tractable. A large number of algorithms have been developed in the past decade for these minimization problems (see [25]–[29] and the reference therein).

The CS theory has established that if the sensing matrix satisfies some conditions, such as the restricted isometry property (RIP) [17], [30]–[32], the null space property [33], and the incoherence condition [34], the sparse signal can be reconstructed by regularization reliably. However, due to the relaxation, the recovery accuracy is often degraded, e.g., it often introduces extra bias [1], [2] and cannot reconstruct a signal with the least observations [3]. Furthermore, for some applications, the result of the -minimization is not sparse enough and the original signals cannot be recovered. A simple example of such a case has been given in [38] with intuitive explanation.

To address this problem, a number of improved algorithms have been developed via employing the nonconvex -norm penalty instead of the one, i.e., and with . For , is the quasi-norm defined as . Compared with the -norm, the -norm is a closer approximation of the -norm. It has been shown in [37] that under certain RIP conditions of the sensing matrix, -regularized algorithms require fewer measurements to achieve reliable recovery than -regularized algorithms. Moreover, the sufficient conditions in terms of RIP for regularization are weaker than those for regularization [39], [40]. Meanwhile, it has been shown in [41] that for any given measurement matrix with restricted isometry constant , there exists some that guarantees exact recovery of signals with support smaller than by -minimization.

Recently, -regularized sparse reconstruction has been extensively studied for CS, e.g., [5], [8], [10], [39]–[66], [200], and extended to structured sparse recovery [201]. As the penalty is nonsmooth and nonconvex, many of these algorithms solve a smoothed (approximated) -minimization problem, e.g., the works [47]–[49] use an approximation of as


where is a smoothing parameter. Furthermore, the iteratively reweighted algorithms [38], [44], [46] use the following two penalties (at the -th iteration)

which explicitly relate to the -norm approximation.

These algorithms have been shown to achieve better recovery performance relative to the -regularized algorithms. However, due to the non-convexity of -minimization, many of these algorithms are generally inefficient and impractical for large-scale problems. For example, the StSALq method in [47] requires repetitive computation of matrix inversion of dimension , whilst the IRucLq method in [48] solve a set of linear equations using matrix factorization (with matrix dimension of ). In comparison, the Lp-RLS method in [49], which uses an efficient conjugate gradient (CG) method in solving a sequence of smoothed subproblems, has better scalable capability. Moreover, the iteratively reweighted method in [50] also involves solving a sequence of weighted mixed subproblems. Although efficient first-order algorithms can be used to solve the subproblems involved in the methods in [49], [50], both the methods have a double loop which hinders the overall efficiency. While subsequence convergence is guaranteed for the iteratively reweighted methods [48] and [50], there is no such guarantee for StSALq [47] and Lp-RLS [49].

In comparison, the proximal gradient descent (PGD) and alternative direction method of multipliers (ADMM) algorithms for the problem (12) are globally convergent under some mild conditions, while being much more efficient. Specifically, let , consider the following quadratic approximation of the objective in the formulation (12) at iteration and at a given point as


where is a proximal parameter. Then, minimizing reduces to the proximity operator introduced in section 2 as


which can be computed element-wise via the shrinkage/thresholding function in Table I.

This PGD algorithm fits the frameworks of the forward-backward splitting method [67] and the generalized gradient projection method [68]. Very recently, the convergence properties of this kind of algorithms have been established via exploiting the Kurdyka-Lojasiewicz (KL) property of the objective function [69]–[71]. Suppose that is a closed, proper, lower semi-continuous, KL function, if , then, the sequence generated by PGD (15) converges to a stationary point of the problem (12). Further, under some more conditions, the convergence of PGD to a local minimizer can be guaranteed [72], [73].

For ADMM algortithm, using an auxiliary variable

the problem (12) can be equivalently reformulated as


Then, the ADMM algorithm consists of the following steps

where is the dual variable, is a penalty parameter. As a standard trick, the -subproblem can be solved approximately via linearizaing the quadratic term. For a closed, proper, lower semi-continuous and KL , under some condition of the proximal parameter and (should be chosen sufficiently large), this proximal ADMM algorithm globally converges to a stationary point of the problem (16) [74]–[75].

For PGD and ADMM, the dominant computational load in each iteration is matrix-vector multiplication with complexity , thus, scale well for high-dimension problems. These two algorithms may be further accelerated by the schemes in [26], [76], however, for a nonconvex , there is no guarantee of convergence when using such acceleration schemes.

Fig. 3: The two considered images.
Fig. 4: Recovery performance of -thresholding versus in reconstructing two images.
Fig. 5: Sorted and normalized absolute values of the wavelet coefficients of the two images (the first 10000 large values).

Example 1 (Image reconstruction). We evaluate the PGD (15) with hard-, - and soft-thresholding on image reconstruction. The used images are of size (), include a synthetic image, “Shepp-Logan” and an MRI image, as shown in Fig. 3. We use the Haar wavelets as the basis for sparse representation of the images, and is a partial DCT matrix with . The -thresholding is initialized by the solution of the soft-thresholding. The regularization parameter is selected by providing the best performance.

Fig. 4 shows the peak-signal noise ratio (PSNR) of recovery of -thresholding for different value of with SNR = 50 dB, including the hard- and soft-thresholding as special cases. It can be seen that each method is able to achieve a high PSNR greater than 60 dB in recovering the synthetic image, but degrades significantly in recovering the MRI image (less than 30 dB). This is due the nature that, as shown in Fig. 5, the wavelet coefficients of the synthetic image are truly sparse (approximately 5.7% nonzeros), while that of a real-life image are not strictly sparse but rather approximately follow an exponential decay, which is referred to as compressible. Also due to this, for the two images, the best performance of -thresholding are given by and , respectively, as shown in Fig. 4. That is, a relatively small value of should be used for strictly sparse signals, while a relatively large value of should be used for compressible (non-strictly sparse) signals. Moreover, the hard- and -thresholding can achieve significant performance improvement over the soft-thresholding only for strictly sparse signals.

Iii-B Sparse Regression and Variable Selection

Nowadays, the analysis of data sets with the number of variables comparable to or even much larger than the sample size arises in many areas, such as genomics, health sciences, economics and machine learning [126]. In this context with high-dimensional data, most traditional variable selection procedures, such as AIC and BIC, becomes infeasible and impractical due to too expensive computational cost. In this scenario, sparse regression, which can simultaneously select variables and estimate coefficients of variables, has become a very popular topic in the last decade due to its effectiveness in the high-dimensional case [37], [82].

Let denote the (deterministic) design matrix, contains the unknown regression coefficients, and is the response vector. Further, assume that depends on through a linear combination and the conditional log-likelihood given is . In the variable selection problem, the assumption is that majority of the true regression coefficients are zero, and the goal is to identify and estimate the subset model. Under the sparse assumption of the true regression coefficients, a natural method for simultaneously locating and estimating those nonzero coefficients in is to maximize the following penalized likelihood of the form


where is a generalized sparsity promotion penalty as defined in section 2. Moreover, there exists a popular alternative which uses the following penalized least-square (LS) formulation

Fig. 6: Recovery performance of the hard-, SCAD-, - and soft-thresholding versus SNR with zero initialization. (a) 2% active coefficients in , (b) 5% active coefficients in , (c) 10% active coefficients in .
Fig. 7: Recovery performance of the hard-, SCAD-, - and soft-thresholding versus SNR, initialized by the solution of the convex penalty. (a) 2% active coefficients in , (b) 5% active coefficients in , (c) 10% active coefficients in .

The well-known LASSO method is first proposed in [37] for the linear regression problem (18). In the same spirit of LASSO, nonconcave penalty functions, such as SCAD [11] and MCP [12], have been proposed to select significant variables for various parametric models, including linear regression, generalized linear regression and robust linear regression models [83]. Extension to some semiparametric models, such as the Cox model and partially linear models have been considered in [84]–[86]. It has been shown in these works that, with appropriately selected regularization parameters, nonconvex penalized estimators can perform as well as the oracle procedure in selecting the correct subset model and estimating the true nonzero coefficients. Further, even for super-polynomial of sample size, nonconvex penalized likelihood methods possess model selection consistency with oracle properties [87]. In addition, adaptive LASSO has been proposed in [88], which uses an adaptively weighted

penalty. While LASSO variable selection can be inconsistent in some scenarios, adaptive LASSO enjoys the oracle properties and also leads to a near-minimax optimal estimator. The oracle property of adaptive LASSO has also been demonstrated in the high-dimensional case [89].

To solve a nonconvex penalized sparse regression problem, locally approximation of the penalty function can be utilized, such as local quadratic approximation (LQA) [11], [90] and local linear approximation (LLA) [91]. While the LQA based methods uses a backward stepwise variable selection procedure, the LLA based method [91] is one-step and hence more favorable.

More recently, benefit from the progress in nonconvex and nonsmooth optimization, direct methods have been widely designed. Specifically, in [92], a coordinate-descent algorithm has been proposed for the linear regression model (18). By using a continuation process of the parameters of the SCAD or MCP penalty, its convergence to a minimum is guaranteed under certain conditions. Meanwhile, an alternative coordinate-descent algorithm has been presented in [93] with guaranteed convergence to a local minimum. Then, a cyclic descent algorithm employing the penalty for multivariate linear regression has been introduced in [94]. Subsequently, a cyclic descent algorithm for the penalized LS problem has been proposed in [7], [95], whilst an majorization-minimization (MM) algorithm with momentum acceleration for the penalized LS formulation has been developed in [96]. For both the methods in [95] and [96], convergence to a local minimizer is guaranteed under certain conditions. Moreover, as introduced in the last subsection, there exist numerous algorithms, e.g., the PGD and ADMM algorithms, can be applied to the penalized LS formulation (18).

In addition, a global optimization approach has been recently proposed in [206] for concave penalties (e.g., SCAD and MCP) based nonconvex learning via reformulating the nonconvex problems as general quadratic programs. Meanwhile, theoretical analysis on the statistical performance of local minimizer methods using folded concave penalties has been provided in [207]. Moreover, a class of nonconvex penalties termed trimmed Lasso has been considered in [208], which enables exact control of the desired sparsity level.

Example 2 (Estimation accuracy in different SNR and different sparsity level). We evaluate the performance of different penalties for the linear sparse regression problem (18) using the PGD (15), with and . Three sparsity levels, 2%, 5%, and 10% active coefficients of , are considered. Fig. 6 shows the performance of different penalties versus SNR when the PGD is initialized by zero, while Fig. 7 shows the results when the PGD is initialized by the solution of the convex penalty. It can be seen that the performance of the nonconvex penalties is heavily dependent on the initialization. The advantage of a nonconvex penalty over the penalty is significant when the sparsity level of the coefficient vector is relatively low and/or the SNR is relatively high.

Iii-C Sparse Signals Separation and Image Inpainting

Sparse signals separation has wide applications, such as source separation, super-resolution and inpainting, interference cancellation, saturation and clipping restoration, and robust sparse recovery in impulsive (sparse) noise. The objective of the problem is to demix the two sparse vectors , , from their mixed linear measurements as


where are known deterministic dictionaries. More specifics on the applications involving the model (19) are as follows.

1) Source separation

: such as the separation of texture in images [97], [98] and the separation of neuronal calcium transients in calcium imaging [99],

and are two dictionaries allowing for sparse representation of the two distinct features, and are the (sparse or approximately sparse) coefficients describing these features [100]–[102]. 2) Super-resolution and inpainting: in the super-resolution and inpainting problem for images, audio, and video signals [103]–[105], only a subset of the desired signal is available. Given , the objective is to fill in the missing parts in , in which case and stands for the missing parts. 3) Interference cancellation: in some audio, video, or communication applications, it is desired to restore a signal corrupted by narrowband interference, such as electric hum [101]. As narrowband interference can be sparsely represented in the frequency domain,

can be an inverse discrete Fourier transform matrix.

4) Saturation and clipping restoration: in many practical systems where the measurements are quantized, nonlinearities in amplifiers may result in signal saturation and causes significant nonlinearity and potentially unbounded errors [101], [106], [107]. In this case, is the desired signal, is the situated measurement with stands for the saturation errors. 5) Robust sparse recovery in impulsive noise: impulsive noise is usually approximately sparse and has a distribution with heavy tail. In practical image and video processing applications [108]–[110], impulsive noise may come from the measurement process, or caused by transmission problems, faulty memory locations, buffer overflow and unreliable memory [111]–[114]. In these cases, represents the (sparsely) impulsive noise and .

Exploiting the sparsity, and can be reconstructed via the following formulation


where and are penalties for sparsity promotion, is a parameter takes the statistic difference between the two components into consideration. As will be shown later, when and have different sparsity properties, using two different penalties for and can obtain performance gain over using a same one.

When both and are the penalty, i.e., , (20) reduces to the sparse separation formulation in [102]. Further, when and , the formulation (20) degenerates to the BP form considered in [100]. Moreover, when and , (20) reduces to the -regularized least-absolute problem for robust sparse recovery [115], which has outstanding robustness in the presence of impulsive noise. In addition, the M-estimation method in [218] can also be considered as a special case of (20) with a selected robust loss term.

Naturally, nonconvex penalties can be expected to yield better reconstruction performance over the above convex methods. For example, the penalty has been used in [116]–[119] to obtain more robust restoration of images corrupted by salt-and-pepper impulsive noise. Very recently, a generalized formulation using penalty has been proposed in [120] with and , . The formulation (20) can be directly solved by a standard two-block ADMM procedure [29], but it often fails to converge in the nonconvex case [120]. To develop convergent algorithms, a recenty work [120] proposed to solve a quadratic approximation of (20) and developed two first-order algorithms based on the proximal block coordinate descent (BCD) and ADMM frameworks. The proximal BCD method consists of the following two update steps


where and are proximal parameters, is quadratic approximation parameter. Suppose that and are closed, proper, lower semi-continuous, KL functions, if and , the algorithm updated via (21) and (22) is a descent algorithm and the generated sequence converges to a stationary point of the approximated problem.

In many applications such as super-resolution and inpainting for color images with 3 (RGB) channels, multichannel joint recovery is more favorable than independently recovery of each channel, as the former can exploit the feature correlation among different channels. In the multitask case, the linear measurements of channels can be expressed as


where , , are the sparse features in the two components. To exploit the joint sparsity among the channels, joint sparsity penalties can be used, which is defined as

(a) Inpainting of a image corrupted by overwritten text
(b) Inpainting of a image corrupted by salt-and-pepper noise (30% of the pixels are corrupted)
Fig. 8: Restoration performance of the and regularization in color images inpainting.
Fig. 9: Restoration performance versus and in terms of RelErr in dB defined as . Left: Case 1: restoration of the image in Fig. 8 (a), contains the DCT coefficients of the image and represents the overwritten text. Middle: Case 2: both and are strictly sparse. Right: Case 3: is strictly sparse and is -stable () noise (non-strictly sparse).

Using such a penalty, e.g., for and for , the proxiaml BCD algorithm for the multitask problem consists of the following two steps [120]


where and are proximal parameters. These two subproblems can be solved row-wise as (5). Suppose that and are closed, proper, lower semi-continuous, KL functions, if , and , the algorithm updated via (24) and (25) is a descent algorithm and the generated sequence converges to a stationary point of the approximated problem.

Example 3 (Color image inpainting). We compare the performance of the and regularization on color images inpainting using the BCD method (24) and (25). The task is to restore the original image from text overwriting or salt-and-pepper noise corruption. In this case, and represents the sparse corruption in the three (RGB) channels. We select as an inverse discrete cosine transformation (IDCT) matrix, and, accordingly, contains the DCT coefficients of the image. As shown in Fig. 8, regularization significantly outperforms the one, e.g., the improvement is more than 9 dB in the overwritten text case and more than 5 dB in the salt-and-pepper noise case.

Example 4 (Evaluation on different penalties). An important problem in practice is how to select a nonconvex penalty. This example sheds some light on this by three application cases. Case 1: the color image inpainting experiment in Fig. 8 (a). Case 2: , and are respectively DCT and Gaussian matrices, and are strictly sparse vectors with sparsity . Case 3 (robust sparse recovery in impulsive noise): , is a Gaussian matrix, , is a strictly sparse vector with , and is symmetric -stable () noise with characteristic exponent and dispersion .

As the penalty has a flexible parametric form that adapts to different thresholding functions, we evaluate the effect of the values of and on the recovery performance in the three cases. Fig. 9 indicates that for strictly sparse signals, a relatively small value of is favorable, while for non-strictly sparse signals, a relatively large value of is favorable. For example, in case 1, (DCT coefficients of the image) is non-strictly sparse whilst is strictly sparse, thus, a relatively large value of and a relatively small value of should be used. In case 3, is strictly sparse whilst ( noise) is non-strictly sparse, thus, a relatively small value of and a relatively large value of would result in good performance.

Iii-D Sparse PCA

PCA is a useful tool for dimensionality reduction and feature extraction, which has been applied in virtually all areas of science and engineering, such as signal processing, machine learning, statistics, biology, medicine, finance, neurocomputing, and computer networks, to name just a few. In many real applications, sparse loading vectors are desired in PCA to enhance the interpretability of the principle components (PCs). For instance, in gene analysis, the sparsity of PCs can facilitate the understanding of the relation between the whole gene microarrays and certain genes; in financial analysis, the sparsity of PCs implies fewer assets in a portfolio thus is helpful to reducing the trading costs. In these scenarios, it is desirable not only to achieve the dimensionality reduction but also to reduce the number of explicitly used variables.

To achieve sparse PCA, as hoc methods were firstly designed via thresholding the PC loadings [124], [125]. Then, more popular methods incorporating a sparsity inducing mechanism into the traditional PCA formulation have been developed [126]. Let be a centered data matrix with and respectively be the dimensionality and the size of the data. Employing a sparsity constraint in the traditional PCA model, the sparse PCA problem to find an -dimensional subspace can be formulated as


for . is a sparsity inducing penalty. Instead of solving (26), a widely used greedy strategy is to find the approximate solution of (26) by sequentially solving the following single PC problem


An alternative of the sparsity constrained formulation (27) is the sparsity penalized formulation as follows


where is a regularization parameter.

There exist numerous algorithms employing different penalty functions for the formulations (26)–(28) and their variants. In [126], the constraint is used for the formulation (26). In [127], an elastic-net regression based algorithm has been proposed. Then, a semidefinite relaxation method has been developed in [128], [129] for the formulation (27) with constraint. Meanwhile, in [130], a regularized low rank matrix approximation method has been designed with the consideration of different penalties, i.e., , and SCAD. In [131], with and penalties, reformulations of (28) and its block variant have been solved via gradient algorithms. Generalized version of (28) with penalty has been considered in [132]. Moreover, an alternative discrete spectral formulation of constrained (27) and an effective greedy approach have been presented in [133]. In [134], unified alternating maximization framework for and constrained or penalized sparse PCA problems (using or loss) has been proposed.

More recently, robust sparse PCA using loss and penalty has been considered in [135]. Meanwhile, an ADMM based distributed sparse PCA algorithm has been proposed in [136] which covers the , log sum and MCP penalties. Moreover, Shatten- penalty has been used for structured sparse PCA in [137]. In addition, there also exist several other methods for constrained or penalized sparse PCA problems, e.g., [138]–[141].

Iv Sparse Matrix Recovery

This section reviews the nonconvex regularization based sparse matrix recovery problems, mainly on large covariance matrix and inverse covariance matrix estimation, which are two fundamental problems in modern multivariate analysis [142]. Nowadays, the advance of information technology makes massive high-dimensional data widely available for scientific discovery. In this context, effective statistical analysis for high-dimensional data is becoming increasingly important. In many applications involving statistical analysis of high-dimensional data, estimating large covariance or inverse covariance matrices is necessary for effective dimensionality reduction or discriminant analysis. Such applications arise in economics and finance, bioinformatics, social networks, smart grid, climate studies, and health sciences [142]–[144]. In the high-dimensional setting, the dimensionality is often comparable to (or even larger than) the sample size. In these cases, the sample covariance matrix estimator has a poor performance [145], and intrinsic structures such as sparsity can be exploited to improve the estimation of covariance and inverse covariance matrices [142], [146]–[148], [155].

Iv-a Large Sparse Covariance Matrix Estimation

Consider a vector with covariance , the objective is to estimate its covariance from observations . Usually, compared to estimate directly, it is more favorable to estimate the correlation matrix first, . Then, given the estimated correlation matrix , the corresponding estimation of the covariance matrix is , where is the sample covariance matrix. That is because the correlation matrix has the same sparsity pattern of the covariance matrix but with all the diagonal elements known to be one, thus, it can be estimated more accurately than the covariance matrix [149]–[151].

Given the sample correlation matrix , the generalized thresholding estimator [148] solves the following problem


with be generalized penalty function for sparsity promotion as introduced in section 2. Note that, the diagonal elements are not penalized since the diagonal elements of a correlation (also covariance) matrix are always positive. The solution to (29) is a thresholding of the sample correlation matrix , which can be efficiently computed as shown in section 2.

The thresholding estimator (29) have good theoretical properties. It is consistent over a large class of (approximately) sparse covariance matrices [148]. However, in practical finite sample applications, such an estimator is not always positive-definite although it converges to a positive-definite limit in the asymptotic setting [149], [151]. To simultaneously achieve sparsity and positive-definiteness, positive-definite constraint can be added into (29) as [152]