I Introduction
Remote sensing image processing [1] is a fast moving area of science. Data acquired from satellite or airborne sensors and converted into useful information (land cover maps, target maps, mineral compositions, biophysical parameters) have nowadays entered many applicative fields: efficient and effective methods for such conversion are therefore needed. This is particularly true for data sources such as hyperspectral and very high resolution images, whose data volume is big and structure is complex: for this reason many traditional methods perform poorly when confronted to this type of data. The problem is even more exhacerbated when dealing with multisource and multimodal data, representing different views of the land being studied (different frequencies, different seasons, angles, …). This created the need for more advanced techniques, often based on statistical learning [2].
Among such methodologies, regularized methods are certainly the most successful. Using a regularizer imposes some constraints on the class of functions to be preferred during the optimization of the model and can thus be beneficial if we know what these properties are. The more often, regularizers are used to favour simpler functions over very complex ones, in order to avoid overfitting of the training data: in classification, the support vector machine uses this form of regularization
[3, 4], while in regression examples can be found in kernel ridge regression or Gaussian processes
[5].But smoothnesspromoting regularizers are not the only ones that can be used: depending on the properties one wants to promote, other choices are becoming more and more popular. A first success story is the use of Laplacian regularization [6]: by enforcing smoothness in the local structure of the data, one can promote the fact that points that are similar in the input space must have a similar decision function (Laplacian SVM [7, 8] and dictionarybased methods [9, 10]
) or be projected close after a feature extraction step (Laplacian eigenmaps
[11] and manifold alignment [12]). Another popular property to be enforced, on which we will focus the rest of this paper, is sparsity [13]. Sparse models have only a part of the initial coefficients which is active (i.e. nonzero) and are thus compact. This is desirable in classification when the dimensionality of the data is very high (e.g. when adding many spatial filters [14, 15] orusing convolutional neural networks
[16, 17]) or in sparse coding when we need to find a relevant dictionary to express the data [18]. Even though nonsparse models can work well in terms of overall accuracy, they still store information about the training samples to be used at test time: if such information is very high dimensional and the number of training samples is important, the memory requirements, the model complexity and – as a consequence – the execution time are strongly affected. Therefore, when processing next generation, large data using models generating millions of features [19, 20], sparsity is very much needed to make models portable, while remaining accurate. For this reason, sparsity has been extensively used in i) spectral unmixing [21], where a large variety of algorithms is deployed to select endmembers as a small fraction of the existing data [18, 22, 23], ii) image classification, where sparsity is promoted to have portable models either at the level of the samples used in reconstructionbased methods [24, 25]or in feature selection schemes
[26, 15, 27] iii) and in more focused applications such as 3D reconstruction from SAR [28], phase estimation
[29] or pansharpening [30].A popular approach to recover sparse features is to solve a convex optimization problem involving the norm (or Lasso) regularization [31, 32, 33]. Proximal splitting methods have been shown to be highly effective in solving sparsityconstrained problems [34, 35, 36]. The Lasso formulation based on the penalty on the norm of the model has been shown to be an efficient shrinkage and sparse model selection method in regression [37, 38, 39]. However, the Lasso regularizer is known to promote biased estimators leading to suboptimal classification performances when strong sparsity is promoted [40, 41]
. A way out of this dilemma between sparsity and performance is to retrain a classifier, this time nonsparse, after the feature selection has been performed with Lasso
[15]. Such scheme works, but at the price of training a second model, thus leading to extra computational effort and to the risk of suboptimal solutions, since we are training a model with the features that were considered optimal by another. In unmixing, synthetic examples also show that the Lasso regularization is not the one leading to the best abundance estimation [42].In recent years, there has been a trend in the study of unbiased sparse regularizers. These regularizers, typically the , and Log Sum Penalty (LSP [40]), can solve the dilemma between sparsity and performance, but are nonconvex and therefore cannot be solved by known offtheshelf convex optimization tools. Therefore, such regularizers have until now received little attention in the remote sensing community. A handful of papers using norm are found in the field of spectral unmixing [43, 42, 44] where authors consider nonnegatrive matrix factorization solutions; in the modelling of electromagnetic induction responses, where the model parameters were estimated by regularized least squares estimation [45]; in feature extraction using deconvolutional networks [46]
and in structured prediction, where authors use a nonconvex sparse classifier to provide posterior probabilities to be used in a graph cut model
[47]. In all these studies, the nonconvex regularizer outperformed the Lasso, while still providing sparse solutions.In this paper, we give a critical explanation and theoretical motivations for the success of regularized classification, with a focus on nonconvex methods. By comparing it with other traditional regularizers (ridge and Lasso ), we advocate the use of nonconvex regularization in remote sensing image processing tasks: nonconvex optimization marries the advantages of accuracy and sparsity in a single model, without the need of unbiasing in two steps or reduce the level of sparsity to increase performance. We also provide a freely available toolbox for the interested readers that would like to enter this growing field of investigation.
The reminder of this paper is as follows: in Section II, we present a general framework for regularized remote sensing image processing and discuss different forms of convex and nonconvex regularization. We will also present the optimization algorithm proposed. Then, in Section III we apply the proposed nonconvex regularizers to the problem of multi and hyperspectral image classification and therefore present the specific data term for classification and study it in synthetic and real examples. In Section IV we apply our proposed framework to the problem of linear unmixing, present the specific data term for unmixing and study the behaviour of the different regularizers in simulated examples involving true spectra from the USGS library. Section V concludes the paper.
Ii Optimization and non convex regularization
In this Section, we give an intuitive explanation of regularized models. We first introduce the general problem of regularization and then explore convex and nonconvex regularization schemes, with a focus on sparsityinducing regularizers. Finally, we present the optimization algorithms to solve nonconvex regularization, with accent put on proximal splitting methods such as GIST [48].
Iia Optimization problem
Regularized models address the following optimization problem:
(1) 
where is a smooth function (Lipschitz gradient), is a regularization parameter and is a regularization function. This kind of problem is extremely common in data mining, denoising and parameter estimation.
is often an empirical loss that measures the discrepancy between a model and a dataset containing real life observations.
The regularization term is added to the optimization problem in order to promote a simple model, which has been shown to lead to a better estimation [49]. All the regularization terms discussed in this work are of the form :
(2) 
where is a monotonically increasing function. This means that the complexity of the model can be expressed as a sum of the complexity of each feature in the model.
The specific form of the regularizer will change the assumptions made on the model. In the following, we discuss several classes of regularizers of increasing complexity: differentiable, nondifferentiable (i.e. sparsity inducing) and finally both nondifferentiable and nonconvex. A summary of all the regularization terms investigated in this work is given in Table I, along with an illustration of the regularization as a function of the value of the coefficient (Fig. 1).
Regularization term  

Ridge, norm  
Lasso, norm  
Log sum penalty (LSP)  
with 
IiB Nonsparse regularization
One of the most common regularizers is the square norm of model , i.e., (). This regularization will penalize large values in the vector but is isotropic, i.e. it will not promote a given direction for the vector . This regularization term is also known as
, quadratic or ridge regularization and is commonly used in linear regression and classification. For instance, logistic regression is often regularized with a quadratic term. Also note that Support Vector Machine are regularized using the
norm in the Reproducing Kernel Hilbert Space of the form [50].IiC Sparsity promoting regularization
In some cases, not all the features or observations are of interest for the model. In order to get a better estimation, one wants the vector to be sparse, i.e. to have several components exactly . For linear prediction, sparsity in the model implies that not all features are used for the prediction^{2}^{2}2Note that zero coefficients might happen also in the solution, but the regularizer itself does not promote their appearence.. This means that the features showing a nonzero value in are then “selected”. Similarly, when estimating a mixture one can suppose that only few materials are present, which again implies sparsity of the abundance coefficients.
In order to promote sparsity in one needs to use a regularization term that increases when the number of active component grows. The obvious choice is to use the pseudonorm that returns directly the number of nonzero coefficients in . Nevertheless, the term is nonconvex and nondifferentiable, and cannot be optimized exactly unless all the possible subsets are tested. Despite recent works aiming at solving directly this problem via discrete optimization [51], this approach is still computationally impossible even for mediumsized problems. Greedy optimization methods have been proposed to solve this kind of optimization problem and have lead to efficient algorithms such as Orthogonal Matching Pursuit (OMP) [52] or Orthogonal Least Square (OLS) [53]. However, one of the most common approaches to promote sparsity without recurring to the regularizer is to use the norm instead. This approach, also known as the Lasso in linear regression, has been widely used in compressed sensing in order to estimate with precision a few component in a large sparse vector.
Now we discuss the intuition why using a regularization term such as promotes sparsity. The reason behind the sparsity of the norm lies in the nondifferentiability at shown in Fig. 1 (dashed blue line). For the sake of readability, we will suppose here that is convex, but the intuition is the same and the results can be generalized to the nonconvex functions presented in the next section. For a more illustrative example we use a 1D comparison between the and regularizers (Fig. 2).

When both the data and regularization term are differentiable, a stationary point has the following property:
(3) In other words, the gradients of both functions have to cancel themselves exactly. This is true for the regularizer everywhere, but also for the , with the exception of . If we consider the regularizer as an example (left plot in Fig. 2), we see that each point has a specific gradient, corresponding to the tangent to each point (e.g. the red dashed line). The stationary point is reached in this case for , as given by the black line in the left plot of Fig. 2.

When the second term in Eq (3) is not differentiable (as in the case at presented in the right plot of Fig. 2), the gradient is not unique anymore and one has to use the subgradients and subdifferentials. For a convex function a subgradient at is a vector such that , i.e. it is the slope of a linear function that remains below the function. In 1D, a subgradient defines a line touching the function at the nondifferentiable point (in the case of Fig. 2, at 0), but stays below the function everywhere else, e.g. the black and green dotteddash lines in Fig. 2 (right). The subdifferential is the set of all the subgradients that respect the minoration relation above. The subdifferential is illustrated in Fig. 2, by the the area in light blue, which contains all possible solutions.
Now the optimality constraints cannot rely on equality since the subgradient is not unique, which leads to the following optimality condition
(4) This is very interesting in our case because this condition is much easier to satisfy than Eq. (3). Indeed, we just need to have a single subgradient in the whole subdifferential that can cancel the gradient . In other words, only one of the possible straight lines in the blue area is needed to cancel the gradient, thus making the chances for a null coefficient much higher. For instance, when using the regularization, the subdifferential of variable in is the set . When becomes large enough it is larger than all the components of the gradient and the only solution verifying the conditions is the null vector .
The regularization has been largely studied. Because it is convex meaning it avoids the problem of local minima, and many efficient optimization procedures exists to solve it (e.g. LARS [54], Forward Backward Splitting [55]). But the sparsity of the solution using regularization often comes with a cost in term of generalization. While theoretical studies show that under some constraint the Lasso can recover the true relevant variables and their sign, the solution obtained will be biased toward [56]. Figure 3 illustrates the bias in a twoclass toy dataset: the
decision function (red line) is biased with respect to the Bayes decision function (blue line). In this case, the bias corresponds to a rotation of the separating hyperplane. In practice, one can deal with this bias by estimating again the model on selected subset of variables using an isotropic norm (e.g.
) [15], but this requires to solve again an optimization problem. The approach we propose in this paper is to use a nonconvex regularization term that will still promote sparsity, while minimizing the aforementioned bias. To this end, we present nonconvex regularization in the next section.IiD Nonconvex regularization
In order to promote more sparsity while reducing the bias, several works have looked at nonconvex, yet continuous regularization. Such regularizers have been proposed for instance in statistical estimation [57], compressed sensing [40]
or in machine learning
[41]. Popular examples are the Smoothly Clipped Absolute Deviation (SCAD) [57], the Minimax Concave Penalty (MCP) [58] and the Log Sum Penalty (LSP) [40] considered below (see [48] for more examples). In the following we will investigate two of them in more detail: pseudonorm with and LSP, both also displayed in Figure 1.All the nonconvex regularization above share some particular characteristics that make them of interest in our case. First (and as the pseudonorm and norm) they all have a nondifferentiability in , which – as we have seen in the previous section – promotes sparsity. Second they are all concave in their positive orthant, which limits the bias because their gradient will decrease for large values of limiting the shrinkage (as compared to the norm, whose gradient for is constant). Intuitively, this means that with a nonconvex regularization it will become more difficult for large coefficients to be shrinked toward 0, because their gradient is small. On the contrary, the norm will treat all coefficients equally and apply the same attraction to the stationary point to all of them. The decision functions for the LSP and norms are shown in Fig. 3 and are much closer to the actual (true) Bayes decision function.
IiE Optimization algorithms
Thanks to the differentiability of the term, the optimization problem can be solved using proximal splitting methods [55]. The convergence of those algorithms to a global minimum are well studied in the convex case. For nonconvex regularization, recent works have proved that proximal methods can be used with nonconvex regularizers when a simple closed form solution of the proximity operator for the regularization can be computed [48]
. Recent works have studied the convergence of proximal methods with nonconvex regularization and proved convergence to a local stationnary point for a large family of loss functions
[59].In this work, we used the General Iterative Shrinkage and Thresholding (GIST) algorithm proposed in [48]. This approach is a first order method that consists in iteratively linearizing in order to solve very simple proximal operators at each iteration. At each iteration one computes the model update by solving
(5) 
When is a Lipschitz constant of , the cost function above is a majorization of which ensures a decrease of the objective function at each iteration. Problem (5) can be reformulated as a proximity operator
(6) 
where can be seen as a gradient step w.r.t. followed by a proximal operator at each iteration. Note that the efficiency of a proximal algorithm depends on the existence of a simple closed form solution for solving the proximity operator in Eq. (6). Luckily, there exists numerous operators in the convex case (detailed list in [55]) and some nonconvex proximal operator can be computed on the regularization used in our work (see [48, Appendix 1] for LSP and [60, Equ. 11] for with ). Note that efficient methods, which estimate the Hessian matrix [61, 62] exist, as well as a wide range of methods based on DC programming, which have shown to work very well in practice [62, 63] and can handle the general case for the pseudonorm (see [64] for an implementation).
Finally, when one wants to perform variable selection using the pseudonorm as regularization, the exact solution of the combinatorial problem is not always necessary. As mentioned above, greedy optimization methods have been proposed to solve this kind of optimization problem and have lead to efficient algorithms such as Orthogonal Matching Pursuit (OMP) [52] or Orthogonal Least Square (OLS) [53]. In this paper, we won’t consider these methods in detail, but they have been shown to perform well on least square minimization problems.
Iii Classification with feature selection
In this section, we tackle the problem of sparse classification. Through a toy example and a series of real data experiments, we will study the interest of non convex regularization.
Iiia Model
The model we will consider in the experiments is a simple linear classifier of the form where is the normal vector to the separating hyperplane and is a bias term. In the binary case (), the estimation is performed by solving the following regularized optimization problem:
(7) 
where is one of the regularizers in Table I and is a classification loss that measures the discrepancy between the prediction and the true label . Hereafter, we will use the squared hinge loss:
When dealing with multiclass problems, we use a OneAgainstAll procedure, i.e. we learn one linear function per class and then predict the final class for a given observed pixel as the solution of . In practice, this leads to an optimization problem similar to Eq. (7), where we need to estimate a matrix , containing the coefficients per each class. The number of coefficients to be estimated is therefore the size of the input space multiplied by the number of classes .
IiiB Toy example
First we consider in detail the toy example in Fig. 3: the data considered are 20dimensional, where the first two dimensions are discriminative (they correspond to those plotted in Fig. 3), while the other are not (they are generated as Gaussian noise). The correct solution is therefore to assign non zero coefficients to the two discriminative features and for all the others.
Figure 3 show the classifiers estimated for the smallest value of the regularization term , which leads to the correct sparsity level (2 features selected). This ensures that we have selected the proper components, while minimizing the bias for all methods. This also illustrates that the classifier has a stronger bias (i.e. provides a decision function further away from the optimal Bayes classifier) than the classifiers regularized by nonconvex functions.
Let’s now focus on the effect of the regularization term and of its strength, defined by the regularization parameter in Eq. (7). Figure 4 illustrates a regularization path, i.e., all the solutions obtained by increasing the regularization parameter ^{3}^{3}3A “regularization path” for the Lasso is generally computed using homotopy algorithms [65]. However, experiments show that the computational complexity of the complete Lasso path remains high
for highdimensional data. Therefore, in our experiments we used an approximate path (i.e., a discrete sampling of
values along the path).. Each line corresponds to one input variable and those with the largest coefficients (and in color) are the discriminative ones. Considering the regularization (top left panel in Fig. 4), no sparsity is achieved and, even if the two correct features have the largest coefficients, the solution is not compact. The solution (top right panel) shows a correct sparse solution for (vertical black line, where all the coefficients but two are ), but the smallest coefficient is biased (it is smaller than expected by the Bayes classifier, represented by the horizontal dashed lines). The two nonconvex regularizers (bottom line of Fig. 4) show the correct features selected, but a smaller bias: the coefficient retrieved are closer to the optimal ones of the Bayes classifier. Moreover, the non zero coefficients stay close to the correct values for a wider set of regularization parameters and then drop directly to zero: this means that the nonconvex model either has not enough features to train or has little feature with the right coefficients, contrarily to the that can retrieve sparse solution with wrong coefficients, as it can be seen in the part to the right of the vertical black line of the regularization path.IiiC Remote sensing images
Data. The real datasets considered are three very high resolution remote sensing images.

Thetford Mines. The first dataset is acquired over the Thetford mines site in Québec, Canada and contains two data sources: a VHR color image (three channels, redgreenblue) at 20 cm resolution and a long wave infrared (LWIR, 84 channels) hyperspectral image at approximatively 1 m resolution^{4}^{4}4The data were proposed as the Data Fusion Contest 2014 [66] and are available on the IADF TC website for download http://www.grssieee.org/community/technicalcommittees/datafusion/. The LWIR images are downsampled by a factor 5, to match the resolution of the RGB data, leading to a datacube. The RGB composite, band 1 of the LWIR data and the train / test ground truths are provided in Fig. 5.

Houston. The second image is a CASI image acquired over Houston with spectral bands at resolution. A field survey is also available ( labeled pixels, divided in 14 land use classes). A LiDAR DSM was also available and was used as an additional feature^{5}^{5}5The data were proposed as the Data Fusion Contest 2013 [67] and are available on the IADF TC website for download http://www.grssieee.org/community/technicalcommittees/datafusion/. The CASI image was corrected with histogram matching for a large shadowed part on the right side (as in [27]) and the DSM was detrended by a 3m trend on the leftright direction. Image, DSM and ground truth are illustrated in Fig. 6.

Zurich Summer. The third dataset is a series of QuickBird images acquired over the city of Zurich, Switzerland, in August 2002^{6}^{6}6The dataset is freely available at https://sites.google.com/site/ michelevolpiresearch/data/zurichdataset. The data have been pansharpened at 0.6 spatial resolution and a dense ground truth is provided for each image. Eight classes are depicted: buildings, roads, railway, water, swimming pools, trees, meadows and bare soil. More information on the data can be found in [68]. To reduce computational complexity, we extracted a set of superpixels using the Felzenszwalb algorithm [69], which reduced the number of samples from pixels per image to a few thousands. An example of the superpixels extracted on image tile #3 is given in Fig. 7.
(a) RGB  (b) LWIR band 1 
(c) GT training  (d) GT test 
Setup. For all datasets, contextual features were added to the spectral bands, in order to improve the geometric quality of classification [14]: morphological and texture filters were added, following the list in [15]. Each image was processed to extract the most effective filters for its processing:

For the Thetford mines dataset, the filters were extracted from the RGB image and from a normalized ratio between the red band and the average of the LWIR bands (following the strategy of the winners of the 2014 Data Fusion Contest [66]), which approaches a vegetation index. Given the extremely high resolution of the dataset, the filters were computed with the size range {7, … 23}, leading to 100 spatial features.

For the Houston case, the filters were calculated on both the first principal components projections extracted from the hyperspectral image and the DSM. Given the smaller resolution of this dataset, the convolution sizes of the local filters are in the range pixels. This leads to 240 spatial features.

For the Zurich summer dataset spatial filters were computed directly on the four spectral bands, plus the NDVI and the NDWI indices.
Then, average, minimum, maximum and standard deviation values per superpixel were extracted as feature values. Since the spatial resolution is comparable to the one of the
Houston dataset, the same sizes of convolution filters are used, leading to a total of 360 spatial features.
The joint spatialspectral input space is obtained by stacking the original images to the spatial filters above. It is therefore of dimension in the Thetford mines data, in the Houston data and in the Zurich summer case.
Regarding the classifier, we considered the linear classifier of Eq. (7) with a squared hinge loss:

In the Thetford mines case, we use labeled pixels per class. Given the spatial resolution of the image and the labeled points available in the training ground truth, this only represents approximatively 5% of the labeled pixels in the training image. For test, we use the entire test ground truth, which is spatially disconnected to the training one (except for the class ‘soil’, see Fig. 5) and carries million labeled pixels.

In the Houston case, we also proceed with pixel classification. All the models are trained with 60 labeled pixels per class, randomly selected, and all the remaining labeled pixels are considered as the test set. We report performances on the entire test set provided in the Data Fusion contest 2013, which is spatially disconnected from the training set (Fig. 6).

For the Zurich Summer data, we deal with superpixels and 20 separate images. We used images #115 to train the classifier and then tested on the five remaining images (Fig. 8). Given the complexity of the task (not all the images have all the classes and the urban fabrics depicted vary from scene to scene), we used 90% of the available superpixels in the training images, which resulted in superpixels. All the labeled superpixels in the test images ( superpixels) are used as test set.
Regarding the regularizers, we compare the four regularizers of Tab. I (, , Log sum penalty and with ) and study the joint behavior of accuracy and sparsity along a regularization path, i.e. for different values of : below , with 18 steps. For each step, the experiment was repeated ten times with different train/test sets (each run with the same training samples for all regularizers) and the average Kappa and number of active coefficients is reported in Fig 9. Also note that we report the total number of coefficients in the multiclass case, , which is equal to the number of features multiplied by the number of classes, plus one additional feature per class (bias term). In total, the model estimates coefficients in the case of the Thetford mines data, while for the Houston and Zurich Summer cases it deals with and coefficients, respectively.
Thetford mines 
Houston 
Zurich Summer 

Results. The results are reported in Fig. 9, comparing the regularization paths for the four regularizers and the three datasets presented above. The graphs can be read as a ROC curve: the most desirable situation would be a classifier with both accuracy and little active features, i.e., a score close to the topleft corner. The model shows no variation on the sparsity axis (all the coefficients are active) and very little variability on the accuracy one: it is therefore represented by a single green dot. It is remarkably accurate, but is the less compact model, since it has all the coefficients active. Employing the regularizer (red line), as it is mainly done in the literature of sparse classification, achieves a sharp decrease in the number of active coefficients, but at the price of a steep decrease in performances of the classifier. When using 100 active coefficients, the model suffers of a 20% drop in performance, a trend is observed in all the experiments reported.
Using the nonconvex regularizers provides the best of both worlds: the regularizer (black line with ‘’ markers) in particular, but also the Log sum penalty regularizer (blue line with ‘’ markers) achieve improvements of about 1520% with respect to the model. More stable results along the regularization path are observed: the nonconvex regularizers are less biased than the norm in classification and achieve competitive performances with respect to the (nonsparse) model with a fraction of the features (around 12%). Note that the models of all experiments were initialized with the vector. This is sensible for the nonconvex problem, since all the regularization discussed in the paper (even ) tend to shrink the model toward this point. By initializing at for nonconvex regularization, we simply promote a local solution not too far from this neutral point. In other words one can see the initialization as an additional regularization. Moreover the experiments show that the nonconvexity leads to stateoftheart performance.
Iv Sparse linear unmixing
In this section we express the sparse linear unmixing problem in the same optimization framework as Eq. (7). We discuss the advantage of using nonconvex optimization. The performance of the , and the non convex and LSP regularization terms are then compared on a simulated example using real reflectance spectra (as in [18]).
Iva Model
Sparse linear unmixing can be expressed as the following optimization problem
(8) 
where is a noisy spectrum observed and is a matrix containing a dictionary of spectra (typically a spectral library). This formulation adds a positivity constraint to the vector w.r.t. problem (7). In practice, (8) can be reformulated as the following unconstrained optimization problem
(9) 
where is the indicator function that has value when one of the component of is and value when it is in the positive orthant. By supposing that is equivalent to , we can gather the last two terms into , thus leading to a problem similar to Eq. (7). All the optimization procedures discussed above can therefore be used for this reformulation, as long as the proximal operator w.r.t. can be computed efficiently. The proximal operator for all the regularization terms in Table I with additional positivity constraints can be obtained by an orthogonal projection on the positive orthant followed by the proximal of :
(10) 
where is taken componentwise. This shows that we can use the exact same algorithm as in the classification experiments of Section III, since we have an efficient proximal operator.
We know that when the solution of Eq. (8) the resulting must only have a few nonzero components: one might want to promote more sparsity with a nondifferentiable regularization term. Therefore, in the following we investigate the use of nonconvex regularization for linear unmixing. We focus on problem (8), but a large part of the unmixing literature works with an additional constraint of sum to for the coefficients. This additional prior can sometimes reflect a physical measure and adds some information to the optimization problem. In our framework, this constraint can make the direct computation of the proximal operator nontrivial. In this case it is more interesting to use multiple splitting instead of one and to use other algorithms such as generalized FBS [70] or ADMM, that has already been used for remote sensing applications [71].
(a)  (b)  (c) 
IvB Numerical experiments
In the unmixing application we consider an example simulated using the USGS spectral library^{7}^{7}7The dataset can be downloaded from http://www.lx.it.pt/~bioucas/: from the library, we extract 23 spectra corresponding to different materials (by keeping spectra with less than angular distance to each other). Using these 23 base spectra, we simulate mixed pixels by creating random linear combinations of endmembers. The random weight of the active components are obtained using an uniform random generation in (leading to weights that do not sum to ). We then add to the resulting signatures some Gaussian noise . For each numerical experiments we solve the unmixing problem by least squares with the four regularizers of Table I: , , and LSP. An additional approach that consists in performing a hard thresholding on the positive least square solution (so, the ) has also been investigated (named ‘LS+threshold’ hereafter). As for the previous example on classification, we calculate the unmixing performance on a regularization path, i.e. a series of values of the regularization parameter in Eq. (8), with . We assess the success of the unmixing by the model error . We repeat the simulation 50 times, to account for different combination of the original elements of the dictionary: all results reported are averages over those 50 simulations.
First, we compare the different regularization schemes for different noise levels (Figure 10). We set and report the model error along the regularization path (varying ) on the top row of Figure 10. On the bottom row, we report the model error as a function of the number of selected components, again along the same regularization path. We observe that the nonconvex strategies achieve the lowest errors (triangle shaped markers) on low and medium noise levels, but also that seems to be more robust to noise. The norm also achieves good results, in particular in high noise situations. Regarding the error achieved per level of sparsity (represented in the bottom row of Fig. 10) , we observe that the nonconvex regularizers achieve far better reconstruction errors, in particular around the right number of active coefficient (here ). On average, the best results are obtained by the the LSP and regularization. Note that the regularizer needs a larger number of active component in order to achieve good model reconstruction (of the order of when the actual number of coefficient is ). The LS+threshold approach seem to work well for component selection, but leads to an important decrease in accuracy of the model.
In order to evaluate the ability of a method to estimate a good model and select the good active components at the same time, we run simulations with a fixed noise level but for a varying number of true active components , from to . In this configuration, we first find for all regularizations the smallest that leads to the correct number of selected component . The average model error as a function of is reported in Figure 11(a). We can see that the nonconvex regularization leads to better performances when the correct number of spectra is selected (compared to and LS+threshold). In Figure 11(b) we report the number of selected components as a function of the true number of active components when the model error is minimal. We observe that nonconvex regularization manages to both select the correct components and estimate a good model when a small number of components are active (), but also that it fails (as does) for large numbers of active components. This result illustrates the fact that nonconvex regularization is more aggressive in term of sparsity and obviously performs best when sparsity is truly needed.
(a)  (b) 
V Conclusions
In this paper, we presented a general framework for nonconvex regularization
in remote sensing image processing. We discussed different ways to
promote sparsity and avoid the bias when sparsity is required
via the use of nonconvex regularizers. We applied the proposed
regularization schemes to problems of hyperspectral image
classification and linear unmixing: in all scenarios, we showed that
nonconvex regularization leads to the best performances when
accounting for both sparsity and quality of the final product. Non
convex regularizers promote compact solutions, but without the bias
(and the decrease in performance) related to nondifferentiable convex norms such as the popular
norm.
Non convex regularization is a flexible and general framework
that can be applied to every regularized processing scheme: keeping
this in mind, we also provide a toolbox to the community to apply
nonconvex regularization to a wider number of problems. The
toolbox can now be accessed online (see also the Appendix of this article for a description of the toolbox).
Vi Acknowledgements
The authors would like to thank Telops Inc. (Québec, Canada) for acquiring and providing the Thetford mines data, as well as the IEEE GRSS Image Analysis and Data Fusion Technical Committee (IADFTC) and Dr. Michal Shimoni (Signal and Image Centre, Royal Military Academy, Belgium) for organizing the 2014 Data Fusion Contest, the Centre de Recherche Public Gabriel Lippmann (CRPGL, Luxembourg) and Dr. Martin Schlerf (CRPGL) for their contribution of the HyperCam LWIR sensor, and Dr. Michaela De Martino (University of Genoa, Italy) for her contribution to data preparation.
The authors would like to thank the Hyperspectral Image Analysis group and the NSF Funded Center for Airborne Laser Mapping (NCALM) at the University of Houston for providing the Houston data sets and the IEEE GRSS IADFTC for organizing the 2013 Data Fusion Contest.
The authors would like to acknowledge Dr. M. Volpi and Dr. Longbotham for making ther Zurich Summer data available.
The authors would like to acknowledge Dr. Iordache and Dr. BioucasDias for sharing the USGS library used in the unmixing experiment.
Appendix
Via Optimization toolbox
In order to promote the use of nonconvex regularization in the remote sensing community, we provide the reader with a simple to use Matlab/Octave generic optimization toolbox. The code will provide a generic solver (complete rewriting of GIST) for problem (7) that is able to handle a number of regularization terms (at least all the terms in Table I) and any differentiable data fitting term . We provide several function for performing multiclass classification tasks such as SVM, logistic regression and calibrated hinge loss. For linear unmixing we provide the least square loss, but extension to other possibly more robust data fitting terms can be performed easily. For instance, performing unmixing with the more robust Huber loss [72] would require the change of only two lines in function ‘‘gist_least.m’’, i.e. the computation of the Huber loss and its gradient. The toolbox can now be accessed at https://github.com/rflamary/nonconvexoptimization. It is freely available on Github.com as a community project and we welcome contributions.
References
 [1] G. CampsValls, D. Tuia, L. GómezChova, S. Jimenez, and J. Malo, Remote Sensing Image Processing, Synthesis Lectures on Image, Video, and Multimedia Processing. Morgan and Claypool, 2011.
 [2] G. CampsValls, D. Tuia, L. Bruzzone, and J. Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” IEEE Signal Proc. Mag., vol. 31, no. 1, pp. 45–54, 2014.
 [3] G. CampsValls and L. Bruzzone, “Kernelbased methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 6, pp. 1351–1362, 2005.
 [4] G. Mountrakis, J. Ima, and C. Ogole, “Support vector machines in remote sensing: A review,” ISPRS J. Photogramm. Rem. Sens., vol. 66, no. 3, pp. 247–259, 2011.
 [5] J. Verrelst, J. MuñozMarí, L. Alonso, J. Delegido, J. P. Rivera, G. CampsValls, and J. Moreno, “Machine learning regression algorithms for biophysical parameter retrieval: Opportunities for Sentinel2 and 3,” Remote Sens. Enviro., vol. 118, pp. 127–139, 2012.

[6]
M. Belkin, I. Matveeva, and P. Niyogi,
“On manifold regularization,”
in
Proceedings of the Tenth International Workshop on Artificial intellProceedings of the Tenth International Workshop on Artificial Intelligence and Statistics AISTATigence and Statistics (AISTAT)
, Bonn, Germany, 2005, pp. 17–24.  [7] L. GómezChova, G. CampsValls, J. MuñozMarí, and J. Calpe, “Semisupervised image classification with Laplacian support vector machines,” IEEE Geosci. Remote Sens. Lett., vol. 5, no. 4, pp. 336–340, 2008.
 [8] M. A. Bencherif, J. Bazi, A. Guessoum, N. Alajlan, F. Melgani, and H. Alhichri, “Fusion of extreme learning machine and graphbased optimization methods for active classification of remote sensing images,” IEEE Geosci. Remote Sens. Letters, vol. 12, no. 3, pp. 527–531, 2015.
 [9] W. Zhangyang, N. Nasrabadi, and T. S. Huang, “Semisupervised hyperspectral classification using taskdriven dictionary learning with Laplacian regularization,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 1161–1173, 2015.
 [10] X. Sun, N. Nasrabadi, and T. Tran, “Taskdriven dictionary learning for hyperspectral image classification with structured sparsity constraints,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4457–4471, 2015.
 [11] S. T. Tu, J. Y. Chen, W. Yang, and H. Sun, “Laplacian eigenmapsbased polarimetric dimensionality reduction for SAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 1, pp. 170–179, 2012.
 [12] D. Tuia, M. Volpi, M. Trolliet, and G. CampsValls, “Semisupervised manifold alignment of multimodal remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 12, pp. 7708–7720, 2014.
 [13] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [14] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and J. C. Tilton, “Advances in spectralspatial classification of hyperspectral images,” Proc. IEEE, vol. 101, no. 3, pp. 652 – 675, 2013.
 [15] D. Tuia, M. Volpi, M. Dalla Mura, A. Rakotomamonjy, and R. Flamary, “Automatic feature learning for spatiospectral image classification with sparse SVM,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10, pp. 6062–6074, 2014.

[16]
C. Romero, A. und Gatta and G. CampsValls,
“Unsupervised deep feature extraction for remote sensing image classification,”
IEEE Transactions on Geoscience and Remote Sensing, 2016, in press.  [17] M. CamposTaberner, A. RomeroSoriano, C. Gatta, G. CampsValls, A. Lagrange, B. L. Saux, A. Beaupère, A. Boulch, A. ChanHonTong, S. Herbin, H. Randrianarivo, M. Ferecatu, M. Shimoni, G. Moser, and D. Tuia, “Processing of extremely high resolution LiDAR and optical data: Outcome of the 2015 IEEE GRSS Data Fusion Contest. Part A: 2D contest,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., in press.
 [18] M.D. Iordache, J. M. BioucasDias, and A. Plaza, “Sparse unmixing of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 6, pp. 2014–2039, 2011.
 [19] P. Tokarczyk, J. Wegner, S. Walk, and K. Schindler, “Features, color spaces, and boosting: New insights on semantic classification of remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 1, pp. 280–295, 2014.
 [20] M. Volpi and D. Tuia, “Dense semantic labeling with convolutional neural networks,” IEEE Trans. Geosci. Remote Sens., in press.
 [21] J. BioucasDias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot, “Hyperspectral unmixing overview: Geometrical, statistical, and sparse regressionbased approaches,” IEEE J. Sel. Topics Appl. Earth Observ., vol. 5, no. 2, pp. 354–379, April 2012.
 [22] Q. Qu, N. Nasrabadi, and T. Tran, “Abundance estimation for bilinear mixture models via joint sparse and lowrank representation,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 7, pp. 4404–4423, July 2014.
 [23] M.D. Iordache, J. BioucasDias, and A. Plaza, “Collaborative sparse regression for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 1, pp. 341–354, Jan 2014.
 [24] Y. Chen, N. Nasrabadi, and T. Tran, “Hyperspectral image classification using dictionarybased sparse representation,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, Oct 2011.
 [25] K. Tan, S. Zhou, and Q. Du, “Semisupervised discriminant analysis for hyperspectral imagery with blocksparse graph,” IEEE Geosci. Remote Sens. Letters, vol. 12, no. 8, pp. 1765–1769, Aug 2015.
 [26] B. Song, J. Li, M. Dalla Mura, P. Li, A. Plaza, J. BioucasDias, J. Atli Benediktsson, and J. Chanussot, “Remotely sensed image classification using sparse representations of morphological attribute profiles,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 5122–5136, Aug 2014.
 [27] D. Tuia, N. Courty, and R. Flamary, “Multiclass feature learning for hyperspectral image classification: sparse and hierarchical solutions,” ISPRS J. Int. Soc. Photo. Remote Sens., vol. 105, pp. 272–285, 2015.

[28]
X. X. Zhu and R. Bamler,
“Superresolution power and robustness of compressive sensing for spectral estimation with application to spaceborne tomographic SAR,”
IEEE Trans. Geosci. Remote Sens., vol. 50, no. 1, pp. 247–258, Jan 2012.  [29] H. Hongxing, J. BioucasDias, and V. Katkovnik, “Interferometric phase image estimation via sparse coding in the complex domain,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2587–2602, May 2015.
 [30] S. Li and B. Yang, “A new pansharpening method using a compressed sensing technique,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 2, pp. 738–746, Feb 2011.
 [31] D. L. Donoho and P. B. Stark, “Uncertainty principles and signal recovery,” SIAM Journal on Applied Mathematics, vol. 49, no. 3, pp. 906–931, 1989.
 [32] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization,” Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003.
 [33] E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique, vol. 346, no. 9, pp. 589–592, 2008.
 [34] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forwardbackward splitting,” Multiscale Modeling & Simulation, vol. 4, no. 4, pp. 1168–1200, 2005.
 [35] S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa, “Solving structured sparsity regularization with proximal methods,” in Machine Learning and Knowledge Discovery in Databases, pp. 418–433. Springer, 2010.
 [36] P. L. Combettes and J.C. Pesquet, “Proximal thresholding algorithm for minimization over orthonormal bases,” SIAM Journal on Optimization, vol. 18, no. 4, pp. 1351–1376, 2007.
 [37] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 [38] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The entire regularization path for the support vector machine,” Journal of Machine Learning Research, vol. 5, pp. 1391–1415, 2004.
 [39] D. L. Donoho and B. F. Logan, “Signal recovery and the large sieve,” SIAM Journal on Applied Mathematics, vol. 52, no. 2, pp. 577–591, 1992.
 [40] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted minimization,” Journal of Fourier analysis and applications, vol. 14, no. 56, pp. 877–905, 2008.
 [41] T. Zhang, “Analysis of multistage convex relaxation for sparse regularization,” J. Mach. Learn. Res., vol. 11, pp. 1081–1107, 2010.
 [42] J. Sigurdsson, M. Ulfarsson, and J. Sveinsson, “Hyperspectral unmixing with regularization,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 11, pp. 6793–6806, Nov 2014.
 [43] Y. Qian, S. Jia, J. Zhou, and A. RoblesKelly, “Hyperspectral unmixing via sparsityconstrained nonnegative matrix factorization,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4282–4297, Nov 2011.
 [44] W. Wang and Y. Qian, “Adaptive sparsityconstrained nmf with halfthresholding algorithm for hyperspectral unmixing,” IEEE J. Sel. Topics Appl. Earth Observ., vol. 8, no. 6, pp. 2618–2631, June 2015.
 [45] M.H. Wei, J. McClellan, and W. Scott, “Estimation of the discrete spectrum of relaxations for electromagnetic induction responses using regularized least squares for ,” IEEE Geosci. Remote Sens. Letters, vol. 8, no. 2, pp. 233–237, March 2011.
 [46] J. Zhang, P. Zhong, Y. Chen, and S. Li, “ regularized deconvolution network for the representation and restoration of optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2617–2627, May 2014.
 [47] S. Jia, X. Zhang, and Q. Li, “Spectralspatial hyperspectral image classification using regularized lowrank representation and sparse representationbased graph cuts,” IEEE J. Sel. Topics Appl. Earth Observ., vol. 8, no. 6, pp. 2473–2484, June 2015.
 [48] P. Gong, C. Zhang, Z. Lu, J. Z. Huang, and J. Ye, “A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems,” in Proc. ICML, Atlanta, GE, 2013.
 [49] O. Bousquet and A. Elisseeff, “Stability and generalization,” The Journal of Machine Learning Research, vol. 2, pp. 499–526, 2002.
 [50] B. Schölkopf, C. J. Burges, and A. J. Smola, Advances in kernel methods: support vector learning, MIT press, 1998.
 [51] S. Bourguignon, J. Ninin, H. Carfantan, and M. Mongeau, “Exact sparse approximation problems via mixedinteger programming: Formulations and computational performance,” IEEE Trans. Signal Proc., vol. 64, no. 6, pp. 1405–1419, March 2016.
 [52] Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Signals, Systems and Computers, 1993. 1993 Conference Record of The TwentySeventh Asilomar Conference on. IEEE, 1993, pp. 40–44.

[53]
S. Chen, C. F. Cowan, and P. M. Grant,
“Orthogonal least squares learning algorithm for radial basis function networks,”
Neural Networks, IEEE Transactions on, vol. 2, no. 2, pp. 302–309, 1991.  [54] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al., “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.
 [55] P. L. Combettes and J.C. Pesquet, “Proximal splitting methods in signal processing,” in Fixedpoint algorithms for inverse problems in science and engineering, pp. 185–212. Springer, 2011.
 [56] H. Zou, “The adaptive Lasso and its oracle properties,” J. American Stat. Assoc., vol. 101, no. 476, pp. 1418–1429, 2006.
 [57] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” J. American Stat. Assoc., vol. 96, no. 456, pp. 1348–1360, 2001.
 [58] C. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” Annals of Statistics, vol. 38, no. 2, pp. 894–942, 2010.
 [59] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the kurdykalojasiewicz inequality,” Mathematics of Operations Research, vol. 35, no. 2, pp. 438–457, 2010.
 [60] Z. Xu, X. Chang, F. Xu, and H. Zhang, “ regularization: a thresholding representation theory and a fast solver,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 23, no. 7, pp. 1013–1027, 2012.
 [61] E. Chouzenoux, J.C. Pesquet, and A. Repetti, “A block coordinate variable metric forwardbackward algorithm,” 2013.
 [62] A. Rakotomamonjy, R. Flamary, and G. Gasso, “Dc proximal newton for nonconvex optimization problems,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 27, no. 3, pp. 636–647, 2016.
 [63] L. Laporte, R. Flamary, S. Canu, S. Déjean, and J. Mothe, “Nonconvex regularizations for feature selection in ranking with sparse svm,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 25, no. 6, pp. 1118–1130, 2014.
 [64] G. Gasso, A. Rakotomamonjy, and S. Canu, “Recovering sparse signals with a certain family of nonconvex penalties and dc programming,” EEE Trans. Signal Proc., vol. 57, no. 12, pp. 4686–4698, 2009.
 [65] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization path for generalized linear models via coordinate descent,” Journal of Statistical Software, vol. 33, pp. 1–122, 2010.
 [66] W. Liao, X. Huang, F. Van Collie, A. Gautama, W. Philips, H. Liu, T. Zhu, M. Shimoni, G. Moser, and D. Tuia, “Processing of thermal hyperspectral and digital color cameras: outcome of the 2014 data fusion contest,” IEEE J. Sel. Topics Appl. Earth Observ., vol. 8, no. 6, pp. 2984–2996, 2015.
 [67] F. Pacifici, Q. Du, and S. Prasad, “Report on the 2013 IEEE GRSS data fusion contest: Fusion of hyperspectral and LiDAR data,” IEEE Remote Sens. Mag., vol. 1, no. 3, pp. 36–38, 2013.
 [68] M. Volpi and V. Ferrari, “Semantic segmentation of urban scenes by learning local class interactions,” in IEEE/CVF CVPRW Earthvision, 2015.
 [69] P. Felzenszwalb and D. Huttenlocher, “Efficient graphbased image segmentation,” IJCV, vol. 59, no. 2, pp. 167–181, 2004.
 [70] H. Raguet, J. Fadili, and G. Peyré, “A generalized forwardbackward splitting,” SIAM Journal on Imaging Sciences, vol. 6, no. 3, pp. 1199–1226, 2013.
 [71] M.D. Iordache, J. M. BioucasDias, and A. Plaza, “Total variation spatial regularization for sparse hyperspectral unmixing,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 50, no. 11, pp. 4484–4502, 2012.
 [72] P. J. Huber et al., “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964.
Comments
There are no comments yet.