 # Iterative Log Thresholding

Sparse reconstruction approaches using the re-weighted l1-penalty have been shown, both empirically and theoretically, to provide a significant improvement in recovering sparse signals in comparison to the l1-relaxation. However, numerical optimization of such penalties involves solving problems with l1-norms in the objective many times. Using the direct link of reweighted l1-penalties to the concave log-regularizer for sparsity, we derive a simple prox-like algorithm for the log-regularized formulation. The proximal splitting step of the algorithm has a closed form solution, and we call the algorithm 'log-thresholding' in analogy to soft thresholding for the l1-penalty. We establish convergence results, and demonstrate that log-thresholding provides more accurate sparse reconstructions compared to both soft and hard thresholding. Furthermore, the approach can be directly extended to optimization over matrices with penalty for rank (i.e. the nuclear norm penalty and its re-weigthed version), where we suggest a singular-value log-thresholding approach.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider sparse reconstruction problems which attempt to find sparse solutions to over-determined systems of equations. A basic example of such a problem is to recover a sparse vector

from measurements , where with , and captures corruption by noise. Attempting to find maximally sparse solutions is known to be NP-hard, so convex relaxations involving -norms have gained unprecedented popularity. Basis pursuit (or LASSO in statistics literature) minimizes the following objective:

 min∥y−Ax∥22+λ∥x∥1 (1)

Here is a parameter that balances sparsity versus the norm of the residual error. There is truly a myriad of algorithms for solving (1) (see e.g. [1, 2, 3]), and for large-scale instances, variations of iterative soft thresholding have become very popular:

 x(n+1)=Sλ(x(n)+AT(y−x(n))) (2)

where applies soft-thresholding for each entry:

 Sλ(zi)=sign(zi)max(0,|zi|−λ). (3)

Based on operator splitting and proximal projection theories, the algorithm in (2) converges if the spectral norm [4, 5]. This can be achieved simply by rescaling . Accelerated versions of iterative thresholding have appeared .

An exciting albeit simple improvement over -norms for approximating sparsity involves weighting the -norm: with . Ideal weights require knowledge of the sparse solution, but a practical idea is to use weights based on solutions of previous iterations [7, 8]:

 w(n+1)i=1δ+|^x(n)i| (4)

This approach can be motivated as a local linearization of the

-heuristic for sparsity

. There is strong empirical  and recent theoreical evidence that reweighted approaches improve recovery of sparse signals, in the sense of enabling recovery from fewer measurements [9, 10].

In this paper, we consider the log-regularized formulation that gives rise to the re-weighting schemes mentioned above, and propose a simple prox-like optimization algorithm for its optimization. We derive a closed-form solution for the proximal step, which we call -thresholding. We establish monotone convergence of iterative -thresholding (ILT) to its fixed points, and derive conditions under which these fixed points are local minima of the log-regularized objective. Sparse recovery performance of the method on numerical examples surpasses both soft and hard iterative thresholding (IST and IHT). We also extend the approach to minimizing rank for matrix functions via singular value -thresholding.

To put this into context of related work,  has considered iterative thresholding based on non-convex -norm penalties for sparsity. However, these penalties do not have a connection to re-weighted optimization. Also,  have investigated coordinate descent based solutions for non-convex penalties including the log-penalized penalty, but their approach does not use closed form log-thresholding. Finally, general classes of non-convex penalties, their benefits for sparse recovery, and reweighed convex-type methods for their optimization are studied in . This class of methods is different from the log-thresholding we propose.

## 2 ISTA as proximal splitting

We briefly review how soft-thresholding can be used to solve the sparse reconstruction problem in (1). Functions of the form where is convex differentiable with a Lipschitz gradient, and is general convex can be solved by a general proximal splitting method :

 ^x(n+1)=proxg(x(n)−∇h(x(n))). (5)

The prox-operation is a generalization of projection onto a set to general convex functions:

 proxh(x)=argminzh(z)+12∥x−z∥22. (6)

If is an indicator function for a convex set, then the prox-operation is equivalent to the projection onto the set, and ISTA itself is equivalent to the projected gradient approach.

Forward-backward splitting can be applied to the sparse recovery problem (1) by deriving the proximal operator for -norm, which is precisely the soft-thresholding operator in (3). The convergence of ISTA in (2) thus follows directly from the theory derived for forward-backward splitting .

## 3 Log-thresholding

The reweighted- approach can be justified as an iterative upper bounding by a linear approximation to the concave -heuristic for sparsity (here is a small positive constant) :

 minf(x)=min∥y−Ax∥22+λ∑ilog(δ+|xi|). (7)

While the -penalty is concave rather than convex, we still consider the scalar proximal objective around a fixed :

 gλ(z)≜(z−x)2+λlog(δ+|z|). (8)

We note that for small enough, the global minimum of over (with held constant) is always at . However, when , the function also exhibits a local minimum, which disappears for small . We show that it is the local, rather than the global minimum, that provides the link to re-weighted minimization and is key to the log-prox algorithm we propose. Indeed, it is also the local (and not global) minimum that provides the link to iteratively re-weighteld algorithms.

Our algorithm arises directly from first order necessary conditions for optimality. For , we solve the equation to find the local minimum in closed-form. We call this operation log-thresholding , :

 Lλ(x)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩12((xi−δ)+√(xi+δ)2−2λ), x>x012((xi+δ)−√(xi−δ)2−2λ), x<−x00, otherwise (9)

where . We illustrate log-thresholding in Figure 1. The left plot shows as a function of for several values of . For large the function has a local minimum, but for small the local minimum disappears. For -thresholding we are specifically interested in the the local minimum: an iterative re-weighted approach with small enough step size starting at , i.e. beyond the local minimum, will converge to the local minimum, avoiding the global one. The right plot in Figure 1 shows the -thresholding operation with as a function of . It can be seen as a smooth alternative falling between hard and soft thresholding.

In analogy to ISTA, we can now formally define the iterative -thresholding algorithm:

 ^xn+1=Lλ(xn+AT(y−Axn)) (10)

where applied the element-wise -thresholding operation we obtained in (9). We establish its convergence next.

### 3.1 Convergence of iterative log-thresholding

The theory of forward-backward splitting does not allow an analysis of -thresholding, because the log is non-convex, and log-thresholding is not a contraction (in particular, it is not firmly non-expansive). Therefore, for the analysis we use an approach based on optimization transfer using surrogate functions  to prove convergence of ILT to its fixed points. At a high-level the analysis follows the program for IHT in , but some of the steps are notably different, and in particular some assumptions on the operator action are necessary to establish that fixed points correspond to local minima of our formulation. In the appendix we establish:

###### Proposition 1

Under the assumption , the ILT algorithm in (10) monotonically decreases the objective in (7), and converges to fixed points. A sufficient condition for these fixed points to be local minima is that restricted to the non-zero coefficients is well-conditioned, specifically that the lowest singular values of the restriction are greater than .

## 4 Singular Value Log-thresholding

A closely related problem to finding sparse solutions to systems of linear equations is finding low-rank matrices from sparse observations, known as matrix completion:

 minrank(X) such that Xi,j=Yi,j,{(i,j)∈Ω} (11)

Similar to sparsity, rank is a combinatorial objective which is typically intractable to optimize directly. However, the nuclear norm , where are the singular values of X, serves as the tightest convex relaxation of rank, analogous to -norm being the convex relaxation of the -norm. In fact, the nuclear norm is exactly the -norm of the singular value spectrum of a matrix. This connection enables the application of various singular value thresholding algorithms: for instance, the SVT algorithm of  alternates soft-thresholding of the singular value spectrum with gradient descent steps. In the experimental section we investigate a simplified singular-value log-thresholding algorithm for matrix completion, where we replace soft thresholding with hard and log-thresholdings. We present very promising empirical results of singular value log-thresholding in Section 5, and a full convergence analysis will appear in a later publication. Figure 2: Noiseless sparse recovery: (a) average error-norm (b) probability of exact recovery after 250 iterations over 1000 random trials. M=100,N=200.

## 5 Experiments

We investigate the performance of iterative log thresholding via numerical experiments on noiseless and noisy sparse recovery. Intuitively we expect ILT to recover sparser solution than soft-thresholding due to the connection to re-weighted- norms, and also to behave better than the non-smooth iterative hard thresholding.

First we consider sparse recovery without noise, i.e. we would like to find the sparsest solution that satisfies exactly. One could in principle solve a sequence of problems (1) with decreasing , i.e. increasing penalty on via IST, IHT, ILT. Howeveer, when we know an upper bound on the desired number of non-zero coefficients, a more successful approach is to adaptively change to eliminate all except the top- coefficients in each iteration111 This is easy for IST and IHT by sorting in descending order: let then . For ILT we have from (9). as used e.g. in . We compare the performance of IST, IHT, and the proposed ILT in Figure 2. We have and we vary . Apart from changing the thresholding operator, all the algorithms are exactly the same. The top plot shows the average reconstruction error from the true sparse solution . It is averaged over trials allowing IST, IHT and ILT to run for up-to iterations. The bottom plot shows probability of recovering the true sparse solution. We can see that ILT is superior in both probability of recovery (higher probability of recovery) and in reconstruction error (lower reconstruction error) over both IST and IHT. Figure 3: Sparse recovery with noise. Average error vs. sparsity over 100 trials, after 250 iterations.

Our next experiment compares the three iterative thresholding algorithms on noisy data. Since regularization parameters have a different meaning for the different penalties, we plot the whole solution path of squared residual error vs. sparsity for the three algorithms in Figure 3. We compute the average residual norm for a given level of sparsity for all three algorithms, averaged over runs. We have , and a small amount of noise is added. We can see that the iterative log thresholding consistently achieves the smallest error for each level of sparsity.

In our final experiment we consider singular value log-thresholding for matrix completion. We study a simplified algorithm that parallels the noiseless sparse recovery algorithm with known number of nonzero-elements . We alternate gradient steps with steps of eliminating all but the first singular values by soft, hard and log-thresholding. We have an matrix with observed entries, and rank, . We show the average error in Frobenius norm from the true underlying solution as a function of iteration number over random runs in Figure 4. We see that the convergence of log-SV-thresholding to the correct solution is consistently faster. We expect similar improvements to hold for other algorithms involving soft-thresholding, and to other problems beyond matrix completion, e.g. robust PCA.

## 6 Convergence of ILT

Here we establish Proposition 1. We first define a surrogate function for in (7):

 Q(x,z)=∥y−Ax∥22+λ∑ilog(δ+|xi|)+ ∥x−z∥22−∥A(x−z)∥22 (12)

Note that . Simplifying (6) we have

 Q(x,z)=∑i(xi−ki(z))2+λlog(xi+|δ|))+K(z), (13)

where and contains terms independent of . The optimization over is now separable, i.e. can be done independently for each coordinate. We can see that finding local minima over of corresponds to iterative log-thresholding.

Using this motivation for ILT, we can now prove convergence to fixed points of . First, we have:

###### Proposition 2

and , are monotonically decreasing with iterations as long as the spectral norm .

The proof parallels the IHT proof of  using the fact that , which is independent of the thresholding used. The main difference for ILT is that is not the global minimum of but it still holds that .
Next, we have:

###### Proposition 3

Any fixed point of (10) satisfies the following:

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩aTi(y−A¯x)=λ2(¯xi+δ)if¯xi>x0aTi(y−A¯x)=λ2(¯xi−δ)if¯xi<−x0|aTi(y−A¯x)|≤x0% otherwise

In other words, if , then the corresponding gradient component satisfies local stationarity conditions for problem (7), and if , the gradient is bounded.

Proof: Given a fixed point of (10) define

 si=aTi(y−A¯x), (14)

Suppose first that . Explicitly writing (10),

 ¯xi−si+δ =√(¯xi+si+δ)2−2λ,

squaring both sides, and simplifying, we have

 aTi(y−A¯x)=λ2(¯xi+δ),

which is precisely equivalent to local optimality of (7) with respect to the th coordinate. Otherwise, suppose . Then we have , and so .

###### Proposition 4

For any fixed point of the ILT algorithm (7) and any small perturbation , if is small enough, for small we have

 Q(¯x+η,¯x)>Q(¯x)+∥P0η∥2+34∥P1η∥2,

where and denote the projections onto the zero and nonzero indices of . The precise condition on is as follows:

 λδ+2δ>2√2λ, (15)

Proof: This result follows by Proposition 3, together with the proof technique of [Lemma 3]. In particular, for any perturbation , we can write as

 =∑i(−2ηisi+η2i+λlog(|¯xi+ηi|+δ|¯xi|+δ))

The above inequality is easily verified using (13). Defining now and , we can use the optimality properties of Proposition 3 to rewrite :

 ∥η∥2+∑i∈Γ0(−2ηisi+λlog(|ηi|+δ|δ|))+ ∑i∈Γ1(−ηiλ¯xi+|δ|+λlog(|¯xi+ηi|+δ|¯xi|+δ))

We now consider lower bounds for each of these two sums taking all WLOG:

 ∑i∈Γ0 −2ηisi+λlog(|ηi|+δδ)≥ =∑i∈Γ0λlog(1+|ηi|δ)−2ηi|x0| =∑i∈Γ0(λ|ηi|δ−2|ηi|x0)−O(∥η∥2)

Given that (15) holds, the quantity on the last line is positive for small enough. For , we have

 ∑i∈Γ1−ηiλ¯xi+δ+λlog(|¯xi+ηi|+δ|¯xi|+δ)≥ ∑i∈Γ1−ηiλ+ηiλ¯xi+δ−142λη2i(|¯xi|+δ)2≥−14∥η∥22

where the last inequality comes from the fact that for any , we have .

###### Proposition 5

ILT converges to its fixed points if . Moreover, if the singular values of the restriction of to the columns are greater than , these fixed points must be the local minima of (7).

The result follows from the proof technique of [Theorem 3]; in particular the sums are monotonically increasing and bounded, so the iterates must converge. Finally, we have

 Q(¯x+η) =Q(¯x+η,¯x)−∥η∥2+∥Aη∥2 ≥Q(¯x)−14∥P1η∥2+∥AP1η∥2.

This quantity is non-negative provided that the singular values of the restriction of to are greater than .

## References

•  MÁrio A. T. Figueiredo, Robert D. Nowak, and Stephen J. Wright, “Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 4, pp. 586–597, Dec. 2007.
•  Michael P. Friedlander Ewout van den Berg, Ewout van den Berg, and Michael P. Friedlander, “Probing the Pareto Frontier for Basis Pursuit Solutions,” SIAM Journal on Scientific Computing, vol. 31, no. 2, pp. 890–912, Jan. 2009.
•  Stanley Osher, Yu Mao, Bin Dong, and Wotao Yin, “Fast linearized Bregman iteration for compressive sensing and sparse denoising,” Communications in Mathematical Sciences, vol. 8, no. 1, pp. 93–111, Mar. 2010.
•  I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Comm. on pure and applied mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.
•  P. L. Combettes and J. C. Pesquet, “Proximal splitting methods in signal processing,” in Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, 2011.
•  A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
•  M. Fazel, H. Hindi, and S. P. Boyd, “A rank minimization heuristic with application to minimum order system approximation,” in IEEE American Control Conf., 2001.
•  E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted 1 minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877–905, 2008.
•  D. Needell, “Noisy signal recovery via iterative reweighted 1-minimization,” in Forty-Third Asilomar Conference on Signals, Systems and Computers, 2009. IEEE, 2009, pp. 113–117.
•  A. Khajehnejad, W. Xu, S. Avestimehr, and B. Hassibi, “Weighted 1 minimization for sparse recovery with prior information,” in IEEE Int. Symposium on Inf. Theory, 2009., 2009, pp. 483–487.
•  Sergey Voronin and Rick Chartrand, “A new generalized thresholding algorithm for inverse problems with sparsity constraints,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2013.
•  R. Mazumder, J. H. Friedman, and T. Hastie, “Sparsenet: Coordinate descent with nonconvex penalties,” Journal of the American Statistical Association, vol. 106, no. 495, 2011.
•  Qiuying Lin, Sparsity and nonconvex nonsmooth optimization, Ph.D. thesis, University of Washington, 2010.
•  K. Lange, D. R. Hunter, and I. Yang, “Optimization transfer using surrogate objective functions,” Journal of comput. and graphical statistics, vol. 9, no. 1, pp. 1–20, 2000.
•  T. Blumensath and M. E. Davies, “Iterative thresholding for sparse approximations,” Journal of Fourier Analysis and Applications, vol. 14, no. 5-6, pp. 629–654, 2008.
•  J.F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optim., vol. 20, no. 4, pp. 1956–1982, 2010.
•  A. Maleki, “Coherence analysis of iterative thresholding algorithms,” in 47th Annual Allerton Conference on Communication, Control, and Computing, 2009. IEEE, 2009, pp. 236–243.