 # Robust Compressed Sensing and Sparse Coding with the Difference Map

In compressed sensing, we wish to reconstruct a sparse signal x from observed data y. In sparse coding, on the other hand, we wish to find a representation of an observed signal y as a sparse linear combination, with coefficients x, of elements from an overcomplete dictionary. While many algorithms are competitive at both problems when x is very sparse, it can be challenging to recover x when it is less sparse. We present the Difference Map, which excels at sparse recovery when sparseness is lower and noise is higher. The Difference Map out-performs the state of the art with reconstruction from random measurements and natural image reconstruction via sparse coding.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In compressed sensing (CS), we are given a measurement matrix (where ), observed data , and we wish to recover a sparse such that

 y=Φx.

The compressed sensing problem can then be written as

 argminx∥x∥0 subject to y=Φx, (1)

where is the penalty function, giving the number of nonzero elements. In the noisy case, where

 ~y=Φx+ϵ⋅N(0,σ)

is assumed to be a noisy observation, we often replace the linear constraints with quadratic ones:

 argminx∥x∥0 subject to ∥~y−Φx∥22≤δ (2)

for some . The problem (2) can also be used for sparse coding (SC); in this setting, is an overcomplete dictionary,

is a signal (such as an image patch), and we seek a sparse coefficient vector

.

Recently, a variety of algorithms have achieved good results for a variety of CS and SC problems, including Least Angle Regression (LARS) , Subspace Pursuit , Matching Pursuit and its variants [11, 19], Iterative Hard Thresholding (IHT) and its variants [4, 3, 2], Iteratively Reweighted Least Squares (IRLS) , and the Alternating Direction Method of Multipliers (ADMM) [5, 7].

Because solving problems (1) and (2) directly is known to be NP-hard , some approaches relax the penalty to the convex norm [18, 12], while some attempt to address the case directly [9, 19, 4], and still others consider any number of (quasi-)norms for [5, 8, 7]. In general, the challenge for CS and SC algorithms is to balance two competing constraints on the solution : to accurately reconstruct the observed data , and to be sparse.

This paper presents a method for solving CS and SC problems without relaxing the constraint, using a general method known as the Difference Map . The Difference Map (DM) has been used to solve a wide variety constraint-intersection problems. Given sets and and distance-minimizing projections and 111By distance-minimizing projection, we mean that subject to , and likewise for ., respectively, DM searches for a point . One iteration of DM is defined by , where

 D(x)=x+β[PA∘fB(x)−PB∘fA(x)] (3)

where

 fA(x)=PA(x)−(PA(x)−x)/β fB(x)=PB(x)+(PB(x)−x)/β

and . One can test for convergence by monitoring the value , which vanishes when a solution is found. Recently, DM has achieved state-of-the-art performance on a variety of NP-hard nonconvex optimization problems including protein folding, -SAT, and packing problems .

The rest of this paper is organized as follows. In Section 2, we introduce an adaptation of the Difference Map for compressed sensing and sparse coding, which we compare at a high level to other well-known algorithms. In Section 3, we compare the algorithms on CS problems using random measurements, and we reconstruct natural images via SC in Section 4.

## 2 Compressed Sensing and Sparse Coding with the Difference Map

Given a matrix (where ) and data , we wish to find a sparse vector that is a solution to problem (2). We apply the Difference Map to this problem by defining the constraint sets

 A={x∈Rn:∥x∥0≤s} B={x∈Rn:∥Φx−y∥22<δ}

for a pre-defined positive integer and scalar .

The minimum-distance projection onto is known as hard thresholding, and is defined by

 PA(x)=[x]s, (4)

where is obtained by setting to zero the components of having the smallest absolute values.

A minimum-distance projection onto the set involves solving a quadratically-constrained quadratic programming problem, which can be very costly. We approximate this projection with:

 PB(x)=x−Φ+(Φx−y) (5)

where is the Moore-Penrose pseudo-inverse of .

The motivation for (5) comes from a simplification of our definition for . Consider the set with linear constraints (as in (1)):

 {x∈Rn:Φx=y}.

The minimum distance projection onto this set is given by the linearly constrained quadratic programming (QP) problem

 P(x0)=argminx∈Rn12∥x−x0∥22 % such that Φx=y.

The Lagrangian of this QP is

 L(x,λ)=12∥x−x0∥22+λ(Φx−y).

The that solves the QP is a minimizer of , and is found by setting , which yields

 x=x0+ΦTλ. (6)

Plugging (6) into and solving for gives

 λ=(ΦΦT)−1(Φx0−y).

Finally, we plug this into (6) to get

 x=x0−ΦT(ΦΦT)−1(Φx0−y)

as in (5).

Although the motivation for (5) comes from the assumption of non-noisy observations , we will see that it performs very well in the noisy case.

It is worth noting that the pseudo-inverse is expensive to compute, though it only needs to be computed once. Thus in the case of sparse image reconstruction where each image patch is reconstructed independently, amortizing the cost of computing over all image patches significantly reduces the pre-computation overhead.

### 2.1 Comparison to Other Algorithms

In what follows, we compare the Difference Map to a representative sample of commonly-used algorithms for solving CS and SC problems: Least Angle Regression (LARS) , Stagewise Orthogonal Matching Pursuit (StOMP) , Accelerated Iterative Hard Thresholding (AIHT) , Subspace Pursuit , Iteratively-Reweighted Least Squares (IRLS)  and Alternating Direction Method of Multipliers (ADMM) [5, 7]. As a final point of comparison, we test the Alternating Map (AM) defined by , with and defined as in (4) and (5), respectively, which resembles the ECME algorithm for known sparsity levels .

The projection (4), known as hard thresholding, is an important part of many CS algorithms [2, 3, 4, 9, 16, 17]. The projection (5) also appears in the ECME algorithm [16, 17]. Normalized Iterative Hard Thresholding (NIHT)  uses a calculation similar to , replacing the pseudo-inverse with for an appropriately chosen scalar .

Given that many algorithms consider the same types of projections as DM, any advantage achieved by DM must not come from the individual projections and , but rather the way in which DM combines the two projections into a single iterative procedure. This is particularly true when comparing DM to the simple alternating map. Alternating between projections is guaranteed to find a point at the intersection of the two constraints if both are convex; however, if either of the constraints is not convex, it is easy for this scheme to get stuck in a local minimum that does not belong to intersection.

While many of the theoretical questions about the Difference Map remain open, it does come with a crucial guarantee here: even on nonconvex problems, a fixed point (meaning ) implies that we have found a solution (meaning a point in ). To see this, note that implies

 PA∘fB(x)=PB∘fA(x). (7)

Thus we have found a point that exists in both and . This leads us to believe that DM will find better sparse solutions when other algorithms are stuck in local minima.

Note that we are not the first to consider applying DM to compressed sensing. Qiu and Dogandžić 

apply DM to the ECME algorithm (a variant of expectation maximization) in order to improve upon one of the two projections inside that algorithm. Although one of ECME’s two projections uses DM

internally, ECME continues to combine the two projections in a simple alternating fashion. This is in stark contrast to our proposed algorithm, which uses DM externally to the individual projections as a more intricate way of combining them. The resulting algorithm, called DM-ECME, is intended only for compressed sensing with non-negative signals. Because we consider different types of problems in this paper, we do not include DM-ECME in the comparisons below.

### 2.2 Implementation Details

We implemented the Difference Map in Matlab (code is available online). All experiments were performed on a computer with a 3 GHz quad-core Intel Xeon processor, running Matlab R2011a. We obtained Matlab implementations of LARS and StOMP from SparseLab v2.1 . Implementations of AIHT  and Subspace Pursuit  were found on the websites of the papers’ authors. We also obtained Matlab implementations of ADMM  and IRLS  directly from the authors of the cited papers.

The implementations of LARS, Subspace Pursuit, AIHT, and StOMP are parameter-free. It was necessary to tune a single parameter for DM, and two parameters each for ADMM and IRLS. We tuned the parameters in two iterations of grid search. ADMM and IRLS required different parameters for the two different experiments presented in this paper (reconstruction from random measurements, and natural image reconstruction). Interestingly, DM performed well with the same parameter for both types of experiments.

We use “training” matrices of the same dimension, sparsity, and noise level as the ones presented in the figures of this paper in order to tune parameters. We chose parameters to minimize the MSE of the recovering , averaged over all training problems. When tuning parameters for natural image reconstruction, we used a training set of 1000 image patches taken from the person and hill

categories of ImageNet

, providing a good variety of natural scenery.

When tuning for DM, we first perform grid search with an interval of , between and 333The natural range for the parameter is [-1,1] (excluding 0), but Elser et al report that occasionally values outside of this interval work well. . Next, in a radius of around the best , we performed another grid search with an interval of . Surprisingly, all in the interval appeared to be equally good for all problems reported in this paper. We chose because it performed slightly better during our experiments, but the advantage over other was not significant.

We use logarithmic grid search to tune the two parameters for ADMM and IRLS. First, we search parameter values by powers of ten, meaning , for . We then search in the neighborhood of best exponent by powers of ten, meaning for .

For random measurements (the experiments in Section 3), this results in parameter values for ADMM, and . For natural image reconstruction (Section 4), we found for ADMM and for IRLS. Note that the parameter for IRLS has nothing to do with the parameter for DM. We refer to both as only to remain consistent with the respective bodies of literature about each algorithm, but in what follows we will only refer to the parameter for DM. IRLS is capable of addressing the quasi-norm for a variety of values , while ADMM uses modifications of the quasi-norm designed to have a simple proximal mapping . In both cases we tried and , and found to perform better. Figure 1: Reconstructing signals with various levels of sparsity s. With sparser signals (left, middle), most algorithms get very close to the true signal. With less sparse signals (right), the Difference Map gets closer than other algorithms to recovering the signal. Each plot is averaged over ten runs, with ϵ chosen to give an SNR of approximately 20 dB, and Φ∈R400×1000. Figure 2: Reconstructing signals with various levels of noise ϵ. With less noise (left), most algorithms get very close to the true signal. With more noise (middle, right), the Difference Map gets closer than other algorithms to recovering the signal. Each plot is averaged over ten runs, with s=150 and Φ∈R400×1000. Figure 3: In the low sparsity regime, the difference map outperforms other algorithms at recovering x from a noisy observed signal with a wide variety of matrix sizes. The sparsity ratio s/n is approximately 1/3, and the noisy observation ~y=Φx+ϵ⋅N(0,1) has an SNR of approximately 20 dB. Each plot is averaged over ten runs.

## 3 Random Measurements

In this Section, we compare the performance of DM to other algorithms when reconstructing signals from random measurements, testing a wide variety of matrix sizes, sparsity, and noise levels. Given positive integers and

, we generate the random matrix

with entries drawn from . We then ensure that columns have zero mean and unit norm. We generate the -sparse vector whose nonzero elements are drawn from . We then calculate , and the noisy “observation” . Finally, we ask each algorithm to reconstruct given only and .

We measure runtime instead of iterations, as the time required per iteration varies widely for the algorithms considered. Additionally, the pre-computation for DM is the longest of any algorithm, requiring the pseudo-inverse of the dictionary. This pre-computation overhead is included in the timekeeping.

In the first experiment, each algorithm attempts to reconstruct as we vary the sparsity level . We choose so that the signal-to-noise ratio (SNR) is close to 20 dB. The results in Figure 1 demonstrate that for small values of (left, middle), meaning sparser signals, most algorithms are able to recover almost perfectly. As we increase the value of (right), meaning denser signals, other algorithms converge to undesirable minima. The Difference Map, however, continues to get very close to recovering .

In the next experiment, each algorithm attempts to reconstruct as we vary the noise by changing . is fixed at 150. The results in Figure 2 show that with less noise (left), meaning lower and higher SNR, several algorithms are able to get close to recovering the true signal . As the noise increases (middle, right), meaning higher and lower SNR, only the Difference Map is able to get close to recovering the signal.

Note that DM and AM start “late” in all plots from Figures 1 and 2 because their pre-computation time is the longest (calculating ). Despite using the same projections, we notice a large disparity in performance between DM and AM when . Because the two algorithms both use the same two projections and , this performance gap shows the power of combining two simple projections in a more elaborate way than simply alternating between them.

From the results in Figures 1 and 2, we hypothesize that DM has a significant advantage with noisy, less sparse signals (higher and ) where other state-of-the-art compressed sensing algorithms get stuck in local minima or require a large amount of time to reach a good solution. We test this hypothesis with a variety of different matrix sizes in Figure 3, each time with a sparsity ratio of approximately and an SNR of approximately 20 dB. The results show that DM does indeed outperform other algorithms in this setting, for all matrix sizes tested.

## 4 Sparse Coding Image Reconstruction

The results from Section 3 indicate that DM offers an advantage over other algorithms when the underlying signal is less sparse and the observation is noisy. The less-sparse, noisy setting corresponds well to images which contain a large variety of textures, such as natural images. In this Section, we measure the sparse coding performance of the same algorithms as in the previous section (with the exception of AM, whose sparse coding performance was not competitive), by comparing reconstruction quality for a variety of images.

When reconstructing a large image, we treat each patch as an independent signal to reconstruct. Because our dictionary is constant, we only need to compute the pseudo-inverse in (5) once. By amortizing the cost of the pseudo-inversion over all patches, this effectively allows DM to converge in less time per patch. We amortize the cost of pre-computation for other algorithms as well (most notably ADMM and IRLS).

In order to test the performance of the algorithms when reconstructing natural images, we require a dictionary learned for sparse image reconstruction. Dictionary learning is not the focus of this paper, but we present our method for completeness. The dictionary is trained with 10 million image patches, and we choose to learn 1000 atoms, resulting in a dictionary of size . We train the dictionary with patches from the person and hills category of ImageNet , which provide a variety of natural scenery. We alternate sparse coding using 20 iterations of ADMM using -shrinkage with (see  for details), with a dictionary update using the method of optimal directions . We used ADMM as the sparse-coding algorithm simply because we had access to MPI-parallelized C code for this purpose. We do not believe that this gave an unfair advantage to ADMM, because the reconstructed images presented in this paper are separate from the dataset used to train the dictionary.

Using 1,000 processors, the dictionary converged in about 2.5 hours. The training patches were reconstructed by the dictionary with an average of 29 nonzero components (out of 1000), and the reconstruction of the training images had a relative error of 5.7%. The dictionary contains the typical combination of high- and low-frequency edges, at various orientations and scales. Some examples are shown in Figure 4. Figure 4: Example atoms from the dictionary Φ∈R400×1000 used for reconstruction. The dictionary contains elements of size 20×20, learned from 10 million image patches from the person and hill categories of ImageNet .

We reconstruct several natural images and measure the quality of the sparse reconstruction as a function of time. At time , we measure the reconstruction quality of patch as follows. First, we perform hard-thresholding on the algorithm’s current guess , setting the smallest absolute values to zero, yielding the -sparse vector . We then calculate the reconstruction

 yt=Φ[xt]s

and measure the SNR of (the true image patch) to . Thus we are measuring how well, at time , the algorithm can create a sparse reconstruction of . Note that algorithms returning a solution that is sparser than required will not be affected by the hard-thresholding step.

We reconstruct a image of a dog, seen in Figure 5, using the dictionary from Figure 4. We measure results for both and , as well as and 444 We measure time in seconds per patch. Thus corresponds to approximately 10 seconds for the entire image, and to 20 seconds for the entire images. The astute reader will notice that when has dimension , it takes longer than 0.1 seconds to compute . This can be seen in Figures 1 and 2, where it takes almost 0.2 seconds for DM to finish calculating and begin searching for . However, because we only need to calculate the pseudo-inverse once for the entire image, this start-up cost is amortized over all patches and becomes negligible.. The results in Table 1 show that DM consistently achieves the highest SNR. Furthermore, while increasing and tend to improve each algorithm’s reconstruction, the relative quality between algorithms stays fairly consistent.

The highest quality reconstructions, achieved with and , are shown in Figure 5 (top row). While some algorithms fail to reconstruct details in the animal’s fur and the grass, many algorithms reconstruct the image well enough to make it difficult to find errors by mere visual inspection. We show the difference between the reconstructions and the original image (Figure 5, bottom row), where a neutral gray color in the difference image corresponds to a perfect reconstruction of that pixel; white and black are scaled to a difference of 0.3 and -0.3, respectively (the original image was scaled to the interval [0,1]). Figure 5: Reconstructing a natural image. The Difference Map outperforms the other algorithms when reconstructing a 320×240 image of a dog (top). Difference images (bottom) show the difference between the reconstruction and the original image, which ranges from -0.3 (black) to 0.3 (white) – original grayscale values are between 0 (black) and 1 (white). Results for s=200 are shown, and the time limit was 0.1 seconds per 20×20 patch.

The advantage of DM over other algorithms, when sparsely reconstructing images, can be seen with a large variety of images. In Figure 6, we see that DM consistently achieves the best reconstruction. All original images are available online. Figure 6: The Difference Map regularly outperforms other algorithms in finding sparse reconstructions of a variety of images. We measure the SNR in decibels (right column) between the reconstruction and the original image (left column). Images are scaled to 240×320 pixels (320×240 for horizontal images). Reconstructions have sparsity s=200, and are completed in 20 seconds per image (approximately 0.1 seconds per 20×20 patch). The dictionary Φ is the same as in Figure 4.

## 5 Conclusions

We have presented the Difference Map, a method of finding a point at the intersection of two constraint sets, and we have introduced an implementation of DM for compressed sensing and sparse coding. The constraint-set formulation is a natural fit for sparse recovery problems, in which we have two competing constraints for : to be consistent with the data , and to be sparse.

When the solution is very sparse and the observation is not too noisy, DM takes more time in finding the same solution as competing algorithms. However, when the solution is less sparse and when the observation is noisy, DM outperforms state of the art sparse recovery algorithms. The noisy, less sparse setting corresponds well to reconstructing natural images, which can often require a large number of components in order to accurately reconstruct. Experiments show that DM performs favorably in reconstructing a variety of images, with a variety of parameter settings.

Parameter tuning can present a laborious hurdle to the researcher. DM requires tuning only a single parameter . For all experiments in this paper (natural image reconstruction for various images; reconstruction with random matrix dictionaries of various sizes, with varying amounts of sparsity and noise), we found DM to work almost equally as well for all . The robustness of DM under such a wide variety of parameter values and problems makes DM a very competitive choice for compressed sensing.

The robustness of DM comes from how it combines two simple projections into a single iterative procedure. The Alternating Map (AM) combines the same projections in a simple alternating fashion, and struggles in almost all experiments. The gap in performance between these two methods demonstrates the power of combining multiple constraints in a more perspicacious way.

Finally, we recall that performance in all experiments was measured as a function of time, which would seem to put DM at a natural disadvantage to other algorithms: DM requires the pseudo-inverse of the dictionary, computing which requires more time than any other algorithm’s pre-computation. Despite this, DM consistently outperforms other algorithms.

## References

•  SparseLab Matlab Software. http://sparselab.stanford.edu.
•  T. Blumensath. Accelerated iterative hard thresholding. Signal Processing, 92(3):752 – 756, 2012.
•  T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
•  T. Blumensath and M. E. Davies. Normalized iterative hard thresholding: Guaranteed stability and performance. IEEE Journal of Selected Topics in Signal Processing, pages 298–309, 2010.
•  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers.

Foundations and Trends in Machine Learning

, 3(1):1–122, 2011.
•  R. Chartrand. Nonconvex splitting for regularized low-rank + sparse decomposition. IEEE Transactions on Signal Processing, 60:5810–5819, 2012.
•  R. Chartrand and B. Wohlberg. A nonconvex ADMM algorithm for group sparsity with sparse groups. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013.
•  R. Chartrand and W. Yin. Iteratively reweighted algorithms for compressive sensing. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3869–3872, 2008.
•  W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 55(5):2230–2249, 2009.
•  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In

IEEE Conference on Computer Vision and Pattern Recognition

, 2009.
•  D. L. Donoho, Y. Tsaig, I. Drori, and J.-L. Starck. Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Transactions on Information Theory, 58(2):1094–1121, 2012.
•  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.
•  V. Elser, I. Rankenburg, and P. Thibault. Searching with iterated maps. Proceedings of the National Academy of Sciences, 104(2):418–423, 2007.
•  K. Engan, S. Aase, and J. Hakon Husoy. Method of optimal directions for frame design. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 2443–2446, 1999.
•  B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.
•  K. Qiu and A. Dogandžić. Double overrelaxation thresholding methods for sparse signal reconstruction. In Information Sciences and Systems, pages 1–6. IEEE, 2010.
•  K. Qiu and A. Dogandžić. Nonnegative signal reconstruction from compressive samples via a difference map ECME algorithm. In IEEE Workshop on Statistical Signal Processing, pages 561–564, 2011.
•  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
•  J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, 2007.