    # Average performance of Orthogonal Matching Pursuit (OMP) for sparse approximation

We present a theoretical analysis of the average performance of OMP for sparse approximation. For signals, that are generated from a dictionary with K atoms and coherence μ and coefficients corresponding to a geometric sequence with parameter α, we show that OMP is successful with high probability as long as the sparsity level S scales as Sμ^2 K ≲ 1-α . This improves by an order of magnitude over worst case results and shows that OMP and its famous competitor Basis Pursuit outperform each other depending on the setting.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In sparse approximation the goal is to approximate a given signal by a linear combination of a small number of elements , called atoms, out of a given larger set, such as basis or a frame, called the dictionary. Storing the normalised atoms as columns in the dictionary matrix and denoting the restriction to the columns indexed by a set by we can write informally,

 findy≈∑k∈Iφkxk=ΦIxIs.t.|I|=S≪d (1)

Finding the smallest error for a given sparsity level and the corresponding support set , which determines via , where is the Moore-Penrose pseudo inverse, becomes an NP-hard problem in general unless the dictionary is an orthonormal system. In this case thresholding, meaning choosing as the indices of the atoms having the -largest inner products with the signal in magnitude, will succeed. For all other cases, one had to find algorithms which are more efficient, if less optimal than an exhaustive search through all possible supports sets with subsequent projection . The two most investigated directions are greedy methods and convex relaxation techniques - the two golden classics being Orthogonal Matching Pursuit (OMP), , and Basis Pursuit (BP), , respectively.
OMP finds the support iteratively, adding the index of the atom which has the largest absolute inner product with the residual and updating the residual. So initialising , it

until a stopping criterion is met, such as reaching the desired number of iterations or size of the residual/inner product is sufficiently small.
The Basis Pursuit principle, on the other hand, prescribes finding the minimiser of the convex programme

 ^x=argminx:y=Φx∥x∥1, (2)

and choosing as the index set of the -largest entries of in magnitude. The interesting question concerning both schemes is when they are successful, meaning assuming that the signal is known to be S-sparse, meaning with , when can they recover the support . It was first studied in [15, 7] and for dictionaries with coherence a sufficient condition for both schemes to succeed is that , which is relaxed in comparison to the sufficient condition for thresholding , but still quite restrictive, especially considering the much better performance in practice. This led to the investigation of the average performance when modelling signals as

 y=∑kσkckφp(k), (3)

where is a Rademacher sequence, the coefficient sequence is non-increasing, , and for and is some permutation such that the support satisfies , where for a matrix the transpose is denoted by .
It was shown that BP recovers the true support except with probability as long as ,111The theorem actually considers Steinhaus instead of Rademacher sequences. However, the proof for Rademacher sequences is exactly the same; simply use Hoeffding’s inequality instead of the complex Bernstein inequality. Also beware the buggy support condition in model M1., and that thresholding succeeds except with probability as long as , 222The improved constant presented here is due to the fact that also for Rademacher sequences .. The fact that for OMP a similar result could only be found in a multi-signal scenario, , started to give OMP the reputation of being weaker than BP.
This was further increased by the advent of Compressed Sensing (CS), , which can be seen as sparse approximation with design freedom for the dictionary. While for BP-type schemes in combination with randomly chosen dictionaries strong results appeared very early, [2, 1], comparable results for OMP and its variants took longer to develop and are weaker in general, [8, 12]. Still, thanks to its computational advantages and flexibility, e.g. concerning the stopping criteria, OMP remained popular in signal processing - the only difference being that users had a defensive statement a la ’of course BP will perform even better’ ready at all times.
Contribution: Here we will provide the long missing analysis of the average performance of OMP and show that on average neither BP nor OMP are stronger, but confirm folklore wisdom, that OMP works better for signals with decaying coefficients while BP is better for equal sized coefficients. The idea that the performance of OMP improves for decaying coefficients has already been used in  and the simplified result states that if the sorted absolute coefficients form a geometric sequence with decay , then OMP is guaranteed to succeed for all sparsity levels with . Replacing certainty with high probability we will relax this bound by an order of magnitude to .
Organisation: We will first address full support recovery in the noiseless case and then extend this result to partial support recovery in the noisy case. In Section 4 we will conduct two experiments showing that our theoretical results accurately predict the average performance, before finally discussing our results and future work.

## 2 Noiseless Case

We start with the simple case of signals following the model in (3). Note that from  we know that for a randomly chosen subset (permutation ) the condition is satisfied with high probability as long as .

Assume that the signals follow the model in (3) and that for the coefficients satisfy for . Then, except with probability , OMP will recover the full support as long as

 ⎛⎝t⌈Sλ⌉+√mtSlogKλ+1⎞⎠Sμ2≤120. (4)

Before presenting the proof, we want to provide some background information on the ideas used for proving success of OMP and the difficulties associated with an average case analysis. A necessary and sufficient condition for a step of OMP to succeed is that for the current (correct) sub-support we have

 maxk∈I/J|⟨φi,rJ⟩|>maxk∉I|⟨φk,rJ⟩|. (5)

Thus a sufficient condition for OMP to fully recover the support is, that for all possible sub-supports the missing atom which has the largest coefficient satisfies

 |⟨φiJ,rJ⟩|>maxk∉I|⟨φk,rJ⟩|. (6)

If the coefficients have random signs then for all the inner products should concentrate around their expectation,

 |⟨φi,rJ⟩|2⇝∑k∈I/J|⟨φi,[Id−P(ΦJ)]φk⟩|2c2k⇝c2i±μ2∥cI/J∥22, (7)

so a condition of the form should ensure success with high probability. The problem is that there are sub-supports for which we need to have this concentration. So taking a union bound for the probability of not enough concentration over all sub-supports, we get back the worst case condition but with a non-zero failure probability. The immediate conclusion is that in order to get a useful average case result, we have to reduce the number of intermediate supports that we need to control. For equally sized coefficients this is impossible, since the random signs determine the order of the absolute inner products. However, if the coefficients exhibit some decay, there is a natural order and it is more likely that atoms with large coefficients are picked first. For instance, with sufficient decay it might happen that the atom with the second largest coefficient is picked before that with the largest, but very unlikely that the atom with the smallest coefficient is picked first. The idea of the proof is that OMP will only pick ’sensible’ sub-supports, so we only need to ensure concentration for a much smaller number of them. The amount of concentration needed can then be further reduced by a pooling ’sensible’ supports of the same type and combining probabilistic and deterministic bounds.
We use the following short hands as well as for the residual based on an index set and for the signed coefficients, . In order to better understand the various bounds of terms involving , we recommend a quick familiarisation with Lemma 6.2, . Further we will assume w.l.o.g., that is, by reordering the dictionary matrix, that . We now define the following sets for and , to be specified later,

 Ai ={1…i−1},(A1=∅), (8) Mi ={i+1,…,i+T−1}∩S, (9) Zi ={i+T,…,K}∪Sc. (10)

We call a sub-support admissible if there exist and such that . We write .
A sufficient condition for OMP to succeed is that it only picks admissible sub-supports. Assuming is admissible, OMP picks another admissible support if (suff. cond.)

 |⟨φi,rJ⟩|>maxk∈Zi|⟨φk,rJ⟩|. (11)

Since is admissible the residual has the form

 rJ=Q(ΦJ)y=Q(ΦJ)(xiφi+ΦBcxBc+ΦZixZi), (12)

and we have for , the index of the largest missing coefficient,

 ⟨φi,rJ⟩=xi∥Q(ΦJ)φi∥22+⟨φi,Q(ΦJ)ΦBcxBc⟩+⟨φi,Q(ΦJ)ΦZixZi⟩. (13)

Using the identities and , to further split the last term above into,

 ⟨φi,Q(ΦJ)ΦZixZi⟩ =⟨φi,Q(ΦJ)Q(ΦAi)ΦZixZi⟩+⟨φi,Q(ΦJ)P(ΦAi)ΦZixZi⟩ =⟨φi,Q(ΦAi)ΦZixZi⟩−⟨φi,P(ΦB)Q(ΦAi)ΦZixZi⟩,

 ⟨φi,rJ⟩ =xi∥Q(ΦJ)φi∥22+⟨φi,Q(ΦJ)ΦBcxBc⟩ −⟨φi,P(ΦB)Q(ΦAi)ΦZixZi⟩+⟨φi,Q(ΦAi)ΦZixZi⟩. (14)

Similarly we have for all

 ⟨φk,rJ⟩ =xk(1−∥P(ΦJ)φk∥22)+xi⟨φk,[Id−P(ΦJ)]φi⟩−⟨φk,[Id−P(ΦJ)]ΦBcxBc⟩ +⟨φk,[Id−P(ΦAi)]ΦZkixZki⟩−⟨φk,P(ΦB)[Id−P(ΦAi)]ΦZkixZki⟩,

where we have used the shorthand .
We first bound the terms involving . So for all and , respectively, we have

 ∥Q(ΦJ)φk∥22=1−∥P(ΦJ)φk∥22≥1−|J|μ21−δJ (15) and|⟨φk,Q(ΦJ)φi⟩|≤μ+|J|μ21−δJ. (16)

Next we bound the terms involving . For we have

 |⟨φk,Q(ΦJ)ΦBcxBc⟩| ≤|⟨φk,ΦBcxBc⟩|+|⟨φk,P(ΦJ)ΦBcxBc⟩| ≤∥xBc∥1(μ+∥Φ⋆BcP(ΦJ)φk∥∞) ≤∥xBc∥1(μ+|J|μ21−δJ), (17)

as well as

 |⟨φk,P(ΦB)Q(ΦAi)ΦZkixZki⟩| ≤∥P(ΦB)φk∥2∥Q(ΦAi)ΦZkixZki∥2 ≤∥P(ΦB)φk∥2∥ΦZkixZki∥2 ≤μ√|B|√1−δB√1+δZi∥xZki∥2. (18)

We bound the remaining terms with high probability using Hoeffding’s inequality. For we have

 P(|⟨φk,Q(ΦAi)ΦZkixZki⟩|≥θk) =P(∣∣∑j∈Zki⟨φk,Q(ΦAi)φj⟩cjσj∣∣>θk) ≤2exp⎛⎜⎝−θ2k2∑j∈Zki|⟨φk,Q(ΦAi)φj⟩|2c2j⎞⎟⎠.

To bound we use that

 |⟨Q(ΦAi)φk,φj⟩|≤μ+|Ai|μ21−δAi. (19)

Setting , we get

 P(|⟨φk,Q(ΦAi)ΦZkixZki⟩|≥~θk)≤2e−2θ2k. (20)

Combining all our estimates we get that except with probability

we have for all

 |⟨φi,rJ⟩|−|⟨φk,rJ⟩|≥ci (1−μ−2|J|μ21−δJ)−ck−2∥cBc∥1(μ+|J|μ21−δJ) −2∥cZi∥2μ√|B|√1−δB√1+δI−2(θk+θi)∥cZi∥2(μ+|Ai|μ21−δAi).

We now determine the sets and by choosing to get the following bounds. For all we have

 ck≤ci(1−λS)T/t≤ci(1−λS)S/λ≤cie−1, (21)

as well as and

 ∥cZi∥2≤ci(1−λS)S/λ(t1−α2)1/2≤cie−1(tSλ)1/2.

Using and setting , for all , we get except with probability for all ,

 c−1i(|⟨φi,rJ⟩|−|⟨φk,rJ⟩|) ≥1−μ−4|J|μ2−e−1−2t⌈Sλ⌉(μ+2|J|μ2) >0.6−4(t⌈Sλ⌉+√mtSlogKλ+1)(μ+Sμ2). (22)

In case the deterministic analysis holds and the theorem is trivially true. If conversely , then and so (4) implies that the expression above is larger than zero, which further implies that OMP will pick another admissible sub-support. Taking a union bound over all possible sets we get that OMP will succeed with probability at least as long as (4) holds. First note that the theorem only improves over the worst case analysis when . However, if conversely , then the ratio between largest and smallest coefficient is of the order , so thresholding should still have a good success probability.
To get a better feeling for the quality of the theorem we next specialise it to the case , where the coefficients form a sub-geometric sequence with parameter , meaning . In this case the theorem essentially says that OMP will recover the support except with probability as long as and . Comparing this to the condition for BP, , for failure probability , we see that OMP has the advantage that the admissible sparsity level has a milder dependence on the dictionary size and success probability while BP has the advantage of being independent of the coefficient decay. This means that each algorithm can outperform the other depending on the setting. Before confirming this in the numerical simulation in Section 4 we first have a look at the performance of OMP in a noisy setting.

## 3 Noisy Case

We next study partial support recovery, when the sparse signals are contaminated with noise and are modelled as

 ~y=y+η=∑kσkckφp(k)+η, (23)

with as in the previous section and

a sub-Gaussian noise vector with parameter

. This means that and that for all unit vectors and the marginals satisfy .
For Gaussian noise the parameter

corresponds to the standard deviation and so for normalised coefficient sequences

the signal to noise ratio is . Similar bounds also hold in the general case, . In the noisy setting we clearly cannot recover coefficients below the noise level so with more decay there will be a trade off between allowing to recover more atoms and decreasing the coefficients faster. Assume that the signals follow the model in (23) and that for the coefficients satisfy , . Then OMP will recover an atom from the support in the first steps, except with probability , as long as,

 ⎛⎝t⌈Sλ⌉+√mtSlogKλ+1⎞⎠(μ+sμ2)≤110 (24) andcs≥10ρ√mlogK(1+2μt⌈Sλ⌉). (25)

We use the same approach as before, assuming w.l.o.g. , but take into account the new expression for the residuals

 ~rJ=Q(ΦJ)~y=Q(ΦJ)(y+η)=rJ+Q(ΦJ)η. (26)

and inner products . If is an admissible sub-support, for , then a sufficient condition for OMP to pick another admissible support in the noisy case is that for all we have

 |⟨φi,rJ⟩|−|⟨φk,rJ⟩|>|⟨φi,Q(ΦJ)η⟩|+|⟨φk,Q(ΦJ)η⟩|.

This means that we need to bound . Using the decomposition we can rewrite

 ⟨φk,Q(ΦJ)η⟩ =⟨φk,Q(ΦJ)[P(ΦAi)+Q(ΦAi)]η⟩ =⟨φk,[Id−P(ΦJ)]Q(ΦAi)η⟩ =⟨φk,Q(ΦAi)η⟩−⟨φk,P(ΦB)Q(ΦAi)η⟩ =⟨φk,Q(ΦAi)η⟩−⟨(Φ⋆BΦB)−1Φ⋆Bφk,Φ⋆BQ(ΦAi)η⟩. (27)

To bound with high probability we use the sub-Gaussian property of for the marginals , with . For this leads to

 P(∃i≤s,k:|⟨Q(ΦAi)φk,η⟩|≥θη) ≤2sKexp(−θ2η2ρ2),

where we have used that . We substitute this bound into (27) and use the Cauchy-Schwarz inequality as well as to get, except with probability ,

 |⟨φk,Q(ΦJ)η⟩| ≤θη+∥Φ†Bφk∥2⋅∥Φ⋆BQ(ΦAi)η∥2 ≤θη+μ√|B|1−δB⋅∥Φ⋆BQ(ΦAi)η∥∞⋅√|B| ≤θη(1+μ|B|1−δB). (28)

Using the intermediate expression in (22) as well as (28) with and setting we get, except with probability , for all with and all that

 c−1i|⟨φi,~rJ⟩|−|⟨φk,~rJ⟩| >0.6−4(t⌈Sλ⌉+√mtSlogKλ+1)⋅(μ+μ2|J|) −2ρcs√mlogK(1+2μt⌈Sλ⌉). (29)

To get to the final statement observe that the conditions in (24)/(25) imply that (29) is larger than zero, which in turn implies recovery of a correct atom in the first steps.

## 4 Numerical Simulations Figure 1: Percentage of correctly recovered supports for noiseless signals with various sparsity and coefficient decay parameters via BP (a,d), OMP (b,e) and thresholding (c,f) in the Dirac-DCT dictionary (a,b,c) and the Dirac-DCT-random dictionary (d,e,f). Figure 2: Percentage of correctly recovered atoms before recovery of first wrong atom via OMP for signals with various sparsity levels and coefficient decay parameter contaminated with no noise (a,d) or Gaussian noise corresponding to SNR=256 (b,e) and SNR=16 (c,f) in the Dirac-DCT dictionary (a,b,c) and the Dirac-DCT-random dictionary (d,e,f), as well as the percentage of correctly recoverable atoms for SNR=256 and SNR=16 (g,h).

In the first experiment we draw permutations and sign sequences and for each pair count how often BP, OMP and thresholding can recover the full support from the corresponding signals. From the results in Figure 1 we can see that the success region of OMP is indeed a union of two areas, one derived from the worst case analysis, , and one from the average case analysis . In particular, we can see the linear dependence of the breakdown sparsity level on the parameter . We can also see that the success of BP is not influenced by the coefficient decay and that, as indicated by theory, neither BP nor OMP is better in general but that each of them is better in a certain region. Finally, observe that the price you have to pay for the computational lightness of thresholding is the very limited range of parameters, where it is performing well.
In the second experiment we additionally draw noise-vectors to create the signals. For each signal we count how many atoms OMP identifies correctly before recovering the first incorrect atom. Figure 2 shows the average over all realisations divided by the correct sparsity level for both dictionaries and three noise levels, as well as the relative number of recoverable atoms for the two non-zero noise levels (g, h), meaning the number of coefficients above the noise level . Comparing to the success rates in the noiseless case, we can clearly see the overlay of the two effects; for small coefficient decay we recover as many atoms as in the noiseless case, while for large decay we recover all atoms with coefficients above the noise level.

## 5 Discussion

We have shown that OMP is successful with high probability if the coefficients exhibit decay and in such settings can even outperform BP. In particular, for geometric sequences with parameter the admissible sparsity level scales as . Our next goal is to extend the results to OMP using a perturbed dictionary, which is a necessary step to help tackle dictionary learning algorithms like K-SVD theoretically. We are also interested in deriving average case results for other algorithms such as stagewise OMP, , which picks more than one atom in each round, or Hard Thresholding Pursuit, , an iterative thresholding scheme. Both these algorithms can be computationally more efficient due to using less iterations and a theoretical analysis might allow the design of hybrids that automatically adapt to the decay, retain the computational efficiency and allow for multiple stopping criteria.

This work was supported by the Austrian Science Fund (FWF) under Grant no. Y760. The computational results presented have been achieved (in part) using the HPC infrastructure LEO of the University of Innsbruck. Finally, many, many, many thanks go to M.C. Pali for proof-reading the manuscript.

## References

• Candès et al.  E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, 2006.
• Candès and Tao  Emmanuel J. Candès and Terence Tao. Near optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425, 2006.
• Donoho and Elad  D. Donoho and M. Elad. Optimally sparse representation in general (non-orthogonal) dictionaries via minimization. Proc. Nat. Aca. Sci.,, 100(5):2197–2202, March 2003.
• Donoho  D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
• Donoho et al.  D.L. Donoho, Y. Tsaig, I. Drori, and J.L. Starck. Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit. IEEE Transactions on Information Theory, 58(2):1094–1121, 2012.
• Foucart  S. Foucart. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011.
• Fuchs  J. J. Fuchs. Extension of the pisarenko method to sparse linear arrays. IEEE Transactions on Signal Processing, 45(2413-2421), October 1997.
• Gilbert and Tropp  A.C. Gilbert and J.A. Tropp. Signal recovery from random measurements via Orthogonal Matching Pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, December 2007.
• Gribonval et al.  R. Gribonval, H. Rauhut, K. Schnass, and P. Vandergheynst. Atoms of all channels, unite! Average case analysis of multi-channel sparse recovery using greedy algorithms. Journal of Fourier Analysis and Applications, 14(5):655–687, 2008.
• Herzet et al.  C. Herzet, A. Drémeau, and C. Soussen. Relaxed recovery conditions for OMP/OLS by exploiting both coherence and decay. IEEE Transactions on Information Theory, 62(1):459 – 470, 2016.
• Hsu et al.  D. Hsu, S.M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability (arXiv:1110.2842), 17(14), 2012.
• Needell and Vershynin [DOI: 10.1007/s10208-008-9031-3] D. Needell and R. Vershynin. Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit. Foundations of Computational Mathematics, DOI: 10.1007/s10208-008-9031-3.
• Pati et al.  Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal Matching Pursuit: recursive function approximation with application to wavelet decomposition. In Asilomar Conf. on Signals Systems and Comput., 1993.
• Schnass and Vandergheynst  K. Schnass and P. Vandergheynst. Average performance analysis for thresholding. IEEE Signal Processing Letters, 14(11):828–831, 2007.
• Tropp  J.A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231–2242, October 2004.
• Tropp  J.A. Tropp. On the conditioning of random subdictionaries. Applied and Computational Harmonic Analysis, 25(1-24), 2008.