# Finite Sample Analysis of Approximate Message Passing Algorithms

Approximate message passing (AMP) refers to a class of efficient algorithms for statistical estimation in high-dimensional problems such as compressed sensing and low-rank matrix estimation. This paper analyzes the performance of AMP in the regime where the problem dimension is large but finite. For concreteness, we consider the setting of high-dimensional regression, where the goal is to estimate a high-dimensional vector β_0 from a noisy measurement y=A β_0 + w. AMP is a low-complexity, scalable algorithm for this problem. Under suitable assumptions on the measurement matrix A, AMP has the attractive feature that its performance can be accurately characterized in the large system limit by a simple scalar iteration called state evolution. Previous proofs of the validity of state evolution have all been asymptotic convergence results. In this paper, we derive a concentration inequality for AMP with i.i.d. Gaussian measurement matrices with finite size n × N. The result shows that the probability of deviation from the state evolution prediction falls exponentially in n. This provides theoretical support for empirical findings that have demonstrated excellent agreement of AMP performance with state evolution predictions for moderately large dimensions. The concentration inequality also indicates that the number of AMP iterations t can grow no faster than order n/ n for the performance to be close to the state evolution predictions with high probability. The analysis can be extended to obtain similar non-asymptotic results for AMP in other settings such as low-rank matrix estimation.

• 21 publications
• 22 publications
11/06/2017

### Estimation of Low-Rank Matrices via Approximate Message Passing

Consider the problem of estimating a low-rank symmetric matrix when its ...
09/24/2021

### Graph-based Approximate Message Passing Iterations

Approximate-message passing (AMP) algorithms have become an important el...
03/25/2020

### Rigorous State Evolution Analysis for Approximate Message Passing with Side Information

A common goal in many research areas is to reconstruct an unknown signal...
01/13/2022

### Statistically Optimal First Order Algorithms: A Proof via Orthogonalization

We consider a class of statistical estimation problems in which we are g...
01/11/2014

### Multi Terminal Probabilistic Compressed Sensing

In this paper, the `Approximate Message Passing' (AMP) algorithm, initia...
02/01/2019

### An Analysis of State Evolution for Approximate Message Passing with Side Information

A common goal in many research areas is to reconstruct an unknown signal...
02/11/2018

### TARM: A Turbo-type Algorithm for Affine Rank Minimization

The affine rank minimization (ARM) problem arises in many real-world app...

## I Introduction

Consider the high-dimensional regression problem, where the goal is to estimate a vector from a noisy measurement given by

 y=Aβ0+w. (1.1)

Here is a known real-valued measurement matrix, and is the measurement noise. The sampling ratio is denoted by .

Approximate Message Passing (AMP) [1, 2, 3, 4, 5, 6] is a class of low-complexity, scalable algorithms to solve the above problem, under suitable assumptions on and . AMP algorithms are derived as Gaussian or quadratic approximations of loopy belief propagation algorithms (e.g., min-sum, sum-product) on the dense factor graph corresponding to (1.1).

Given the observed vector , AMP generates successive estimates of the unknown vector, denoted by for . Set , the all-zeros vector. For , AMP computes

 zt =y−Aβt+zt−1nN∑i=1η′t−1([A∗zt−1+βt−1]i), (1.2) βt+1 =ηt(A∗zt+βt), (1.3)

for an appropriately-chosen sequence of functions . In (1.2) and (1.3), denotes the transpose of , acts component-wise when applied to a vector, and denotes its (weak) derivative. Quantities with a negative index are set to zero throughout the paper. For a demonstration of how the AMP updates (1.2) and (1.3) are derived from a min-sum-like message passing algorithm, we refer the reader to [1].

For a Gaussian measurement matrix with entries that are i.i.d. , it was rigorously proven [1, 7] that the performance of AMP can be characterized in the large system limit via a simple scalar iteration called state evolution. This result was extended to the class of matrices with i.i.d. sub-Gaussian entries in [8]. In particular, these results imply that performance measures such as the -error and the -error converge almost surely to constants that can be computed via the distribution of . (The large system limit is defined as such that , a constant.)

AMP has also been applied to a variety of other high-dimensional estimation problems. Some examples are low-rank matrix estimation [9, 10, 11, 12, 13, 14], decoding of sparse superposition codes [15, 16, 17], matrix factorization [18], and estimation in generalized linear and bilinear models [5, 19, 20].

Main Contributions: In this paper, we obtain a non-asymptotic result for the performance of the AMP iteration in (1.2)–(1.3), when the measurement matrix has i.i.d. Gaussian entries . We derive a concentration inequality (Theorem 1) that implies that the probability of -deviation between various performance measures (such as ) and their limiting constant values fall exponentially in . Our result provides theoretical support for empirical findings that have demonstrated excellent agreement of AMP performance with state evolution predictions for moderately large dimensions, e.g., of the order of several hundreds [2].

In addition to refining earlier asymptotic results, the concentration inequality in Theorem 1 also clarifies the effect of the iteration number versus the problem dimension . One implication is that the actual AMP performance is close to the state evolution prediction with high probability as long as is of order smaller than . This is particularly relevant for settings where the number of AMP iterations and the problem dimension are both large, e.g., solving the LASSO via AMP [6].

We prove the concentration result in Theorem 1 by analyzing the following general recursion:

 bt=Aft(ht,β0)−λtgt−1(bt−1,w),ht+1=A∗gt(bt,w)−ξtft(ht,β0). (1.4)

Here, for , the vectors , describe the state of the algorithm, are Lipschitz functions that are separable (act component-wise when applied to vectors), and are scalars that can be computed from the state of the algorithm. The algorithm is initialized with . Further details on the recursion in (1.4), including how the AMP in (1.2)–(1.3) can be obtained as a special case, are given in Section IV-A.

For ease of exposition, our analysis will focus on the recursion (1.4) and the problem of high-dimensional regression. However, it can be extended to a number of related problems. A symmetric version of the above recursion yields AMP algorithms for problems such as solving the TAP equations in statistical physics [21] and symmetric low-rank matrix estimation [10, 12]. This recursion is defined in terms of a symmetric matrix with entries i.i.d. , and i.i.d. for . (In other words, can be generated as , where has i.i.d. entries.) Then, for , let

 mt+1=Apt(mt)−btpt−1(mt−1). (1.5)

Here, for , the state of the algorithm is represented by a single vector , the function is Lipschitz and separable, and is a constant computed from the state of the algorithm (see [1, Sec. IV] for details). The recursion (1.5) is initialized with a deterministic vector .

Our analysis of the recursion (1.4) can be easily extended to obtain an analogous non-asymptotic result for the symmetric recursion in (1.5). Therefore, for problems of estimating either symmetric or rectangular low-rank matrices in Gaussian noise, our analysis can be used to refine existing asymptotic AMP guarantees (such as those in [9, 10, 11]), by providing a concentration result similar to that in Theorem 1. We also expect that the non-asymptotic analysis can be generalized to the case where the recursion in (1.4) generates matrices rather than vectors, i.e, and (where remains fixed as grow large; see [7] for details). Extending the analysis to this matrix recursion would yield non-asymptotic guarantees for the generalized AMP [5] and AMP for compressed sensing with spatially coupled measurement matrices [22].

Since the publication of the conference version of this paper, the analysis described here has been used in a couple of recent papers: an error exponent for sparse regression codes with AMP decoding was obtained in [23], and a non-asymptotic result for AMP with non-separable denoisers was given in [24].

### I-a Assumptions

Before proceeding, we state the assumptions on the model (1.1) and the functions used to define the AMP. In what follows, are generic positive constants whose values are not exactly specified but do not depend on . We use the notation to denote the set .

Measurement Matrix: The entries of measurement matrix are i.i.d. .

Signal: The entries of the signal

are i.i.d. according to a sub-Gaussian distribution

. We recall that a zero-mean random variable

is sub-Gaussian if there exist positive constants such that , [25].

Measurement Noise: The entries of the measurement noise vector are i.i.d. according to some sub-Gaussian distribution with mean and for . The sub-Gaussian assumption implies that, for ,

 P(∣∣∣1n∥w∥2−σ2∣∣∣≥ϵ)≤Ke−κnϵ2, (1.6)

for some constants [25].

The Functions : The denoising functions, , in (1.3) are Lipschitz continuous for each , and are therefore weakly differentiable. The weak derivative, denoted by , is assumed to be differentiable, except possibly at a finite number of points, with bounded derivative everywhere it exists. Allowing to be non-differentiable at a finite number of points covers denoising functions like soft-thresholding which is used in applications such as the LASSO [6].

Functions defined with scalar inputs are assumed to act component-wise when applied to vectors.

The remainder of the paper is organized as follows. In Section II we review state evolution, the formalism predicting the performance of AMP, and discuss how knowledge of the signal distribution and the noise distribution can help choose good denoising functions . However, we emphasize that our result holds for the AMP with any choice of satisfying the above condition, even those that do not depend on and . In Section II-A, we introduce a stopping criterion for termination of the AMP. In Section III, we give our main result (Theorem 1) which proves that the performance of AMP can be characterized accurately via state evolution for large but finite sample size . Section IV gives the proof of Theorem 1. The proof is based on two technical lemmas: Lemmas 3 and 5. The proof of Lemma 5 is long; we therefore give a brief summary of the main ideas in Section IV-F and then the full proof in Section V. In the appendices, we list a number of concentration inequalities that are used in the proof of Lemma 5. Some of these, such as the concentration inequality for the sum of pseudo-Lipschitz functions of i.i.d. sub-Gaussian random variables (Lemma B.4), may be of independent interest.

## Ii State Evolution and the Choice of ηt

In this section, we briefly describe state evolution, the formalism that predicts the behavior of AMP in the large system limit. We only review the main points followed by a few examples; a more detailed treatment can be found in [1, 4].

Given , let Let , where . Iteratively define the quantities and as

 τ2t=σ2+σ2t,σ2t =1δE[(ηt−1(β+τt−1Z)−β)2], (2.1)

where and are independent random variables.

The AMP update (1.3) is underpinned by the following key property of the vector : for large , is approximately distributed as , where is an i.i.d.  random vector independent of . In light of this property, a natural way to generate from the “effective observation” is via the conditional expectation:

 βt+1(s)=E[β∣β+τtZ=s], (2.2)

i.e., is the MMSE estimate of given the noisy observation . Thus if is known, the Bayes optimal choice for is the conditional expectation in (2.2).

In the definition of the “modified residual” , the third term on the RHS of (1.2) is crucial to ensure that the effective observation has the above distributional property. For intuition about the role of this ‘Onsager term’, the reader is referred to [1, Section I-C].

We review two examples to illustrate how full or partial knowledge of can guide the choice of the denoising function . In the first example, suppose we know that each element of is chosen uniformly at random from the set . Computing the conditional expectation in (2.2) with this , we obtain [1]. The constants are determined iteratively from the state evolution equations (2.1).

As a second example, consider the compressed sensing problem, where , and is such that . The parameter determines the sparsity of . For this problem, the authors in [2, 4] suggested the choice , where the soft-thresholding function is defined as

 η(s;θ)=⎧⎪⎨⎪⎩(s−θ), if s>θ,0 if −θ≤s≤θ,(s−θ), if s<−θ.

The threshold at step is set to , where is a tunable constant and is determined by (2.1

), making the threshold value proportional to the standard deviation of the noise in the effective observation. However, computing

using (2.1) requires knowledge of . In the absence of such knowledge, we can estimate by : our concentration result (Lemma 5(e)) shows that this approximation is increasingly accurate as grows large. To fix , one could run the AMP with several different values of , and choose the one that gives the smallest value of for large .

We note that in each of the above examples is Lipschitz, and its derivative satisfies the assumption stated in Section I-A.

### Ii-a Stopping Criterion

To obtain a concentration result that clearly highlights the dependence on the iteration and the dimension , we include a stopping criterion for the AMP algorithm. The intuition is that the AMP algorithm can be terminated once the expected squared error of the estimates (as predicted by state evolution equations in (2.1)) is either very small or stops improving appreciably.

For Bayes-optimal AMP where the denoising function is the conditional expectation given in (2.2), the stopping criterion is as follows. Terminate the algorithm at the first iteration for which either

 σ2t<ε0, or σ2tσ2t−1>1−ε′0, (2.3)

where and are pre-specified constants. Recall from (2.1) that is expected squared error in the estimate. Therefore, for suitably chosen values of , the AMP will terminate when the expected squared error is either small enough, or has not significantly decreased from the previous iteration.

For the general case where is not the Bayes-optimal choice, the stopping criterion is: terminate the algorithm at the first iteration for which at least one of the following is true:

 σ2t<ε1,  or  (σ⊥t)2<ε2,  or  (τ⊥t)2<ε3, (2.4)

where are pre-specified constants, and are defined in (4.19). The precise definitions of the scalars are postponed to Sec. IV-B as a few other definitions are needed first. For now, it suffices to note that are measures of how close and are to and , respectively. Indeed, for the Bayes-optimal case, we show in Sec IV-C that

 (σ⊥t)2:=σ2t(1−σ2tσ2t−1),  (τ⊥t)2:=τ2t(1−τ2tτ2t−1).

Let be the first value of for which at least one of the conditions is met. Then the algorithm is run only for . It follows that for ,

 σ2t>ε1,τ2t>σ2+ε1,(σ⊥t)2>ε2,(τ⊥t)2>ε3. (2.5)

In the rest of the paper, we will use the stopping criterion to implicitly assume that are bounded below by positive constants.

## Iii Main Result

Our result, Theorem 1, is a concentration inequality for pseudo-Lipschitz

(PL) loss functions. As defined in

[1], a function is pseudo-Lipschitz (of order ) if there exists a constant such that for all , where denotes the Euclidean norm.

###### Theorem 1.

With the assumptions listed in Section I-A, the following holds for any (order-) pseudo-Lipschitz function , and , where is the first iteration for which the stopping criterion in (2.4) is satisfied.

 P(∣∣ ∣∣1NN∑i=1ϕ(βt+1i,β0i)−E[ϕ(ηt(β+τtZ),β)]∣∣ ∣∣≥ϵ)≤Kte−κtnϵ2. (3.1)

In the expectation in (3.1), and are independent, and is given by (2.1). The constants are given by , where are universal constants (not depending on , , or ) that are not explicitly specified.

The probability in (3.1) is with respect to the product measure on the space of the measurement matrix , signal , and the noise .

Remarks:

1. By considering the pseudo-Lipschitz function , Theorem 1 proves that state evolution tracks the mean square error of the AMP estimates with exponentially small probability of error in the sample size . Indeed, for all ,

 P(∣∣∣1N∥βt+1−β0∥2−δσ2t+1∣∣∣≥ϵ)≤Kte−κtnϵ2. (3.2)

Similarly, taking the theorem implies that the normalized -error is concentrated around .

2. Asymptotic convergence results of the kind given in [1, 6] are implied by Theorem 1. Indeed, from Theorem 1, the sum

 ∞∑N=1P(∣∣1NN∑i=1ϕ(βt+1i,β0i)−E[ϕ(ηt(β+τtZ),β)]∣∣≥ϵ)

is finite for any fixed . Therefore the Borel-Cantelli lemma implies that for any fixed :

 limN→∞1NN∑i=1ϕ(βt+1i,β0i)a.s.=E[ϕ(ηt(β+τtZ),β)].

3. Theorem 1 also refines the asymptotic convergence result by specifying how large can be (compared to the dimension ) for the state evolution predictions to be meaningful. Indeed, if we require the bound in (3.1) to go to zero with growing , we need as . Using the expression for from the theorem then yields .

Thus, when the AMP is run for a growing number of iterations, the state evolution predictions are guaranteed to be valid until iteration if the problem dimension grows faster than exponentially in . Though the constants in the bound have not been optimized, we believe that the dependence of these constants on is inevitable in any induction-based proof of the result. An open question is whether this relationship between and is fundamental, or a different analysis of the AMP can yield constants which allow to grow faster with .

4. As mentioned in the introduction, we expect that non-asymptotic results similar to Theorem 1 can be obtained for other estimation problems (with Gaussian matrices) for which rigorous asymptotic results have been proven for AMP. Examples of such problems include low-rank matrix estimation [9, 10, 11], robust high-dimensional M-estimation [26], AMP with spatially coupled matrices [22], and generalized AMP [7, 27].

As our proof technique depends heavily on being i.i.d. Gaussian, extending Theorem 1 to AMP with sub-Gaussian matrices [8] and to variants of AMP with structured measurement matrices (e.g., [28, 29, 30]) is non-trivial, and an interesting direction for future work.

## Iv Proof of Theorem 1

We first lay down the notation that will be used in the proof, then state two technical lemmas (Lemmas 3 and 5) and use them to prove Theorem 1.

### Iv-a Notation and Definitions

For consistency and ease of comparison, we use notation similar to [1]. To prove the technical lemmas, we use the general recursion in (1.4), which we write in a slightly different form below. Given , , define the column vectors and for recursively as follows, starting with initial condition :

 bt:=Aqt−λtmt−1,mt:=gt(bt,w),ht+1:=A∗mt−ξtqt,qt:=ft(ht,β0). (4.1)

where the scalars and are defined as

 ξt:=1nn∑i=1g′t(bti,wi),λt:=1δNN∑i=1f′t(hti,β0i). (4.2)

In (4.2), the derivatives of and are with respect to the first argument. The functions are assumed to be Lipschitz continuous for , hence the weak derivatives and exist. Further, and are each assumed to be differentiable, except possibly at a finite number of points, with bounded derivative everywhere it exists.

Let with . We let and assume that there exist constants such that

 P(∣∣∣1n∥q0∥2−σ20∣∣∣≥ϵ)≤Ke−κnϵ2. (4.3)

Define the state evolution scalars and for the general recursion as follows.

 τ2t:=E[(gt(σtZ,W))2],σ2t:=1δE[(ft(τt−1Z,β))2], (4.4)

where , and are independent random variables. We assume that both and are strictly positive.

The AMP algorithm is a special case of the general recursion in (4.1) and (4.2). Indeed, the AMP can be recovered by defining the following vectors recursively for , starting with and .

 ht+1=β0−(A∗zt+βt),qt=βt−β0,bt=w−zt,mt=−zt. (4.5)

It can be verified that these vectors satisfy (4.1) and (4.2) with

 ft(a,β0)=ηt−1(β0−a)−β0, and gt(a,w)=a−w. (4.6)

Using this choice of in (4.4) yields the expressions for given in (2.1). Using (4.6) in (4.2), we also see that for AMP,

 λt=−1δNN∑i=1η′t−1([A∗βt−1+zt−1]i),ξt=1. (4.7)

Recall that is the vector we would like to recover and is the measurement noise. The vector is the noise in the effective observation , while is the error in the estimate . The proof will show that and are approximately i.i.d. , while

is approximately i.i.d. with zero mean and variance

.

For the analysis, we work with the general recursion given by (4.1) and (4.2). Notice from (4.1) that for all ,

 bt+λtmt−1=Aqt,ht+1+ξtqt=A∗mt. (4.8)

Thus we have the matrix equations and where

 Xt:=[h1+ξ0q0∣h2+ξ1q1∣…∣ht+ξt−1qt−1],Yt:=[b0∣b1+λ1m0∣…∣bt−1+λt−1mt−2],Mt:=[m0∣…∣mt−1],Qt:=[q0∣…∣qt−1]. (4.9)

The notation is used to denote a matrix with columns . Note that and are the all-zero vector. Additionally define the matrices

 Ht:=[h1|…|ht],Ξt:=diag(ξ0,…,ξt−1),Bt:=[b0|…|bt−1],Λt:=diag(λ0,…,λt−1). (4.10)

Note that , , , and are all-zero vectors. Using the above we see that and

We use the notation and to denote the projection of and onto the column space of and , respectively. Let

 αt:=(αt0,…,αtt−1)∗,γt:=(γt0,…,γtt−1)∗ (4.11)

be the coefficient vectors of these projections, i.e.,

 mt∥:=t−1∑i=0αtimi,qt∥:=t−1∑i=0γtiqi. (4.12)

The projections of and onto the orthogonal complements of and , respectively, are denoted by

 mt⊥:=mt−mt∥,qt⊥:=qt−qt∥. (4.13)

Lemma 5 shows that for large , the entries of and are concentrated around constants. We now specify these constants and provide some intuition about their values in the special case where the denoising function in the AMP recursion is the Bayes-optimal choice, as in (2.2).

### Iv-B Concentrating Values

Let and each be sequences of of zero-mean jointly Gaussian random variables whose covariance is defined recursively as follows. For ,

 E[˘Zr˘Zt]=~Er,tσrσt,E[~Zr~Zt]=˘Er,tτrτt, (4.14)

where

 ~Er,t:=1δE[fr(τr−1~Zr−1,β)ft(τt−1~Zt−1,β)],˘Er,t:=E[gr(σr˘Zr,W)gt(σt˘Zt,W)], (4.15)

where and are independent random variables. In the above, we take , the initial condition. Note that and , thus .

Define matrices for such that

 ~Cti+1,j+1=~Ei,j,  and  ˘Cti+1,j+1=˘Ei,j, 0≤i,j≤t−1. (4.16)

With these definitions, the concentrating values for and (if and are invertible) are

 ^γt:=(~Ct)−1~Et, and ^αt:=(˘Ct)−1˘Et, (4.17)

with

 ~Et:=(~E0,t…,~Et−1,t)∗,  and  ˘Et:=(˘E0,t…,˘Et−1,t)∗. (4.18)

Let and , and for define

 (σ⊥t)2:=σ2t−(^γt)∗~Et=~Et,t−~E∗t(~Ct)−1~Et,(τ⊥t)2:=τ2t−(^αt)∗˘Et=˘Et,t−˘E∗t(˘Ct)−1˘Et. (4.19)

Finally, we define the concentrating values for and as

 ^λt:=1δE[f′t(τt−1~Zt−1,β)],  and  ^ξt=E[g′t(σt˘Zt,W)]. (4.20)

Since and are assumed to be Lipschitz continuous, the derivatives and are bounded for . Therefore defined in (4.2) and defined in (4.20) are also bounded. For the AMP recursion, it follows from (4.6) that

 ^λt=−1δE[η′t−1(β−τt−1~Zt−1)],  and  ^ξt=1. (4.21)
###### Lemma 1.

If and are bounded below by some positive constants (say and , respectively) for , then the matrices and defined in (4.16) are invertible for .

###### Proof:

We prove the result using induction. Note that and are both strictly positive by assumption and hence invertible. Assume that for some , and are invertible. The matrix can be written as

 ~Ck+1=[M1M2M3M4],

where , , and defined in (4.18). By the block inversion formula, is invertible if and the Schur complement are both invertible. By the induction hypothesis is invertible, and

 M4−M3M−11M2=~Ek,k−~E∗k(~Ck)−1~Ek=(σ⊥k)2≥~c>0. (4.22)

Hence is invertible. Showing that is invertible is very similar. ∎

We note that the stopping criterion ensures that and are invertible for all that are relevant to Theorem 1.

### Iv-C Bayes-optimal AMP

The concentrating constants in (4.14)–(4.19) have simple representations in the special case where the denoising function is chosen to be Bayes-optimal, i.e., the conditional expectation of given the noisy observation , as in (2.2). In this case:

1. It can be shown that in (4.15) equals for . This is done in two steps. First verify that the following Markov property holds for the jointly Gaussian with covariance given by (4.14):

 E[β∣β+τt~Zt, β+τr~Zr]=E[β∣β+τt~Zt],0≤r≤t.

We then use the above in the definition of (with given by (4.6)), and apply the orthogonality principle to show that for .

2. Using in (4.14) and (4.15), we obtain .

3. From the orthogonality principle, it also follows that for ,

 E[∥βt∥2]=E[β∗βt],  %and  E[∥βr∥2]=E[(βr)∗βt],

where .

4. With and for , the quantities in (4.17)–(4.19) simplify to the following for :

 ^γt=[0,…,0,σ2t/σ2t−1],^αt=[0,…,0,τ2t/τ2t−1],(σ⊥t)2:=σ2t(1−σ