## 1 Introduction

We consider the standard linear regression problem, where the goal is to recover the vector from measurements

(1) |

where is a known matrix and is an unknown disturbance. With high-dimensional random , the approximate message passing (AMP) algorithm [1] remains one of the most celebrated and best understood iterative algorithms. In particular, when the entries of

are drawn i.i.d. from a sub-Gaussian distribution and

with , ensemble behaviors of AMP, such as the per-iteration mean-squared error (MSE), can be perfectly predicted using a state evolution (SE) formalism [2].^{1}

^{1}1See also [3] for an earlier proof of AMP’s state evolution under i.i.d. Gaussian entries. Furthermore, the SE formalism shows that, in certain regimes, AMP’s MSE converges to the minimum MSE as predicted by the replica method [3, 2], which has been shown to coincide with the minimum MSE for linear regression under i.i.d. Gaussian [4, 5] as with . More recently, it has been proven that the state-evolution accurately characterizes AMP’s behavior for large but finite [6].

The rigorous SE proofs in [2, 3, 6], however, are long and complicated, and thus remain out of reach for many readers. And, although the AMP algorithm can be heuristically derived from an approximation of loop belief propagation (LBP), as outlined in [1] and [7], the LBP perspective is lacking in several respects. First, LBP is generally suboptimal, making it surprising that a simplified approximation of LBP can be optimal. Second, the LBP derivation provides little insight into why large i.i.d. matrices are important for AMP. Third, the LBP derivation does not suggest a scalar state evolution.

In this work, we propose a heuristic derivation of AMP and its MSE state evolution that uses the simple idea of “first-order cancellation.” This derivation provides insights missing from the LBP derivation, while being much more accessible than the rigorous SE proofs.

## 2 Problem Setup

In our treatment of the linear regression problem (1),
, , and are deterministic vectors and is a deterministic matrix.
Importantly, however, we assume that the components of are realizations of
i.i.d. Bernoulli^{2}^{2}2With additional work, our derivation can be extended to i.i.d. Gaussian , but doing so lengthens the derivation and provides little additional insight.random variables that are drawn independently of and .
Our model for is a special case of that considered in [2].

Throughout, we will focus on the following large-system limit.

###### Definition 1.

The “large system limit” is defined as with for some fixed sampling ratio .

We will assume that the components of , , and scale as in the large-system limit.

We consider a family of algorithms that, starting with , iterates the following over iteration index :

(2a) | ||||

(2b) |

where is a component-wise function (i.e., ) and is a correction term. The quantity is iteration-estimate of the unknown vector . We refer to as a “denoiser” for reasons that will become clear in the sequel. For technical reasons, we will assume that is a polynomial function of bounded degree, similar to the assumption in [2].

The classical iterative shrinkage/thresholding (IST) algorithm [8] uses no correction, i.e.,

(3) |

for all iterations , whereas the AMP algorithm [1] uses the “Onsager” correction

(4) |

initialized with . In (4), refers to the derivative of . Our goal is to analyze the effect of on the behavior of algorithm (2) in the large-system limit, and in particular to understand how and why the Onsager correction (4) is a good choice. To do this, we will analyze the errors on and in (2) and drop terms that vanish in the large-system limit.

## 3 AMP Derivation

We will now analyze the error on the input to the denoiser , i.e.,

(5) |

(6) | ||||

(7) | ||||

(8) |

Let us examine the th component of when . We have that

(9) | ||||

(10) | ||||

(11) |

since . Continuing,

(12) | ||||

(13) |

where omits the direct contribution of from and thus is only weakly dependent on . We formalize this weak dependence through Assumption 1, which is admittedly an approximation. In fact, the approximate nature of Assumption 1 is one of the main reasons that our derivation is heuristic.

###### Assumption 1.

The matrix entry is a realization of an equiprobable Bernoulli random variable , where are mutually independent and is independent of , , and .

Assumption 1 will often be used when analyzing summations, as in the following lemma.

###### Lemma 1.

Consider the quantity , where are realizations of i.i.d. random variables with zero mean and . If are drawn independently of , and scale as in the large-system limit, then also scales as .

###### Proof.

First, note that is a realization of the random variable . Furthermore, , since if and if . Clearly and are both in the large-system limit. Thus we conclude that is . Finally, since is a realization of a random variable

whose second moment is

, we conclude that scales as in the large-system limit. ∎In the sequel, we will make use of the following lemma, whose proof is postponed because it is a bit long and does not provide much insight.

###### Lemma 2.

###### Proof.

See Appendix A. ∎

We now perform a Taylor series expansion of the term in (13) about :

(14) |

where the scaling follows from the fact that , that both and scale as via Lemma 2, and is polynomial of bounded degree, which implies that also scales as . We will ignore the term in (14) since it will vanish relative to the component in the large-system limit. Thus we have

(15) | ||||

(16) |

using .

We are now in a position to observe the principal mechanism of AMP. As we argue below (using the central limit theorem), the first and second terms in (

19) behave like realizations of zero-mean Gaussians in the large-system limit, because are realizations of i.i.d. zero-mean random variables that are independent of , , and under Assumption 1. But the same cannot be said in general about the third term in (19), because is strongly coupled to . Consequently, the denoiser input-error is difficult to characterize for general .With AMP’s choice of , however, the 3rd term in (19) vanishes in the large-system limit. In particular, with the Onsager choice (4), the 3rd term in (19) takes the form

(20) | ||||

(21) |

where for the last step we used the Taylor-series expansion

(22) |

and dropped the term, since it will vanish relative to the term in the large-system limit. Looking at (21), the first term is

(23) |

since and is due to Lemma 2. Thus the first term in (21) will vanish in the large-system limit. The second term in (21) is

(24) |

which will also vanish in the large-system limit. The scaling in (24) follows from Lemma 2 under Assumption 1, and the scaling follows from the fact that and .

Thus, for large and the AMP choice of , equation (19) becomes

(25) |

Under Assumption 1, is a realization of equiprobable that is independent of , , and . Thus we can apply the central limit theorem to say that, for any fixed

, the first term converges to a Gaussian with mean and variance

(26) | ||||

(27) |

From the Taylor expansion (14), we have

(28) | ||||

(29) |

where the scaling follows from the facts that and is . Notice that is the denoiser output error. Because the term in (29) vanishes in the large-system limit, we see that (27) becomes

(30) | ||||

(31) |

where

(32) |

is the average squared error on the denoiser output . We have thus deduced that, in the large-system limit, the first term in (25) behaves like a zero-mean Gaussian with variance . For the second term in (25), we can again use the central limit theorem to say that, for any fixed , the second term converges to a Gaussian with mean and variance

(33) | ||||

(34) |

where denotes the empirical second moment of the noise:

(35) |

To summarize, with AMP’s choice of from (4), the th component of the denoiser input-error behaves like

(36) |

in the large-system limit, where denotes a Gaussian random variable with mean and variance . With other choices of (e.g., ISTA’s choice of ), it is difficult to characterize the denoiser input-error and in general it will not be Gaussian.

## 4 AMP State Evolution

In Section 3, we used Assumption 1 to argue that the AMP algorithm yields a denoiser input-error whose components are in the large system limit. Here, where is the average squared-error at the denoiser output in the large-system limit.

Recalling the definition of from (32), we can write

(37) | ||||

(38) |

where is a scalar random variable defined from the empirical distribution

(39) |

with denoting the Dirac delta function. Thus we can argue that, in the large-system limit,

(40) |

where now is distributed according to the limit of the empirical distribution. Combining (40) with the update equation for gives the following recursion for :

(41a) | ||||

(41b) |

initialized with . The recursion (41) is known as AMP’s “state evolution” for the mean-squared error [1, 3, 2].

The reason that we call a “denoiser” should now be clear. To minimize the mean-squared error , the function should remove as much of the noise from its input as possible. The smaller that is, the smaller the input-noise variance will be during the next iteration.

## 5 AMP Variance Estimation

For best performance, the iteration- denoiser should be designed in accordance with the iteration- input noise variance . With the AMP algorithm, there is an easy way to estimate the value of at each iteration from the vector, i.e., [7]. Below, we explain this approach using arguments similar to those used above.

Equation (58) shows that

(42) |

Ignoring the term and plugging in AMP’s choice of from (4) yields

(43) | ||||

(44) |

where we used the Taylor series (22) in the second step and to justify the scaling. Since the last term in (44) is the scaled average of terms, with scaling, the entire term is . We can thus drop it since it will vanish relative to the others in the large-system limit. Doing this and plugging in yields

(45) |

recalling the definition of from (25). Squaring the result and averaging over yields

(46) |

We now examine the components of (46) in the large-system limit. By definition, the first term in (46) converges to

. By the law of large numbers, the second term converges to

(47) |

since when and when . Using the relationship between and from (29), it can be seen that

(48) |

where is implicitly a function of because . In summary,

(49) |

which shows that is well estimated by in the large-system limit.

## 6 Numerical Experiments

We now present numerical experiments that demonstrate the AMP behaviors discussed above. In all experiments, we used a sampling ratio of , drawn i.i.d. zero-mean Gaussian with variance , drawn i.i.d. from the Bernoulli-Gaussian distribution with sparsity rate (i.e., , where denotes the Dirac delta distribution), and drawn i.i.d. zero-mean Gaussian with variance and , so that dB. We experimented with two denoisers: the MMSE denoiser and the soft-thresholding denoiser with , which is the minimax choice, i.e., the value of that minimizes the maximum MSE over all -sparse signals (see [7] for more details). With the soft-thresholding denoiser, AMP solves the LASSO problem “