Optimization in high-dimensional landscapes can be computationally hard. This difficulty is often attributed to the topological complexity of the landscape. We show here that for planted signal recovery problems in high dimensions, there is another key obstruction to local optimization methods. Indeed, we find that a crucial factor in these settings is the competition between the strength of the signal and the entropy of the prior. We focus on a well-known optimization problem from high dimensional statistics which is known to be NP hard, namely maximum likelihood estimation for tensor principal component analysis (PCA) .
Suppose that we are given i.i.d. observations, , of a -tensor of rank 1 which has been subject to Gaussian noise. That is,
where is deterministic, are i.i.d. Gaussian -tensors with and is the signal-to-noise ratio. Our goal is to infer the “planted signal” or “spike”, , by maximum likelihood estimation.
Observe that maximum likelihood estimation for this problem boils down to optimizing an empirical risk of the form
where denotes the usual Euclidean inner product. Note further that in this setting, optimizing this risk is equivalent (in law) to optimizing the same risk for a single observation upon making the change . We therefore restrict our analysis to the case .
When , this is the well-known spiked matrix model . In this setting it is known  that there is an order 1 critical signal-to-noise ratio, , such that below , it is information-theoretically impossible to detect the spike, and above , the maximum likelihood estimator is a distinguishing statistic. This transition is commonly referred to as the BBP transition 
. In this setting the maximum likelihood estimator is the top eigenvector, which can be computed in polynomial time by, e.g., power iteration. Much more detailed information is known about this transition for spiked matrix models, including universality, fluctuations, and large deviations. See, e.g.,[28, 12, 10] for a small sample of these works.
When , this is the spiked tensor model . In this case, there is a regime of signal-to-noise ratios for which it is information theoretically possible to recover the signal but for which there is no known algorithm to efficiently approximate it. This is called a statistical-to-algorithmic gap. In particular, it was shown in [33, 29, 31] that the minimal signal-to-noise ratio above which it is information-theoretically possible to detect the signal—called the information-theoretic threshold—is of order 1. See also [32, 31, 27, 13] for similar results with different priors. On the other hand, the minimal signal-to-noise ratio above which one can efficiently detect the signal— called the algorithmic threshold—has been proved or predicted to scale like for some for every studied algorithm. (By the correspondence mentioned above, the regime of diverging can be translated to the regime of of order 1 with a diverging number of observations , so that this regime is also of practical interest.) In , two local optimization methods, Approximate Message Passing and Tensor Power Iteration were shown to have critical exponents at most with predicted thresholds at . Semi-definite relaxation approaches have also been analyzed. Tensor unfolding was shown  to have a critical exponent of at most and conjecturally . It was also shown that the degree Sum-of-Squares algorithm  and a related spectral algorithm  (in the case ) have sharp critical thresholds of . See also  for a similar analysis in the case . We remark that statistical-to-algorithmic gaps, often diverging in the underlying dimension, have also been observed in myriad other problems of interest [11, 39, 1, 15, 5, 6].
Let us also discuss the complexity of the landscape given by (1.1). The complexity in the absence of a spike (the case where ) has been extensively studied [3, 2, 37]; see also  for a related line of work. When adding in the signal term so that , it was proved  that the expected number of critical points of —called the annealed complexity— is exponentially large in
and has a topological phase transition as one varieson the order 1 scale.
One might wonder why the statistical-to-algorithmic gap is diverging when . We investigate this issue for algorithms which directly perform maximum likelihood estimation, by analyzing the behavior of a family of “plain vanilla” algorithms, called Langevin dynamics, which contain, e.g., gradient descent. We find that for natural initializations, the statistical-to-algorithmic gap for Langevin dynamics diverges like . One may expect that this issue is due to the topological complexity of . Our proof, however, suggests that this gap is actually due to the weakness of the signal in the region of maximal entropy for the uninformative prior.
To clarify this point, we study Langevin dynamics on a more general family of random landscapes. For convenience, let us rescale our problem to be on , the sphere in of radius . We consider a function of the form
where is a deterministic, non-linear function and is a noise term. To put ourselves in a general setting, we only assume that is a mean-zero Gaussian process with a rotationally invariant law that is well-defined in all dimensions. That is, we assume that for every , has covariance of the form
for some fixed function , where denotes the Euclidean inner product.111It is classical  that the largest class of such is of the form with for some . For simplicity, we take the function to be a function of the inner product of
with some “unknown” vector. As is isotropic, without loss of generality, we assume that , the first canonical Euclidean basis vector, so that is a function of
which we call the correlation. In particular, we take of the form
where is not necessarily integer. The case and integer , corresponds to the setting of (1.1). The case where corresponds to the -spin glass model from , whose topological phase transitions have been precisely analyzed in  via a computation of the quenched complexity using a novel replica-Kac-Rice approach.
We analyze here the performance of Langevin dynamics and gradient descent in achieving order 1 correlation as one varies the initialization, the non-linearity of the signal, , and the signal-to-noise ratio, . If , we find that the critical threshold for algorithmic recovery via Langevin dynamics diverges like , with , for a natural class of initializations. On the other hand, we find that if , this algorithmic threshold is of order 1. In the former regime, the second derivative of the signal is vanishing in the maximum entropy region of the uninformative prior, whereas in the latter it is diverging, matching the mechanism proposed above.
Our analysis has two main thrusts: recovery above critical thresholds and refutation below them. In both of these settings, we find that the obstacle to recovering via Langevin dynamics is escaping the equator, i.e., the region where , which corresponds to the maximum entropy region of the uninformative prior. In Section 2.2, we provide a hierarchy of sufficient conditions on the initial data that imply that Langevin dynamics with will strongly solve the recovery problem down to a hierarchy of thresholds in order 1 times; the lowest of these thresholds, , is the threshold below which Langevin dynamics started from a uniformly chosen point would not even solve the recovery problem if given a pure signal, i.e., . In Section 2.4, we provide examples of initial data that satisfy these conditions at different levels: the case of the volume measure is discussed in depth in Section 2.1. To prove these results, we build on the “bounding flows” strategy of . In particular, we show that on times, we can compare the evolution of the correlation, , to the gradient descent for the problem with no noise . This follows by a stochastic Taylor expansion upon combining the Sobolev-type -norm estimates developed in  for spin glasses, with estimates on the regularity of the initial data developed here. This is discussed in more detail in Section 2.6.
We conjecture that the threshold is the sharp threshold for recovery initialized from the volume measure, and for efficient recovery more generally. We end the paper by showing that is sharp for a Gibbs class of initial data. To prove the desired refutation theorem below , we formalize the notion of free energy wells (see Definition 2.14). We show that their existence imply exponential lower bounds on the exit times of the well from a natural class of initial data. We find that below the critical , there is a free energy well at the equator, and use this to deduce hardness of recovery for high-temperature Gibbs-initializations. For more on this, see Section 2.5.
2. Statements of main results
Our main focus is a canonical class of optimization algorithms called Langevin dynamics with Hamiltonian
. These interpolate between between gradient descent and Brownian motion via a parameter, usually called the inverse temperature. More precisely, let solve the stochastic differential equation (SDE)
where is Brownian motion on , denotes the covariant derivative on , and is called the Hamiltonian which is given here by (1.2). The infinitesimal generator of this Markov process is
where is the Laplacian on , and is the metric tensor. In particular, as is , this martingale problem is well-posed so that is well-defined [14, 36]. (When is an integer, is smooth so that one can solve this in the strong sense as well .) We denote the law of started at by . Although we focus mainly on the case of Langevin dynamics, much of our analysis applies equally to gradient descent (). For more on this, see sec:gradient-descent.
We aim to determine the minimal for which efficient recovery of the signal, , is possible via Langevin dynamics, and understand the role that the initialization plays. There are, of course, multiple notions of recovery. The main ones in which we are interested are weak recovery and strong recovery. For fixed , sequence , and sequence of initial data , we say that the Langevin dynamics weakly recovers the signal in order time if it attains order 1 correlation in
time with high probability. On the other hand, we say that Langevin dynamicsstrongly recovers the signal in order time if it attains correlation in time with high probability.
2.1. Recovery initialized from the volume measure
Perhaps the most natural initialization is a completely uninformative prior, i.e., the (uniform) volume measure on . This is particularly motivated from the algorithmic perspective as it is easy to sample from the volume measure on in order 1 time (the volume measure has a log-Sobolev inequality with constant uniformly bounded away from .) In order to focus on the key issues and deal with all in a comprehensive manner, we restrict to the upper hemisphere: . Of course, a point sampled from the volume measure on is in the upper hemisphere with probability .
As a consequence of a general framework developed in Section 2.2, we obtain the following recovery guarantees for Langevin dynamics starting from the volume measure on the upper hemisphere. Let be the volume measure on and let denote the law of the noise .
Fix any , and .
If and for , for every , there exists so that for all ,
If and is a large enough constant, there exists such that for all ,
Our proof approach suggests the recovery guarantees above hold down to and .
Item (1) in Theorem 2.1 holds for all and item (2) holds for all .
We first pause to comment on the special case of with integer, corresponding to maximum likelihood estimation for tensor PCA. The thresholds of Theorem 2.1 improve upon the rigorous known threshold for Approximate Message Passing and Tensor Power Iteration, and the conjectured threshold matches the conjectured threshold for those algorithms.
The thresholds of Conjecture 1 correspond to the signal-to-noise ratios for which the second derivative of diverges at points of correlation (the asymptotic support of ). We predict that these thresholds are sharp for efficient algorithmic recovery of via local optimization, so that when the second derivative is at these correlations, efficient recovery is not possible. When , it is easy to see that for and order 1, or and with , the Langevin dynamics takes at least stretched-exponential time to correlate with ; we expect this to persist with the addition of noise. This is discussed more in depth in Section 2.6.
We are able to prove sharpness of the thresholds proposed in Conjecture 1 for a high-temperature Gibbs-type initialization that approximates the volume measure as . In the next section, we define general conditions, Condition 1 and Condition 2, on the initial data that guarantee recovery above those conjectured thresholds. Then, the obstruction to proving Conjecture 1 becomes a purely static () question of obtaining concentration estimates for derivatives, and contractions of derivatives, of under . (As Remark 2.9 notes, we can improve the thresholds of Theorem 2.1 to and , though the proof is omitted for conciseness.)
2.2. General thresholds for recovery
We introduce here the following natural hierarchy of conditions on a choice of initial data which will guarantee recovery of down to a corresponding threshold in . Let denote the space of probability measures on . A choice of initial data corresponds to a choice of measure . Our main recovery guarantees apply to any initial data which satisfy the following two natural conditions.
The first condition is on the regularity of the initial data. Let , defined as
be the generator of Langevin dynamics with respect to . For every , , and , let
where we emphasize that there is a dependence on here that is suppressed in the notation.
We say that a sequence of random probability measures satisfies Condition 1 at level n at inverse temperature if for every ,
Again notice that the dependence of the condition on is implicit in , but is important to keep in mind. When the choice of is clear, we drop it from the description of Condition 1.
The second condition ensures that the initial correlation is on the typical scale, so that the drift from gradient descent for the signal is not negligible at time zero.
A sequence of random measures, , satisfies Condition 2, if
A sequence of random measures satisfies Condition 2’ if for every ,
We emphasize that neither of these conditions involve the parameters (the non-linearity of the signal) or (the signal-to-noise ratio). The conditions can be shown to hold for various natural choices of initial data, such as the volume measure on the upper hemisphere, implying Theorem 2.1, as well as certain “high-temperature” Gibbs measures. For more on this, see sec:examples.
Let us now turn to our main results. We begin with the supercritical regime, , where one will need to diverge with to efficiently recover, as the curvature of the signal in the region where is negligible. For every , let
We then have the following result regarding strong recovery.
Fix any , , and . Let and consider an initialization satisfying Condition 1 at level at inverse temperature . We then have the following.
If also satisfies Condition 2, then for every and every , there exists a such that for every ,
If satisfies Condition 2’, the same convergence holds, instead, in probability.
The above theorem shows that in the regime , we need to diverge for Langevin dynamics to recover the signal in order 1 time. Observe that for such , the second derivative of in the region is vanishing as . Let us now show conversely, that in the subcritical regime , i.e., the regime where the second derivative of is diverging when , order 1 time weak recovery holds for large but finite signal-to-noise ratios. That is, the statistical-to-algorithmic gap is at most order 1 for . In this regime, one cannot hope for strong recovery (see rem:cant-hope). If we let
then we have the following weak recovery guarantee.
Fix any and . There exists such that for all the following holds. If satisfies Condition 1 at level and inverse temperature and Condition 2, then for every and every , there exists and , such that for any
The main ideas behind Theorems 2.4 and 2.5 are essentially the same. We will explain the intuition behind their proofs presently (see sec:ideas below). We end this section with the following remark on possible relaxations of Condition 1.
One could not make the set much smaller, since measures that don’t contain information about the planted signal, e.g., the volume measure on , have .
2.3. Gradient descent
It is also natural to study these recovery problems for gradient descent, which can be seen as the limit of the Langevin dynamics: that is, which solves the ODE,
The above recovery results extend naturally to this setting as well. Even though gradient descent can, in principle, get stuck at the exponentially many critical points near the equator while Langevin dynamics will not, their thresholds for order 1 time recovery seem to match one another. To avoid technical issues regarding the existence of solutions to this equation (when , is only -Hölder), let us focus on the supercritical regime.
In this setting, we let be the infinitesimal generator of gradient descent on ,
Fix any and . Let and consider initialization satisfying Condition 1 at level and . Gradient descent satisfies the following.
If also satisfies Condition 2, then for every and every , there exists such that
If satisfies Condition 2’, then the same convergence holds, instead, in probability.
The proof of this result is the same, mutatis mutandis, as that of thm:supercritical-main. For an explanation of these changes, see Section 5.4.
2.4. Examples of initial data satisfying Condition 1 and Condition 2
Let us now turn to some examples of initial data that satisfy the conditions of our theorems. When considering initial data for such problems there are a few natural choices.
Let us begin by observing that -a.s., any initial data which is concentrated on the region , for example where , satisfies Condition 1 at level 1 for every and Condition 2 tautologically. Let us now turn to higher levels.
The normalized volume measure on satisfies Condition 1 at level 3 at every and satisfies Condition 2.
The normalized volume measure on , satisfies Condition 1 at level for every at every .
By similar arguments, we can show that the volume measure on satisfies Condition 1 at level 4. However, as the argument becomes unwieldy and we believe the proof at level 3 already suggests the essential difficulty behind proving conj:level-n, we only include the proof of the level 3 case.
Although we do not prove Conjecture 2 for the volume measure on —which would imply efficient recovery from the uninformative prior above —we do find another choice of initial data which is natural and does in fact satisfy Condition 1 at level for every .
Let be the Gibbs measure on corresponding only to the noise at inverse temperature , and let be conditioned on , i.e.,
Let be even. There exists such that for all , the measure satisfies Condition 1 at level at inverse temperature for every . Moreover, the measure satisfies Condition 2’.
As an immediate corollary, we obtain the following.
Let be even and . If , for every , if with , the Langevin dynamics starting from strongly recovers the signal in order 1 time in -prob.
Notice, in particular, that like the volume measure, the measure is completely independent of the noise as well as the signla-to-noise ratio . Moreover, as , the measure approximates the volume measure, lending further support to Conjecture 2. We end this section with the following conjecture regarding the measure . While this result would imply an almost sure recovery result for and , as well as the matching weak recovery result for , we also believe that it is of independent interest.
For every , if is the law of a standard Gaussian,
weakly as measures -a.s. In particular, satisfies Condition 2.
2.5. Refutation below
Our final result for this section concerns the sharpness of the threshold with respect to this Gibbs initialization. cor:recovery-spin-glass shows that for even, Langevin dynamics strongly recovers the planted signal for all when started with initial data . Our last result shows that this is sharp. For any , let be the hitting time
and let be as in (2.8).
Fix any , any and , and let . If , there exists such that for every sufficiently small,
Motivated by Conjecture 2, and the fact that as , approximates the volume measure on , we believe that a similar refutation result also holds for initialization from the volume measure whenever ; this would make the thresholds and sharp for initialization from the volume measure.
2.6. Ideas of proofs
We now sketch some of the key ideas underlying the above recovery and refutation results and their proofs.
2.6.1. Ideas of proofs of thm:supercritical-main and thm:subcritical-main
. Our interest is in understanding the transition for signal recovery in short times. It turns out that the subcritical and supercritical problems are essentially the same. To see why, consider, for the moment, the recovery question for Langevin dynamics in the simpler setting where there is only a signal,and .
By rotation invariance, the question of escaping the equator for the problem with pure signal is effectively the same as studying the escape from the origin for a 1-dimensional Langevin dynamics with Hamiltonian,
in the small noise regime (noise of order ). Evidently, this amounts to studying the ODE,
where . The second term reverts to the origin. To escape, we must then hope that the first term dominates at the initial point. In particular, if is positive and small and is large, one hopes to compare this ODE to the simpler system.
In this setting, one may then apply a standard comparison inequality (see lem:power-law-comparison), which compares solutions of this ODE to certain power laws.222In the critical regime, , this is the classical Gronwall inequality which is, of course, not of power law type. Evidently, the order of growth of the second derivative of is the essential ingredient in resolving the tradeoff above. Indeed, under Condition 2, which places , if and is order 1, then , and similarly, if then provided grows sufficiently fast (). Notice also that if , then Langevin dynamics with from would not efficiently recover the signal even in this trivial pure spike problem.
When adding back , we consider the evolution equation for given by
where , given by (2.3), is the infinitesimal generator for Langevin dynamics with respect to , and is a martingale. We will see that , so that on short times, this is not far from the situation of (2.11). The remaining discrepancy, evidently, is to ensure that starts and remains smaller than . To this end, we use the -norm estimates from  to show that provided is suitably localized, i.e., provided Condition 1 holds at level , then remains localized on the relevant timescale needed to recover the signal above (see Theorem 5.3). The main result then follows by combining this localization with the aforementioned comparison inequality of Lemma 5.1: this is developed in Section 5.
(Strong recovery is impossible for finite ) When is order 1, one cannot hope to obtain a strong recovery result. Indeed if we start from any point sufficiently close to the north pole, correlation for sufficiently small, then will decrease in correlation in order 1 time. To see this, we examine the drift in (2.12), and expand as in (6.5): if is sufficiently small, then will dominate ; furthermore the maximum of on the spherical cap can be shown to scale down to zero as goes to zero. Putting these together with the types of arguments found in Section 4 would imply the desired.
2.6.2. Ideas of proofs of thm:worst-case-main
The underlying idea behind our refutation result is the presence of what we call a free energy well for the correlation, which is defined as follows. Define the Gibbs measure for by,
which is normalized to be a probability measure, where is the normalized volume measure on . In the following, for a real number , we let , denote the ball of radius around . For any function , we define the entropy:
We can now define free energy wells for Lipschitz functions.
A Lipschitz function, , has an –free energy well of height in if the following holds: there exists and such that and
Such free energy wells are the exit time analog of the of free energy barriers formalized in  for spectral gap estimates. We show in thm:fe-barrier-metastability that free energy wells confine the dynamics on timescales that are exponential in the height, , when started from this Gibbs measure restricted to the well. We then show that for , there is a free energy well for the correlation: namely, the function has a free energy well of height in (see Proposition 7.1). Theorem 2.12 then follows by combining this with the facts that and are comparable when restricted to this band of correlations, and is asymptotically supported in this region.
We conclude with a remark regarding exceptional points which facilitate recovery at order one .
Remark 2.15 (Equatorial passes).
In light of the free energy well for the correlation, one might hope to prove an even stronger refutation theorem. It may be tempting to believe that when and is order 1, the Langevin dynamics cannot recover the signal in sub-exponential times, uniformly over all initial with . Indeed, this is the case for the simpler “pure signal” problem where . As a consequence of (2.12) and lower bounds on (see e.g., ), however, this guess does not hold. One can show the following: for every and , there exist initial data such that and such that Langevin dynamics started from succeeds at weak recovery.
We thank Giulio Biroli, Chiara Cammarota, Florent Krzakala, Lenka Zdeborova, Afonso Bandeira, and David Gamarnik for inspiring discussions. This research was conducted while G.B.A. was supported by BSF 2014019 and NSF DMS1209165, and A.J. was supported by NSF OISE-1604232. R.G. thanks NYU Shanghai for its hospitality during the time some of this work was completed.
3. Preliminaries: Regularity theory and stochastic analysis in high dimensions
Throughout the paper, we will make frequent use of certain uniform Sobolev-type estimates for , developed in the context of spin-glass dynamics, as well as properties of solutions to certain Langevin-type stochastic differential equations. We recall these results in this section. In what follows, for functions we say that if there is a constant such that .
3.1. Regularity theory of spin glasses and the norm
As is often the case in such problems it will be important to understand the regularity of the related Hamiltonians. It turns out that in high-dimensional analysis problems, one needs to define these Sobolev spaces carefully, as the scaling of the norms in the dimension is often crucial to the problem at hand.
With this in mind, let us recall the -norm, which provides a (topologically) equivalent norm on the usual Sobolev space, , but which is better suited to high-dimensional problems, as well as the related -norm regularity of , established in , which will be crucial to our analysis.
A function is in the space , if
Here, denotes the natural operator norm when is viewed as a -form acting on the -fold product of the tangent space . Throughout the paper, unless otherwise specified will denote this norm.
By the equivalence of norms on finite dimensional vector spaces, for fixed , this space is equivalent to the canonical sobolev space , which is defined using Frobenius norms. We use the operator norm instead for the following reason. For , we need to bound the operator norms of random tensors. It is well-known that for such random tensors, there is a marked difference in the scaling of the Frobenius and operator norms in the dimension (see, e.g., ).
We let denote the special case , which is chosen precisely such that the scaling in of the -norm is independent of . Namely, in , it was shown that for every , the -norm of is order in . Recall from (1.3) and that implies that .
Theorem 3.3 ([7, Theorem 3.3]).
For every , there exist such that is in uniformly in with high probability: for every ,
The result was stated there for with only one nonzero , i.e., the -spin model, however, as observed in [7, Remark 3.4] it easily extends to this setting by Borell’s inequality and the fact that in that case, the corresponding is of at most polynomial growth in .
As further motivation for the definition of the norm , specifically with , we note here the following easy observations which are useful in bounding the regularity of observables with respect to Langevin-type operators: we call these the ladder relations for .
Lemma 3.5 (Ladder relations).
For every , there exists such that for every ,
In particular, if and for some that satisfy and then there is a such that
The first result in (3.2) follows from the fact that traces commute with covariant derivatives. Indeed for smooth, observe that
so that .
As a result of the above and explicit calculation, one also sees the following.
For every , there exist and such that for every ,
Finally, an explicit computation also shows that always lives in the space for any . In particular, for every and every , there exists such that
3.2. The Langevin operator and existence of the martingale solution
Let us now recall some elementary results from stochastic analysis. For a function , we always let denote its evolution under the Langevin dynamics (2.1). We also let denote the martingale part of this evolution,
Observe that is well-defined as the Martingale problem for given by (2.2) is well-posed.
Let us now recall the following elementary estimate. Suppose that is smooth and ; then by Doob’s maximal inequality,
As we will frequently use the following estimate, we note that in the case that , one has by (3.4), and Doob’s maximal inequality, that there is a universal such that for every , and every ,
We also define here the following notation which is used throughout the paper. Let be given by
4. On Weak and Strong recovery
As mentioned in the introduction, there are two main notions of recovery that we study in this paper: weak and strong recovery. In this section, we discuss the relationship between these. We prove that weak recovery implies strong recovery in the diverging regime. We then show that depending on the rate of divergence in , there is a certain related radius of correlations, , which is , such that one can weakly recover in order 1 time from every initial point with correlation greater than . This reduces the difficulty of proving our recovery theorems to showing that the dynamics “escapes the equator”. We end the section observing the stability of weak recovery.
4.1. Weak recovery implies strong recovery
We show that as long as is diverging, weak recovery with Langevin dynamics implies strong recovery. In the following, we let be such that eventually -almost surely. Recall that such a exists by thm:reg. For any , we let denote the first hitting time for the set
Fix and . For every and every sequence , there exists such that for all , eventually -almost surely,
Fix and suppose that is any diverging sequence. Let be the first hitting time of . We wish to show that there exists such that uniformly over all , we have that and for every , with –probability . For such initial data, for all , eventually -almost surely, from (3.7) satisfies
Consequently, for any there is a such that for every , for all . Applying (3.6) with , we see that for every ,
for some universal . As a consequence, there exists a such that . By similar reasoning, for every ,
The strong Markov property and a union bound over the above two estimates then implies the desired. ∎
4.2. Weak recovery from microscopic scales
By a similar argument to the preceeding, one can show that in this regime, weak recovery occurs as soon as has crossed a certain microscopic correlation. More precisely, we obtain the following.
Fix and . There exist such that if satisfies
for some and some , then, for every , there is a such that for all , -almost surely,
Fix . Suppose that with , and that and satisfy
for some . Then -almost surely,