# Implicit stochastic approximation

## Authors

• 13 publications
• 18 publications
• ### Towards stability and optimality in stochastic gradient descent

Iterative procedures for parameter estimation based on stochastic gradie...
05/10/2015 ∙ by Panos Toulis, et al. ∙ 0

• ### Stochastic gradient descent methods for estimation with large data sets

We develop methods for parameter estimation in settings with large-scale...
09/22/2015 ∙ by Dustin Tran, et al. ∙ 0

• ### Convergence Analysis of Inexact Randomized Iterative Methods

In this paper we present a convergence rate analysis of inexact variants...
03/19/2019 ∙ by Nicolas Loizou, et al. ∙ 11

• ### Symmetry-invariant optimization in deep networks

Recent works have highlighted scale invariance or symmetry that is prese...
11/05/2015 ∙ by Vijay Badrinarayanan, et al. ∙ 0

• ### Backpropagation for Implicit Spectral Densities

Most successful machine intelligence systems rely on gradient-based lear...
06/01/2018 ∙ by Aditya Ramesh, et al. ∙ 0

• ### Forward and Reverse Gradient-Based Hyperparameter Optimization

We study two procedures (reverse-mode and forward-mode) for computing th...
03/06/2017 ∙ by Luca Franceschi, et al. ∙ 0

• ### Optimizing Non-decomposable Performance Measures: A Tale of Two Classes

Modern classification problems frequently present mild to severe label i...
05/26/2015 ∙ by Harikrishna Narasimhan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Robbins and Monro (1951) considered the problem of estimating the zero of a function . Specifically, for every fixed , they assumed that the exact value

is unknown but can be unbiasedly estimated by a random variable

, such that . Starting from an estimate , Robbins and Monro (1951) recursively approximated as follows,

 θrmn=θrmn−1−γnWθrmn−1, (1)

where is a decreasing sequence of positive numbers, known as the learning rate sequence; typically, such that to guarantee convergence, and to guarantee that convergence can be towards any point in . Robbins and Monro (1951) proved convergence in quadratic mean of procedure (1) —also known as the “Robbins-Monro procedure"—, i.e., . Since then, several other authors have explored its theoretical properties, for example, Ljung et al. (1992); Kushner and Yin (2003); Borkar (2008). Due to its remarkable simplicity and computational efficiency, the Robbins-Monro procedure has found widespread applications across scientific disciplines, including statistics (Nevel’son et al., 1973), engineering (Benveniste et al., 1990), and optimization (Nesterov, 2004).

Recently, the Robbins-Monro procedure has received renewed interest for its applicability in parameter estimation with large data sets, and its connections to stochastic optimization methods. In such settings, even though there is a finite data set , the Robbins-Monro procedure can be applied with being, for example, the log-likelihood of at a single data point that is sampled with replacement from . In this case, the theory of Robbins and Monro (1951) implies that , where the expectation is now with respect to the empirical distribution of data points in , and is an estimator of , such as the maximum-likelihood estimator, or maximum a-posteriori if regularization is used. In this work, we will study the theoretical properties of a modified stochastic approximation method in the more general setting with a stream of data points, but will also discuss applications to iterative estimation with a finite data set.

Despite its remarkable properties, a well-known issue with the Robbins-Monro procedure is that the sequence crucially affects both its numerical stability and convergence. Regarding stability, consider a simple example where in Eq.(1) is approximating a scale parameter, and thus needs to be positive. Clearly, even if the previous iterate is positive, there is no guarantee that the update in Eq.(1) will provide a positive next iterate . Regarding convergence, the Robbins-Monro procedure can be arbitrarily slow if is even slightly misspecified. For example, let , and assume there exists the scalar potential , such that , for all . If is strongly convex with parameter , then if (Nemirovski et al., 2009, Section 1); (Moulines and Bach, 2011, Section 3.1). In summary, large learning rates can make the iterates diverge numerically, whereas small rates can make the iterates converge very slowly to . Importantly, these conflicting requirements for stability and fast convergence are very hard to reconcile in practice, especially in high-dimensional problems (Toulis and Airoldi, 2015a, Section 3.5)

, which renders the Robbins-Monro procedure, and all its derived methods, inapplicable without heuristic modifications.

## 2 Implicit stochastic approximation

Our idea to improve the Robbins-Monro procedure (1) is to transform its iteration into a stochastic fixed-point equation as follows,

 θimn=θimn−1−γnWθ(∗)n, s.t. (2) θ(∗)n=θimn−1−γnh(θ(∗)n)\mathclap\scriptsize def=E(θimn|Fn−1), (3)

where is the -algebra adapted to . The implicit stochastic approximation method of Eq.(2) and Eq.(3) can be motivated as the limit of a sequence of improved Robbins-Monro procedures, as follows. First, for convenience, drop the superscript from procedure (1), and fix the sample history . Then, . Assuming that can be computed even though is unknown , then it is reasonable to expect that it is better to use instead of when drawing in procedure (1). This leads to update . In turn, this update has expected value , and thus can be used instead of , and so on. This argument can be repeated ad infinitum producing the following sequence of improved Robbins-Monro procedures,

 θn =θn−1−γnWθn−1, θn =θn−1−γnWθ(1)n, θn =θn−1−γnWθ(2)n, … θn =θn−1−γnWθ(∗)n,

where is defined recursively as . The initial iterate, , corresponds to the classic Robbins-Monro procedure (1). The limit iterate, , is the fixed point of Eq.(3), and corresponds to implicit stochastic approximation (2).

The improvement achieved by the fixed-point equation (3) can be explained through the following simple argument. First, take norms in Eq.(3) to obtain

 ||θimn−1−θ⋆||2≥||θ(∗)n−θ⋆||2+2γnh(θ(∗)n)⊺(θ(∗)n−θ⋆).

As before, suppose that the scalar potential exists and is convex. Then, it follows , which implies that . Let , such that , then Eq.(2) can be written as . Thus, the fixed-point equation contracts towards , and potentially contracts as well. This idea is also closely related to proximal operators in optimization because Eq.(3) can be written as

 θ(∗)n=argminθ{12||θ−θimn−1||2+γnH(θ)}. (4)

The right-hand side of Eq.(4) is a proximal operator (Bauschke and Combettes, 2011), mapping to the intermediate point , and has as a fixed point. Interest on proximal operators has surged in recent years because they are non-expansive and converge with minimal assumptions. Furthermore, they can be applied on non-smooth objectives, and can easily be combined in modular algorithms for optimization in large-scale and distributed settings (Parikh and Boyd, 2013).

On the practical side, the key advantage of the contractive property of implicit stochastic approximation (2) compared to classic stochastic approximation (1) is that it is no longer required to have small learning rates for stability. This resolves the conflicting requirements for stability and convergence in classic stochastic approximation, and provides valuable flexibility in choosing the learning rate . The theoretical results of the following section do confirm that implicit stochastic approximation converges under minimal conditions, maintains the asymptotic properties of classic stochastic approximation, but also has remarkable stability in short-term iterations.

On the downside, the implementation of implicit stochastic approximation is generally a challenging task because is unknown. In specific, the point cannot be computed from Eq.(3) because function is unknown, otherwise we could simply set in Eq.(2). However, as mentioned before, Eq.(2) can be written as , where , and therefore we only need a noisy estimate of the intermediate point in order to implement implicit stochastic approximation. We leverage this result to provide several concrete implementations in Section 4.

## 3 Theory of implicit stochastic approximation

The symbol denotes the vector/matrix norm. We also define the error random variables , such that . The parameter space for will be without loss of generality. For a positive scalar sequence , the sequence is such that , for some fixed , and every ; the sequence is such that in the limit where . denotes a positive sequence decreasing towards zero. We further assume that implicit stochastic approximation (2) operates under a combination of the following assumptions.

###### Assumption 1.

It holds, , and .

###### Assumption 2.

The regression function is Lipschitz with parameter , i.e., for all ,

 ||h(θ1)−h(θ2)||≤L||θ1−θ2||.
###### Assumption 3.

Function satisfies either

1. , for all , or

2. , where , and , for all .

###### Assumption 4.

There exists a scalar potential such that , for all .

###### Assumption 5.

There exists fixed such that, for every ,

 E(||ξn||2|Fn−1)≤σ2.
###### Assumption 6.

Let , then , and for fixed positive-definite matrix . Furthermore, if , then for all , if , or otherwise.

Assumption 3(a) is a typical convexity assumption. Assumption 3(b) is stronger than the convexity assumption, but weaker than the assumption of “strong convexity” that is typical in the literature. Assumption 5 was first introduced by Robbins and Monro (1951), and has since been standard in stochastic approximation analysis. Assumption 6 is the Lindeberg condition that is used to prove asymptotic normality of . Overall, our assumptions are weaker than the assumptions used in classic stochastic approximation; compare, for example, Assumptions 1-6 with assumptions (A1)-(A4) of Borkar (2008, Section 2.1), or assumptions of Benveniste et al. (1990, Theorem 15).

### 3.1 Convergence of implicit stochastic approximation

In Theorem 1 we derive a proof of almost-sure convergence of implicit stochastic approximation, which relies on the supermartingale lemma of Robbins and Siegmund (1985).

###### Theorem 1.

Suppose that Assumptions 1, 2, 3(a), and 5 hold. Then the iterates of the implicit stochastic approximation procedure (2) converge almost-surely to ; i.e., , such that , almost-surely.

The conditions for almost-sure convergence of implicit stochastic approximation are weaker than the conditions for classic stochastic approximation. For example, to show almost-sure convergence for standard stochastic approximation methods, it is typically assumed that the iterates are almost-surely bounded (Borkar, 2008, Assumption (A4)).

### 3.2 Non-asymptotic analysis

In this section, we prove results on upper bounds for the deviance and the errors . This provides information on the rate of convergence, as well as the stability of implicit stochastic approximation methods. Theorem 2 on deviance uses Assumption 3(a), which only assumes non-strong convexity of , whereas Theorem 3 on squared errors uses Assumption 3(b), which is weaker than strong convexity.

###### Theorem 2.

Suppose that Assumptions 1, 2, 3(a), 4, and 5 hold. Define . Then, if , there exists such that, for all ,

 E(H(θimn)−H(θ⋆))≤[2Γ2γγ1+o(1)]n−1+γ.

If , there exists such that, for all ,

 E(H(θimn)−H(θ⋆))≤[Γσ√Lγ1+o(1)]n−γ/2.

Otherwise, and there exists such that, for all ,

 E(H(θimn)−H(θ⋆))≤⎡⎢ ⎢⎣3+√9+4γ31Lσ2/Γ22γ1/Γ2+o(1)⎤⎥ ⎥⎦n−1/3.

There are two main results in Theorem 2. First, the rates of convergence for the deviance are either or , depending on the learning rate parameter . These rates match standard stochastic approximation results under non-strong convexity; see, for example, Theorem 4 of Moulines and Bach (2011). Second, there is a uniform decay of the expected deviance towards zero, since the constants can be made small, depending on the desired accuracy in the constants of the upper-bounds in Theorem 2. In contrast, in standard stochastic approximation methods under non-strong convexity, there is a term (Moulines and Bach, 2011, Theorem 4), which can amplify the initial conditions arbitrarily. Thus, implicit stochastic approximation has similar asymptotic properties to classic stochastic approximation, but is significantly more stable.

###### Theorem 3.

Suppose that Assumptions 1, 3(b), and 5 hold, and define and . Then, if , for every it holds,

 ζn≤e−logκ⋅n1−γ−δζ0+σ2γ1κδ1n−γ+δ+O(n−γ+δ−1).

Otherwise, if , it holds,

 ζn≤e−logκ⋅lognζ0+σ2γ1κδ1n−1+O(n−2).

There are two main results in Theorem 3. First, if is strongly convex (), then the rate of convergence of is , which matches the rate of convergence for classic stochastic approximation under strong convexity (Benveniste et al., 1990, Theorem 22, p.244). Second, there is an exponential discounting of initial conditions regardless of the specification of the learning rate parameter and the Lipschitz parameter . In stark contrast, in classic stochastic approximation there exists a term in front of the initial conditions , which can make the approximation diverge numerically if is misspecified with respect to the Lipschitz parameter (Moulines and Bach, 2011, Theorem 1). Thus, as in the non-strongly convex case of Theorem 2, implicit stochastic approximation has similar asymptotic rates to classic stochastic approximation, but is also significantly more stable.

### 3.3 Asymptotic distribution

Asymptotic distributions are well-studied in classic stochastic approximation. In this section, we leverage this theory to show that iterates from implicit stochastic approximation are asymptotically normal. The following theorem establishes this result using Theorem 1 of Fabian (1968); see also (Ljung et al., 1992, Chapter II.8).

###### Theorem 4.

Suppose that Assumptions 1,2, 3(a), 5, and 6 hold. Suppose also that is positive-definite, where is the Jacobian of at , and is the identity matrix. Then, the iterate of implicit stochastic approximation (2) is asymptotically normal, such that

The covariance matrix is the unique solution of

 (γ1Jh(θ⋆)−I/2)Σ+Σ(γ1Jh(θ⋆)⊺−I/2)=Ξ.

The asymptotic distribution of the implicit iterate is identical to the asymptotic distribution of , as derived by Fabian (1968). Intuitively, in the limit,

with high probability, and thus implicit stochastic approximation behaves like the classic one. We also note that if

commutes with then can be derived in closed-form as .

## 4 Algorithms for implicit stochastic approximation

In the previous section, we showed that implicit stochastic approximation has similar asymptotic properties to classic stochastic approximation, but it is also more stable. Therefore, implicit stochastic approximation is arguably a superior form of stochastic approximation.

However, the main drawback of implicit stochastic approximation is that the intermediate value , which is necessary to compute the update in Eq.(3), is not readily available because it depends on the regression function that is unknown. Thus, every implementation of implicit stochastic approximation needs to approximate or estimate at every iteration.

One strategy for such implementation is to approximate through a separate standard stochastic approximation procedure; i.e., at every th iteration of procedure (2) run a Robbins-Monro procedure for , starting from and updating as follows,

 xk=xk−1−ak(γnWxk−1+xk−1−θimn−1), (5)

where satisfies the Robbins-Monro conditions, and are independent draws with fixed . By the properties of the Robbins-Monro procedure, converges to that satisfies , and thus , by Eq.(3). For practical reasons, it would be sufficient to apply algorithm (5) for a finite number of steps, say , even before has converged to , and then use in place of in procedure (2). We note that Eq.(5) is in fact a stochastic fixed-point iteration, where one critical necessary condition is that the mapping on the right-hand side of Eq.(5) is non-expansive (Borkar, 2008, Theorem 4, Section 10.3); this condition might be hard to validate in practice.

Another implementation is possible if the analytic form of is known, even though the regression function is not. The idea is, rather counter-intuitively, to use the next iterate as an estimator of . This is reasonable because, by definition, , i.e., is an unbiased estimator of . Thus, procedure (2) could be approximately implemented as follows,

 θimn=θimn−1−γnWθimn. (6)

The next iterate appears on both sides of Eq.(6) and thus Eq.(6) is a multi-dimensional fixed-point equation, also known as implicit update. Algorithm (6) appears to be the most practical implementation of implicit stochastic approximation, which justifies why the implicit update of algorithm (6) lends its name to implicit stochastic approximation.

Now, the main problem in applying algorithm (6) is to solve efficiently the fixed-point equation at every iteration and calculate . To solve this problem we distinguish two cases. First, let , where is a scalar random variable that may depend on parameter , and is unit-norm vector random variable that does not depend on . Then, it is easy to see that algorithm is equivalent to

 θimn=θimn−1−γnλns(θimn−1)U, (7)

where the scalar satisfies

 λns(θimn−1)=s(θimn−1−γnλns(θimn−1)U). (8)

Therefore, at the th iteration we can first draw , then solve the one-dimensional fixed-point equation (8), given that the analytic form of is known, and finally calculate through Eq.(7). The solution of the one-dimensional fixed-point equation is computationally efficient in a large family of statistical models when is the negated gradient of the log-likelihood; this family includes, for example, generalized linear models and convex M-estimation (Toulis and Airoldi, 2015a, Theorem 4.1). Second, consider the more general case , where is a random vector that can depend on parameter . The problem is now is that we cannot sample because it depends on the next iterate. However, we can simply draw instead, which reduces to the previous case where , and therefore algorithm (7) is again applicable.

## 5 Application in parameter estimation

A key application of stochastic approximation is in iterative estimation of the parameters of a statistical model (Nevel’son et al., 1973, Chapter 8); (Ljung et al., 1992, Chapter 10); (Borkar, 2008, Section 10.2). In particular, consider a stream of i.i.d. data points , , where the outcome is distributed conditional on covariates according to known density , but unknown model parameters . We will consider two cases, one where the likelihood is fully known and can be calculated easily at each data point , and one where the likelihood is not fully known (e.g., it is known up to normalizing constant) and there is a finite data set instead of a stream. We will also discuss a third case of likelihood-free estimation.

### 5.1 Likelihood-based estimation

Consider the random variable , where the regression function is still unknown. In many statistical models the log-likelihood is concave (Bickel and Doksum, 2015), and thus is the unique point for which . Therefore the standard stochastic approximation procedure (1) can be written as

 θrmn=θrmn−1+γn∇logf(Yn;Xn,θrmn−1). (9)

Stochastic approximation theory implies and therefore is a consistent estimator of . Procedure (9) is known as stochastic gradient descent (SGD) in optimization and signal processing (Coraluppi and Young, 1969), and has been fundamental in modern machine learning with large data sets (Zhang, 2004; Bottou, 2010; Toulis and Airoldi, 2015b).

An implicit stochastic approximation version of procedure (9) can be implemented through algorithm (6), resulting in the iteration

 θimn=θimn−1+γn∇logf(Yn;Xn,θimn). (10)

Procedure (10) is known as the incremental proximal method in optimization (Bertsekas, 2011), or as implicit stochastic gradient descent in statistical estimation (Toulis et al., 2014), and has shown superior performance to standard stochastic gradient descent, both in theory and applications (Toulis et al., 2014; Toulis and Airoldi, 2015a). In particular, in accordance to the theoretical properties of their stochastic approximation counterparts, implicit SGD has identical asymptotic efficiency and convergence rate as standard SGD, but it is significantly more stable. The stochastic proximal gradient algorithm (Singer and Duchi, 2009; Rosasco et al., 2014) is related to implicit SGD. However, in contrast to implicit SGD, the proximal gradient algorithm first makes a standard update (forward step), and then an implicit update (backward step), which may increase convergence speed, but may also introduce instability due to the forward step.

Example. Let be the true parameter of a normal model with i.i.d. observations , . Thus, , and . Assume as the learning rate. Then, the SGD procedure (9) is given by

 θrmn =(1−γnX2n)θrmn−1+γnYnXn. (11)

Procedure (11) is known as the least mean squares filter (LMS) in signal processing, or as the Widrow-Hoff algorithm (Widrow and Hoff, 1960). The implicit SGD procedure for this problem, using update (10), is given by

 θimn =11+γnX2nθimn−1+γn1+γnX2nYnXn. (12)

Procedure (12) is also known as the normalized least mean squares filter (NLMS) in signal processing (Nagumo and Noda, 1967). From Eq. (11) we see that it is crucial for standard SGD to have a well-specified learning rate parameter . For instance, assume fixed for simplicity, then if the iterate will diverge to a value . In contrast, a very large will not cause divergence in implicit SGD, but it will simply put more weight on the th observation than the previous iterate . Moreover, from a statistical perspective, implicit SGD specifies the correct averaging by weighing the estimate and observation according to the inverse of information .

### 5.2 Estimation with likelihood known up to normalizing constant

We now consider the case where the likelihood is known up to a normalizing constant, which arises frequently in practice (Gelman and Meng, 1998)

. In such situations, a large family of estimation methods relies on being able to sample from the underlying model, e.g., through Metropolis-Hastings, which does not require knowledge of the normalization constant. For example, the method of Markov chain Monte Carlo maximum-likelihood estimation

(Geyer, 1991) uses simulations to estimate ratios of normalization constants, which appear in the maximization of the log-likelihood over a finite data set.

The same idea underlying simulation-based methods can be applied in iterative estimation with stochastic approximation. Consider a finite data set with data points, and a complete sufficient statistic that can be calculated on the th data point. Let denote the averaged value of the statistic over independent data points that are simulated conditional on . Then, the iteration

 n ∼Uniform{1,2,…,N} θn =θn−1+γn(Sn−^S(θn−1;k)), (13)

can converge to under typical conditions of stochastic approximation. The learning rates can be chosen as , as it is common, and adaptive schemes are possible (Lai and Robbins, 1979). The constant

affects the variance of the stochastic part in update (

5.2): higher lowers the variance, but also increases the computational burden because more simulations are needed. Resolution of such statistical/computational trade-offs typically require problem-specific considerations.

The implicit counterpart of procedure (5.2) uses an update as follows,

 θn=θn−1+γn(Sn−λn^S(θn−1;k)), (14)

where the scalar satisfies

 (15)

Solution of Eq.(15) is particularly challenging because one needs to find the scalar such that the update (14) leads to a new iterate with simulated statistics satisfying (15). One idea is to consider only , which is true under strong convexity; for example, in the normal linear model of Eq.(11), . More generally, this idea is reasonable because usually acts as a shrinkage factor (Toulis and Airoldi, 2015a, Section 5). Then, we can repeat simulations on a grid of values , and set as follows,

 λn=argminλ∈[0,1]m||λ^S(θn−1;k)−^S(θn(λ);k)||2, (16)

where . Procedure (14) uses simulations per iteration from the underlying model, and is thus more computationally demanding than procedure (5.2), which uses only

simulations per iteration. Improvements in computation are possible if certain simulated moments are reused, similar to the idea of

Bartz et al. (2009). This is reasonabe because the grid of values will increasingly yield similar simulated moments as the iteration counter increases and .

Example. Suppose we observe graphs , , and we want to fit an exponential random graph model (ERGM) (Frank and Strauss, 1986; Pattison and Wasserman, 1999; Robins et al., 2007) with density of a graph defined as

 f(G;θ)=exp(θ⊺XG)/c(θ), (17)

where are properties of , such as number of triangles, number of edges, etc., and is an appropriate normalizing constant. Typically, is hard to compute, and thus remains unknown.

In this case, it is still possible to use the stochastic approximation procedure (18), where we define , and , where is an independent sample from the ERGM model (17) with parameter value fixed at . Obtaining such samples is computationally straightforward (Snijders, 2002; Bartz et al., 2009). The implicit procedure (14) implemented through (16) has potential stability advantages, but a more efficient implementation remains an interesting open problem ahead.

### 5.3 Likelihood-free estimation

An important class of estimation methods in statistics does not rely on likelihood, such as the method of moments, or non-parametric methods. Typically, in such cases, there is a statistic that can be calculated at every th data point, and the regression function is known. Then, the procedure

 θn=θn−1+γn(Sn−T(θn−1)), (18)

can converge to under typical stochastic approximation conditions (e.g., convexity of vector-valued function ). If the system of equations is over-specified, convexification or regularization could be applied, similar, for example, to the way the generalized method of moments (Hall, 2005) resolves over-specification in moment conditions. Interestingly, the form of procedure (18), where the update depends on a discrepancy between an observed statistic and its expected value, also appears in likelihood-based estimation with SGD, e.g., in exponential family models —cf. Eq.(11). The implicit counterpart of procedure (18) has straightforward implementations through the algorithms of Section 4, which we illustrate in the following example.

Example. In their seminal paper, Robbins and Monro (1951) described an application of procedure (1

) in iterative quantile estimation. In particular, consider a random variable

. An experimenter wants to find the point such that , for a fixed . The experimenter cannot draw samples of , but has access to the random variable , for any value of . Robbins and Monro (1951) argued that the procedure

 θn=θn−1−γnWθn−1, (19)

will converge to . Indeed, . Convexity or concavity of is not required because is nondecreasing and , which are the original conditions for convergence established by Robbins and Monro (1951).

Quantile estimation through implicit stochastic approximation can be accomplished through algorithm (5), with the th iteration defined as

 xk =xk−1−ak(γnWxk−1+xk−1−θimn−1),k=1,2,…,K, θimn =xK, (20)

where , , and is a small constant; we give concrete values to these constants in a numerical simulation at the end of this example.

The benefits of the implicit procedure over the classic one can be emphatically illustrated through the following numerical exercise. Suppose is a standard normal random variable, and . In this case, , and thus is in a region of the parameter space where is very flat; in fact, the region is within from the value of . It is well-known that the convergence of the classic Robbins-Monro procedure depends on : if the product is small then convergence will be slow (Toulis and Airoldi, 2015b, Section 2.1). In this example, because is very small in the flat region, the learning rate parameter needs to compensate by being large, otherwise convergence to will be very slow. In fact, standard theory suggests that it is optimal to select . However, if is large the classic procedure (19) will overshoot early on to a point such that . Return from that point will be very slow because the values of will be close to zero, since . For example, suppose and , then to return back to we will need steps for which

 n0+m∑n=n0+1γnWθn−1 ≈(10−3.09) γ1n0+m∑n=n0+10.001/n ≈6.91 logm ≥6.91⋅103/γ1≈23. (21)

The classic procedure will therefore require at least steps to return back after this simple overshoot from to . Smaller values of the learning rate, e.g., , will exacerbate this problem.

In stark contrast, the implicit procedure (5.3) can neither overshoot nor undershoot. The update (5.3) satisfies approximately , which can be written as . Because is nondecreasing it follows that if , then . Similarly, if then . Furthermore, a large value of will in fact push more towards . The implicit equation therefore bestows a remarkable stability property to the underlying stochastic approximation procedure.

To validate our theory, we ran 20 replications of both the Robbins-Monro procedure (19) and the implicit procedure (5.3) for quantile estimation. Both procedures had a learning rate , where we picked multiple values of to test the stabillity of the procedures. For the implicit procedure we also set and . Each procedure was initialized at and was run for 5,000 iterationsl, at each replication. The final parameter estimate , , was stored for each procedure and replication. The results are shown in Figure 1.

We observe that, as the learning rate constant increases, the classic RM procedure produces iterates that overshoot and get stuck. This is consistent with the theoretical analysis of this section, where we showed that the classic procedure cannot converge after overshooting because the true parameter value is in a region where the objective function is flat. In stark contrast, the implicit procedure remains significantly robust, with iterates around the true parameter value across different learning rates. Importantly, the implicit procedure was only approximately implemented with algorithm (5), where the inner stochastic approximation for was executed for only iterations within each th iteration of the main implicit procedure (5.3).

## 6 Conclusion

The need to estimate parameters of a statistical model from massive amounts of data has reinvigorated interest in approximate inference procedures. While most gradient-based procedures of current interest can be seen as special case of a gradient-free stochastic approximation method developed by Robbins and Monro (1951), there is no general gradient-free method that can accommodate procedures with updates defined through implicit equations. Here, we conceptualize a gradient-free implicit stochastic approximation method, and develop asymptotic and non-asymptotic theory for it. This new approximation method provides the theoretical basis for gradient-based procedures that rely on proximal operators (implicit updates), and opens the door to new iterative estimation procedures that do not require access to a gradient or a fully-known likelihood function.

## References

• Bartz et al. (2009) Bartz, K., J. Liu, and J. Blitzstein (2009). Monte carlo maximum likelihood for exponential random graph models: From snowballs to umbrella densities.
• Bauschke and Combettes (2011) Bauschke, H. H. and P. L. Combettes (2011). Convex analysis and monotone operator theory in Hilbert spaces. Springer Science & Business Media.
• Benveniste et al. (1990) Benveniste, A., P. Priouret, and M. Métivier (1990). Adaptive algorithms and stochastic approximations.
• Bertsekas (2011) Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization. Mathematical programming 129(2), 163–195.
• Bickel and Doksum (2015) Bickel, P. J. and K. A. Doksum (2015). Mathematical Statistics: Basic Ideas and Selected Topics, volume I, Volume 117. CRC Press.
• Borkar (2008) Borkar, V. S. (2008). Stochastic approximation. Cambridge Books.
• Bottou (2010) Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Springer.
• Coraluppi and Young (1969) Coraluppi, G. and T. Y. Young (1969). Stochastic signal representation. Circuit Theory, IEEE Transactions on 16(2), 155–161.
• Fabian (1968) Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 1327–1332.
• Frank and Strauss (1986) Frank, O. and D. Strauss (1986). Markov graphs. Journal of the american Statistical association 81(395), 832–842.
• Gelman and Meng (1998) Gelman, A. and X.-L. Meng (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, 163–185.
• Geyer (1991) Geyer, C. J. (1991). Markov chain monte carlo maximum likelihood.
• Gladyshev (1965) Gladyshev, E. (1965). On stochastic approximation. Theory of Probability & Its Applications 10(2), 275–278.
• Hall (2005) Hall, A. R. (2005). Generalized method of moments. Oxford University Press Oxford.
• Kushner and Yin (2003) Kushner, H. J. and G. Yin (2003). Stochastic approximation and recursive algorithms and applications, Volume 35. Springer Science & Business Media.
• Lai and Robbins (1979) Lai, T. L. and H. Robbins (1979). Adaptive design and stochastic approximation. The annals of Statistics, 1196–1221.
• Ljung et al. (1992) Ljung, L., G. Pflug, and H. Walk (1992). Stochastic approximation and optimization of random systems.
• Moulines and Bach (2011) Moulines, E. and F. R. Bach (2011).

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems, pp. 451–459.
• Nagumo and Noda (1967) Nagumo, J.-I. and A. Noda (1967). A learning method for system identification. Automatic Control, IEEE Transactions on 12(3), 282–287.
• Nemirovski et al. (2009) Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4), 1574–1609.
• Nesterov (2004) Nesterov, Y. (2004). Introductory lectures on convex optimization, Volume 87. Springer Science & Business Media.
• Nevel’son et al. (1973) Nevel’son, M. B., R. Z. Khasʹminskii, and B. Silver (1973). Stochastic approximation and recursive estimation. American Mathematical Society Providence, RI.
• Parikh and Boyd (2013) Parikh, N. and S. Boyd (2013). Proximal algorithms. Foundations and Trends in optimization 1(3), 123–231.
• Pattison and Wasserman (1999) Pattison, P. and S. Wasserman (1999). Logit models and logistic regressions for social networks: Ii. multivariate relations. British Journal of Mathematical and Statistical Psychology 52(2), 169–194.
• Robbins and Monro (1951) Robbins, H. and S. Monro (1951). A stochastic approximation method. The annals of mathematical statistics, 400–407.
• Robbins and Siegmund (1985) Robbins, H. and D. Siegmund (1985). A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pp. 111–135. Springer.
• Robins et al. (2007) Robins, G., P. Pattison, Y. Kalish, and D. Lusher (2007). An introduction to exponential random graph (p*) models for social networks. Social networks 29(2), 173–191.
• Rosasco et al. (2014) Rosasco, L., S. Villa, and B. C. Vũ (2014). Convergence of stochastic proximal gradient algorithm. arXiv preprint arXiv:1403.5074.
• Singer and Duchi (2009) Singer, Y. and J. C. Duchi (2009). Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems, pp. 495–503.
• Snijders (2002) Snijders, T. A. (2002). Markov chain monte carlo estimation of exponential random graph models. Journal of Social Structure 3(2), 1–40.
• Toulis et al. (2014) Toulis, P., E. Airoldi, and J. Rennie (2014). Statistical analysis of stochastic gradient methods for generalized linear models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 667–675.
• Toulis and Airoldi (2015a) Toulis, P. and E. M. Airoldi (2015a). Implicit stochastic gradient descent for principled estimation with large datasets. arXiv preprint arXiv:1408.2923.
• Toulis and Airoldi (2015b) Toulis, P. and E. M. Airoldi (2015b). Scalable estimation strategies based on stochastic approximations: classical results and new insights. Statistics and computing 25(4), 781–795.
• Widrow and Hoff (1960) Widrow, B. and M. E. Hoff (1960). Adaptive switching circuits. Defense Technical Information Center.
• Zhang (2004) Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp. 116. ACM.

## Appendix A Appendix

### a.1 Implicit stochastic approximation

Implicit stochastic approximation is defined as follows,

 θimn=θimn−1−γnWθ(∗)n, s.t. (2) E(θimn|Fn