# Properties of the Stochastic Approximation EM Algorithm with Mini-batch Sampling

To speed up convergence a mini-batch version of the Monte Carlo Markov Chain Stochastic Approximation Expectation Maximization (MCMC-SAEM) algorithm for general latent variable models is proposed. For exponential models the algorithm is shown to be convergent under classical conditions as the number of iterations increases. Numerical experiments illustrate the performance of the mini-batch algorithm in various models. In particular, we highlight that an appropriate choice of the mini-batch size results in a tremendous speed-up of the convergence of the sequence of estimators generated by the algorithm. Moreover, insights on the effect of the mini-batch size on the limit distribution are presented.

## Authors

• 6 publications
• 4 publications
• 2 publications
• ### Mini-batch learning of exponential family finite mixture models

Mini-batch algorithms have become increasingly popular due to the requir...
02/09/2019 ∙ by H D Nguyen, et al. ∙ 0

• ### Mini-batch Tempered MCMC

In this paper we propose a general framework of performing MCMC with onl...
07/31/2017 ∙ by Dangna Li, et al. ∙ 0

• ### Nested Mini-Batch K-Means

A new algorithm is proposed which accelerates the mini-batch k-means alg...
02/09/2016 ∙ by James Newling, et al. ∙ 0

• ### Mining Smart Card Data for Travelers' Mini Activities

In the context of public transport modeling and simulation, we address t...
12/19/2017 ∙ by Boris Chidlovskii, et al. ∙ 0

• ### An Investigation into Mini-Batch Rule Learning

We investigate whether it is possible to learn rule sets efficiently in ...
06/18/2021 ∙ by Florian Beck, et al. ∙ 0

• ### Unsupervised Learning Framework of Interest Point Via Properties Optimization

This paper presents an entirely unsupervised interest point training fra...
07/26/2019 ∙ by Pei Yan, et al. ∙ 10

• ### On the Global Convergence of (Fast) Incremental Expectation Maximization Methods

The EM algorithm is one of the most popular algorithm for inference in l...
10/28/2019 ∙ by Belhal Karimi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

On very large datasets the computing time of the classical expectation-maximization (EM) algorithm (Dempster et al., 1977) as well as its variants such as MCEM, SAEM, MCMC-SAEM and others is long, since all data are used in every iteration. To circumvent this problem, a bunch of EM-type algorithms have been proposed, namely various mini-batch (Neal and Hinton, 1999; Liang and Klein, 2009; Karimi et al., 2018; Nguyen et al., 2019) and online (Titterington, 1984; Lange, 1995; Cappé and Moulines, 2009; Cappé, 2011) versions of the EM algorithm. They all consist in using only a part of the observations during one iteration in order to accelerate convergence. While online algorithms process a single observation per iteration handled in the order of arrival, mini-batch algorithms use larger, randomly chosen subsets of observations. The size of these subsets of data is called mini-batch size. Choosing large mini-batch sizes entails long computing times, while very small mini-batch sizes as well as online algorithms may result in a loss of accuracy of the algorithm. Hence, an optimal mini-batch size would achieve a compromise between accuracy and computing time. However this issue is generally overlooked.

In this article, we propose a mini-batch version of the MCMC-SAEM algorithm (Delyon et al., 1999; Kuhn and Lavielle, 2004). The original MCMC-SAEM algorithm is a powerful alternative to EM when the E-step is intractable. This is particularly interesting for nonlinear models or non-Gaussian models, where the unobserved data cannot be simulated exactly from the conditional distribution. Moreover, the MCMC-SAEM algorithm is also more computing efficient than the MCMC-EM algorithm, since only a single instance of the latent variable is sampled at every iteration of the algorithm. Nevertheless, when the dimension of the latent variable is huge, the sampling step can be time consuming. From this point of view the here proposed mini-batch version is computationally more efficient than the original MCMC-SAEM, since at each iteration only a small proportion of the latent variable is simulated and only the corresponding data are visited. For curved exponential models, we prove almost-sure convergence of the sequence of estimates generated by the mini-batch MCMC-SAEM algorithm under classical conditions as the number of iterations of the algorithm increases. Moreover, we assess in simulation experiments the influence of the mini-batch size on the speed-up of the convergence in various models and highlight that an appropriate choice of the mini-batch size results in an important gain. We also present insights on the effect of the mini-batch size on the limit distribution of the estimates. Numerical results illustrate our findings.

## 2 Latent variable model and algorithm

This section first presents the general latent variable model and the required assumptions. Then the original MCMC-SAEM algorithm is described, before presenting the new mini-batch version of the MCMC-SAEM algorithm.

### 2.1 Model and assumptions

Consider the common latent variable model with observed (incomplete) data and latent (unobserved) variable . Denote the dimension of the latent variable . In many latent variable models, corresponds to the number of observations, but it is not necessary that and have the same size or that each observation depends only on a single latent component .

Denote

the model parameter of the joint distribution of the complete data

. In what follows, omitting all dependencies in the observations , which are considered as fixed realizations in the analysis, we assume that the complete-data likelihood function has the following form

 f(z;θ)=exp{−ψ(θ)+⟨S(z),ϕ(θ)⟩}c(z), (1)

where denotes the scalar product,

denotes a vector of sufficient statistics of the model with values in

and and are functions on . The posterior distribution of the latent variables given the observations, is denoted by .

### 2.2 Description of MCMC-SAEM algorithm

The original MCMC-SAEM algorithm proposed by Kuhn and Lavielle (2004) consists in replacing the E-step in the EM-algorithm by a simulation step combined with a stochastic approximation step. Here, we focus on a version where the MCMC part is a Metropolis-Hastings-within-Gibbs algorithm (Robert and Casella, 2004). More precisely, at iteration , the following three steps are performed.

#### Simulation step

A new realization of the latent variable is sampled from an ergodic Markov transition kernel , whose stationary distribution is the posterior distribution . In practice, one iteration of a Metropolis-Hastings-within-Gibbs algorithm is used. We consider a collection of symmetric random walk Metropolis kernels defined on and acting only on the -th coordinate, see Fort et al. (2003). More precisely, let be the canonical basis of . Then, for each starting from the -vector , the proposal in the direction of is given by , where is sampled from a symmetric increment density

. This proposal is then accepted with probability

#### Stochastic approximation step

The approximation of the sufficient statistic is updated by

 sk=(1−γk)sk−1+γkS(zk), (2)

where is a decreasing sequence of positive step-sizes such that and . That is, the current approximation of the sufficient statistic is a weighted mean of its previous value and the value obtained by the current simulation step.

#### Maximization step

The model parameter is updated by

 θk =^θ(sk),

with where

Depending on the model the maximization problem may not have a closed-form solution.

### 2.3 Mini-batch MCMC-SAEM algorithm

When is large, the simulation step can be very time-consuming. Indeed, simulating all components of the latent variable at every iteration is costly in time. Thus, according to the spirit of other mini-batch algorithms, updating only a part of the latent components may speed up the convergence of the algorithm. With this idea in mind, denote the (mean) proportion of components of the latent variable that are updated during one iteration.

#### Mini-batch simulation step

In the mini-batch version of the MCMC-SAEM algorithm the simulation step consists of two parts. First, select the indices of the components of the latent variable that will be updated. That is, we sample the number

of indices from a binomial distribution Bin

and then select randomly indices among without replacement. Denote this set of selected indices at iteration . Next, at iteration , instead of sampling all the components , we may sample only the components with index from the Markov kernel using one iteration of a Metropolis-Hastings-within-Gibbs algorithm.

#### Stochastic approximation and maximization step

These two steps are identical to the original SAEM algorithm. However, in many models the evaluation of the sufficient statistic can be performed by an update of a small part of the components of its previous value . This may be computationally more efficient than computing naively.

#### Initialization

Initial values , and for the model parameter, the sufficient statistic and the latent variable, respectively, have to be chosen by the user or at random.

See Algorithm 1 for a complete description of the algorithm.

## 3 Convergence of the algorithm

In this section we show that the mini-batch MCMC-SAEM algorithm converges as the number of iterations increase under classical conditions (which are the ones that ensure convergence of the original MCMC-SAEM algorithm). Indeed, compared to classical MCMC-SAEM algorithm, the mini-batch version involves an additional stochastic part that comes from the selection of indexes of the latent variable to be updated. This additional randomness is ruled by the value of .

### 3.1 Equivalent descriptions

The above description of the simulation step is convenient for achieving maximal computing efficiency. We now focus on an equivalent framework that underlines the fact that our mini-batch algorithm is a special case of the MCMC-SAEM classical algorithm. For two kernels , we denote their composition by

 P2∘P1(z,z′)=∫~zP2(~z,z′)P1(z,d~z).

With this notation at hand, the Metropolis-Hastings-within-Gibbs relies on the kernel . Now, in the mini-batch simulation step, the update of only a part of the latent variable is equivalent to generating latent vectors according to the Markov kernel defined as the following recursive composition

 Πα(z,z′|θ) where Πα,i(z,z′|θ) =αΠi(z,(z1,…,z′i,…,zn)|θ)+(1−α)δz(z′), and thus Πα(z,z′|θ) =n∑k=0αk(1−α)n−k∑1≤i1<⋯

where denotes the Dirac measure at . That is, is the composition of mixtures of the original kernels and a deterministic component with mixing weight . In other words, the mini-batch MCMC-SAEM algorithm corresponds to the original MCMC-SAEM algorithm with a particular choice of the transition kernel. Note that it can also be seen as a mixture over different trajectories (the choice of indexes to be updated) of Metropolis-Hastings-within-Gibbs kernels acting on a subpart of the latent vector .

A third description of the algorithm will be appropriate for the analysis of its theoretical properties. In order to stress the role of the mini-batch procedure, we denote by the sequence of latent variables obtained by the mini-batch algorithm with mini-batch size . In the -th mini-batch simulation step, for each

, we sample a Bernoulli random variable

with parameter . This is an indicator of whether the latent variable is updated at iteration . Next, we sample a realization from the transition kernel and set

 (4)

When , the sequence generated by the batch algorithm is a Markov chain with transition kernel .

Fort et al. (2003) establish results on the geometric ergodicity of hybrid samplers and in particular for the random-scan Gibbs sampler. The latter is defined as , where each is a kernel on acting only on the -th component. More generally the random-scan Gibbs sampler may be defined as where

is a probability distribution. This means that at each step of the algorithm, only one component

is drawn from the probability distribution and then updated. These probabilities may be chosen uniformly () or for e.g. can help focusing on a component that is more difficult to simulate. We generalize their results to a setup where at each step , we rely on a kernel iterated from a random-scan Gibbs sampler as follows

 Πα(⋅,⋅|θk) =˜Πα(⋅,⋅|θk)∑iUi,k (5) and ˜Πα(⋅,⋅|θk) ={(∑ni=1Uk,i)−1∑ni=1Uk,iΠi(⋅,⋅|θk) if ∑ni=1Uk,i≥1,Idelse

where denotes the identity kernel . Note that this is not exactly the kernel corresponding to the algorithm described above, as here this one could formally update the same component more than once in one step of the algorithm. Nonetheless, we neglect this effect and establish our result for the algorithm corresponding to this kernel.

### 3.2 Assumptions and result

Assume that the random variables are defined on the same probability space . We denote the increasing family of -algebras generated by the random variables . Assume that the following regularity conditions on the model hold.

(M1)

The parameter space is an open subset of . The complete data likelihood function is given by

 f(z;θ)=exp(−ψ(θ)+⟨S(z),ϕ(θ)⟩)c(z),

where is a continuous function on taking its values in an open subset of . Moreover, the convex hull of is included in and for all

 ∫S(z)π(z;θ)dz<∞.

(M2) The functions and are twice continuously differentiable on .

(M3) The function defined as is continuously differentiable on .

(M4) The observed-data log-likelihood function defined as is continuously differentiable on and

(M5) Define as There exists a continuously differentiable function , such that

 ∀s∈S,  ∀θ∈Θ,  L(s;ˆθ(s))≥L(s;θ).

We now introduce the usually required conditions for proving convergence of the SAEM procedure.

(SAEM1) For all in , , and .

(SAEM2) The functions and are times differentiable, where we recall that is an open subset of .

For any , we define and its expectation with respect to the posterior distribution denoted by . For any denote . We consider the following additional assumptions.

(H1)

There exists a constant such that

 L ={s∈S,⟨∇ℓ(^θ(s)),h(s)⟩=0}⊂{s∈S,−ℓ(^θ(s))

In addition, there exits such that is a compact set.

(H2)

The family of symmetric densities is such that there exist some constants and (for ) such that whenever .

(H3)

There are constants and with such that

 ξ≜infi=1,…,n∫Δδqi(x)dx>0,

and, for any sequence with , we may extract a subsequence with the property that, for some , all and all

 limjπ(~zj;θ)π(~zj−sign(zji)xei;θ)=0, % and limjπ(~zj+sign(zji)xei;θ)π(~zj;θ)=0.
(H4)

There exist , and such that for all ,

 |S(z)|≤Cπ(z;θ0)−ρ.

To state our convergence result, we consider the version of the algorithm with truncation on random boundaries studied by Andrieu et al. (2005). This additional projection step ensures in particular the stability of the algorithm for the theoretical analysis and is only a technical tool for the proof without any practical consequences.

###### Theorem 1.

Assume (M1)(M5), (SAEM1)(SAEM2) and (H1)(H4). Then, for all , the sequence generated by the mini-batch MCMC-SAEM algorithm with corresponding Markov kernel converges towards the set of critical points of the observed likelihood as the number of iterations increases.

###### Proof of Theorem 1.

The proof consists of two steps. First, we prove the convergence of the sequence of sufficient statistics towards the set of zeros of function using Theorem 5.5 in Andrieu et al. (2005). Then, in a second step, following the usual reasoning for EM-type algorithms, described for instance in Delyon et al. (1999), we deduce that the sequence converges to the set of critical points of the observed data log-likelihood .

#### First step.

In order to apply Theorem 5.5 in Andrieu et al. (2005), we need to establish that their conditions (A1) to (A4) are satisfied. In what follows, (A1) to (A4) always refer to the conditions stated in Andrieu et al. (2005). First, note that under our assumptions (H1), (M1)(M5) and (SAEM2), condition (A1) is satisfied. Indeed, this is a consequence of Lemma 2 in Delyon et al. (1999). To establish (A2) and (A3), as suggested in Andrieu et al. (2005), we establish their drift conditions (DRI), see Proposition 6.1 in Andrieu et al. (2005). We first focus on establishing (DRI1) in Andrieu et al. (2005). To this aim, we rely on Fort et al. (2003) that establishes results for the random-scan Metropolis sampler. In their context, they consider a sampler . We generalize their results to our setup according to (5). Following the lines of the proof of Theorem 2 in Fort et al. (2003), we can show that Equations (6.1) and (6.3) appearing in the drift condition (DRI1) in Andrieu et al. (2005) are satisfied as soon as (H2)-(H3) hold. Indeed following the strategy developed in Allassonnière et al. (2010), we first establish Equations (6.1) and (6.3) using a drift function depending on namely where is given in (H4). Then we define the common drift function as follow. Let and given in (H4) and define . Then for any compact , there exist two positive constants and such that for all and for all , we get . We then establish Equations (6.1) and (6.3) for this drift function . Moreover, using Proposition 1 and Proposition 2 in Fort et al. (2003) we obtain that Equation (6.2) in (DRI1) from Andrieu et al. (2005) holds. Under assumption (H4) we have the first part of (DRI2) in Andrieu et al. (2005). The second part of the same equation is true with in our case. Finally, (DRI3) in Andrieu et al. (2005) is true in our context with because is twice continuously differentiable, thus Lipschitz on any compact set. To prove this, we decompose the space in the acceptance and rejection regions and consider the integral over four sets leading to four different expressions of the acceptance ratio (see for example proof of Lemma 4.7 in Fort et al., 2015). This concludes that (DRI) and therefore (A2)–(A3) in Andrieu et al. (2005) are satisfied. Notice that (SAEM1) ensures (A4). This concludes the first step of the proof.

#### Second step.

As the function is continuous, the second step is immediate by applying Lemma 2 in Delyon et al. (1999). ∎

## 4 Experiments

We carry out various simulation experiments in a frailty model, a nonlinear mixed effects model and a Bayesian deformable template model to illustrate the performance of the proposed mini-batch MCMC-SAEM algorithm and the potential gain in efficiency.

### 4.1 Frailty model in survival analysis

In survival analysis the frailty model is an extension of the well-known Cox model (Duchateau and Janssen, 2008). Indeed, the hazard rate function in the frailty model includes an additional random effect called frailty to account for unexplained heterogeneity.

#### Model

We observe survival times measured over groups with measurements per group, and covariates . The latent random variables correspond to the frailty terms, each component being the frailty of one group. The

’s are supposed to be i.i.d. with centered Gaussian distribution with variance

. We denote by the baseline hazard function. Here we choose for the Weibull function given by

 λ0(t)=λ0ρtρ−1,t>0,

with and . The conditional hazard rate of observation given the frailty is given by

 λij(⋅|zi)=λ0(⋅)exp(Xij⊺β+zi),

where . Thus the unknown model parameter is . In practical applications the main interest lies in the estimation of the regression parameter .

#### Likelihood

In the frailty model the conditional survival function is given by

 Gij(t|zi) =P(tij>t|zi) =exp[−λ0tρexp(Xij⊺β+zi)].

In other words, the conditional distribution of the survival time given is the Weibull distribution with scale and shape . For the conditional and the complete likelihood function we obtain

 Pθ(t|z) =n∏i=1m∏j=1Gij(tij|zi)λij(tij|zi), Pθ(t,z) =Pθ(t|z)n∏i=1φ(ziσ),

where

denotes the density of the standard normal distribution.

The complete likelihood may be written in the form (1), with sufficient statistics where and for .

#### Simulation step

In the simulation step we use the following sampling procedure. Let be the subset of indices of latent variable components that have to be updated at iteration . For each , we use a Metropolis-Hastings procedure: first, draw a candidate from the normal distribution , then compute the logarithm of the acceptance ratio given by

 log(τk,i) =m(~zk,i−zk−1,i)−12σ2k−1(~z2k,i−z2k−1,i) −λ0(e~zk,i−ezk−1,i)(m∑j=1tρijexp[Xij⊺βk−1]).

Then, for a realization

of the uniform distribution

, we set if , and otherwise.

#### Stochastic approximation step

Compute the stochastic approximation of the sufficient statistics according to Eq. (2), that is, for , compute

 sk,l=(1−γk)sk−1,l+γkSl(zk),

where the sequence is chosen as for and otherwise.

#### Maximization step

The maximization of is explicit in and , with solutions given by

 σ2k =sk,0 λ0,k =nm(n∑i=1m∑j=1tρijexp[Xij⊺βk]sk,i)−1.

The update of and are done by Newton’s method, as these maximizations are not explicit.

#### Numerical results

In a simulation study we consider the frailty model with parameters fixed to , and . We set and . The covariates are drawn independently from the uniform distribution for every dataset.

In the first setting, 500 datasets are generated to which the mini-batch MCMC-SAEM with random initial values and different mini-batch sizes is applied, that is, with varying values of the proportion . Figure 1 shows the evolution of the precision of the mean of the estimates of parameter component as a function of the number of passes through the data. That means that the value 10 on the -axis, for example, corresponds to 10 iterations in the batch MCMC-SAEM () and to 100 iterations in the mini-batch MCMC-SAEM with . That is, parameter estimators are compared when the different algorithms have visited (approximately) the same amount of data, or, to put it differently, when the algorithms have generated the same number of latent components .

Figure 1 a) shows 80%-confidence bands for parameter component

, and graph b) shows the corresponding evolution of the logarithm of the mean squared error. Obviously, for all algorithms estimation improves when the number of passes through the data increases. Moreover, it is clear from both graphs that the rate of convergence depends extremely on the mini-batch size. Indeed, at (almost) any number of passes through the data, the mean squared error and the width of the confidence intervals is increasing in the proportion

. In this sense, the best choice that achieves the fastest convergence is a mini-batch size corresponding to the smallest value of , here . From graph b) we see that convergence seems to be attained after only three passes through the data when , while almost 30 are required when . For larger mini-batch sizes convergence is even much slower.

In the second setting, we aim at studying the asymptotic behavior of the estimates, when the algorithms are supposed to have converged. Here, estimates of the different algorithms are compared at a fixed large number of iterations. That is, after iterations the batch algorithm has visited the whole dataset times, while the mini-batch version with e.g. proportion has only visited half as much information. To compute the variance of the estimates that is only due to the stochasticity of the algorithm, we fix a dataset and run the mini-batch MCMC-SAEM algorithm 500 times using different initial values. We choose iterations and consider proportions varying in . Figure 2 illustrates the limit distributions of the estimates of for different mini-batch sizes, which seem to be Gaussians centered at the same value but with varying variances. Table 1 gives the corresponding standard deviation of the estimates of , which is clearly decreasing in . This is expected as more data are visited for larger . These results give some insight into the asymptotic behavior of the algorithms and in particular into the impact of the mini-batch size on the limit distribution, which is generally overlooked in the literature.

### 4.2 Nonlinear mixed model for pharmacokinetic study

In this section we consider a classical one-compartment model used in clinical pharmacokinetic studies. The model presented in Davidian and Giltinan (1995) serves to analyze the kinetic of the drug theophylline used in therapy for respiratory diseases. For and , we define

 yij=dikaiVikai−Cli[e−Clitij/Vi−e−kaitij]+ϵij,

where the observation is the measure of drug concentration on individual  at time . The drug dose administered to individual  is denoted . The parameters for individual  are the volume of the central compartment, the constant of the drug absorption rate, and the drug’s clearance . The random measurement error is denoted by and supposed to have a centered normal distribution with variance . For the individual parameters , and log-normal distributions are considered given by

 logVi = log(μV)+zi,1, logkai = log(μka)+zi,2, logCli = log(μCl)+zi,3,

where are independent following a centered normal distribution with variance . Then the model parameters are .

In a simulation study we estimate the parameters by both the original MCMC-SAEM algorithm (see Kuhn and Lavielle, 2005, for implementation details) and our mini-batch version. More precisely, one dataset is generated with the following values: . The dose is constant, equal to . The times are such that for all . Then we perform repetitions of the estimation task by using the mini-batch MCMC-SAEM algorithm with . We set for and otherwise.

The results for parameter are shown in Figures 3 and 4. The other results are similar and therefore omitted. Figure 3 presents the mean parameter estimate sequence with respect to the number of passes through the dataset as in Figure 1 for different values of the proportion . We observe that the smaller the proportion , the faster the sequence of estimates converges, in particular during the first passes through the data. Figure 4 presents for different values of the proportion the estimates of the empirical standard deviation with respect to the number of iterations. We observe that as the number of iterations increases, the standard deviations are lower than for higher values of . This illustrates in particular that including more data in the inference task leads to more accurate estimation results. This is indeed very intuitive. Therefore choosing an optimal value for remains to achieve a trade-off between speeding up the convergence and involving enough data in the process to get accurate estimates.

### 4.3 Deformable template model for image analysis

We consider the dense deformation template model introduced in Allassonnière et al. (2007). Such models allow to represent observed images as deformations of a given common reference image called template. We deal with the formulation proposed in Allassonnière et al. (2010). We observe gray level images denoted by . Each image is defined on a grid of pixels where for each , is the location of pixel in a domain . We assume that each image derives from the deformation of a common unknown template , which is a function from to . Furthermore we assume for each image the existence of an unobserved deformation field such that

 y(s)=I(xs−Φ(xs))+σϵ(s),

where are distributed with respect to a normalized Gaussian distribution and denotes the variance. To formulate this complex problem in a simpler parametric way, the template and the deformation are supposed to have parametric forms as follows. Given a set of landmarks denoted by which covers the domain , the template function is parametrized by coefficients through

 Iξ=Kpξ,where(Kpξ)(x)=kp∑k=1Kp(x,pk)ξ(k),

and is a fixed known kernel. Likewise, for another fixed set of landmarks , the deformation field is given by

 Φz(x)=(Kgz)(x)=kg∑k=1Kg(x,gk)(z(1)(k),z(2)(k)),

where and again, is a fixed known kernel. The variables are the latent variables of this model and are assumed to follow a centered Gaussian distribution with covariance matrix .

We refer to Allassonnière et al. (2010) for further details on the model and for the implemention of the MCMC-SAEM algorithm. We estimate all model parameters, namely , with both algorithms - the batch and the mini-batch with . For the first experiment we use images from the United States Postal Service database. We present the results obtained on digit as an illustrating example in Figure 5. We observe that during the first iterations of the algorithms the convergence is sped up by using the mini-batch version leading to a more contrasted and accurate template estimate after three passes through the dataset.

In the second experiment we assess the effect of the mini-batch version in the asymptotic behavior of the algorithm. Therefore we run the batch version on images of the dataset and the mini-batch version with on images of the United States Postal Service database, both during iterations to reach asymptotic behavior. Note that choosing a small mini-batch size allows us to increase the number of images in the input, while the computing time of both algorithms is of the same order. To assess simultaneously the accuracy of the obtained estimates for the template and the deformation, we generate new samples from the model using both estimates of and . These results are presented in Figure 6. We observe that the samples of the second row look more like usual handwritten digits as those of the database than the ones of the first row, highlighting that both template and deformation are better estimated by the mini-batch version performed on images of the dataset. This shows that given a constraint on the computing time, more accuracy can be obtained by using the mini-batch MCMC-SAEM instead of the original algorithm.

## 5 Conclusion

The proposed mini-batch version of the MCMC-SAEM algorithm has a good theoretical foundation, as it is shown to be almost surely convergent whenever the original algorithm is. Moreover, in the simulations carried out in different models, the new algorithm turns out to achieve an important speed-up of convergence during the first passes through the data compared to the original algorithm. This opens the way to possibly drastic reductions of the computing time, or to increase accuracy by processing larger datasets and keeping the computing time fixed. Furthermore, our investigations on the limit distribution of the sequence of estimates generated by the algorithm yields: the larger the mini-batch size, the more concentrated is the limit distribution.

These results encourage the development of other mini-batch EM-type algorithms, as for example for the variational EM algorithm, in order to achieve similar speed-ups in further models and applications.

Finally, the question about an optimal choice of the mini-batch size arises naturally. According to our findings, an optimal value of the proportion must simultaneously achieve two objectives: speeding up the convergence to drastically reduce the computing time, while estimating accurately model parameters. Therefore, it is of great interest to develop an empirical criterion leading to an optimal choice of the mini-batch size by operating a trade-off between accuracy and computing time.

All programs are available on request from the authors.

## Acknowledgement

Work partly supported by the grant ANR-18-CE02-0010 of the French National Research Agency ANR (project EcoNet).

## References

• Allassonnière et al. (2007) Allassonnière, S., Y. Amit, and A. Trouvé (2007). Toward a coherent statistical framework for dense deformable template estimation. J. Roy. Statist. Soc. Ser. B 69, 3–29.
• Allassonnière et al. (2010) Allassonnière, S., E. Kuhn, and A. Trouvé (2010). Construction of Bayesian deformable models via a stochastic approximation algorithm: A convergence study. Bernoulli 16(3), 641–678.
• Andrieu et al. (2005) Andrieu, C., E. Moulines, and P. Priouret (2005). Stability of stochastic approximation under verifiable conditions. SIAM J. Control Optim. 44(1), 283–312.
• Cappé (2011) Cappé, O. (2011).

Online EM algorithm for hidden Markov models.

Journal Computational and Graphical Statistics 20(3), 728–749.
• Cappé and Moulines (2009) Cappé, O. and E. Moulines (2009). On-line expectation-maximization algorithm for latent data models. J. Roy. Statist. Soc. Ser. B 71(3), 593–613.
• Davidian and Giltinan (1995) Davidian, M. and D. M. Giltinan (1995). Nonlinear Models for Repeated Measurement Data. Chapman & Hall/CRC Press.
• Delyon et al. (1999) Delyon, B., M. Lavielle, and E. Moulines (1999). Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist. 27(1), 94–128.
• Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39(1), 1–38.
• Duchateau and Janssen (2008) Duchateau, L. and P. Janssen (2008). The frailty model. New York, NY: Springer.
• Fort et al. (2015) Fort, G., B. Jourdain, E. Kuhn, T. Lelièvre, and G. Stoltz (2015). Convergence of the wang-landau algorithm. Mathematics of Computation 84(295), 2297–2327.
• Fort et al. (2003) Fort, G., E. Moulines, G. O. Roberts, and J. S. Rosenthal (2003). On the geometric ergodicity of hybrid samplers. Journal of Applied Probability 40, 123–146.
• Karimi et al. (2018) Karimi, B., M. Lavielle, and E. Moulines (2018). On the convergence properties of the mini-batch EM and MCEM algorithms. Unpublished.
• Kuhn and Lavielle (2004) Kuhn, E. and M. Lavielle (2004). Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: P&S 8, 115–131.
• Kuhn and Lavielle (2005) Kuhn, E. and M. Lavielle (2005). Maximum likelihood estimation in nonlinear mixed effects models. Comput Stat Data An. 49(4), 1020–1038.
• Lange (1995) Lange, K. (1995, 01). A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Statist. Soc. Ser. B 2(57), 425–437.
• Liang and Klein (2009) Liang, P. and D. Klein (2009). Online EM for unsupervised models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pp. 611–619. Association for Computational Linguistics.
• Neal and Hinton (1999) Neal, R. M. and G. E. Hinton (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in Graphical Models, pp. 355–368. Cambridge, MA, USA: MIT Press.
• Nguyen et al. (2019) Nguyen, H., F. Forbes, and G. McLachlan (2019). Mini-batch learning of exponential family finite mixture models. Technical report, arXiv:1902.03335.
• Robert and Casella (2004) Robert, C. P. and G. Casella (2004). Monte Carlo statistical methods (Second ed.). Springer Texts in Statistics. Springer-Verlag, New York.
• Titterington (1984) Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. J. Roy. Statist. Soc. Ser. B 2(46), 257–267.