1 Introduction
Expectation Maximization (EM) is a widely used and general technique for estimating maximum likelihood parameters of a latent variable model (Dempster et al., 1977). We will be considering models with a sequential structure. Elegant algorithms are available for special cases of sequential models, such as linear systems with Gaussian noise (Shumway and Stoffer, 1982)
, and finitestate hidden Markov models
(Baum, 1972). Here we focus on inference in complex models that do not admit analytic solutions, for which sequential Monte Carlo (SMC) methods are widely used to approximate the expectation in the Estep. Generally, the use of Monte Carlo methods in the context of EM is known as stochastic approximation EM (SAEM; Delyon et al. 1999) and this class of methods is favored in practice over gradientbased approaches due to their relative stability and computational efficiency when estimating high dimensional parameters (Chitralekha et al., 2010; Kantas et al., 2009).Convergence of EM methods can nevertheless be slow for complex models and/or with large data volumes. Several authors have proposed acceleration techniques (Jamshidian and Jennrich, 1993; Lange, 1995; Varadhan and Roland, 2008), but these require that the Estep is analytically tractable. For SAEM standard recursive EM methods are used instead, the two most popular being batch EM (BEM) and online EM (OEM). Both methods require the user to specify a tuning parameter, and in both cases the performance of the algorithm is strongly dependent on the chosen parameter. For instance, for BEM, very large batch sizes lead to inaccurate estimates because of slow convergence, whereas very small batch sizes lead to imprecise estimates due to the inherent stochasticity of the model within a small batch of observations. The optimal batch size in BEM, or equivalently the optimal learning rate in OEM, depends on the particularities of the model.
While the relative merits of these and other methods for parameter estimation have been studied in detail (see e.g. Kantas et al. 2009
), the problem of choosing optimal learning rates has received relatively little attention. Here we introduce a novel algorithm, termed Introspective Online EM (IOEM), which removes the need for setting the learning rate altogether by estimating the optimal parameterspecific learning rate along with the parameters of interest. This is particularly helpful when inferring parameters in a high dimensional model, since the optimal tuning parameter may differ between parameters. Broadly, IOEM works by estimating both the precision and the accuracy of parameters in an online manner through weighted linear regression, and uses these estimates to tune the learning rate so as to improve both simultaneously.
The outline of this paper is as follows. Sect. 2 uses a oneparameter autoregressive statespace model to introduce BEM, OEM, and a simplified version of IOEM. Sect. 3 considers the full 3parameter autogressive model, which requires the complete IOEM algorithm. Sect. 4
considers a 2dimensional autoregressive model to show the benefit of the proposed algorithm when inferring many parameters. Finally, Sect.
5 demonstrates desirable performance in the stochastic volatility model, an important case as it is nonlinear and hence more similar to applications of SAEM.2 EM for a Simplified Autoregressive Model
Here we review SMC, BEM, OEM, and present the IOEM algorithm with a simple model. This illustrates the main concepts behind IOEM before delving into details in Sect. 3.
We consider a simple autoregressive model with one unknown parameter. We observe the sequence of random variables
which depends on the unobserved sequence , as follows:(1) 
where and are i.i.d. standard normal variates, and are known parameters, and is unknown. Under this model, we have the following transition and emission densities:
We have chosen as the unknown parameter as it is the most straightforward to estimate, allowing us to introduce the idea of IOEM without certain complications which we address in Sect. 3. As and are members of the exponential family of distributions, the M step of EM can be done using sufficient statistics, and so the E step amounts to the expectation of the sufficient statistics. In this model, the parameter has the sufficient statistic
(2) 
The estimate of is obtained by setting . More generally, for an unknown parameter , where is a known function mapping sufficient statistics to parameter estimates.
To estimate , we use sequential Monte Carlo (SMC) to simulate particles and their associated weights , , so that
(3) 
approximates the distribution . The standard MCEM approximation of would require storage of all observations , the simulation of each time is updated, and ideally an increasing Monte Carlo sample size as the parameter estimates near convergence. To avoid this, we employ SAEM which effectively averages over previous parameter estimates as an alternative to generating a new Monte Carlo sample every time an estimate is updated, and hence is more suitable to online inference. This method as proposed in Cappé and Moulines (2009) approximates the expectation in (2) recursively.
The outline of the SMC with EM algorithm we consider in this paper is as follows:
Algorithm 1 Sequential Importance Resampling (bootstrap filter)
For time :


Compute normalized weights satisfying

Update to using chosen EM method

Resample particles if
Here is the initial distribution for , is the effective sample size defined as , , and is shorthand for the coordinate of . In models with multiple unknown parameters, each parameter is updated in step 3 of the algorithm, however we will refer only to a single parameter to keep the notation simple.
Throughout this paper we follow common practice in using the fixedlag technique in order to reduce the mean square error between and (Cappé and Moulines, 2005; Cappé et al., 2007). In particular, we choose a lag and then at time , using particles shaped by data , estimate the term of the summation in (2). We will use to denote the coordinate of the particle , but we will continue to write as a shorthand for . (see Table 1 for an overview of notation used in this paper.)
The fixedlag technique involves making the approximation
(4) 
where we assume that can be written as
This allows to be updated in an online manner by computing the componentwise sufficient statistics
allowing to be updated as
with some weight ; in (2) . This approach is slightly different from that of (Cappé and Moulines, 2005); see Sect. 7.1 for a discussion.
Choosing a large value of allows SMC to use many observations to improve the posterior distribution of . However the cost of a large
is a loss in particle independence due to the resampling procedure which increases the sample variance. The optimal choice for
balances the opposing influences of the forgetting rate of the model and the collapsing rate of the resampling process due to the divergence between the proposal distribution and the posterior distribution. For the examples in this paper we chose as recommended by Cappé and Moulines (2005), which seems to be a reasonable choice for our models.There are various other techniques to improve on this basic SMC method, including improved resampling schemes (Douc and Cappé, 2005; Olsson et al., 2008; Doucet and Johansen, 2009; Cappé et al., 2007), and choosing better sampling distributions through lookahead strategies or resamplemove procedures (Pitt and Shephard, 1999; Lin et al., 2013; Doucet and Johansen, 2009), which are not discussed further here. Instead, in the remainder of this paper, we focus on the process of updating the parameter estimates . The remainder of this section describes the options for step 3 of Algorithm 1.
2.1 Batch Expectation Maximization
Batch Expectation Maximization (BEM) processes the data in batches. Within a batch of size , the parameter estimate stays constant () and the update to the sufficient statistic
is collected at each iteration . At the end of the th batch we have , at which time
is our approximation of , and .
The batch size determines the convergence behavior of the estimates. For a fixed computational cost, choosing too small will result in noisedominated estimates and low precision, whereas choosing too large will result in precise but inaccurate estimates due to slow convergence.
2.2 Online Expectation Maximization
BEM only makes use of the collected evidence at the end of each batch, missing potential early opportunities for improving parameter estimates. OEM addresses this issue by updating the parameter estimate at every iteration. The approximation of at time is a running average of , weighted by a prespecified weighting sequence. The choice of weighting sequence determines how quickly the algorithm “forgets” the earlier parameter estimates. In OEM at time ,
(5) 
where is the chosen weighting sequence, typically of the form for a chosen (Cappé, 2009). Note that when using lag , for . This update rule ensures that at time , is a weighted sum of where the term has weight
(6) 
Algorithm 2 Online Expectation Maximization for a simplified autoregressive model
For time :

Simulate and calculate weights of new particles as outlined in Algorithm 1

Collect sufficient statistic

Update running average of sufficient statistics

Maximize expected likelihood by setting
Although this method can outperform BEM, its performance remains strongly dependent on the parameter determining the weighting sequence, and a suboptimal choice can reduce performance by orders of magnitude. At one extreme, the estimates will depend strongly only on the most recent data, resulting in noisy parameter estimates and low precision. At the other extreme, the estimates will average out stochastic effects but be severely affected by false initial estimates, resulting in more precise but less accurate estimates. Again, the best choice depends on the model.
A pragmatic approach to the problem of choosing a tuning parameter in OEM takes inspiration from Polyak (1990). In this method, a weight sequence that emphasizes incoming data is used to ensure quick initial convergence, while imprecise estimates are avoided at later iterations by averaging all OEM estimates beyond a threshold .
Choosing an appropriate threshold can be more straightforward than choosing for , but it still requires the user to have an intuition for how the estimates for each parameter will behave. We will refer to this method as AVG, use , and set which is half the total iterations for our examples.
2.3 Introspective Online Expectation Maximization
We now introduce IOEM to address the issue of having to prespecify a weighting sequence . The algorithm is similar to OEM, but instead of prespecifying , we estimate the precision and accuracy in the sufficient statistic updates and use these to determine the next weight . More precisely, we keep online estimates of a weighted regression on the dependent variables where the index serves as the explanatory variable and the data point has weight (6) as before. This weighted regression results in intercept and slope estimates , , and estimates of their variance , . We next use these estimates to define a proposed weight as follows:
This definition of ensures that a substantial slope estimate indicating low accuracy in our previous parameter estimates will put a large weight on the incoming statistic, improving accuracy. A large reflecting low precision in the estimates will result in a small weight, so that successive estimates are smoothed out, improving precision.
We do not use standard weighted regression, where the weights are assumed to be inversely proportional to the variance of the observation, as this assumption is not justified here; the standard prodecure would lead to biased estimates of and would impact the performance of IOEM. Instead we assume that observations share an unknown variance, and we use the weights to modulate the influence of each observation to the estimates of both and . See Sect. 7.2 for details.
We impose restrictions on which keep it between the most extreme choices for OEM. Taken together, the update step for becomes
(7) 
where is chosen to be very close to and guarantees convergence. These restrictions ensure that our algorithm satisfies the assumptions of Theorem 1 of Cappé and Moulines (2009), namely that , , and . Hence for any model for which and satisfy the assumptions guaranteeing convergence of the standard OEM estimator, the IOEM algorithm is also guaranteed to converge. The precise conditions are detailed in Assumption 1, Assumption 2, and Theorem 1 of Cappé and Moulines (2009).
Algorithm 3 Introspective Online Expectation Maximization for a simplified autoregressive model
For time :

Simulate and calculate weights of new particles using SMC with parameter

Collect sufficient statistic

Maximize expected likelihood by setting

Perform weighted regression on to calculate
The results of using BEM, OEM, and IOEM to perform parameter inference on model (1) with a wide range of tuning parameters from to , and from to , are presented in Figure 1. The choice of tuning parameter in BEM and OEM makes a significant difference to the precision of the estimate even after 100,000 observations. IOEM was able to recognize that behavior similar to BEM with or OEM with was optimal. The accuracy and precision of IOEM are comparable with those of the postOEM averaging technique (AVG) with parameters and .
The adapting weight sequence sets IOEM apart from OEM. This formulation of IOEM only works in the setting where has a linear relationship with a single sufficient statistic (here ) and is meant as an introduction to some of the ideas involved in IOEM. The method outlined in Algorithm 3 will not suffice when the function mapping the sufficient statistics to does not have this simple form. We introduce the general IOEM algorithm in Sect. 3 below.
3 EM Simulations in the Full Autoregressive Model
The model of Sect. 2 is special in that the sufficient statistic and the parameter of interest coincide. Generally this is not true, leading to a more involved setup that we explore here. To this end, we now consider the full noisilyobserved autoregressive model AR(1) with master equations as in (1), but now with unknown parameters , , and . We define four sufficient statistics,
Then, in BEM and OEM, we update the parameter estimates to
(8)  
(9)  
(10) 
where is an approximation of .
In most cases, as above, the function mapping to is nonlinear, and requires multiple sufficient statistics as input. To avoid bias, we want all sufficient statistics that inform one parameter estimate to share a weight sequence . We therefore estimate an adapting weight sequence for each parameter independently, by performing the regression on the level of the parameter estimates (Algorithm 4), rather than on the level of the sufficient statistics. We will calculate as in OEM (5) using our adapting weight sequence instead of a user specified weighting sequence. Because the adapting weight sequence is specific to each parameter, we will have multiple estimates of certain summary sufficient statistics. In this case and are estimated by and for (8) and by and for (9).
Simply regressing on with respect to would correspond to regression on , not . As is a running average, there is a strong correlation between and and hence also a strong dependence between and . In order to perform the regression on the parameters we must “unsmooth” to create pseudoindependent parameter updates (see Algorithm 4). This is accomplished by taking linear combinations,
where the coefficients are chosen so as to minimize the covariance between successive updates, justifying the term pseudoindependent. The resulting updates correspond with the unsmoothed sufficient statistics updates used in Sect. 2.3. See Sect. 7.3 for further details on this step.
For time : Simulate and calculate weights of new particles using SMC with parameter Collect sufficient statistics Update running average of sufficient statistics
Maximize expected likelihood Create pseudoindependent parameter updates
Perform weighted regression on to calculate
Estimates for the parameter under different EM methods are presented in Fig. 2; for the other parameter inferences see Sect. 7.5, Fig. 5. In the AR(1) model, IOEM outperforms most other EM methods when estimating the parameter. It is worth noting that in this case, OEM with substantially outperforms OEM with . This is a result of the bad initial estimates. OEM with forgets the earlier simulations much faster than OEM with and hence is able to move its estimates of , , and much more quickly. Here IOEM recognizes that it should have similar behavior to OEM with , whereas in the inference displayed in Figure 1 IOEM chose behavior similar to OEM with . IOEM can indeed adapt to the model.
4 EM Simulations in a TwoDimensional AR Model
Now we investigate a model with a larger number of parameters and varying accuracy of initial parameter estimates. IOEM’s main advantage over OEM is its ability to adapt to each parameter independently. To highlight this, we applied IOEM to a simple 2dimensional autoregressive model. For this model we consider the sequences as observed, while are unobserved, where
(11) 
Note that and are uncoupled, and that their master equation have independent parameters except for a shared parameter . By giving component good initial estimates and bad initial estimates, we can see how the different EM methods cope with a combination of accurate and inaccurate initializations. IOEM is able to identify the set with good initial estimates () and quickly start smoothing out noise. To IOEM, the other parameters appear to not have converged ( and because they are at the wrong value, because it will be changing to compensate for and ).
OEM with and OEM with both suffer in this model as they are both well suited to parameter estimation in one of the components, but not the other. IOEM on the other hand is able to capture the best of both worlds, striving for precision in component A and initially foregoing precision in favour of accuracy in component B.
5 Stochastic volatility model
The previous sections have demonstrated IOEM is comparable to choosing the optimal tuning parameter in OEM or BEM in certain models. However, the models shown have all been based on the noisily observed autoregressive model, which is a linear Gaussian case where in practice analytic techniques would be prefered over SAEM. We now examine the behaviour of these algorithms when inferring the parameters of a nonlinear stochastic volatility model defined by transition and emission densities
6 Conclusion
Stochastic Approximation EM is a general and effective technique for estimating parameters in the context of SMC. However, convergence can be slow, and improving convergence speed is of particular interest in this setting. We have shown that IOEM produces accurate and precise parameter estimates when applied to continuous statespace models. Across models, and across varying levels of accuracy of the intial estimates, the efficiency of IOEM matches that of BEM/OEM with the optimal choice of tuning parameter. The AVG procedure also shows good behaviour, but like BEM/OEM it has tuning parameters, and when these are chosen suboptimally performance is not as good as IOEM (Figs. 89). In addition, BEM/OEM/AVG all make use of a single learning schedule , and for more complex models a single learning schedule generally cannot achieve optimal convergence rates for all parameters, as we have shown for the 2dimensional AR example.
IOEM finds parameterspecific learning schedules, resulting in better performance than standard methods with a single learning rate parameter are able to achieve. IOEM can be applied with minimal prior knowledge of the model’s behavior, and requires no user supervision, while retaining the convergence guarantees of BEM/OEM, therefore providing an efficient, practical approach to parameter estimation in SMC methods.
=0mu plus 1mu
References
 Baum (1972) Leonard E Baum. An equality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities, 3:1–8, 1972.
 Cappé (2009) Olivier Cappé. Online sequential monte carlo em algorithm. In Statistical Signal Processing, 2009. SSP’09. IEEE/SP 15th Workshop on, pages 37–40. IEEE, 2009.
 Cappé and Moulines (2005) Olivier Cappé and Eric Moulines. On the use of particle filtering for maximum likelihood parameter estimation. In Signal Processing Conference, 2005 13th European, pages 1–4. IEEE, 2005.
 Cappé and Moulines (2009) Olivier Cappé and Eric Moulines. Online expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593–613, 2009.
 Cappé et al. (2007) Olivier Cappé, Simon J Godsill, and Eric Moulines. An overview of existing methods and recent advances in sequential monte carlo. Proceedings of the IEEE, 95(5):899–924, 2007.
 Chitralekha et al. (2010) Saneej B Chitralekha, J Prakash, H Raghavan, RB Gopaluni, and Sirish L Shah. A comparison of simultaneous state and parameter estimation schemes for a continuous fermentor reactor. Journal of Process Control, 20(8):934–943, 2010.
 Delyon et al. (1999) Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochastic approximation version of the em algorithm. The Annals of Statistics, 27(1):94–128, 1999.
 Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 Douc and Cappé (2005) Randal Douc and Olivier Cappé. Comparison of resampling schemes for particle filtering. In Image and Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium on, pages 64–69. IEEE, 2005.

Doucet and Johansen (2009)
Arnaud Doucet and Adam M Johansen.
A tutorial on particle filtering and smoothing: Fifteen years later.
Handbook of Nonlinear Filtering
, 12:656–704, 2009.  Jamshidian and Jennrich (1993) Mortaza Jamshidian and Robert I Jennrich. Acceleration of the em algorithm by using quasinewton methods. Journal of the American statistical association, 88(421):221–228, 1993.
 Kantas et al. (2009) Nicholas Kantas, Arnaud Doucet, Sumeetpal Sindhu Singh, and Jan Marian Maciejowski. An overview of sequential monte carlo methods for parameter estimation in general statespace models. In 15th IFAC Symposium on System Identification (SYSID), SaintMalo, France.(invited paper), volume 102, page 117, 2009.
 Kaufman (1995) P.J. Kaufman. Smarter Trading: Improving Performance in Changing Markets. McGrawHill, 1995. ISBN 9780070340022. URL https://books.google.co.uk/books?id=ndq_21wRJjEC.
 Lange (1995) Kenneth Lange. A quasi newton acceleration of the em algorithm. Statistica Sinica, 5:1–18, 1995.
 Lin et al. (2013) Ming Lin, Rong Chen, Jun S Liu, et al. Lookahead strategies for sequential monte carlo. Statistical Science, 28(1):69–94, 2013.
 Olsson et al. (2008) Jimmy Olsson, Olivier Cappé, Randal Douc, Eric Moulines, et al. Sequential monte carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli, 14(1):155–179, 2008.
 Pitt and Shephard (1999) Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association, 94(446):590–599, 1999.
 Polyak (1990) Boris Teodorovich Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika, (7):98–107, 1990.
 Shumway and Stoffer (1982) Robert H Shumway and David S Stoffer. An approach to time series smoothing and forecasting using the em algorithm. Journal of time series analysis, 3(4):253–264, 1982.
 Varadhan and Roland (2008) Ravi Varadhan and Christophe Roland. Simple and globally convergent methods for accelerating the convergence of any em algorithm. Scandinavian Journal of Statistics, 35(2):335–353, 2008.
7 Supplemental text
7.1 Fixedlag technique
Our fixedlag technique is slightly different than that proposed in the literature (Cappé and Moulines, 2005; Olsson et al., 2008). Compared to the existing approach it uses less intermediate storage. Recall that the approximation we aim to evaluate is
where the sufficient statistic is written explicitly as a sum over the path traced out by the particle . The drawback is that for the paths will have collapsed due to resampling, increasing the variance for those contributions to . The solution proposed in Cappé and Moulines (2005) is to use instead the approximation
This requires storing the quantities
for each sufficient statistic and each particle. This storage can be expensive if large numbers of sufficient statistics are tracked. Instead, at iteration we use the approximation
By disregarding terms involving for and switching the summation in this way, we can now update at each iteration by adding the contribution of the current particles to a single summary statistic at a distance , without requiring perparticle storage other than each particle’s recent history.
7.2 Weighted regression
The term “weighted regression” usually refers to regression where the errors are independent and normally distributed with zero mean and known variance (up to a multiplicative constant), and the data is weighted inversely proportionally to its variance. In our case, the data is assumed to drift, contributing an additional, nonindependent term to the error. Weights are used to only focus on recent data where the drift contributes an error of the same order of magnitude as the normally distributed noise, while discounting the impact of data points further away. In this setup we are interested both in estimating the regression coefficients, and the error in these estimates.
Perry Kaufman’s adaptive moving average (AMA) (Kaufman, 1995) is a similar averaging technique which reacts to the trends and volatility (jointly referred to as the behavior) of the sequence. The difference lies in the measure of the behavior. AMA relies on a user specified window length . The most recent data points are used to measure the behavior. This would be equivalent to using equallyweighted linear regression over the last points. By using weighted regression, the contribution of points to the behavior measures is also influenced by the previously observed behavior. For example, a sharp trend will effectively employ a smaller value as we have lost interest in the behavior before that trend.
Let be the matrix consisting of a column of s and a column with the dependent variable, let
be the vector of observations, let
be the two coefficients, and the vector of errors, with . Finally let be a vector of weights. We estimate by minimizingwhere and are defined as
Setting the derivative to zero and solving for results in the standard estimator for weighted regression
or more explicitly
From this expression we can see that can be updated in an online manner as increases simply by updating the above summations. The variance in can be estimated as follows:
If this simplifies to the usual . Writing out the expression for explicitly shows that it is again possible to find online updates for the relevant terms.
7.3 Pseudoindependent parameter updates
In order to perform our regression on the level of the parameters, we need to map from to and then to . We do not wish to regress on , as and are highly correlated. Instead we want a sequence defined in the parameter space where the correlations resemble those in . We define this sequence as
Here we show that and are uncorrelated for all , under the assumption that and are uncorrelated (). Define to be the sequence that satisfies and . Note that , , and so on. Now,
(15) 
Comments
There are no comments yet.