# Efficient data augmentation techniques for Gaussian state space models

We propose a data augmentation scheme for improving the rate of convergence of the EM algorithm in estimating Gaussian state space models. The scheme is based on a linear transformation of the latent states, and two working parameters are introduced for simultaneous rescaling and re-centering. A variable portion of the mean and scale are thus being moved into the missing data. We derive optimal values of the working parameters (which maximize the speed of the EM algorithm) by minimizing the fraction of missing information. We also study the large sample properties of the working parameters and their dependence on the autocorrelation and signal-to-noise ratio. We show that instant convergence is achievable when the mean is the only unknown parameter and this result is extended to Gibbs samplers and variational Bayes algorithms.

## Authors

• 7 publications
• ### Data transforming augmentation for heteroscedastic models

Data augmentation (DA) turns seemingly intractable computational problem...
11/07/2019 ∙ by Hyungsuk Tak, et al. ∙ 0

• ### On the EM-Tau algorithm: a new EM-style algorithm with partial E-steps

The EM algorithm is one of many important tools in the field of statisti...
11/21/2017 ∙ by Val Andrei Fajardo, et al. ∙ 0

• ### Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size!

How does missing data affect our ability to learn signal structures? It ...
05/02/2019 ∙ by Niels Bruun Ipsen, et al. ∙ 0

• ### EM Converges for a Mixture of Many Linear Regressions

We study the convergence of the Expectation-Maximization (EM) algorithm ...
05/28/2019 ∙ by Jeongyeol Kwon, et al. ∙ 0

• ### Bayesian Multivariate Nonlinear State Space Copula Models

In this paper we propose a flexible class of multivariate nonlinear non-...
11/01/2019 ∙ by Alexander Kreuzer, et al. ∙ 0

• ### A Randomized Missing Data Approach to Robust Filtering and Forecasting

We put forward a simple new randomized missing data (RMD) approach to ro...
04/29/2021 ∙ by Dobrislav Dobrev, et al. ∙ 0

• ### A Hybrid EM Algorithm for Linear Two-Way Interactions with Missing Data

We study an EM algorithm for estimating product-term regression models w...
11/13/2021 ∙ by Dale S. Kim, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The expectation-maximization or EM algorithm

(Dempster et al., 1977) is an iterative method for maximum likelihood estimation that is widely used in missing data problems. For state space models, the EM algorithm provides a natural approach for estimating the model parameters (e.g. Shumway and Stoffer, 1982)

, as the latent states can be taken as the missing data and the E-step can be performed using smoothed estimates from the Kalman filter

(Kalman, 1960). Compared with other likelihood maximization techniques such as scoring or Newton-Raphson, the EM algorithm has some attractive features such as numerical stability (each iteration increases the observed data likelihood) and guaranteed convergence to a local maximum for exponential families (Wu, 1983). However, the EM algorithm has been found to converge slowly in the later stages of the iterative procedure despite being able to move quickly to a region close to the maximum (Watson and Engle, 1983; Harvey and Peters, 1990).

The rate of convergence of the EM algorithm is governed by the fraction of missing information in the data augmentation scheme. Let denote the complete or augmented data where is the observed data and is the missing data, and be the maximum likelihood estimate of an unknown parameter . The (matrix) rate of convergence of the EM algorithm (Dempster et al., 1977) is

 DM(θ∗)=Imis(θ∗)I−1aug(θ∗)=I−Iobs(θ∗)I−1aug(θ∗),Imis(θ)=−Ep(x|y,θ){∂2logp(x|y,θ)∂θ∂θT},Iaug(θ)=−Ep(x|y,θ){∂2logp(yaug|θ)∂θ∂θT}. (1)

The EM algorithm converges slowly ( is close to identity) if the proportion of missing information relative to the augmented information is large. Conversely, convergence is rapid ( is close to zero) if the observed information is close to the augmented information. Thus, the rate of convergence can be optimized by minimizing the missing information in the data augmentation scheme. Meng and van Dyk (1997, 1998) develop augmentation schemes for the multivariate -model and mixed effects models by rescaling the missing data and introducing “working parameters" into the model that control the portion of the scale factor that is shifted into the missing data. Optimal values of the working parameters are then found by minimizing the fraction of missing information. A similar approach was considered by Tan et al. (2007) for quadratic optimization problems. The PX-EM algorithm (Liu et al., 1998) also seeks to reduce the fraction of missing information albeit by expanding the set of model parameters to adjust the covariance among parameters in the M-step.

In this article, we consider Gaussian state space models of the form:

 yt=xt+ϵt,ϵt∼N(0,σ2ϵ),xt=μ+ϕ(xt−1−μ)+ηt,ηt∼N(0,σ2η),x0∼N(μ,σ2η/(1−ϕ2)),(t=1,…,n), (2)

where are the observations, are the latent states, , , and . The and sequences are independent for all and , and are independent of . Let

be the vector of model parameters,

be the signal-to-noise ratio, and . This model is also called the AR(1) plus noise or ARMA(1,1) model in time series analysis (Durbin and Koopman, 2012). We propose a data augmentation scheme that introduces two working parameters and to rescale and re-center the latent states. For , we define the transformed latent state as

 αt=σ−aη(xt−wtμ),a∈R,wt∈R. (3)

Suppose

. In the context of Markov chain Monte Carlo algorithms, (

3) is well-known as the noncentered parametrization when and the centered parametrization when . It has been shown, for instance, for random effect models (Gelfand et al., 1995, 1996), Gaussian state space models (Pitt and Shephard, 1999; Frühwirth-Schnatter, 2004) and a large class of hierarchical models (Papaspiliopoulos et al., 2003, 2007), that the convergence rates of Markov chain Monte Carlo algorithms depend critically on the parametrization (Roberts and Sahu, 1997). The centered and noncentered parametrizations have been found to play complementary roles in that the Gibbs sampler often converges much faster under one of these parametrizations than the other. To take advantage of this contrasting feature, Yu and Meng (2011) propose an ancillarity-sufficiency strategy that interweaves these two parametrizations.

A less well-known alternative is the partially noncentered parametrization, which has been shown to be capable of yielding better convergence than both the centered and noncentered parametrizations for fitting random effect models using Gibbs samplers (Papaspiliopoulos et al., 2003). Partially noncentering (of the mean) assumes the form of (3) with and , and is thus regarded as lying on the continuum between centering and noncentering. In this article, we consider a data augmentation scheme or parametrization of the Gaussian state space model where both the location and scale are partially noncentered, and investigate how this scheme can be optimized to construct efficient EM algorithms. As the rate of convergence of EM algorithms and Gibbs samplers are closely linked (Sahu and Roberts, 1999), we show how some of our results carry over to Gibbs samplers and in fact variational Bayes algorithms as well.

## 2 EM algorithm for partially noncentered Gaussian state space model

Let and . We can express (2) in terms of as

 yt=σaηαt+wtμ+ϵt,ϵt∼N(0,σ2ϵ),σaηαt=~wtμ+ϕ(σaηαt−1−~wt−1μ)+ηt,ηt∼N(0,σ2η),σaηα0∼N(~w0μ,σ2η/(1−ϕ2)),(t=1,…,n). (4)

Now and appear in both the state and observation equations. When and , we recover the centered parametrization in (2), so called as the latent state is centered around the a priori expected value and the parameters , appear only in the state equation. The noncentered parametrization is obtained when and . In matrix notation, (4) can be expressed as

 y∣α,θ∼N(σaηα+μw,σ2ϵI),σaηα∣θ∼N(μ~w,σ2ηΛ−1),

where is a symmetric tridiagonal matrix with diagonal and off-diagonal elements equal to . For , is invertible. We use

, 1 and 0 to denote the identity matrix and the vectors of all ones and zeros respectively, where the dimension is inferred from the context. The marginal distribution of

is where .

Suppose we wish to find the maximum likelihood estimate of , that maximizes the observed data log-likelihood using an EM algorithm. We consider the latent states as the missing data and as the augmented data. Given an initial estimate , the algorithm performs an E-step and an M-step at each iteration , where the E-step computes , and the M-step maximizes with respect to . The conditional distribution is , where

 V−1a=σ2aΩ,σaηmaw=Ω−1z+μ~w,z=σ−2ϵ(y−μ1),Ω=σ−2ϵI+σ−2ηΛ. (5)

The subscripts of and represent their dependence on the values of and in the scheme. For instance, we use when and . Let . We have

 Q(θ∣θ(i)) =2−1[log|Λ|−σ−2ϵ{(y−μw−σaηmaw)T(y−μw−σaηmaw)+σ2aηtr(Va)} −nlog(σ2ϵ)−n(1−a)log(σ2η)−σ−2η{ζTΛζ+σ2aη% tr(ΛVa)}]−nlog(2π),

where and are evaluated at . At each iteration, the EM algorithm updates and given and sets . The expectation-conditional maximization algorithm (Meng and Rubin, 1993) reduces the complexity of the M-step by replacing it with a sequence of conditional maximization steps. If we maximize with respect to each element of with the remaining elements held fixed at their current values, then the update of can be obtained by setting . This yields the following closed form updates for and .

 μ={σ−2ϵ(y−σaηmaw)Tw+σa−2ηmTawΛ~w}/τ(w),τ(w)=σ−2ϵwTw+σ−2η~wTΛ~w,σ2ϵ=n−1{(y−μw−σaηmaw)T(y−μw−σaηmaw)+σ2aηtr(Va)}. (6)

Setting yields

 n(1−a)σ2η =(1−a)σ2aη{mTawΛmaw+tr% (ΛVa)}+μ2~wTΛ~w+(a−2)σaημmTawΛ~w (7) +aσ2ησ−2ϵ[σaη(y−μw)Tmaw−σ2aη{mTawmaw+tr(Va)}],

and closed form updates are available only in the following special cases:

 σ2η={n−1{(m0w−μ~w)TΛ(m0w−μ~w)+tr(ΛV0)}ifa=0,{mT11m11+tr(V1)}−2{(y−μ1)Tm11}2ifa=1,w=1,

Setting yields a cubic equation in ,

 ϕσ2η=(1−ϕ2)n−1∑t=1(ζtζt+1+σ2aηVa,t,t+1)−ϕ(1−ϕ2)n−1∑t=2(ζ2t+σ2aηVa,tt), (8)

which can be solved using iterative methods.

Here we are interested in finding the values of and that optimize the rate of convergence of the EM algorithm. From (1), since depends only on the observed data and is independent of the parametrization, it is sufficient to minimize with respect to and . We use to denote the element in . We first consider the cases where only the location or the scale is unknown followed by the case where all the parameters are unknown,

## 3 Unknown location parameter

Suppose are known and is the only unknown parameter. The EM algorithm for this case, hereby called Algorithm 1, alternately updates and as in (5) and (6).

###### Theorem 1.

The rate of convergence of Algorithm 1 is , where and is given in (6). This rate is minimized to zero when is

 wopt=1−σ−2ϵΩ−11. (9)
###### Proof.

It can be shown that and . From (1), the rate of convergence is . Since for any and = 0, the rate is minimized to be zero at . ∎

From Theorem 1, the rate of convergence of Algorithm 1 for the centered () and noncentered () parametrizations are and respectively. When , Algorithm 1 converges instantly. This optimal rate is achievable only if the portion of subtracted from in (3) is allowed to vary with . If is common for all , we can show that the optimal value of each is and the rate of convergence is positive except when . Furthermore, it is possible to compute in advance for Algorithm 1 as does not depend on . The rate of convergence is also independent of . Thus, can be set to any convenient value, say zero.

To investigate the range of the elements in and their dependence on , we derive an explicit expression for . From (9), is simply the row sums of divided by . We first present an expression for and some important properties. If , let so that is a symmetric tridiagonal matrix with diagonal and off-diagonal elements equal to , where , and . It can be shown that and .

###### Property 1.

If , for where

 vt=bt−1κn−t/κ,ut=bt−1κt−1/κ0(t=1,…,n),κ=φ2+rn−1+−φ2−rn−1−,κt=φ+rt+−φ−rt−(t=0,…,n−1),φ+=r+−|ϕ|,φ−=r−−|ϕ|,r±={c±(c2−4)1/2}/2. (10)
###### Property 2.

The sum of the th row of is . If , . If , .

###### Property 3.

If , is positive for and all elements of are positive.

###### Theorem 2.

Let be the th element of . If , and

 woptt=(1−ϕ)2+bγ(1−ϕ)(vt+vn−t+1)(1−ϕ)2+γifϕ≠0. (11)
###### Proof.

If , and . Hence . If , . Substituting the expression for from Property 2 and noting that yields the result in (11). ∎

As from (10), Theorem 2 implies that can be computed using only elements from the first row of . In addition, is symmetric since for each . It is clear that the value of depends on , and on and only through the signal-to-noise ratio . Corollary 1 presents bounds for which are tight when .

###### Corollary 1.

For ,

 1−B1≤woptt≤1−B2(0≤ϕ<1),1−B2≤woptt≤1+B2−2B1(−1<ϕ<0),

where and . When , and when , .

###### Proof.

From Theorem 2, if and it clearly satisfies the bounds. If , for from Property 3 and hence from (11). From Property 2, implies that . If , and . Again from Property 2, implies . ∎

###### Corollary 2.

If , each element of decreases strictly as the signal-to-noise ratio increases and as increases. As approaches 1, approaches the zero vector.

###### Proof.

Writing ,

 ∂wopt∂γ=−Q−1wopt|ϕ|,∂wopt∂ϕ=Q−1|ϕ|dΛdϕ~wopt,

where is a symmetric tridiagonal matrix with diagonal and off-diagonal elements equal to . From Property 2 and Corollary 1, all elements of and are positive. Hence each element of is negative and decreases strictly with for all . To show that each element of is negative, it suffices to show that each element of (a symmetric vector) is negative. The first and last elements are equal to which is negative. From Theorem 2, the th element of for is

 2ϕ~woptt−~woptt−1−~wopti+1 =γϕ[2ϕst−st−1−st+1] =−{ϕ(c−2)}−1γ{2(1−ϕ)−(c1−ϕ)st−(c1−ϕ)}<0

from Property 2. It is clear from (11) that as approaches 1, approaches the zero vector. ∎

For random effects models (Gelfand et al., 1995; Papaspiliopoulos et al., 2003), the partial noncentering parameter for location always lies between 0 and 1. This is also true for the Gaussian state space model when . However, when , is positive but not necessarily bounded above by one as shown in Fig. 1. From Corollary 2, the centered parametrization () is increasingly preferred as and increase when . However, when , Fig. 1 shows that may not be strictly decreasing with either or .

## 4 Unknown scale parameter

Next, suppose are known and is the only unknown parameter. The EM algorithm for this case (hereby called Algorithm 2), alternately updates and at the E-step and at the M-step as given in (5) and (7) respectively.

###### Theorem 3.

If , the rate of convergence of Algorithm 2 is independent of and is minimized at . If , the rate is jointly minimized at

 aopt=1−(nσ2η)−1zTΩ−1ΛΩ−1z,~wopt=(μΩ)−1{2ΛΩ−1z/(aoptσ2η)−z}. (12)
###### Proof.

The rate of convergence of Algorithm 2 can be optimized by minimizing with respect to and . We can show that , where

 I=μ2a2~wTΩ~w/2+μa2zT~w−2σ−2ηaμzTΩ−1Λ~w+n(1−a)2+a2zTΩ−1z/2.

Note that is defined in (5) and is to be evaluated at . If , and is independent of . Therefore is minimized at . If , setting and yields

 a=2n+2σ−2ημzTΩ−1Λ~w2n+(z+μΩ~w)TΩ−1(z+μΩ~w),~w=(μΩ)−1{2ΛΩ−1z/(aσ2η)−z}.

Solving the two equations simultaneously, we obtain the results in (12). The Hessian of can be verified to be positive definite at . Hence is minimized at . ∎

From Theorem 3, and depend not only on the unknown , but also on the observed data . Thus even if the true parameter is known, and will still vary across datasets generated from model (4) due to sampling variability. We recommend updating and at each iteration in Algorithm 2 based on the latest update of .

If , is clearly bounded in . If , is always bounded above by 1 but may be negative, especially when is small and the signal-to-noise ratio is large. To investigate the dependence of on and the signal-to-noise ratio , we compute for datasets simulated under a variety of parameter settings. We set , , , , and generate 1000 datasets in each setting. Fig. 2 shows that the mean of seems to decrease as the signal-to-noise ratio increases, is symmetric about , and varies with in the form of an (inverted) U-shaped curve. Support for these observations is provided by Theorem 4 below, which derives an estimate of for large . This estimate (shown in Fig. 2) improves as increases and is almost indistinguishable from for greater than 500. Proof of Theorem 4 requires the trace of and which are stated in Property 3 (see Tan2017).

###### Property 4.

The trace of is and the trace of is , where

 S=4n2γ2+8nγ(ϕ2−1)+κ−10nc(φ4+r2n−2+−φ4−r2n−2−)−4γ(1+ϕ2)+2(ϕ2−1)2+4γκ−20{4γ+c(φ2+r2n−1++φ2−r2n−1−)}+2κ−10(ϕ2−1)(φ2+r2n−1+−φ2−r2n−1−).
###### Theorem 4.

If , converges to in quadratic mean as , where

 ^aopt=1−γ[{(1−ϕ)2+γ}{(1+ϕ)2+γ}]−1/2.
###### Proof.

The marginal distribution of is . Let .

 E(aopt)=1−(nσ2η)−1tr(Σ)=1−(nσ2ϵ)−1tr(Ω−1)=1−(n|ϕ|)−1γtr(Q−1),var(aopt)=(nσ2η)−22tr(Σ2)=(nσ2ϵ)−22tr(Ω−2)=(n|ϕ|)−22γ2tr(Ω−2).

We first show that and as . From Property 4, is

 1κ0limn→∞1+(φ2−/φ2+)(r−/r+)n−11−(φ2−/φ2+)(r−/r+)n−1+2γκ20limn→∞1−(r−/r+)nn{φ2+/r+−(φ2−/r+)(r−/r+)n−1},

which reduces to since . Moreover . Similarly, we can show that by using the expression of in Property 4. Finally

 E{(aopt−^aopt)2}=var(aopt)+{E(aopt)−^aopt}2→0

as . ∎

###### Corollary 3.

If , asymptotically almost surely.

###### Proof.

From Theorem 4,

 pr(aopt<^aopt−ϵ)≤pr(|aopt−^aopt|>ϵ)≤E{(aopt−^aopt)2}/ϵ2→0

as for any . Set . ∎

The approximation in Theorem 4 lends insight on how varies with and . For instance, is symmetric about . It is also easy to verify that is always negative and hence decreases strictly with the signal-to-noise ratio. From Corollary 3, is almost surely to lie in the interval when is large. On the other hand, in (12) is not bounded in unlike in Section 3. Results from simulations (see Fig. 3) show that as , is close to one under different conditions but there is greater variability when the signal-to-noise is large. In Theorem 5, we show that approximately as

by taking the moments of a second order Taylor’s approximation. However, Fig

3

shows that the 5th and 95th quantiles of

do not approach one even for large , especially when is close to one and/or signal-to-noise ratio is large. Thus does not seem to vanish as .

###### Theorem 5.

As , approximately.

Since