Efficient data augmentation techniques for Gaussian state space models

12/24/2017 ∙ by Linda S. L. Tan, et al. ∙ National University of Singapore 0

We propose a data augmentation scheme for improving the rate of convergence of the EM algorithm in estimating Gaussian state space models. The scheme is based on a linear transformation of the latent states, and two working parameters are introduced for simultaneous rescaling and re-centering. A variable portion of the mean and scale are thus being moved into the missing data. We derive optimal values of the working parameters (which maximize the speed of the EM algorithm) by minimizing the fraction of missing information. We also study the large sample properties of the working parameters and their dependence on the autocorrelation and signal-to-noise ratio. We show that instant convergence is achievable when the mean is the only unknown parameter and this result is extended to Gibbs samplers and variational Bayes algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The expectation-maximization or EM algorithm

(Dempster et al., 1977) is an iterative method for maximum likelihood estimation that is widely used in missing data problems. For state space models, the EM algorithm provides a natural approach for estimating the model parameters (e.g. Shumway and Stoffer, 1982)

, as the latent states can be taken as the missing data and the E-step can be performed using smoothed estimates from the Kalman filter

(Kalman, 1960). Compared with other likelihood maximization techniques such as scoring or Newton-Raphson, the EM algorithm has some attractive features such as numerical stability (each iteration increases the observed data likelihood) and guaranteed convergence to a local maximum for exponential families (Wu, 1983). However, the EM algorithm has been found to converge slowly in the later stages of the iterative procedure despite being able to move quickly to a region close to the maximum (Watson and Engle, 1983; Harvey and Peters, 1990).

The rate of convergence of the EM algorithm is governed by the fraction of missing information in the data augmentation scheme. Let denote the complete or augmented data where is the observed data and is the missing data, and be the maximum likelihood estimate of an unknown parameter . The (matrix) rate of convergence of the EM algorithm (Dempster et al., 1977) is

(1)

The EM algorithm converges slowly ( is close to identity) if the proportion of missing information relative to the augmented information is large. Conversely, convergence is rapid ( is close to zero) if the observed information is close to the augmented information. Thus, the rate of convergence can be optimized by minimizing the missing information in the data augmentation scheme. Meng and van Dyk (1997, 1998) develop augmentation schemes for the multivariate -model and mixed effects models by rescaling the missing data and introducing “working parameters" into the model that control the portion of the scale factor that is shifted into the missing data. Optimal values of the working parameters are then found by minimizing the fraction of missing information. A similar approach was considered by Tan et al. (2007) for quadratic optimization problems. The PX-EM algorithm (Liu et al., 1998) also seeks to reduce the fraction of missing information albeit by expanding the set of model parameters to adjust the covariance among parameters in the M-step.

In this article, we consider Gaussian state space models of the form:

(2)

where are the observations, are the latent states, , , and . The and sequences are independent for all and , and are independent of . Let

be the vector of model parameters,

be the signal-to-noise ratio, and . This model is also called the AR(1) plus noise or ARMA(1,1) model in time series analysis (Durbin and Koopman, 2012). We propose a data augmentation scheme that introduces two working parameters and to rescale and re-center the latent states. For , we define the transformed latent state as

(3)

Suppose

. In the context of Markov chain Monte Carlo algorithms, (

3) is well-known as the noncentered parametrization when and the centered parametrization when . It has been shown, for instance, for random effect models (Gelfand et al., 1995, 1996), Gaussian state space models (Pitt and Shephard, 1999; Frühwirth-Schnatter, 2004) and a large class of hierarchical models (Papaspiliopoulos et al., 2003, 2007), that the convergence rates of Markov chain Monte Carlo algorithms depend critically on the parametrization (Roberts and Sahu, 1997). The centered and noncentered parametrizations have been found to play complementary roles in that the Gibbs sampler often converges much faster under one of these parametrizations than the other. To take advantage of this contrasting feature, Yu and Meng (2011) propose an ancillarity-sufficiency strategy that interweaves these two parametrizations.

A less well-known alternative is the partially noncentered parametrization, which has been shown to be capable of yielding better convergence than both the centered and noncentered parametrizations for fitting random effect models using Gibbs samplers (Papaspiliopoulos et al., 2003). Partially noncentering (of the mean) assumes the form of (3) with and , and is thus regarded as lying on the continuum between centering and noncentering. In this article, we consider a data augmentation scheme or parametrization of the Gaussian state space model where both the location and scale are partially noncentered, and investigate how this scheme can be optimized to construct efficient EM algorithms. As the rate of convergence of EM algorithms and Gibbs samplers are closely linked (Sahu and Roberts, 1999), we show how some of our results carry over to Gibbs samplers and in fact variational Bayes algorithms as well.

2 EM algorithm for partially noncentered Gaussian state space model

Let and . We can express (2) in terms of as

(4)

Now and appear in both the state and observation equations. When and , we recover the centered parametrization in (2), so called as the latent state is centered around the a priori expected value and the parameters , appear only in the state equation. The noncentered parametrization is obtained when and . In matrix notation, (4) can be expressed as

where is a symmetric tridiagonal matrix with diagonal and off-diagonal elements equal to . For , is invertible. We use

, 1 and 0 to denote the identity matrix and the vectors of all ones and zeros respectively, where the dimension is inferred from the context. The marginal distribution of

is where .

Suppose we wish to find the maximum likelihood estimate of , that maximizes the observed data log-likelihood using an EM algorithm. We consider the latent states as the missing data and as the augmented data. Given an initial estimate , the algorithm performs an E-step and an M-step at each iteration , where the E-step computes , and the M-step maximizes with respect to . The conditional distribution is , where

(5)

The subscripts of and represent their dependence on the values of and in the scheme. For instance, we use when and . Let . We have

where and are evaluated at . At each iteration, the EM algorithm updates and given and sets . The expectation-conditional maximization algorithm (Meng and Rubin, 1993) reduces the complexity of the M-step by replacing it with a sequence of conditional maximization steps. If we maximize with respect to each element of with the remaining elements held fixed at their current values, then the update of can be obtained by setting . This yields the following closed form updates for and .

(6)

Setting yields

(7)

and closed form updates are available only in the following special cases:

Setting yields a cubic equation in ,

(8)

which can be solved using iterative methods.

Here we are interested in finding the values of and that optimize the rate of convergence of the EM algorithm. From (1), since depends only on the observed data and is independent of the parametrization, it is sufficient to minimize with respect to and . We use to denote the element in . We first consider the cases where only the location or the scale is unknown followed by the case where all the parameters are unknown,

3 Unknown location parameter

Suppose are known and is the only unknown parameter. The EM algorithm for this case, hereby called Algorithm 1, alternately updates and as in (5) and (6).

Theorem 1.

The rate of convergence of Algorithm 1 is , where and is given in (6). This rate is minimized to zero when is

(9)
Proof.

It can be shown that and . From (1), the rate of convergence is . Since for any and = 0, the rate is minimized to be zero at . ∎

From Theorem 1, the rate of convergence of Algorithm 1 for the centered () and noncentered () parametrizations are and respectively. When , Algorithm 1 converges instantly. This optimal rate is achievable only if the portion of subtracted from in (3) is allowed to vary with . If is common for all , we can show that the optimal value of each is and the rate of convergence is positive except when . Furthermore, it is possible to compute in advance for Algorithm 1 as does not depend on . The rate of convergence is also independent of . Thus, can be set to any convenient value, say zero.

To investigate the range of the elements in and their dependence on , we derive an explicit expression for . From (9), is simply the row sums of divided by . We first present an expression for and some important properties. If , let so that is a symmetric tridiagonal matrix with diagonal and off-diagonal elements equal to , where , and . It can be shown that and .

Property 1.

If , for where

(10)
Property 2.

The sum of the th row of is . If , . If , .

Property 3.

If , is positive for and all elements of are positive.

Theorem 2.

Let be the th element of . If , and

(11)
Proof.

If , and . Hence . If , . Substituting the expression for from Property 2 and noting that yields the result in (11). ∎

As from (10), Theorem 2 implies that can be computed using only elements from the first row of . In addition, is symmetric since for each . It is clear that the value of depends on , and on and only through the signal-to-noise ratio . Corollary 1 presents bounds for which are tight when .

Corollary 1.

For ,

where and . When , and when , .

Proof.

From Theorem 2, if and it clearly satisfies the bounds. If , for from Property 3 and hence from (11). From Property 2, implies that . If , and . Again from Property 2, implies . ∎

Corollary 2.

If , each element of decreases strictly as the signal-to-noise ratio increases and as increases. As approaches 1, approaches the zero vector.

Proof.

Writing ,

where is a symmetric tridiagonal matrix with diagonal and off-diagonal elements equal to . From Property 2 and Corollary 1, all elements of and are positive. Hence each element of is negative and decreases strictly with for all . To show that each element of is negative, it suffices to show that each element of (a symmetric vector) is negative. The first and last elements are equal to which is negative. From Theorem 2, the th element of for is

from Property 2. It is clear from (11) that as approaches 1, approaches the zero vector. ∎

For random effects models (Gelfand et al., 1995; Papaspiliopoulos et al., 2003), the partial noncentering parameter for location always lies between 0 and 1. This is also true for the Gaussian state space model when . However, when , is positive but not necessarily bounded above by one as shown in Fig. 1. From Corollary 2, the centered parametrization () is increasingly preferred as and increase when . However, when , Fig. 1 shows that may not be strictly decreasing with either or .

Figure 1: Left: Plot of (solid line) against for , where and . Dashed lines represent the bounds in Corollary 1. Right: Plot of against for , where and .

4 Unknown scale parameter

Next, suppose are known and is the only unknown parameter. The EM algorithm for this case (hereby called Algorithm 2), alternately updates and at the E-step and at the M-step as given in (5) and (7) respectively.

Theorem 3.

If , the rate of convergence of Algorithm 2 is independent of and is minimized at . If , the rate is jointly minimized at

(12)
Proof.

The rate of convergence of Algorithm 2 can be optimized by minimizing with respect to and . We can show that , where

Note that is defined in (5) and is to be evaluated at . If , and is independent of . Therefore is minimized at . If , setting and yields

Solving the two equations simultaneously, we obtain the results in (12). The Hessian of can be verified to be positive definite at . Hence is minimized at . ∎

From Theorem 3, and depend not only on the unknown , but also on the observed data . Thus even if the true parameter is known, and will still vary across datasets generated from model (4) due to sampling variability. We recommend updating and at each iteration in Algorithm 2 based on the latest update of .

If , is clearly bounded in . If , is always bounded above by 1 but may be negative, especially when is small and the signal-to-noise ratio is large. To investigate the dependence of on and the signal-to-noise ratio , we compute for datasets simulated under a variety of parameter settings. We set , , , , and generate 1000 datasets in each setting. Fig. 2 shows that the mean of seems to decrease as the signal-to-noise ratio increases, is symmetric about , and varies with in the form of an (inverted) U-shaped curve. Support for these observations is provided by Theorem 4 below, which derives an estimate of for large . This estimate (shown in Fig. 2) improves as increases and is almost indistinguishable from for greater than 500. Proof of Theorem 4 requires the trace of and which are stated in Property 3 (see Tan2017).

Figure 2: Plots show the value of averaged over 1000 datasets simulated under different parameter settings. Solid lines represent value of in Theorem 4.
Property 4.

The trace of is and the trace of is , where

Theorem 4.

If , converges to in quadratic mean as , where

Proof.

The marginal distribution of is . Let .

We first show that and as . From Property 4, is

which reduces to since . Moreover . Similarly, we can show that by using the expression of in Property 4. Finally

as . ∎

Corollary 3.

If , asymptotically almost surely.

Proof.

From Theorem 4,

as for any . Set . ∎

The approximation in Theorem 4 lends insight on how varies with and . For instance, is symmetric about . It is also easy to verify that is always negative and hence decreases strictly with the signal-to-noise ratio. From Corollary 3, is almost surely to lie in the interval when is large. On the other hand, in (12) is not bounded in unlike in Section 3. Results from simulations (see Fig. 3) show that as , is close to one under different conditions but there is greater variability when the signal-to-noise is large. In Theorem 5, we show that approximately as

by taking the moments of a second order Taylor’s approximation. However, Fig

3

shows that the 5th and 95th quantiles of

do not approach one even for large , especially when is close to one and/or signal-to-noise ratio is large. Thus does not seem to vanish as .

Figure 3: Black, red and blue lines represent respectively the mean, 5th and 95th quantiles of the values of over 1000 datasets simulated under different parameter settings.
Theorem 5.

As , approximately.

Proof.

Since