Reduced-dimensional Monte Carlo Maximum Likelihood for Latent Gaussian Random Field Models

10/22/2019 ∙ by Jaewoo Park, et al. ∙ 0

Monte Carlo maximum likelihood (MCML) provides an elegant approach to find maximum likelihood estimators (MLEs) for latent variable models. However, MCML algorithms are computationally expensive when the latent variables are high-dimensional and correlated, as is the case for latent Gaussian random field models. Latent Gaussian random field models are widely used, for example in building flexible regression models and in the interpolation of spatially dependent data in many research areas such as analyzing count data in disease modeling and presence-absence satellite images of ice sheets. We propose a computationally efficient MCML algorithm by using a projection-based approach to reduce the dimensions of the random effects. We develop an iterative method for finding an effective importance function; this is generally a challenging problem and is crucial for the MCML algorithm to be computationally feasible. We find that our method is applicable to both continuous (latent Gaussian process) and discrete domain (latent Gaussian Markov random field) models. We illustrate the application of our methods to challenging simulated and real data examples for which maximum likelihood estimation would otherwise be very challenging. Furthermore, we study an often overlooked challenge in MCML approaches to latent variable models: practical issues in calculating standard errors of the resulting estimates, and assessing whether resulting confidence intervals provide nominal coverage. Our study therefore provides useful insights into the details of implementing MCML algorithms for high-dimensional latent variable models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monte Carlo maximum likelihood (MCML) provides an elegant and flexible approach for maximum likelihood estimation for latent variable models. However, MCML algorithms may not be practical in many settings, especially in the context of high-dimensional latent variable models. For instance, they may be computationally infeasible for high-dimensional versions of stochastic volatility models (cf. Jacquier et al., 2007, Duan et al., 2017)

for financial econometrics, and finite Gaussian mixture models for galaxy data

(Johansen et al., 2008)

. A particularly important example of latent variable models where MCML estimation can be impractical is latent Gaussian random field models or spatial generalized linear mixed models (SGLMMs), which are popular models for spatial data, and widely used in many scientific domains

(cf. Banerjee et al., 2014, Lawson et al., 2016)

. The main challenge in MCML for high-dimensional latent variable models arises from the high-dimensional integration of a multivariate random variable. It is quite difficult to even obtain reasonable importance functions – distributions from which samples are drawn to obtain a Monte Carlo approximate of the likelihood – to carry out MCML for such models. In this manuscript, we develop a new computationally efficient approach for MCML for high-dimensional latent Gaussian random field models. Our method is based on projecting the high-dimensional latent variable to much lower dimensions, and involves a practical iterative approach for finding an importance function that leads to a computationally efficient algorithm. We also study different approaches for approximating the standard error of the maximum likelihood estimates.

There is a vast literature on Bayesian approaches for latent Gaussian random field models, and there have been many algorithms proposed for handling the computational challenges effectively. Just a few of these include the predictive process (Banerjee et al., 2008), the integrated nested Laplace approximation or INLA (Rue et al., 2009), and reparameterization and dimension reduction methods (cf. Haran et al., 2003, Rue and Held, 2005, Christensen et al., 2006, Hughes and Haran, 2013, Guan and Haran, 2018). Frequentist methods for latent Gaussian random field models have been much less popular than Bayesian approaches at least in part due to the fact that it is very computationally challenging to do maximum likelihood estimation in this context. Here, we focus on the challenge of making maximum likelihood estimation efficient for such models via the MCML approach. We note that a Bayesian approach typically allows more easily for flexible hierarchical modeling than frequentist methods, though for certain commonly used SGLMMs a maximum likelihood approach can potentially lend itself to faster algorithms, and can also provide some automation by avoiding the need for prior specification.

Several algorithms have been developed for maximum likelihood inference for SGLMMs. Zhang (2002)

proposes Monte Carlo expectation-maximization (MCEM)

(Wei and Tanner, 1990) and Christensen (2004) proposes Markov chain Monte Carlo maximum likelihood (MCML) (Geyer and Thompson, 1992). The MCEM algorithm uses Monte Carlo in the expectation step of the expectation-maximization (EM) algorithm, while the MCML algorithm works with a Monte Carlo approximation of the log-likelihood function. These approximations require sampling the random effects and evaluating the likelihood function, which is computationally expensive for large data sets. Evangelou et al. (2011), Bonat and Ribeiro (2016)

develop sophisticated Laplace approximations for the full conditional distribution of the random effects; this approach is related to INLA. However, both algorithms require finding optimized values of the random effects for Laplace approximations, which can be challenging for large data sets. Furthermore, unless the conditional distribution is well approximated by a Gaussian distribution, for high-dimensional data sets the Laplace approximation can be poor. Maximum likelihood inference has therefore rarely been used for analyzing large non-Gaussian spatial data sets. A recent exception is

Guan and Haran (2019) which develops a projection-based Monte Carlo Expectation-Maximization algorithm that is computationally feasible for such problems. However, as discussed in Knudson (2016), rigorously quantifying uncertainties for the estimates can be difficult, as MCEM algorithms lack the theoretical justifications of MCML. With increasing Monte Carlo sample size, the asymptotic properties of the likelihood approximations in MCMCL are well-established (Geyer and Thompson, 1992, Geyer, 1994). Specifically, with a compactness assumption for the parameter space, the MCMLE converges to the MLE almost surely (Geyer, 1994). The efficient MCML algorithm we develop in this manuscript can take advantage of existing asymptotic theories for estimating standard errors and establishing asymptotic confidence intervals. Furthermore, the MCML algorithm may be easier to extend to more general settings than the MCEM algorithm, for instance, when the distributions of the latent variables include intractable normalizing functions as in image denoising problems (Geman and Graffigne, 1986, Tanaka, 2002, Murphy, 2012)

or in hidden Markov models for disease mapping

(Green and Richardson, 2002).

In this manuscript we propose a fast MCML algorithm that uses projection methods (Banerjee et al., 2012, Hughes and Haran, 2013, Guan and Haran, 2018, cf.). We show via simulated and real data examples that our algorithm can address the computational and inferential challenges of maximum likelihood estimation for SGLMMs, and demonstrate that our method is applicable to both the continuous and discrete spatial domain. The outline of the remainder of this paper is as follows. In Section 2 we describe SGLMMs and introduce relevant notation, and in Section 3 we introduce a standard MCML approach and point out their computational challenges. In Section 4 we propose our fast dimension-reduced MCML approach to inference, and provide implementation details. We also investigate in detail the computational complexity of our method. We study the performance of our approach via simulation studies in Section 5 and describe the application of our methods to two real data sets in Section 6. We conclude with a summary and discussion in Section 7.

2 Spatial Generalized Linear Mixed Models

In this section, we describe spatial generalized linear mixed models (SGLMMs), which are latent Gaussian random field models for spatial data. Non-Gaussian spatial data are common in climate science, epidemiology, ecology and agriculture, among other disciplines. Examples include binary satellite data on ice sheets (ice-no ice) data (Chang et al., 2016), and count data on plant disease (Zhang, 2002). Spatial generalized linear mixed models (SGLMMs) are popular and flexible models well suited for non-Gaussian spatial data, both for continuous domain (Diggle et al., 1998) and discrete domain or lattice data (Besag, 1974). SGLMMs are commonly used to either interpolate these data or adjust for spatial dependence in a spatial regression model. However, for large data sets, maximum likelihood inference for such models becomes computationally very demanding because evaluating the likelihood function involves high-dimensional integration and operations on large matrices. Furthermore, spatial confounding between fixed and random effects can lead to unreliable parameter estimations (Reich et al., 2006, Hanks et al., 2015, Guan and Haran, 2018). In this manuscript we provide a fast MCML algorithm for SGLMMs that also allows for adjustments to account for spatial confounding.

We now describe SGLMMs, and provide relevant notation. Let be the spatial domain of interest. Consider , the observed data at coordinates and is the matrix of covariates depending on the spatial locations. At each location we can define spatially correlated random effect . Then our model can be defined as follows. The random effects can be defined differently depending on whether the spatial domain of interest is continuous or discrete: (1) For continuous domains, is a zero-mean second order stationary Gaussian process with a covariance kernel , where covariance function depends on the parameters ;

is the variance parameter and

is the range parameter. For example, the Matérn class (Stein, 2012) covariance function is widely used. Then

follows a normal distribution,

. (2) For discrete domains, is assumed to follow a zero-mean Gaussian Markov random field (GMRF) (Besag, 1974). For all locations, if th location and th location are neighbors, , otherwise and is defined as 0. This forms an adjacency matrix . Then the distribution of is , where diag and controls the smoothness of the spatial field.

Then with a link function the conditional mean can be modeled as

(1)

Given random effects and regression parameters , are assumed to be conditionally independent of each other (i.e. ). Let be the model parameters for SGLMMs. Then the likelihood function is

(2)

The number of random effects grows with the size of the data, which makes likelihood function evaluation difficult. Therefore, direct maximization of (2) is infeasible.

We note that for SGLMMs, spatial confounding can lead to uninterpretability of parameters and variance inflation. Let and . Then our model can be represented as

(3)

Then similar to the multicolinearity issue in standard regression models, is confounded with because they have a linear relationship. To alleviate spatial confounding, Reich et al. (2006) suggests that should be deleted from the model, which is called a restricted spatial regression model. For this model, Hanks et al. (2015) proposes to add a posteriori adjustment based on the MCMC samples of to account effect of removing . Hughes and Haran (2013), Guan and Haran (2018) develop fast reduced-dimensional Bayesian methods and Guan and Haran (2019) develops a Monte Carlo Expectation Maximization approach for maximum likelihood inference; all these approaches address both computational and confounding issues. In what follows, we develop a Monte Carlo maximum likelihood approach that is computationally efficient and also useful for addressing confounding issues.

3 Monte Carlo Maximum Likelihood

Here we outline a Monte Carlo maximum likelihood (MCML) algorithm (Geyer and Thompson, 1992, Christensen, 2004) for SGLMMs. The likelihood function can be written as

(4)

where is the conditional density for the random effects. In the MCML context is called the ”importance function” and should be reasonably close to the MLE for an accurate approximation. Then the MLE can be obtained by maximizing the following Monte Carlo approximation to the log-likelihood function

(5)

where are sampled by MCMC from the distribution . This approach works well for small data sets. However for more than thousands of data points, constructing efficient MCMC algorithms is challenging because the algorithms are expensive per update due to the dimensionality of the random effects. Furthermore, the Markov chains thus constructed tend to mix very slowly, which means that it takes many more iterations to obtain reasonable approximations. By using gradient information of the target posterior in the proposal distribution, a well mixing Langevin Hastings algorithm is proposed by Christensen and Waagepetersen (2002), though the computational cost per iteration can increase significantly, thereby reducing the gains from better mixing. Haran et al. (2003) and Knorr-Held and Rue (2002) suggest approaches for constructing proposals for large block updates of the random effects. Although this can improve the mixing of MCMC, application to large data sets can be challenging. Furthermore, the choice of importance function is crucial to the success of the MCML approach; that is, should be reasonably close to the MLE. To address these challenges, we propose a projection-based MCML approach in the following section.

4 Projection-based Monte Carlo Maximum Likelihood

4.1 Projection-based Dimension Reduction

Projection methods (Hughes and Haran, 2013, Guan and Haran, 2018) have been developed for both continuous and discrete spatial domain to alleviate spatial confounding issues as well as to reduce computational costs. They reduce the dimension of the random effects from to by using the projection matrix .

In the continuous spatial domain, models include the zero-mean Gaussian random effects with a covariance , where is the variance parameter and

is the range parameter. In the context of principal component analysis (PCA), the projection matrix

is constructed via , where is the first eigenvectors of correlation matrix , and

is a diagonal matrix with corresponding eigenvalues. The resulting model

(Guan and Haran, 2018) is

(6)

By multiplying to , a restricted model (Reich et al., 2006) can be easily constructed to eliminate spatial confounding. For every update, eigendecomposition on is necessary. This can lead to computational challenges for large data sets. Guan and Haran (2018) replaces this exact eigendecomposition with a probabilistic version of Nyström’s approximation (Williams and Seeger, 2001, Drineas and Mahoney, 2005). We provide an outline of the random projection method in the supplementary material.

In the discrete case, we can construct by taking the first principal components of the Moran operator , where is an adjacency matrix. Such projection matrix can account for the spatial covariates of interest as well as for the underlying graph on a lattice domain. Then the projected model (Hughes and Haran, 2013) can be written as follows.

(7)

In the following section we propose a projection-based MCML approach for SGLMMs. Our approach addresses computational and inferential challenges, and is generally applicable to both the continuous and discrete domain. Furthermore, we provide some guidance on how to tune the algorithm.

4.2 Projection-based MCML

Here we propose a fast Monte Carlo maximum likelihood (MCML) algorithm for SGLMMs. Our approach can be applicable to both the continuous and discrete spatial domain. Based on projection approaches in the previous section, the likelihood function can be represented as

(8)

where is the importance function for the reduced-dimensional random effects. Then we can obtain the MLE by maximizing the following Monte Carlo approximation of the reduced-dimensional log-likelihood function

(9)

where are sampled from the distribution via MCMC. The MCML algorithm based on projection shows significant gains in computational efficiency over the original method (Christensen, 2004). This is because, compared to the full model likelihood function (4), (8) requires a much smaller dimension integration (. For example, in our simulation study we show that for can provide accurate estimates and prediction. Furthermore, is less correlated than the original random effects , resulting in fast mixing of the MCMC for sampling the random effects from .

In the continuous spatial domain, model parameters are . The maximization of (9) with respect to and is straightforward, because we can easily derive the first and second derivatives of (9

). Consider the conditional distribution of the response variable

which is from the exponential family. Then the first and second derivatives with respect to the the regression parameters are

(10)

where is a diagonal matrix whose elements are the conditional variance of . In the continuous case and derivatives with respect to the are

(11)

The derivatives with respect to do not have a closed form. For a given , we use a Newton-Raphson procedure to obtain and which maximize (9). We refer the reader to the supplementary material for calculating first and second derivatives of the objective function (9). Then and are plugged into (9) to obtain . This function is maximized with respect to using numerical optimization. Our projection-based MCML algorithm is described in Algorithm 1 below.

Given which is close to the MLE.
1. Simulate random effects via MCMC from .
2. Construct a Monte Carlo approximation for the likelihood function as follows.
3. For given , obtain via an iterative Newton-Raphson method.
4. For given and , obtain via a numerical optimization of .
5. For given , obtain via an iterative Newton-Raphson method.
Algorithm 1 Basic (non-iterative) projection-based MCML algorithm (continuous domain)

In Step 1 of the Algorithm 1, we construct a standard Metropolis-Hastings (MH) algorithm whose stationary distribution is (see supplementary material for details). In practice, however, it is challenging to start the algorithm with which is reasonably close to the MLE. Therefore, we provide Algorithm 2, an iterative method for finding . Starting with an arbitrary initial value (e.g. GLM estimates), we repeat Algorithm 1 with one modification: we replace optimization of with numerical search on neighboring values of to reduce the computational expense of that step. An iterative search is repeated until is close to 0, which implies that the and are close to each other. We present details about stopping rules for the algorithm in Section 4.3.

For at th iteration following procedures are repeated until becomes close to 0.
1. Simulate random effects via MCMC from .
2. Construct a Monte Carlo approximation for the objective function as follows.
3. For , obtain via an iterative Newton-Raphson method.
4. For and , obtain via a numerical search on the neighboring values of which maximizes .
5. Set
Algorithm 2 An iterative method for finding (continuous domain)

In the discrete spatial domain, parameters of interest are . The derivatives with respect to are identical to the derivatives in continuous case, (10). However, derivatives with respect to the covariance parameter are derived separately because of difference in the covariance structure. The derivatives with respect to the are

(12)

Since we have derivatives for both and , a projection-based MCML algorithm and its iterative search method may be simply written as shown in Algorithm 3 and Algorithm 4.

Given which is close to the MLE.
1. Simulate random effects via MCMC from .
2. Construct a Monte Carlo approximation for the likelihood function as follows.
3. Obtain via an iterative Newton-Raphson method.
Algorithm 3 Basic (non-iterative) projection-based MCML algorithm (discrete domain)
For at th iteration following procedures are repeated until becomes close to 0.
1. Simulate random effects via MCMC from .
2. Construct a Monte Carlo approximation for the objective function as follows.
3. Obtain via an iterative Newton-Raphson method.
Algorithm 4 An iterative method for finding (discrete domain)

4.3 Implementation Details

For our fast MCML approaches, we examine the approximation error. There are two sources of error for MCML estimates: (1) sampling error and (2) Monte Carlo error. Knudson (2016) derives two sources of error in the context of generalized linear mixed models (GLMMs). We apply the results in Knudson (2016) to our fast MCML approaches for SGLMMs. Consider an MLE from the data with sample size , and is the true parameter value. Then under regularity conditions listed in Geyer et al. (2013), asymptotically follows , where . The other source of error comes from the Monte Carlo approximation. Let be the Monte Carlo estimates with number of Monte Carlo samples. Then under regularity conditions listed in Geyer (1994) converges to , where is

(13)

can be estimated as based on the observed Fisher information. can also be estimated as

(14)

Then Monte Carlo errors can be obtained as . One might want to use this quantity as a stopping rule for the number of Monte Carlo samples. However, evaluating this value with each iteration of MCMC is computationally expensive because it requires that we calculate for each . Instead we use the Effective Sample Size (ESS) (Kass et al., 1998, Robert and Casella, 2013) as a stopping rule. Incorporating these two sources of error is challenging in general. Sung et al. (2007) found that the sampling error and Monte Carlo error can be added when the importance sampling distribution is constructed independently of the data. However we cannot apply this result directly because our importance sampling distribution depends on the data.

We can also obtain standard errors based on the parametric bootstrap (Efron and Tibshirani, 1994)

as follows: (1) From the observed data, we obtain parameter estimates via our fast MCML approaches. (2) Multiple data sets are simulated from the SGLMMs for the estimated parameter values. (3) We obtain point estimates of simulated data sets using our fast MCML methods. (4) Finally we estimate standard errors from standard deviations of all parameter estimates. In this manuscript we compare the observed Fisher information-based standard errors with bootstrap-based standard errors. We found that bootstrap-based standard errors can provide more reasonable coverage (See Table 

3 and Table 5).

For this projection-based MCML approach, we need to tune the following: (1) rank (), (2) number of random effects simulations for Monte Carlo approximations, and (3) number of iterative search for finding . For the rank selection we follow suggestions in Hughes and Haran (2013), Guan and Haran (2018). The rank can be chosen from the desired proportion of variation (e.g. 95%, explained by ). For the continuous domain we conduct eigendecomposition of the correlation matrix for given initial . For the discrete domain we can take the first principal components of the Moran operator . For all the examples we consider in our study, appears to suffice.

The number of MCMC iterations is determined by the effective sample size (ESS) (Kass et al., 1998, Robert and Casella, 2013, Gong and Flegal, 2015). ESS approximates the number of independent samples that corresponds to the number of dependent MCMC samples. Almost independent chain would return an ESS similar to the actual length of Markov chain; therefore, ESS provides a rough diagnostic for how well the chain is mixing. In Algorithm 1 and Algorithm 3 we run the MCMC until the ESS of the first random effect is at least 20 times the rank (). However for an iterative search (Algorithm 2 and Algorithm 4) we don’t need a long run of MCMC, because Monte Carlo approximations do not need to be accurate. Therefore we run MCMC until the ESS of the first random effect is at least 3 times the rank.

In Algorithm 2 and Algorithm 4 iterative searches are repeated until becomes close to 0, which indicates that the and are similar to each other. We stop the iterative search when , where ASE denotes the asymptotic standard error of log-likelihood estimated using batch means (Flegal et al., 2008). We set to 0.5; this value was obtained by trial-and-error. In summary, we determine the number of random effects simulations as a function of rank (). We find that for the final iteration and in all previous iterations work well in practice.

4.4 Prediction

In many continuous spatial domain applications, prediction at unsampled locations is of great interest. Consider a prediction of locations . In the projection-based model the spatial random effects at the observed coordinates can be reparameterized as . To obtain at some coordinates , we use basic definitions of the Gaussian process to obtain

(15)

where . Then the conditional distribution of given observed is

(16)

To make a prediction of random effects, the Monte Carlo sample of the random effects and the parameter estimates from the last iteration of the MCML algorithm are plugged into (16). Then we can obtain the empirical best linear unbiased predictor (EBLUP) of from the mean in (16).

4.5 Computational Complexity

Here we examine the computational complexity of our projection-based MCML approaches and the standard MCML approaches (Christensen, 2004), summarizing how our methods scale as we increase the size of the data. Table 1 summarizes these calculations. We provide details about calculating computational complexity for the algorithms in the supplementary material.

Continuous Domain Operations Fast MCML Standard MCML
Importance sampling
Update
Update
Discrete Domain Operations Fast MCML Standard MCML
Importance sampling
Update
Table 1: Computational complexity. is the number of data points and is the reduced rank.

The computational benefits come from reducing the dimension of random effects through projection methods. We could avoid expensive computations involving large covariance matrices when we update model parameters. Considering that is much smaller than (we use for thousands of in our study), our projection-based MCML approaches scales well compared to the standard MCML approaches.

5 Simulation Studies

We apply our approach to count and binary data for both the continuous and discrete domain. The code for this is implemented in R. The source code can be downloaded from the following repository (https://github.com/jwpark88/projMCML). All the code was run on dual 10 core Xeon E5-2680 processors on the Penn State high performance computing cluster.

5.1 Count Data in a Continuous Domain

We simulate a count data set with in the unit domain with model parameters . We first simulate random effects using a Matérn class (Stein, 2012)

covariance function with a smoothing parameter 2.5. Given simulated random effects, count observations are simulated from a Poisson distribution with rates

, where is coordinate matrix for , randomly generated from the unit domain. We fit the model using the first 1,000 observations and 400 points are used for prediction.

Our initial values are obtained from GLM estimates. For the range parameter

, we use the first quartile of the Euclidean distance between the locations. For the simulated example, this gives us initial values

and . We select a reduced rank and follow the stopping criteria described in Section 4.3. Figure 1 indicates agreement between simulated and predicted random effects; we observed similar spatial patterns. We study our methods for different values of . Table 2 shows that the parameter estimates from different choice of are reasonably close to the true values. We observed that the bootstrap-based confidence intervals are wider than the observed Fisher information-based confidence intervals. Furthermore, we can obtain confidence intervals for using bootstrap, without calculating analytically intractable derivatives.

Figure 1: The left panel shows the simulated random effects at the observation (top) and prediction locations (bottom). The right panel shows the estimated (top) and predicted (bottom) random effects. Spatial patterns are similar between simulated and predicted random effects.
m 95% CI Time(min)
25 0.996 0.922 0.999 0.190 7.676
Fisher (0.859, 1.132) (0.784, 1.060) (0.423, 1.576) NA
Bootstrap (0.861, 1.127) (0.745, 1.068) (0.558, 5.734) (0.124, 0.411)
50 1.078 0.952 1.53 0.237 17.450
Fisher (0.944, 1.212) (0.817, 1.086) (0.877, 2.185) NA
Bootstrap (0.922, 1.189) (0.779, 1.054) (0.659, 5.304) (0.174, 0.384)
75 1.030 0.912 1.48 0.255 23.439
Fisher (0.971, 1.088) (0.877, 0.946) (0.961, 1.992) NA
Bootstrap (0.882, 1.144) (0.767, 1.028) (0.599, 3.799) (0.168, 0.367)
100 1.030 0.901 1.44 0.246 31.774
Fisher (0.872, 1.182) (0.746, 1.036) (1.011, 1.872) NA
Bootstrap (0.911, 1.169) (0.726, 1.020) (0.587, 3.846) (0.179, 0.343)
Table 2: Continuous domain simulated Poisson data; parameter estimates from different choice of are close to the true values. We cannot obtain standard errors of via a Fisher information because it is analytically intractable.

To validate our methods, we repeat the simulation 100 times. We used the same algorithm tuning parameters (, ) for all the simulated data sets. Figure 2

indicates that the point estimates for each simulated data set are distributed around the true values. Compared to the regression parameters, estimates of covariance parameters are slightly right skewed, but still close to the true simulated values. We also investigate the coverages obtained from two different confidence intervals based on: (1) bootstrap, and (2) observed Fisher information. We observe that coverage based on the observed Fisher information is lower than the 95% confidence. On the other hand, coverage based on bootstrap are close to the nominal rate.

Figure 2: Point estimates for the 100 simulated Poisson data sets in a continuous spatial domain. Red lines indicate true value. Point estimates are distributed around the true values.
Methods
Fisher 0.81 0.83 0.69 NA
Bootstrap 0.93 0.88 0.95 0.97
Table 3: Coverage for simulated Poisson data sets in a continuous spatial domain.

5.2 Binary Data in a Continuous Domain

We simulate a binary data set with in the unit domain with parameters . Similar to the Poisson data simulation, we generate random effects

using a Matérn class covariance function with a smoothing parameter 2.5. Given simulated random effects, binary observations are simulated from a logit link function

, where is coordinate matrix for . We use the first 1,000 observations and 400 points are used for prediction.

We use initial values and which are obtained from GLM estimates. We choose a reduce rank and use stopping rules described in Section 4.3. We observed similar spatial patterns between simulated and predicted random effects (Figure 3). We also study our methods for different choice of . In Table 4, we observed that Fisher information-based confidence intervals are wider than the those in the Poisson cases. Similar to the Poisson data examples, the bootstrap-based confidence intervals are wider than the observed Fisher information-based confidence intervals.

Figure 3: The left panel shows the simulated random effects at the observation (top) and prediction locations (bottom). The right panel shows the estimated (top) and predicted (bottom) random effects. Spatial patterns are similar between simulated and predicted random effects.
m 95% CI Time(min)
25 0.869 1.405 2.274 0.292 6.919
Fisher (0.520, 1.219) (0.994, 1.815) (1.686, 2.862) NA
Bootstrap (-0.054, 1.403) (0.807, 2.019) (0.689, 32311657) (0.117, 2.959)
50 0.805 1.381 1.181 0.208 11.482
Fisher (0.403, 1.207) (0.979, 1.783) (0.646, 1.717) NA
Bootstrap (0.227, 1.363) (0.851, 2.018) (0.444, 13727.69) (0.0989, 1.128)
75 0.824 1.424 1.461 0.217 21.913
Fisher (0.387, 1.259) (0.995, 1.852) (0.741, 2.181) NA
Bootstrap (0.131, 1.505) (0.832, 1.929) (0.389, 192147.4) (0.082, 1.101)
100 0.796 1.364 2.909 0.374 30.614
Fisher (0.413, 1.180) (0.973, 1.756) (2.017, 3.802) NA
Bootstrap (-0.003, 1.422) (0.714, 1.905) (0.481, 559588) (0.138, 7.918)
Table 4: Continuous domain simulated binary data; parameter estimates from different choice of are close to the true values. We cannot obtain standard errors of via a Fisher information because it is analytically intractable.

We repeat the simulation 100 times to investigate distribution of point estimates. It is observed that the point estimates are distributed around the true simulated values (Figure 4). Coverage obtained from the observed Fisher information are higher than those in Poisson cases; but still lower than the nominal rate (95% confidence). Coverage based on bootstrap are close to the nominal rate.

Figure 4: Point estimates for the 100 simulated binary data sets in a continuous spatial domain. Red lines indicate true value. Point estimates are distributed around the true values.
Methods
Fisher 0.87 0.90 0.72 NA
Bootstrap 0.97 0.96 0.97 0.93
Table 5: Coverage for simulated binary data sets in a continuous spatial domain.

5.3 Count Data on a Lattice

Our methods are also applicable to the discrete spatial domain setting. We follow simulation settings in Hughes and Haran (2013). We create a Poisson data set on a lattice over the unit domain with model parameters . By taking first 400 eigenvectors of the Moran operator, we construct matrix . Then we simulate random effects from . Given generated random effects, count observations are simulated from a Poisson distribution with rates , where is coordinate matrix for .

As in previous examples we use an initial value obtained from GLM estimates ( and ). We choose a reduce rank and use stopping rules described in Section 4.3. We study our methods for different choice of . In Table 6, it is observed that all of the estimated values are close to the true simulated values. Both confidence intervals (Fisher information, bootstrap) are tighter than those in continuous domain cases.

m 95% CI Time(min)
25 1.069 0.963 6.263 9.558
Fisher (0.968, 1.171) (0.861, 1.065) (1.548, 10.979)
Bootstrap (1.010, 1.153) (0.906, 1.058) (2.367, 9.775)
50 1.053 0.970 6.712 15.988
Fisher (0.951, 1.154) (0.867, 1.072) (2.680, 10.744)
Bootstrap (0.986, 1.137) (0.920, 1.049) (3.117, 9.284)
75 1.049 0.964 6.94 23.684
Fisher (0.947, 1.152) (0.861, 1.067) (3.055, 10.819)
Bootstrap (0.986, 1.144) (0.913, 1.044) (3.388, 8.435)
100 1.046 0.958 5.731 28.411
Fisher (0.944, 1.148) (0.855, 1.061) (2.780, 8.682)
Bootstrap (0.992, 1.130) (0.907, 1.032) (2.916, 9.304)
Table 6: Discrete domain simulated Poisson data; parameter estimates from different choice of are close to the true values.

We conduct the simulation 100 times, and the point estimates for are distributed around the true simulated values. However the points estimates for are slightly biased (Figure 5). In Table 7, it is observed that coverage obtained from the observed Fisher information and coverage based on the bootstrap are both close to the normal rate (95% confidence). This is because we can simultaneously update and based on derivatives with respect to each parameter unlike continuous domain cases. However coverage for are lower than nominal rate, due to the bias in point estimates. In general it is challenging to estimate correctly (Hughes and Haran, 2013).

Figure 5: Point estimates for the 100 simulated Poisson data sets in a lattice spatial domain. Red lines indicate true value. Point estimates are distributed around the true values.
Methods
Fisher 0.93 0.96 0.63
Bootstrap 0.95 0.93 0.49
Table 7: Coverage for simulated Poisson data sets in a discrete spatial domain.

6 Real Data Examples

We illustrate the application of our approach to two real data examples: (1) binary mistletoe data over the continuous domain, and (2) county-level US infant mortality data. For both large non-Gaussian spatial data sets, a projection-based MCML algorithm can conduct statistical inference as well as provide prediction for unobserved locations within a reasonable amount of time.

6.1 Mistletoe Data

We apply our method to mistletoe data set from the Minnesota Department of Natural Resources forest inventory. The data set contains forest stand information from black spruce stands in Minnesota forests (the binary response indicates whether or not dwarf mistletoe was found in the stand). Eastern spruce dwarf mistletoe (Arceuthobium pusillum) is a parasitic plant which causes the most serious disease of black spruce (Picea mariana) throughout its range (Baker et al., 2006). These data are analyzed in Hanks et al. (2011) under a GLM framework. We use 4,000 locations as training and the use 1,000 locations for model prediction. In addition to coordinates (in Universal Transverse Mercator coordinate system), we also use average age of trees in stands, basal area per acre, average height of trees, and the total volume of trees in the stand as additional covariates. As in the simulated examples, we use initial values for model parameters from GLM estimates. We use an initial value of , which is the first quartile of the Euclidean distance matrix from coordinates. For a given initial value of , we conduct eigendecomposition of the correlation matrix . We chose a reduced rank , and this explains 99% of variation (). We use stopping rules described in Section 4.3.

Our fast MCML approach takes about 2 hours. Because of the expensive MCMC simulations to approximate the log-likelihood function, regular MCML (Christensen, 2004) are impractical. The projection-based MCEM approach in (Guan and Haran, 2019) may have comparable computational costs to our approach. However, as mentioned before, MCML offers an alternative that has rigorous theoretical justifications. Furthermore, our MCML approach may be more easily applicable to, and computationally tractable for, a broader class of models, for instance, image processing problems (Geman and Graffigne, 1986, Tanaka, 2002, Murphy, 2012) or hidden Markov models for disease mapping (Green and Richardson, 2002)

, where the distributions of the latent variables are intractable, for example when an Ising or Potts model is used as to model latent variables. Because our goal is maximum likelihood estimation, we do not compare our results to the plethora of fast algorithms for Bayesian inference, though we note that the Bayesian approaches based on similar projection-based methods are much more computationally expensive, for instance the algorithm in

(Guan and Haran, 2018) will take about several days to run. Inference results are summarized in Table 8. Confidence intervals are obtained from the observed Fisher-information. Figure 6 indicates that there are similar binary spatial patterns between observed and predicted locations.

Parameter Estimate 95% CI
X-coordinates -2.242e-05 (-2.489e-05, -1.995e-05)
Y-coordinates 2.869e-06 (-1.072e-06, 6.811e-06)
Age 0.005 (0.002, 0.007)
Basal area -0.002 (-0.005, 0.001)
Height 0.024 (0.018, 0.030)
Volume -0.002 (-0.004, 0.001)
4.772 (2.893, 6.651)
35847.78 NA
Table 8: Inference results for a mistletoe data.
Figure 6: The left panel shows the mistletoe presence at the observation (top) and prediction locations (bottom). The right panel shows the estimated (top) and predicted (bottom) mistletoe presence. Red and blue points indicate presence-absence of mistletoe respectively. Spatial patterns are similar between simulated and predicted random effects.

6.2 US Infant Mortality Data

We apply our projection-based MCML algorithm to the infant mortality data set, which is observed in 3,071 US counties. The data set is obtained from the 2008 Area Resource File, which is maintained by the Bureau of Health Professions, Health Resources and Services Administration, US Department of Health and Human Services. This data set is analyzed in Hughes and Haran (2013) under a Bayesian framework. As in Hughes and Haran (2013), we use the three-year (2002-2004) average number of infant deaths before the first birthday as a response variable. We use the rate of low birth weight, the percentage of black residents, the percentage of Hispanic, the Gini coefficient which measures income inequality, a composite score of social affluence, and residential stability as covariates (see Hughes and Haran (2013) for details). Then, we use the average number of live births as an offset. As in the previous examples, we set the initial values for model parameters from GLM estimates. We chose a reduced rank as in Hughes and Haran (2013), and use same stopping rules as in Section 4.3.

Parameter Estimate 95% CI
Intercept -5.423 (-5.605, -5.240)
Low birth weight 8.802 (7.565, 10.038)
Black 0.004 (0.003, 0.006)
Hispanic -0.004 (-0.005, -0.003)
Gini -0.575 (-1.000, -0.151)
Affluence -0.077 (-0.089, -0.065)
Stability -0.029 (-0.044, -0.015)
7.815 (2.152, 13.479)
Table 9: Inference results for the US infant mortality data.

Our approach takes about 1.5 hours. As in the binary mistletoe example, regular MCML (Christensen, 2004) is infeasible. In the continuous domain there is a signficant computational advantage of the MCML approach over the corresponding projection-based Bayesian approach via MCMC. In the discrete domain, however, the computational benefit of MCML are not as dramatic. The Bayes projection-based approach (Hughes and Haran, 2013) takes about 3.5 hours for this example. However, a direct computational comparison is challenging because the Bayes approach is implemented in C++ using the Rcpp and RcppArmadillo packages (Eddelbuettel et al., 2011). Table 9 summarizes inference results for the US infant mortality data set. We construct confidence intervals using the observed Fisher-information. The results in Table 9 are comparable to those in Hughes and Haran (2013).

7 Discussion

In this manuscript, we have proposed a fast Monte Carlo Maximum Likelihood approach for latent Gaussian random field models. Our methods are based on recently developed projection approaches (Hughes and Haran, 2013, Guan and Haran, 2018). By reducing the dimension of random effects, we can avoid expensive computations involving large covariance matrices as well as increase mixing of the algorithms for generating the MCMC samples for random effects, leading to a fast Monte Carlo approximation to the likelihood function. Our method is applicable to both the continuous and discrete domains. Our study shows that our approach obtains computational gains over existing methods and provides accurate point estimates. Our methods also provide accurate prediction in many applications (Figure 1, Figure 3, Figure 6).

We have studied two sources of error for MCML estimates: sampling error and Monte Carlo error. Combining these two sources of error in cases where importance functions are dependent on the data is challenging (Knudson, 2016). Asymptotic confidence intervals can be constructed using sampling error via the observed Fisher information. We observe that in the continuous domain cases, the coverage obtained from the observed Fisher information is lower than the nominal rate. These results are consistent with a well-known fact that, under fixed domain asymptotics, parameters in the Matérn class are not consistent and hence cannot be asymptotically normal (Zhang, 2004). Considering that interpolation is the ultimate goal in many geostatistical problems, constructing accurate confidence intervals is often of limited interest in practice. However, for researchers who have an interest in constructing confidence intervals, we recommend the bootstrap-based intervals, which have close to the nominal rate of coverage. We point out that the bootstrap-based approach is still computationally challenging, though we can take advantage of high-performance computing resources. To our knowledge, no existing approach provides practical ways to address this coverage issue in the context of SGLMMs. Developing practical ways for efficiently calculating confidence intervals for SGLMMs provides an interesting avenue for future research.

The computational methods developed in this manuscript allow researchers to perform maximum likelihood estimation to fit SGLMMs with much larger data sets than was previously possible, and we hope that they will permit researchers to fit such models more routinely. The algorithms we discuss here apply to a large and widely applicable class of spatial models, and can be much faster than corresponding Bayesian approaches to fitting the same models. Bayesian approaches are of course enormously flexible, especially for handling missing data, incorporating prior information, and integrating multiple data sets; there is an extensive literature on fast computational methods for fitting models in the Bayesian context (cf. Lindgren et al., 2011, Banerjee et al., 2008, Hughes and Haran, 2013, Guan and Haran, 2018, Rue et al., 2009). The research on efficient MCMLE approaches for maximum likelihood estimation for computationally challenging problems is limited; we hope that the ideas and methods we have discussed here may be broadly relevant for efficient maximum likelihood inference for high-dimensional latent variable models. Examples include a Gaussian mixture model for galaxy data (Johansen et al., 2008), stochastic volatility models in finance (Jacquier et al., 2007, Duan et al., 2017), and state space models in economics (Duan and Fulop, 2015).

Acknowledgement

MH and JP were partially supported by the National Science Foundation through NSF-DMS-1418090. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R01GM123007. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors are grateful to Yawen Guan and Christina Knudson for providing useful sample code and advice.

Supplementary Material for Reduced-dimensional Monte Carlo Maximum Likelihood for Spatial Generalized Linear Mixed Models


Jaewoo Park and Murali Haran

Appendix A Monte Carlo Likelihood Approximation Calculations and Derivatives

Consider a logarithm of the Monte Carlo approximation to the likelihood as

(17)

The derivatives of the can be defined as

(18)

From the chain rule, the gradient of the log likelihood can be derived as

(19)

By using the product rule, the Hessian of the MCML is as follows.

(20)

Completing the derivative for each term gives