## 1 Introduction

Clinical data on individuals are often collected from different geographical regions and then aggregated and analyzed in public health studies. Analysis of such data conducted on a higher level often assume that covariate effects are constant over the entire spatial domain. This is a rather strong assumption, as all intrinsic heterogeneities in data are ignored. For example, if one was to study the hazard for patients with lung cancer, it is expected that the true hazard is not the same in areas where there is little air pollution and severe air pollution, even for patients with similar characteristics. From Tobler’s first law of geography (Tobler, 1970), “Everything is related to everything else, but near things are more related than distant things,” it is reasonable to consider similarities between nearby locations in survival data due to environmental circumstances in geographically close regions.

Existing approaches that account for such patterns in survival data can be put into two major categories. The first one is to incorporate spatial random effects in survival models, such that spatial variations are accounted for by different intercepts for different regions, while parameters for covariates are held constant. Banerjee and Dey (2005) used a frailty model to study infant mortality with two specifications for the spatial random effects – the areal conditionally autoregressive (CAR) prior, and a geostatistical spatial process. Zhou et al. (2008) proposed a joint model that uses the CAR prior to account for spatial variations in modeling both the age at diagnosis, and the subsequent conditional survival time. The CAR prior has also been used in the accelerated failure time (AFT) model to analyze prostate cancer survival in Louisiana (Zhang and Lawson, 2011). In addition to normal spatial random effects, or log-normal frailties, gamma frailties have been used with the proportional hazards model to investigate spatial leukemia survival (Henderson et al., 2012) Another important approach, instead of assuming all covariate effects are constant, allows parameters to be spatially varying. Hu et al. (2020) proposed an AFT model that allows the parameters, instead of spatial frailties, to follow certain spatial prior distributions. Similar to geographically weighted regression, where observations are assigned weights depending on their distance to the target location of interest. Hu and Huffer (2019) weighed observations with exponential kernel and proposed modified Nelson–Aalen and Kaplan–Meier estimators for spatial survival functions. Moving to the study of covariate effects, Xue et al. (2019) used the similar weighting scheme and proposed a modified partial likelihood to estimate varying coefficients in a spatial proportional hazards model.

Despite their flexibility, the aforementioned spatially varying coefficients models can be unnecessarily large. Imposing certain constraints on nearby regions so that they have the same parameter values provides an efficient way of reducing the model size without sacrificing too much of its flexibility. While similar endeavor have been made to cluster spatial survival responses (Huang et al., 2007; Bhatt and Tiwari, 2014), the clustering of covariate effects and baseline hazards have yet to be studied for survival data.

Two challenges are to be tackled for clustering of coefficients and baseline hazards for spatial survival models. First, the spatial structure needs to be appropriately incorporated into the clustering process. Contiguousness constraints should be added so that truly similar neighbors are driven to the same cluster. The constraints, however, should not be overly emphasized, as two distant regions may still share similar geographical and demographical characteristics and thus parameters. Existing methods, such as in Lee et al. (2017, 2019) and Li and Sang (2019), do not allow for globally discontiguous clusters, which is a serious limitation. Second, the true number of clusters is unknown, and needs to be estimated. With the probabilistic Bayesian framework, simultaneous estimation of the number of clusters and the clustering configuration for each region is achieved by complicated search algorithms (e.g., reversible jump MCMC, Green, 1995) in variable dimensional parameter spaces. Such algorithms assign a prior to the number of clusters that needs to be updated in every MCMC iteration, which made them difficult to implement or automate, and suffer from mixing issues as well as lack of scalability. Nonparametric Bayesian approaches, such as the Chinese restaurant process (CRP; Pitman, 1995), provide another approach to allow for uncertainties in the number of clusters. Its extension, the distance dependent CRP (ddCRP; Blei and Frazier, 2011), considers spatial information, and makes a flexible class of distributions over partitions that allows for dependencies between their elements. The CRP framework, however, has been shown to be inconsistent in its estimation of number of clusters (Miller and Harrison, 2013). Lu et al. (2018) proposed the powered CRP that suppresses the small tail clusters. Similar to the traditional CRP, however, it does not consider distance information, and therefore is not well-suited when spatial homogeneity is to be detected.

To address these challenges, in this work, we consider a spatial proportional hazards model, and propose a geographically weighted Chinese restaurant process (gwCRP) to capture the spatial homogeneity of both the regression coefficients and baseline hazards over subareas under piecewise constant hazards models framework (Friedman et al., 1982). Our main contributions in this paper are in three folds. First, we develop a new nonparametric Bayesian method for spatial clustering which combines the ideas of geographical weights and Dirichlet mixture models to leverage geographical information. Compared with existing methods, our proposed approach is able to capture both locally spatially contiguous clusters and globally discontiguous clusters. Moreover, an efficient Markov chain Monte Carlo (MCMC) algorithm is proposed for our proposed model without reversible jumps to simultaneously estimate the number of clusters and clustering configuration. Second, we introduce a model diagnostic technique, Logarithm of the Pseudo-Marginal Likelihood (LPML; Gelfand and Dey, 1994), to tune the range parameter in gwCRP. After tuning of range parameter based on LPML, the gwCRP has nearly consistent estimator of the number of clusters. Finally, we apply the method to the analysis of the Surveillance, Epidemiology, and End Results (SEER) Program data in the state of Louisiana among different counties, which provide important information to study spatial survival rates.

The remainder of the paper is organized as follows. In Section 2, we develop a homogeneity pursuit of survival data in the piecewise constant proportional hazard framework with gwCRP prior. In Section 3, a collapsed Gibbs sampler algorithm and post MCMC inference are discussed. The extensive simulation studies are carried out in Section 4. For illustration, our proposed methodology is applied to respiratory cancer survival data in Section 5. Finally, we conclude this paper with a brief discussion in Section 6. For ease of exposition, additional technical results are given in supplementary material.

## 2 Methodology

### 2.1 Spatial Piecewise Constant Hazards Models

Let denote the survival time for patient at location , representing the event and indicating censored, and

denote the vector of covariates corresponding to

for , and , with denoting the number of the patients at . All data that have been observed is denoted by . The semi-parametric proportional hazards model of Cox (1972) has long been used for time-to-event data. Under the Cox model, without the need of estimating the baseline hazard function, the covariate effects are estimated through maximizing the logarithm of the partial likelihood function given by(1) |

where is the set of indices for subjects at risk at time , and is a -dimensional vector of regression coefficients. It can be seen from (1) that only the vector of regression coefficients is of interest, while the nonparametric baseline hazard function is canceled out. However, if one is interested not only in covariates effects, but also in prediction, the baseline hazard is needed. To circumvent this problem, we consider a proportional hazards model with piecewise constant baseline hazard function. Assume we partition into intervals (), then the hazard function is given by

(2) |

with piecewise constant baseline hazard function for .

For the piecewise constant hazard function mentioned in (2), the baseline hazards and regression coefficients are constants over different regions. Due to observed environmental factors, spatially varying patterns in baseline hazards and regression coefficients of hazard function need to be considered. The piecewise constant hazard function with spatially varying pattern is therefore given by

(3) |

where for . Under this model, and represent the location-specific baseline hazards and regression coefficients.

Based on the location-specific survival model, the likelihood function for observed survival data is obtained as

(4) |

where

(5) |

is the survival function for patient at location .

After some algebra, the logarithm of likelihood function for observed survival data is obtained as

(6) |

where , and

For one particular location , let and define the collection of parameters, then the maximized likelihood estimate (MLE) can be obtained by solving the score function, which is the derivative of the logarithm of likelihood function in (6

), and the estimated variance-covariance matrix of MLE is

, respectively, where denotes the negative Hessian matrix. Based on the MLEs and estimated variance-covariance matrices, we have the following approximation of the likelihood.###### Proposition 1.

As , the data likelihood is approximated as

(7) |

where MVN stands for the multivariate normal distribution.

For ease of exposition, the derivations of and the proof of Proposition 1 are given in Appendix A and Appendix B.

Based on the normal approximation given in Proposition 1, a natural way which follows Gelfand et al. (2003) for spatially varying pattern of baseline hazards and regression is to give a Gaussian process prior to . The Gaussian process for is defined as

(8) |

where , is a dimensional vector, is a spatial correlation matrix depending on the distance matrix with parameter , is a covariance matrix, and denotes Kronecker product. The -th entry of is , where is the distance between and , and is the range parameter for spatial correlation. For the Gaussian process prior, the parameters of closer locations have stronger correlations.

For many spatial survival data, some regions will share same covariate effects or baseline hazards with their nearby regions. In addition, some regions will share similar parameters regardless of their geographical distances, due to the similarities of regions’ demographical information such as income distribution (Ma et al., 2019; Hu et al., 2020), food environment index, air pollution (Zhao et al., 2020), and etc.. A spatially varying pattern for is not always valid. Based on the homogeneity pattern, we focus on the clustering of spatially-varying parameters. In our setting, we assume that the parameter vectors can be clustered into groups, i.e., .

### 2.2 Geographically Weighted Chinese Restaurant Process

A latent clustering structure can be introduced to accommodate the spatial heterogeneity on parameters of sub-areas. Under the frequentist framework, the clustering problem could be solved in a two-stage approach: first obtain the estimate of number of clusters, , then detect the optimal clustering assignment among all possible clusterings of elements into clusters. However, in this approach, the performance of the estimation of cluster assignments highly relies on the estimated number of clusters, it may ignore uncertainty in the first stage and cause redundant cluster assignments. Bayesian nonparametric method is a natural remedy to simultaneously estimate the number of clusters and cluster assignments. The Chinese restaurant process (CRP; Pitman, 1995; Neal, 2000) offers choices to allow uncertainty in the number of clusters by assigning a prior distribution on . In CRP, are defined through the following conditional distribution (also called a Pólya urn scheme, Blackwell et al., 1973).

(9) |

Here refers to the size of cluster labeled , and is the concentration parameter of the underlying Dirichlet process. Based on the Pólya urn scheme shown in (9

), the costumer will have no preference for sitting with different costumers. For the spatial survival data, nearby regions will share similar environmental effects such as P.M. 2.5, water quality, etc.. These similar effects will lead the nearby sub-regions will share similar parameters. In order to consider similar effects caused by geographical distance. We modify the tradition CRP to geographically weighted CRP (gwCRP) in order to have the costumer will have higher probability sitting with their familiar costumers which are geographically nearby. We have the conditional distribution of

given based on following definition.###### Definition 1.

If is a continuous distribution and , the distributions of given is proportional to

(10) |

where denote the number of clusters excluding the -th observation, are distinguished values of .

Based on the Definition 1, we have similar Pólya urn scheme called gwCRP for conditional distribution in (10) with CRP.

###### Proposition 2.

A Pólya urn scheme of gwCRP is defined as

(11) |

, where is the geographical weight.

For the geographical weights, we adapt the Stochastic Neighborhood Autoregressive (SNCAR) model (White and Ghosh, 2009) for areal data. SNCAR is an extension of the ordinary Conditional Autoregressive (Banerjee et al., 2014) model. Unlike the general adjacency matrix, whose diagonal elements are all 0 and off diagonal element if areas and share a common boundary, the SNCAR model allows the off-diagonal elements to depend on unknown parameters. Compared the existing geographically weighted regression literatures, our weights are obtained by graph distance between different areas. Following Müller et al. (1987); Bhattacharyya and Bickel (2014); Xue et al. (2019), we denote a graph as , with set of vertices , and set of edges . The graph distance between two vertices and is defined as:

(12) |

where represents the cardinality of edges in . In this way, we can calculate the graph distance among the counties in the data set. Based on the graph distance calculated by (12), we calculate the geographical weights by:

(13) |

where is the graph distance between areas and . For the weighting function in (13), we give the largest weight () for the areas sharing the same boundary, which follows the first law of geography (Tobler, 1970). For the simplicity, we refer to gwCRP introduced above as .

###### Remark 1.

Based on the Pólya urn scheme defined in (11) and geographical weighting scheme defined in (13), we find that (i) when , the gwCRP reduces to traditional CRP, which leads to over-clustering problem in estimating of the number of clusters; (ii) when , a new costumer just only choose the table representing spatialy contiguous regions. This will also lead to the same over-clustering problem as CRP.

### 2.3 gwCRP for Piecewise Constant Hazards Models

Adapting gwCRP to the piecewise constant hazards models, our model and prior can be expressed hierarchically as:

(14) |

where is a dimensional vector. And let , and

be hyperparameter for base distribution of

. We choose in all the simulations and real data analysis providing noninformative priors. The concentrate parameter controls the probability of introducing a new cluster which is similar with CRP. Different values of lead to different weighting scale for different sub-regions. In our following simulations and real data analysis, we fix and tune with different values.## 3 Bayesian Inference

In this section, we will introduce the MCMC sampling algorithm, post MCMC inference method, and Bayesian model selection criterion.

### 3.1 Bayesian Computation

Our goal is to sample from the posterior distribution of the unknown parameters , , , and . Based on Proposition 1 and Proposition 2, the sampler is presented in Algorithm 1, which efficiently cycles through the full conditional distributions of for and , where . The marginalization over can avoid complicated reversible jump MCMC algorithms or even allocation samplers. The full conditionals of are given in Proposition 3.

###### Proposition 3.

The full conditional distributions of is given as

### 3.2 Inference of MCMC Results

Another important task for Bayesian nonparametric method is the inference of MCMC results. The posterior mean or median of clustering configurations is not suitable. Dahl’s method (Dahl, 2006) provides a remedy for posterior inference of the clustering configurations and the estimated parameters. The inference of Dahl’s method is based on the membership matrices, , from posterior samples. The definition of membership matrix , for the -th posterior sample, is

(15) |

with is either 0 or 1 for all . If , it indicates that observations and are in the same cluster in the -th posterior sample. After obtaining the membership matrices of the posterior samples, the empirical pairwise probability matrix is calculated by the Euclidean mean of the membership matrices, i.e.,

with representing the empirical probability that observations and are in the same cluster among these M posterior samples. Furthermore, the iteration with the least squares distance to the Euclidean mean matrix is obtained via

(16) |

and the clustering configuration and estimated parameters from the -th posterior sample are the posterior estimates selected by Dahl’s method. The advantages of Dahl’s method is that it utilizes information of all posterior clusterings, and selects the ”average” clustering rather than forming one from an external algorithm.

### 3.3 Model Selection Criterion

In gwCRP, the decaying effect parameter for geographical weights needs to be tuned, and we use the Logarithm of the Pseudo-Marginal Likelihood (LPML; Ibrahim et al., 2001) based on conditional predictive ordinate (CPO; Gelfand et al., 1992; Geisser, 1993; Gelfand and Dey, 1994) to select . The LPML is defined as

(17) |

where is the i-th conditional predictive ordinate. Following Chen et al. (2000), a Monte Carlo estimate of the CPO, within the Bayesian framework, can be obtained as

(18) |

where is the total number of Monte Carlo iterations, is the -th posterior sample from Algorithm 1, and is the likelihood function define in (6). An estimate of the LPML can subsequently be calculated as:

(19) |

A model with a larger LPML value is preferred.

## 4 Simulation

In this section, we will illustrate our proposed methods based on the simulation results.

### 4.1 Simulation Setting and Evaluation Metrics

In this section, we present simulation studies under four different designs to illustrate the performance of our proposed gwCRP method and compare with traditional CRP, in terms of both clustering configuration and estimation of regression coefficients and piecewise constant baseline hazards under proportional hazards model. Survival datasets that resemble the SEER respiratory cancer data for Louisiana are generated. The censoring rate is around 30%. We design four different geographical clustering patterns in Louisiana state, which are shown in Figure 1. Designs I and III have three true clusters, and Designs II and IV have two true clusters. In addition, Designs II and III both have one cluster consisting of two disjoint areas since, in practice, it is still possible for two distant counties to belong to the same cluster. Design IV has two clusters both consisting of disjoint areas.

For each design, 100 replicate datasets are generated under proportional hazards model with piecewise constant baseline hazard. In each replicate, we generate survival data of 60 subjects for each county, including three regression covariates from i.i.d., survival time and censoring. We set three pieces for the baseline hazards with cutting points 1.5 and 6 for all designs, and over four designs, we have three true clusters at maximum, and the true regression coefficients and baseline hazards used are chosen from , , and . Censoring times are generated independently by taking the minimum of 150 and random values from Exp(0.01) with expectation 100. For each replicate, we set and run different values of , from 0 to 2 with grid 0.2, and from 3 to 10 with grid 1, and select the optimal via LPML. A total of 2000 MCMC iterations are run for each replicate, with the first 500 iterations as burn-in.

To compare the performance of clustering of gwCRP under different values of , both estimation of the number of clusters and the matchability of clustering configurations are reported. Since is marginalized out in our collapsed Gibbs sampler, we do not obtain posterior distribution of directly. However, can estimated by the Dahl’s method introduced in Section 3.2 .

For the the matchability of clustering configurations, we use Rand Index (Rand, 1971) to measure the agreement between and the true clustering configuration. The Rand Index of two partitions, and , of a set of n objects , is defined as

where denotes the number of pairs of objects in that are in the same cluster in and the same cluster in , denotes the number of pairs of objects in that are in different clusters in and different clusters in , denotes the number of pairs of objects in that are in the same cluster in and different clusters in , denotes the number of pairs of objects in that are in different clusters in and the same cluster in . The Rand Index lies between 0 and 1, with a higher value indicating better agreement between two partitions. Specially, when two partitions agree perfectly, the Rand Index is 1. In our simulation, the Rand Index is obtained by using R-package fossil (Vavrek, 2011).

In addition to clustering performance, we further evaluate the estimation performance of covariates coefficients and baseline hazards, which is assessed by average of mean squared error (AMSE) defined as follows. Let be the true clustering label vector, be the true parameter value of cluster , be the number of counties in cluster , , , and for simulated data set , let be Dahl’s method estimate at location . Then AMSE is calculated as

which calculates mean squared errors for each cluster first, and then average across clusters.

### 4.2 Simulation Results

Figure 2 shows the histogram of estimates and boxplots of Rand Index under different and the optimal selected by LPML for four simulation designs. We see that when , the proposed gwCRP method is identical to the traditional CRP method, and in this case, CRP always tends to over-cluster and often yields smaller Rand Index than the results under . Another important trend is that, as increases, the estimated number of clusters decreases first and then increases, and the Rand Index increases first and then decreases as becomes too large. As we discussed in Remark 1, this is because when increase from 0, the spatial patterns in the data is captured by the proposed gwCRP method. However, as , the geographical weights

for spatial-discontiguous counties become 0, which means only adjacent counties can be classified into the same cluster, therefore leading to over-clustering phenomenon again. It is also discovered that the clustering perfomance under optimal

selected by LPML is very well, with the probability of selecting true number of clusters always greater than 0.75, and Rand Index larger than or similar to the highest results attained by some fixed value of .Table 1 summarizes the AMSE results of estimating parameters of gwCRP under different for different designs. For simplicity of summary results, here the AMSE of is the average of AMSE of since they have similar scales, and the value of is the average of AMSE of , respectively. Similar to the clustering performance, traditional CRP has the largest AMSE and AMSE decrease as increase from 0 to moderate values, and increase again as increase to relatively large values. The results of optimal selected by LPML also has the best performance in estimation.

Method | AMSE | Design I | Design II | Design III | Design IV |
---|---|---|---|---|---|

gwCRP | 0.0069 | 0.0067 | 0.0083 | 0.0060 | |

0.0193 | 0.0186 | 0.0219 | 0.0190 | ||

gwCRP | 0.0065 | 0.0058 | 0.0087 | 0.0049 | |

0.0194 | 0.0171 | 0.0217 | 0.0172 | ||

gwCRP | 0.0068 | 0.0055 | 0.0085 | 0.0049 | |

0.0190 | 0.0158 | 0.0216 | 0.0191 | ||

gwCRP | 0.0129 | 0.0072 | 0.0204 | 0.0074 | |

0.0281 | 0.0217 | 0.0296 | 0.0195 | ||

gwCRP Optimal | 0.0059 | 0.0055 | 0.0067 | 0.0035 | |

0.0177 | 0.0145 | 0.0203 | 0.0177 | ||

CRP | 0.0086 | 0.0092 | 0.0089 | 0.0082 | |

0.0228 | 0.0233 | 0.0239 | 0.0223 |

In a brief conclusion based on our simulation studies, gwCRP models have better performance than CRP both for clustering and parameter estimation. Our proposed model selection criterion, LPML, can nearly select the best performance value for both clustering and parameter estimation.

## 5 SEER Respiration Cancer Data

### 5.1 Data Description

In this section, we apply our proposed model to analyze respiratory cancer data in Louisiana state, which is downloaded from the Surveillance, Epidemiology, and End Results (SEER) Program. Modifying the criterion in Zhang et al. (2016), we conducted initial clean for the raw data. First, we delete the subjects for whom the respiratory cancer was not the primary cancer. In addition, we exclude the subjects with unknown covariates information which includes marital status, sex, age at diagnosis, cancer stage, cancer grade, surgery status and unknown radiation status. Furthermore, subjects with unknown survival times or subjects who died not because of respiratory cancer are not included in our analysis. After cleaning, there are 16213 observations left, and the censoring rate is 30.44%. We select Age, Gender, Cancer grade and Historical stage of cancer for our analysis, and give the summary of survival times and covariates in Table 2. The median survival times for patients in each county are plotted in Figure 3.

Mean(SD) / Frequency (Percentage) | |
---|---|

Survival Time | 22.43 (31.90) |

Event | 12.63 (18.32) |

Censor | 44.85 (43.06) |

Diagnostic Age | 64.78 (10.89) |

Sex | |

Female | 6548 () |

Male | 9665 () |

Cancer Grade | |

the class of lower grades | 5307 () |

the class of III or IV | 10906 () |

Historical Stage | |

not distant | 9005 () |

distant | 7208 () |

Demographics for the studied dataset. For continuous variables, the mean and standard deviation (SD) are reported. For binary variables, the frequency and percentage of each class are reported.

We first fit the Cox model of patients for each county using the covariates selected. The regression coefficients are visualized in Figure 4. From Figures 3 and 4, it is seen that some counties have similar characteristics, no limited to only adjacent counties, indicating possibilities of globally discontiguous clusters.

### 5.2 Data Analysis

We set three pieces for the baseline hazard with cutting points 2.01 and 6.01, then run from 0 to 10 with grid 0.1, and for each , 2000 MCMC iterations are run and drop the first 500 as burn-in. The optimal selected by LPML is 3.9, the corresponding estimate of number of clusters is 3, while the traditional CRP classifies the counties into 5 clusters. The plots of clustering patterns of CRP and gwCRP Optimal () are shown in Figure 5, from which it is seen that the gwCRP captures the globally discontiguous clusters very well. The estimates of regression covariates coefficients and baseline hazards obtained by gwCRP Optimal () are given in Table 3, from which we see that, though the baseline hazards are similar, the regression covariates coefficients are quite different across different clusters. From the clustering results shown in Figure 5, we see that our proposed method successfully detects both spatially contiguous cluster and discontinuous cluster simultaneously. Most counties in Cluster 1 are spatially contiguous and most counties in Cluster 3 are spatially discontinuous. From the estimation results shown in Table 3, we see that (i) for the counties in Cluster 1, all four covariates have moderate hazard effects compared with other counties; (ii) for the counties in Cluster 2, diagnostic age has least hazard effects, but male, later cancer stage and distant historical stage will have higher hazards effects than other counties; (iii) for the counties in Cluster 3, the diagnostic age has most hazards effects than other counties.

Cluster | |||||||
---|---|---|---|---|---|---|---|

1 | 0.2158 | 0.1349 | 0.5351 | 1.4192 | 1.0631 | 1.0515 | 0.9752 |

2 | 0.1009 | 0.3638 | 0.7852 | 1.6679 | 1.1103 | 0.9649 | 0.9807 |

3 | 0.3042 | 0.0657 | 0.4282 | 1.0555 | 1.0658 | 1.0133 | 0.9363 |

## 6 Discussion

In this paper, we proposed a geographically weighted Chinese restaurant process to capture spatial homogeneity of regression coefficients and baseline hazards based on piecewise constant hazard model. An efficient MCMC algorithm is proposed for our methods without complicated reversible jump algorithm. Extensive simulation results are carried out to show that our proposed method has better clustering performance than the traditional CRP in spatial homogeneity pursuit for survival data. Simulation studies also show that our proposed methods have promising results in coefficients and baseline hazard estimation. An application to analysis of SERR data provides an interesting illustration of our proposed methods.

Furthermore, three topics beyond the scope of this paper are worth further investigation. In this paper, our proposed algorithm is based on two step estimation under piecewise constant proportional hazard model assumption. Proposing an efficient sampling algorithm without Laplace approximation is an important future work. Furthermore, we fixed the number of pieces of baseline hazards in both simulation studies and real data analysis. Imposing adaptive number of pieces model in baseline hazards is devoted for future research. Finally, variable selection approaches based on hierarchical CRP (Griffiths et al., 2004) is also worth being investigated.

## Appendix A Derivations of

After taking the second-order partial derivatives of the log-likelihood function, elements of the Hessian matrix are given by

(20) |

Let . Then, after this reparameterization, the (negative of) Hessian matrix is given by

(21) |

The estimated variance of the MLEs is .

## Appendix B Proof of Proposition 1

For the likelihood at particular location , MLE and its variance estimates . We can used the second order of the Taylor at , where is the MLE of , is the vector of parameters we want to estimate

(22) |

Since is the MLE from likelihood, , and as . Thus the likelihood at location can be written as

(23) |

And then we have

(24) |

## Appendix C Full Conditionals of Collapsed Gibbs Sampling

Full contitonal of :

(25) |

Closed form expression for is

(26) |

If is an existing cluster, then