Potts models can be used to describe spatial grids for which each location falls into one of a fixed number of classes. The Ising model (Ising1925) can be viewed as a special case of the Potts model with two classes. Since both of these models were originally developed to describe the magnetization of atoms in crystalline solids, they involve an inverse temperature parameter that controls the strength of magnetization. The idea of Potts model also coincides with the autologistic models introduced by Besag1974 for analyzing categorical data on spatial grids. Potts models have been used in a variety of areas beyond statistical physics, including neuroscience (Roudi2009) and quantum computing (King2018). The hidden Potts model, in which the model for the data depends on an unobserved Potts model configuration, has also seen multiple applications, including in medical image processing (doi:10.1177/1176935117711910) and satellite image analysis (Moores2020).
Potts models are computationally challenging, as model probabilities involve a normalizing constant that becomes intractable for large spatial grids. To address this intractability,2987782
introduced a pseudo-likelihood consisting of the product of full conditional distributions, which, due to a cancellation, does not require calculation of the normalizing constant. However, the asymptotic variance of the maximum-pseudo-likelihood estimator is high when the size of the spatial grid tends to infinity(stoehr2017). 10.2307/3088837 performed an algebraic simplification of the normalizing constant in the observed Potts model using its Markov property. Although this can work efficiently for small spatial grids, the computational expense increases exponentially as the grid size increases. Several authors have proposed MCMC algorithms and variational methods (as reviewed by stoehr2017) for inference on the parameter of the observed Potts model. Some of these methods can also be used to sample the spatial grid for given values of the parameter .
Inference for hidden Potts models is even more involved due to a doubly intractable likelihood. Many works have been dedicated to developing approximate Bayesian computation (ABC) algorithms for speeding up the computation by assuming an approximate structure of the likelihood of a hidden Potts model that follows certain properties. Recently, Moores2020 have introduced a parametric functional approximate Bayesian (PFAB) algorithm for inference on the temperature parameter in the hidden Potts model, which surpasses the computational efficiency of previous methods by a long margin while preserving accurate inference.
Here, we propose an ordered conditional approximation (OCA) that essentially approximates the joint density of a Potts models as a product of conditional distributions, for which we order the spatial locations by their coordinates and then condition each point on its nearest previously ordered neighbors. Our OCA is motivated by the simpler Vecchia approximation that has been highly successful for speeding up Gaussian-process inference (e.g., Vecchia1988; Stein2004; Datta2016; Guinness2016a; Katzfuss2017a; Katzfuss2018)
. However, unlike the Gaussian-process inference, where the conditional distributions can be calculated as closed-form multivariate Gaussian distributions, conditional distributions under the Potts model do not yield such closed-form expressions and require computing sums whose cost increases exponentially with the grid size. To overcome this computational issue, we restrict the calculation of the sums in the conditional distributions to a sub-region of the whole grid to construct the OCA. Because of its construction, our OCA can take advantage of distributed computation structures for fast calculations. Additionally, we propose a simple mechanism using our method to draw random samples from the joint distribution of a Potts field for specific values of the parameter. We extend our approach to hidden Potts models, to allow fast evaluation of the marginal likelihood and to enable scalable inference on the hidden configuration of the spatial grids. We also propose a Gibbs sampler as an optional tool for inference in this setting. All OCA inference scales linearly in the total number of spatial locations for fixed tuning parameters.
The remainder of this article is organized as follows. In Section 2, we describe the Potts model and introduce its OCA. In Section 3, we extend our OCA to the hidden Potts model. In Sections 4 and 5, we present numerical results using simulations and a satellite image, respectively. We conclude in Section 6.
2 The Potts model
Let be a set of locations on a two-dimensional rectangular spatial grid of size . At each location , we observe one of classes, . The Potts model assumes that the likelihood of a class label, say , at a particular location increases with the number of neighboring locations on the grid falling into the same class. This likelihood is governed by an inverse temperature parameter , which hence controls the strength of the spatial dependence in the grid.
Specifically, the joint density (i.e., the joint probability mass function) for a particular grid configuration is given by,
is the Hamiltonian function,
is the summary statistics, and
is the normalizing constant.
In (3), means that and are neighbors. Throughout, we consider a first-order neighborhood structure consisting, for each (interior) pixel, of the four pixels immediately above, below, left, and right. More on the neighbor structure can be found at stoehr2017. The Ising model is a special case of the Potts model with states. See Figure 1 for examples of realizations of the Ising and Potts models.
For a particular grid configuration, calculation of the joint density in (1) involves evaluating the normalizing constant in (4), which requires summing over the possible states, and is thus computationally infeasible for even moderately large . 2987782 introduced a pseudo-likelihood consisting of the product of full conditional distributions, , which, due to a cancellation, does not require calculation of the normalizing constant.
2.2 Ordered conditional approximation of the Potts model
To introduce our ordered conditional approximation (OCA), we first assume that the locations (and hence the corresponding observations in ) follow a lexicographic ordering according to their coordinates, as illustrated in the right panel of Figure 1.
Next, we obtain an ordered conditional expression of the joint density (1) as a product of conditional densities:
where the conditional distributions for the Potts model can be shown to be
We propose to approximate the expression in (6) by considering subvectors of and . Define and (i.e., the blue and magenta boxes in Figure 1) of size and , respectively. Then define the modified Hamiltonian for as
Since depends only on a subset of the full set of states , if we replace with in (6), the sums over and can be replaced by sums over and , yielding
which is much less costly to evaluate, as long as is small. Note that while the expressions include , the states that are not part of do not enter into , so their values do not matter. In addition, since
is a smaller vector than, there are fewer terms in each individual evaluation of , providing further computational speedups. Furthermore, since the states appear in both numerator and denominator in (6), we can ignore their contribution while defining the modified Hamiltonian in (7). A detailed derivation of (8) from (6) can be found in Appendix A. The state vector captures the dependence of individual on its previously ordered nearest neighbors. Although the calculation time is linear in , choosing a large value of will not yield better results, since calculations involving the previously ordered neighbors cancel out from both the numerators and denominators in 8. Thus, a reasonable choice of can be .
Thus, the OCA of the Potts density in (5) can be expressed as,
Evaluating this expression requires time, which is linear in the grid size . Furthermore, each of the terms in (9) can be computed independently of the remaining terms. Hence, we can also take advantage of parallel computation for calculating the joint log-likelihood.
Inference on can proceed by replacing the exact likelihood or density by the OCA , both of which implicitly depend on
. Both maximizing the (log) likelihood or Bayesian inference onare possible. In contrast to the pseudo-likelihood of 2987782, the OCA has the virtue that it converges to the exact likelihood as and increase to ; we show in Section 4.1.1 below that the OCA can be more accurate than the pseudo-likelihood even when using very small conditioning sets.
As a brief aside, it has been shown that other orderings (e.g., the maximum-minimum-distance, or maximin ordering) can improve over lexicographic ordering in terms of the accuracy of the Vecchia approximation of Gaussian processes (e.g., Guinness2016a; Katzfuss2017a). However, we carried out exploratory numerical experiments that indicated that both maximin and reverse maximin ordering was less accurate than lexicographic ordering for the OCA of the Potts model.
2.3 Sampling from the Potts model using OCA
We also provide an algorithm for joint sampling from a Potts field using OCA. Unlike many existing algorithms based on iterative techniques such as Markov chain Monte Carlo (MCMC), our approach directly draws a joint sample from the (approximate) Potts model for the whole grid (i.e., from (9)) in a single iteration. Thus, OCA does not require a large number of iterations to ensure mixing and convergence. For each location , (8) can be rewritten as,
for a -state Potts model. Thus, sequentially for each , we evaluate for , and then sample from this discrete distribution. Simulation studies to examine this method are performed in Section 4.1.2.
3 The hidden Potts model
Let denote a spatial grid generated from a Potts model as in Section 2.1, but now assume that is hidden (i.e., latent), and instead we observe a vector . We assume that are conditionally independent given :
Various distributional assumptions for the likelihood
are possible, including a normal distribution(4767596)10.2307/3085830), or a multivariate normal distribution for vector-valued . For example, for the normal case, we could assume that
Some realizations from such a hidden Potts model are shown in Figure 2.
3.2 OCA of the marginal likelihood
Similar to Section 2.2, we begin by writing the joint marginal density of the data (i.e., the integrated likelihood) in an ordered conditional form,
which can be written as the ratio of joint distributions
As before, we replace the with , which allows us to sum over a much smaller set,
Hence, by virtue of OCA, the final approximated integrated likelihood can be written as,
The OCA integrated likelihood can be evaluated in time, which is also linear in .
3.3 OCA inference on the hidden Potts model
We now consider the joint posterior distribution of a hidden -state Potts model, which can be written as
since, for each location , each of the previous noisy observations are independent of the original states , once is given.
The conditional distributions in (16) can be written as:
However, computation of exact posterior probability is computationally infeasible because of calculating the full Hamiltonianfor all state configurations.
To overcome this, we again replace by , and write the approximate likelihood as:
This approximate posterior probability calculation has a complexity of .
3.4 Gibbs sampler for hidden Potts model
We also provide tool for Bayesian inference on hidden Potts models and parameters using the OCA that can proceed using a Gibbs sampler, for which we can sample the hidden field from its joint (OCA) full-conditional distribution. Let us consider the following prior distributions for the parameters of a -state hidden Potts model described in Section 3.1.
In addition, we assume a uniform prior on on the positive real line. The Gibbs sampler consists of the following steps.
Sample , the hidden field, given , , and the , as described in Section 3.3.
Update the using the current hidden configuration , by sampling from the following normal-inverse gamma posterior:
where , , , .
Update using a Metropolis update based on the following un-normalized pdf:
where is approximated as in (9).
Repeat from step 1.
This algorithm has been illustrated with simulation studies in Section 4.2.
4 Numerical experiments
4.1 Simulation for observed Potts model
4.1.1 Parameter estimation using the OCA likelihood
Figure 3 shows the log-likelihood for an Ising model (i.e., Potts with ) on a spatial field of size for different conditioning-set sizes and . As the set size increased, the log-likelihood seemed to converge, with differences between the curves decreasing. For , the OCAs log-likelihoods (and their maxima) appeared quite accurate.
For a more systematic comparison, we fixed the true value of at 0.35 and simulated 180 datasets from the Ising model. After that, we obtained estimates of for each dataset by maximizing both the OCA likelihood and the pseudolikelihood described in Besag1974. We then compared the quality of these estimates using the root mean squared error (RMSE). As shown in Figure 4(a), the OCA estimates converged quickly as increased, resulting in a smaller RMSE than for the pseudo-likelihood. The OCA likelihood evaluations were computationally feasible as well; even with , OCA Ising likelihood evaluations were obtained in less than 60 seconds on a laptop.
We repeated the same set of simulations in Figure 4(b), but instead of using the Ising model, we simulated datasets from a 3-state Potts model using the PottsUtils (PottsUtilsR) package in R. OCA yielded better RMSEs than the pseudolikelihood method.
4.1.2 Sampling from observed Potts model
We simulated 60 samples from a 3-state observed Potts model for different values of using the OCA likelihood on a 5050 spatial grid. The required steps for sampling have been described in Section 2.3. We also considered several sizes of the future condition sets, while setting .
We considered the Swendsen-Wang algorithm swendsenwang for comparison purposes. Since the model configurations are nominal variables that indicate only the class labels and can not be compared individually, the summary statistics S() has been computed. Our simulation results are presented in Figure 5.
From the plot, it can be concluded that while our method performed well below the critical temperature (), the estimates of both mean and standard deviation for higher values of can differ from the Swendsen-Wang algorithm more substantially. Yet because of its simple construction, OCA likelihood can be a useful tool when sampling from small values of . Fixing different values of the future condition set size = 2, 4, and 6, a single draw of random sample from a 3-state Potts model on a spatial grid on an average required 0.01, 0.1, and 2.4 seconds, respectively.
4.2 Simulation for hidden Potts model
We followed our assumptions in Section 3.1 to generate realizations from a 3-state hidden Potts model on a spatial field. We assumed for all . Generally speaking, for simulating data from a hidden Potts model, we first simulated the latent Potts field using the PottsUtils (PottsUtilsR) package, and then we simulated the noisy observed data conditional on the Potts field class at each pixel using the rnorm function in R.
4.2.1 Integrated likelihood
Figure 6 shows integrated log-likelihoods for hidden Potts models. For data generated from a 3-state and a 5-state Potts model for two different values (indicated by vertical lines) with noise level , we calculated the OCA likelihood of the data at different values of . The computational expense for calculating the log-likelihood grows exponentially with the size of the conditioning sets. On average, a single evaluation of the integrated likelihood takes 2 seconds with , 4 minutes with , and 10 minutes with for a 3-state hidden Potts model.
As expected, the error in estimating increased with the noise level in Figure 7. While the RMSE was approximately 0.014 for estimating on a hidden 3-state Potts model with low noise, it increased multiple folds when more noise was added. For , we obtained an estimate of for the hidden Potts model within 2 minutes on a laptop. Computation of the likelihood as well as optimization become expensive as increases.
4.2.2 Gibbs sampler for the hidden field
We considered the Gibbs sampler described in Section 3.4 for a 3-state hidden Potts model (i.e., ). The data were generated with different noise levels, which can be found in Figure 2. For parameters of the prior distributions assumed in (18), we considered , , , and for .
We ran the Gibbs sampler for 8,000 iterations, with the first half considered burn-in. Our numerical results are plotted in Figure 8, including the highest-posterior-probability (HPP) map, which for each pixel shows the class that had the highest posterior probability. The results show that, even with moderately large noise in data, our algorithm can generally extract the hidden spatial structure.
4.2.3 Comparison with other models
We used the Brier score as a tool of comparison between our method and the finite mixture model of Gaussian distributions (a brief overview of this Gaussian mixture model can be found in AppendixB), and the PFAB algorithm illustrated in Moores2020. The Brier score measures the accuracy of probabilistic prediction for a set of mutually exclusive discrete outcomes. It is defined by,
where is probability of forecast in class for observation, and is an indicator function which attains the values of 1 if the observation truly belongs to class j, and 0 otherwise. Hence, by its construction, Brier score takes on small positive values when there is little mismatch between the forecast and the test data (i.e., when the calibration is good).
|= 0.1||= 0.3||= 0.6|
Brier score for three algorithms - Gaussian mixture model (GMM), Ordered conditional approximation (OCA) and Parametric functional approximate Bayesian (PFAB) in three different signal to noise ratio (SNR)’s.denotes the noise for the j pixel. In our simulation, we have considered same noise level for all the pixels.
From our simulation results, we can interpret that in all of the cases, our method (OCA) worked well compared to the Gaussian mixture model (GMM). However, it performs poorly with respect to Moores2020 (PFAB) in terms of calibration. However, the PFAB algorithm requires
to be a scalar random variable. Ifis vector-valued (e.g., as in the modified Potts model of doi:10.1177/1176935117711910), the current PFAB algorithm does not work, whereas our method could be extended to such a scenario.
5 Analysis of a satellite image
5.1 Image classification
We applied our method to the Menteith data in bayessR. This is a 100 100 satellite image of the lake of Menteith, near Stirling, Scotland. Following the description by bayessR
, our objective is to use the noisy satellite data of the lake and the land bodies surrounding it, to classify each pixel into one of six different classes of surface types, using the Gibbs sampler described in Section3.4. We have performed a -means algorithm with six classes on the dataset and used the corresponding class means and standard deviations to find the optimal value of based on the log-likelihood described in Section 3. The corresponding value was then used as the initial value of for a Gibbs sampler as described in Section 3.3. To mitigate the computational expense, we considered in Section 2.2 and Section 3.3 while calculating the log-likelihood of and sampling from the posterior distributions. The original satellite picture and the HPP image are provided in Figure 9, along with one image sampled from the posterior to give a sense of the posterior uncertainty for the latent field. It is interesting to note that, while our method is able to distinguish between different classes, all the pixels are classified into a single class when we use the a Gaussian mixture model (see Appendix B) with six classes or components.
On a personal laptop with an Intel i5 7th gen CPU and 2 cores, a single draw of the random sample from hidden Potts model takes 2 seconds, while computation of log-likelihood of for the 100100 hidden Potts model takes around 5 minutes, resulting in slow Metropolis filter for sampling of in Section 4.2.2. Since our log-likelihood calculation is able to take advantage of the distributed computation, we believe that this calculation can be done faster using a system with higher number of processing cores.
5.2 Prediction for held-out pixels
Using the same setup as in Section 5.1, we effectively held out 10 randomly chosen pixels of the 100100 spatial grid (i.e., 1,000 pixels) by assuming each of them to have a very large (known) standard deviation of 100, which means that the information in these 1,000 data points was down-weighted and largely ignored. We then obtained the posterior distribution of the entire latent field as in step 1 in Section 3.4, in effect generating out-of-sample predictions for the “held-out” 1,000 pixels.
To overcome the computational limitation on a personal laptop, we considered only 100 MCMC iterations. After performing steps 1 and 2 of Section 3.4 in each iteration, we consider the previously chosen pixels, and draw 100 normal samples using the predicted means and standard deviations of the corresponding classification labels for each of those pixels that we get from step 2. For example, if one of the 1000 pixels has a prediction as label 3, then for that particular pixel we draw 100 normal samples using and that we get from Step 2 of the Gibbs sampler in Section 3.4
. Then, we consider continuous rank probability score (CRPS) to measure the difference between the originally observed value of the pixels and the 100 samples. Given a probability distributionand a sample , the CRPS is given by,
We repeat this process of randomly selecting pixels and perform following steps 10 times, and take average CRPS of the whole process. We do the same experiment using the Gaussian mixture model (GMM).
From our experiment, the Gibbs sampler method using the OCA approach yields an average CRPS of 5.43, while the GMM yielded a much higher average CRPS of 20.36. From the definition of CRPS, it is clear that low CRPS implies that the absolute difference between the original distribution and the empirical distribution of the samples are closely situated, which means better prediction. Using this fact, we can conclude in this section that, when less information about the pixels are present (which also means high standard deviation), OCA performs better than GMM in terms of prediction.
We introduced an ordered conditional approximation (OCA) of the likelihood of both the Potts model and hidden Potts model. OCA does not require calculation of the intractable normalising constant. We have also devised a simple algorithm to sample from the observed Potts model using OCA and numerically shown that it can be used to sample from the Potts model for a range of values of the parameter . Our method could straightforwardly be extended to higher spatial dimensions (i.e., more than 2 coordinates) because of its simple structure of calculating multiple conditional likelihoods (refer to Section 2.2). A Gibbs sampler for inference on the spatial configuration and the parameter has also been suggested. We have tested these methods on both simulated data and real observations.
Inference using our methods requires linear time in , total number of locations in the spatial grid. We have used root mean squared error (RMSE) to show superiority of our method over pseudo-likelihood. Additionally, we have used the Brier score to show better consistency of our method compared to a Gaussian mixture model. As calculation of the likelihood slows down with increasing conditioning-set size, inference becomes infeasible for large datasets in personal laptops. As a side-note, OCA should work well if applied to the algorithm developed by Moores2020; however, because of its redundancy, we have not performed this step.
One of the possible future directions of this work might be developing theoretical properties of OCA based on varying conditional set size. Schafer2020 have provided some theoretical bounds of conditional likelihood approximations in the context of Gaussian processes, but, to the best of our knowledge, the same results can not be applied directly in the context of Potts model. Another direction is to extend our approach to the modified Potts models, where the parameters vary based on the class labels (doi:10.1177/1176935117711910). As the latter involves more than one parameter, gradient-based optimization may be implemented to obtain the maximum likelihood estimators.
MK and JG were partially supported by National Science Foundation (NSF) Grant DMS–1953005. MK was also partially supported by NSF Grants DMS–1654083 and CCF–1934904.
Appendix A Calculation of Potts model likelihood
Here our objective is to approximate the conditional likelihood described in (6). It is to be noted that, for each fixed , calculation of the conditional likelihood takes approximately unit of time, which is infeasible for small values of . For this reason, we truncate the calculation of our conditional likelihoods up to a subset of future indices for every . So, we define, (i.e., the magenta boxes in Figure 1) of size . Then, we define the first modified Hamiltonian as,
where , and for suitable choice of with , since the locations with indices in f(i) will only have nearest neighbors from the locations with indices in (the blue boxes in Figure 1). Finally we can approximate the conditional likelihood as,
Appendix B Gaussian mixture model (GMM)
Suppose we have observations , and the corresponding state vector that denotes individual class labels. We assume that the observations follow the following distribution.
where is the maximum number of unique states.
The goal is to make inference about the class labels and the corresponding parameters. However, we do not have the opportunity to directly observe the class labels. Instead we only observe . For this reason, we consider some priors on the class labels and the parameters using a framework called the Gaussian mixture model (GMM).
First, we consider the following priors for the individual means and standard deviations:
The above distributions have followed the same prior structure with the Bayesian framework illustrated in Section 3.4. However, here we do not assume that . Instead, we assume that,
where denote the individual class probabilities. We assume that the class probabilities are not known either (since we do not have any information about the class labels, as discussed earlier), and so we consider a Dirichlet prior on . We assume,
where are the tuning parameters that control the mean and variances of corresponding parameters. We have fixed as the tuning parameter in the numerical simulations. For example, in Section 4.2.3 and in Section 5. The posterior distributions for the GMM follows.
where = , j = , , , , . More details of this Gaussian mixture model can be found, for example, in rasmussen1999infinite.