ConstrMixMod
R package for Flexible Mixture Modeling on Constrained Spaces
view repo
This paper addresses challenges in flexibly modeling multimodal data that lie on constrained spaces. Applications include climate or crime measurements in a geographical area, or flow-cytometry experiments, where unsuitable recordings are discarded. A simple approach to modeling such data is through the use of mixture models, with each component following an appropriate truncated distribution. Problems arise when the truncation involves complicated constraints, leading to difficulties in specifying the component distributions, and in evaluating their normalization constants. Bayesian inference over the parameters of these models results in posterior distributions that are doubly-intractable. We address this problem via an algorithm based on rejection sampling and data augmentation. We view samples from a truncated distribution as outcomes of a rejection sampling scheme, where proposals are made from a simple mixture model, and are rejected if they violate the constraints. Our scheme proceeds by imputing the rejected samples given mixture parameters, and then resampling parameters given all samples. We study two modeling approaches: mixtures of truncated components and truncated mixtures of components. In both situations, we describe exact Markov chain Monte Carlo sampling algorithms, as well as approximations that bound the number of rejected samples, achieving computational efficiency and lower variance at the cost of asymptotic bias. Overall, our methodology only requires practitioners to provide an indicator function for the set of interest. We present results on simulated data and apply our algorithm to two problems, one involving flow-cytometry data, and the other, crime recorded in the city of Chicago.
READ FULL TEXT VIEW PDFR package for Flexible Mixture Modeling on Constrained Spaces
This paper studies two approaches to flexibly modeling multimodal data that lie on constrained spaces. An example of such data is crime data, where measurements are restricted to within the complex boundaries of a geographical entity, either because none exist outside (due to topographical features like water bodies) or because measurements outside belong to another city/state/country. Flow cytometry is another example, with values outside some range (e.g. 0 to 1024) discarded. Other instances include single cell RNA-sequence data (Cao et al., 2017) (where count-measurements below some threshold are discarded), operational risk modeling (Luo et al., 2009) (where loss data from banks only above a threshold are provided), stock price data (Aban et al., 2006), mortality (Alai et al., 2013), survival (Cain et al., 2011) and capture-recapture data (Manning and Goldberg, 2010), where certain outcomes are truncated (Mandel, 2007)
, and climate data. In all such cases, it is important to accurately account for the boundaries of the constraint set, and to avoid biases which might arise if boundary effects incorrectly interact with smoothness assumptions inherent in typical probability models. Doing so, however, raises computational challenges due to the need to evaluate integrals over complicated subsets of a Euclidean space. This problem is exacerbated when the data exhibits rich multimodal structure, a common situation that requires mixture (and sometimes nonparametric mixture) models.
We address this problem in this paper, developing and extending a methodology proposed in Rao et al. (2016) that avoids having to compute intractable terms arising from complex constraints. Our approach is to treat observations lying on the constrained space as the outcome of a rejection-sampling algorithm (Robert and Casella, 2005), with a proposal distribution defined on the simpler unconstrained space, and with observations falling outside the constraint set discarded. Following Rao et al. (2016); Beskos et al. (2006), we carry out inference over the distribution by imputing the rejected samples. Implicit in is all information about the original distribution of interest. Working directly with the unconstrained , and imputing the rejected proposals, allows us to use standard Bayesian modeling and computational techniques (such as nonparametric Bayes, and Gibbs samplers based on the Chinese restaurant process (Neal, 2000) and the stick-breaking process (Ishwaran and James, 2001). We focus on mixture models, studying two approaches: mixtures of truncated components, and truncated mixtures of components. For both models, we describe exact and efficient Markov chain Monte Carlo (MCMC) schemes to impute the rejected proposals, allowing inference over cluster assignments, cluster parameters and cluster weights. We also explore computational approximations to speed up inference, and show how, at the price of some asymptotic bias, these can improve performance by reducing variance. Here, our approach is to bound the number of rejected proposals that our imputation scheme produces.
We start by briefly reviewing the rejection sampling method and the associated data augmentation algorithm of Beskos et al. (2006); Rao et al. (2016) in Section 2. In Section 3 we define the two modeling approaches along with details of the associated MCMC samplers. Section 4 outlines the thresholding approximation for computational efficiency. Finally, in Section 5, we present the results of our simulation studies on synthetic data, and real datasets of crime and flow cytometry measurements.
We consider observations lying on a subset of a Euclidean space . might be the set of positive reals, the unit sphere, or something more interesting like the city of Chicago. We show a simulated and a real example in Figure 1. Our methodology is most useful for nonstandard sets like those shown in Figure 1. However, it is useful even for more common subsets like the unit square or the simplex. For these situations, standard approaches can be inadequate, either discarding correlation structure (e.g. mixtures of products of betas on the unit square) or imposing strong edge-effects (such as models involving transformed multivariate Gaussians). Our methodology provides a simple modeling framework for a wide variety of constraint sets, with the practitioner only requiring an indicator function for that set.
We model the observations as i.i.d. draws from a probability distribution
whose support equals to. Our goal is to estimate
from the observations , and to do so, we will take a Bayesian approach, placing a prior over and studying the resulting posterior distribution. Unfortunately, the requirement that be restricted to raises challenges for both model specification, as well as computation. For many interesting settings, the likelihood will involve an integral over , and will therefore be intractable for all but simple choices of . Posterior inference is consequently a doubly-intractable problem (Murray et al., 2006).Our approach is to regard the distribution as a restriction (and renormalization) of some other distribution on the ambient space . Thus, , where is the indicator function for , and is some standard distribution for which MCMC sampling techniques already exist. Defining this way allows sharp drops in the probability density from inside to outside , and avoids undesirable smoothing effects at the boundaries. We restrict ourselves to subsets having positive probability under , and in practice, to subsets of having positive Lebesgue measure. Our methodology does not apply to situations where is a lower-dimensional manifold lying on a higher-dimensional Euclidean space; these require different techniques.
At a high-level, our strategy is to specify a flexible, possibly nonparametric prior over the distribution , and thus implicitly over . Placing a prior over the unconstrained distribution allows us to avail of standard Bayesian modeling tools. We treat the observations as the outcome of a rejection sampling algorithm, where we propose from and discard samples falling outside of . Given the observations , we compute the posterior distribution over , implicit in which, once more, is all information contained in the posterior over . While is a simpler object than , its conditional distribution given the observations is still not easy to sample from. In order to do this, we recognize that if we augment the observations with the rejected proposals from , then conditional inference over is straightforward. In particular, the rejected proposals (call these ) together with the observations form i.i.d. samples from the unconstrained model . This allows the use of standard MCMC methods for posterior inference over , and imputing eliminates any intractable integrals arising from the constraint set .
The question now is how to impute the rejected proposals . In Rao et al. (2016), it was shown that the set of rejected samples preceding each observation are exchangeable across different observations. Consequently, in order to reconstruct the rejected samples for any observation , one merely has to sample a new observation from the rejection sampler, and assign its rejected samples to . Concretely, this involves simulating from the proposal distribution until an acceptance, and after discarding the accepted sample, assigning the rest to . This idea was first proposed by Beskos et al. (2006), in the specific setting of parameter inference for stochastic differential equations. Repeating this for each observation in the dataset allows all discarded samples to be imputed.
To describe the overall MCMC algorithm, we parametrize both the
constrained density of interest, as well as the proposal density, by
, writing them as , and . As we describe
later, will represent the parameters
of a mixture model. We place a prior on , and given
observations , are interested in the posterior given ,
. To sample from this distribution, we simulate from
, which has as its marginal distribution.
We sample on this augmented space by repeating two Gibbs steps: 1) simulate
the rejected proposals given the parameter , and
2) update the parameter
given the rejected samples , targeting the density
.
Algorithm 1 provides the general data augmentation
scheme.
We now move from the abstract setting of the previous section to the focus of this paper, where the proposal density is a mixture model, and the parameter represents the mixing proportions , as well as the component parameters . In Bayesian settings, it is typical to place a Dirichlet prior over
, and when possible, a conjugate prior over
. Thus, when working with a mixture of Gaussians, would include the mean and covariance of the Gaussian components, and we would place a Normal-Inverse-Wishart (NIW) prior over the pairs. In nonparametric settings, is infinite-dimensional, and one might place a stochastic process prior (e.g. a Dirichlet process prior, Ferguson (1973)) over via e.g. a stick-breaking construction (Ishwaran and James, 2001). Letbe the hyperparameter for the prior over the component parameters, and
be the hyperparameter for the distribution over weights. We write the latter as a Dir, though it could be either the Dirichlet or a stick-breaking prior. Then, for the case of Gaussian likelihoods, the prior over parameters can be written as follows:(1) |
In the first approach we consider, our proposal distribution
is a simple unconstrained mixture model; for concreteness, we use a mixture of multivariate Gaussian distributions. Assume
components, each with its own mean and covariance matrix . Then the proposal has the form(2) |
The parameter consists of the mixing distribution , and the component parameters . In nonparametric settings with a Dirichlet process prior, is infinity, and the proposal distribution becomes a Dirichlet process mixture of Gaussians (Lo, 1984; Escobar and West, 1995; Rasmussen, 2000). We assume our constraint set is some subset of with nonzero Lebesgue measure, so that the observations follow a density equal to , truncated to and renormalized. We call this distribution a truncated mixture of Gaussians (TMoG):
(3) |
To simulate observations from this model, we make proposals from equation (2) until one lies in the constraint set . We note that since proposals are made independently from the mixture distribution of equation (2), rejected proposals preceding the same observation need not belong to the same cluster. Figure 2 describes the generative process of this model in more detail.
Given observations , MCMC inference over the parameters involve first imputing the rejected proposals conditioned on the parameters, and then updating the parameters given these imputed variables. Having updated the parameters, we discard the rejected samples and repeat the process. We describe the steps of the overall Gibbs sampling algorithm below.
To impute the rejected proposals given and , we follow steps 1 to 6 of Algorithm 1, proposing from the mixture of normals (equation (2)) to generate a pseudo-dataset of the same size as , and keeping all the rejected proposals generated along the way. Call these , and write for the total number of elements in . Each element lies outside the constraint set , and is assigned an associated cluster along the way. Write for the set of cluster assignments of the imputed proposals. The sets and , together with the observations will be used to update the parameters , as well as the cluster assignments of the observations (write these as ). The joint probability distribution of the model is
(4) |
Updating and the ’s is now straightforward as described next.
Write and for the total number of observations and rejected samples (respectively) in cluster . These are easily calculated from and . For a Dirichlet distribution prior over , the Gibbs conditional over takes the simple form
(5) |
For a nonparametric stick-breaking prior over , the conditional update for is a simple adaptation of standard methodology (such as in Ishwaran and James (2001)). The key point implied by equation (4) for any prior over is to include both observations and rejected samples in the cluster counts.
Use to represent quantities calculated after excluding observation . Then the conditional distribution for , the cluster assignment of observation , is given by
(6) |
Conditioned on , all the ’s can be updated independently. It is also possible to marginalize out , and update ’s sequentially. In this case, the ’s follow a Pólya urn/Chinese restaurant process update (Neal, 2000), where once again, we must consider both the number of observations as well as the number of rejected proposals at any cluster:
(7) |
For a new cluster, we replace with the concentration parameter .
The conditional distribution for the cluster parameters depends both on observations and rejected samples assigned to that cluster. Writing and for the observations and rejected samples assigned to cluster , and for all parameters except , we have
(8) |
The parameters for all components can be updated independently, and with a conjugate Normal-Inverse-Wishart prior, this distribution is easy to sample from.
In the previous section, we modeled an unknown density on a constrained space with a truncated mixture model (in particular, a truncated mixture of Gaussians). In this section, we take a second approach, modeling the density as a mixture of truncated distributions. Again, for concreteness, we consider a mixture of truncated Gaussians. For some subset of a -dimensional Euclidean space, write for a Gaussian with mean and covariance restricted to that subset:
(9) |
A mixture of truncated Gaussians, with parameters and with mixing proportion has probability density given by:
(10) |
Unlike the truncated mixture of Gaussians which involves a single intractable normalization constant (equation (2)), the equations above show that the mixture of truncated Gaussian involves (albeit simpler) intractable normalization constants. As before, we place Dirichlet/stick-breaking priors on , and a conjugate Normal-Inverse-Wishart prior on the components parameters . The generative process then follows:
Having chosen the cluster of observation from the distribution
, the challenge now is to sample from the corresponding truncated normal distribution
. To do this, we again use rejection sampling, now proposing from the unconstrained Gaussian distribution until acceptance. Note that now, unlike with the truncated mixture of Gaussians, all rejected samples associated with an observation come from the same component as that observation. Figure 3 outlines this process in more detail. This results is subtle differences in the associated MCMC sampler, where now, rejected proposals of each observation must be imputed in a cluster-specific manner. Having imputed these, we must update cluster assignments and parameters, again in a manner slightly different from the TMoG case. As before, at the end of these updates, we discard the imputed samples, and repeat. We describe the full Gibbs sampling procedure below.Unlike TMoG, where each observation is an accepted proposal from a mixture of Gaussians, now each observation is an accepted proposal from the single Gaussian component to which it was initially assigned. Accordingly, to impute the rejected proposals for observation belonging to cluster , we repeatedly propose from component until acceptance, and keep all the rejected proposals. We now have to keep track of which rejected proposals belong to which observation, and write for the set of rejected samples preceding observation , and for the cardinality of . The joint probability density for the entire set of variables is then
(11) |
We use this to update the latent variables as described below.
Write for the number of observations assigned to component . The conditional distribution for follows a Dirichlet distribution:
(12) |
Unlike TMoG, this distribution does not involve the rejected samples at all. This is because for each observation, a cluster is chosen from only once, with all rejected proposals assigned to that same component. For any other prior over (e.g. a Dirichlet process), the conditional update rule remains unchanged from standard methodology, involving only cluster assignment counts of observations.
Unlike TMoG, updating now involves the rejected proposals asssociated with observation , and has conditional distribution
(13) |
This is a consequence of the fact that the cluster assignment is made only once, after which proposals are made from that cluster until acceptance. This also accounts for the fact that without the rejected proposals, this distribution would be proportional to and would involve the intractable normalization constant of component . Imputing the rejected proposals avoids having to calculate this quantity, though now, the associated rejected proposals are transferred along with that observation to a new cluster. If were marginalized out (e.g. under a Chinese restaurant process), the cluster update rule becomes
(14) |
Here refers to all rejected proposals except those associated with .
For the same reason as above, updating the cluster parameters also require conditioning on the imputed samples. The conditional distribution is identical to that for TMoG, and is given by
With a conjugate Normal-Inverse-Wishart prior for the Gaussian mixture model, this distribution is easy to sample from.
The MCMC schemes for the two models outlined earlier target the exact posterior distribution through a data augmentation step that imputes the rejected proposals preceding each observation. As described, this does not a priori bound the number of rejected proposals. In our experiments, we see that as the corresponding MCMC algorithm explores parameter space, a cluster will occasionally be produced under which the constrained subset
has very low probability. This could involve a Gaussian whose mean is two or more standard deviations away from
. Producing an accepted proposal from this component can require a very large number of rejected proposals. As a consequence, the MCMC algorithm will experience a significant slow down, taking a very long time to move through an MCMC iteration. This does not just increase computational overhead but also affects Markov chain mixing: a large number of rejected proposals will increase coupling between successive MCMC iterations. Furthermore, a large number of rejected samples can swamp out the observations , causing the cluster parameters to drift away from the observations and , further exacerbating the issue. The net effect is longer computation times and larger variances in the MCMC estimates. Since MoTG proposes from a particular Gaussian component until acceptance (unlike TMoG, which picks a new component for each proposal), we expect that this effect is worse for MoTG than TMoG. Our experiments confirm this.To address this, we suggest bounding the maximum number of rejected samples per observation. We refer to algorithms with such a bound as thresholded samplers, giving TMoG and MoTG with thresholding. We consider different settings of this threshold, including 1, 5, and 50 rejections per observation. This introduces asymptotic bias into the MCMC sampler. We will see that introducing bias in this manner reduces the variance in both simulation run-times as well as estimates produced, resulting in faster and (for reasonable threshold settings) better estimates. Effectively, the threshold acts to regularize the proposal distribution , limiting the amount of probability mass it assigns outside by limiting the proportion of rejected samples. While it is possible to achieve the same effect through priors over concentrated near , this is a simpler and more direct approach. Section 5 studies the trade-offs involved for different threshold settings and the performance of the algorithm in terms of time and the predictive likelihood for various simulation studies.
We apply our models and algorithms to a number of synthetic and real datasets. We study the modeling and computational trade-offs between the truncated mixture of Gaussian (TMoG) and the mixture of truncated Gaussian (MoTG) models, as well as between different threshold settings for the number of auxiliary variables in our data augmentation algorithm. In all synthetic experiments, we generate datasets with 500 observations, and evaluate performance by using of the data as training, and the remaining as held-out test data. We evaluate test-likelihood using importance sampling, verifying these numbers with numerical integration when possible. Our synthetic experiments focus on relatively simple subsets to allow such numerical evaluation.
We consider 5 different threshold values: 0, 1, 5, 50, and infinity. Here, 0 means no data augmentation, so that the data are modeled with an unconstrained mixture model that is not cognizant of truncation boundaries. We will see that this can result in inappropriately low probability at the boundaries of the constraint set. A threshold of infinity means no thresholding, corresponding to the asymptotically unbiased MCMC algorithm. The remaining settings (1, 5, and 50) limit the maximum number of rejections per observation. We repeat each experiment 100 times, plotting error bars at 25% and 75% levels, as well as the median, for both test log-likelihood and compute time. Our MCMC samplers used a total of 5000 iterations, the first 2000 of which were discarded as burn-in. In all experiments, we use a stick-breaking prior truncated to 50 components, with concentration parameter set to 1.
We start with a simple one-dimensional setting, where the constraint set
is the unit interval. This does not really require our methodology, since all normalization constants can easily be calculated numerically, or since one might choose a different modeling approach like a mixture of Beta distributions. Nevertheless, this provides a simple test case to exactly evaluate the effect of different modeling and algorithmic choices. In higher dimensions (such as the unit square or hypercube), the limitations of standard methods become more evident, and our methodology will be necessary to capture correlation structure as well as high probabilities at the edges. We consider two datasets, one drawn from a truncated Gaussian centered at the midpoint of the interval (in particular,
), and one at the edge (in particular, ). We model these datasets using TMoG and MoTG, placing on the components a Normal-Inverse-Gamma prior:(15) |
We set the mean parameter to the true mean ( or ), and set parameters and to , and respectively.
with varying thresholds on the number of rejected proposals. The performance plot shows the 25% and 75% quartiles along x- and y-axes.
We present the results for MoTG on the first dataset in Figure 4 (the results of TMoG are almost identical). The left panel plots the posterior mean density given the observations for all threshold settings, and we can see no noticeable differences here. The right-panel quantifies this, plotting the log-likelihood of the test dataset (on the y-axis) against the time taken to run 100 iterations. Again, we see no differences in the predictive log-likelihood or run-times.
The lack of sensitivity thresholding stems from the simplicity of the problem, where the true data generation mechanism involves a Gaussian density, most of whose mass is restricted to within the constraint set. Our algorithm recovers this fact, so that the inferred proposal distribution closely approximates the generating Gaussian, resulting in the number of rejected samples rarely hitting even the smallest threshold level.
The second dataset presents a more challenging problem, with the mode located at the boundary. Now, thresholding does play a role, with the fit becoming increasingly accurate as rejected proposals are incorporated. Figure 5 shows posterior mean density for MoTG (to the left), and TMoG (to the right). With low thresholds (and especially when equal to 0) the posterior mean experiences a clear shift away from the left boundary; this is despite the fact that we use a flexible mixture of Gaussians to model this distribution. Essentially, the model struggles to reconcile the abrupt change in the number of observations across the boundary, and compromises by smoothing across the boundary, resulting in a moderate (rather than high) density at the edge. While it might be possible to try to get around this using a prior allowing very peaked components, this will not account for the smoothness of the density inside the interval. This emphasizes the importance of accounting for boundary effects in modeling constrained data. Overall TMoG appears to produce a better fit that MoTG.
Figure 6 presents the speed-accuracy trade-off for different threshold settings. The left and middle plots show results for MoTG and TMoG respectively, and interestingly, the qualitative degradation resulting from low threshold settings does not manifest itself quantitatively, with no statistically significant differences among the different settings for both models. This is partly because not enough test observations lie close to the edge to significantly affect the log-likelihood. We will see such effects later with sharper densities and higher-dimensional settings. If however we bias the test dataset to favor observations near the edges, we see a clear drop in performance when the threshold is set to . Here, we selected the test set from observations in the interval .
In terms of efficiency, we see a clear advantage to using lower thresholds. We also find that TMoG runs significantly faster than MoTG. This is a result of the generative process of MoTG, where, having picked a cluster, one must repeatedly sample from it until an acceptance. When the chosen cluster assigns low probability to , we get a large number of rejections before acceptance. This can also initiate a runaway event, with the rejected proposals (that lie outside ) drawing the cluster away from while increasing its mixture selection probability, resulting in even more rejections. Figure 7 demonstrates this by plotting histograms of the number of rejected proposals for different threshold settings. When this threshold is infinite, the number of rejected samples is more than an order of magnitude larger for MoTG.
Our next experiment, while still relatively simple, considers the situation when the true density lies outside the model class. Again we use the two models TMoG and MoTG, now with the Beta on the interval being the true true density. Observe that this setting of parameters results in sharp modes at each end of the unit interval.
Figure 8 plots the posterior mean densities for both MoTG and TMoG. Similar to the previous experiment, we see that if the truncation is too strong, the model fails to adequately capture the peaks at the boundaries of the interval. For both MoTG and TMoG, predictive performance improves as more and more rejected samples are incorporated. This is clearly demonstrated in Figure 9, which shows a trend of test log-likelihood improving with threshold for both models. As before, this accuracy comes at the price of greater computational expense, with larger thresholds having longer run-times. Similar to the previous experiment, each TMoG setting is significantly more efficient than its MoTG counterpart.
Our final synthetic experiment considers data on the two-dimensional plane. Our constraint set is the unit square , with data generated from a mixture of two Gaussians, one centered at and the other diagonally across at . We generate 1000 observations from this model, using as training and as test. For both TMoG and MoTG, we place Normal-Inverse-Wishart priors on the components:
(16) |
The parameters are set as follows: , , , and , where
is the 2-dimensional identity matrix. We show contour plots of the estimated densities for TMoG for different threshold settings in Figure
10. Visually, there is no obvious qualitative difference between these plots, though Figure 11, plotting test log-likelihood against compute time for different threshold settings, tells a more complicated story. As before, we see that TMoG is significantly faster than MToG, now by almost two orders of magnitude. Additionally, increasing the threshold parameter results in an increase in the run time. For TMoG, performance with a threshold of is significantly worse than other settings, though larger thresholds do not result in any significant improvement in performance. Interestingly, for MoTG, large settings of the threshold parameter result in a decrease in test-likelihood. This illustrates the importance of controlling variance through the introduction of asymptotic bias in the MCMC algorithm. The effect is more noticeable in two-dimensions than in one because the probability of rejection increases with dimension. A threshold of 5 to 10 represents a reasonable compromise between too much bias and too much variance. This effect is much less noticeable for TMoG (and not statistically significant), see Figure 11(right).The first real dataset we consider involves acute graft-versus-host disease (GvHD) flow-cytometry measurements from patients subject to bone-marrow transplants (Brinkman et al., 2007). GvHD is a serious complication that can arise after receiving stem cells, when transplanted donor T-cells (“the graft”) attack the patient’s healthy organs (“the host”). The dataset from Brinkman et al. (2007) contains 6809 “control” and 9083 “positive” observations from 31 patients. We focus on the control observations, which are from patients who did not develop either acute or chronic GvHD after the transplant. Each observation involves measurements of 4 activation markers: CD4, CD8b, CD3, and CD8, with each measurement varying between 0 and 1024. The data collection process is such that observations outside this range are discarded, resulting in a sharp drop in intensity outside this set (see Figure 12 for projections of the raw data onto different two-dimensional planes). We scale the data to lie in the four-dimensional unit hypercube, which forms our constraint set .
This dataset was briefly considered in Rao et al. (2016), where the authors demonstrated the feasibility of MCMC inference for TMoG. Here we analyze it more systematically, evaluating the performance of our two models, as well as different threshold settings. As before, in our mixture models we use the Normal-Inverse-Wishart as the prior, with parameters: , , , and . is the four-dimensional identity matrix.
Figure 12 shows contour plots of the predictive density of the models for threshold 0, 1, 5, and 50, while Figure 13 shows how thresholding trades-off between predictive performance and compute time. We do not include the exact MCMC sampler (where the threshold is set to infinity), since for MoTG, this occasionally produced a very large number of rejections. This effect is also visible for smaller thresholds, and just like the previous experiment with bivariate Gaussians, MoTG with threshold greater than 5 performs worse than for lower thresholds. In the contour plots of Figure 12, this manifests itself in the loss of multimodal structure, while Figure 13 shows a steep drop in test log-likelihood for large thresholds. This reiterates the point that even though a large threshold reduces bias, it can significantly increase variance, and that MoTG is particularly sensitive to this. As we described earlier, a significant driver of this large variance is poor mixing due to the large number of auxiliary variables. Figure 14 shows MCMC traceplots of the number of rejected samples over MCMC iterations for MoTG, and we see that other than for the low threshold, the MCMC chain mixes very poorly. Consequently, we do not recommended MoTG as a model for data constrained to sets of even moderate complexity, or for problems of more than a few dimensions.
The TMoG models show much better fits than their MoTG counterparts; this is clear qualitatively from Figure 12 and quantitatively from Figure 13. We do not see any significant changes in performance with increasing threshold (although there is improvement in median performance). As before, this is because a large fraction of test samples lie away from the edges. The rightmost subplot in Figure 13 focuses on performance at the boundaries, constructing the test set from observations within of an edge (this constitutes more than 5% of the entire dataset). Now we see a clear and statistically significant improvement with threshold value. For all threshold settings, TMoG demonstrates reasonable mixing (like the left-most plot in Figure 14), with autocorrelations not extending beyond 10-15 iterations.
Finally, we consider measurements of criminal activity recorded in the city of Chicago. We gathered data from the city of Chicago data-portal^{1}^{1}1https://data.cityofchicago.org/Public-Safety, and plot it in Figure 1. Each recording in this dataset includes details such as case number, time, type of crime, as well as the longitude and latitude of the crime location. We restrict ourselves to modeling locations of homicide crimes occuring from the years 2012 to 2017, resulting in 3220 observations. Thus, each observation is a measurement in a two-dimensional space where the longitude and latitude are the x- and and y-coordinates respectively. The range of longitude values is from to , while latitudes range from to . We rescaled these values to between and .
Within the 2-dimensional Euclidean space, our constraint set is the interior of the city of Chicago. To characterize this complex, non-convex set, we gathered data for the boundaries of the 77 neighborhood limits of Chicago^{2}^{2}2https://www.cityofchicago.org/city/en/depts/doit/provdrs/gis.html, and approximated each neighborhood with a polygon using the spatial polygon package, ^{3}^{3}3https://cran.r-project.org/web/packages/sp/. Combined together, these polygons formed the entire city limits. The package also allows to check whether a point lies inside a polygon. We used this function in our rejection sampling algorithm, to decide whether or not a proposal on lies within Chicago.
Below, we present results from modeling this data using TMoG. MoTG, as indicated by our previous experiments, takes much longer to run and produces results that are much worse, hence we do not include it here. For our prior over cluster parameters, we used a two-dimensional Normal-Inverse-Wishart distribution with parameters , , , and . Figure 15 visualizes the posterior distribution through samples from the MCMC algorithm for different settings of the threshold parameter. Each subplot shows the log of the posterior mean density for TMoG given the crime data, with the grey contour lines showing the estimated proposal distribution , and the black lines highlighting them within the Chicago city limits. The latter gives (up to a proportionality constant) the density of interest, whose properties can easily be estimated from the MCMC samples.
For all threshold settings, we observe two modes, one to the south of Chicago, and one to the west. Larger threshold values are however able to capture many details near the boundary that are otherwise lost. When the threshold is set to , the range of density values near the boundary is smaller, being flattened to account for the absence of observations just outside the border. Further, there is a sharp cluster of observations right on the north-east boundary of Chicago which is lost for the threshold settings and . For larger threshold settings, the rejected proposals produce a significant cluster most of whose mass lies outside the city limits, but which overlaps with the city to allow a bump in probability at the corner. Edge smoothing-effects due to the constraint also result in coarser estimates at the mode to the west of the city.
An interesting phenomenon is the mild structure in the density contours away from the city boundaries. These are transients, resulting from the data-augmentation interacting with the thresholding. These do not affect inferences over the subset of interest, and are not relevant to the main estimation problem. Nevertheless, these represent an inefficiency in the thresholding procedure, and a waste of computational resources. A future research direction is to favor thresholding away from the subset of interest.
Figure 16 quantifies the effect of threshold settings, plotting predictive performance of TMoG on held-out test data versus threshold. The left plot gives results for test data randomly sampled from the original dataset, and we see a slight, but not significant performance hit with no augmentation. The right subplot presents results when the test dataset is drawn from observations in Chicago neighborhoods touching the boundary. Now we see a significant loss of predictive performance without data-augmentation, providing quantitative justification of the importance of our data-augmentation scheme to accurately model probability structure near the edges of the constraint set.
As far as run-time is concerned, we see that a threshold of is much faster that other settings, and that the average run-time does not increase significantly for larger thresholds. In fact, the relative inefficiency of these settings has little to do with our data-augmentation scheme. For instance, our recommended setting of TMoG, with threshold set to , produces on average around 1250 rejected proposals, which is less than a 50% increase in the original dataset. Instead, the increased run-time reflects our relatively crude approach to deciding if a proposal lies in the Chicago city limit. Our implementation, using the R package SP, needs to check that a proposal does not lie in any of the Chicago neighborhoods before we can reject it. This accounts for the bulk of the run-time for non-zero thresholds (the vanilla mixture model without data-augmentation is unaware of the borders of the city and therefore does not include such checks). With a more careful implementation of the Chicago border, we can reduce this overhead.
We proposed two approaches to modeling data lying within a contrained space: the truncated mixtures of Gaussians (TMoG) and mixtures of truncated Gaussians (MoTG). Our methodology is most useful for nontrivial constraint sets, such as the boundaries of a city, though it is also useful in simpler settings like the simplex, or the unit disc, when existing mixture models are not flexible enough to represent rich correlation structure or to capture sharp changes in probability across boundaries. Implementation-wise, our code only requires an indicator function for the subset of interest.
By characterizing both our models through a rejection sampling scheme, we propose exact MCMC samplers that avoid computing intractable normalizing constants. We also study computational approximations that limit the number of rejections, allowing bias-variance trade-off in performance. Finally, we show that TMoG is significantly faster and performs better than MoTG in all our experiments as well as in real data applications.
Future studies can extend our work to continuous state space models with restricted domains, and to a more complex constraint sets like manifolds. Our thresholding scheme serves to regularize the proposal distribution, limiting how much mass it assigns outside the constraint set. It is interesting to look at more refined approaches to this, such as increasing the likelihood of truncation with distance from the constraint set. Finally, it is of interest to better theoretically understand the trade-offs involved in our computational approximation scheme, both statistically and computationally.
Lifetime dependence modelling using a truncated multivariate gamma distribution.
Insurance: Mathematics and Economics, 52(3):542–549.Bias due to left truncation and left censoring in longitudinal studies of developmental and disease processes.
American Journal of Epidemiology, 173:1078–84.Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06)
, pages 359–366. AUAI Press.
Comments
There are no comments yet.