Hidden Markov Pólya trees for high-dimensional distributions
The Pólya tree (PT) process is a general-purpose Bayesian nonparametric model that has found wide application in a range of inference problems. The PT has a simple analytic form and the resulting posterior computation boils down to straight-forward beta-binomial conjugate updates along a partition tree over the sample space. Recent development in PT models shows that performance of these models can be substantially improved by (i) incorporating latent state variables that characterize local features of the underlying distributions and (ii) allowing the partition tree to adapt to the structure of the underlying distribution. Despite these advances, however, some important limitations of the PT that remain include—(i) the sensitivity in the posterior inference with respect to the choice of the partition points, and (ii) the lack of computational scalability to multivariate problems beyond a small number (<10) of dimensions. We consider a modeling strategy for PT models that incorporates a very flexible prior on the partition tree along with latent states that can be first-order dependent (i.e., following a Markov process), and introduce a hybrid algorithm that combines sequential Monte Carlo (SMC) and recursive message passing for posterior inference that can readily accommodate PT models with or without latent states as well as flexible partition points in problems up to 100 dimensions. Moreover, we investigate the large sample properties of the tree structures and latent states under the posterior model. We carry out extensive numerical experiments in the context of density estimation and two-sample testing, which show that flexible partitioning can substantially improve the performance of PT models in both inference tasks. We demonstrate an application to a flow cytometry data set with 19 dimensions and over 200,000 observations.
READ FULL TEXT