Bayesian Approaches for Flexible and Informative Clustering of Microbiome Data
We propose two unsupervised clustering methods that are designed for human microbiome data. Existing clustering approaches do not fully address the challenges of microbiome data, which are typically structured as counts with a fixed sum constraint. In addition to accounting for this structure, we recognize that high-dimensional microbiome datasets often contain uninformative features, or "noise" operational taxonomic units (OTUs), that hinder successful clustering. To address this challenge, we select features which are useful in differentiating groups during the clustering process. By taking a Bayesian modeling approach, we are able to learn the number of clusters from the data, rather than fixing it upfront. We first describe a basic version of the model using Dirichlet multinomial distributions as mixture components which does not require any additional information on the OTUs. When phylogenetic or taxonomic information is available, however, we rely on Dirichlet tree multinomial distributions, which capture the tree-based topological structure of microbiome data. We test the performance of our methods through simulation, and illustrate their application first to gut microbiome data of children from different regions of the world, and then to a clinical study exploring differences in the microbiome between long and short term pancreatic cancer survivors. Our results demonstrate that the proposed methods have performance advantages over commonly used unsupervised clustering algorithms and the additional scientific benefit of identifying informative features.
READ FULL TEXT