LinDA: Linear Models for Differential Abundance Analysis of Microbiome Compositional Data
One fundamental statistical task in microbiome data analysis is differential abundance analysis, which aims to identify microbial taxa whose abundance covaries with a variable of interest. Although the main interest is on the change in the absolute abundance, i.e., the number of microbial cells per unit area/volume at the ecological site such as the human gut, the data from a sequencing experiment reflects only the taxa relative abundances in a sample. Thus, microbiome data are compositional in nature. Analysis of such compositional data is challenging since the change in the absolute abundance of one taxon will lead to changes in the relative abundances of other taxa, making false positive control difficult. Here we present a simple, yet robust and highly scalable approach to tackle the compositional effects in differential abundance analysis. The method only requires the application of established statistical tools. It fits linear regression models on the centered log-ratio transformed data, identifies a bias term due to the transformation and compositional effect, and corrects the bias using the mode of the regression coefficients. Due to the algorithmic simplicity, our method is 100-1000 times faster than the state-of-the-art method ANCOM-BC. Under mild assumptions, we prove its asymptotic FDR control property, making it the first differential abundance method that enjoys a theoretical FDR control guarantee. The proposed method is very flexible and can be extended to mixed-effect models for the analysis of correlated microbiome data. Using comprehensive simulations and real data applications, we demonstrate that our method has overall the best performance in terms of FDR control and power among the competitors. We implemented the proposed method in the R package LinDA (https://github.com/zhouhj1994/LinDA).
READ FULL TEXT