Supervised Learning and Model Analysis with Compositional Data

05/15/2022
by   Shimeng Huang, et al.
0

The compositionality and sparsity of high-throughput sequencing data poses a challenge for regression and classification. However, in microbiome research in particular, conditional modeling is an essential tool to investigate relationships between phenotypes and the microbiome. Existing techniques are often inadequate: they either rely on extensions of the linear log-contrast model (which adjusts for compositionality, but is often unable to capture useful signals), or they are based on black-box machine learning methods (which may capture useful signals, but ignore compositionality in downstream analyses). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast models to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. Finally, we apply the KernelBiome framework to two public microbiome studies and illustrate the proposed model analysis. KernelBiome is available as an open-source Python package at https://github.com/shimenghuang/KernelBiome.

READ FULL TEXT

page 24

page 40

page 41

research
03/31/2023

Regression and Classification of Compositional Data via a novel Supervised Log Ratio Method

Compositional data in which only the relative abundances of variables ar...
research
07/29/2021

MLMOD Package: Machine Learning Methods for Data-Driven Modeling in LAMMPS

We discuss a software package for incorporating into simulations data-dr...
research
09/11/2019

Robust Regression with Compositional Covariates

Many high-throughput sequencing data sets in biology are compositional i...
research
08/24/2021

Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions

Accurate and interpretable prediction of survey response rates is import...
research
05/02/2022

Reproducing Kernels and New Approaches in Compositional Data Analysis

Compositional data, such as human gut microbiomes, consist of non-negati...
research
09/15/2021

BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales

Algorithmic harmonization - the automated harmonization of a musical pie...
research
11/17/2021

Three approaches to supervised learning for compositional data with pairwise logratios

The common approach to compositional data analysis is to transform the d...

Please sign up or login with your details

Forgot password? Click here to reset