Mycorrhiza: genotype assignment using phylogenetic networks

02/12/2020
by   Jeremy Georges-Fiilteau, et al.
0

Motivation The genotype assignment problem consists of predicting, from the genotype of an individual, which of a known set of populations it originated from. The problem arises in a variety of contexts, including wildlife forensics, invasive species detection and biodiversity monitoring. Existing approaches perform well under ideal conditions but are sensitive to a variety of common violations of the assumptions they rely on. Results In this article, we introduce Mycorrhiza, a machine learning approach for the genotype assignment problem. Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples. Those features are then used as input to a Random Forests classifier. The classification accuracy was assessed on multiple published empirical SNP, microsatellite or consensus sequence datasets with wide ranges of size, geographical distribution and population structure and on simulated datasets. It compared favorably against widely used assessment tests or mixture analysis methods such as STRUCTURE and Admixture, and against another machine-learning based approach using principal component analysis for dimensionality reduction. Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium. Moreover, the phylogenetic network approach estimates mixture proportions with good accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2014

Classification of Basmati Rice Grain Variety using Image Processing and Principal Component Analysis

All important decisions about the variety of rice grain end product are ...
research
12/12/2017

Empirical Evaluation of Kernel PCA Approximation Methods in Classification Tasks

Kernel Principal Component Analysis (KPCA) is a popular dimensionality r...
research
05/25/2021

Hierarchical Subspace Learning for Dimensionality Reduction to Improve Classification Accuracy in Large Data Sets

Manifold learning is used for dimensionality reduction, with the goal of...
research
09/21/2017

Lazy stochastic principal component analysis

Stochastic principal component analysis (SPCA) has become a popular dime...
research
10/22/2014

Demixed principal component analysis of population activity in higher cortical areas reveals independent representation of task parameters

Neurons in higher cortical areas, such as the prefrontal cortex, are kno...
research
03/04/2019

Traditional Machine Learning for Pitch Detection

Pitch detection is a fundamental problem in speech processing as F0 is u...

Please sign up or login with your details

Forgot password? Click here to reset