Resource saving taxonomy classification with k-mer distributions and machine learning

03/10/2023
by   Wolfgang Fuhl, et al.
0

Modern high throughput sequencing technologies like metagenomic sequencing generate millions of sequences which have to be classified based on their taxonomic rank. Modern approaches either apply local alignment and comparison to existing data sets like MMseqs2 or use deep neural networks as it is done in DeepMicrobes and BERTax. Alignment-based approaches are costly in terms of runtime, especially since databases get larger and larger. For the deep learning-based approaches, specialized hardware is necessary for a computation, which consumes large amounts of energy. In this paper, we propose to use k-mer distributions obtained from DNA as features to classify its taxonomic origin using machine learning approaches like the subspace k-nearest neighbors algorithm, neural networks or bagged decision trees. In addition, we propose a feature space data set balancing approach, which allows reducing the data set for training and improves the performance of the classifiers. By comparing performance, time, and memory consumption of our approach to those of state-of-the-art algorithms (BERTax and MMseqs2) using several datasets, we show that our approach improves the classification on the genus level and achieves comparable results for the superkingdom and phylum level. Link: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2022

Machine Learning based Discrimination for Excited State Promoted Readout

A limiting factor for readout fidelity for superconducting qubits is the...
research
11/01/2020

Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

The classification of DNA sequences is a key research area in bioinforma...
research
11/23/2020

End-to-End Framework for Efficient Deep Learning Using Metasurfaces Optics

Deep learning using Convolutional Neural Networks (CNNs) has been shown ...
research
06/07/2020

Uncertainty-Aware Deep Classifiers using Generative Models

Deep neural networks are often ignorant about what they do not know and ...
research
04/20/2023

Focus on the Challenges: Analysis of a User-friendly Data Search Approach with CLIP in the Automotive Domain

Handling large amounts of data has become a key for developing automated...
research
03/01/2023

On the Importance of Feature Representation for Flood Mapping using Classical Machine Learning Approaches

Climate change has increased the severity and frequency of weather disas...
research
08/10/2022

Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning

Inverse design of short single-stranded RNA and DNA sequences (aptamers)...

Please sign up or login with your details

Forgot password? Click here to reset