Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets

01/09/2020
by   Hayda Almeida, et al.
0

Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on developing automatic tools to support BGC discovery in plants, fungi, and bacteria. Data-driven methods, as well as probabilistic and supervised learning methods have been explored in identifying BGCs. Most methods applied to identify fungal BGCs were data-driven and presented limited scope. Supervised learning methods have been shown to perform well at identifying BGCs in bacteria, and could be well suited to perform the same task in fungi. But labeled data instances are needed to perform supervised learning. Openly accessible BGC databases contain only a very small portion of previously curated fungal BGCs. Making new fungal BGC datasets available could motivate the development of supervised learning methods for fungal BGCs and potentially improve prediction performance compared to data-driven methods. In this work we propose new publicly available fungal BGC datasets to support the BGC discovery task using supervised learning. These datasets are prepared to perform binary classification and predict candidate BGC regions in fungal genomes. In addition we analyse the performance of a well supported supervised learning tool developed to predict BGCs.

READ FULL TEXT
research
09/07/2020

Improving colonoscopy lesion classification using semi-supervised deep learning

While data-driven approaches excel at many image analysis tasks, the per...
research
06/08/2020

Supervised Whole DAG Causal Discovery

We propose to address the task of causal structure learning from data in...
research
11/04/2021

An Information-Theoretic Framework for Identifying Age-Related Genes Using Human Dermal Fibroblast Transcriptome Data

Investigation of age-related genes is of great importance for multiple p...
research
07/05/2021

UCSL : A Machine Learning Expectation-Maximization framework for Unsupervised Clustering driven by Supervised Learning

Subtype Discovery consists in finding interpretable and consistent sub-p...
research
09/03/2019

Detecting Compromised Implicit Association Test Results Using Supervised Learning

An implicit association test is a human psychological test used to measu...
research
04/01/2022

Identifying Exoplanets with Machine Learning Methods: A Preliminary Study

The discovery of habitable exoplanets has long been a heated topic in as...
research
03/02/2021

Probabilistic Inference for Structural Health Monitoring: New Modes of Learning from Data

In data-driven SHM, the signals recorded from systems in operation can b...

Please sign up or login with your details

Forgot password? Click here to reset