Machine learning applications to DNA subsequence and restriction site analysis

by   Ethan J. Moyer, et al.

Based on the BioBricks standard, restriction synthesis is a novel catabolic iterative DNA synthesis method that utilizes endonucleases to synthesize a query sequence from a reference sequence. In this work, the reference sequence is built from shorter subsequences by classifying them as applicable or inapplicable for the synthesis method using three different machine learning methods: Support Vector Machines (SVMs), random forest, and Convolution Neural Networks (CNNs). Before applying these methods to the data, a series of feature selection, curation, and reduction steps are applied to create an accurate and representative feature space. Following these preprocessing steps, three different pipelines are proposed to classify subsequences based on their nucleotide sequence and other relevant features corresponding to the restriction sites of over 200 endonucleases. The sensitivity using SVMs, random forest, and CNNs are 94.9 scores lower in specificity with SVMs, random forest, and CNNs resulting in 77.4 the misclassifications in SVMs and CNNs are investigated. Across these two models, different features with a derived nucleotide specificity visually contribute more to classification compared to other features. This observation is an important factor when considering new nucleotide sensitivity features for future studies.


page 3

page 4

page 5


Random forest models of the retention constants in the thin layer chromatography

In the current study we examine an application of the machine learning m...

Feature space reduction method for ultrahigh-dimensional, multiclass data: Random forest-based multiround screening (RFMS)

In recent years, numerous screening methods have been published for ultr...

Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

From a fresh data science perspective, this thesis discusses the predict...

A Novel Approach to Radiometric Identification

This paper demonstrates that highly accurate radiometric identification ...

Systematic Comparison of the Influence of Different Data Preprocessing Methods on the Classification of Gait Using Machine Learning

Human movements are characterized by highly non-linear and multi-dimensi...

conformalClassification: A Conformal Prediction R Package for Classification

The conformalClassification package implements Transductive Conformal Pr...

Data Mining Ice Cubes

IceCube is a 1 km3 scale neutrino telescope located at the geographic So...

Please sign up or login with your details

Forgot password? Click here to reset