Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease

02/03/2014
by   Hubert Haoyang Duan, et al.
0

From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on genetic variations at the DNA base pair level, called Single-Nucleotide Polymorphisms (SNPs), collected from the Ontario Heart Genomics Study (OHGS). First, the thesis explains two commonly used supervised learning algorithms, the k-Nearest Neighbour (k-NN) and Random Forest classifiers, and includes a complete proof that the k-NN classifier is universally consistent in any finite dimensional normed vector space. Second, the thesis introduces two dimensionality reduction steps, Random Projections, a known feature extraction technique based on the Johnson-Lindenstrauss lemma, and a new method termed Mass Transportation Distance (MTD) Feature Selection for discrete domains. Then, this thesis compares the performance of Random Projections with the k-NN classifier against MTD Feature Selection and Random Forest, for predicting artery disease based on accuracy, the F-Measure, and area under the Receiver Operating Characteristic (ROC) curve. The comparative results demonstrate that MTD Feature Selection with Random Forest is vastly superior to Random Projections and k-NN. The Random Forest classifier is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS genetic dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2011

The All Relevant Feature Selection using Random Forest

In this paper we examine the application of the random forest classifier...
research
01/20/2020

An Efficient Framework for Automated Screening of Clinically Significant Macular Edema

The present study proposes a new approach to automated screening of Clin...
research
04/30/2023

Predictability of Machine Learning Algorithms and Related Feature Extraction Techniques

This thesis designs a prediction system based on matrix factorization to...
research
11/07/2020

Machine learning applications to DNA subsequence and restriction site analysis

Based on the BioBricks standard, restriction synthesis is a novel catabo...
research
01/30/2023

Prediction of Customer Churn in Banking Industry

With the growing competition in banking industry, banks are required to ...
research
10/25/2021

Gradient-based Quadratic Multiform Separation

Classification as a supervised learning concept is an important content ...
research
06/17/2022

DPDR: A novel machine learning method for the Decision Process for Dimensionality Reduction

This paper discusses the critical decision process of extracting or sele...

Please sign up or login with your details

Forgot password? Click here to reset