Feature set optimization by clustering, univariate association, Deep Machine learning omics Wide Association Study (DMWAS) for Biomarkers discovery as tested on GTEx pilot
Univariate and multivariate methods for association of the genom-ic variations with the end-or-endo phenotype have been widely used for genome wide association studies. In addition to encoding the SNPs, we advocate usage of clustering as a novel method to encode the structural variations, SVs, in genomes, such as the deletions and insertions polymorphism (DIPs), Copy Number Variations (CNVs), translocation, inversion, etc., that can be used as an independent fea-ture variable value for downstream computation by artificial intelli-gence methods to predict the endo-or-end phenotype. We introduce a clustering based encoding scheme for structural variations and om-ics based analysis. We conducted a complete all genomic variants association with the phenotype using deep learning and other ma-chine learning techniques, though other methods such as genetic al-gorithm can also be applied. Applying this encoding of SVs and one-hot encoding of SNPs on GTEx V7 pilot DNA variation dataset, we were able to get high accuracy using various methods of DMWAS, and particularly found logistic regression to work the best for death due to heart-attack (MHHRTATT) phenotype. The genom-ic variants acting as feature sets were then arranged in descending order of power of impact on the disease or trait phenotype, which we call optimization and that also uses top univariate association into account. Variant Id P1_M_061510_3_402_P at chromosome 3 position 192063195 was found to be most highly associated to MHHRTATT. We present here the top ten optimized genomic va-riant feature set for the MHHRTATT phenotypic cause of death.
READ FULL TEXT