A Novel Scalable Apache Spark Based Feature Extraction Approaches for Huge Protein Sequence and their Clustering Performance Analysis

04/21/2022
by   Preeti Jha, et al.
0

Genome sequencing projects are rapidly increasing the number of high-dimensional protein sequence datasets. Clustering a high-dimensional protein sequence dataset using traditional machine learning approaches poses many challenges. Many different feature extraction methods exist and are widely used. However, extracting features from millions of protein sequences becomes impractical because they are not scalable with current algorithms. Therefore, there is a need for an efficient feature extraction approach that extracts significant features. We have proposed two scalable feature extraction approaches for extracting features from huge protein sequences using Apache Spark, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in two clustering algorithms, i.e., Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (SRSIO-FCM) and Scalable Literal Fuzzy C-Means (SLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM and SLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies-Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM and SLFCM clustering algorithms achieves significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2015

ProtVec: A Continuous Distributed Representation of Biological Sequences

We introduce a new representation and feature extraction method for biol...
research
04/09/2015

Extraction of Protein Sequence Motif Information using PSO K-Means

The main objective of the paper is to find the motif information.The fun...
research
09/15/2020

Autonomous Learning of Features for Control: Experiments with Embodied and Situated Agents

As discussed in previous studies, the efficacy of evolutionary or reinfo...
research
01/28/2022

Bioinspired Cortex-based Fast Codebook Generation

A major archetype of artificial intelligence is developing algorithms fa...
research
11/17/2021

Exploring Unsupervised Learning Methods for Automated Protocol Analysis

The ability to analyse and differentiate network protocol traffic is cru...
research
06/07/2017

Time Stretch Inspired Computational Imaging

We show that dispersive propagation of light followed by phase detection...
research
02/05/2021

Scalable Robust Graph and Feature Extraction for Arbitrary Vessel Networks in Large Volumetric Datasets

Recent advances in 3D imaging technologies provide novel insights to res...

Please sign up or login with your details

Forgot password? Click here to reset