ProtVec: A Continuous Distributed Representation of Biological Sequences

03/17/2015
by   Ehsaneddin Asgari, et al.
0

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93 obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8 unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0 sequence data for various proteins into this model, accurate information about protein structure can be determined.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 6

page 13

page 14

page 15

research
06/02/2023

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

The field of protein folding research has been greatly advanced by deep ...
research
08/20/2023

SBSM-Pro: Support Bio-sequence Machine for Proteins

Proteins play a pivotal role in biological systems. The use of machine l...
research
03/30/2017

Near Perfect Protein Multi-Label Classification with Deep Neural Networks

Artificial neural networks (ANNs) have gained a well-deserved popularity...
research
11/20/2012

A Brief Review of Data Mining Application Involving Protein Sequence Classification

Data mining techniques have been used by researchers for analyzing prote...
research
04/21/2022

A Novel Scalable Apache Spark Based Feature Extraction Approaches for Huge Protein Sequence and their Clustering Performance Analysis

Genome sequencing projects are rapidly increasing the number of high-dim...
research
05/29/2003

Seven clusters in genomic triplet distributions

In several recent papers new gene-detection algorithms were proposed for...
research
12/19/2012

Feature vector regularization in machine learning

Problems in machine learning (ML) can involve noisy input data, and ML c...

Please sign up or login with your details

Forgot password? Click here to reset