Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

by   Nabil Ibtehaz, et al.

Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.



There are no comments yet.


page 1

page 2

page 3

page 4


An Analysis on the Learning Rules of the Skip-Gram Model

To improve the generalization of the representations for natural languag...

Predicting protein-protein interactions based on rotation of proteins in 3D-space

Protein-Protein Interactions (PPIs) perform essential roles in biologica...

Macromolecule Classification Based on the Amino-acid Sequence

Deep learning is playing a vital role in every field which involves data...

Prompt-Guided Injection of Conformation to Pre-trained Protein Model

Pre-trained protein models (PTPMs) represent a protein with one fixed em...

Comparing two deep learning sequence-based models for protein-protein interaction prediction

Biological data are extremely diverse, complex but also quite sparse. Th...

Interpretable Structured Learning with Sparse Gated Sequence Encoder for Protein-Protein Interaction Prediction

Predicting protein-protein interactions (PPIs) by learning informative r...

Theoretical Understandings of Product Embedding for E-commerce Machine Learning

Product embeddings have been heavily investigated in the past few years,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.