Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

12/06/2020
by   Nabil Ibtehaz, et al.
0

Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2020

An Analysis on the Learning Rules of the Skip-Gram Model

To improve the generalization of the representations for natural languag...
research
01/06/2020

Macromolecule Classification Based on the Amino-acid Sequence

Deep learning is playing a vital role in every field which involves data...
research
07/03/2022

Advancing protein language models with linguistics: a roadmap for improved interpretability

Deep neural-network-based language models (LMs) are increasingly applied...
research
12/22/2017

Predicting protein-protein interactions based on rotation of proteins in 3D-space

Protein-Protein Interactions (PPIs) perform essential roles in biologica...
research
02/07/2022

Prompt-Guided Injection of Conformation to Pre-trained Protein Model

Pre-trained protein models (PTPMs) represent a protein with one fixed em...
research
05/16/2021

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

The potential of deep learning has been recognized in the protein struct...
research
07/23/2019

Interpretable and Steerable Sequence Learning via Prototypes

One of the major challenges in machine learning nowadays is to provide p...

Please sign up or login with your details

Forgot password? Click here to reset