Advancing protein language models with linguistics: a roadmap for improved interpretability

07/03/2022
by   Mai Ha Vu, et al.
0

Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely blackbox models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that have learned relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Combining linguistics with protein LMs enables the development of next-generation interpretable machine learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.

READ FULL TEXT

page 2

page 8

page 15

research
08/17/2021

Modeling Protein Using Large-scale Pretrain Language Model

Protein is linked to almost every life process. Therefore, analyzing the...
research
06/07/2023

Neural Embeddings for Protein Graphs

Proteins perform much of the work in living organisms, and consequently ...
research
05/03/2023

Exploring the Protein Sequence Space with Global Generative Models

Recent advancements in specialized large-scale architectures for trainin...
research
09/26/2022

ImmunoLingo: Linguistics-based formalization of the antibody language

Apparent parallels between natural language and biological sequence have...
research
12/06/2020

Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Background: The inception of next generations sequencing technologies ha...
research
01/16/2023

Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

As opposed to scaling-up protein language models (PLMs), we seek improvi...
research
07/07/2021

Deep Extrapolation for Attribute-Enhanced Generation

Attribute extrapolation in sample generation is challenging for deep neu...

Please sign up or login with your details

Forgot password? Click here to reset