Protein language models trained on multiple sequence alignments learn phylogenetic relationships

03/29/2022
by   Umberto Lupo, et al.
0

Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. This could aid them to separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations arising from historical contingency. To test this hypothesis, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We demonstrate that unsupervised contact prediction is indeed substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.

READ FULL TEXT

page 3

page 4

research
04/14/2022

Generative power of a protein language model trained on multiple sequence alignments

Computational models starting from large ensembles of evolutionarily rel...
research
08/14/2023

Pairing interacting protein sequences using masked language modeling

Predicting which proteins interact together from amino-acid sequences is...
research
06/14/2022

Exploring evolution-based -free protein language models as protein function predictors

Large-scale Protein Language Models (PLMs) have improved performance in ...
research
08/20/2021

Extracting Qualitative Causal Structure with Transformer-Based NLP

Qualitative causal relationships compactly express the direction, depend...
research
07/09/2021

Can Deep Neural Networks Predict Data Correlations from Column Names?

For humans, it is often possible to predict data correlations from colum...
research
07/13/2020

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Computational biology and bioinformatics provide vast data gold-mines fr...
research
12/14/2021

Deciphering antibody affinity maturation with language models and weakly supervised learning

In response to pathogens, the adaptive immune system generates specific ...

Please sign up or login with your details

Forgot password? Click here to reset