Sparse generative modeling of protein-sequence families

11/23/2020
by   Pierre Barrat-Charlaix, et al.
0

Pairwise Potts models (PM) provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution between pairs of distinct sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution, and couplings can a priori connect any pairs of sites, even those being distant along the protein chain, or distant in the three-dimensional protein fold. The most conservative choice to describe all of the coevolution signal is to include all possible two-site couplings into the PM. This choice, typically made by what is known as Direct Coupling Analysis, has been highly successful in using sequences for predicting residue contacts in the three-dimensional structure, mutational effects, and in generating new functional sequences. However, the resulting PM suffers from important over-fitting effects: many couplings are small, noisy and hardly interpretable, and the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a parameter-reduction procedure via iterative decimation of the less statistically significant couplings. We propose an information-based criterion that identifies couplings that are either weak, or statistically unsupported. We show that our procedure allows one to remove more than 90 couplings, while preserving the predictive and generative properties of the original dense PM. The resulting model is far away from criticality, meaning that it is more robust to noise, and its couplings are more easily interpretable.

READ FULL TEXT
research
04/14/2022

Generative power of a protein language model trained on multiple sequence alignments

Computational models starting from large ensembles of evolutionarily rel...
research
05/07/2020

Assessing the Precision and Recall of msTALI as Applied to an Active-Site Study on Fold Families

Proteins execute various activities required by biological cells. Furthe...
research
06/09/2023

PoET: A generative model of protein families as sequences-of-sequences

Generative protein language models are a natural way to design new prote...
research
11/12/2021

Benchmarking deep generative models for diverse antibody sequence design

Computational protein design, i.e. inferring novel and diverse protein s...
research
03/18/2020

Site2Vec: a reference frame invariant algorithm for vector embedding of protein-ligand binding sites

Protein-ligand interactions are one of the fundamental types of molecula...
research
05/27/2022

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

The ability to accurately model the fitness landscape of protein sequenc...
research
06/05/2020

Expression, Purification and Crystallization of Pore Mutants of Ammonium Transport Protein 1 From Archaeoglobus Fulgidus

Ammonium transport proteins are highly conserved families of integral me...

Please sign up or login with your details

Forgot password? Click here to reset