Unsupervised language models for disease variant prediction

12/07/2022
by   Allan Zhou, et al.
0

There is considerable interest in predicting the pathogenicity of protein variants in human genes. Due to the sparsity of high quality labels, recent approaches turn to unsupervised learning, using Multiple Sequence Alignments (MSAs) to train generative models of natural sequence variation within each gene. These generative models then predict variant likelihood as a proxy to evolutionary fitness. In this work we instead combine this evolutionary principle with pretrained protein language models (LMs), which have already shown promising results in predicting protein structure and function. Instead of training separate models per-gene, we find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot, without MSAs or finetuning. We call this unsupervised approach VELM (Variant Effect via Language Models), and show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2023

PoET: A generative model of protein families as sequences-of-sequences

Generative protein language models are a natural way to design new prote...
research
06/08/2023

Multi-level Protein Representation Learning for Blind Mutational Effect Prediction

Directed evolution plays an indispensable role in protein engineering th...
research
05/03/2023

Exploring the Protein Sequence Space with Global Generative Models

Recent advancements in specialized large-scale architectures for trainin...
research
03/07/2021

RNA Alternative Splicing Prediction with Discrete Compositional Energy Network

A single gene can encode for different protein versions through a proces...
research
06/25/2021

VEGN: Variant Effect Prediction with Graph Neural Networks

Genetic mutations can cause disease by disrupting normal gene function. ...
research
07/19/2023

ProtiGeno: a prokaryotic short gene finder using protein language models

Prokaryotic gene prediction plays an important role in understanding the...
research
01/27/2023

Gene Teams are on the Field: Evaluation of Variants in Gene-Networks Using High Dimensional Modelling

In medical genetics, each genetic variant is evaluated as an independent...

Please sign up or login with your details

Forgot password? Click here to reset