ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

03/29/2023
by   Youhan Lee, et al.
0

Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering. Also, we prove that ProtFIM generates protein sequences with decent protein representations through exhaustive experiments and visualizations.

READ FULL TEXT

page 12

page 13

research
08/16/2023

Atom-by-atom protein generation and beyond with language models

Protein language models learn powerful representations directly from seq...
research
06/27/2022

ProGen2: Exploring the Boundaries of Protein Language Models

Attention-based models trained on protein sequences have demonstrated in...
research
02/07/2022

Prompt-Guided Injection of Conformation to Pre-trained Protein Model

Pre-trained protein models (PTPMs) represent a protein with one fixed em...
research
12/20/2022

Plug Play Directed Evolution of Proteins with Gradient-based Discrete MCMC

A long-standing goal of machine-learning-based protein engineering is to...
research
11/10/2022

Probabilistic thermal stability prediction through sparsity promoting transformer representation

Pre-trained protein language models have demonstrated significant applic...
research
06/14/2020

Autofocused oracles for model-based design

Data-driven design is making headway into a number of application areas,...
research
07/13/2020

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Computational biology and bioinformatics provide vast data gold-mines fr...

Please sign up or login with your details

Forgot password? Click here to reset