Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

01/16/2023
by   Ahmed Elnaggar, et al.
0

As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10 and <30 structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.

READ FULL TEXT
research
03/11/2023

Enhancing Protein Language Models with Structure-based Encoder and Pre-training

Protein language models (PLMs) pre-trained on large-scale protein sequen...
research
01/06/2023

Conditional Generation of Paired Antibody Chain Sequences through Encoder-Decoder Language Model

Protein language models (LMs) have been successful in sequence, structur...
research
11/18/2022

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

Despite being self-supervised, protein language models have shown remark...
research
07/03/2022

Advancing protein language models with linguistics: a roadmap for improved interpretability

Deep neural-network-based language models (LMs) are increasingly applied...
research
12/09/2021

Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction

Protein-protein interactions (PPIs) are essentials for many biological p...
research
07/13/2020

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Computational biology and bioinformatics provide vast data gold-mines fr...
research
06/20/2023

Lingua Manga: A Generic Large Language Model Centric System for Data Curation

Data curation is a wide-ranging area which contains many critical but ti...

Please sign up or login with your details

Forgot password? Click here to reset