Evaluating Protein Transfer Learning with TAPE

06/19/2019
by   Roshan Rao, et al.
3

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

ReLaB: Reliable Label Bootstrapping for Semi-Supervised Learning

Reducing the amount of labels required to trainconvolutional neural netw...
research
12/04/2020

Super-Selfish: Self-Supervised Learning on Images with PyTorch

Super-Selfish is an easy to use PyTorch framework for image-based self-s...
research
04/06/2022

Structure-aware Protein Self-supervised Learning

Protein representation learning methods have shown great potential to yi...
research
07/07/2023

Solvent: A Framework for Protein Folding

Consistency and reliability are crucial for conducting AI research. Many...
research
09/14/2023

Nucleus-aware Self-supervised Pretraining Using Unpaired Image-to-image Translation for Histopathology Images

Self-supervised pretraining attempts to enhance model performance by obt...
research
09/30/2021

Semi-Supervised Text Classification via Self-Pretraining

We present a neural semi-supervised learning model termed Self-Pretraini...
research
12/23/2019

BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale

Capturing the semantics of related biological concepts, such as genes an...

Please sign up or login with your details

Forgot password? Click here to reset