Large-scale Pretraining Improves Sample Efficiency of Active Learning based Molecule Virtual Screening

09/20/2023
by   Zhonglin Cao, et al.
0

Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, brute-force virtual screening using traditional tools such as docking becomes infeasible in terms of time and computational resources. Active learning and Bayesian optimization has recently been proven as effective methods of narrowing down the search space. An essential component in those methods is a surrogate machine learning model that is trained with a small subset of the library to predict the desired properties of compounds. Accurate model can achieve high sample efficiency by finding the most promising compounds with only a fraction of the whole library being virtually screened. In this study, we examined the performance of pretrained transformer-based language model and graph neural network in Bayesian optimization active learning framework. The best pretrained models identifies 58.97 top-50000 by docking score after screening only 0.6 containing 99.5 million compounds, improving 8 baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Such model can serve as a boost to the accuracy and sample efficiency of active learning based molecule virtual screening.

READ FULL TEXT
research
12/13/2020

Accelerating high-throughput virtual screening through molecular pool-based active learning

Structure-based virtual screening is an important tool in early stage dr...
research
03/09/2023

Improving computation efficiency using input and architecture features for a virtual screening application

Virtual screening is an early stage of the drug discovery process that s...
research
05/03/2022

Self-focusing virtual screening with active design space pruning

High-throughput virtual screening is an indispensable technique utilized...
research
06/12/2020

A benchmark study on reliable molecular supervised learning via Bayesian learning

Virtual screening aims to find desirable compounds from chemical library...
research
09/11/2020

Bayesian Screening: Multi-test Bayesian Optimization Applied to in silico Material Screening

We present new multi-test Bayesian optimization models and algorithms fo...
research
04/18/2022

Active Learning Helps Pretrained Models Learn the Intended Task

Models can fail in unpredictable ways during deployment due to task ambi...

Please sign up or login with your details

Forgot password? Click here to reset