A decision theoretic approach to model evaluation in computational drug discovery

07/24/2018
by   Oliver Watson, et al.
0

Artificial intelligence, trained via machine learning or computational statistics algorithms, holds much promise for the improvement of small molecule drug discovery. However, structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. 25 publicly available molecular datasets were extracted from ChEMBL. Neural nets, random forests, support vector machines (regression) and ridge regression were then fitted to the structure-activity data. A new validation method, based on quantile splits on the activity distribution function, is proposed for the construction of training and testing sets. Model validation based on random partitioning of available data favours models which overfit and `memorize' the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes models which can extrapolate onto structurally different molecules outside of the training data. This approach favours more constrained models, namely ridge regression and support vector regression. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Model performance should be evaluated from a decision theoretic perspective with subjective loss functions. Data-splitting based on the separation of high and low activity data provides a robust methodology for determining the best extrapolating model. Simpler, traditional statistical methods such as ridge regression outperform state-of-the-art machine learning methods in this setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2021

Virtual Screening of Pharmaceutical Compounds with hERG Inhibitory Activity (Cardiotoxicity) using Ensemble Learning

In silico prediction of cardiotoxicity with high sensitivity and specifi...
research
10/29/2021

High-dimensional multi-trait GWAS by reverse prediction of genotypes

Multi-trait genome-wide association studies (GWAS) use multi-variate sta...
research
02/15/2018

A comparison of machine learning techniques for taxonomic classification of teeth from the Family Bovidae

This study explores the performance of modern, accurate machine learning...
research
01/07/2020

A semi-supervised learning framework for quantitative structure-activity regression modelling

Supervised learning models, also known as quantitative structure-activit...
research
07/16/2021

Intrinsic Dimension Adaptive Partitioning for Kernel Methods

We prove minimax optimal learning rates for kernel ridge regression, res...
research
10/04/2021

Pharmacoprint – a combination of pharmacophore fingerprint and artificial intelligence as a tool for computer-aided drug design

Structural fingerprints and pharmacophore modeling are methodologies tha...

Please sign up or login with your details

Forgot password? Click here to reset