## 1 Ligand-based Supervised Biochemical Activity Modelling

In silico computational supervised modelling based on ligand profiles is widely used as a complementary method in the pharmaceutical industry; specifically in the early stage of drug design and development to obtain an indication of off-target interactions. The aim is to train a model that generalizes the biological activities of ligands, i.e. small biochemical molecule, ion, or protein, by reusing information from former in vitro

laboratory experiments. A common approach to predicting activities of ligands on biological targets, e.g. genes, is to exploit a panel of structure-analysis models such as quantitative structure-activity relationship (QSAR) with one model per a biological target. Then, using statistical learning, we train a classification or regression model used to predict categorical (active or inactive) or numerical values, e.g. probability of reaction, for a new unseen ligand, respectively. The suggested strategies on how to train the models are the single and multi-target schemes. In our text, we focus on a simplified single-target modelling that is a baseline for more general approaches, e.g. the multi-target ones mentioned above.

Several machine learning methods are typically employed for constructing such models. Currently, the most popular ones are Deep Neural Network (DNNs) and Support Vector Machines (SVMs). Despite the fact that Deep Learning (DL) is getting popular in recent years, SVMs are still applicable. One of the most advantages of the SVMs approach is that they find learning functions maximizing geometric margins, unlike generally composed functions produced by DNNs. A benefit of this additional implicit (in contrast with DNNS) requirement is that reducing generalization error leading to preventing overfitting of a model and improving its robustness in terms of a bias-variance tradeoff. Respecting later, SVM seems to be appropriate as a technique for modelling biochemical activity, which exploits ligand profiles, for the reason that a multivariance among samples is typically small. This fact is arising from the nature of in vitro experiments because laboratories test biochemical activities of quite similar ligands to a particular target commonly than testing ligands belonging to entirely different groups.

However, SVMs have a serious drawback; they are sensitive to imbalanced datasets, outliers and multicollinearies among training samples, which could be a cause of preferencing one group over another. Therefore, we propose to use Platt scalling for an additional model calibration, which is based on transforming the SVM classification model output into a posterior probability distribution by fitting logistic regression model to SVM raw prediction scores. This calibrating technique is practically used for reducing impact of overfitting to predictor mainly caused by training data. After training calibrated models, we demonstrate balanced predictive relevance of these models by converting them to label prediction using an optimal threshold.

This text is organized as follows. In Section 2, the SVM formulations with a relaxed-bias term are presented. A calibration technique based on the Platt scaling is introduced in Section 3. Numerical experiments are presented in Section 4.

## 2 Support Vector Machines for Relaxed-bias Classification

SVMs belong to conventional machine learning techniques, and they are practically used for both classification and regression. Unlike the DL underlying architecture, SVMs could be considered as the single perceptron problems that find the learning functions maximizing the geometric margins. Therefore, we can explain the qualities of a learning model and the underlying solver behaviour straightforwardly. In this paper, we focus only on the linear C-SVM for classification, in which a misclassifications error term is penalized by a user-specified penalty

. We denote the C-SVM as SVM for simplification in further text.SVM was originally designed by CorVap-ML-1995

as a supervised binary classifier, i.e. a classifier that decides whether a sample falls into either Class A or Class B by means of a model determined from already categorised samples in the training phase of the classifier. Let us denote the training data as an ordered sample-label pairs such that

(1) |

where is the number of samples, is the -th sample and denotes the label of the -th sample,

. The linear SVM solves the problem of training the classification model in the form of a so-called maximal-margin hyperplane:

(2) |

where is a normal vector of hyperplane and determines the offset of the hyperplane from the origin along its normal vector . In the cases of the relaxed-bias classification, we do not consider in a classification model, however we include it into the problem by means of augmenting the vector and each sample with an additional dimension so that , , and is a user defined variable, typically set to .

Let for purposes related to our application, then the problem of finding hyperplane can be formulated as a constrained optimization problem in the following primal formulation:

(3) |

where

is the hinge loss function releated to augmented samples

. Essentially, the hinge loss function quantifies error between predicted and correct classification of sample . The variable is a penalty parameter that penalizes misclassification error. Generally, a higher value of increases the importance of minimising the hinge loss functions and, on the other hand, causes maximizing , i.e. minimizing the width of margin, leading to poor generalization capabilities of a classification model. The goal is to find a reasonable value of such that a resulting model balances the robustness and performance tradeoff.Further, we can say in general, the minimizer associated with formulation (3) corresponding to an optimal rotation of separating hyperplane in one-dimension higher feature-space than the original feature space has.

To reduce a number of unknowns, we can dualize the primal formulation (3) using the Lagrange duality so that, for both and , they result in:

(4) |

(5) |

respectively. is matrix of inner products called Gramian such that , is data matrix of the training samples, is vector of corresponding labels, , and , denote a zero-vector and an all-ones vector, respectively. In general, the Gramian is symmetric positive semi-definite (SPS) of a rank

(6) |

where maximum number of linearly independent training samples, and are a numbers of training samples and their features, respectively.

Comparing the dual SVM-QP formulations, specifically, 1-loss (4) and 2-loss (5), we can see that they differ in the forms of the related Hessians and constraints. While the Hessian is generally an SPS matrix in a case of (4), the Hessian related to the formulation (5) is regularized by means of the matrix . This can provide a better convergence rate for (5) and an associated optimization problem could be more stable. On the other hand, -loss SVM could produce a more robust model in the sense of performance score, because using a linear sum of leads to catching the outliers during a training phase of a classifier.

Further, for obtaining a solution of the original primal problem, we introduce dual to primal reconstruction formula as follows:

(7) |

Using the reconstructed normal vector , we can set the decision rule:

(8) |

In sense of equivalence of solutions, we can easily show a connection between the classification models associated with standard model (2) and relaxed-bias formulations. Let us write the separating hyperplane equation in a component-wise form such that:

(9) |

which is equivalent to (2). However, the bias term is incorporated into the regularization term in sense of Tikhonov regularization, the resulting model could slightly differ from this attained by standard (non-relaxed) formulations. On the other side, we have not to deal with equality constraints, which are appear in the standard dual formulations, and sometimes, they are reason why solvers could diverge.

## 3 Model Calibration

Calibrating a classification model refers to a special type of statistical inference that transforms a uncalibrated output (raw prediction), particularly, decision function , to a probability of class membership

. Commonly, the calibration is required when we need to adjust robustness of a classification model, balances of class preference and provides a cost-sensitive classification. In this section, we pay attention to an estimation of the probability employing a well-known calibration technique called Platt’s scaling – introduced by

Platt-Advances-1999. Commonly, this technique is known as Platt’s calibration in the machine learning communities.An idea beyond this technique is based on fitting a parametric form of sigmoid-shaped function that maps the uncalibrated SVM output to the posterior probability , where

(10) |

The parameters , determine the slope of the sigmoidal curve and lateral displacement, respectively, and they are practically fitted using maximum likelihood estimation (MLE). In order to the relaxed-bias classification mentioned in Section 2, we assume that the raw SVM output

(11) |

is proportional to the log odds of positive samples in the model (

10).In the original paper, Platt suggested to use an additional training set, i.e. a calibration set, for training calibration curve on output of general instance-based SVM to avoid incorporating bias failures, i.e. cases when , on functional-margin . Let us denote such dataset as an ordered set:

(12) |

where is a number of the calibration samples, is estimate of for . On the other hand, when an optimal model performance is attained in a reasonably small value of the penalty , e.g., in real-world applications employing linear SVMs, see Platt-Advances-1999, and data is well-behaved, bias on margin failures usually become small. Therefore, it often possible to simply fit the sigmoid on the training dataset.

To prevent model overfitting, Platt proposed additional transformation of binary labels to target probabilities such that , or , where and are numbers of positive and negative calibration samples, respectively.

The best parameter setting is determined by minimizing negative log likehood (cross-entropy error function) on calibration data so that:

(13) |

where . To solve (13), an author in Platt-Advances-1999 proposed to use Levenberg–Marquardt (LM) algorithm Levenberg-QAM-1944. Unfortunately, a technique for updating a damping factor for LM introduced by Platt causes that solver could not converge to a minimum of (12). It is discussed in Lin-ML-2007. To avoid issue arising from a damping factor associated with the LM method, the authors suggested the Newton method with backtracking line-search.

Though the proposed approach is favourable due to its simplicity, the trust-region methods are more robust. Since we focus on training robust predictors in this paper, we exploit the Newton method with trust region in all numerical experiments presented in Section 4.

## 4 PermonSVM

The PermonSVM package is a part of the PERMON toolbox designed for usage in a massively parallel distributed environment containing hundreds or thousands computational cores. Technically, it is an extension of the core PERMON package called PermonQP, from which it inherits basic data structures, initialization routines, build system, and utilizes computational and QP transformation routines, e.g. normalization of an objective function, dualization, etc. Programmatically, core functionality of PERMON toolbox is written on the top of the PETSc framework, follows the same design and coding style, making it easy-to-use for anyone familiar with PETSc. It is usable on all main operating systems and architectures consisting of smartphones through laptops to high-end supercomputers.

PermonSVM supports distributed parallel (through MPI) reading from formats like SVMLight, HDF5, PETSc binary file formats, more than 4 problem formulations of classification problem, two types of parallel cross-validation, namely k-fold and stratified k-fold cross-validation. The resulting QP-SVM problem with implicitly represented Hessian, in which Gram matrix

is not assembled, is proceeded by solvers provided by the PermonQP package or the PETSc framework. Unlike standard machine learning libraries, PERMON toolbox provides interface functions to change underlying QP-SVM solver, monitoring and tweaking the algorithms. In Code 1, we present an example of a usage PermonSVM API. Our libraries are developed as an open-source project under the BSD 2-Clause Licence.

## 5 Numerical Experiments

In this section, we analyze numerical experiments related to balancing predictive relevance of the single-target relaxed-bias classification model using the Platt’s Calibration technique, which we introduced in Section 2 and Section 3, respectively. We benchmark this approach on datasets associated with biochemical activities of ligands on biological targets, namely abl (Abelson murine leukemia viral oncogene homolog protein), adoraa (Adenosine A_{2A} receptor), cnr (cannabinoid receptor type ), and cnr (cannabinoid receptor type ). These datasets were exported from the ExCAPE database, which was developed by Sun-JOC-2017. While it is possible to calibrate a model on the same dataset, on which the model was trained, see Section 3, it could be problematic to decide if a bias of an uncalibrated model is small enough. Therefore, we split training samples into the training and calibration datasets. After training models, we evaluate their performance on the test dataset using precision, sensitivity, and area under the curve receiver operating characteristic (AUC) performance scores. The input datasets were divided into training, calibration and test datasets such that they consist of , , ligands, respectively, and a ratio of active and inactive ones is sufficiently preserved. Characteristics associated with these datasets are summarized in Table 1.

Target (dataset) | #ligands | ||
---|---|---|---|

#active | #incative | ||

abl (training) | |||

abl (calibration) | |||

abl (test) | |||

adoraa (training) | |||

adoraa (calibration) | |||

adoraa (test) | |||

cnr (training) | |||

cnr (calibration) | |||

cnr (test) | |||

cnr (training) | |||

cnr (cablibration) | |||

cnr (test) |

For training uncalibrated classification models, we choose the best penalty from the set

algorithmically employing the hyperparameter optimization (HyperOpt) by means of grid-search combined with stratified

-fold cross validation (CV). The value of the best penalty is selected so that accumulated related precision and sensitivity during CV are maximized. All components of the initial guess are set to , proposed in Pecha-LNEE-2019. The relative norm of projected gradient being smaller than , discussed in Pecha-SVM-AIP-2018, is used as stopping criterion for the MPRGP (Modified Proportioning and Reduced Gradient Projection) algorithm, see Dos-book-09, in all presented experiments. The expansion step-length is fixed and determined such as , where , where denotes the Hessian matrix associated with (4) and (5).Using PETSc implementation of the Newton method without preconditioning with default setting, the sigmoid-shaped calibration function is computed by minimizing cross-entropy (13) on calibration data. Since the Newton method converges quickly to optimal solution when vector is close enough to , Platt-Advances-1999 proposed initial guesses for parameters of sigmoid such that and , where and denote numbers of active and inactive samples associated with the calibration dataset . To avoid numerical difficulties or catastrophic cancellations that could arise from evaluation , where is close to , we evaluate (10) by using when else we use (10). This numerical improvements were proposed by Lin-ML-2007. Other numerical obstacles could arise from evaluating Hessian

(14) |

associated with cross-entropy function (13). Thus, we replace the term by means of when , else , see Lin-ML-2007. Since the Hessian (14) is SPS in general Lin-ML-2007, we regularize it by the matrix , where in our experiments.

Instead of stochastic optimization, which commonly used in the machine learning community, our used solvers are deterministic in these experiments. They pass all training samples in one iteration. Therefore we consider terms epoch and iteration as identical in this text. In other words, we do not take into account a batch of a training dataset during the training phase of the classifier. By this, we obtain strictly settled and reproducible training pipelines, unlike employing DNNs or other traditionally used techniques. On the other hand, deterministic solvers suffer on their cost in the sense of computational resources. Thus, training predictors can take longer than in the case of stochastic optimization. At this expense, we reduce uncertainty during the training process, which could be crucial for some scientific application, e.g. ones related to the pharmaceutical industry.

After calibrating a classification model, we convert probabilities to label prediction using optimal threshold (thr.), i.e. to demonstrate the balanced class predictive relevance of calibrated models. The optimal threshold is determined using grid-search so that absolute value of the difference of precision score (Pre.) and sensitivity (Sen.) on the test dataset is minimized, and F score must be greater than , i.e. predictive ability of model must be better than random.

Because of all datasets are too small to utilize more than one processor core, all experiments were run on MPI process pinned to a processor core. In all presented experiments, we utilized the same node of the ANSELM supercomputer at IT4Innovations. Evaluations of performance scores are summarized in Table 2 and Table 3 for uncalibrated models and models after calibration, respectively.

Target | Loss | Uncalibrated model | ||||
---|---|---|---|---|---|---|

Pre. [%] | Sen. [%] | F | AUC | |||

abl | ||||||

adoraa | ||||||

cnr | ||||||

cnr | ||||||

Looking at performance scores presented in Table 2, we can see that -loss SVM outperforms -loss in overall performance scores (F and AUC) in order to abl and cnr datasets, while -loss SVM provides slightly better models for adoraa and cnr datasets. As we mentioned in Section , 1-loss SVM commonly produces better quality models. However, we have to take into an account that we relax bias in our approaches. Therefore the models are relaxed as well, which could be a cause of these unexpected results in the sense of performance score of models.

Target | Loss | Calibrated model | ||||
---|---|---|---|---|---|---|

Brier score | Binary classification | |||||

Thr. | Pre. [%] | Sen. [%] | AUC | |||

abl | ||||||

adoraa | ||||||

cnr | ||||||

cnr | ||||||

Analysing predictive relevance of uncalibrated models, we can see that active ligands are preferred in cases of models related to abl, cnr, and cnr, on the other hand, inactive ligands are prefered for adoraa dataset. To balance predictive relevance, we perform model calibration. After this, we can see in Table 3, the models trained using 2-loss seem to be better calibrated by comparing Brier scores for all cases than ones related to the 1-loss SVM. This could simple consequence of underlying model robustness. Even we use relaxed approach, the 1-loss SVM still tries to produce a more robust model than the 2-loss SVM, since using a linear sum of hinge loss functions instead of a sum of squared hinge loss functions leads to slightly better catching the outliers.

Loss | Elapsed time [s] (HyperOpt + Training + Calibration) | |||
---|---|---|---|---|

abl | adoraa | cnr | cnr | |

Thus, calibrating models related to the 2-loss SVM has significant impact than in a case of the 1-loss SVM; as we can see, predictive relevances of classes are well balanced. Moreover, from the Table 4, we can observe speedups (abl), (adoraa), (cnr), and (cnr) in order to using the -loss SVM against the -loss SVM.

However, calibrating models could cause a deterioration of overall model performance determined by means of AUC as we can see in Table 4. Specifically, the over performance scores of models decrease by to in the cases of the abl (both loss-type models), cnr (-loss model) and cnr (-loss model). In order to models related to adoraa target, -loss and -loss models associated with cnr and cnr, respectively, AUC scores are same. Since models were trained, calibrated and tested on different datasets, we can consider the calibrated models have the same overall performance score in the sense of AUC as their related uncalibrated models. Comparing the quality of the remaining calibrated and uncalibrated models is application-specific. In some application, they could be considered as models of same quality. For more strict quality merits, the models can be considered that differ significantly.

From achieved results, it seems that it is better to train models using the -loss SVM that are not such robust as in the case of the -loss SVM and, then, perform their calibration. Moreover, we can obtain a better convergence rate by employing this approach. We observe speedup up to in case of training model on the cnr dataset.

## 6 Conclusion

In this paper, we focused on a problem dealing with balancing predictive relevance of single-target models trained using SVMs. This calibration could be required since SVMs is sensitive to imbalanced datasets, outliers and high multicorrelation among training samples.

Regarding calibration improvements of models, we observe that an additional calibration works significantly better for models trained using the -loss SVM with relaxed-bias from achieved results. It seems this could be a consequence of that models are not such robust as in the case of the -loss SVM. Moreover, we achieve speedup up to by means of the approach based on -loss SVM. On the other hand, calibrating models could cause a deterioration of overall model performance as we saw in the presented numerical experiments. Therefore, it makes sense calibrating models just for critical applications, e.g. biochemical modelling presented in this paper, where balanced predictive relevance is required.

Since we achieved some unexpected results in the sense of model performance scores, which are probably caused relaxing the bias term of the hyperplane , we are going to focus on calibrating model trained to employ training based on full-formulation dual formulation of SVM, i.e. with equality constraint. Further, we are going to test another calibration technique, e.g. the isotonic regression.

## Acknowledgments

The author acknowledge the support of The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602”; the grant programme “Support for Science and Research in the Moravia–Silesia Region 2017” (RRC/10/2017), financed from the budget of the Moravian–Silesian region; and the Grant of SGS No. SP2020/84, VSB - Technical University of Ostrava. The author would like to thank a reviewer for the constructive feedback as well.

Comments

There are no comments yet.