Quantitative structure-activity relationship (QSAR) is a computer or mathematical modeling method for identifying the relationship between biological activity and the structural properties of chemical compounds. The underlying principle is that variations in structural properties cause different biological activities . Structural properties refer to physicochemical properties, and biological activity corresponds to various pharmacokinetic properties, such as absorption, distribution, metabolism, excretion and toxicity.
QSAR simulations help rank the hit line of a large number of chemicals in terms of their desired biological activity and greatly reduces the number of candidates for testing. QSAR modeling has become a common process in pharmacology, but with all the progress of QSAR, even after the well-known article by a group of co-authors  , there are many limitations [3, 4].
For example: the data may include more than hundreds of thousands of compounds, or, on the contrary, a very small sample; each compound can be represented by multiple descriptors; some features are highly correlated; it is assumed that the dataset contains some errors as relationships are estimated through in-situ experiments. Due to these and other limitations for predicting a QSAR based model, it is difficult to achieve a reliable prediction result.
In forecasting based on QSAR, machine learning approaches were applied: linear regression models  and Bayesian neural networks [6-8] were used. The random forest (RF) [9,10] deserves special mention - it is the most frequently used algorithm with a high level of predictability, simplicity, and reliability. RF is a kind of ensemble method based on sets of decision trees that can prevent overfitting. RF is considered the gold standard in this area, so new QSAR forecasting methods are often compared in performance with this algorithm.
The well-known Merck Kaggle competition in 2012 drew people’s attention to neural networks. The winning team used multitasking neural networks (MTNN) . The fundamental learning structure is based on simple feedforward neural networks; it avoids overfitting by studying multiple biological analyzes at the same time. The team achieved results that consistently outperformed the random forest algorithm. Despite achieving high performance with a multitasking neural network, this team ended up using an ensemble combining different methods. RF, and many of the algorithms in the famous Kaggle competition used ensemble learning, a technique that creates a set of training models and combines multiple models to produce final predictions. It has been shown theoretically and empirically that the predictive power of ensemble learning is superior to the predictive power of an individual algorithm even if the last are accurate and diverse [12-15]. Ensemble learning manages the strengths and weaknesses of learning individual algorithms, similar to consensus decision making in critical situations.
Thus, it is popular to use algorithm ensembles by using several algorithms and combining their results. In ensembles, the base algorithms generate partially dependent or independent results on the same or a different part of a dataset, and then results are combined in several ways. The success of an ensemble depends on two main properties: the first is the individual success of the base algorithms of the ensemble, and the second one is the independence of base algorithms’ results from each other (low error, high diversity).
Ensemble methods, including an ensemble of neural networks based on bootstrap sampling in QSAR (data sampling ensemble) ; ensemble versus different training methods for drug interactions , Bayesian ensemble model with various QSAR instruments (ensemble method) , ensemble training based on qualitative and quantitative SAR models , hybrid QSAR prediction model with various training methods [20 ], ensembles with different boosting methods , hybrid feature selection and training in QSAR simulations , and ensemble against various chemicals to predict carcinogenicity (representative ensembles)  have been widely used in drug-like studies.
In contrast to the work in which the results of a comparative analysis of ensemble algorithms are presented ,this study aims at overcoming the difficulties of QSAR modeling by using ensembles. Our experiments focus on regression ensembles because this type of models is simpler and easier to understand for medical chemists. The performance of ensemble algorithms is investigated with respect to ensemble algorithms themselves, and the base algorithms used within the ensemble algorithms.
The article consists of the following sections: in section 2 we described ensemble and base regression algorithms, dimension reduction process, dataset collection. In section 3 presented the results of simulation running and their discussion. In conclude part (section 4) we showed the previous works in order to detailed the success of our results in comparing with achievements from different studies.
Ii Materials and Methods
In this section, the base and ensemble algorithms used in our study are briefly described. For the evaluation of the algorithms, the scikit-learn library was used . Each ensemble algorithm was used with each of the base algorithms. The base algorithms were also used alone. With this configuration (2 ensemble + 1 single) x (19 base) = 57 different algorithms were obtained and used.
We ran our experiments on Windows 10 (Intel(R) Core(TM) i7-9700 CPU © 3.00GHz 3.00 GHz). We used the Scikit-learn library package (version 0.23.2) for conventional machine learning methods.
ii.1 Ensemble algorithms
Bagging/bootstrapping (BG): Bagging generates N new equal-sized datasets from the original dataset by selecting samples with a replacement . The base algorithms are trained with the datasets. The independence of the individual results is confirmed in the experiments to some degree. N was chosen as 10 in our experiments. The results of the base algorithms are simply averaged to produce the ensemble result.
Additive regression (AR): This is the adaptation of the AdaBoost algorithm to regression types of problems . At each iteration, the samples having big errors at the previous iteration are considered. The iteration number was chosen as 10 in our study. The ensemble result is the weighted mean of the base algorithms. The weights are inversely proportional to the errors of the base algorithms.
ii.2 Base regression algorithms
In our study, 19 regression algorithms were used as base learners in the ensembles. They are as follows:
Lasso: The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. For this reason Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients .
Ridge: This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression.
ElasticNet: ElasticNet is a linear regression model trained with both 1 and 2-norm regularization of the coefficients. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
Orthogonal Matching Pursuit (OMP): OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.
Bayesian Regression: Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.
Automatic Relevance Determination: Fit the weights of a regression model, using an ARD prior. The weights of the regression model are assumed to be in Gaussian distributions. Also estimate the parameters lambda (precisions of the distributions of the weights) and alpha (precision of the distribution of the noise). The estimation is done by an iterative procedures (Evidence Maximization).
Passive Aggressive Algorithms: The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C .
Theil-Sen estimator: TheilSenRegressor is comparable to the Ordinary Least Squares (OLS) in terms of asymptotic efficiency and as an unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric method which means it makes no assumption about the underlying distribution of the data. Since Theil-Sen is a median-based estimator, it is more robust against corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3% .
Huber Regression: This makes sure that the loss function is not heavily influenced by the outliers while not completely ignoring their effect.
Kernel ridge regression (KRR): Kernel ridge regression combines ridge regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space .
Support Vector Regression (SVR): The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function ignores samples whose prediction is close to their target .
Decision Tree Regressor (DTR): The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features .
Random Forest Regressora: Random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Extra-Trees Regressor: This class implements a meta estimator that fits a number of randomized decision trees various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Nearest Neighbors Regression: Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors .
Multi-layer Perceptron regressor(MLPR): MLPRegressor trains iteratively since at each time step the partial derivatives of the loss function with respect to the model parameters are computed to update the parameters. It can also have a regularization term added to the loss function that shrinks model parameters to prevent overfitting. This implementation works with data represented as dense and sparse numpy arrays of floating point values .
ii.3 Dimension reduction process
Drug design datasets generally have a very large number of features. In our study, the original datasets and their dimensionally reduced versions are used. By doing so, the effects of the feature selection process on the accuracies of the algorithms are investigated. The accuracies over the original and dimensionally reduced datasets are compared. The Random Forest Importances method is used for feature selection .
ii.4 Dataset collection
Our drug data collection consists of 4 drug datasets obtained from several studies. The datasets are shown in Table 1. The datasets with 2075 descriptors were formed using the Dragon . The molecules and outputs were obtained from the original studies.
|Dataset ID||Dataset name||Number of samples||Original number of descriptors||Number of selected features||Reference|
Iii Results and Discussion
Nineteen base regressors were used together with each ensemble algorithm on 4 regression-type drug design problems. Before simulation we have posed the same questions as in the article , since these are the most important characteristics of ensemble methods:
Do the algorithm ensembles generate more successful results than a single algorithm?
What is the most successful ensemble algorithm?
What is the base algorithm-ensemble pair with the best results?
Which algorithm performs well with the ensembles?
What is the most successful single algorithm?
How are the algorithms and datasets grouped according to their performances?
How does the dimension reduction process affect the results?
To answer these questions, 57 algorithms ((2 ensemble + 1 single) x (19 base algorithms) = 57) were employed on the 4 drug design datasets described in Table 1 and their dimensionally reduced versions. A cross validation was used and the RMSE results were averaged.
The RMSE is defined as:
where is the prediction of alg.name for the ith test sample, is the actual output value of the ith test sample, and N is the number of test samples.
Our base and ensemble algorithms have some hyperparameters to optimize. We used the default hyperparameters.
In the cross-validation methodology, the dataset is randomly divided after shuffling into 2 halves. One half is used in the training and the other is used in the testing. This validation is repeated 5 times. In the results of this validation, 5 estimates of testing the RMSE were obtained for each algorithm and each dataset. In some experiments, very high RMSE results were obtained, especially with the simple linear regression algorithm disturbing the overall averages. Because of this, the performance comparisons of the algorithms were done with the algorithms’ success ranking instead of the averaged RMSEs. In each experiment, the averaged cross-validation RMSEs were sorted in ascending order. The algorithm with the lowest RMSE got the 1st ranking. The worst got the 57th ranking. These success rankings are given in Tables 2 and 3. In Table 2. the results with the original datasets are shown. In Table 3, the results with the dimensionally reduced datasets are shown. The 4 datasets are ordered along the columns of the tables. The algorithms are ordered along the rows of the tables. The average success rate and standard deviation of each algorithm are shown in the last 2 columns.
In Tables 4 and 5, the summaries of Tables 2 and 3 are given, respectively. Each cell is the averaged success ranking of the experiments with the base algorithm in the cell’s row and the ensemble algorithm in the cell’s column. The average success rankings of the single algorithms used are given in the ‘Single’ column. In the Avg. column, the averaged success rankings of the experiments with respect to the base algorithms are given. In the ‘Avg.’ row, the averaged success rankings of the experiments with respect to the ensemble algorithms are given.
When Tables 2, 3,4,5 are investigated, the following conclusions are reached. For the experiments with the original datasets (Tables 2 and 4):
–The best ranking performance (5.75) is obtained with the Extra Trees Regressor algorithm.
–The best performed ensemble algorithms are additive regression (AR).
–The best performed base algorithm is Support Vector Machine.
–All of the ensemble algorithms generally increased the performance of each base algorithm. The exceptions are Bayesian Ridge, Support Vector Machine and K Neighbors Regressor.
–The Decision Tree and Support Vector Machine base algorithms had their best performances with BG. The Decision Tree and Automatic Relevance Determination algorithms with AR, achieved their best performances.
|9||Lasso Least Angle Regression||34||44||34||45||39,25|
|10||BG-Lasso Least Angle Regression||32||43||32||46||38,25|
|11||AR-Lasso Least Angle Regression||36||42||36||47||40,25|
|12||Orthogonal Matching Pursuit||48||48||48||0||36|
|13||BG-Orthogonal Matching Pursuit||30||24||30||6||22,5|
|14||AR-Orthogonal Matching Pursuit||24||19||24||1||17|
|18||Automatic Relevance Determination||45||50||45||4||36|
|19||BG-Automatic Relevance Determination||38||17||38||3||24|
|20||AR-Automatic Relevance Determination||20||13||20||5||14,5|
|21||Passive Aggressive Regressor||37||40||37||51||41,25|
|22||BG-Passive Aggressive Regressor||35||30||35||52||38|
|23||AR-Passive Aggressive Regressor||40||14||40||53||36,75|
|33||Support Vector Machine||4||16||4||41||16,25|
|34||BG-Support Vector Machine||1||23||1||39||16|
|35||AR-Support Vector Machine||14||26||14||42||24|
|36||K Neighbors Regressor||19||41||19||33||28|
|37||BG-K Neighbors Regressor||16||37||16||34||25,75|
|38||AR-K Neighbors Regressor||22||38||22||35||29,25|
|45||Extra Trees Regressor||3||7||3||10||5,75|
|46||BG-Extra Trees Regressor||6||9||6||16||9,25|
|47||AR-Extra Trees Regressor||11||20||11||14||14|
|51||Gradient Boosting Regressor||17||22||17||8||16|
|52||BG-Gradient Boosting Regressor||10||11||10||9||10|
|53||AR-Gradient Boosting Regressor||7||27||7||7||12|
|54||Multi Level Perceptron||56||51||56||56||54,75|
|55||BG-Multi Level Perceptron||50||36||50||55||47,75|
|56||AR-Multi Level Perceptron||54||45||54||54||51,75|
|9||Lasso Least Angle Regression||42||35||42||47||41,5|
|10||BG-Lasso Least Angle Regression||39||33||39||49||40|
|11||AR-Lasso Least Angle Regression||45||30||45||50||42,5|
|12||Orthogonal Matching Pursuit||25||26||25||37||28,25|
|13||BG-Orthogonal Matching Pursuit||44||12||44||36||34|
|14||AR-Orthogonal Matching Pursuit||43||5||43||35||31,5|
|18||Automatic Relevance Determination||28||45||28||8||27,25|
|19||BG-Automatic Relevance Determination||47||43||47||11||37|
|20||AR-Automatic Relevance Determination||48||52||48||4||38|
|21||Passive Aggressive Regressor||55||56||55||53||54,75|
|22||BG-Passive Aggressive Regressor||27||41||27||52||36,75|
|23||AR-Passive Aggressive Regressor||29||54||29||51||40,75|
|33||Support Vector Machine||15||8||15||43||20,25|
|34||BG-Support Vector Machine||14||4||14||41||18,25|
|35||AR-Support Vector Machine||7||10||7||42||16,5|
|36||K Neighbors Regressor||36||13||36||33||29,5|
|37||BG-K Neighbors Regressor||32||11||32||32||26,75|
|38||AR-K Neighbors Regressor||50||23||50||44||41,75|
|45||Extra Trees Regressor||0||2||0||0||0,5|
|46||BG-Extra Trees Regressor||2||0||2||1||1,25|
|47||AR-Extra Trees Regressor||1||1||1||2||1,25|
|51||Gradient Boosting Regressor||3||25||3||19||12,5|
|52||BG-Gradient Boosting Regressor||18||20||18||13||17,25|
|53||AR-Gradient Boosting Regressor||6||19||6||14||11,25|
|54||Multi Level Perceptron||56||47||56||56||53,75|
|55||BG-Multi Level Perceptron||21||42||21||55||34,75|
|56||AR-Multi Level Perceptron||5||39||5||54||25,75|
|Lasso Least Angle Regression||38,25||40,25||39,25||39,25|
|Orthogonal Matching Pursuit||22,5||17||36||25,17|
|Automatic Relevance Determination||24||14,5||36||24,83|
|Passive Aggressive Regressor||38||36,75||41,25||38,67|
|Support Vector Machine||16||24||16,25||18,75|
|K Neighbors Regressor||25,75||29,25||28||27,67|
|Extra Trees Regressor||9,25||14||5,75||9,67|
|Gradient Boosting Regressor||10||12||16||12,67|
|Multi Level Perceptron||47,75||51,75||54,75||51,42|
|Lasso Least Angle Regression||40||42,5||41,5||41,33|
|Orthogonal Matching Pursuit||34||31,5||28,25||31,25|
|Automatic Relevance Determination||37||38||27,25||34,08|
|Passive Aggressive Regressor||36,75||40,75||54,75||44,08|
|Support Vector Machine||18,25||16,5||20,25||18,33|
|K Neighbors Regressor||26,75||41,75||29,5||32,67|
|Extra Trees Regressor||1,25||1,25||0,5||1,00|
|Gradient Boosting Regressor||17,25||11,25||12,5||13,67|
|Multi Level Perceptron||34,75||25,75||53,75||38,08|
For the experiments with the dimensionally reduced datasets (Tables 3 and 5):
–The best ranking performance (0.5) is obtained with the Extra Trees Regressor algorithm.
–The best performed ensemble algorithms are additive regression (AR) and bagging (BG).
–The best performed base algorithm is Ridge Regression.
–All of the ensemble algorithms generally increased the performance of each base algorithm. The exceptions are Orthogonal Matching Pursuit, Bayesian Ridge, Automatic Relevance Determination, TheilSen Regressor and Kernel Ridge.
–The Ridge Regression and Support Vector Machine base algorithms had their best performances with BG. The Ridge Regression, Support Vector Machine and Decision Tree algorithms with AR, achieved their best performances.
The average successes of the algorithms were investigated above. Next, the best performing algorithm will be investigated over each individual dataset. In Table 6, the dataset name, and the error and the name of the best performing algorithm are shown for the original and dimensionally reduced datasets.
|With the selected features|
|Dataset name||Best performing algorithm||RMSE||Best performing algorithm||RMSE|
|polymer_133||AR-Automatic Relevance Determination||0,01||BG-Gradient Boosting Regressor||0,01|
|alkaloid_53||AR-Elastic Net||0,29||BG-Extra Trees Regressor||0,31|
|alkaloid_103||AR-Decision Tree||0,57||Extra Trees Regressor||0,65|
|Polymer_150||Orthogonal Matching Pursuit||0,00||Orthogonal Matching Pursuit||0,02|
When Table 6 is investigated, the following conclusions are reached:
–The best performing algorithms are generally ensemble algorithms. This is in agreement with the average success of the algorithms.
–experiments with dimensional reduced data sets do not have better results than the original data sets, except for 1 data set (polymer_133).
When the algorithms are clustered, the algorithms are represented by points having 4 (the number of datasets) features (dimensions). When the datasets are clustered, the datasets are represented by points having 57 (the number of algorithms) features (dimensions).
According to Figure 1, the following conclusions are reached:
–In both figures, the ensemble-algorithm pairs are generally clustered with their base single algorithms.
–The feature selection process does not affect the similarities of the algorithms dramatically.
According to Figure 2, the following conclusions are reached:
–On the left side of Figure 2, there is no obvious pattern between the clusters and the number of features/samples.
–To the right of Figure 2, the polymers and alkaloids are clustered separately.
Iv Previous works
The selected previous studies in this area for both classification and regression are shown comparatively in Table 7.
According to Table 7, together with our experiments, the following conclusions are reached:
–The number of drug design / chemical data sets used in our experiments is greater than in previous studies, except for .
–The number of base machine learning methods used in our experiments is greater than in previous studies.
–The superior success of ensemble algorithms over single algorithms is confirmed.
|Reference||Compared methods in the study||Datasets||Results|
|||ctree, rtree, cforest, rforest, gbm, fnn,||1 regression-type||RandomForest showed|
|earth, glmnet, ridge, lm, pcr, plsr, rsm,||(chemical data).||good results.|
|rvm, ksvm, ksvmfp, nnet, nneth2o|
|||(4 ensemble + 1 single) * (7 base) = 35||15 regression-type||Ensemble methods showed|
|(chemical data).||good results.|
|||AdaBoostM1+Bagging (Ada_Bag),||Bagging (Ada_Bag)|
|AdaBoostM1+Jrip (Ada_Jrip),||and Random Forest|
|AdaBoostM1+J48 (Ada_J48),||(Ada_RF) algorithms|
|AdaBoostM1+PART (Ada_PART),||showed good|
|AdaBoostM1+RandomForest (Ada_RF),||MDDR database.||results.|
|||EnsemDT, EnsemKRR and||2 classification datasets||EnsemDT and EnsemKRR than|
|other single methods||(chemical data)||showed better results|
|other single methods.|
In machine learning, committee algorithms (ensembles), especially those with classification applications, are highly popular because they have better performances than single algorithms.
In this study, the comparative performances of algorithm ensembles with drug design datasets in regression applications were investigated. A drug design dataset collection with 4 regression-type datasets was used for this purpose. We obtained the performances of the single algorithms and the algorithm ensembles on those datasets. The combinations of 19 base algorithms and 2 ensemble algorithms were investigated.