Machine learning (ML) is the branch of Artificial Intelligence (AI) that focuses on developing systems that can learn from experience. Rather than being explicitly told how to solve a problem, ML algorithms are able to learn from observations – induction(Russell and Norvig, 2016). As ML algorithms have a generic ability to learn, rather than solve any particular problem, they are very widely applicable. The application of ML to science has a long history. The pioneering work was the development of learning algorithms for the analysis of mass-spectrometric data (Buchanan et al., 1968). Now, the significance of ML to science has been generally recognized, and ML is being applied to a wide variety of different scientific areas, such as functional genomics (King et al., 2009), physics (Schmidt and Lipson, 2009), drug discovery (Schneider, 2017), organic synthesis planning (Segler et al., 2018), materials science (Butler et al., 2018), and medicine (Esteva et al., 2017)
. Probably the most exciting current area of machine learning is that of deep neural networks (DNNs)(LeCun et al., 2015; Silver et al., 2016; Esteva et al., 2017). Thanks to advances in computer hardware and the availability of vast amounts of data, DNNs have been shown to be capable of such impressive tasks as beating World Champions at games such as Go (Silver et al., 2016), and diagnosing skin cancers better than human specialists (Esteva et al., 2017). In practice, however, DNNs are applicable only to a very small subset of scientific problems for which such large amounts of data are available. In addition, in most scientific problems, there is a requirement for human comprehensible models, while DNNs only provide black-box models.
1.1 Representation Learning
The key to success in machine learning (ML) is the use of effective data representations. Almost all machine learning is based on representations that use tuples of attributes, i.e. the data can be put into a single table, with the examples as rows, and the attributes (descriptors) as columns. An attribute is a proposition that is possibly true about an example. (Examples are described as tuples, and not vectors, as the order of the attributes does not matter - as long as it is the same for all the examples.) The attributes used to describe examples are intrinsic properties of the examples that are believed to be important: for example if one wished to learn about the effectiveness of a drug, then properties of its molecular structure may be useful attributes; similarly, if one wished to learn about chess positions, then the position of the white King might be a useful attribute. Typically, one attribute is singled out as the one we want to predict, and the other attributes contribute information to make this prediction. If this attribute is categorical then the problem is a discrimination/classification task, if the attribute is a real number then the problem is a regression one. Here, we focus on regression problems. The recent success of DNNs has been based on their ability to utilize multiple neural network layers, and large amounts of data, to learn how to convert raw input representations (e.g., image pixel values) into richer internal representations that are effective for learning. This internal conversion has been especially successful in problems where the only available attributes are very simple and minimal, such as pixel colour, brightness, position, etc. Due to this ability to learn effective internal representations, DNNs have succeeded in domains that had previously proved recalcitrant to ML, such as face recognition and learning to play GO. The archetypical case of this is face recognition, which was once considered to be intractable, but can now be solved with super-human ability on certain limited problems(Bengio, 2012).
1.2 Multi-task Learning and Transfer Learning
The large amounts of data required for DNNs to learn a good representation is unfortunately not available for many scientific problems. Nevertheless, many scientific problems do often present themselves as sets of related problems, which taken together, provide significant amounts of data, e.g. learning quantitative structure activity relationships (QSARs) for related targets (proteins). Multi-task learning (Caruana, 1997) is the branch of machine learning in which related problems (called tasks) are learned simultaneously, with the aim to exploit similarities between the tasks and thus obtain improved performance (Ando and Zhang, 2005; Evgeniou et al., 2005). The tasks are learned in parallel using a shared representation, so that what is learned from one task (e.g. one where more data is available) can also be used for another task. Multi-task Learning has been successful in many scientific application, such as HIV Therapy Screening (Bickel et al., 2008), analysis of genotype and gene expression data (Kim and Xing, 2010), discovery of highly important marker genes (Xu et al., 2011), modelling of disease progression (Zhou et al., 2011), disease prediction (Zhang et al., 2012), biological sequence classification (Widmer et al., 2010), and predicting small interfering RNA (siRNA) efficacy (Liu et al., 2010). Multi-task learning is closely related to the field of transfer learning (Thrun and Pratt, 1998), in which information is transferred from a specific source task to a specific target task. This can be done by forcing the target model to be structurally or otherwise similar to the source model(s). Neural networks are well suited to transfer learning as both the structure and the model parameters of the source models can be used as a good initializations for the target model, yielding a pre-trained model which can then be further fine-tuned using the available training data on the target task (Thrun and Mitchell, 1994; Baxter, 1995; Bengio, 2012; Caruana, 1995)
. Especially large image datasets, such as ImageNet(Krizhevsky et al., 2012), have been shown to yield pre-trained models that transfer well to other tasks (Donahue et al., 2014; Sharif Razavian et al., 2014). However, it has also been shown that this approach doesn’t work well when the target task is not very similar (Yosinski et al., 2014). As such, it is often difficult to make transfer learning work for many scientific problems.
The success or failure of multi-task learning often crucially depends on the existence of a good task similarity measure. For instance, one could learn a common Bayesian prior over model parameters trained on multiple tasks and use this to measure between-task similarity (Xue et al., 2007; Bakker and Heskes, 2003), or clustering tasks into groups outright (Jacob et al., 2009; Argyriou et al., 2008; Evgeniou et al., 2005). However, it is usually not straightforward to find a similarity measure that works well.
2 Transformative Learning
We present transformative learning a novel method for transforming input representations into more effective ones. The fundamental new idea is to convert a representation based on intrinsic properties to an extrinsic representation based on the predictions on a set of pre-trained models, each trained on another tasks. This leverages available data from many related tasks to perform a combination of multi-task and transfer learning able to make predictions. Transformative learning has the dual advantages of enabling better predictions, and providing explainable explanations. The input to transformative learning is: (1) a set of related prediction problems, and (2) a set of related examples that have been applied to one or more of the prediction problems. Transformative learning is performed in two learning stages. In the first learning stage (Fig. 1), separate prediction models are learned for each problem, using the available examples, and their standard intrinsic attributes to describe the examples, producing predictive models. In the second learning stage (Fig. 2), for each problem, the available examples are applied to the models to produce predictive values. These values form the transformed representation. Instead of representing examples by intrinsic attributes, they are represented by what other models predict about them. This transformed extrinsic representation is used to learn the final predictive model. In transformative learning we learn task similarity and a joint representation at the same time. Instead of using a predefined similarity measure to pre-select a set of similar tasks, we project the different tasks into one joint numeric representation, and use a meta-learning algorithm to learn from this new representation how to make accurate predictions for the task at hand.
To demonstrate the utility of transformative learning we have applied it to three real-world scientific problems: drug-design (quantitative structure activity relationship learning), predicting human gene expression (across different tissue types and drug treatments), and meta-machine learning (predicting how well machine learning method will work on problems).
3 Quantitative Structure Activity Relationship Learning
The standard Quantitative Structure Activity Relationship (QSAR) learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (normally inhibiting a target protein), learn a predictive mapping from molecular representation to activity. QSAR problems are suitable for transformative learning as they can be related by having related targets proteins (e.g. the problem of inhibiting mouse DHFR is similar to that of inhibiting human DHFR), and they can also be related by involving the same or chemically related small molecules.
Drug development is one of the most important applications of science. It is an essential step in the treatment of almost all diseases. Developing a new drug is however slow and expensive. The average cost to bring a new drug to market is billion US dollars (Mullard, 2014). A key step in drug development is learning QSARs (Martin, 2010; Cherkasov et al., 2014; Cumming et al., 2013). Almost every form of statistical and machine learning method has been applied to this problem, but no single method has been found to be always best (Olier et al., 2018). The most important QSAR dataset is the ChEMBL database (Gaulton et al., 2016), a medicinal chemistry database managed by the European Bioinformatics Institute (EBI). It is abstracted and curated from the scientific literature, and covers a significant fraction of the medicinal chemistry corpus. The data consist of information on the drug targets, the structures of the tested compounds (from which different intrinsic chemoinformatic representations may be calculated), and the bioactivities of the compounds on their targets. We extracted 2,219 targets from ChEMBL with a diverse number of chemical compounds, ranging from 30 to about 6,000, each target resulting in a dataset with as many examples as compounds (Olier et al., 2018). Chemical compounds were intrinsically described using a standard fingerprint representation (as it is the most commonly used in QSAR learning), where the presence or absence of a particular molecular substructure in a molecule (e.g. methyl group, benzene ring) is indicated by a Boolean variable. Specifically, we calculated the 1024 bits FCFP4 fingerprint representation using the Pipeline Pilot software from BIOVIA (Rogers and Hahn, 2010).
We applied transformative learning to generate extrinsic descriptors of the chemical compounds. For this we selected two learning methods: Random Forest (RF, 500 trees)(Breiman, 2001)
, and Linear Regression with Ridge Penalization (Ridge, L2 = 10)(Hoerl and Kennard, 1970). This choice was based on the results from (Olier et al., 2018)
, where these two methods performed best for QSAR datasets using the 1,024 fingerprint representation. QSAR models were created, one for each dataset and learner. Then extrinsic descriptors were generated by predicting activity using all the models but excluding the one from compound was part of the training set. Therefore, 2,218 extrinsic descriptors were generated per chemical compound (i.e. 2,219 original datasets - 1 training dataset). We performed a comparative assessment of the two QSAR data representations: the original intrinsic one based of molecular fingerprints, and the transformed data representation based on model predictions. For the comparison we applied three machine learning methods: Random Forest (RF, 500 trees), Linear Regression with Ridge Penalization (Ridge, L2 = 10), and Support Vector Machines (SVM, radial basis function kernel, width = 0.2)(Cortes and Vapnik, 1995)
. Method performance was measured using the root mean squared error (RMSE). RMSE, which values are in the same range as the response variable, is standard for regression tasks. 10-fold cross-validation was used across all experiments, with the same data splits to reduce bias risk. All the experiments were performed in R(Team et al., 2013). Table 1 reports average RMSE performance on the test sets.
|Learning Method||Original rep.||TL - RF||(%)||TL - Ridge||(%)|
First considering the application of Random Forest learning to transform the intrinsic chemical representation. Applying Random Forest learning a second time to the transformed representation was found to outperform the first Random Forest on 1,118 of the 2,212 problems. This corresponds to % mean improvement in RMSE. A similar result was found applying SVM to this transformed representation where SVM outperform the first SVM on 1,125 of the 2,212 problems, which also corresponds to a 10% mean improvement in RMSE. These results are especially noteworthy as we know from previous work, where we compared 18 common learning methods with 3 different intrinsic representations on the same data, that Random Forest with the fingerprint representation is the best method / intrinsic representation combination (Olier et al., 2018). Therefore, transformative learning has produced a large improvement over the best of 54 (18 x 6) intrinsic approaches. The transformed learning approach does not work well with Linear Regression with Ridge Penalization. Using Ridge Penalization as the learning method to transform the representation produces no improvement. Nor is Ridge Penalization successful at exploiting the transformed representation generated by random Forest.
4 Gene Expression Learning
As our second problem domain we selected the problem of predicting gene expression level. Our goal was to build a predictive models that given a drug and cancer cell type would be able to predict gene expression levels. These models can then be used to guide laboratory-based drug discovery experiments. Specifically, we utilized the Library of Integrated Network-based Cellular Signatures data (LINCS) (Koleti et al., 2017). This data describes the effect of drugs in cancer cell lines on the expression levels of 978 landmark human genes. The prediction problem is to learn models for each gene (978 models) that predict the gene’s expression level, given experimental conditions (cell type, drug, dosage), the related examples are the experimental condition (cell type, drug, dosage).
We used LINCS Phase II data (accession code GSE70138), which consists of 118,050 experimental conditions, along with the corresponding expression levels for 978 landmark genes. We generated attributes for each perturbation condition using the accompanying metadata. Each experimental condition is associated with a perturbagen (drug), cell type and site, perturbagen dosage, and perturbagen time frame. In total, there are 30 cell types (ct), 14 cell sites (cs), 83 dosages (d) and 3 time points (tp). Of the 2,170 drugs in the dataset, 1,795 have valid chemical structures (canonical smiles codes) according to the metadata. We converted the canonical smiles to the a 1,024 bit FCFP4 fingerprints (fp) using RDKit (Landrum, 2016). For all perturbation conditions with valid canonical smiles as rows, we generated Boolean features with the following columns: . This generated a 107,152 × 1,155 experimental condition matrix, row and column identifiers included, which can be used as input for building models to predict the expression levels of the 978 genes using traditional machine learning techniques. For each gene we generated both a train and test set with 7000 and 3000 samples respectively. We did this by first randomly splitting the original perturbation condition data with 107,152 samples and their corresponding gene expression levels, into train and test sets of 70% and 30% respectively. Using this main train and test set, we randomly sampled train and test individuals for each gene. The gene expression levels for the 978 genes were normalised such that their values lie between 0.0 and 1.0. We used two learning algorithms for these experiments, Random Forests (RF) and Linear Regression with Ridge Penalization (Ridge). For the RF 500 trees were grown, a third of the total number of variables were considered at each split, and five observations were used in each terminal node. For Ridge the regularization parameter was chosen using 10-fold internal cross-validation. All the experiments were performed in R (Team et al., 2013). The RF experiments were performed using version 4.6-12 of the randomForest package, and the Ridge experiments were performed using version 2.0-13 of the GLMNET package. Model performance was calculated as the RMSE. For both, Random Forests and Ridge, we considered 500 descriptors in the transformative learning step. For both learning methods the same gene models were used in the generation of the first order descriptors.
First considering the application of Random Forest learning to learn from the intrinsic representation. Applying Random Forest learning a second time to this transformed representation was found to outperform the first Random Forest on 977 of the 978 genes. This corresponds to a % mean improvement in RMSE. In contrast, applying Ridge learning to the transformed representation was found to outperform the first Random Forest on 862 of the 978 genes. This corresponds to a % mean improvement in RMSE, see Table 2.
|Learning Method||Original rep.||TL - RF||(%)||TL - Ridge||(%)|
Then considering the application of Ridge learning to learn from the intrinsic representation. Applying Random Forest learning to this transformed representation was found to outperform the base Ridge models on 952 of the 978 genes. This corresponds to a % mean improvement in RMSE, see Table 2. In contrast applying Ridge learning to the Ridge learning transformed representation outperformed Ridge learning on only 415 of the 978 genes.
5 Meta-Learning for Machine Learning
In machine learning, a key challenge is to select the best algorithm to train a predictive model on a new task. One approach to this problem is to apply machine learning itself to predict the best techniques (Vanschoren, 2018). Hence, this is called meta-learning, and we select it as our third problem domain. In this type of meta-learning, the prediction problem is to predict the performance of a machine learning method (given an exact configuration) on a new task, given the characteristics of the training data (e.g. statistics of the training data distribution). Domain problems can be related by having similar data distributions, data defects (e.g. missing values), or by containing data being generated by similar processes. The properties used to describe the datasets themselves are typically called meta-features.
Meta-learning for machine learning is feasible thanks to the creation of open repositories that collect datasets, meta-features, and experiment results. OpenML is an online machine learning platform where researchers can automatically log and share data, code, and experiments (Vanschoren et al., 2014). It brings together reproducible experiments from most major machine learning environments, such as WEKA (Java), mlr (R), and scikit-learn (Python). From OpenML we retrieved data from an earlier meta-learning study.111Details can be found on https://www.openml.org/s/7. Although we had to exclude a few tasks and algorithms because they lacked sufficient evaluations in OpenML, this yielded a set of 10840 evaluations on 351 tasks (datasets) and 53 machine learning methods (called flows on OpenML) from mlr (Bischl et al., 2016). From each task, 21 dataset descriptors were extracted, such as the number of examples, number of missing values, and percentage of numeric features. We formed meta-datasets, one for each machine learning method. An observation within a meta-dataset represents an original OpenML task, and each feature, a dataset descriptor. The original aim of the study was to predict the area under the ROC (AUC). Therefore, in total, we produced 53 meta-datasets with a diverse number of OpenML tasks, ranging from above 100 to about 250. We applied transformative learning to transform the original representation of the datasets into extrinsic descriptors of the OpenML tasks. Three learners were selected to do the transformation: Random Forest (RF, 500 trees), Linear Regression with Ridge Penalization (Ridge, L2 = 10), and Support Vector Machines with Radial Basis Kernel Functions (SVM, ). The transformed descriptors were generated by predicting AUC using all available models but excluding the one from the which the OpenML task belonged. In this way 52 extrinsic descriptors were generated for each OpenML task.
Table 3 shows comparative performance results between the two data representation: the intrinsic original representation using data descriptors (i.e. number of instances, percentage of numeric features, etc), and the transformed extrinsic representation. We used similar learners as above (RF, 500 trees; Ridge, L=10; and SVM, ). For instance, when we train a Random Forest on the intrinsic representation and use it to predict the performance of learning algorithms on every dataset, those predictions have an RMSE of 0.1184 (first row in Table 3). Training the Random Forest learning on the transformed representation (which does not have access to the dataset we are predicting for) was found to outperform the first Random Forest on 51 of the 52 tasks, and yielding an RMSE of 0.526. This corresponds to an impressive mean improvement in RMSE. Similarly, applying Ridge to the transformed representation was found to outperform the first Random Forest on all of the 52 tasks, which corresponds to mean improvement in RMSE. Applying SVM to the transformed representation was found to outperform the first Random Forest on 50 of the 52 tasks, which corresponds to mean improvement in RMSE. Likewise, applying an SVM to learn from the transformed representations was found to vastly outperform training on the intrinsic representation (third row in Table 3), corresponding to mean improvement in RMSE. Learning on features transformed by the Random Forest learning was found to outperform the original SVM model on 50 of the 52 tasks and mean improvement in RMSE, and using features transformed by Ridge was found to outperform the first SVM method on all of the 52 tasks, which corresponds to mean improvement in RMSE.
|Learning Method||Original rep.||TL - RF||(%)||TL - Ridge||(%)||TL - SVM||(%)|
As with QSAR learning and gene expression prediction the application of Ridge learning to transform the representation was unsuccessful, with results little different from the original intrinsic representation.
Comparison with All-In Learning.
A standard meta-learning approach, often used with DNNs, is to try to learn one large model that encompasses all the problems. In some circumstances this can work well. However, this approach has clear disadvantages compared to transformative learning:
If new data occurs for a task, the whole model has to be relearned.
If a new task is added, the whole model has to be relearned.
The relationships between tasks are not explicit.
The relationships between examples are also not explicit.
A major motivation of transformative learning is to develop a learning approach that provides explainable models. The transformed representation generates clearly understandable descriptors for learning. For example, using the example problem in Fig 2 of classifying animals, it is possible to classify an animal as a rabbit if it has a combination of properties of a donkey and kitten. This explainability is in marked contrast to the black-box nature of DNNs. Transformative learning also enables one to better understand the relationships between the learning tasks. This can be achieved by using the models for each task to predict all the examples, and then clustering the tasks by their predictions: which displays how the tasks are related in prediction space. Similarly, it is possible to better understand the relationships between examples by clustering them by their different model predictions: which shows how the examples are related in task space.
The Computational Cost of Transformative Learning.
One disadvantage of transformative learning is its additional computational cost. With transformative learning, in addition to the standard learning process, it is necessary to: 1) use each task model to predict all the examples to form the transformed representation, and 2) learn new task models using the transformed representation. Both tasks are potentially computationally expensive. However, the cost of transformative learning is low compared to DNNs.
Transformative Learning using Linear Regression with Ridge Penalization.
Our results indicate that the use of Ridge to form a transformed representation does not result in improved predictions. This suggests that it is necessary for the learning method that forms the transformed representation to be non-linear. In contrast, the use of Ridge to make predictions based on the a transformed representation made by Random Forests and SVM can work well, as it does for Gene Expression prediction and Meta-learning for Machine Learning.
Second-Order Transformative Learning.
In transformative learning the fundamental new idea is to transform the original, intrinsic data representation, to an extrinsic representation based on what a pre-trained set of models predict about the examples. Given the expectation that using the transformed representation produces better predictions than the original intrinsic representation, it is natural to extend the idea of transformative learning by applying it a second time, i.e. to use the predictions from the transformed representation to form a second-order transformed representation. As the predictions from the transformed representation are better than the ones from intrinsic representation, learning using second-order transformed representation should be more successful than with the first -order transformed representation. One clear disadvantage with this approach is the high-computational cost of using a second-order transformed representation.
In the past, machine learning was most commonly applied in bespoke ways to isolated problems. Now, with the ever-increasing availability of data, machine learning is being increasingly applied to large sets of related problems. This is motivating an increased interest in multi-task and transfer learning. We have developed a novel and general representation learning approach for multi-task learning, and we have demonstrated the success of this approach on three real-world scientific problems: drug-design, predicting human gene expression, and meta-learning for machine learning. In all three problems, transformative machine learning significantly outperforms the best intrinsic representations. We expect transformative learning to be of general application to scientific problems and beyond.
Acknowledgements.The authors would like to thank Rafael Mantovani for generating the original meta-learning data used in this study.
- Ando and Zhang  R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
- Argyriou et al.  A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
- Bakker and Heskes  B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4(May):83–99, 2003.
Learning internal representations.
Proceedings of the eighth annual conference on Computational learning theory, pages 311–320. ACM, 1995.
- Bengio  Y. Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 17–36, 2012.
- Bickel et al.  S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer. Multi-task learning for hiv therapy screening. In Proceedings of the 25th international conference on Machine learning, pages 56–63. ACM, 2008.
- Bischl et al.  B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016. URL http://jmlr.org/papers/v17/15-066.html.
- Breiman  L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- Buchanan et al.  B. Buchanan, G. Sutherland, and E. A. Feigenbaum. Heuristic DENDRAL: A program for generating explanatory hypotheses in organic chemistry. Stanford University, 1968.
- Butler et al.  K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh. Machine learning for molecular and materials science. Nature, 559(7715):547, 2018.
Learning many related tasks at the same time with backpropagation.In Advances in neural information processing systems, pages 657–664, 1995.
- Caruana  R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
- Cherkasov et al.  A. Cherkasov, E. N. Muratov, D. Fourches, A. Varnek, I. I. Baskin, M. Cronin, J. Dearden, P. Gramatica, Y. C. Martin, R. Todeschini, et al. Qsar modeling: where have you been? where are you going to? Journal of medicinal chemistry, 57(12):4977–5010, 2014.
- Cortes and Vapnik  C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
- Cumming et al.  J. G. Cumming, A. M. Davis, S. Muresan, M. Haeberlein, and H. Chen. Chemical predictive modelling to improve compound quality. Nature reviews Drug discovery, 12(12):948, 2013.
- Donahue et al.  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
- Esteva et al.  A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.
- Evgeniou et al.  T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005.
- Gaulton et al.  A. Gaulton, A. Hersey, M. Nowotka, A. P. Bento, J. Chambers, D. Mendez, P. Mutowo, F. Atkinson, L. J. Bellis, E. Cibrián-Uhalte, et al. The chembl database in 2017. Nucleic acids research, 45(D1):D945–D954, 2016.
- Hoerl and Kennard  A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- Jacob et al.  L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems, pages 745–752, 2009.
- Kim and Xing  S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML, pages 543–550, 2010.
- King et al.  R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, E. Byrne, M. Liakata, M. Markham, P. Pir, L. N. Soldatova, et al. The automation of science. Science, 324(5923):85–89, 2009.
- Koleti et al.  A. Koleti, R. Terryn, V. Stathias, C. Chung, D. J. Cooper, J. P. Turner, D. Vidović, M. Forlin, T. T. Kelley, A. D’Urso, et al. Data portal for the library of integrated network-based cellular signatures (lincs) program: integrated access to diverse large-scale cellular perturbation response data. Nucleic acids research, 46(D1):D558–D566, 2017.
Krizhevsky et al. 
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
- Landrum  G. Landrum. Rdkit: open-source cheminformatics http://www. rdkit. org, 2016.
- LeCun et al.  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
- Liu et al.  Q. Liu, Q. Xu, V. W. Zheng, H. Xue, Z. Cao, and Q. Yang. Multi-task learning for cross-platform sirna efficacy prediction: an in-silico study. BMC bioinformatics, 11(1):181, 2010.
- Martin  Y. C. Martin. Tautomerism, hammett , and qsar. Journal of computer-aided molecular design, 24(6-7):613–616, 2010.
- Mullard  A. Mullard. New drugs cost us $2.6 billion to develop, 2014.
- Olier et al.  I. Olier, N. Sadawi, G. R. Bickerton, J. Vanschoren, C. Grosan, L. Soldatova, and R. D. King. Meta-qsar: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 107(1):285–311, 2018.
- Rogers and Hahn  D. Rogers and M. Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
- Russell and Norvig  S. J. Russell and P. Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
- Schmidt and Lipson  M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. science, 324(5923):81–85, 2009.
- Schneider  G. Schneider. Automating drug discovery. Nature Reviews Drug Discovery, 17(2):97, 2017.
- Segler et al.  M. H. Segler, M. Preuss, and M. P. Waller. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604, 2018.
- Sharif Razavian et al.  A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In
- Silver et al.  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Team et al.  R. C. Team et al. R: A language and environment for statistical computing. 2013.
- Thrun and Mitchell  S. Thrun and T. M. Mitchell. Learning one more thing. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, 1994.
- Thrun and Pratt  S. Thrun and L. Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998.
- Vanschoren  J. Vanschoren. Meta-learning: A survey. arXiv preprint arXiv:1810.03548, 2018.
- Vanschoren et al.  J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
- Widmer et al.  C. Widmer, J. Leiva, Y. Altun, and G. Rätsch. Leveraging sequence classification by taxonomy-based multitask learning. In Annual International Conference on Research in Computational Molecular Biology, pages 522–534. Springer, 2010.
- Xu et al.  Q. Xu, H. Xue, and Q. Yang. Multi-platform gene-expression mining and marker gene analysis. International journal of data mining and bioinformatics, 5(5):485–503, 2011.
- Xue et al.  Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8(Jan):35–63, 2007.
- Yosinski et al.  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
- Zhang et al.  D. Zhang, D. Shen, A. D. N. Initiative, et al. Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. NeuroImage, 59(2):895–907, 2012.
- Zhou et al.  J. Zhou, L. Yuan, J. Liu, and J. Ye. A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 814–822. ACM, 2011.