Although great strides have been made in predictive computational modeling where large datasets exist, the use of computational modeling where data is scarce and/or expensive has been limited. Data expense and scarcity is the norm in materials designHutchinson2017. Computer-aided design of materials often relies on physics-based predictive methods Shi2017; Afzal2019, which require no data, but cannot predict complex properties like activity of a drug or refractive index of a thin film. Even if physics-based modeling is used, there is no clear mechanism to improve predictions as data is gathered in the course of testing materials. Thus, human intuition is often the state-of-the-art for choosing which new materials to test when there are small amounts of data.
Here we apply active learning to binary classification of peptides across a variety of binary tasks like predicting solubility or activity against bacteria. We examine two standard active learning methods: query by committee (QBC) and uncertainty minimization with supervised learning of a deep convolutional neural network. We also examine meta-learning to see if there is gain in transferring knowledge from one task to another. To apply these methods to past data, we only allow our active learners to choose from the dataset of labeled peptides. The methods are evaluated not based on the peptide chosen, but the accuracy of the resulting trained model. The rationale is that there are many competing design constraints in peptide design (e.g., synthetic feasibility, cost, bioavailability, etc) and thus it is better to have an accurate model than a finite set of examples proposed to be active. The goal of this work is to assess active learning and meta-learning as potential ways to improve iterative discovery of peptides in this setting.
Active learning has a long history as an extension to design of experiments, which is about choosing the optimal experiments to do with limited resources. Our concern is a sequence of experiments where the results of the previous experiments influence our decision of the next, whereas optimal design of experiments is about choosing the best experiment prior to beginning, and assumes a linear model. Active learning is this process of choosing the next experiment optimallySettles2010. It is sometimes called optimal experimental designLiepe2013, targeted experiment designVanlier2012, sequential design of hypothesesZACKS1996151, optimal learningGopakumar2018
, and artificial intelligence scientific discoveryBuchanan1968 depending on the goal and problem context.
Active learning is typically formulated as an optimization problemSettles2010. Consider observation pairs of features and labels, respectively, with indicating the order of observation. Assume that is a class label and
is a vector of reals. We have atask model,
, that assigns a probability to each class label for a featureand is defined by parameters , which are updated after each new observation. In this work
is a deep-learning convolutional neural network.is updated according to some training procedure after a new pair is observed. In active learning, we choose from our fixed dataset of pairs according to
where is a functional of the task model and possibly . can defined by parameters , although it is normally fixed. For example, could be the most uncertain point, which gives
where is the most likely class label for lewis1994sequential. is called the acquisition function or utility function depending on the problem settingSettles2010.
Within this framework, there are a variety of approaches depending on the form of the task model and utility function. If the task model is probabilistic, like above, utility functions which maximize information gainLi2013a; Liepe2013, reduce model uncertainty5509293, maximize expected model changeSettles2010
, or reduce model varianceMackay1992 can be chosen. Within variance reduction methods, there are so-called A-optimalMackay1992, D-optimalMackay1992; Chaloner1993, and E-optimalFlaherty2005 approaches which minimize the covariance matrix according to different assumptions. If the task model is non-parametric, Bayesian approaches are well- suitedVanlier2012. One still has a choice of utility function and can, for example, maximize the expected model improvement at each experimentChaloner1993; kapoor2007active. Bayesian approaches can work with recent deep learning methods through Bayesian convolutional neural networks Gal2017 or through the hypothesized connection between dropout and neural network uncertaintyGal2015.
If the task model is not stochastic but there is flexibility in parameter choice or multiple models to choose from, then there are a variety of pooled or consensus active learning approaches. QBC is an active learning approach that maintains a committee of models and chooses the next experiment based on where the committee disagrees mostseung1992query. There are also active learning pool methods for specific model types. For example, k-nearest neighborwei2015submodularityhoi2006batch; guo2007discriminative
, and linear regression with noiseyu2006active
. One can also treat the choice of model as a probability distribution and then the pool of models can be viewed as a stochastic or Bayesian model to use any of the previous utility functions.
There are active learning methods that are independent of the task model and instead find a characteristic set of datacohn1996active. This can be done by clustering the data, optimizing a function which describes local varianceYang2015, finding regions of high uncertaintycohn1994improving, or by finding a “coreset” characteristic subset of data via submodular function optimizationSener2017; wei2015submodularity
. These methods are closely related to semi-supervised learning, which tries to use unlabeled data to improve a model when labeled data is sparsecorduneanu2002information; szummer2003information.
Another closely related topic to active learning is Bayesian optimization or global optimization of black box functions. There the goal is to optimize a function while evaluating it a minimum number of times. To connect this to active learning described above, view the “expensive function” as the experiment. A surrogate stochastic model is constructed, often through bootstraping or non-parametric models, and that surrogate model is used within an active learning framework to reduce the number of the function evaluationsjones1998efficient. Then active learning is applied so that each function evaluation improves the maximum function value and/or understanding of the model. This method can be equivalent to the variance reduction techniques discussed above if the same utility functions are chosenchaloner1995bayesian. More sophisticated active learning methods can be used with the surrogate model, including Gaussian process regression and complex portfolios of acquisition (utility) functionshoffman2011portfolio; shahriari2014an. A common critique of Bayesian optimization is that it typically scales between to , depending on approximations made, and struggles with high-dimensional surrogate models. However, this is not applicable here, since our goal is learning in systems with dozens of experiments, not thousands. A recent overview on the application of Bayesian optimization to materials design can be found in Frazier and Wang.frazier2016bayesian
An early example of active learning in chemistry can be found with van de Walle et al.van2002automating, where a phase diagram was explored using variance minimization active learning on cluster expansions. Active learning works well in general with choosing cluster expansions via variance minimizationMueller2010; PhysRevB.83.224111. More recently, Lookman et al.Lookman2017 explored elastic properties with ab initio calculations using a Bayesian optimization technique (Efficient Global Optimization) to optimize the ratio of three metals. The authors applied the same method to design new piezoelectricsyuan2018accelerated with a four dimensional design space. Gopakumar et al.Gopakumar2018 also showed how active learning methods that balance exploration and exploitation can work well on 2 and 3 dimensional materials systems. This active learning approach to finding compositions with Bayesian optimization is quickly gaining popularity in the material informatics community for low-dimensional systemsFukazawa2019a; Wen2019a; Rickman2019c.
Kim et al.Kim2019 explored active learning methods to find polymers with a specific glass transition temperature. This is a high-dimensional system because the polymers are represented with a variety of descriptors. It was made tractable by treating a list of 731 possible polymers as known a priori. At each step the model evaluates all 731 possible polymers. An earlier example using a fixed molecule set can also be found in Warmuth et al.warmuth2002active where candidate drug molecules were selected from a vendor catalogue with a variance minimization active learning algorithm.
To avoid this difficulty of high-dimensionality and ambiguity of representing molecular structures as vectors (although see new work from Sanchez-Lengeling2018; Blaschke2018; Gomez-Bombarelli2018
), this work focuses on peptides. Peptides are a natural fit as targets for experimental design driven by machine learning, because they draw from the finite alphabet of 20 natural amino acids. They can be readily encoded as discrete vectors, as opposed to the less well-defined space of all possible valid chemical structures for other purposes, for which more specialized methods like autoencoders can be necessary. The easily-encoded amino acid sequence representation of peptides also has a convenient parallel in existing machine learning methods for natural language processing, which is already a developed sub-field of machine learning from which modeling techniques can be leveraged.cambria2014jumping
explored the use of Bayesian optimization for de novo design of peptide substrates. Their work is similar in that both this work and theirs used a sequence model with the goal of minimizing the number of experiments required to optimize a peptide. The differences are that they were modeling with regression, allowed for complete choice in sequence space, did not have a goal of creating accurate models, and did not use a deep learning model but a Naïve Bayes classifier. Their work is an experimentally validated demonstration that intelligently designing experiments with predictive models can reduce the required number peptides that need to be synthesized and tested.
Another topic explored in this work is meta-learning. Meta-learning is a technique for improving few-shot learning across multiple tasks. The goal, in our nomenclature above, is to make the task model depend on hyperparametersthat are trained to work well across multiple tasks. Then on a new task, using
, as few new examples are required to improve performance. Active learning and meta-learning are connected because both are concerned with maximizing the value of data. As the goal is to minimize the number of experiments required for new systems, we find it natural to consider this method on our dataset. Examples of meta-learning for accelerating task models can be found in transfer learningPan2010, few-shot learningAltae-tran2017; Snell2017; Vinyals2016, and automated machine learningquanming2018taking
. One application that has recently connected active learning and meta-learning is in model free reinforcement learningDuan2017; Gupta2018. Pang and Fang2015 have also explored the connection between active learning and meta-reinforcement learning.
Five past published databases were used to generate training datasets in this work: the Antimicrobial Peptide Database (APD)apd3, the collection of protein surface fragments from barrett2018classifying, the PROSO II database,Smialowski2012PROSOPrediction library hits with activity against TULA-2 proteinCheng2010, and library hits with activity against SHP-2Sweeney2005SHP. The APD contains peptides flagged with a variety of activities, such as antibacterial, antifungal, antiviral, anticancer, and hemolytic activities. The PROSO II databse contains peptides and proteins categorized as soluble or insolbule. The TULA-2 and SHP-2 libraries are fixed-width peptides optimized for binding to a specific target. Eight datasets were chosen from the APD: 1. “antibacterial,” 2. “anticancer,” 3. “antifungal,” 4. “antiHIV,” 5. “antiMRSA,” 6. “antiparasital,” 7. “antiviral,” and 8. “hemolytic.” All sequences above length 200 were excluded. These wide variety of tasks represent a range of modeling goals in peptide research.
The task model is a deep convolutional neural network classifier, necessitating negative training examples as well as positive. One corresponding negative training data set was generated for each positive data set. Two types of negative data were generated: fake scrambled data with amino acid distributions identical to the positive dataset and samples from datasets which are expected to have no intersection with positive examples. These are expected to be rather challenging negative examples, since scrambled peptides likely have many physical properties of the positive examples. The non-intersecting datasets are also, generally, naturally occurring peptides that likely are biologically relevant. To generate the fake scrambled negative data set, a number of peptides were generated randomly, with lengths sampled from the same length range, and residues sampled from the same frequency distribution as their respective positive set. The non-intersecting examples are shown in Table 1 along with rationale in the caption. To balance classes, the negative datasets were sampled down to be the same size as the positive example. A t-SNE projection of the peptides in this works are shown in Figure 1.
|Positive Dataset||Size||Negative Datasets|
|antibacterial||2079||shp2, tula2, insoluble, antifungal, antiHIV, anticancer, fake|
|anticancer||183||shp2, tula2, insoluble, antifungal, antiHIV, antiparasital, antibacterial, fake|
|antifungal||891||shp2, tula2, insoluble, antiHIV, anticancer, antibacterial, fake|
|antiHIV||87||shp2, tula2, insoluble, antifungal, anticancer, antiparasital, antibacterial, fake|
|antiMRSA||119||shp2, tula2, insoluble, antiHIV, anticancer, antiparasital, fake|
|antiparasital||90||shp2, tula2, insoluble, antiHIV, anticancer, fake|
|antiviral||150||shp2, tula2, insoluble, antifungal, anticancer, antiparasital, antibacterial, fake|
|hemolytic||253||shp2, tula2, insoluble, human, fake|
|human||2880||insoluble, hemolytic, fake|
Peptide sequences are encoded as one-hot vectors of dimension (with the length of the peptide), where each index in the second dimension corresponds to the index of the amino acid in the th position in the alphabet of amino acids: [A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V]. Activity was encoded as a one-hot label vector of length 2, where indicates a positive label and indicates a negative label.
The task model structure is a convolutional neural network partially inspired by past work in peptide modeling.Barrett2018 The model structure is shown schematically in Figure 2. The first layer of the neural network is a convolutional layer with a weight matrix of dimension , where and conceptually represent peptide “motif widths” and number of “motif classes,” respectively, and is the length of the amino acid alphabet considered (
for the naturally-occurring peptides used here). The next layer is a max pool layer, which captures which “motif class” is most likely in the input peptide by pooling across the peptide’s length dimension, leaving avector.
The output of the max pool layer is concatenated with the relative frequency of each amino acid in the input peptide ( vector) and standardized (-
) sequence length. This is then fed to three fully-connected (FC) layers with ReLU activation function and then to one FC layer with softmax activation for classification, ensuring the outgoing vector of size 2 adds up to 1, since this vector is meant to represent the likelihood of assignment to the positive or negative classes. The final output is compared with the true label vector (classification) or the true activity (regression), and loss is calculated as the cross-entropy between the two. The minimization algorithm used during training was TensorFlow’stensorflow2015whitepaper Adam optimizer with parameters recommended in KingmaAdam2014 and a learning rate of 0.001.
The method of uncertainty minimization is designed to favor exploration to maximize information gained per training iteration of a model. This is achieved by choosing a new training point based on some measure of the uncertainty of the machine learning model used. In this sense, it is well-aligned with an eventual goal of automated experiment selection, because it can minimize the number of necessary experiments to characterize a property space well. This, in turn, would lead to a reduction in operating costs and time spent performing experiments.
The training procedure for this active learning method is shown schematically in Figure 3. The model described previously was used with and chosen as 5 and 6, respectively. Model weights are randomly initialized and then the model uncertainty is calculated as the variance of the output vector of the neural network for each peptide. One peptide is then sampled, with probabilities of selection for each peptide weighted by their respective variances under the current model parameters, i.e. . This is different than Equation 1, where
is used instead of sampling. The chosen peptide is then used to train the neural network for one training iteration, then an additional 5 training iterations are performed with batch size 5, sampling uniformly from all previously-observed peptides for training input. Thus, in the first iteration the first point is used for 25 training steps, then in the second, the first and second points are used an expected value of 12.5 times each, etc. This process is repeated for 25 training epochs for one training run. To gather statistics, 30 training runs of 25 epochs each were performed for each dataset.
The QBC method employed in this work uses the same approach as uncertainty minimization, but instead of a single task model that is trained and used for selection, a committee of 10 models is used. The 9 additional models are all structured in the same way as the model used in uncertainty minimization, but use different hyperparameters. They differ in the dimensions of the weights matrix used in the convolution layer, having all combinations of and along with the model used in uncertainty minimization. In QBC, input data is passed to each committee member (task model), and each one produces a prediction. The average variance among all models is used as sampling weight for selecting a new training point, and training is performed in the same way as for uncertainty minimization, with the same number of epochs and training runs.
To compare these active learning methods against a base case, we use two control training methods with the same model as used for the uncertainty minimization method. The first control is a “baseline” Adam training, where the model trains on all peptides for 5,000 steps with a batch size of 32. The second is a more direct comparison to the active learning methods called “random” where peptides are chosen randomly the model is trained in the same way as in the active learning methods (batch size 5, 5 iterations, 25 epochs).
The meta-learning method used in this work is Reptile from Nichol2018. It is related to the the work by Finn2017, called model agnostic meta-learning. In our notation, the goal of meta-learning is to optimize the initial parameters of the task model, , to work well given a sampled task . is taken to be uniform across our datasets. is optimized by Adam optimization of a meta-objective function:
where is the dataset corresponding to a set , are the parameters defining the active learning method , and is the given initial task model parameters.
is the usual loss function andis a stand-in for doing steps of active learning training with . The gradient of this meta-objective requires a Hessian, but Reptile approximates this with a Taylor expansion. In this work , meaning we train with active learning on 5 peptides each time with a batch size of 5. 2,500 meta-learning iterations were done and then early stopping was used to prevent overfitting. This was done on an 80% split of the left-out dataset and then results were reported on the 20% remaining sequences of the left-out dataset.
AUC values and accuracies are reported on withheld data which was 20% of the dataset size. In the meta-learning results, the accuracies and AUCs reported had the dataset withheld from training. During meta-learning, the location of active/inactive labels (first or second index) was swapped between tasks to prevent over-fitting to active being in one position or another. This gives minimal zero-shot accuracy. The minimized loss was cross entropy between label probabilities and true labels. Error bars and individual traces are different due to split differences of data and random number initial parameter seeds.
3 Results and Discussion
Figure 5 shows the active learning, meta-learning, and baseline results across the datasets for the uncertainty minimization active learning strategy. The baseline results (gray) show that the convolutional neural network provides reasonable results across the range of modeling tasks despite its simple architecture, with the exception of the soluble and Tula-2 tasks. The solubility task is challenging because many of the training examples are actually folded proteins, requiring long-range sequence correlations to model properly. The current state of the art on this dataset is accuracy of 0.77 Khurana2018DeepSol:Prediction and the convolutional neural network here has an accuracy of 0.56 (barely above random). The Tula-2 peptide affinity task simply is difficult to predict due to the small data amount () and diversity of sequences. Figure 5 also shows the comparison of choosing peptides randomly and choosing maximum uncertainty peptides (uncertainty minimization). Uncertainty minimization (blue line) is no better than choosing randomly (dashed green) in general. It is sometimes worse and sometimes better.
To assess the effect of meta-learning on reducing experiment number, it was evaluated both alone and in combination with active learning. Meta-learning results are shown in Figure 5. Meta-learning consistently improves accuracy. This can be seen in the red ML+Random line, which is consistently above the dashed green line (random alone). Combining uncertainty minimization with meta-learning provides no advantage, and mostly decreases accuracy. In many of the datasets, performance approaches baseline levels with only 25 examples, while the baseline is trained on all data. This demonstrates the advantage of using meta-learning.
Receiver operator characteristic (ROC) curves provide an accounting of the balance between type I and type II error. This is important for peptide activity because, due to the large design space, false positives are more detrimental to a model’s usefulness. The area under curve (AUC) of a ROC curve gives a scalar number representing the quality of the ROCs. Although note that here we enforced balanced classes. The ROC AUCs are reported in Table2. As observed in Figure 5, there is no gain by using the uncertainty minimization active learning. Also, 25 examples is not enough to match the baseline models without meta-learning. The source of variance in this work is because there are many ways to choose 25 peptides from the datasets and the datasets are small. Significant exceptions are the Tula-2, SHP-2, and soluble datasets, which show poor performance with limited examples relative to baseline models. In particular, SHP-2 seems to require the full dataset to achieve good accuracy. This may be due to the importance of motifs in this dataset White2013b.
The QBC results are reported in Table 2 and Figure 8. QBC provides consistently better performance over random choice and uncertainty minimization. QBC further improves with meta-learning. QBC seems to have better accuracy in general and the best performance when combined with meta-learning.
Overall, meta-learning improves learning across these data, especially when combined with QBC. Recent analysis of meta-learning has shown mixed results across tasks. raghu2019rapid showed that model agnostic meta-learning methods like Reptile only learn to re-use features across tasks. To assess how feature re-use can be applicable in these dataset, we performed a basic sensitivity analysis in Figure 6 which gives insight into the various features used in the modeling. Figure 6 shows the features for the baseline model on the antibacterial dataset. The zero-shot features for meta-learning are shown in Figure 6. Note that all motif frequencies are 1.0 since this is the meta-learned parameters, not a realization of training. The results show that meta-learning does not have the exact features found in the baseline model, although many of the important amino acids are shared (N, W, C, H). Some of the amino acids within the motifs are shared, but they are not identical.
Area under curve (AUC) for receiver operator characteristic curves for classifiers trained with different active learning methods on different datasets. Baseline was trained on all data, whereas others saw 10 peptides according to their active learning strategy. Errors are computed from standard deviation across 30 trials on different data splits and random sampling in active learning strategy. Umin is uncertainty minimization, ML is meta-learning, and QBC is query by committee.
To ensure our conclusions about meta-learning and QBC being preferred for peptide design are robust, we explored three different model choices. These are shown in Figure 7 for only the antibacterial task (although meta-learning traces were trained on all but antibacterial). The first subplot is ablation of the motif convolutions. Without the motifs, accuracy is reduced a small amount and the advantage of QBC and meta-learning disappears. Switching from a cross-entropy loss to absolute error loss reduces accuracy and the advantage of QBC over random choice. Meta-learning is still advantageous. The last plot shows if we remove the label swapping, which is used to reduce over-fitting to ”active” peptides. Zero-shot performance is improved, due to good correlation of activity in peptides across tasks. The QBC method has an improved margin, likely because it has more learnable parameters. Meta-learning is preferred in this setting, when it is known that the class labels of active vs inactive can be re-used across tasks.
This work has explored active learning and meta-learning strategies for predicting peptide activity on a dataset of 12 different peptide activity tasks. The simple deep convolutional neural networks used here offer reasonable performance across the tasks. We expect more complex models with attention, more layers, and long-range interactions in sequence space could improve the accuracy. Active learning strategies provided improvements over sampling peptides randomly: around a 3-5% increase in accuracy after training the classifier on 25 examples. Meta-learning was found to improve accuracy. These conclusions were found to be robust across loss choice, model structure, and whether or not zero-shot learning is being optimized. This work provides a new peptide multi-task dataset and benchmark results for standard active learning and meta-learning methods.
This material is based upon work supported by the National Science Foundation under grants 1764415 and 1751471.