PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences

10/17/2018 ∙ by Payel Das, et al. ∙ 0

Given the emerging global threat of antimicrobial resistance, new methods for next-generation antimicrobial design are urgently needed. We report a peptide generation framework PepCVAE, based on a semi-supervised variational autoen- coder (VAE) model, for designing novel antimicrobial peptide (AMP) sequences. Our model learns a rich latent space of the biological peptide context by taking advantage of abundant, unlabeled peptide sequences. The model further learns a disentangled antimicrobial attribute space by using the feedback from a jointly trained AMP classifier that uses limited labeled instances. The disentangled rep- resentation allows for controllable generation of AMPs. Extensive analysis of the PepCVAE-generated sequences reveals superior performance of our model in comparison to a plain VAE, as PepCVAE generates novel AMP sequences with higher long-range diversity, while being closer to the training distribution of bio- logical peptides. These features are highly desired in next-generation antimicrio- bial design.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Hospital-acquired infection is a serious global health concern and is the sixth leading cause of death in the United States, with an estimated cost of  $10 billion annually

(Peleg and Hooper, 2010). 60-70% of hospital-acquired infection is attributed to Gram-negative bacteria. Those bacteria are also efficient in creating antibiotic-resistant mutants. Each year, 30 million sepsis cases are reported worldwide, and potentially 5 million deaths occur as a result of antibiotic-resistant infections (Fleischmann et al., 2016). The estimated annual number of deaths due to direct antibacterial resistance (AMR) is reported to be at least 23,000 in US alone (CDC, ) and 700,000 globally. The emergence of multidrug-resistant bacterial strains, aka priority pathogens (CDC, ), combined with the dry drug pipeline, advocates for urgent development of new approaches to fight AMR. Antimicrobial peptides (AMP) or host defense peptides are peptide sequences typically comprised of 10-50 amino acids. AMPs directly disrupt the bacterial membrane integrity, leading to membrane pore formation and membranolysis, which is referred as antimicrobial activity. These peptides are found among all classes of life and are considered as potential candidates for next-generation antibiotics, because of their natural antimicrobial properties and a low propensity for development of resistance by microorganisms.

Until now, discovery of many therapeutic molecules either happened by chance, e.g. discovery of penicillin, or via exhaustive combinatorial search (Porto et al., 2018). In cerebro design of AMPs involves synthesis or modification of peptide sequences toward certain desired characteristics, e.g. enhancing net positive charge and hydrophobicity, which favors interaction with negatively charged bacterial membrane. Such approaches suffer from three main obstacles: (1) it is practically impossible to perform an exhaustive search and characterization of the original sequence space. (2) Hand-engineering and/or selecting features is frequently needed. And (3) it is often not possible to have control over the generation process during a trial-and-error method or brute-force search. Therefore, inverse design of therapeutic molecules remains challenging.

Recently, the combination of big datasets with machine learning methods, like deep generative models, has opened the door towards accelerated molecule discovery by using data-driven approaches. In fact, in recent years, popular deep generative models, such as generative adversarial networks (GAN)

(Goodfellow et al., 2014) and variational autoencoders (VAE) (Kingma and Welling, 2013) have been successfully adapted to the development of new molecules (Kadurin et al., 2017; Blaschke et al., 2018; Gómez-Bombarelli et al., 2018; Kusner et al., 2017; Jin et al., 2018)

. Often, the molecule generation task is approached by formulating the design problem as a natural language generation problem, in which molecules are represented as SMILES strings,

i.e. sequences of characters. Similarly, biological molecule (peptide, nucleic acid) generation can be tackled by presenting the sequence as a text string of building blocks: e.g.

peptide as a string of 20 basic amino acid characters, and nucleic acid as a string of 4 basic nucleotide characters. It has been suggested that biological sequences exhibit characteristics typical of natural-language texts, such as “signature-style” word usage indicative of authors or topics. For example, at an unigram level, AMPs are reported to be rich in cationic (K, R) and hydrophobic (A, C, L) amino acids. The natural language processing and generation algorithms may therefore be adapted to “biological language modeling”

(Osmanbeyoglu and Ganapathiraju, 2011; Müller et al., 2018; Nagarajan et al., 2018). Therefore, it is safe to state. that by exploiting the advent of large-scale data from next-generation sequencing, generative machine learning algorithms can potentially accelerate the discovery process of novel AMPs.

The generative modeling task is difficult mostly because it needs to fulfill three main desiderata: (1) discrete sequence data generation, which is more difficult than continuous data generation; (2) controlled generation with certain attributes (e.g. AMP characteristics) disentangled into controllable “knobs”; and (3) diversity of generated sequences. The diversity of generated peptide sequences is one of the most important features that a good AMP generator should possess. One primary reason for the difficulty on generating diverse sequences is that the generative models are usually trained on labeled data only, which is relatively scarce and sparse and labeling at massive scale remains expensive. Generated sequences from those models demonstrate high identity with known sequences in public databases and contain a restrictive set of amino acid patterns (see (Porto et al., 2018) and references therein). Therefore, to achieve diversity, it is necessary to develop semi-supervised generative models that can simultaneously learn from the large unlabeled peptide sequence databases and a limited number of labeled sequences.

In this work, we employ a combination of VAE and an AMP classifier to learn the disentangled latent space of peptide sequences following (Hu et al., 2017), and generate novel antimicrobial molecules by sampling from the latent space. We refer to this framework as PepCVAE. To this end, we collected and curated a new dataset consisting of two main parts: (1) a large unlabeled dataset (1.6M samples) of peptide sequences, and (2) a smaller set (15k samples) of peptides labeled for antimicrobial activity/inactivity. We demonstrate the advantage of our PepCVAE architecture and approach by rigorously comparing the generated sequences with those from a simple VAE architecture trained only on AMP sequences. The results show that our semi-supervised VAE setup produces a more diverse set of biologically relevant AMP sequences. The proposed approach can therefore be applied to the general task of targeted design of novel molecules and materials.

Related Work

Sequence Generation:

The simplest model one can consider for sequence generation involves a single recurrent neural network (RNN) language model that predicts most probable next token, given previous tokens.

(Müller et al., 2018) have proposed a RNN-based peptide generator that is trained on a set of known AMP sequences. A more versatile approach to sequence generation is based on the VAE framework, which allows to sample new sequences based on a continuous latent space . Bowman et al. first used a VAE for probabilistic generation of natural language, which has more recently been adapted to molecular SMILES sequence generation (Gómez-Bombarelli et al., 2018). Sequence generation using GANs has also been the focus of recent research, although they require special techniques to deal with the non-differentiable nature of sequences (Yu et al., 2017; Kusner and Hernández-Lobato, 2016).

Controlled Sequence Generation: the controlled generation of text based on stylistic attributes has been the focus of a number of recent papers. (Hu et al., 2017) have demonstrated that, in a variational encoder-decoder setup, feeding the generator with the attribute information along with the latent variable enables generation with control. Additionally, Hu et al.

’s method allows for semi-supervised learning.


VAE-based Sequence Generation

Variational Autoencoder (VAE) is a class of generative models that build on the autoencoder (AE) framework by adding a particular type of regularization. VAE’s regularization imposes a prior distribution to the latent codes . This allows the sampling of new instances (not present in the training set) by sampling from the prior distribution . Concretely, the encoder in VAE parametrizes an approximate posterior distribution over with a neural network conditioned on the input . The prior distribution

is usually a standard Gaussian. The loss function in VAE encourages the model to keep its posterior distributions (encoder) close to the prior

, and has the form:


where, is the KullbackLeibler divergence , is the reconstruction loss, is the posterior distribution approximated by the encoder, is the posterior distribution approximated by decoder and is the set of learnable parameters of the AE.

Figure 1: VAE for Modeling Sequences.
Figure 2: Controlled Sequence Generation.

Fig. 2

illustrates the VAE framework for the peptide sequence generation case. During training, known peptide sequences are fed into the encoder, which generates their respective latent codes. The latent code is later decoded into a peptide sequence using the decoder. Next, the two components of the loss are computed and the model is updated using stochastic gradient descent (SGD). In the figure, we detail the inputs for computing the two components of the loss function. After training, one can sample a latent code

from the prior and use the decoder to generate the corresponding peptide sequence.

In our experiments, we use a single layer Gated Recurrent Unit (GRU) RNN for both the encoder and decoder. The prior

is the standard Gaussian distribution. Previous work

Bowman et al. (2015) shows that VAE faces optimization challenges when used to model sequences. The most common issue is that the model normally sets equal to the prior , in this situation, the decoder essentially becomes a language model and ignores . Two solutions proposed by Bowman et al. are: (a) KL annealing, which consists in adding a variable weight to the KL term in the cost function at training time; and (b) word dropout, which consists in randomly replacing a fraction of the conditioned-on tokens with a common unknown word token UNK. We adopt both solutions in our experiments.

Semi-supervised Controlled Sequence Generation

Although VAE is a versatile approach to train generative models, it lacks two important qualities to model AMPs: semi-supervised learning and controllable generation. The set of sequences known to be AMPs is small compared to the universe of known peptides. Therefore, when training AMP generators, we want to use semi-supervised methods that leverage not only the known AMP sequences but also the large number of available peptides. An additional design requirement is that the generative model allows “knobs” to control certain properties like AMP activity and toxicity.

Hu et al. (2017) proposed a VAE variant that allows both controllable generation and semi-supervised learning. In order to perform controllable generation, Hu et al.’s approach augments the unstructured latent codes with a set of structured variables , which are trained to control a salient and independent attribute of the sequence: whether it is antimicrobial, toxic, soluble, etc. Each model component of Figure 2 is alternatingly updated with a different loss function. For the newly introduced classifier, we minimize the loss with respect to the parameters :


where the first expectation is approximated with a minibatch of the small labeled dataset, while the second expectation is computed with a minibatch of generated data with sampled from the prior: . The classifier loss requires both real and generated sequences’ attribute to be classified correctly, while minimizing the entropy of the classifier encourages it to have high confidence in its predictions on generated data. For the encoder, the loss is unchanged , while for the decoder (Generator), the loss becomes:


where the and enforce correct classification of a minibatch of “soft” generated sentences under the classifier and the encoder respectively. The full expressions of and are omitted, we refer the reader to (Hu et al., 2017).

This method gives a model with meaningful attribute codes , with the major advantage that not all data needs all attribute labels . Specifically we will use a large unlabeled peptide database for the encoder and decoder losses, with a much smaller labeled dataset (peptides with reported antimicrobial annotation) for the classifier loss. We will refer to this as semi-supervised generative modeling: using a large unlabeled corpus to capture the distribution with VAE, and a small labeled corpus to learn the controlling attribute code .

A Dataset for Semi-supervised Training of AMP Generators

We compiled a new two-part dataset for semi-supervised modeling of antimicrobial peptides. The first part, AMP-lab-15K contains about 15K labeled peptides for which we know if they are AMPs or not. The second part, Uniprot-unlab-1.7M contains just over 1.7M unlabeled sequences. In curating AMP-lab-15K, we created the positive set by extracting experimentally validated AMP sequences from two major databases: LAMP (Zhao et al., 2013), and satPDB (Singh et al., 2015). LAMP is a comprehensive database of AMPs with information about their antimicrobial activity and cytotoxicity. It consists of 3,904 natural AMPs and 1,643 synthetic peptides with antimicrobial activity. SatPDB is an integrated database of therapeutic peptides, curated from twenty public domain peptide databases and two datasets. The duplicates between these two datasets were removed to generate a non-redundant AMP dataset. As a preprocessing step, the sequences with non-natural amino acids (B, J, O, U, X, and Z) and the ones with lower case letters were eliminated, resulting in a total of 7960 positive monomeric sequences comprised of 20 natural amino acids. The AMP-negative peptide sequences in AMP-lab-15K are filtered out from the negative AMP dataset created by AmPEP (Bhadra et al., 2018). Those sequences were originally retrieved from Uniprot-Trembl comprising computer-reviewed sequences (EMBL-EBI, 2018). Then, sequences with any of the following annotations were removed: AMP, membrane, toxic, secretory, defensive, antibiotic, anticancer, antiviral, and antifungal. We only considered unique sequences comprised of natural amino acids. The negative dataset contains 6948 sequences.

For the second part, Uniprot-unlab-1.7M, the unlabeled sequences were retrieved from Uniprot-Trembl database comprising computer-reviewed sequences (EMBL-EBI, 2018). Again duplicates and sequences with non-natural amino acids or lower case letters were removed, resulting into a total of 1.7M unlabeled sequences.

When training the VAE model, we will subselect sequences with length

the hyperparameter

max_seq_length(l). Furthermore both AMP-lab-15K and Uniprot-unlab-1.7M were split into train, heldout and test set. This reduces the available sequences for training; eg for Uniprot-unlab-1.7M the number of available training sequences are 93k for l=25 and 168k for l=30.


Experimental Setup

The model architecture from (Hu et al., 2017)

was implemented in PyTorch. For

(Eq 3), we use the Uniprot-unlab-1.7M unlabeled sequences, while for the first term of Eq 2 we use our AMP-lab-15K dataset. We consider one iteration to be a single stochastic update on both the classifier, generator, and encoder respectively with minibatch size 32. Unless otherwise noted, we pretrain the VAE for 20k iterations (epochs) followed by 5k iterations ( epochs) of full model training. We noticed a slight advantage using KL annealing: from initial value to during VAE pretraining. Further hyperparameters mostly follow Hu et al. (2017): learning rate = , balancing weights , entropy .

max_seq_length (l) 15 20 25 30
PepCVAE 82.17 82.11 84.33 83.04
VAE 83.37 82.95 84.56 83.25
Unlabeled 60.16 47.98 38.26 28.06
Table 1: Accuracy of the independently trained LSTM-based AMP classifier on generated sequences and random Uniprot-Trembl sequences.

Evaluation Metrics

In our framework we propose 3 escalations of evaluations; level 1 comprises preliminary automated evaluation based on peptide heuristics (sequence similarity, diversity, uniqueness, and molecular characteristics) and an external AMP classifier. Level 2 consists of an AMP potency (high/low) ranking model and

ab initio structure prediction. Level 3 consists of in silico full-blown atomistic simulations and wet lab experiments. In this work we performed level 1 and 2 evaluations. More expensive level 3 evaluation will be covered in future work. All evaluation statistics presented in this work are averaged from 3 different runs with different random seeds. We present results with l = 15, 20, 25, and 30. For evaluation we generate 5000 sequences.

Figure 3: Difference in amino acid composition between high-confidence and low-confidence AMP sequences, as returned by the LSTM-based AMP classifier.

AMP Classification Accuracy: We measure the efficacy of the models on generating positive AMP sequences by assessing the accuracy using a pretrained AMP classifier for a set of generated sequences. We independently trained a LSTM-based AMP classifier using a dataset of 8,944 labeled examples for training, 2,982 examples for validation, and another 2,982 for testing. Our LSTM-based classifier achieves 81% overall accuracy on the held out test examples, with 87% accuracy for positive AMP examples and 74% accuracy for negative examples. This accuracy on positive samples is comparable to the models reported in literature (Bhadra et al., 2018; Veltri et al., 2018), which were all trained on a much smaller set of AMPS. It should be mentioned that the relatively lower test accuracy on the negative samples makes sense, as the majority of negative instances lack experimental validation, so there is an intrinsic low confidence associated with their label annotation.

Sequence Similarity and Uniqueness: Pairwise sequence similarity was estimated using the widely used BLOSUM62 amino acid substitution matrix (Henikoff and Henikoff, 1992). A penalty of -10 was assigned to a gap opening and -1 penalty was assigned to a gap extension. The final results were robust against the choice of gap penalty values. The sequence similarity was normalized with the logarithm of query sequence length. A positive value suggests stronger evolutionary relationship between two sequences.

Sequence Diversity: Three different metrics are used to evaluate sequence diversity. (1) Language model perplexity. A character-level LSTM language model (LM) (Merity et al., 2017) trained on the labeled AMP/non-AMP sequences was used to estimate the perplexity (PPL) of the generated sequences. A lower value of PPL suggests “closeness” of the sequence to the original distribution it was trained on. (2) -gram entropy, , that is the information entropy per character for -grams and is given by . Relative entropy gain, was defined as . We further mixed generated samples with original samples at a 1:1 ratio and again estimated relative entropy gain in the above-mentioned manner, which is captured in . (3) The diversity is also estimated by measuring the number of shared -grams (Osmanbeyoglu and Ganapathiraju, 2011) for different values of between the generated sequences and the original ones, which we refer as . Therefore, a value of implies more diversity of PepCVAE sequences at a particular compared to that of the VAE ones.

Molecular Characteristics: Peptide characteristics, e.g. hydrophobicity, charge, were estimated using the GlobalAnalysis method in modLAMP (Müller et al., 2017).

AMP Raking Model: In level 2 evaluation, PepCVAE-generated sequences with antimicrobial probability, as predicted by the external LSTM classifier, were selected and ranked according to their predicted potency (high/low). For ranking, an LSTM model was trained on 1200 independent sequences with broad-spectrum antimicrobial activities from (Pirtskhalava et al., 2015) and yielded a test accuracy of .

Ab Initio Structure Prediction: For structure prediction of sequences, PEP-FOLD3 server (Lamiable et al., 2016) was used, which employs structural alphabets (SA) to describe the structure of four consecutive amino acids, couples the predicted series of SA letters to a greedy algorithm and a coarse-grained force field, generates 3D structures and finally sorts them according to energy.

Experimental Results

Table 1 presents the classification accuracy of the AMP sequences generated by PepCVAE. Specifically, we generate sequences given attribute code c that matches AMP, and then use the pre-trained LSTM-based AMP classifier to assign AMP/non-AMP labels to the generated sequences. We compare the classification accuracy of PepCVAE-generated AMPs with the one obtained by using a plain vanilla VAE trained solely on AMP sequences. The probability of randomly selected unlabeled sequences from Uniprot-Trembl database to be predicted as AMP by the classifier is also shown for comparison. It is evident from Table 1, that the generated AMP sequences by both PepCVAE and plain VAE are predicted to be “active” with a probability , which is significantly higher than the predicted probability for the unlabeled training sequences of all lengths.

In Fig. 3, we show, for a set of 5K sequences generated by PepCVAE, the unigram distribution difference between the high confidence and low confidence samples, as returned by the LSTM-based AMP classifier. This difference expresses the positive contribution of cationic (K, R) and hydrophobic (A, C, L I, W) amino acids towards determining AMP character, which is consistent with features identified by existing AMP classifiers (Bhadra et al., 2018; Veltri et al., 2018).

Figure 4:

Relative entropy gain of PepCVAE sequences w.r.t VAE (Solid lines - without mixing, dashed lines - with mixing). See Evaluation Metrics section for details.

Figure 5: n-gram shared similarity, of PepCVAE over VAE sequences as a function n. implies higher diversity of PepCVAE sequences.

Peptide Heuristics

Model Length Uniq-3 Uniq-4 Sself Sorig PPL
Training 15.53 16.00 52.40 -6.49 NA 3.57
PepCVAE 16.55 11.40 68.10 -6.52 -6.62 30.27
VAE 15.12 9.90 72.00 -7.49 -7.51 32.72
Unlabeled 18.84 4.90 68.70 -5.68 -6.71 28.77
Random 15.03 0.02 83.00 -6.90 -7.63 38.10
Table 2: Sequence heuristics for = 25. Positive instances from PepCVAE and plain VAE were compared with training AMPs, random sequences, and unlabeled Uniprot sequences.

Given that both cVAE and VAE generate AMP sequences with high probability, next we estimate a number of heuristics that give some clues about how similar/dissimilar the generated AMPs are compared to the ones present in the training set. The heuristics are sequence length, fraction of unique 3 and 4-gram, and sequence similarity estimated by using standard amino acid substitution matrix, and language model perplexity (PPL). Table 2 presents these estimates for l=25. For the purpose of comparison, the values corresponding to random peptides and unlabeled sequences from Uniprot of similar length are also provided. The VAE sequences are closer in length to the original ones, while the PepCVAE ones are relatively longer. Both methods result into AMPs with lower uniqueness at a 3-gram level and higher uniqueness at a 4-gram level, with respect to the original ones. PepCVAE sequences appear closer to the original ones in terms of 3 and 4-gram uniqueness.

The average pairwise sequence similarity within the set of generated sequences itself, Sself, is consistently lower for PepCVAE compared to VAE (Table 2). Sself of cVAE sequences is closer to that of the biological distribution (both training and unlabeled). This is meaningful, as a higher self-similarity or “homology” implies stronger evolutionary relation between the sequences. In this sense, the extent of homology within the PepCVAE sequences is higher than the VAE ones and matches more closely to that of existing AMPs. Sorig further suggests that PepCVAE sequences possess stronger evolutionary relationship with the actual AMPs as well as the unlabeled biological peptides. The high evolutionary dissimilarity of VAE sequences (both with self and with biological) is reminiscent of random sequences (refer Table 2). It is likely that PepCVAE learns a more “biologically plausible” latent space by exploiting a much larger dataset, that includes unlabeled and negative sequences as well as the positive ones. These results suggest that PepCVAE architecture intrinsically inserts more “biological” character/context during controlled generation of AMPs. The perplexity value (PPL) returned by a LSTM language model that was independently trained on the AMP-lab-15K dataset further confirms this observation. PepCVAE sequences are low in perplexity and are closer to biological sequences, whereas high PPL of VAE samples implies more random-like character.

Sequence diversity

Next, we analyze the diversity of the generated sequences in terms of -gram entropy gain (see Evaluation Metrics section). Figure 5 plots the relative entropy gain of PepCVAE sequences with respect to VAE, , as a function of -gram size. We observe that for , while increasing to values for larger . This result implies that, although VAE sequences are more diverse locally (), PepCVAE sequences demonstrate strong long-range diversity. Consistent with this result, the -gram similarity, i.e. fraction of shared -grams with training AMPs, is lower for PepCVAE with respect to VAE for (Figure 5 and Evaluation Metrics section). Even though PepCVAE generates diverse sequences, at short range it is still consistent with biological sequences, as evident from the language model perplexity values (Table 2). In summary, the PepCVAE sequences show stronger diversity at higher -grams. High peptide diversity compared to existing AMPs is a desired feature, while designing next-generation antimicrobials, as that can potentially help prevent antimicrobial resistance.

Figure 6: Comparison of molecular characteristics for

between training data (training-pos - orange, training-neg - blue), PepCVAE sequences (cvae-pos - purple, cvae-neg - green), VAE sequences (vae-pos - yellow). Horizontal dashed lines account for the mean. Whiskers extend to the most extreme non-outlier data points. (a) amino acid distribution, (b) total charge distribution, (c) Eisenberg hydrophobicity, and (d) Eisenberg hydrophobic moment.

Sequence Charge H H Structure
MWHFIWYLILLPRR 13 3.0 0.30 0.37 Helix
LWNYWFLWSAFRAF 14 2.0 0.45 0.19 Helix
YHSIFFCFKKIKAK 14 5.0 0.07 0.28 Helix
IIYLIWWWLNWV 12 1.0 0.85 0.39 Helix
HKERRWRYW 9 4.0 -0.92 0.35 Helix
Table 3: High-potency AMP sequences (and their features) generated by PepCVAE with =15.

Molecular Characteristics

Figure 6 compares PepCVAE and VAE sequences with the training data in terms of molecular features, e.g. charge, hydrophobicity (H), and hydrophobic moment (H), which are of particular interest, as they play a key role in determining the membrane binding specificity (Fjell et al., 2012).

The amino acid composition, net positive charge, and hydrophobicity (H) (Fig. 6a-c) of generated AMPs by PepCVAE and VAE matches well to the training data, suggesting both models perform equally well in capturing the charge patterning, hydrophobicity, and composition within AMPs. The hydrophobic moment (H), a qualitative measure of the helical character within the sequence, is another frequently used descriptor used in cationic amphiphathic AMP classification (Bhadra et al., 2018). Both VAE and PepCVAE generate sequences with hydrophobic moment values consistent with, albeit slightly lower than, existing AMPs.

AMP Potency ranking and structure prediction

We first selected 45 high probability AMPs out of 5000 PepCVAE sequences and ranked them according to the predicted potency. Next, 3D models of the final 11 high-potency AMP sequences were constructed using the PEP-FOLD3 server (Lamiable et al., 2016). Out of those 11 candidates, the lowest energy model of 9 sequences consistently exhibited a helix (for examples see Table 3), one showed an extended structure, and one revealed coil. As amphipathic helices are abundant in antimicrobial peptides and determine their activity, this multi-level in silico screening scheme successfully identifies high potency, broad-spectrum antimicrobial candidates.

Conclusion and Future Work

We present a peptide sequence design framework PepCVAE based on a semi-supervised variational autoencoder model for generating novel antimicrobial peptide molecules. We curated a dataset that comprises a large number (1.7M) of unlabeled peptide sequences and a smaller set (15k) of labeled (AMP/non-AMP) sequences. The model architecture allows learning of a representation where desired properties are disentangled, and so can handle controlled generation of peptide sequences with AMP/non-AMP characteristics. Extensive analysis of the generated antimicrobial sequences reveals that the proposed framework is capable of learning and generating from a richer representation and yields AMPs that are closer to the original distribution, when compared with a vanilla VAE trained solely on AMP sequences. The generated AMPs from our architecture exhibit high diversity, particularly at a higher -gram level, while still retaining biological characteristics, such as stronger homology and high helicity. These peptide characteristics suggest that the present framework is well-suited for therapeutic molecule design, where it is important to maintain control over “knobs” or attributes, while generating novel samples. In future, we plan to validate the generated AMP sequences using in silico modeling and wet lab experiments.


  • Bhadra et al. (2018) P. Bhadra, J. Yan, J. Li, S. Fong, and S. W. Siu.

    Ampep: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest.

    Sci. Rep., 2018.
  • Blaschke et al. (2018) T. Blaschke, M. Olivecrona, O. Engkvist, J. Bajorath, and H. Chen. Application of generative autoencoder in de novo molecular design. Molecular informatics, 2018.
  • Bowman et al. (2015) S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. arXiv:1511.06349, 2015.
  • (4) CDC. Antibiotic resistance threats in the united states, 2013. URL
  • EMBL-EBI (2018) P. EMBL-EBI, SIB. Universal Protein Resource (UniProt)., 2018. [Online; accessed August-2018].
  • Fjell et al. (2012) C. D. Fjell, J. A. Hiss, R. E. Hancock, and G. Schneider. Designing antimicrobial peptides: form follows function. Nature reviews Drug discovery, 11(1):37, 2012.
  • Fleischmann et al. (2016) C. Fleischmann, A. Scherag, N. K. Adhikari, C. S. Hartog, T. Tsaganos, P. Schlattmann, D. C. Angus, and K. Reinhart. Assessment of global incidence and mortality of hospital-treated sepsis. current estimates and limitations. Am. J. Respir. Crit. Care Med., 2016.
  • Gómez-Bombarelli et al. (2018) R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS CS, 2018.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc NIPS, 2014.
  • Henikoff and Henikoff (1992) S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. PNAS, 1992.
  • Hu et al. (2017) Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. arXiv preprint arXiv:1703.00955, 2017.
  • Jin et al. (2018) W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph generation. arXiv:1802.04364, 2018.
  • Kadurin et al. (2017) A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, and A. Zhavoronkov. drugan: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular pharmaceutics, 2017.
  • Kingma and Welling (2013) D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
  • Kusner and Hernández-Lobato (2016) M. J. Kusner and J. M. Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv:1611.04051, 2016.
  • Kusner et al. (2017) M. J. Kusner, B. Paige, and J. M. Hernández-Lobato. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
  • Lamiable et al. (2016) A. Lamiable, P. Thévenet, J. Rey, M. Vavrusa, P. Derreumaux, and P. Tufféry. Pep-fold3: faster de novo structure prediction for linear peptides in solution and in complex. Nucleic acids research, 44(W1):W449–W454, 2016.
  • Merity et al. (2017) S. Merity, N. S. Keskar, and R. Socher. Regularizing and Optimizing LSTM Language Models. arXiv:1708.02182, 2017.
  • Müller et al. (2017) A. T. Müller, G. Gabernet, J. A. Hiss, and G. Schneider. modlamp: Python for antimicrobial peptides. Bioinformatics, 2017.
  • Müller et al. (2018) A. T. Müller, J. A. Hiss, and G. Schneider. Recurrent neural network model for constructive peptide design. Journal of chemical information and modeling, 2018.
  • Nagarajan et al. (2018) D. Nagarajan, T. Nagarajan, N. Roy, O. Kulkarni, S. Ravichandran, M. Mishra, D. Chakravortty, and N. Chandra. Computational antimicrobial peptide design and evaluation against multidrug-resistant clinical isolates of bacteria. Journal of Biological Chemistry, 2018.
  • Osmanbeyoglu and Ganapathiraju (2011) H. U. Osmanbeyoglu and M. K. Ganapathiraju. N-gram analysis of 970 microbial organisms reveals presence of biological language models. BMC bioinformatics, 2011.
  • Peleg and Hooper (2010) A. Y. Peleg and D. C. Hooper. Hospital-acquired infections due to gram-negative bacteria. New England Journal of Medicine, 2010.
  • Pirtskhalava et al. (2015) M. Pirtskhalava, A. Gabrielian, P. Cruz, H. L. Griggs, R. B. Squires, D. E. Hurt, M. Grigolava, M. Chubinidze, G. Gogoladze, B. Vishnepolsky, et al. Dbaasp v. 2: an enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic acids research, 44(D1):D1104–D1112, 2015.
  • Porto et al. (2018) W. F. Porto, L. Irazazabal, E. S. Alves, S. M. Ribeiro, C. O. Matos, Á. S. Pires, I. C. Fensterseifer, V. J. Miranda, E. F. Haney, V. Humblot, et al. In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nature Commun., 9(1):1490, 2018.
  • Singh et al. (2015) S. Singh, K. Chaudhary, S. K. Dhanda, S. Bhalla, S. S. Usmani, A. Gautam, A. Tuknait, P. Agrawal, D. Mathur, and G. P. Raghava. Satpdb: a database of structurally annotated therapeutic peptides. Nucleic acids research, 2015.
  • Veltri et al. (2018) D. Veltri, U. Kamath, and A. Shehu. Deep learning improves antimicrobial peptide recognition. Bioinformatics, 2018.
  • Yu et al. (2017) L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
  • Zhao et al. (2013) X. Zhao, H. Wu, H. Lu, G. Li, and Q. Huang. Lamp: a database linking antimicrobial peptides. PloS one, 2013.