Fast differentiable DNA and protein sequence optimization for molecular design

05/22/2020
by   Johannes Linder, et al.
University of Washington
15

Designing DNA and protein sequences with improved or novel function has the potential to greatly accelerate synthetic biology. Machine learning models that accurately predict biological fitness from sequence are becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, this method is limited by technical challenges, as it suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence. Here, we build on a previously proposed straight-through approximation method to optimize through discrete sequence samples. By normalizing nucleotide logits across positions and introducing an adaptive entropy variable, we remove bottlenecks arising from overly large or skewed sampling parameters. This results in a markedly improved algorithm with up to 100-fold faster convergence. Moreover, our method finds improved fitness optima compared to existing methods, including the original algorithm without normalization and global optimization heuristics such as Simulated Annealing. We demonstrate our improved method by designing DNA and enzyme sequences for six deep learning predictors, including a protein structure predictor (trRosetta).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 13

12/17/2017

Generating and designing DNA with deep generative models

We propose generative neural network methods to generate DNA sequences a...
06/10/2021

Adaptive machine learning for protein engineering

Machine-learning models that learn from data to predict how protein sequ...
03/27/2018

Analyzing DNA Hybridization via machine learning

In DNA computing, it is impossible to decide whether a specific hybridiz...
10/08/2018

Design by adaptive sampling

We present a probabilistic modeling framework and adaptive sampling algo...
01/24/2022

Guided Generative Protein Design using Regularized Transformers

The development of powerful natural language models have increased the a...
03/14/2020

Lattice protein design using Bayesian learning

A novel protein design method using Bayesian learning is proposed in thi...
01/09/2022

λ-Scaled-Attention: A Novel Fast Attention Mechanism for Efficient Modeling of Protein Sequences

Attention-based deep networks have been successfully applied on textual ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rational design of DNA and protein sequences enables rapid development of novel drug molecules, vaccines, biological circuits and more. Most design methods are guided by predictive models that reliably relate sequence to fitness or function. In recent years, these models are often based on deep learning (Eraslan et. al., 2019; Zou et. al., 2019; Tareen et. al., 2019). For example, neural networks have been trained to predict transcription factor binding (Alipanahi et. al., 2015; Avsec et. al., 2019), chromatin modifications (Zhou et. al., 2015) and transcriptional activity (Movva et. al., 2019). At the level of RNA, they have been used to predict translation (Sample et. al., 2019), splicing (Jaganathan et. al., 2019; Cheng et. al., 2019) and polyadenylation (Bogard et. al., 2019). Neural networks have even been used to predict protein structure (AlQuraishi 2019, Senior et. al., 2020; Yang et. al., 2020), and generative models based on neural networks are readily applied for protein design (Yang et. al., 2019; Riesselman et. al., 2019; Costello et. al., 2019; Brookes et. al., 2019; Gupta et. al., 2019).

A direct approach to sequence design based on a differentiable fitness predictor is to optimize the input pattern by gradient ascent (Lanchantin et. al., 2016; Killoran et. al., 2017; Bogard et. al., 2019). The method, commonly referred to as activation maximization, uses the gradient of the neural network output to make incremental changes to the input. While simple, activation maximization cannot be directly applied to sequences as they are discrete and not amenable to gradient ascent. Several extensions have been proposed to rectify this; Killoran et. al., (2017) used a softmax layer to turn the sequences into continuous relaxations, and Bogard et. al., (2019) used straight-through gradients to optimize discrete samples. However, both methods suffer from vanishing gradients. Furthermore, continuous input relaxations may cause predictor pathologies leading to poor convergence.

Here, we combine discrete nucleotide sampling and straight-through gradients with normalization across the parameters – nucleotide logits – of the sampling distributions, resulting in a markedly improved sequence design method. We demonstrate a 100-fold optimization speedup, and improved optima, for a number of DNA-, RNA- and protein design tasks. We further show that our method outperforms global search meta-heuristics such as Simulated Annealing.

2 Background

Given a sequence-predictive neural network and an initial input pattern , the gradient ascent method seeks to maximize the predicted fitness by tuning the input pattern :

(1)

Assuming is differentiable, we can compute the gradient with respect to the input and optimize by updating the variable in the direction of the fitness gradient (Simonyan et. al., 2013):

(2)

However, sequences are usually represented as one-hot coded patterns (, where is the sequence length and the number of channels), and discrete variables cannot be optimized by gradient ascent. Several different reparameterizations of have been proposed to bypass this issue. In one of the earliest implementations, Lanchantin et. al. (2016) represented the sequence as an unstructured, real-valued pattern () but imposed an L-penalty on in order to keep it from growing too large and causing predictor pathologies. Killoran et. al. (2017) later introduced a softmax reparameterization, turning into a continuous relaxation :

(3)

Here are differentiable nucleotide logits. The gradient of with respect to is defined as:

(4)

Given Equation 3 and 4, we can maximize with respect to the logits using the gradient . While elegant, there are two issues with this architecture. First, the gradient in Equation 4 becomes vanishingly small for large values of (when or ), halting convergence. Second, sequence-predictive neural networks have only been trained on discrete one-hot coded patterns and the predictive power of may be poor on a continuous relaxation such as .

Following advances in gradient estimators for discretized neurons (Bengio, Léonard, & Courville, 2013; Courbariaux et. al., 2016), Bogard et. al. (2019) developed a version of the gradient ascent design method replacing the softmax transform

with a discrete, stochastic sampler :

(5)

Here, is a randomly drawn categorical nucleotide at the

:th position from the (softmax) probability distribution

. The nucleotide logits are thus used as parameters to categorical distributions, from which we sample nucleotides and construct a discrete, one-hot coded pattern . While is not directly differentiable, can be updated based on the estimate of using straight-through (ST) approximation. Rather than using the original ST estimator of (Bengio et al. 2013), we here adapt an estimator with theoretically better properties from (Chung et al., 2016) where the gradient of is replaced by that of the softmax :

(6)

Using discrete samples as input to removes any pathologies from continuous input relaxations. But, as we show below, convergence remain almost as slow as the softmax method. Switching to the original ST estimator () speeds up convergence, but results in worse fitness optima.

3 Related work

A wide selection of discrete search algorithms and meta-heuristics have been applied to computational sequence design, for example Genetic Algorithms (Deaton et. al., 1996), Simulated Annealing (Hao et. al., 2015) and Particle Swarms (Xiao et. al. 2009; Ibrahim et. al., 2011; Mustaza et. al., 2011). Gradient-based optimization has a clear advantage in comparison to these methods: It makes stepwise local improvements to all nucleotides at once based on a gradient direction, rather than heuristically making a small number of changes and evaluating the effect afterwards.

Our design method is closely related to work by (Bogard et. al., 2019), which used Equation 5 and 6 to optimize a set of nucleotide logits through discrete sequence samples. We extend this method by normalizing the logits across sequence positions and introducing an adaptive entropy parameter. Our method is also similar to other gradient ascent-based methods (Lanchantin et. al., 2016; Killoran et. al., 2017), but differ in that we pass discrete samples as input. Finally, we recently proposed a deep generative model (Deep Exploration Network – DEN) which learns to sample a distribution of diverse, high-fitness sequences (Linder et. al. 2019). DENs can generate large-scale sequence sets very efficiently. However, they first require selecting an appropriate generative network, whereas this method is model and- parameter free, making it easier to use when designing smaller sequence sets.

4 Logit normalization - balancing entropy and stochasticity

Figure 1: (A) The gradient ascent architecture. A normalization layer is prepended to the softmax layer, which is used as parameters to a sampling layer. (B) Generation of sequences maximizing DragoNN (SPI1), DeepSEA (CTCF Dnd41), MPRA-DragoNN (SV40), Optimus 5’ and APARENT.

Inspired by instance normalization in image GANs (Ulyanov et. al., 2016), we hypothesized that the main bottleneck in earlier design methods stem from overly large and disproportionally scaled nucleotide logits. Here, we mitigate this problem by normalizing the logits across positions, i.e. we insert a normalization layer between the trainable logits and the sampling layer (Figure 1A).

For DNA sequence design, where the number of one-hot channels is small (), we use a normalization scheme commonly referred to as instance-normalization. In this scheme, the nucleotide logits of each channel are normalized independently across positions. Let and be the sample mean and deviation of logits for nucleotide across all positions . For each step of gradient ascent, we compute the normalized logits as:

(7)

Since logits with zero mean and unit variance have limited expressiveness when used as parameters to a probability distribution, we associate each channel

with a global scaling parameter and offset . Having an independent offset per channel is particularly well-suited for DNA, as nucleotides are often associated with a global preferential bias. The scaled, re-centered logits are calculated as:

(8)

For protein sequence design, the number of one-hot channels is considerably larger (), resulting in fewer samples per channel and noisier normalization statistics. Here we found that layer-normalization was more stable: We compute a global mean and deviation , and use a single scale and offset for all channels.

Given the normalized and scaled logits as parameters for the nucleotide sampler defined in Equation 5, we maximize with respect to , and (or and in the context of proteins). Using the softmax ST estimator from Equation 6, we arrive at the following gradients:

(9)
(10)
(11)

The normalization removes logit drift by keeping the values proportionally scaled and centered at zero (, ), enabling the gradients to swap nucleotides with few updates. Furthermore, the scaling parameter adaptively adjusts the sampling entropy to control global versus local optimization. This can be deduced from the gradient components of in Equation 10:

  1. is positive for nucleotides which increase fitness and negative otherwise.

  2. is positive when and negative otherwise.

  3. is positive only when we are likely to sample the corresponding nucleotide.

Here, the product of the first two terms, , is positive if and nucleotide raises fitness or if and nucleotide lowers fitness. Put together, the gradient for increases when our confidence in nucleotide is consistent with its impact on fitness, such that . Conversely, inconsistent nucleotides decrement the gradient. At the start of optimization, is small, leading to high PWM entropy and large jumps in sequence design space. As we sample consistent nucleotides and the entropy gradient turns positive, increases. Larger lowers the entropy and leads to more localized optimization. However, if we sample sufficiently many inconsistent nucleotides, the gradient of may turn negative, again raising entropy and promoting global exploration.

Note that, in the context of protein design where we have a single scale and offset , the gradient expressions from Equation 10 and 11 are additively pooled across all channels. The argued benefits of instance-normalization above thus holds true for layer-normalization as well.

5 Experiments

We first compare our new logit-normalized, straight-through sampled sequence design method (abbreviated Sampled-IN) to the previous versions of the algorithm described in Background:

  • PWM – The original method with continuous softmax-relaxed inputs (Killoran et. al., 2017).

  • Sampled – The categorical sampling method described in (Bogard et. al., 2019) using the (non-normalized) softmax straight-through gradient estimator.

  • We also tested PWM-IN – a logit-normalized version of the softmax-relaxed method.

In addition to gradient-based methods, we compare Sampled-IN to discrete search algorithms. The first method is a pairwise nucleotide-swapping search (Evolved; Sample et. al., 2019), where sequence is mutated with either or, with a 50% chance, random substitutions at each iteration, resulting in a new candidate sequence . is only accepted if . We also tested a well-known meta heuristic - Simulated Annealing (SA; Kirkpatrick et. al., 1983). In SA, mutations are accepted even if they result in lower fitness with probability , where is a temperature parameter. Here we use the Metropolis acceptance criterion (Metropolis et. al., 1953):

The design methods are evaluated on the task of maximizing the classification or regression score of five genomic deep neural network predictors : DragoNN, DeepSEA (Zhou et. al., 2015), APARENT (Bogard et. al., 2019), MPRA-DragoNN (Movva et. al., 2019), and Optimus 5’ (Sample et. al., 2019). Here is a brief description of each fitness predictor:

  • DragoNN – Predicts the probability of SPI1 transcription factor (TF) binding within a 1000-nt sequence. We define as the logit score of the network output.

  • DeepSEA – Predicts multiple TF binding probabilities and chromatin modifications in a 1000-nt sequence. We define as the logit score of the CTCF (Dnd41) output.

  • APARENT – Predicts proximal alternative polyadenylation isoform abundance in a 206-nt sequence. We define as the logit score of the network output.

  • MPRA-DragoNN – Predicts transcriptional activity of a 145-nt promoter sequence. We define as the sixth output (SV40) of the ’Deep Factorized’ model.

  • Optimus 5’ – Predicts mean ribosome load in a 50-nt sequence. is the (non-scaled) output of the ’evolution’ model.

We further evaluated the methods by designing protein sequences which conform to target structures, using the neural network trRosetta as the tertiary structure predictor (Yang et. al., 2020). The optimization objective is slightly more complicated compared to the five examples above, so we devote Section 5.4 to describe the predictor and design task in more detail.

Starting with a randomly initialized logit matrix , for the methods PWM and PWM-IN we optimize using the continuous softmax relaxation from Equation 3. For Sampled and Sampled-IN, we optimize using the discrete nucleotide sampler from Equation 5. In all experiments, we report the mean training loss for independently optimized patterns. We define the train loss as:

For PWM and PWM-IN, . For Sampled and Sampled-IN, .

The test loss is evaluated on the basis of discrete sequence samples drawn from the optimized softmax representation , regardless of design method. In all four methods, we can use the categorical nucleotide sampler to draw sequence samples and compute the mean test loss as:

refers to the number of samples drawn from each softmax sequence at every weight update , and is the number of independent optimization runs. In all experiments, we set and .

All optimization experiments were carried out in Keras (Chollet, 2015) using Adam with default parameters (Kingma et. al., 2014). Some predictor models were ported using

pytorch2keras.

5.1 Predictor Maximization

Here, we use the design methods PWM, PWM-IN, Sampled and Sampled-IN to generate maximally scored sequences for each of the five DNA-based predictors (Figure 1B). Sampled-IN converges to - of its minimum test loss after logit updates, and reaches of the minimum loss after only updates for all predictors except MPRA-DragoNN and Optimus 5’. In contrast, PWM and Sampled do not converge within updates. Sampled-IN converges to up to 3-fold better optima than all other competing methods. In fact, Sampled-IN reaches the same or better optima in updates than the competing methods reach in updates for DragoNN, MPRA-DragoNN and DeepSEA, marking a 100x speedup. For Optimus 5’ and APARENT, the speedup is 20x-50x.

5.2 Pathologies of Softmax Relaxation

We hypothesized that some predictors, having been trained only on discrete one-hot coded patterns, may have poor predictive power on continuous-valued softmax sequence relaxations (input to PWM). To test this, we measured both the training loss (which is based on the softmax sequence input ) and test loss (which is based on discrete samples ) when maximizing MPRA-DragoNN.

Indeed, maximizing MPRA-DragoNN with PWM leads to a severely overestimated predictor score on the softmax input (Figure 2A), as the training loss is more than 6-fold lower than the test loss. Using Sampled-IN, on the other hand, the training and test losses are identical (Figure 2B). While the training loss is 2x higher than the training loss of PWM, the test loss is more than 3x lower.

Figure 2: Maximizing MPRA-DragoNN. (A) PWM. (B) Sampled-IN.

5.3 Comparison to Evolutionary Algorithms and Simulated Annealing

Next, we compared Sampled-IN to the discrete nucleotide-swapping search algorithm ’Evolved’ (Sample et. al., 2019) as well as the global optimization heuristic Simulated Annealing (’SA’; Figure 3A; Kirkpatrick et. al., 1983). We benchmarked the methods on the DragoNN maximization task (1,000 nt sequences). The results show that Sampled-IN significantly outperforms SA (Figure 3B); the best fitness score that SA reached in 20,000 iterations was reached by Sampled-IN in less than 1,000 iterations, marking a 20x speed-up. Interestingly, increasing the number of substitutions at each step of SA initially increases convergence, but results in worse optima. Furthermore, the pairwise nucleotide-swapping search, Evolved, never converges, suggesting that DragoNN maximization is a highly non-convex optimization problem.

Figure 3: (A) In Simulated Annealing, mutations are accepted with a temperature-controlled probability even if the predicted fitness decreases. (B) Maximizing DragoNN SPI1. Simulated Annealing is tested at several parameter configurations (number of substitutions per step / initial temperature).

5.4 Protein Structure Optimization

Figure 4: (A) Protein sequences are designed to minimize the KL-divergence between predicted and target distance and angle distributions (angle distribution maps omitted). The one-hot pattern is used for two of the trRosetta inputs. (B) Generating sequences which conform to the target predicted structure of a Sensor Histidine Kinase. Simulated Annealing was tested at several initial temperatures, with 1 substitution per step. (C) Predicted residue distance distributions after 200 iterations.

Multiple deep learning models have recently been developed for predicting tertiary protein structure (AlQuraishi, 2019; Senior et. al., 2020; Yang et. al., 2020). Here, we demonstrate our method by designing de novo protein sequences which conform to a target residue contact map as predicted by trRosetta (Yang et. al., 2020). The predictor takes three inputs (Figure 4

A): A one-hot coded sequence, a PSSM constructed from a multiple-sequence alignment (MSA) and a direct-coupling analysis (DCA) map. For our design task, we pass the optimizable one-hot pattern to the first two inputs and an all-zeros tensor as the DCA feature map. Given the predicted distance distribution

and angle distributions , , we minimize the mean KL-divergence against target distributions , , and :

where

We compared Sampled and Simulated Annealing (SA) to a modified version of Sampled-IN, where logits are normalized across all residue channels (layer-normalized rather than instance-normalized) to reduce the increased variance of one-hot channels. We used the methods to design protein sequences which conformed to the target structure of an example protein (Sensor Histidine Kinase) from the trRosetta Github (https://github.com/gjoni/trRosetta). We optimized independent sequences per design method and recorded the median KL-loss at each iteration. The results show that Sampled-IN converges faster and reaches better minima (Figure 4B); after 200 iterations, Sampled-IN reached 2x lower KL-divergence than all other methods, and much of the target structure is visible (Figure 4C).

6 Conclusion

We have presented an improved version of gradient ascent (or activation maximization) for sequence design, combining logit normalization with stochastic nucleotide sampling and straight-through gradients. By normalizing nucleotide logits across positions and using a global entropy parameter, we keep logits proportionally scaled and centered at zero. As demonstrated on five deep learning predictors, logit normalization enables extremely fast sequence optimization, with a 100-fold speedup compared to the original method for many predictors, and with better predicted optima.

In addition to logit drift and vanishing gradients, the original sequence ascent method suffers from predictor pathologies due to passing continuous softmax sequence relaxations as input, a problem fully removed by using discrete sampling. We further observed that straight-through sampling leads to consistently better optima than softmax relaxation, suggesting that it traverses local minima. In fact, our method outperformed global optimization meta heuristics such as Simulated Annealing on more difficult design tasks, such as designing 1000 nt long transcription factor binding sites or designing protein sequences which conform to a complex target structure.

The gradient of the entropy parameter (or ) in our design method adaptively adjusts the sampling stochasticity to trade off global and local optimization. In the beginning, is small, corresponding to a high PWM entropy and consequently very diverse sequence samples. As optimization progresses, grows, leading to more localized sequence changes. This adaptive mechanism, in combination with flexible nucleotide logits due to the normalization, results in a highly efficient design method.

We hope these algorithmic advances showcase the potential of biomolecular optimization and open the door to more research in differentiable sequence design. The approach introduced here could accelerate the design of functional enzymes and other biomolecules, potentially resulting in novel drug therapies, molecular sensors and more.

References

Alipanahi, B., Delong, A., Weirauch, M.T., & Frey, B.J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33, 831-838.

AlQuraishi, M. (2019). End-to-end differentiable learning of protein structure. Cell Systems 8, 292-301.

Avsec, Ž., Weilert, M., Shrikumar, A., Alexandari, A., Krueger, S., Dalal, K., … & Zeitlinger, J. (2019). Deep learning at base-resolution reveals motif syntax of the cis-regulatory code (bioRxiv).

Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation (arXiv).

Bogard, N., Linder, J., Rosenberg, A. B., & Seelig, G. (2019). A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation. Cell 178, 91-106.

Brookes, D.H., Park, H., & Listgarten, J. (2019). Conditioning by adaptive sampling for robust design (arXiv).

Chollet, F. (2015). Keras.

Cheng, J., Nguyen, T.Y.D., Cygan, K.J., Çelik, M.H., Fairbrother, W.G. & Gagneur, J. (2019). MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology, 20, 48.

Chung, J., Ahn, S., & Bengio, Y. (2016). Hierarchical multiscale recurrent neural networks (arXiv).

Costello, Z., & Martin, H.G. (2019). How to hallucinate functional proteins (arXiv).

Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 (arXiv).

Deaton, R. J., Murphy, R. C., Garzon, M. H., Franceschetti, D. R., & Stevens Jr, S. E. (1996, June). Good encodings for DNA-based solutions to combinatorial problems. In DNA Based Computers, 247-258.

Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T. F. G., Qin, C., … & Petersen, S. (2018). De novo structure prediction with deeplearning based scoring. Annual Reviews of Biochemistry 77, 363-382.

Eraslan, G., Avsec, Ž., Gagneur, J., & Theis, F.J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics 20, 389-403.

Gupta, A., & Zou, J. (2019). Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence 1, 105-111.

Hao, G. F., Xu, W. F., Yang, S. G., & Yang, G. F. (2015). Multiple simulated annealing-molecular dynamics (msa-md) for conformational space search of peptide and miniprotein. Scientific reports 5, 15568.

Ibrahim, Z., Khalid, N. K., Lim, K. S., Buyamin, S., & Mukred, J. A. A. (2011, December). A binary vector evaluated particle swarm optimization based method for DNA sequence design problem. In 2011 IEEE Student Conference on Research and Development, 160-164.

Jaganathan, K., Kyriazopoulou Panagiotopoulou, S., McRae, J. F., Darbandi, S. F., Knowles, D., Li, Y. I., … Farh, K. K.-H. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535-548.

Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax (arXiv).

Killoran, N., Lee, L. J., Delong, A., Duvenaud, D., & Frey, B. J. (2017). Generating and designing DNA with deep generative models (arXiv).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization (arXiv).

Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220, 671-680.

Lanchantin, J., Singh, R., Lin, Z., & Qi, Y. (2016). Deep motif: Visualizing genomic sequence classifications (arXiv).

Linder, J., Bogard, N., Rosenberg, A. B., & Seelig, G. (2019). Deep exploration networks for rapid engineering of functional DNA sequences (bioRxiv).

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21, 1087-1092.

Movva, R., Greenside, P., Marinov, G. K., Nair, S., Shrikumar, A., & Kundaje, A. (2019). Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PloS One, 14(6).

Mustaza, S. M., Abidin, A. F. Z., Ibrahim, Z., Shamsudin, M. A., Husain, A. R., & Mukred, J. A. A. (2011, December). A modified computational model of ant colony system in DNA sequence design. In 2011 IEEE Student Conference on Research and Development, 169-173.

Riesselman, A.J., Shin, J.E., Kollasch, A.W., McMahon, C., Simon, E., Sander, C., Manglik, A., Kruse, A.C., & Marks, D.S. (2019). Accelerating Protein Design Using Autoregressive Generative Models (bioRxiv).

Sample, P. J., Wang, B., Reid, D. W., Presnyak, V., McFadyen, I. J., Morris, D. R., & Seelig, G. (2019). Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nature Biotechnology 37, 803-809.

Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., … & Penedones, H. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 1-5.

Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps (arXiv).

Tareen, A., & Kinney, J.B. (2019). Biophysical models of cis-regulation as interpretable neural networks (arXiv).

Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization (arXiv).

Xiao, J., Xu, J., Chen, Z., Zhang, K., & Pan, L. (2009). A hybrid quantum chaotic swarm evolutionary algorithm for DNA encoding. Computers & Mathematics with Applications 57, 1949-1958.

Yang, K.K., Wu, Z., & Arnold, F.H. (2019). Machine-learning-guided directed evolution for protein engineering. Nature Methods 16, 687-694.

Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., & Baker, D. (2020). Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences.

Zhou, J., & Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods 12, 931-934.

Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., & Telenti, A. (2019). A primer on deep learning in genomics. Nature genetics 51, 12-18.

Appendix A: Sequence optimization examples

Figure S1 depicts an example optimization run for each predictor, comparing the softmax sequences (PSSMs) generated by the PWM and Sampled-IN design methods. Results are shown for 200, 2000 and 20000 iterations of gradient ascent (Adam). The logit matrices were uniformly randomly initialized. When maximizing DragoNN, Sampled-IN generates dual SPI1 binding motifs. For APARENT, Sampled-IN generates five CFIm binding motifs, dual CSE hexamers, and multiple cut sites with CstF binding sites. For DeepSEA, Sampled-IN generates four CTCF binding sites. For MPRA-DragoNN, Sampled-IN generates a CRE- and a CTCF site. Finally, for Optimus 5’, Sampled-IN generates a T-rich sequence with a uAUG.

Figure S1: (A) Generation of sequences which maximize the score of DragoNN (SPI1), APARENT, DeepSEA (CTCF, Dnd41), MPRA-DragoNN (SV40 Mean activity) and Optimus 5’. Shown are the resulting sequence logos of running the PWM and Sampled-IN optimization methods for 200, 2000 and 20000 iterations respectively, starting from randomly initialized logits.

Appendix B: Supplemental Comparisons

Here we show that the improved design method (Sampled-IN) is insensitive to optimizer settings, that it outperforms PWM even when enforcing low entropy, and that the softmax straight-through (ST) estimator is superior to both the original ST estimator and the Gumbel distribution.

Entropy penalties and the Gumbel distribution

Curious whether the gap we observed between training and test loss for the PWM method in Figure 2 was caused by high softmax sequence entropy, we tested whether an explicit entropy penalty, , would improve the method. We re-optimized sequences for Optimus 5’ and DragoNN, such that the mean nucleotide conservation reached at least bits (Figure S2A). Even at low entropy, PWM does not converge to nearly as good minima as Sampled-IN

We also compared the performance of our logit-normalized, softmax straight-through design method Sampled-IN to a version of the method using the Gumbel distribution for sampling instead (Jang et. al., 2016; temperature ; Figure S2B). While the Gumbel variant of the design method reached the same optima as Sampled-IN, it converged slower. Importantly, same as PWM and Sampled, the Gumbel design method benefited substantially from logit normalization (Gumbel-IN).

Figure S2: (A) Maximizing Optimus 5’ and DragoNN (SPI1) with a softmax sequence entropy penalty. (B) Comparing Sampled-IN to a version of the same method using the Gumbel distribution.

Insensitivity to optimizer settingss

We noted that the performance of the PWM method was dependent on optimizer settings (Figure S3A); the speed at which it could maximize APARENT was increased by switching to an SGD optimizer and setting a very high learning rate. However, the method could still not reach the same optimum as Sampled-IN, which operated well at a default learning rate of .

Superiority of softmax straight-through gradients

We compared the performance of Sampled-IN, which uses the softmax straight-through gradient estimator , to a version using the original estimator . As demonstrated on DragoNN, the softmax estimator reaches much better optima (Figure S3B). Sampling multiple sequences at each logit update and walking up the average gradient slightly speeds up convergence, but does not improve optima.

Figure S3: (A) Maximizing APARENT using Sampled-IN or PWM with SGD (LR = Learning Rate). (B) Maximizing DragoNN using Sampled-IN, with either the softmax- or original (simple) straight-through estimator. 1x and 10x refer to the number of sequences sampled at each update.

Appendix C: Additional Protein Structure Optimization Example

In addition to the protein structure design task evaluated in Figure 4 (Yang et. al., 2020), we also benchmarked Sampled, Sampled-IN and Simulated Annealing (SA) on a separate protein structure. Here, we task the methods with designing sequences conforming to a coiled-coil hairpin structure, again using trRosetta as the differentiable structure predictor (Yang et. al., 2020). The same KL-divergence loss as was used in Figure 4 was used here. The results, depicted in Figure S4, show that Sampled-IN and SA both converge very quickly to a near-optimal (zero) KL-divergence. The structure is likely much easier to design sequences for, as there is only one major long-ranging contact formation. The sequence is also only about half as long as the one in Figure 4.

Figure S4: (A) Sequences are designed to conform to a coiled-coil hairpin structure, as predicted by trRosetta. (B) Sequence optimization results after 1000 iterations. Simulated Annealing was tested at several initial temperatures. (C) Predicted residue distance distributions after 200 iterations.