1 Introduction
Rational design of DNA and protein sequences enables rapid development of novel drug molecules, vaccines, biological circuits and more. Most design methods are guided by predictive models that reliably relate sequence to fitness or function. In recent years, these models are often based on deep learning (Eraslan et. al., 2019; Zou et. al., 2019; Tareen et. al., 2019). For example, neural networks have been trained to predict transcription factor binding (Alipanahi et. al., 2015; Avsec et. al., 2019), chromatin modifications (Zhou et. al., 2015) and transcriptional activity (Movva et. al., 2019). At the level of RNA, they have been used to predict translation (Sample et. al., 2019), splicing (Jaganathan et. al., 2019; Cheng et. al., 2019) and polyadenylation (Bogard et. al., 2019). Neural networks have even been used to predict protein structure (AlQuraishi 2019, Senior et. al., 2020; Yang et. al., 2020), and generative models based on neural networks are readily applied for protein design (Yang et. al., 2019; Riesselman et. al., 2019; Costello et. al., 2019; Brookes et. al., 2019; Gupta et. al., 2019).
A direct approach to sequence design based on a differentiable fitness predictor is to optimize the input pattern by gradient ascent (Lanchantin et. al., 2016; Killoran et. al., 2017; Bogard et. al., 2019). The method, commonly referred to as activation maximization, uses the gradient of the neural network output to make incremental changes to the input. While simple, activation maximization cannot be directly applied to sequences as they are discrete and not amenable to gradient ascent. Several extensions have been proposed to rectify this; Killoran et. al., (2017) used a softmax layer to turn the sequences into continuous relaxations, and Bogard et. al., (2019) used straightthrough gradients to optimize discrete samples. However, both methods suffer from vanishing gradients. Furthermore, continuous input relaxations may cause predictor pathologies leading to poor convergence.
Here, we combine discrete nucleotide sampling and straightthrough gradients with normalization across the parameters – nucleotide logits – of the sampling distributions, resulting in a markedly improved sequence design method. We demonstrate a 100fold optimization speedup, and improved optima, for a number of DNA, RNA and protein design tasks. We further show that our method outperforms global search metaheuristics such as Simulated Annealing.
2 Background
Given a sequencepredictive neural network and an initial input pattern , the gradient ascent method seeks to maximize the predicted fitness by tuning the input pattern :
(1) 
Assuming is differentiable, we can compute the gradient with respect to the input and optimize by updating the variable in the direction of the fitness gradient (Simonyan et. al., 2013):
(2) 
However, sequences are usually represented as onehot coded patterns (, where is the sequence length and the number of channels), and discrete variables cannot be optimized by gradient ascent. Several different reparameterizations of have been proposed to bypass this issue. In one of the earliest implementations, Lanchantin et. al. (2016) represented the sequence as an unstructured, realvalued pattern () but imposed an Lpenalty on in order to keep it from growing too large and causing predictor pathologies. Killoran et. al. (2017) later introduced a softmax reparameterization, turning into a continuous relaxation :
(3) 
Here are differentiable nucleotide logits. The gradient of with respect to is defined as:
(4) 
Given Equation 3 and 4, we can maximize with respect to the logits using the gradient . While elegant, there are two issues with this architecture. First, the gradient in Equation 4 becomes vanishingly small for large values of (when or ), halting convergence. Second, sequencepredictive neural networks have only been trained on discrete onehot coded patterns and the predictive power of may be poor on a continuous relaxation such as .
Following advances in gradient estimators for discretized neurons (Bengio, Léonard, & Courville, 2013; Courbariaux et. al., 2016), Bogard et. al. (2019) developed a version of the gradient ascent design method replacing the softmax transform
with a discrete, stochastic sampler :(5) 
Here, is a randomly drawn categorical nucleotide at the
:th position from the (softmax) probability distribution
. The nucleotide logits are thus used as parameters to categorical distributions, from which we sample nucleotides and construct a discrete, onehot coded pattern . While is not directly differentiable, can be updated based on the estimate of using straightthrough (ST) approximation. Rather than using the original ST estimator of (Bengio et al. 2013), we here adapt an estimator with theoretically better properties from (Chung et al., 2016) where the gradient of is replaced by that of the softmax :(6) 
Using discrete samples as input to removes any pathologies from continuous input relaxations. But, as we show below, convergence remain almost as slow as the softmax method. Switching to the original ST estimator () speeds up convergence, but results in worse fitness optima.
3 Related work
A wide selection of discrete search algorithms and metaheuristics have been applied to computational sequence design, for example Genetic Algorithms (Deaton et. al., 1996), Simulated Annealing (Hao et. al., 2015) and Particle Swarms (Xiao et. al. 2009; Ibrahim et. al., 2011; Mustaza et. al., 2011). Gradientbased optimization has a clear advantage in comparison to these methods: It makes stepwise local improvements to all nucleotides at once based on a gradient direction, rather than heuristically making a small number of changes and evaluating the effect afterwards.
Our design method is closely related to work by (Bogard et. al., 2019), which used Equation 5 and 6 to optimize a set of nucleotide logits through discrete sequence samples. We extend this method by normalizing the logits across sequence positions and introducing an adaptive entropy parameter. Our method is also similar to other gradient ascentbased methods (Lanchantin et. al., 2016; Killoran et. al., 2017), but differ in that we pass discrete samples as input. Finally, we recently proposed a deep generative model (Deep Exploration Network – DEN) which learns to sample a distribution of diverse, highfitness sequences (Linder et. al. 2019). DENs can generate largescale sequence sets very efficiently. However, they first require selecting an appropriate generative network, whereas this method is model and parameter free, making it easier to use when designing smaller sequence sets.
4 Logit normalization  balancing entropy and stochasticity
Inspired by instance normalization in image GANs (Ulyanov et. al., 2016), we hypothesized that the main bottleneck in earlier design methods stem from overly large and disproportionally scaled nucleotide logits. Here, we mitigate this problem by normalizing the logits across positions, i.e. we insert a normalization layer between the trainable logits and the sampling layer (Figure 1A).
For DNA sequence design, where the number of onehot channels is small (), we use a normalization scheme commonly referred to as instancenormalization. In this scheme, the nucleotide logits of each channel are normalized independently across positions. Let and be the sample mean and deviation of logits for nucleotide across all positions . For each step of gradient ascent, we compute the normalized logits as:
(7) 
Since logits with zero mean and unit variance have limited expressiveness when used as parameters to a probability distribution, we associate each channel
with a global scaling parameter and offset . Having an independent offset per channel is particularly wellsuited for DNA, as nucleotides are often associated with a global preferential bias. The scaled, recentered logits are calculated as:(8) 
For protein sequence design, the number of onehot channels is considerably larger (), resulting in fewer samples per channel and noisier normalization statistics. Here we found that layernormalization was more stable: We compute a global mean and deviation , and use a single scale and offset for all channels.
Given the normalized and scaled logits as parameters for the nucleotide sampler defined in Equation 5, we maximize with respect to , and (or and in the context of proteins). Using the softmax ST estimator from Equation 6, we arrive at the following gradients:
(9)  
(10)  
(11) 
The normalization removes logit drift by keeping the values proportionally scaled and centered at zero (, ), enabling the gradients to swap nucleotides with few updates. Furthermore, the scaling parameter adaptively adjusts the sampling entropy to control global versus local optimization. This can be deduced from the gradient components of in Equation 10:

is positive for nucleotides which increase fitness and negative otherwise.

is positive when and negative otherwise.

is positive only when we are likely to sample the corresponding nucleotide.
Here, the product of the first two terms, , is positive if and nucleotide raises fitness or if and nucleotide lowers fitness. Put together, the gradient for increases when our confidence in nucleotide is consistent with its impact on fitness, such that . Conversely, inconsistent nucleotides decrement the gradient. At the start of optimization, is small, leading to high PWM entropy and large jumps in sequence design space. As we sample consistent nucleotides and the entropy gradient turns positive, increases. Larger lowers the entropy and leads to more localized optimization. However, if we sample sufficiently many inconsistent nucleotides, the gradient of may turn negative, again raising entropy and promoting global exploration.
5 Experiments
We first compare our new logitnormalized, straightthrough sampled sequence design method (abbreviated SampledIN) to the previous versions of the algorithm described in Background:

PWM – The original method with continuous softmaxrelaxed inputs (Killoran et. al., 2017).

Sampled – The categorical sampling method described in (Bogard et. al., 2019) using the (nonnormalized) softmax straightthrough gradient estimator.

We also tested PWMIN – a logitnormalized version of the softmaxrelaxed method.
In addition to gradientbased methods, we compare SampledIN to discrete search algorithms. The first method is a pairwise nucleotideswapping search (Evolved; Sample et. al., 2019), where sequence is mutated with either or, with a 50% chance, random substitutions at each iteration, resulting in a new candidate sequence . is only accepted if . We also tested a wellknown meta heuristic  Simulated Annealing (SA; Kirkpatrick et. al., 1983). In SA, mutations are accepted even if they result in lower fitness with probability , where is a temperature parameter. Here we use the Metropolis acceptance criterion (Metropolis et. al., 1953):
The design methods are evaluated on the task of maximizing the classification or regression score of five genomic deep neural network predictors : DragoNN, DeepSEA (Zhou et. al., 2015), APARENT (Bogard et. al., 2019), MPRADragoNN (Movva et. al., 2019), and Optimus 5’ (Sample et. al., 2019). Here is a brief description of each fitness predictor:

DragoNN – Predicts the probability of SPI1 transcription factor (TF) binding within a 1000nt sequence. We define as the logit score of the network output.

DeepSEA – Predicts multiple TF binding probabilities and chromatin modifications in a 1000nt sequence. We define as the logit score of the CTCF (Dnd41) output.

APARENT – Predicts proximal alternative polyadenylation isoform abundance in a 206nt sequence. We define as the logit score of the network output.

MPRADragoNN – Predicts transcriptional activity of a 145nt promoter sequence. We define as the sixth output (SV40) of the ’Deep Factorized’ model.

Optimus 5’ – Predicts mean ribosome load in a 50nt sequence. is the (nonscaled) output of the ’evolution’ model.
We further evaluated the methods by designing protein sequences which conform to target structures, using the neural network trRosetta as the tertiary structure predictor (Yang et. al., 2020). The optimization objective is slightly more complicated compared to the five examples above, so we devote Section 5.4 to describe the predictor and design task in more detail.
Starting with a randomly initialized logit matrix , for the methods PWM and PWMIN we optimize using the continuous softmax relaxation from Equation 3. For Sampled and SampledIN, we optimize using the discrete nucleotide sampler from Equation 5. In all experiments, we report the mean training loss for independently optimized patterns. We define the train loss as:
For PWM and PWMIN, . For Sampled and SampledIN, .
The test loss is evaluated on the basis of discrete sequence samples drawn from the optimized softmax representation , regardless of design method. In all four methods, we can use the categorical nucleotide sampler to draw sequence samples and compute the mean test loss as:
refers to the number of samples drawn from each softmax sequence at every weight update , and is the number of independent optimization runs. In all experiments, we set and .
All optimization experiments were carried out in Keras (Chollet, 2015) using Adam with default parameters (Kingma et. al., 2014). Some predictor models were ported using
pytorch2keras.5.1 Predictor Maximization
Here, we use the design methods PWM, PWMIN, Sampled and SampledIN to generate maximally scored sequences for each of the five DNAbased predictors (Figure 1B). SampledIN converges to  of its minimum test loss after logit updates, and reaches of the minimum loss after only updates for all predictors except MPRADragoNN and Optimus 5’. In contrast, PWM and Sampled do not converge within updates. SampledIN converges to up to 3fold better optima than all other competing methods. In fact, SampledIN reaches the same or better optima in updates than the competing methods reach in updates for DragoNN, MPRADragoNN and DeepSEA, marking a 100x speedup. For Optimus 5’ and APARENT, the speedup is 20x50x.
5.2 Pathologies of Softmax Relaxation
We hypothesized that some predictors, having been trained only on discrete onehot coded patterns, may have poor predictive power on continuousvalued softmax sequence relaxations (input to PWM). To test this, we measured both the training loss (which is based on the softmax sequence input ) and test loss (which is based on discrete samples ) when maximizing MPRADragoNN.
Indeed, maximizing MPRADragoNN with PWM leads to a severely overestimated predictor score on the softmax input (Figure 2A), as the training loss is more than 6fold lower than the test loss. Using SampledIN, on the other hand, the training and test losses are identical (Figure 2B). While the training loss is 2x higher than the training loss of PWM, the test loss is more than 3x lower.
5.3 Comparison to Evolutionary Algorithms and Simulated Annealing
Next, we compared SampledIN to the discrete nucleotideswapping search algorithm ’Evolved’ (Sample et. al., 2019) as well as the global optimization heuristic Simulated Annealing (’SA’; Figure 3A; Kirkpatrick et. al., 1983). We benchmarked the methods on the DragoNN maximization task (1,000 nt sequences). The results show that SampledIN significantly outperforms SA (Figure 3B); the best fitness score that SA reached in 20,000 iterations was reached by SampledIN in less than 1,000 iterations, marking a 20x speedup. Interestingly, increasing the number of substitutions at each step of SA initially increases convergence, but results in worse optima. Furthermore, the pairwise nucleotideswapping search, Evolved, never converges, suggesting that DragoNN maximization is a highly nonconvex optimization problem.
5.4 Protein Structure Optimization
Multiple deep learning models have recently been developed for predicting tertiary protein structure (AlQuraishi, 2019; Senior et. al., 2020; Yang et. al., 2020). Here, we demonstrate our method by designing de novo protein sequences which conform to a target residue contact map as predicted by trRosetta (Yang et. al., 2020). The predictor takes three inputs (Figure 4
A): A onehot coded sequence, a PSSM constructed from a multiplesequence alignment (MSA) and a directcoupling analysis (DCA) map. For our design task, we pass the optimizable onehot pattern to the first two inputs and an allzeros tensor as the DCA feature map. Given the predicted distance distribution
and angle distributions , , we minimize the mean KLdivergence against target distributions , , and :where  
We compared Sampled and Simulated Annealing (SA) to a modified version of SampledIN, where logits are normalized across all residue channels (layernormalized rather than instancenormalized) to reduce the increased variance of onehot channels. We used the methods to design protein sequences which conformed to the target structure of an example protein (Sensor Histidine Kinase) from the trRosetta Github (https://github.com/gjoni/trRosetta). We optimized independent sequences per design method and recorded the median KLloss at each iteration. The results show that SampledIN converges faster and reaches better minima (Figure 4B); after 200 iterations, SampledIN reached 2x lower KLdivergence than all other methods, and much of the target structure is visible (Figure 4C).
6 Conclusion
We have presented an improved version of gradient ascent (or activation maximization) for sequence design, combining logit normalization with stochastic nucleotide sampling and straightthrough gradients. By normalizing nucleotide logits across positions and using a global entropy parameter, we keep logits proportionally scaled and centered at zero. As demonstrated on five deep learning predictors, logit normalization enables extremely fast sequence optimization, with a 100fold speedup compared to the original method for many predictors, and with better predicted optima.
In addition to logit drift and vanishing gradients, the original sequence ascent method suffers from predictor pathologies due to passing continuous softmax sequence relaxations as input, a problem fully removed by using discrete sampling. We further observed that straightthrough sampling leads to consistently better optima than softmax relaxation, suggesting that it traverses local minima. In fact, our method outperformed global optimization meta heuristics such as Simulated Annealing on more difficult design tasks, such as designing 1000 nt long transcription factor binding sites or designing protein sequences which conform to a complex target structure.
The gradient of the entropy parameter (or ) in our design method adaptively adjusts the sampling stochasticity to trade off global and local optimization. In the beginning, is small, corresponding to a high PWM entropy and consequently very diverse sequence samples. As optimization progresses, grows, leading to more localized sequence changes. This adaptive mechanism, in combination with flexible nucleotide logits due to the normalization, results in a highly efficient design method.
We hope these algorithmic advances showcase the potential of biomolecular optimization and open the door to more research in differentiable sequence design. The approach introduced here could accelerate the design of functional enzymes and other biomolecules, potentially resulting in novel drug therapies, molecular sensors and more.
References
Alipanahi, B., Delong, A., Weirauch, M.T., & Frey, B.J. (2015). Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning. Nature Biotechnology 33, 831838.
AlQuraishi, M. (2019). Endtoend differentiable learning of protein structure. Cell Systems 8, 292301.
Avsec, Ž., Weilert, M., Shrikumar, A., Alexandari, A., Krueger, S., Dalal, K., … & Zeitlinger, J. (2019). Deep learning at baseresolution reveals motif syntax of the cisregulatory code (bioRxiv).
Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation (arXiv).
Bogard, N., Linder, J., Rosenberg, A. B., & Seelig, G. (2019). A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation. Cell 178, 91106.
Brookes, D.H., Park, H., & Listgarten, J. (2019). Conditioning by adaptive sampling for robust design (arXiv).
Chollet, F. (2015). Keras.
Cheng, J., Nguyen, T.Y.D., Cygan, K.J., Çelik, M.H., Fairbrother, W.G. & Gagneur, J. (2019). MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology, 20, 48.
Chung, J., Ahn, S., & Bengio, Y. (2016). Hierarchical multiscale recurrent neural networks (arXiv).
Costello, Z., & Martin, H.G. (2019). How to hallucinate functional proteins (arXiv).
Courbariaux, M., Hubara, I., Soudry, D., ElYaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1 (arXiv).
Deaton, R. J., Murphy, R. C., Garzon, M. H., Franceschetti, D. R., & Stevens Jr, S. E. (1996, June). Good encodings for DNAbased solutions to combinatorial problems. In DNA Based Computers, 247258.
Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T. F. G., Qin, C., … & Petersen, S. (2018). De novo structure prediction with deeplearning based scoring. Annual Reviews of Biochemistry 77, 363382.
Eraslan, G., Avsec, Ž., Gagneur, J., & Theis, F.J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics 20, 389403.
Gupta, A., & Zou, J. (2019). Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence 1, 105111.
Hao, G. F., Xu, W. F., Yang, S. G., & Yang, G. F. (2015). Multiple simulated annealingmolecular dynamics (msamd) for conformational space search of peptide and miniprotein. Scientific reports 5, 15568.
Ibrahim, Z., Khalid, N. K., Lim, K. S., Buyamin, S., & Mukred, J. A. A. (2011, December). A binary vector evaluated particle swarm optimization based method for DNA sequence design problem. In 2011 IEEE Student Conference on Research and Development, 160164.
Jaganathan, K., Kyriazopoulou Panagiotopoulou, S., McRae, J. F., Darbandi, S. F., Knowles, D., Li, Y. I., … Farh, K. K.H. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535548.
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbelsoftmax (arXiv).
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D., & Frey, B. J. (2017). Generating and designing DNA with deep generative models (arXiv).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization (arXiv).
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220, 671680.
Lanchantin, J., Singh, R., Lin, Z., & Qi, Y. (2016). Deep motif: Visualizing genomic sequence classifications (arXiv).
Linder, J., Bogard, N., Rosenberg, A. B., & Seelig, G. (2019). Deep exploration networks for rapid engineering of functional DNA sequences (bioRxiv).
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21, 10871092.
Movva, R., Greenside, P., Marinov, G. K., Nair, S., Shrikumar, A., & Kundaje, A. (2019). Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PloS One, 14(6).
Mustaza, S. M., Abidin, A. F. Z., Ibrahim, Z., Shamsudin, M. A., Husain, A. R., & Mukred, J. A. A. (2011, December). A modified computational model of ant colony system in DNA sequence design. In 2011 IEEE Student Conference on Research and Development, 169173.
Riesselman, A.J., Shin, J.E., Kollasch, A.W., McMahon, C., Simon, E., Sander, C., Manglik, A., Kruse, A.C., & Marks, D.S. (2019). Accelerating Protein Design Using Autoregressive Generative Models (bioRxiv).
Sample, P. J., Wang, B., Reid, D. W., Presnyak, V., McFadyen, I. J., Morris, D. R., & Seelig, G. (2019). Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nature Biotechnology 37, 803809.
Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., … & Penedones, H. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 15.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps (arXiv).
Tareen, A., & Kinney, J.B. (2019). Biophysical models of cisregulation as interpretable neural networks (arXiv).
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization (arXiv).
Xiao, J., Xu, J., Chen, Z., Zhang, K., & Pan, L. (2009). A hybrid quantum chaotic swarm evolutionary algorithm for DNA encoding. Computers & Mathematics with Applications 57, 19491958.
Yang, K.K., Wu, Z., & Arnold, F.H. (2019). Machinelearningguided directed evolution for protein engineering. Nature Methods 16, 687694.
Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., & Baker, D. (2020). Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences.
Zhou, J., & Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods 12, 931934.
Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A., & Telenti, A. (2019). A primer on deep learning in genomics. Nature genetics 51, 1218.
Appendix A: Sequence optimization examples
Figure S1 depicts an example optimization run for each predictor, comparing the softmax sequences (PSSMs) generated by the PWM and SampledIN design methods. Results are shown for 200, 2000 and 20000 iterations of gradient ascent (Adam). The logit matrices were uniformly randomly initialized. When maximizing DragoNN, SampledIN generates dual SPI1 binding motifs. For APARENT, SampledIN generates five CFIm binding motifs, dual CSE hexamers, and multiple cut sites with CstF binding sites. For DeepSEA, SampledIN generates four CTCF binding sites. For MPRADragoNN, SampledIN generates a CRE and a CTCF site. Finally, for Optimus 5’, SampledIN generates a Trich sequence with a uAUG.
Appendix B: Supplemental Comparisons
Here we show that the improved design method (SampledIN) is insensitive to optimizer settings, that it outperforms PWM even when enforcing low entropy, and that the softmax straightthrough (ST) estimator is superior to both the original ST estimator and the Gumbel distribution.
Entropy penalties and the Gumbel distribution
Curious whether the gap we observed between training and test loss for the PWM method in Figure 2 was caused by high softmax sequence entropy, we tested whether an explicit entropy penalty, , would improve the method. We reoptimized sequences for Optimus 5’ and DragoNN, such that the mean nucleotide conservation reached at least bits (Figure S2A). Even at low entropy, PWM does not converge to nearly as good minima as SampledIN
We also compared the performance of our logitnormalized, softmax straightthrough design method SampledIN to a version of the method using the Gumbel distribution for sampling instead (Jang et. al., 2016; temperature ; Figure S2B). While the Gumbel variant of the design method reached the same optima as SampledIN, it converged slower. Importantly, same as PWM and Sampled, the Gumbel design method benefited substantially from logit normalization (GumbelIN).
Insensitivity to optimizer settingss
We noted that the performance of the PWM method was dependent on optimizer settings (Figure S3A); the speed at which it could maximize APARENT was increased by switching to an SGD optimizer and setting a very high learning rate. However, the method could still not reach the same optimum as SampledIN, which operated well at a default learning rate of .
Superiority of softmax straightthrough gradients
We compared the performance of SampledIN, which uses the softmax straightthrough gradient estimator , to a version using the original estimator . As demonstrated on DragoNN, the softmax estimator reaches much better optima (Figure S3B). Sampling multiple sequences at each logit update and walking up the average gradient slightly speeds up convergence, but does not improve optima.
Appendix C: Additional Protein Structure Optimization Example
In addition to the protein structure design task evaluated in Figure 4 (Yang et. al., 2020), we also benchmarked Sampled, SampledIN and Simulated Annealing (SA) on a separate protein structure. Here, we task the methods with designing sequences conforming to a coiledcoil hairpin structure, again using trRosetta as the differentiable structure predictor (Yang et. al., 2020). The same KLdivergence loss as was used in Figure 4 was used here. The results, depicted in Figure S4, show that SampledIN and SA both converge very quickly to a nearoptimal (zero) KLdivergence. The structure is likely much easier to design sequences for, as there is only one major longranging contact formation. The sequence is also only about half as long as the one in Figure 4.
Comments
There are no comments yet.