As deep learning systems have become more complex, their architectures and hyperparameters have become increasingly difficult and time-consuming to optimize by hand. In fact, many good designs may be overlooked entirely by humans with prior biases. Therefor, automating this process, known as metalearning, has become an essential part of the modern machine learning toolbox. Metalearning attempts to tackle this problem from numerous angles, both optimizing different aspects of the architecture, from hyperparametersschmidhuber1987evolutionary to topologies miikkulainen2019evolving
, as well as using different approaches from Bayesian optimization to evolutionary computationlemke2015metalearning .
Loss-function discovery and optimization has recently emerged as a new type of metalearning. It aims to tackle neural network’s root training goal, making it possible to regularize the solutions automatically. Genetic Loss Optimization (GLO) gonzalez2019glo
provided an initial implementation of this idea using a combination of genetic programming and evolutionary strategies.
However, loss functions can be challenging to represent in an optimization system due to their discrete nested structure that exists in conjunction with continuous coefficients. GLO tackled this problem by discovering and optimizing loss functions in two separate steps: evolution of structure and optimization of coefficients. Such separate processes make it challenging to find a mutually optimal structure and coefficients. Furthermore, the structured search space is difficult since small changes to the genotype do not always result in small changes in the phenotype and can easily make a function invalid.
In an ideal case, loss functions would be smoothly mapped into arbitrarily long, fixed-length vectors in a Hilbert space. This mapping should be smooth, well-behaved, well-defined, incorporate both a function’s structure and coefficients, and should by its very nature make large classes of infeasible loss functions a mathematical impossibility. This mapping should be well-behaved, incorporate both a function’s structure and coefficients, and should make large classes of infeasible loss functions mathematically impossible, by design.
This paper introduces such an approach: Multivariate Taylor expansion-based genetic loss-function optimization (TaylorGLO). By using a novel parameterization for loss functions, the key pieces of information that affect a loss function’s behavior can be compactly represented in a vector. Such vectors can then be optimized for a specific task using a Covariance-Matrix Adaptation Evolutionary Strategy (CMA-ES). Select techniques can be used to narrow down the search space and speed up evolution.
Loss functions discovered by TaylorGLO are able to outperform the log loss on the MNIST dataset. Compared to the original GLO technique, TaylorGLO is able to find loss functions that outperform the Baikal loss with significantly fewer evaluations. Like Baikal, TaylorGLO loss functions have good performance characteristics on reduced training datasets.
The following section aims to provide relevant literature in metalearning and loss function optimization, including a brief description of the original GLO technique, and background on multivariate Taylor approximations. This leads to a description of how such Taylor approximations are leveraged in a new loss function parameterization, and the TaylorGLO approach is described. TaylorGLO’s efficacy is then explored by applying it to image classification tasks. Finally, conclusions are presented.
2 Related Work
The recent rise in interest in machine learning and AI has occurred thanks to modern deep learning lecun2015deep . Designing performant deep neural networks involve a high-level of manual tuning of both the architecture and hyperparameters that has lead to neural network design being humorously considered an “art” rather than a “science”. The field of metalearning was developed to tackle this issue algorithmically schmidhuber1987evolutionary . Significant work exists in neural network metalearning using a variety of approaches lemke2015metalearning , with recent work tackling, for example, the evolution of entire architectures miikkulainen2019evolving .
Evolutionary computation is the right tool to use due to the power of evolution to find creative solutions that often elude humans across numerous disciplines lehman2018surprising , such as antenna design lohn2005evolution .
2.1 Loss Function Metalearning
Deep neural networks are traditionally trained iteratively, whereby model parameters (i.e., weights and biases, at a minimum) are updated using gradients that are propagated backwards through the network, starting from an error given by a loss function rumelhart1985learning . Loss functions represent the primary training objective of a neural network.
For many tasks, such as classification and language modeling, the cross-entropy loss (also known as the log loss) is used almost exclusively. Note that while many networks elect to add regularization terms (e.g. weight regularization tikhonov1963regularization ) to their loss function definition, the core component is still the cross-entropy loss. Information-theoretic reasoning is used to arrive at the cross-entropy loss, which aims to minimize the number of bits needed to identify a message from the true distribution, using a code from the predicted distribution.
While the variety in loss functions for classification is rather low, different types of tasks that do not fit neatly into a single-label classification modality often have different task-specific loss functions gonzalez2019faster ; gao20192sound ; kingma2013vae ; zhou2016modelling ; dong2017semantic , demonstrating the merit of choice. Every instance where a loss function is manually and semi-arbitrarily (i.e., without a specific justification) chosen for a task is an opportunity for a system to find a more optimal loss function automatically.
Genetic Loss Optimization (GLO) gonzalez2019glo provided an initial study into metalearning loss functions for neural networks. GLO used a two-phase approach that (1) evolved a function structure using a tree representation, and (2) optimized a structure’s coefficients using an evolutionary strategy. GLO was able to discover Baikal, a new loss function that outperformed the cross-entropy loss on tested image classification tasks, and had noteworthy properties, such as improved data utilization.
Tree-based representations, as used in GLO, have been dominant in genetic programming due to their flexibility and applicability to different types of function evolution domains. For example, this type of genetic programming succeeded in discovering nontrivial mathematical formulas that underly physical systems using noisy experimental data schmidt2009distilling .
However, due to the two-step approach that GLO takes to discovering loss functions, GLO is unable to discover a function with mutually beneficial structure and coefficients. Additionally, the majority of loss function candidates in GLO are never going to be useful, since, for example, many representable functions have discontinuities. Ultimately, this results in GLO’s search being very challenging, requiring large populations that are evolved for many generations. This is exacerbated by the fact that mutations to tree-based representations can have disproportionate effects on the functions they represent.
The technique presented in this paper TaylorGLO, aims to resolve these shortcomings by using evolutionary strategies and a novel loss function parameterization based on multivariate Taylor expansions.
2.2 Multivariate Taylor Expansions
Taylor expansions taylor1715methodus are a well-known function approximator that can represent differentiable functions within the neighborhood of a point using a polynomial series. Below, the common univariate Taylor expansion formulation is presented, followed by a natural extension to arbitrarily-multivariate functions.
Given a smooth (i.e., first through derivatives are continuous), real-valued function, , a th-order Taylor approximation at point , , where , can be constructed as,
Conventional, univariate Taylor expansions have a natural extension to arbitrarily high-dimensional inputs of . Given a smooth, real-valued function, , a th-order Taylor approximation at point , , where , can be constructed. The stricter smoothness constraint compared to the univariate case allows for the application of Schwarz’s theorem on equality of mixed partials, obviating the need to take the order of partial differentiation into account.
Let us define an th-degree multi-index, , where , , . , and . Multivariate partial derivatives can be concisely written using a multi-index
Thus, discounting the remainder term, the multivariate Taylor expansion for at is
The unique partial derivatives in and are parameters for a th order Taylor expansion. Thus, a th order Taylor expansion of a function in variables requires parameters to define the center, , and one parameter for each unique multi-index , where . That is:
The multivariate Taylor expansion can be leveraged for a novel loss-function parameterization. It enables TaylorGLO, a way to efficiently optimize loss functions, as will be described in subsequent sections.
3 Loss functions as multivariate Taylor expansions
Let an -class classification loss function be defined as
The function can be replaced by its th-order, bivariate Taylor expansion, . More sophisticated loss functions can be supported by having more input variables, beyond and
, such as a time variable or unscaled logits. This approach can be useful, for example, to evolve loss functions that change as training progresses.
For example, a loss function in and has the following 3rd-order parameterization with parameters (where ):
Notably, the reciprocal-factorial coefficients can be integrated to be a part of the parameter set by direct multiplication if desired.
As will be shown in this paper, the technique makes it possible to train neural networks that are more accurate and learn faster, than those with tree-based loss function representations. Representing loss functions in this manner confers several useful properties:
Guarantees smooth functions;
Functions do not have poles (i.e., discontinuities going to infinity or negative infinity) within their relevant domain;
Can be implemented purely as compositions of addition and multiplication operations;
Can be trivially differentiated;
Nearby points in the search space yield similar results (i.e., the search space is locally smooth), making the fitness landscape easier to search;
Valid loss functions can be found in fewer generations and with higher frequency;
Loss function discovery is consistent and does not depend as much on a specific initial population; and
The search space has a tunable complexity parameter (i.e., the order of the expansion).
These properties are not necessarily held by alternative function approximators. For instance:
- Fourier series
are well suited for approximating periodic functions fourier1829theorie , therefor, they are not as well suited for loss functions, whose local behavior within a narrow domain is important. Being a composition of waves, Fourier series tend to have many critical points within the domain of interest. Gradients fluctuate around such points, making gradient descent infeasible. Additionally, close approximations require a large number of terms, which in itself can be injurious, causing large, high-frequency fluctuations, known as “ringing”, due to Gibb’s phenomenon wilbraham1848certain .
- Padé approximants
can be more accurate approximations than Taylor expansions; indeed, Taylor expansions are a special case of Padé approximants where graves1979numerical . However, unfortunately, Padé approximants can model functions with one or more poles, which valid loss functions typically should not have. These problems still exist, and are exacerbated, for Chisholm approximants chisholm1973rational (a bivariate extension) and Canterbury approximants graves1975calculation (a multivariate generalization).
- Laurent polynomials
can represent functions with discontinuities, the simplest being . While Laurent polynomials provide a generalization of Taylor expansions into negative exponents, the extension is not useful because it results in the same issues as Padé approximants.
- Polyharmonic splines
seem like an excellent fit since they can represent continuous functions within a finite domain. However, the number of parameters is prohibitive in multivariate cases.
The multivariate Taylor expansion is therefore a better choice than the alternatives. It makes it possible to optimize loss functions efficiently in TaylorGLO, as will be described next.
4 The Taylor-GLO Approach
TaylorGLO (Figure 1) aims to find the optimal parameters for a loss function parameterized as a multivariate Taylor expansion, as described in Section 3. The parameters for a Taylor approximation (i.e., the center point and partial derivatives) are referred to as . , . TaylorGLO strives to find the vector that parameterizes the optimal loss function for a task.
Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) hansen1996cmaes ,a population-based search technique, is used to find try to find . At each generation, CMA-ES samples points in whose fitness must be determined; this is accomplished by training a model with the corresponding loss function and evaluating the model on a validation dataset. Fitness evaluations may be distributed across multiple machines in parallel. An initial vector of is chosen as a starting point in the search space to avoid imparting bias.
CMA-ES is a popular black-box optimization technique that is able to optimize an objective on rugged spaces. CMA-ES functions by maintaining a covariance matrix around a mean point that represents a distribution of solutions. At each generation CMA-ES adapts the distribution to better fit evaluated objective values from sampled individuals. In this manner, the area in the search space which is being sampled at each step dynamically grows, shrinks, and moves as needed to maximize sampled candidates’ fitnesses.
The specific variant of CMA-ES that TaylorGLO uses is ()-CMA-ES hansen2001cmaesmumulambda , which incorporates weighted rank- updates hansen2004weightedrankmucmaes to reduce the number of objective function evaluations that are needed. As in the original GLO technique, the objective function is the network’s performance on a validation dataset after a shortened training period. While objective function evaluations from incomplete training are noisier, this is offset by the large reduction in time spent per evaluation.
4.1 Search Space and Efficiency Improvements
While Taylor expansions already provide many desirable restrictions on the search space, the search space can be further refined to simplify and expedite the loss function discovery process.
4.1.1 Partial training
For each loss function evaluation, a model needs to be trained to assess the loss function’s efficacy. This can prove to be prohibitively expensive more all but the simplest problems. As such, models are only partially trained, since there is a positive correlation between performance near the beginning of training and at the end of training. An additional positive effect to partial training is the selection pressure towards loss functions that learn more quickly.
4.1.2 Trimming away ineffectual terms
For a loss function to be useful, it must have a derivative that depends on the prediction. Therefor, internal terms that do not contribute to can be trimmed away. This implies that any term, within , where , can be replaced with .
For example, this refinement simplifies Equation 6 to:
providing a reduction in the number of parameters from twelve to eight.
5 Experimental Setup
This section presents the experimental setup that was used to evaluate the TaylorGLO technique. The MNIST and CIFAR-10 image classification tasks were used as domains to measure the technique’s efficacy, and provide a point of comparison against the original GLO technique and the standard cross-entropy loss function. Additionally, having two domains makes it possible to test the transferability of TaylorGLO loss functions between tasks and model architectures.
Experiments on TaylorGLO are performed using two popular image classification datasets, MNIST Handwritten Digits mnist and CIFAR-10 krizhevsky2009learning . Both datasets are well understood and relatively quick to train. In the following two subsections, the two datasets and the respective model architectures are described. The model architectures are standard, since achieving state-of-the-art accuracy on MNIST and CIFAR-10 is not the focus of this paper, rather, the improvements brought about by using a TaylorGLO loss function are.
Both of these tasks are classification problems, and traditionally framed with the standard cross-entropy loss:
where is sampled from the true distribution, is from the predicted distribution, and is the number of classes. The cross-entropy loss is used as a baseline in this paper’s experiments.
The first domain used for evaluation was the MNIST Handwritten Digits mnist
, a widely used dataset where the goal is to classifypixel images as one of ten digits. MNIST has 55,000 training samples, 5,000 validation samples, and 10,000 testing samples. The same CNN architecture presented in the GLO study gonzalez2019glo is used to provide a direct point of comparison. Notably, this architecture contains a dropout layer hinton2012improving
to explicitly provide regularization. Additionally, as in GLO’s experimentation, training is based on stochastic gradient descent (SGD) with a batch size of 100, a learning rate of 0.01, and, unless otherwise specified, occurred over 20,000 steps.
To further validate TaylorGLO, the more challenging CIFAR-10 dataset krizhevsky2009learning dataset of small, color photographs in ten classes, was used to test the evolution and transferability of loss functions found on a different domain. CIFAR-10 traditionally consists of 50,000 training samples, and 10,000 testing samples; however 5,000 samples from the training dataset were used for validation of candidates, resulting in 45,000 training samples.
Models trained on CIFAR-10 used a ResNet architecture resnet
with five residual blocks, commonly known as ResNet-32. Models were trained using SGD with a batch size of 128, and momentum of 0.9 for 200 epochs. Training proceeded with a fixed learning-rate schedule that starts at 0.1 and decays by a factor of 0.1 at 100 and 150 epochs.
regularization with a weight decay of 0.0001 provided regularization on the model’s weights. Inputs to the network have their mean pixel value subtracted and are divided by their pixel standard deviation. Data augmentation techniques such as random, horizontal flips and croppings with two pixel padding were also applied during training.
A CMA-ES was set to have a population size and an initial step size . These values were found to work well experimentally. The candidates were third-order (i.e., ) TaylorGLO loss functions (Equation 7). Such functions were found experimentally to have a better trade-off between evolution time and performance compared to second- and fourth-order TaylorGLO loss functions (although the differences were relatively).
During candidate evaluation, models were trained for 10% of a full training run. On MNIST, this equates to 2,000 steps (i.e., 4 epochs), and on CIFAR-10, 20 epochs.
5.3 Implementation Details
Due to the number of partial training sessions that are needed to evaluate TaylorGLO loss function candidates, training was distributed across the network to a cluster, composed of dedicated machines with NVIDIA GeForce GTX 1080Ti GPUs. Training itself was implemented with TensorFlowtensorflow
in Python. The primary components of TaylorGLO (i.e., the genetic algorithm and CMA-ES) were implemented in the Swift programming language which allows for easy parallelization. These components run centrally on one machine and asynchronously dispatch work to the cluster. The implementation made use of the open-source SwiftCMAswiftcma library for CMA-ES.
The subsequent subsections present experimental results that show both how the TaylorGLO evolution process functions, and how the loss functions from TaylorGLO can be used as high-performance, drop-in replacements for the cross-entropy loss function.
On MNIST, the average testing accuracy for models trained with the cross-entropy loss was 0.9903, with a standard deviation of 0.0005 (). At 10% through the training process, corresponding to the amount of training undergone to evaluate TaylorGLO candidates, the average testing accuracy was 0.9656 with a standard deviation of 0.0015 ().
On CIFAR-10, the average testing accuracy for models trained with the cross-entropy loss was 0.9139, with a standard deviation of 0.0030 ()
Figure 2 shows an overview of the evolution process over 60 generations on the MNIST dataset, more than sufficient to reach convergence. TaylorGLO is able to quickly discover highly-performing loss functions, with all improvements being discovered within 20 generations. Generations’ average validation accuracy approaches generations’ best accuracy as evolution progresses, indicating that populations as a whole are improving. Notably, every single partial training session completed successfully without any instability, a stark contrast to GLO, whose unbounded search space often results in pathological functions.
Figure 3 shows the shapes and parameters of each generation’s highest-scoring loss function. They are plotted as if they were being used for binary classification, using the procedure described in the GLO study gonzalez2019glo . The functions are colored according to their generation. Additionally, plotted loss function curves are vertically shifted such that their loss at is zero; the raw value of a loss function is not relevant, the derivative, however, is. The functions have a distinct pattern through the evolution process, where early generations show a wider variety of shapes that converge in later generations towards curves with a shallow minimum around (the best loss function found on MNIST—described below—specifically had a minimum at ). Most notably, these well performing loss functions are not monotonically decreasing as the cross-entropy loss is. They each have a local minima near, but not at, , after which the loss increases again as approaches 1. This shape provides an implicit regularization effect in the same manner that the Baikal loss function does gonzalez2019glo : it discourages overfitting by preventing the model from being too confident in its predictions.
This structure becomes quite evident when dimensionality reduction using t-SNE tsne is performed on every candidate loss function within a run of TaylorGLO evolution, as illustrated in Figure 4. Points on the plot quickly migrate and spread in initial generations as CMA-ES grows the parameter space that is being explored (demonstrating resilience to the initial step size ). This pattern concurs with the spreading that occurs in early generations in the bottom plot of Figure 3. The points gradually migrate across the t-SNE space and finally coalesce in the smaller region of red points.
The best loss function obtained from running TaylorGLO on MNIST was found in generation 74. This function, with parameters , , , , , , , (rounded to four decimal-places), achieved a 2k-step validation accuracy of 0.9950 on its single evaluation, higher than 0.9903 for the cross entropy loss. This loss function was a modest improvement over the previous best loss function from generation 16, which had a validation accuracy of 0.9958.
Over 10 fully-trained models, the best TaylorGLO loss function achieved a mean testing accuracy of 0.9951, with a standard deviation of 0.0005, while the cross-entropy loss only reached 0.9899, and the BaikalCMA loss function from the original GLO gonzalez2019glo technique reached 0.9947. These figures are shown in Figure 5. The testing accuracy improvement the TaylorGLO loss function confers over the cross-entropy loss is highly statistically significant, with a -value of-value of 0.0625.
Notably, the top loss function from TaylorGLO on MNIST slightly outperforms the top loss function from the original GLO technique, while requiring significantly fewer generations and partial training sessions. BaikalCMA required 11,120 partial evaluations (i.e., 100 individuals over 100 GP generations plus 32 individuals over 35 CMA-ES generations, ignoring evaluations subsequent to the discovery of BaikalCMA), while the top TaylorGLO loss function only required 448 partial evaluations (that is, as many in this case). Thus, TaylorGLO is a practical replacement for GLO which can achieve comparable results with significantly fewer evaluations.
6.3 Performance on Reduced Datasets
The best TaylorGLO loss function on MNIST has significantly better performance characteristics than the cross-entropy loss function when applied on reduced datasets. This property appears to be a characteristic of optimized loss functions gonzalez2019glo . Figure 6 presents a comparison of accuracies for models trained on different portions of the MNIST dataset. Overall, TaylorGLO significantly outperforms the cross-entropy loss. All models were trained for the same number of steps (i.e., 20,000).
This paper proposes multivariate Taylor expansion-based genetic loss-function optimization (TaylorGLO), a new technique for loss-function metalearning. TaylorGLO leverages a novel parameterization for loss functions that is designed to have desirable properties and allows for a continuous optimization approach to be used, instead of genetic programming. TaylorGLO loss functions are able to significantly outperform the cross-entropy loss on MNIST, and provide a slight improvement over the original GLO technique while requiring many fewer candidates to be evaluated. Analyses suggest that TaylorGLO is a mature technique that can provide customized loss functions that result in higher testing accuracies and better data utilization.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, Savannah, GA, 2016. USENIX Association.
-  J. Chisholm. Rational approximants defined from double power series. Mathematics of Computation, 27(124):841–848, 1973.
H. Dong, S. Yu, C. Wu, and Y. Guo.
Semantic image synthesis via adversarial learning.
Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5706–5714, 2017.
-  J. B. Fourier. La théorie analytique de la chaleur. Mémoires de l’Académie Royale des Sciences de l’Institut de France, 8:581–622, 1829.
R. Gao and K. Grauman.
2.5D visual sound.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019.
-  S. Gonzalez. Swiftcma. https://github.com/sgonzalez/SwiftCMA, 2019.
-  S. Gonzalez, J. Landgraf, and R. Miikkulainen. Faster training by selecting samples using embeddings. In 2019 International Joint Conference on Neural Networks (IJCNN), 2019.
-  S. Gonzalez and R. Miikkulainen. Improved training speed, accuracy, and data utilization through loss function optimization. arXiv preprint arXiv:1905.11528, 2019.
-  P. Graves-Morris. The numerical calculation of Padé approximants. In Padé approximation and its applications, pages 231–245. Springer, 1979.
-  P. Graves-Morris and D. Roberts. Calculation of Canterbury approximants. Computer Physics Communications, 10(4):234–244, 1975.
-  N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions. In International Conference on Parallel Problem Solving from Nature, pages 282–291. Springer, 2004.
-  N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pages 312–317. IEEE, 1996.
-  N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
-  D. Kingma and M. Welling. Auto-encoding variational Bayes. In Second International Conference on Learning Representations (ICLR), 12 2014.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
-  Y. LeCun, C. Cortes, and C. Burges. The MNIST dataset of handwritten digits, 1998.
-  J. Lehman et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:1803.03453, 2018.
-  C. Lemke, M. Budka, and B. Gabrys. Metalearning: a survey of trends and technologies. Artificial Intelligence Review, 44(1):117–130, 2015.
-  J. D. Lohn, G. S. Hornby, and D. S. Linden. Evolution, re-evolution, and prototype of an X-band antenna for NASA’s Space Technology 5 mission. In International Conference on Evolvable Systems, pages 205–214. Springer, 2005.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
-  R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier, 2019.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
-  J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
-  M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. Science, 324(5923):81–85, 2009.
-  B. Taylor. Methodus incrementorum directa & inversa. Auctore Brook Taylor, LL. D. & Regiae Societatis Secretario. typis Pearsonianis: prostant apud Gul. Innys ad Insignia Principis in …, 1715.
-  A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. In Proceedings of the USSR Academy of Sciences, volume 4, pages 1035–1038, 1963.
-  H. Wilbraham. On a certain periodic function. The Cambridge and Dublin Mathematical Journal, 3:198–201, 1848.
-  Y. Zhou, C. Liu, and Y. Pan. Modelling sentence pairs with tree-structured attentive encoder. arXiv preprint arXiv:1610.02806, 2016.