1 Introduction
Much of the power of modern neural networks originates from their complexity, i.e. number of parameters, hyperparameters, and topology. This complexity is beyond human ability to optimize, and automated methods are needed. An entire field of metalearning has emerged recently to address this issue, based on various methods such as gradient descent, simulated annealing, reinforcement learning, Bayesian optimization, and evolutionary computation (EC)
elsken2018neural .While a wide repertoire of work now exists for optimizing many aspects of neural networks, the dynamics of training are still usually set manually without concrete, scientific methods. Training schedules, loss functions, and learning rates all affect the training and final functionality of a neural network. Perhaps they could also be optimized through metalearning?
The goal of this paper is to verify this hypothesis, focusing on optimization of loss functions. A general framework for loss function metalearning, covering both novel loss function discovery and optimization, is developed and evaluated experimentally. This framework, Genetic Lossfunction Optimization (GLO), leverages Genetic Programming to build loss functions represented as trees, and subsequently a Covariance Matrix Adaptation Evolution Strategy (CMAES) to optimize their coefficients.
EC methods were chosen because EC is arguably the most versatile of the metalearning approaches. It is a populationbased search method; allowing for extensive exploration, which often results in creative, novel solutions that are not obvious at first lehman2018surprising . EC has been successful in hyperparameter optimization and architecture design in particular miikkulainen2019evolving ; nature_neuroevolution ; real2018regularized ; loshchilov2016cma . It has also been used to discover mathematical formulas to explain experimental data schmidt2009distilling . It is, therefore, likely to find creative solutions in the lossfunction optimization domain as well.
Indeed, in the MNIST image classification benchmark, GLO discovered a surprising new loss function, named Baikal for its shape. This function performs very well, presumably by establishing an implicit regularization effect. Baikal outperforms the standard crossentropy loss in terms of training speed, final accuracy, and data requirements. Furthermore, Baikal was found to transfer to a more complicated classification task, CIFAR10, while carrying over its benefits.
The next section reviews related work in metalearning and EC, to help motivate the need for GLO. Following this review, GLO is described in detail, along with the domains upon which it has been evaluated. The subsequent sections present the experimental results, including an analysis of the loss functions that GLO discovers.
2 Related Work
In addition to hyperparameter optimization and neural architecture search, new opportunities for metalearning have recently emerged. In particular, learning rate scheduling and adaptation can have a significant impact on a model’s performance. Learning rate schedules determine how the learning rate changes as training progresses. This functionality tends to be encapsulated away in practice by different gradientdescent optimizers, such as AdaGrad adagrad and Adam adam . While the general consensus has been that monotonically decreasing learning rates yield good results, new ideas, such as cyclical learning rates smith2017cyclical
, have shown promise in learning better models in fewer epochs.
Metalearning methods have also been recently developed for data augmentation, such as AutoAugment cubuk2018autoaugment , a reinforcement learning based approach to find new data augmentation policies. In reinforcement learning tasks, EC has proven a successful approach. For instance, in evolving policy gradients houthooft2018evolved
, the policy loss is not represented symbolically, but rather as a neural network that convolves over a temporal sequence of context vectors. In reward function search
niekum2010genetic , the task is framed as a genetic programming problem, leveraging PushGP push .In terms of loss functions, a generalization of the L2 loss was proposed with an adaptive loss parameter barron2017general
. This loss function is shown to be effective in domains with multivariate output spaces, where robustness might vary across between dimensions. Specifically, the authors found improvements in Variational Autoencoder (VAE) models, unsupervised monocular depth estimation, geometric registration, and clustering.
Notably, no existing work in the metalearning literature automatically optimizes loss functions for neural networks. As shown in this paper, evolutionary computation can be used in this role to improve neural network performance, gain a better understanding of the processes behind learning, and help reach the ultimate goal of fully automated learning.
3 The GLO Approach
The task of finding and optimizing loss functions can be framed as a functional regression problem. GLO accomplishes this through the following highlevel steps (shown in Figure 1): (1) loss function discovery: using approaches from genetic programming, a genetic algorithm builds new candidate loss functions, and (2) coefficient optimization: to further optimize a specific loss function, a covariancematrix adaptation evolutionary strategy (CMAES) is leveraged to optimize coefficients.
3.1 Loss function discovery
GLO uses a populationbased search approach, inspired by genetic programming, to discover new optimized loss function candidates. Under this framework, loss functions are represented as trees within a genetic algorithm. Trees are a logical choice to represent functions due to their hierarchical nature. The loss function search space is defined by the following tree nodes:

[noitemsep,nosep]
 Unary Operators:

 Binary Operators:

 Leaf Nodes:

, where represents a true label, and represents a predicted label.
The search space is further refined by automatically assigning a fitness of 0 to trees that don’t contain both at least one and one . Generally, a loss function’s fitness within the genetic algorithm is the validation performance of a network trained with that loss function. To expedite the discovery process, and encourage the invention of loss functions that enable faster learning, training does not proceed to convergence. Unstable training sessions that result in NaN values are assigned a fitness of 0. Fitness values are cached to avoid needing to retrain the same network twice.
The initial population is composed of randomly generated trees with a maximum depth of 2. Recursively starting from the root, nodes are randomly chosen from the allowable operator and leaf nodes using a weighting (where are threetimes as likely and is twotimes as likely as ), this can impart a bias and prevent, for example, the integer 1 from occurring too frequently. The genetic algorithm has a population size of 80, incorporates elitism with 6 elites per generation, and uses roulettesampling.
Recombination is accomplished is accomplished by randomly splicing two trees together. For a given pair of parent trees, a random element is chosen in each as a crossover point. The two subtrees, whose roots are the two crossover points, are then swapped with each other. Figure 1
presents an example of this. Both resultant trees become part of the next generation. Recombination occurs with a probability of
.To introduce variation into the population, the genetic algorithm has the following mutations, applied in a bottomup fashion:

[noitemsep,nosep]

Integer scalar nodes are incremented or decremented with a probability.

Nodes are replaced with a weightedrandom node with the same number of children with a probability.

Nodes (and their children) are deleted and replaced with a weightedrandom leaf node with a probability.

Leaf nodes are deleted and replaced with a weightedrandom element (and weightedrandom leaf children if necessary) with a probability.
3.2 Coefficient optimization
Loss functions found by the above genetic algorithm can all be thought of having unit coefficients for each node in the tree. This set of coefficients can be represented as a vector with dimensionality equal to the number of nodes in a loss function’s tree. The coefficient vector is optimized independently and iteratively using a covariancematrix adaptation evolutionary strategy (CMAES). hansen1996cmaes The specific variant of CMAES that GLO uses is ()CMAES hansen2001cmaesmumulambda , and incorporates weighted rank updates hansen2004weightedrankmucmaes to reduce the number of objective function evaluations that are needed. The implementation of GLO presented in this paper uses an initial step size . As in the discovery phase, the objective function is the network’s performance on a validation dataset after a shortened training period.
3.3 Implementation details
Due to the large number of partial training sessions that are needed for both the discovery and optimization phases, training is distributed across the network to a cluster of dedicated machines that use Condor condor
for scheduling. Each machine in this cluster has one NVIDIA GeForce GTX Titan Black GPU and two Intel Xeon E52603 (4 core) CPUs running at 1.80GHz with 8GB of memory. Training itself is implemented with TensorFlow
tensorflow in Python. The primary components of GLO (i.e., the genetic algorithm and CMAES) are implemented in Swift. These components run centrally on one machine and asynchronously dispatch work to the Condor cluster over SSH. Code for the Swift CMAES implementation is open sourced at: https://github.com/sgonzalez/SwiftCMA4 Experimental Evaluation
This section provides an experimental evaluation of GLO, on the MNIST and CIFAR10 image classification tasks. Baikal, a GLO loss function found on MNIST, is presented and evaluated in terms of its resulting testing accuracy, training speed, training data requirements, and transferability to CIFAR10.
4.1 Target tasks
Experiments on GLO are performed using two popular image classification datasets, MNIST Handwritten Digits mnist and CIFAR10 krizhevsky2009learning . Both datasets, with MNIST in particular, are well understood, and relatively quick to train. This allowed rapid iteration in the development of GLO and allowed time for more thorough experimentation. In the following two sections, the two datasets, and the respective model architectures that were used are described. The model architectures are simple, since achieving stateoftheart accuracy on MNIST and CIFAR10 is not the focus of this paper, rather the improvements brought about by using a GLO loss function are.
Both of these tasks, being classification problems, are traditionally framed with the standard crossentropy loss (sometimes referred to as the log loss): , where is sampled from the true distribution, is from the predicted distribution, and is the number of classes. The crossentropy loss is used as a baseline in this paper’s experiments.
4.1.1 Mnist
The first target task used for evaluation was the MNIST Handwritten Digits dataset mnist
, a widely used dataset where the goal is to classify
pixel images as one of ten digits. The MNIST dataset has 55,000 training samples, 5,000 validation samples, and 10,000 testing samples.A simple CNN architecture with the following layers is used: (1) convolution with 32 filters, (2) stride2 maxpooling, (3) convolution with 64 filters, (4) stride2 maxpooling, (5) 1024unit fullyconnected layer, (6) a dropout layer hinton2012improving
with 40% dropout probability, and (7) a softmax layer. ReLU
nair2010rectifiedactivations are used. Training uses stochastic gradient descent (SGD) with a batch size of 100, a learning rate of 0.01, and, unless otherwise specified, for 20,000 steps.
4.1.2 Cifar10
To further validate GLO, the more challenging CIFAR10 dataset krizhevsky2009learning (a popular dataset of small, color photographs in ten classes) was used as a medium to test the transferability of loss functions found on a different domain. CIFAR10 consists of 50,000 training samples, and 10,000 testing samples.
A simple CNN architecture, taken from gonzalez2019faster (and itself inspired by AlexNet NIPS2012_4824 ), with the following layers is used: (1) convolution with 64 filters and ReLU activations, (2) maxpooling with a stride of 2, (3) local response normalization NIPS2012_4824 with , (4) convolution with 64 filters and ReLU activations, (5) local response normalization with , (6) maxpooling with a stride of 2, (7) 384unit fullyconnected layer with ReLU activations, (8) 192unit fullyconnected, linear layer, and (9) a softmax layer.
Inputs to the network are sized , rather than as provided in the dataset; this enables more sophisticated data augmentation. To force the network to better learn spatial invariance, random
croppings are selected from each fullsize image, which are randomly flipped longitudinally, randomly lightened or darkened, and their contrast is randomly perturbed. Furthermore, to attain quicker convergence, an image’s mean pixel value and variance are subtracted and divided, respectively, from the whole image during training and evaluation. CIFAR10 networks were trained with SGD,
regularization with a weight decay of 0.004, a batch size of 1024, and an initial learning rate of 0.05 that decays by a factor of 0.1 every 350 epochs.4.2 The Baikal loss function
The most notable loss function that GLO discovered against the MNIST dataset (with 2,000step training for candidate evaluation) is the Baikal loss (named as such due to its similarity to the bathymetry of Lake Baikal when its binary variant is plotted in 3D, see Section 5.1):
(1) 
where is from the true distribution, is from the predicted distribution, and is the number of classes. Additionally, after coefficient optimization, GLO arrived at the following version of the Baikal loss:
(2) 
where .
This loss function, BaikalCMA, was selected for having the highest validation accuracy out of the population. The Baikal and BaikalCMA loss functions had validation accuracies at 2,000 steps equal to 0.9838 and 0.9902, respectively. For comparison, the crossentropy loss had a validation accuracy at 2,000 steps of 0.9700. Models trained with the Baikal loss on MNIST and CIFAR10 (to test transfer) are the primary vehicle for validating GLO’s efficacy, as detailed in subsequent sections.
4.3 Testing accuracy
Figure 2 shows the increase in testing accuracy that Baikal and BaikalCMA provide on MNIST over models trained with the crossentropy loss. Over trained models each, the mean testing accuracies for crossentropy loss, Baikal, and BaikalCMA were 0.9899, 0.9933, and 0.9947, respectively.
This increase in accuracy from Baikal over crossentropy loss is found to be statistically significant, with a value of
, in a heteroscedastic, twotailed Ttest, with
samples from each distribution. With the same significance test, the increase in accuracy from BaikalCMA over Baikal was found to be statistically significant, with a value of .4.4 Training speed
Training curves for networks trained with the crossentropy loss, Baikal, and BaikalCMA are shown in Figure 3. Each curve represents 80 testing dataset evaluations spread evenly (i.e., every 250 steps) throughout 20,000 steps of training on MNIST. Networks trained with Baikal and BaikalCMA both learn significantly faster than the crossentropy loss. Interestingly, the Baikal and BaikalCMA training curves are both smoother than the crossentropy loss curve, implying that their loss surfaces have fewer or less detrimental local minima. These phenomena make Baikal a compelling loss function for fixed timebudget training, where the improvement in resultant accuracy over the crossentropy loss becomes most evident.
4.5 Training data requirements
Figure 4 provides an overview of the effects of dataset size on networks trained with crossentropy loss, Baikal, and BaikalCMA. For each training dataset portion size, five individual networks were trained for each loss function.
The degree by which Baikal and BaikalCMA outperform crossentropy loss increases as the training dataset becomes smaller. This provides evidence of less overfitting when training a network with Baikal or BaikalCMA. As expected, BaikalCMA outperforms Baikal at all tested dataset sizes. The size of this improvement in accuracy does not grow as significantly as the improvement over crossentropy loss, leading to the belief that the overfitting characteristics of Baikal and BaikalCMA are very similar. Ostensibly, one could run the optimization phase of GLO on a reduced dataset specifically to yield a loss function with better performance than BaikalCMA on small datasets.
4.6 Loss function transfer to CIFAR10
Figure 5 presents a collection of 18 separate tests of the crossentropy loss and Baikal applied to CIFAR10. Baikal is found to outperform crossentropy across all training durations, with the difference becoming more prominent for shorter training periods. These results present an interesting use case for GLO, where a loss function that is found on a simpler dataset can be transferred to a more complex dataset while still maintaining performance improvements. This provides a particularly persuasive argument for using GLO loss functions in fixed timebudget scenarios.
5 Analysis
This section presents a symbolic analysis of the Baikal loss function, followed by experiments that attempt to elucidate why Baikal works better than the crossentropy loss. A likely explanation is that Baikal results in implicit regularization.
5.1 Binary classification
Loss functions used on the MNIST dataset, being a 10dimensional classification problem, are difficult to plot and visualize graphically. In this section, loss functions are analyzed in the context of binary classification; where , the Baikal loss expands to:
(3) 
Since vectors and sum to , by consequence of being passed through a softmax function, for binary classification and . This constraint simplifies the binary Baikal loss to the following function of two variables ( and ):
(4) 
This same methodology can be applied to the crossentropy loss and BaikalCMA.
In practice, true labels are assumed to be correct with certainty, thus, is equal to either or . The specific case where is plotted in Figure 6 for the crossentropy loss, Baikal, and BaikalCMA. The crossentropy loss is shown to be monotonically decreasing, while Baikal and BaikalCMA counterintuitively show an increase in the loss value as the predicted label, , approaches the true label . Section 5.2 provides reasoning for this unusual phenomenon.
As also seen in Figure 6, the minimum for the Baikal loss where lies around 0.71, while the minimum for the BaikalCMA loss where lies around 0.77. This, along with the more pronounced slope around is likely a reason why BaikalCMA performs better than Baikal.
5.2 Implicit regularization
The Baikal and BaikalCMA loss functions are unusual in that they are not monotonically decreasing (see the previous section for more details). At first glance, this behavior may seem undesirable; however, this may be an advantageous trait that implicitly provides a form of regularization (enabling better generalization). This is strongly supported by pereyra2017regularizing , where researchers built a confidence regularizer, on top of crossentropy loss, that penalizes low entropy prediction distributions. The bimodal distribution of output probabilities that the researchers found on MNIST is nearly identical to that which can be found on a network trained with Baikal or BaikalCMA.
Histograms of the output probability distributions of network trained with the crossentropy loss and BaikalCMA on the test dataset, after 15,000 steps of training on MNIST, are shown in Figure
7. Note that the abscissae in Figures 6 and 7 correspond with each other, thus one can qualitatively see how the channelshaped curves for BaikalCMA may contribute to the shift in histogram peaks.Furthermore, the improved behavior under smalldataset conditions described in Section 4.5 backs this theory of implicit regularization, since less overfitting was observed when using Baikal and BaikalCMA.
6 Discussion and Future Work
This paper proposes loss function discovery and optimization as a new form of metalearning, and introduces an evolutionary computation approach to it. GLO was evaluated experimentally in the image classification domain, and discovered a surprising new loss function, Baikal. Experiments showed substantial improvements in accuracy, convergence speed, and data requirements. Further analysis suggests that these improvements result from implicit regularization that reduces overfitting to the data.
In the future, GLO can be applied to other machine learning datasets and tasks. The approach is general, and could result in discovery of customized loss functions for different domains, or even specific datasets. It will be interesting to find out how much such customization matters, and whether general principles that apply across domains and tasks can be determined from the results. One particularly interesting domain is generative adversarial networks (GANs). Significant manual tuning is necessary in GANs to ensure that the generator and discriminator networks learn harmoniously. GLO could find cooptimal loss functions for the generator and discriminator networks in tandem, thus making GANs more powerful, robust, and easier to implement.
GAN optimization is an example of coevolution, where multiple interacting solutions are developed simultaneously. GLO could leverage coevolution more generally: for instance, it could be combined with techniques like CoDeepNEAT miikkulainen2019evolving to learn jointlyoptimal network structures, hyperparameters, learning rate schedules, data augmentation, and loss functions simultaneously. Such approaches require significant computing power, but they may also discover and utilize interactions between the design elements that result in higher complexity and better performance than is currently possible.
7 Conclusion
This paper proposes Genetic Lossfunction Optimization (GLO) as a general framework for discovering and optimizing loss functions for a given task. A surprising new loss function was discovered in the experiments, and shown to outperform the crossentropy loss on MNIST and CIFAR10 in terms of accuracy, training speed, and data requirements. This function, Baikal, likely achieves these benefits through an implicit regularization effect. GLO can be combined with other aspects of metalearning in the future, paving the way to robust and powerful AutoML.
References
 [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, Savannah, GA, 2016. USENIX Association.
 [2] J. T. Barron. A general and adaptive robust loss function. arXiv preprint arXiv:1701.03077, 2017.
 [3] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
 [4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [5] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.
 [6] S. Gonzalez, J. Landgraf, and R. Miikkulainen. Faster training by selecting samples using embeddings. In 2019 International Joint Conference on Neural Networks (IJCNN), 2019.
 [7] N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions. In International Conference on Parallel Problem Solving from Nature, pages 282–291. Springer, 2004.
 [8] N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pages 312–317. IEEE, 1996.
 [9] N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
 [10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 [11] R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [15] Y. LeCun, C. Cortes, and C. Burges. The MNIST dataset of handwritten digits, 1998.
 [16] J. Lehman et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:1803.03453, 2018.
 [17] I. Loshchilov and F. Hutter. CMAES for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269, 2016.
 [18] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier, 2019.
 [19] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010.
 [20] S. Niekum, A. G. Barto, and L. Spector. Genetic programming for reward function search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010.
 [21] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
 [22] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 [23] M. Schmidt and H. Lipson. Distilling freeform natural laws from experimental data. Science, 324(5923):81–85, 2009.

[24]
L. N. Smith.
Cyclical learning rates for training neural networks.
In
2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
, pages 464–472. IEEE, 2017.  [25] L. Spector, E. Goodman, A. Wu, W. B. Langdon, H. m. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M. Garzon, E. Burke, and M. Kaufmann Publishers. Autoconstructive evolution: Push, pushgp, and pushpop. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO2001), 05 2001.
 [26] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen. Designing neural networks through neuroevolution. Nature Machine Intelligence, 1(1):24–35, 2019.
 [27] D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the condor experience. Concurrency and computation: practice and experience, 17(24):323–356, 2005.
Comments
There are no comments yet.