Much of the power of modern neural networks originates from their complexity, i.e. number of parameters, hyperparameters, and topology. This complexity is beyond human ability to optimize, and automated methods are needed. An entire field of metalearning has emerged recently to address this issue, based on various methods such as gradient descent, simulated annealing, reinforcement learning, Bayesian optimization, and evolutionary computation (EC)elsken2018neural .
While a wide repertoire of work now exists for optimizing many aspects of neural networks, the dynamics of training are still usually set manually without concrete, scientific methods. Training schedules, loss functions, and learning rates all affect the training and final functionality of a neural network. Perhaps they could also be optimized through metalearning?
The goal of this paper is to verify this hypothesis, focusing on optimization of loss functions. A general framework for loss function metalearning, covering both novel loss function discovery and optimization, is developed and evaluated experimentally. This framework, Genetic Loss-function Optimization (GLO), leverages Genetic Programming to build loss functions represented as trees, and subsequently a Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimize their coefficients.
EC methods were chosen because EC is arguably the most versatile of the metalearning approaches. It is a population-based search method; allowing for extensive exploration, which often results in creative, novel solutions that are not obvious at first lehman2018surprising . EC has been successful in hyperparameter optimization and architecture design in particular miikkulainen2019evolving ; nature_neuroevolution ; real2018regularized ; loshchilov2016cma . It has also been used to discover mathematical formulas to explain experimental data schmidt2009distilling . It is, therefore, likely to find creative solutions in the loss-function optimization domain as well.
Indeed, in the MNIST image classification benchmark, GLO discovered a surprising new loss function, named Baikal for its shape. This function performs very well, presumably by establishing an implicit regularization effect. Baikal outperforms the standard cross-entropy loss in terms of training speed, final accuracy, and data requirements. Furthermore, Baikal was found to transfer to a more complicated classification task, CIFAR-10, while carrying over its benefits.
The next section reviews related work in metalearning and EC, to help motivate the need for GLO. Following this review, GLO is described in detail, along with the domains upon which it has been evaluated. The subsequent sections present the experimental results, including an analysis of the loss functions that GLO discovers.
2 Related Work
In addition to hyperparameter optimization and neural architecture search, new opportunities for metalearning have recently emerged. In particular, learning rate scheduling and adaptation can have a significant impact on a model’s performance. Learning rate schedules determine how the learning rate changes as training progresses. This functionality tends to be encapsulated away in practice by different gradient-descent optimizers, such as AdaGrad adagrad and Adam adam . While the general consensus has been that monotonically decreasing learning rates yield good results, new ideas, such as cyclical learning rates smith2017cyclical
, have shown promise in learning better models in fewer epochs.
Metalearning methods have also been recently developed for data augmentation, such as AutoAugment cubuk2018autoaugment , a reinforcement learning based approach to find new data augmentation policies. In reinforcement learning tasks, EC has proven a successful approach. For instance, in evolving policy gradients houthooft2018evolved
, the policy loss is not represented symbolically, but rather as a neural network that convolves over a temporal sequence of context vectors. In reward function searchniekum2010genetic , the task is framed as a genetic programming problem, leveraging PushGP push .
In terms of loss functions, a generalization of the L2 loss was proposed with an adaptive loss parameter barron2017general
. This loss function is shown to be effective in domains with multivariate output spaces, where robustness might vary across between dimensions. Specifically, the authors found improvements in Variational Autoencoder (VAE) models, unsupervised monocular depth estimation, geometric registration, and clustering.
Notably, no existing work in the metalearning literature automatically optimizes loss functions for neural networks. As shown in this paper, evolutionary computation can be used in this role to improve neural network performance, gain a better understanding of the processes behind learning, and help reach the ultimate goal of fully automated learning.
3 The GLO Approach
The task of finding and optimizing loss functions can be framed as a functional regression problem. GLO accomplishes this through the following high-level steps (shown in Figure 1): (1) loss function discovery: using approaches from genetic programming, a genetic algorithm builds new candidate loss functions, and (2) coefficient optimization: to further optimize a specific loss function, a covariance-matrix adaptation evolutionary strategy (CMA-ES) is leveraged to optimize coefficients.
3.1 Loss function discovery
GLO uses a population-based search approach, inspired by genetic programming, to discover new optimized loss function candidates. Under this framework, loss functions are represented as trees within a genetic algorithm. Trees are a logical choice to represent functions due to their hierarchical nature. The loss function search space is defined by the following tree nodes:
- Unary Operators:
- Binary Operators:
- Leaf Nodes:
, where represents a true label, and represents a predicted label.
The search space is further refined by automatically assigning a fitness of 0 to trees that don’t contain both at least one and one . Generally, a loss function’s fitness within the genetic algorithm is the validation performance of a network trained with that loss function. To expedite the discovery process, and encourage the invention of loss functions that enable faster learning, training does not proceed to convergence. Unstable training sessions that result in NaN values are assigned a fitness of 0. Fitness values are cached to avoid needing to retrain the same network twice.
The initial population is composed of randomly generated trees with a maximum depth of 2. Recursively starting from the root, nodes are randomly chosen from the allowable operator and leaf nodes using a weighting (where are three-times as likely and is two-times as likely as ), this can impart a bias and prevent, for example, the integer 1 from occurring too frequently. The genetic algorithm has a population size of 80, incorporates elitism with 6 elites per generation, and uses roulette-sampling.
Recombination is accomplished is accomplished by randomly splicing two trees together. For a given pair of parent trees, a random element is chosen in each as a crossover point. The two subtrees, whose roots are the two crossover points, are then swapped with each other. Figure 1
presents an example of this. Both resultant trees become part of the next generation. Recombination occurs with a probability of.
To introduce variation into the population, the genetic algorithm has the following mutations, applied in a bottom-up fashion:
Integer scalar nodes are incremented or decremented with a probability.
Nodes are replaced with a weighted-random node with the same number of children with a probability.
Nodes (and their children) are deleted and replaced with a weighted-random leaf node with a probability.
Leaf nodes are deleted and replaced with a weighted-random element (and weighted-random leaf children if necessary) with a probability.
3.2 Coefficient optimization
Loss functions found by the above genetic algorithm can all be thought of having unit coefficients for each node in the tree. This set of coefficients can be represented as a vector with dimensionality equal to the number of nodes in a loss function’s tree. The coefficient vector is optimized independently and iteratively using a covariance-matrix adaptation evolutionary strategy (CMA-ES). hansen1996cmaes The specific variant of CMA-ES that GLO uses is ()-CMA-ES hansen2001cmaesmumulambda , and incorporates weighted rank- updates hansen2004weightedrankmucmaes to reduce the number of objective function evaluations that are needed. The implementation of GLO presented in this paper uses an initial step size . As in the discovery phase, the objective function is the network’s performance on a validation dataset after a shortened training period.
3.3 Implementation details
Due to the large number of partial training sessions that are needed for both the discovery and optimization phases, training is distributed across the network to a cluster of dedicated machines that use Condor condor
for scheduling. Each machine in this cluster has one NVIDIA GeForce GTX Titan Black GPU and two Intel Xeon E5-2603 (4 core) CPUs running at 1.80GHz with 8GB of memory. Training itself is implemented with TensorFlowtensorflow in Python. The primary components of GLO (i.e., the genetic algorithm and CMA-ES) are implemented in Swift. These components run centrally on one machine and asynchronously dispatch work to the Condor cluster over SSH. Code for the Swift CMA-ES implementation is open sourced at: https://github.com/sgonzalez/SwiftCMA
4 Experimental Evaluation
This section provides an experimental evaluation of GLO, on the MNIST and CIFAR-10 image classification tasks. Baikal, a GLO loss function found on MNIST, is presented and evaluated in terms of its resulting testing accuracy, training speed, training data requirements, and transferability to CIFAR-10.
4.1 Target tasks
Experiments on GLO are performed using two popular image classification datasets, MNIST Handwritten Digits mnist and CIFAR-10 krizhevsky2009learning . Both datasets, with MNIST in particular, are well understood, and relatively quick to train. This allowed rapid iteration in the development of GLO and allowed time for more thorough experimentation. In the following two sections, the two datasets, and the respective model architectures that were used are described. The model architectures are simple, since achieving state-of-the-art accuracy on MNIST and CIFAR-10 is not the focus of this paper, rather the improvements brought about by using a GLO loss function are.
Both of these tasks, being classification problems, are traditionally framed with the standard cross-entropy loss (sometimes referred to as the log loss): , where is sampled from the true distribution, is from the predicted distribution, and is the number of classes. The cross-entropy loss is used as a baseline in this paper’s experiments.
The first target task used for evaluation was the MNIST Handwritten Digits dataset mnist
, a widely used dataset where the goal is to classifypixel images as one of ten digits. The MNIST dataset has 55,000 training samples, 5,000 validation samples, and 10,000 testing samples.
A simple CNN architecture with the following layers is used: (1) convolution with 32 filters, (2) stride-2 max-pooling, (3) convolution with 64 filters, (4) stride-2 max-pooling, (5) 1024-unit fully-connected layer, (6) a dropout layer hinton2012improvingnair2010rectified
activations are used. Training uses stochastic gradient descent (SGD) with a batch size of 100, a learning rate of 0.01, and, unless otherwise specified, for 20,000 steps.
To further validate GLO, the more challenging CIFAR-10 dataset krizhevsky2009learning (a popular dataset of small, color photographs in ten classes) was used as a medium to test the transferability of loss functions found on a different domain. CIFAR-10 consists of 50,000 training samples, and 10,000 testing samples.
A simple CNN architecture, taken from gonzalez2019faster (and itself inspired by AlexNet NIPS2012_4824 ), with the following layers is used: (1) convolution with 64 filters and ReLU activations, (2) max-pooling with a stride of 2, (3) local response normalization NIPS2012_4824 with , (4) convolution with 64 filters and ReLU activations, (5) local response normalization with , (6) max-pooling with a stride of 2, (7) 384-unit fully-connected layer with ReLU activations, (8) 192-unit fully-connected, linear layer, and (9) a softmax layer.
Inputs to the network are sized , rather than as provided in the dataset; this enables more sophisticated data augmentation. To force the network to better learn spatial invariance, random
croppings are selected from each full-size image, which are randomly flipped longitudinally, randomly lightened or darkened, and their contrast is randomly perturbed. Furthermore, to attain quicker convergence, an image’s mean pixel value and variance are subtracted and divided, respectively, from the whole image during training and evaluation. CIFAR-10 networks were trained with SGD,regularization with a weight decay of 0.004, a batch size of 1024, and an initial learning rate of 0.05 that decays by a factor of 0.1 every 350 epochs.
4.2 The Baikal loss function
The most notable loss function that GLO discovered against the MNIST dataset (with 2,000-step training for candidate evaluation) is the Baikal loss (named as such due to its similarity to the bathymetry of Lake Baikal when its binary variant is plotted in 3D, see Section 5.1):
where is from the true distribution, is from the predicted distribution, and is the number of classes. Additionally, after coefficient optimization, GLO arrived at the following version of the Baikal loss:
This loss function, BaikalCMA, was selected for having the highest validation accuracy out of the population. The Baikal and BaikalCMA loss functions had validation accuracies at 2,000 steps equal to 0.9838 and 0.9902, respectively. For comparison, the cross-entropy loss had a validation accuracy at 2,000 steps of 0.9700. Models trained with the Baikal loss on MNIST and CIFAR-10 (to test transfer) are the primary vehicle for validating GLO’s efficacy, as detailed in subsequent sections.
4.3 Testing accuracy
Figure 2 shows the increase in testing accuracy that Baikal and BaikalCMA provide on MNIST over models trained with the cross-entropy loss. Over trained models each, the mean testing accuracies for cross-entropy loss, Baikal, and BaikalCMA were 0.9899, 0.9933, and 0.9947, respectively.
This increase in accuracy from Baikal over cross-entropy loss is found to be statistically significant, with a -value ofsamples from each distribution. With the same significance test, the increase in accuracy from BaikalCMA over Baikal was found to be statistically significant, with a -value of .
4.4 Training speed
Training curves for networks trained with the cross-entropy loss, Baikal, and BaikalCMA are shown in Figure 3. Each curve represents 80 testing dataset evaluations spread evenly (i.e., every 250 steps) throughout 20,000 steps of training on MNIST. Networks trained with Baikal and BaikalCMA both learn significantly faster than the cross-entropy loss. Interestingly, the Baikal and BaikalCMA training curves are both smoother than the cross-entropy loss curve, implying that their loss surfaces have fewer or less detrimental local minima. These phenomena make Baikal a compelling loss function for fixed time-budget training, where the improvement in resultant accuracy over the cross-entropy loss becomes most evident.
4.5 Training data requirements
Figure 4 provides an overview of the effects of dataset size on networks trained with cross-entropy loss, Baikal, and BaikalCMA. For each training dataset portion size, five individual networks were trained for each loss function.
The degree by which Baikal and BaikalCMA outperform cross-entropy loss increases as the training dataset becomes smaller. This provides evidence of less overfitting when training a network with Baikal or BaikalCMA. As expected, BaikalCMA outperforms Baikal at all tested dataset sizes. The size of this improvement in accuracy does not grow as significantly as the improvement over cross-entropy loss, leading to the belief that the overfitting characteristics of Baikal and BaikalCMA are very similar. Ostensibly, one could run the optimization phase of GLO on a reduced dataset specifically to yield a loss function with better performance than BaikalCMA on small datasets.
4.6 Loss function transfer to CIFAR-10
Figure 5 presents a collection of 18 separate tests of the cross-entropy loss and Baikal applied to CIFAR-10. Baikal is found to outperform cross-entropy across all training durations, with the difference becoming more prominent for shorter training periods. These results present an interesting use case for GLO, where a loss function that is found on a simpler dataset can be transferred to a more complex dataset while still maintaining performance improvements. This provides a particularly persuasive argument for using GLO loss functions in fixed time-budget scenarios.
This section presents a symbolic analysis of the Baikal loss function, followed by experiments that attempt to elucidate why Baikal works better than the cross-entropy loss. A likely explanation is that Baikal results in implicit regularization.
5.1 Binary classification
Loss functions used on the MNIST dataset, being a 10-dimensional classification problem, are difficult to plot and visualize graphically. In this section, loss functions are analyzed in the context of binary classification; where , the Baikal loss expands to:
Since vectors and sum to , by consequence of being passed through a softmax function, for binary classification and . This constraint simplifies the binary Baikal loss to the following function of two variables ( and ):
This same methodology can be applied to the cross-entropy loss and BaikalCMA.
In practice, true labels are assumed to be correct with certainty, thus, is equal to either or . The specific case where is plotted in Figure 6 for the cross-entropy loss, Baikal, and BaikalCMA. The cross-entropy loss is shown to be monotonically decreasing, while Baikal and BaikalCMA counterintuitively show an increase in the loss value as the predicted label, , approaches the true label . Section 5.2 provides reasoning for this unusual phenomenon.
As also seen in Figure 6, the minimum for the Baikal loss where lies around 0.71, while the minimum for the BaikalCMA loss where lies around 0.77. This, along with the more pronounced slope around is likely a reason why BaikalCMA performs better than Baikal.
5.2 Implicit regularization
The Baikal and BaikalCMA loss functions are unusual in that they are not monotonically decreasing (see the previous section for more details). At first glance, this behavior may seem undesirable; however, this may be an advantageous trait that implicitly provides a form of regularization (enabling better generalization). This is strongly supported by pereyra2017regularizing , where researchers built a confidence regularizer, on top of cross-entropy loss, that penalizes low entropy prediction distributions. The bimodal distribution of output probabilities that the researchers found on MNIST is nearly identical to that which can be found on a network trained with Baikal or BaikalCMA.
Histograms of the output probability distributions of network trained with the cross-entropy loss and BaikalCMA on the test dataset, after 15,000 steps of training on MNIST, are shown in Figure7. Note that the abscissae in Figures 6 and 7 correspond with each other, thus one can qualitatively see how the channel-shaped curves for BaikalCMA may contribute to the shift in histogram peaks.
Furthermore, the improved behavior under small-dataset conditions described in Section 4.5 backs this theory of implicit regularization, since less overfitting was observed when using Baikal and BaikalCMA.
6 Discussion and Future Work
This paper proposes loss function discovery and optimization as a new form of metalearning, and introduces an evolutionary computation approach to it. GLO was evaluated experimentally in the image classification domain, and discovered a surprising new loss function, Baikal. Experiments showed substantial improvements in accuracy, convergence speed, and data requirements. Further analysis suggests that these improvements result from implicit regularization that reduces overfitting to the data.
In the future, GLO can be applied to other machine learning datasets and tasks. The approach is general, and could result in discovery of customized loss functions for different domains, or even specific datasets. It will be interesting to find out how much such customization matters, and whether general principles that apply across domains and tasks can be determined from the results. One particularly interesting domain is generative adversarial networks (GANs). Significant manual tuning is necessary in GANs to ensure that the generator and discriminator networks learn harmoniously. GLO could find co-optimal loss functions for the generator and discriminator networks in tandem, thus making GANs more powerful, robust, and easier to implement.
GAN optimization is an example of co-evolution, where multiple interacting solutions are developed simultaneously. GLO could leverage co-evolution more generally: for instance, it could be combined with techniques like CoDeepNEAT miikkulainen2019evolving to learn jointly-optimal network structures, hyperparameters, learning rate schedules, data augmentation, and loss functions simultaneously. Such approaches require significant computing power, but they may also discover and utilize interactions between the design elements that result in higher complexity and better performance than is currently possible.
This paper proposes Genetic Loss-function Optimization (GLO) as a general framework for discovering and optimizing loss functions for a given task. A surprising new loss function was discovered in the experiments, and shown to outperform the cross-entropy loss on MNIST and CIFAR-10 in terms of accuracy, training speed, and data requirements. This function, Baikal, likely achieves these benefits through an implicit regularization effect. GLO can be combined with other aspects of metalearning in the future, paving the way to robust and powerful AutoML.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, Savannah, GA, 2016. USENIX Association.
-  J. T. Barron. A general and adaptive robust loss function. arXiv preprint arXiv:1701.03077, 2017.
-  E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
-  J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
-  T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.
-  S. Gonzalez, J. Landgraf, and R. Miikkulainen. Faster training by selecting samples using embeddings. In 2019 International Joint Conference on Neural Networks (IJCNN), 2019.
-  N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions. In International Conference on Parallel Problem Solving from Nature, pages 282–291. Springer, 2004.
-  N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pages 312–317. IEEE, 1996.
-  N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
-  R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
-  Y. LeCun, C. Cortes, and C. Burges. The MNIST dataset of handwritten digits, 1998.
-  J. Lehman et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:1803.03453, 2018.
-  I. Loshchilov and F. Hutter. CMA-ES for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269, 2016.
-  R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier, 2019.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
-  S. Niekum, A. G. Barto, and L. Spector. Genetic programming for reward function search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010.
-  G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
-  E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
-  M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. Science, 324(5923):81–85, 2009.
L. N. Smith.
Cyclical learning rates for training neural networks.
2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017.
-  L. Spector, E. Goodman, A. Wu, W. B. Langdon, H. m. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M. Garzon, E. Burke, and M. Kaufmann Publishers. Autoconstructive evolution: Push, pushgp, and pushpop. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), 05 2001.
-  K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen. Designing neural networks through neuroevolution. Nature Machine Intelligence, 1(1):24–35, 2019.
-  D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the condor experience. Concurrency and computation: practice and experience, 17(2-4):323–356, 2005.