Improved Training Speed, Accuracy, and Data Utilization Through Loss Function Optimization

05/27/2019 ∙ by Santiago Gonzalez, et al. ∙ The University of Texas at Austin 0

As the complexity of neural network models has grown, it has become increasingly important to optimize their design automatically through metalearning. Methods for discovering hyperparameters, topologies, and learning rate schedules have lead to significant increases in performance. This paper shows that loss functions can be optimized with metalearning as well, and result in similar improvements. The method, Genetic Loss-function Optimization (GLO), discovers loss functions de novo, and optimizes them for a target task. Leveraging techniques from genetic programming, GLO builds loss functions hierarchically from a set of operators and leaf nodes. These functions are repeatedly recombined and mutated to find an optimal structure, and then a covariance-matrix adaptation evolutionary strategy (CMA-ES) is used to find optimal coefficients. Networks trained with GLO loss functions are found to outperform the standard cross-entropy loss on standard image classification tasks. Training with these new loss functions requires fewer steps, results in lower test error, and allows for smaller datasets to be used. Loss-function optimization thus provides a new dimension of metalearning, and constitutes an important step towards AutoML.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Much of the power of modern neural networks originates from their complexity, i.e. number of parameters, hyperparameters, and topology. This complexity is beyond human ability to optimize, and automated methods are needed. An entire field of metalearning has emerged recently to address this issue, based on various methods such as gradient descent, simulated annealing, reinforcement learning, Bayesian optimization, and evolutionary computation (EC)

elsken2018neural .

While a wide repertoire of work now exists for optimizing many aspects of neural networks, the dynamics of training are still usually set manually without concrete, scientific methods. Training schedules, loss functions, and learning rates all affect the training and final functionality of a neural network. Perhaps they could also be optimized through metalearning?

The goal of this paper is to verify this hypothesis, focusing on optimization of loss functions. A general framework for loss function metalearning, covering both novel loss function discovery and optimization, is developed and evaluated experimentally. This framework, Genetic Loss-function Optimization (GLO), leverages Genetic Programming to build loss functions represented as trees, and subsequently a Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimize their coefficients.

EC methods were chosen because EC is arguably the most versatile of the metalearning approaches. It is a population-based search method; allowing for extensive exploration, which often results in creative, novel solutions that are not obvious at first lehman2018surprising . EC has been successful in hyperparameter optimization and architecture design in particular miikkulainen2019evolving ; nature_neuroevolution ; real2018regularized ; loshchilov2016cma . It has also been used to discover mathematical formulas to explain experimental data schmidt2009distilling . It is, therefore, likely to find creative solutions in the loss-function optimization domain as well.

Indeed, in the MNIST image classification benchmark, GLO discovered a surprising new loss function, named Baikal for its shape. This function performs very well, presumably by establishing an implicit regularization effect. Baikal outperforms the standard cross-entropy loss in terms of training speed, final accuracy, and data requirements. Furthermore, Baikal was found to transfer to a more complicated classification task, CIFAR-10, while carrying over its benefits.

The next section reviews related work in metalearning and EC, to help motivate the need for GLO. Following this review, GLO is described in detail, along with the domains upon which it has been evaluated. The subsequent sections present the experimental results, including an analysis of the loss functions that GLO discovers.

2 Related Work

In addition to hyperparameter optimization and neural architecture search, new opportunities for metalearning have recently emerged. In particular, learning rate scheduling and adaptation can have a significant impact on a model’s performance. Learning rate schedules determine how the learning rate changes as training progresses. This functionality tends to be encapsulated away in practice by different gradient-descent optimizers, such as AdaGrad adagrad and Adam adam . While the general consensus has been that monotonically decreasing learning rates yield good results, new ideas, such as cyclical learning rates smith2017cyclical

, have shown promise in learning better models in fewer epochs.

Metalearning methods have also been recently developed for data augmentation, such as AutoAugment cubuk2018autoaugment , a reinforcement learning based approach to find new data augmentation policies. In reinforcement learning tasks, EC has proven a successful approach. For instance, in evolving policy gradients houthooft2018evolved

, the policy loss is not represented symbolically, but rather as a neural network that convolves over a temporal sequence of context vectors. In reward function search

niekum2010genetic , the task is framed as a genetic programming problem, leveraging PushGP push .

In terms of loss functions, a generalization of the L2 loss was proposed with an adaptive loss parameter barron2017general

. This loss function is shown to be effective in domains with multivariate output spaces, where robustness might vary across between dimensions. Specifically, the authors found improvements in Variational Autoencoder (VAE) models, unsupervised monocular depth estimation, geometric registration, and clustering.

Notably, no existing work in the metalearning literature automatically optimizes loss functions for neural networks. As shown in this paper, evolutionary computation can be used in this role to improve neural network performance, gain a better understanding of the processes behind learning, and help reach the ultimate goal of fully automated learning.

3 The GLO Approach

Figure 1:

Genetic Loss Optimization (GLO) overview. A genetic algorithm constructs candidate loss functions as trees. The best loss functions from this set then has its coefficients optimized using CMA-ES. GLO loss functions are able to train models more quickly and more accurately.

The task of finding and optimizing loss functions can be framed as a functional regression problem. GLO accomplishes this through the following high-level steps (shown in Figure 1): (1) loss function discovery: using approaches from genetic programming, a genetic algorithm builds new candidate loss functions, and (2) coefficient optimization: to further optimize a specific loss function, a covariance-matrix adaptation evolutionary strategy (CMA-ES) is leveraged to optimize coefficients.

3.1 Loss function discovery

GLO uses a population-based search approach, inspired by genetic programming, to discover new optimized loss function candidates. Under this framework, loss functions are represented as trees within a genetic algorithm. Trees are a logical choice to represent functions due to their hierarchical nature. The loss function search space is defined by the following tree nodes:


Unary Operators:

Binary Operators:

Leaf Nodes:

, where represents a true label, and represents a predicted label.

The search space is further refined by automatically assigning a fitness of 0 to trees that don’t contain both at least one and one . Generally, a loss function’s fitness within the genetic algorithm is the validation performance of a network trained with that loss function. To expedite the discovery process, and encourage the invention of loss functions that enable faster learning, training does not proceed to convergence. Unstable training sessions that result in NaN values are assigned a fitness of 0. Fitness values are cached to avoid needing to retrain the same network twice.

The initial population is composed of randomly generated trees with a maximum depth of 2. Recursively starting from the root, nodes are randomly chosen from the allowable operator and leaf nodes using a weighting (where are three-times as likely and is two-times as likely as ), this can impart a bias and prevent, for example, the integer 1 from occurring too frequently. The genetic algorithm has a population size of 80, incorporates elitism with 6 elites per generation, and uses roulette-sampling.

Recombination is accomplished is accomplished by randomly splicing two trees together. For a given pair of parent trees, a random element is chosen in each as a crossover point. The two subtrees, whose roots are the two crossover points, are then swapped with each other. Figure 1

presents an example of this. Both resultant trees become part of the next generation. Recombination occurs with a probability of


To introduce variation into the population, the genetic algorithm has the following mutations, applied in a bottom-up fashion:

  • [noitemsep,nosep]

  • Integer scalar nodes are incremented or decremented with a probability.

  • Nodes are replaced with a weighted-random node with the same number of children with a probability.

  • Nodes (and their children) are deleted and replaced with a weighted-random leaf node with a probability.

  • Leaf nodes are deleted and replaced with a weighted-random element (and weighted-random leaf children if necessary) with a probability.

3.2 Coefficient optimization

Loss functions found by the above genetic algorithm can all be thought of having unit coefficients for each node in the tree. This set of coefficients can be represented as a vector with dimensionality equal to the number of nodes in a loss function’s tree. The coefficient vector is optimized independently and iteratively using a covariance-matrix adaptation evolutionary strategy (CMA-ES). hansen1996cmaes The specific variant of CMA-ES that GLO uses is ()-CMA-ES hansen2001cmaesmumulambda , and incorporates weighted rank- updates hansen2004weightedrankmucmaes to reduce the number of objective function evaluations that are needed. The implementation of GLO presented in this paper uses an initial step size . As in the discovery phase, the objective function is the network’s performance on a validation dataset after a shortened training period.

3.3 Implementation details

Due to the large number of partial training sessions that are needed for both the discovery and optimization phases, training is distributed across the network to a cluster of dedicated machines that use Condor condor

for scheduling. Each machine in this cluster has one NVIDIA GeForce GTX Titan Black GPU and two Intel Xeon E5-2603 (4 core) CPUs running at 1.80GHz with 8GB of memory. Training itself is implemented with TensorFlow

tensorflow in Python. The primary components of GLO (i.e., the genetic algorithm and CMA-ES) are implemented in Swift. These components run centrally on one machine and asynchronously dispatch work to the Condor cluster over SSH. Code for the Swift CMA-ES implementation is open sourced at:

4 Experimental Evaluation

This section provides an experimental evaluation of GLO, on the MNIST and CIFAR-10 image classification tasks. Baikal, a GLO loss function found on MNIST, is presented and evaluated in terms of its resulting testing accuracy, training speed, training data requirements, and transferability to CIFAR-10.

4.1 Target tasks

Experiments on GLO are performed using two popular image classification datasets, MNIST Handwritten Digits mnist and CIFAR-10 krizhevsky2009learning . Both datasets, with MNIST in particular, are well understood, and relatively quick to train. This allowed rapid iteration in the development of GLO and allowed time for more thorough experimentation. In the following two sections, the two datasets, and the respective model architectures that were used are described. The model architectures are simple, since achieving state-of-the-art accuracy on MNIST and CIFAR-10 is not the focus of this paper, rather the improvements brought about by using a GLO loss function are.

Both of these tasks, being classification problems, are traditionally framed with the standard cross-entropy loss (sometimes referred to as the log loss): , where is sampled from the true distribution, is from the predicted distribution, and is the number of classes. The cross-entropy loss is used as a baseline in this paper’s experiments.

4.1.1 Mnist

The first target task used for evaluation was the MNIST Handwritten Digits dataset mnist

, a widely used dataset where the goal is to classify

pixel images as one of ten digits. The MNIST dataset has 55,000 training samples, 5,000 validation samples, and 10,000 testing samples.

A simple CNN architecture with the following layers is used:  (1) convolution with 32 filters,  (2) stride-2 max-pooling,  (3) convolution with 64 filters,  (4) stride-2 max-pooling,  (5) 1024-unit fully-connected layer,  (6) a dropout layer hinton2012improving

with 40% dropout probability,  and (7) a softmax layer. ReLU


activations are used. Training uses stochastic gradient descent (SGD) with a batch size of 100, a learning rate of 0.01, and, unless otherwise specified, for 20,000 steps.

4.1.2 Cifar-10

To further validate GLO, the more challenging CIFAR-10 dataset krizhevsky2009learning (a popular dataset of small, color photographs in ten classes) was used as a medium to test the transferability of loss functions found on a different domain. CIFAR-10 consists of 50,000 training samples, and 10,000 testing samples.

A simple CNN architecture, taken from gonzalez2019faster (and itself inspired by AlexNet NIPS2012_4824 ), with the following layers is used:  (1) convolution with 64 filters and ReLU activations,  (2) max-pooling with a stride of 2,  (3) local response normalization NIPS2012_4824 with ,  (4) convolution with 64 filters and ReLU activations,  (5) local response normalization with ,  (6) max-pooling with a stride of 2,  (7) 384-unit fully-connected layer with ReLU activations,  (8) 192-unit fully-connected, linear layer,  and (9) a softmax layer.

Inputs to the network are sized , rather than as provided in the dataset; this enables more sophisticated data augmentation. To force the network to better learn spatial invariance, random

croppings are selected from each full-size image, which are randomly flipped longitudinally, randomly lightened or darkened, and their contrast is randomly perturbed. Furthermore, to attain quicker convergence, an image’s mean pixel value and variance are subtracted and divided, respectively, from the whole image during training and evaluation. CIFAR-10 networks were trained with SGD,

regularization with a weight decay of 0.004, a batch size of 1024, and an initial learning rate of 0.05 that decays by a factor of 0.1 every 350 epochs.

4.2 The Baikal loss function

The most notable loss function that GLO discovered against the MNIST dataset (with 2,000-step training for candidate evaluation) is the Baikal loss (named as such due to its similarity to the bathymetry of Lake Baikal when its binary variant is plotted in 3D, see Section 5.1):


where is from the true distribution, is from the predicted distribution, and is the number of classes. Additionally, after coefficient optimization, GLO arrived at the following version of the Baikal loss:


where .

This loss function, BaikalCMA, was selected for having the highest validation accuracy out of the population. The Baikal and BaikalCMA loss functions had validation accuracies at 2,000 steps equal to 0.9838 and 0.9902, respectively. For comparison, the cross-entropy loss had a validation accuracy at 2,000 steps of 0.9700. Models trained with the Baikal loss on MNIST and CIFAR-10 (to test transfer) are the primary vehicle for validating GLO’s efficacy, as detailed in subsequent sections.

4.3 Testing accuracy

Figure 2: Mean testing accuracy on MNIST, . Both Baikal and BaikalCMA provide statistically significant improvements to testing accuracy over the cross-entropy loss.

Figure 2 shows the increase in testing accuracy that Baikal and BaikalCMA provide on MNIST over models trained with the cross-entropy loss. Over trained models each, the mean testing accuracies for cross-entropy loss, Baikal, and BaikalCMA were 0.9899, 0.9933, and 0.9947, respectively.

This increase in accuracy from Baikal over cross-entropy loss is found to be statistically significant, with a -value of

, in a heteroscedastic, two-tailed T-test, with

samples from each distribution. With the same significance test, the increase in accuracy from BaikalCMA over Baikal was found to be statistically significant, with a -value of .

4.4 Training speed

Figure 3: Training curves for different loss functions on MNIST. Baikal and BaikalCMA result in faster and smoother training compared to the cross-entropy loss.

Training curves for networks trained with the cross-entropy loss, Baikal, and BaikalCMA are shown in Figure 3. Each curve represents 80 testing dataset evaluations spread evenly (i.e., every 250 steps) throughout 20,000 steps of training on MNIST. Networks trained with Baikal and BaikalCMA both learn significantly faster than the cross-entropy loss. Interestingly, the Baikal and BaikalCMA training curves are both smoother than the cross-entropy loss curve, implying that their loss surfaces have fewer or less detrimental local minima. These phenomena make Baikal a compelling loss function for fixed time-budget training, where the improvement in resultant accuracy over the cross-entropy loss becomes most evident.

4.5 Training data requirements

Figure 4: Dataset size sensitivity for different loss functions on MNIST. For each size, . Baikal and BaikalCMA increasingly outperform the cross-entropy loss on small datasets; providing evidence of reduced overfitting.

Figure 4 provides an overview of the effects of dataset size on networks trained with cross-entropy loss, Baikal, and BaikalCMA. For each training dataset portion size, five individual networks were trained for each loss function.

The degree by which Baikal and BaikalCMA outperform cross-entropy loss increases as the training dataset becomes smaller. This provides evidence of less overfitting when training a network with Baikal or BaikalCMA. As expected, BaikalCMA outperforms Baikal at all tested dataset sizes. The size of this improvement in accuracy does not grow as significantly as the improvement over cross-entropy loss, leading to the belief that the overfitting characteristics of Baikal and BaikalCMA are very similar. Ostensibly, one could run the optimization phase of GLO on a reduced dataset specifically to yield a loss function with better performance than BaikalCMA on small datasets.

4.6 Loss function transfer to CIFAR-10

Figure 5: Testing accuracy across training steps on CIFAR-10. The Baikal loss, which has been transferred from MNIST, outperforms the cross-entropy loss on all training durations.

Figure 5 presents a collection of 18 separate tests of the cross-entropy loss and Baikal applied to CIFAR-10. Baikal is found to outperform cross-entropy across all training durations, with the difference becoming more prominent for shorter training periods. These results present an interesting use case for GLO, where a loss function that is found on a simpler dataset can be transferred to a more complex dataset while still maintaining performance improvements. This provides a particularly persuasive argument for using GLO loss functions in fixed time-budget scenarios.

5 Analysis

This section presents a symbolic analysis of the Baikal loss function, followed by experiments that attempt to elucidate why Baikal works better than the cross-entropy loss. A likely explanation is that Baikal results in implicit regularization.

5.1 Binary classification

Loss functions used on the MNIST dataset, being a 10-dimensional classification problem, are difficult to plot and visualize graphically. In this section, loss functions are analyzed in the context of binary classification; where , the Baikal loss expands to:


Since vectors and sum to , by consequence of being passed through a softmax function, for binary classification and . This constraint simplifies the binary Baikal loss to the following function of two variables ( and ):


This same methodology can be applied to the cross-entropy loss and BaikalCMA.

Figure 6: Binary classification loss functions at . Correct predictions lie on the right side of the graph, and vice versa. The log loss is shown to be monotonically decreasing, while Baikal and BaikalCMA present counterintuitive, sharp increases in loss as predictions, approach the true label.

In practice, true labels are assumed to be correct with certainty, thus, is equal to either or . The specific case where is plotted in Figure 6 for the cross-entropy loss, Baikal, and BaikalCMA. The cross-entropy loss is shown to be monotonically decreasing, while Baikal and BaikalCMA counterintuitively show an increase in the loss value as the predicted label, , approaches the true label . Section 5.2 provides reasoning for this unusual phenomenon.

As also seen in Figure 6, the minimum for the Baikal loss where lies around 0.71, while the minimum for the BaikalCMA loss where lies around 0.77. This, along with the more pronounced slope around is likely a reason why BaikalCMA performs better than Baikal.

5.2 Implicit regularization

Figure 7: Loss function input activation strength histograms for cross-entropy loss and BaikalCMA. The peaks are likely shifted with BaikalCMA due to implicit regularization. These histograms match those from a network trained with a confidence regularizer pereyra2017regularizing .

The Baikal and BaikalCMA loss functions are unusual in that they are not monotonically decreasing (see the previous section for more details). At first glance, this behavior may seem undesirable; however, this may be an advantageous trait that implicitly provides a form of regularization (enabling better generalization). This is strongly supported by pereyra2017regularizing , where researchers built a confidence regularizer, on top of cross-entropy loss, that penalizes low entropy prediction distributions. The bimodal distribution of output probabilities that the researchers found on MNIST is nearly identical to that which can be found on a network trained with Baikal or BaikalCMA.

Histograms of the output probability distributions of network trained with the cross-entropy loss and BaikalCMA on the test dataset, after 15,000 steps of training on MNIST, are shown in Figure 

7. Note that the abscissae in Figures 6 and 7 correspond with each other, thus one can qualitatively see how the channel-shaped curves for BaikalCMA may contribute to the shift in histogram peaks.

Furthermore, the improved behavior under small-dataset conditions described in Section 4.5 backs this theory of implicit regularization, since less overfitting was observed when using Baikal and BaikalCMA.

6 Discussion and Future Work

This paper proposes loss function discovery and optimization as a new form of metalearning, and introduces an evolutionary computation approach to it. GLO was evaluated experimentally in the image classification domain, and discovered a surprising new loss function, Baikal. Experiments showed substantial improvements in accuracy, convergence speed, and data requirements. Further analysis suggests that these improvements result from implicit regularization that reduces overfitting to the data.

In the future, GLO can be applied to other machine learning datasets and tasks. The approach is general, and could result in discovery of customized loss functions for different domains, or even specific datasets. It will be interesting to find out how much such customization matters, and whether general principles that apply across domains and tasks can be determined from the results. One particularly interesting domain is generative adversarial networks (GANs). Significant manual tuning is necessary in GANs to ensure that the generator and discriminator networks learn harmoniously. GLO could find co-optimal loss functions for the generator and discriminator networks in tandem, thus making GANs more powerful, robust, and easier to implement.

GAN optimization is an example of co-evolution, where multiple interacting solutions are developed simultaneously. GLO could leverage co-evolution more generally: for instance, it could be combined with techniques like CoDeepNEAT miikkulainen2019evolving to learn jointly-optimal network structures, hyperparameters, learning rate schedules, data augmentation, and loss functions simultaneously. Such approaches require significant computing power, but they may also discover and utilize interactions between the design elements that result in higher complexity and better performance than is currently possible.

7 Conclusion

This paper proposes Genetic Loss-function Optimization (GLO) as a general framework for discovering and optimizing loss functions for a given task. A surprising new loss function was discovered in the experiments, and shown to outperform the cross-entropy loss on MNIST and CIFAR-10 in terms of accuracy, training speed, and data requirements. This function, Baikal, likely achieves these benefits through an implicit regularization effect. GLO can be combined with other aspects of metalearning in the future, paving the way to robust and powerful AutoML.