Population-Based Training for Loss Function Optimization

02/11/2020 ∙ by Jason Liang, et al. ∙ 0

Metalearning of deep neural network (DNN) architectures and hyperparameters has become an increasingly important area of research. Loss functions are a type of metaknowledge that is crucial to effective training of DNNs and their potential role in metalearning has not yet been fully explored. This paper presents an algorithm called Enhanced Population-Based Training (EPBT) that interleaves the training of a DNN's weights with the metalearning of optimal hyperparameters and loss functions. Loss functions use a TaylorGLO parameterization, based on multivariate Taylor expansions, that EPBT can directly optimize. On the CIFAR-10 and SVHN image classification benchmarks, EPBT discovers loss function schedules that enable faster, more accurate learning. The discovered functions adapt to the training process and serve to regularize the learning task by discouraging overfitting to the labels. EPBT thus demonstrates a promising synergy of simultaneous training and metalearning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNN) constitute a powerful machine learning approach capable of learning useful representations of complex, high dimensional data

(LeCun et al., 2015)

. DNNs have outperformed traditional machine learning models on a variety of benchmarks and tasks, including computer vision, reinforcement learning, and natural language processing

(Krizhevsky et al., 2012; Bahdanau et al., 2014; Mnih et al., 2015).

However, training modern DNNs often requires extensive tuning, and thus many state-of-the-art DNNs must be carefully designed by hand. Over the years, much research has focused on the development of automated methods for metalearning and optimization of DNN hyperparameters and architectures, using techniques such as Bayesian optimization, reinforcement learning, and evolutionary search (Snoek et al., 2015; Zoph and Le, 2016; Miikkulainen et al., 2019; Real et al., 2019).

Recently, promising techniques have been developed for metalearning loss functions too (Houthooft et al., 2018; Gonzalez and Miikkulainen, 2019). Loss function optimization provides a different dimension of metalearning: instead of optimizing network structure or weights, it modifies the gradients themselves, making it possible to automatically regularize the learning process. However, as in most prior metalearning methods, training and metalearning were done separately. Such an approach is computationally complex and cannot adapt to different stages of learning.

A recent metalearning algorithm called Population-Based Training (PBT) is designed to overcome this limitation (Jaderberg et al., 2017). PBT interleaves DNN weight training with the optimization of hyperparameters that are relevant to the training process but also have no particular fixed value (e.g., learning rate). Such online adaption is crucial in domains where the learning dynamics are non-stationary.

Therefore, PBT forms a promising starting point for loss function optimization as well. Building on PBT, this paper makes the following contributions: First, a new online and adaptive hyperparameter optimization algorithm called Enhanced Population-Based Training

(EPBT) is introduced. This algorithm makes use of powerful heuristics commonly used in evolutionary black-box optimization to discover promising combinations of hyperparameters for DNN training. In particular, EPBT uses selection, mutation, and crossover operators adapted from genetic algorithms

(Whitley, 1994) to find good solutions.

Second, a recently developed function parameterization based on multivariate Taylor expansions called TaylorGLO (Gonzalez and Miikkulainen, 2020) is combined with EPBT to optimize loss functions. This parameterization makes it possible to encode a wide variety of different loss functions compactly. On the CIFAR-10 (Krizhevsky and Hinton, 2009) and SVHN (Netzer et al., 2011) image classification benchmarks, EPBT and TaylorGLO can achieve faster training and better convergence when compared to the standard training process that uses cross-entropy loss.

Third, an analysis of the shapes of the discovered loss functions suggests that they penalize overfitting, thus regularizing the learning process automatically. Different loss shapes are most effective at different stages of the training process, thus suggesting that an adaptive loss function may perform better than one that remains static throughout the training.

The rest of this paper is organized as follows: First, background work in metalearning and loss function optimization is reviewed. Next, the algorithm for EPBT and the parameterization of loss functions as Taylor expansions are described. Experimental results on the CIFAR-10 and SVHN image classification datasets are then presented, followed by an analysis of the optimized loss functions.

2 Background and Related Work

While metalearning and neural architecture search have become popular in the machine learning community, loss function optimization is a relatively new area of research. This section summarizes existing work done on both metalearning for DNNs and loss functions.

2.1 Deep Metalearning

Metalearning of good DNN hyperparameters and architectures is a highly active field of research. One popular approach is to use reinforcement learning or policy gradient methods to tune a controller that performs metalearning on a DNN’s structure and hyperparameters (Zoph and Le, 2016; Zoph et al., 2018; Pham et al., 2018). Another method is to make the metalearning differentiable to the performance of the DNN (Maclaurin et al., 2015; Liu et al., 2018), and thus learn by gradient descent.

Recently, metalearning methods based on evolutionary algorithms (EA) have also gained popularity. These methods can optimize DNNs of arbitrary topology and structure

(Miikkulainen et al., 2019), achieving state-of-the-art results on large-scale image classification benchmarks (Real et al., 2019), and demonstrating good trade-offs in multiple objectives such as performance and network complexity (Lu et al., 2018). Many of these EAs use proven and time-tested heuristics such as mutation, crossover, selection, and elitism (Goldberg and Holland, 1988; Whitley, 1994; Stanley et al., 2019) to perform black-box optimization on arbitrary complex objectives. Advanced EAs such as CMA-ES (Hansen and Ostermeier, 1996) also have successfully optimized DNN hyperparameters in high dimensional search spaces (Loshchilov and Hutter, 2016) and are competitive with population hyperparameter tuning methods such as Bayesian optimization (Snoek et al., 2012, 2015).

One challenge shared by every DNN metalearning algorithm is to determine the right amount of training required to evaluate a network architecture/hyperparameter configuration on a benchmark task. Many algorithms simply stop training prematurely, assuming that the partially trained performance is correlated with the true performance (Li et al., 2017; Miikkulainen et al., 2019). Other methods rely on weight sharing, where many candidate architectures share model layers (Pham et al., 2018), thus ensuring that the training time is amortized among all solutions being evaluated.

PBT (Jaderberg et al., 2017) uses the weight sharing approach, which is more computationally efficient. PBT works by alternating between training models in parallel and tuning the model’s hyperparameters through an exploit-and-explore strategy. During exploitation, the hyperparameters and weights of well-performing models are duplicated to replace the worst performing ones. During exploration, hyperparameters are randomly perturbed within a constrained search space. Because PBT never retrains models from scratch, the computational complexity scales only with the population size and not with the total number of hyperparameter configurations searched. Besides tuning training hyperparameters such as the learning rate, PBT has successfully discovered data augmentation schedules (Ho et al., 2019). Therefore, PBT serves as a promising basis for the design of EPBT, which is described in more detail in Section 3.

2.2 Loss Function Optimization

DNNs are trained through the backpropagation of gradients that originate from a loss function

(LeCun et al., 2015). Loss functions represent the primary training objective for a neural network. The choice of the loss function can have a significant impact on a network’s performance (Janocha and Czarnecki, 2017; Bosman et al., 2019; Gonzalez and Miikkulainen, 2019). Recently, Genetic Loss Optimization (GLO) (Gonzalez and Miikkulainen, 2019) was proposed as a new type of metalearning, making it possible to automatically discover novel loss functions that can be used to train higher-accuracy neural networks in less time.

In GLO, loss functions are represented as trees and optimized through genetic programming

(Banzhaf et al., 1998). This approach has the advantage of allowing arbitrarily complex loss functions. However, it opens the door to pathological functions in the search space with undesirable behaviors, such as discontinuities. To resolve many of these shortcomings, this paper uses a loss function representation based on multivariate Taylor expansions (Gonzalez and Miikkulainen, 2020). This parameterization has several advantages including smoothness, guaranteed continuity, adjustable complexity, and ease of implementation. How these loss functions are optimized with EPBT is described in the next section.

3 Algorithm Description

This section will describe in detail how Enhanced Population-Based Training (EPBT) optimizes loss functions and other relevant hyperparameters during training. EPBT makes use of several powerful heuristics that are not part of the original PBT algorithm. In addition, TaylorGLO (Gonzalez and Miikkulainen, 2020), an optimizable parameterization for loss functions using multivariate Taylor expansions, is described.

3.1 EPBT Overview

Figure 1: An overview of the EPBT method. EPBT begins by randomly initializing individuals, which are composed of hyperparameters, model weights, and fitness values. Next, EPBT runs for multiple generations in a three-step loop: (1) selection of the best individuals, (2) generation of new individuals, and (3) evaluation of these individuals. In step one, promising individuals are selected using a heuristic. In step two, new individuals with updated hyperparameters are created, but the weights and fitness are inherited. In step three, these individuals are evaluated on a task and have their model weights and fitness (i.e., performance in the task) updated. Thus, EPBT makes it possible to simultaneously train the network and optimize hyperparameters such as loss function parameterizations.

The core concept of EPBT is intuitive and builds on extensive work already done with evolutionary optimization of DNNs (Stanley et al., 2019). Figure 1 shows how EPBT works by maximizing the fitness of a population of candidate solutions (individuals) over multiple iterations (generations). As a black-box method, EPBT requires no gradient information but only the fitness value of each individual. With EPBT, it is thus possible to incorporate metalearning in tasks where meta-gradients are not available.

At the beginning of generation , the population consists of individuals . Each where is a DNN model (both architecture and weights), is a set of hyperparameters, and is a real-valued scalar fitness. In the Stage 1 of the generation, is used to select promising individuals to form a parent set , where . In the Stage 2, is used to create a set , which contains new individuals . Each of these new individuals inherits from the parent , but has updated hyperparameters . The heuristics used for generating will be described in more detail later. Finally, in Stage 3, each is evaluated by training on a task or dataset, thereby creating an updated model . The validation performance of is used to determine a new fitness value . Thus, by the end of generation , the population pool contains the evaluated individuals , where . This process is repeated for multiple generations until the fitness of the best individual in the population converges.

Since the evaluation of an individual does not depend on other individuals, the entire process can be parallelized. In the current implementation of EPBT, fitness evaluations are mapped onto a multi-process pool of workers on a single machine. Each worker has access to a particular GPU of the machine, and if there are multiple GPUs available, every GPU will be assigned to at least one worker. For the experiments in this paper, a single worker does not fully utilize the GPU and multiple workers can be trained in parallel without any slowdown.

3.2 EPBT Heuristics

EPBT uses evolutionary optimization heuristics, or genetic operators (Whitley, 1994), to tune individuals. Below is a summary of how EPBT is initialized and how these operators are utilized at each stage of a generation.

Initialization: A population with individuals is created as . For each , is set to a fixed DNN architecture and its weights are randomly initialized. Also, each variable in is uniformly sampled from within a fixed range and is set to zero.

Stage 1 – Tournament Selection: Using the tournament selection operator , individuals are repeatedly chosen at random from . Each time, the individuals are compared and the one with the highest fitness is added to . This process is repeated until , where is the number of elites. Elitism will be described in more detail in Stage 3. The value is commonly used in EA literature and also in the experiments in this paper.

Stage 2 – Mutation and Crossover: For each , a uniform mutation operator is applied by introducing multiplicative Gaussian noise independently to each variable in . The mutation operator can randomly and independently reinitialize every variable as well. This approach allows for the exploration of novel combinations of hyperparameters. After mutation, a uniform crossover operator is applied, where each variable in

is randomly swapped (50% probability) with the same variable from another individual in

, resulting in the creation of . is copied from and combined with to form the unevaluated individual .

Stage 3 – Fitness Evaluation with Elitism: The evaluation process proceeds as described above and results in evaluated individuals . After evaluation, EPBT uses an elitism heuristic to preserve progress. In elitism, is sorted by and the best performing individuals are preserved and combined with to form , the population for the next generation. By default, is set to half of the population size, which is a popular value in literature. In the same way that mutation and crossover encourage exploration of a search space, elitism allows for the exploitation of promising regions in the search space.

When viewed from the EA perspective, PBT (Jaderberg et al., 2017) can be seen as a simpler variant of EPBT. The explore step in PBT corresponds to mutation and the exploit step corresponds to elitism. EPBT improves upon PBT in two major ways. First, EPBT makes use of uniform Gaussian mutation (compared to the deterministic mutation in PBT) and uniform crossover. These biologically inspired heuristics allow the algorithm to scale better to higher dimensions. In particular, the crossover operator plays an important role in discovering good global solutions in large search spaces (Goldberg and Holland, 1988; Whitley, 1994). Second, EPBT utilizes tournament selection, a heuristic that helps prevent premature convergence to a local optimum (Shukla et al., 2015). The code for EPBT is summarized in Algorithm 1, with each line numbered by the corresponding stage.

3.3 Loss Function Parameterization

Loss functions can be represented by leveraging the TaylorGLO parameterization, which is defined by a fixed set of continuous-valued parameters, unlike the tree-based representation in GLO (Gonzalez and Miikkulainen, 2019). TaylorGLO loss functions have several functional advantages over GLO, including inherent stability, smoothness, and lack of discontinuities (Gonzalez and Miikkulainen, 2020)

. Furthermore, because of their simple and compact representation as a continuous vector, TaylorGLO functions can be easily tuned using black-box methods. Specifically in this paper, a third-order TaylorGLO loss function with parameters

, is used:

(1)

where is the sample’s true label in one-hot form, and

is the network’s prediction (i.e., scaled logits). The eight parameters (

) are stored in and are optimized using EPBT.

  Input: max generations , initial population , genetic operators
  for  to  do
     1. Select using
     2a. Set
     2b. Set
     3a. Evaluate , set
     3b. Set to top from
     3c. Set
  end for
Algorithm 1 EPBT
Figure 2: Experiments on CIFAR-10 with ResNet-32. Each line represents the test classification accuracy (

-axis) of the method over the number of epochs of training (

-axis). All results are averaged over five runs with error bars shown. The top plot is a zoomed-in version of the bottom plot. EPBT outperforms all baselines by a significant margin.
Figure 3: Experiments on CIFAR-10 with WRN-16-8. All results are averaged over five runs with error bars shown. The top plot is a zoomed-in version of the bottom plot. The best TaylorGLO loss function discovered by EPBT outperforms the baseline, which uses cross-entropy loss for training.
Figure 4: Experiments on SVHN with ResNet-32. All results are averaged over five runs with error bars shown. The top plot is a zoomed-in version of the bottom plot. EPBT outperforms the baseline, which uses cross-entropy loss for training.

4 Experimental Results

To show the effectiveness of EPBT, the algorithm was applied to optimize loss functions for two popular image classification datasets: CIFAR-10 and SVHN. Experimental results and comparisons to multiple baselines are presented below. An analysis of the performance and computational complexity of EPBT is also done.

4.1 Cifar-10

CIFAR-10 (Krizhevsky and Hinton, 2009) is a widely used image classification dataset consisting of 60,000 natural images in ten classes. The dataset is composed of a training set of 50,000 images and a test set of 10,000 images. To evaluate individuals during EPBT, a separate validation set of 5,000 images was created by splitting the training set. The fitness was calculated by finding the classification accuracy of the trained model on the validation set. The test accuracies of each individual’s model at the end of every generation was also recorded for comparison purposes only.

To better understand the improvement brought by EPBT, three baselines were created. The first baseline is a model trained without EPBT: a 32-layer residual network (ResNet-32) with 0.47 million weights that was initialized with the He method (He et al., 2015, 2016)

. The model was trained using stochastic gradient descent (SGD) for 200 epochs on all 50,000 training images with a batch size of 128, momentum of 0.9, and cross-entropy loss. A fixed learning rate schedule that starts at 0.1 and decays by a factor of 10 at 100 and 150 epochs was used. Input images were normalized to have unit pixel variance and a mean pixel value of zero before training while data augmentation techniques such as random flips and translations were applied during training.

The second baseline is the original PBT algorithm set to tune the eight parameters of the TaylorGLO loss function. The training setup was similar to the first baseline, and the algorithm configuration was based on previous work where it was used to find data augmentation schedules (Ho et al., 2019). The search space for TaylorGLO was constrained between and and the loss parameters were initialized around zero. The loss parameters were tuned using a mixture of both random resets and multiplicative perturbations of magnitude 1.2. Finally, the weights and loss parameters from the top 25% of the population were copied over to the bottom 25% every generation. PBT was run for 25 generations, each with eight epochs of training (for a total of 200 epochs), and population size of 40. The third baseline is PBT set to optimize the learning rate and momentum of SGD with the standard cross-entropy loss function.

The experiments with EPBT were run using a similar training setup as described above. Like the PBT baselines, EPBT was run for 25 generations of eight epochs each and with a population size of 40. EPBT was configured similarly as the PBT baseline, but with an elitism size of and with the initial TaylorGLO parameters sampled uniformly between and 1.


Algorithm CIFAR-10, ResNet-32 CIFAR-10, WRN-16-8 SVHN, ResNet-32
Baseline 1 (no PBT) 91.38 (0.65) 95.15 (0.15) 97.75 (0.03)
Baseline 2 (PBT, TGLO) 91.19 (0.35)
Baseline 3 (PBT, SGD) 90.48 (0.60)
EPBT (TGLO) 92.13 (0.18) 95.29 (0.15) 98.00 (0.05)
Table 1:

Mean and standard deviation (over five runs) of final test accuracies on the CIFAR-10 and SVHN datasets. EPBT achieves better results (bold) compared to the baselines. Results are reported in percentage.

The test accuracies of each baseline and best model in EPBT’s population, averaged over five independent runs with standard error bars shown, are summarized in Figure 

2. EPBT converges rapidly to the highest test accuracy and outperforms all three baselines. Using PBT to optimize TaylorGLO (Baseline 2) results in lower accuracy not significantly different from using cross-entropy loss without PBT (Baseline 1). Such a limited accuracy in Baseline 2 probably follows from optimizing over an eight-dimensional search space with just a simple mutation heuristic and no crossover operator. Using PBT to optimize SGD hyperparameters (Baseline 3) performs the worst. In this case, the model converged to a local optimum. Preliminary experiments were also run with EPBT optimizing SGD hyperparameters and TaylorGLO loss parameters simultaneously. These experiments resulted in a similar outcome as Baseline 3. There appear to be complex interactions between the learning rate and loss function that are difficult to tune.

EPBT can also be applied to larger DNN architectures. In Figure 3, Baseline 1 and both variants of EPBT were used to train a wide residual network (Zagoruyko and Komodakis, 2016) with 11 million weights (WRN-16-8). The results show that EPBT is again able to achieve noticeable improvements over the baseline and achieve better test accuracy at a faster pace.

4.2 Svhn

To demonstrate that loss function optimization scales with dataset size, EPBT was applied to SVHN (Netzer et al., 2011), a larger image classification task. This dataset is composed of around 600,000 training images and 26,000 testing images. Following existing practices (Huang et al., 2017), the dataset was normalized but no data augmentation was used during training. The baseline model was optimized with SGD on the full training set for a total of 40 epochs, with the learning rate decaying from 0.1 by a factor of 10 at 20 and 30 epochs respectively. EPBT was run for 40 generations, each with one epoch of training, and a validation set of 30,000 images was separated for evaluating individuals. Otherwise, the experiment setup was identical to the CIFAR-10 domain.

Figure 4 gives a comparison of EPBT against Baseline 1 in the SVHN domain. As expected, EPBT achieves higher test accuracy than the baseline. Like in the earlier experiments with CIFAR-10 and ResNet-32, both EPBT variants learn faster and converge to a high test accuracy at the end.


Algorithm CIFAR-10, ResNet-32 CIFAR-10, WRN-16-8 SVHN, ResNet-32
Baseline 1 (no PBT) 112 152 24
Baseline 2 (PBT, TGLO) 104
Baseline 3 (PBT, SGD) 104
Table 2: Number of training epochs required for EPBT to exceed the test accuracy of the baselines. The baselines were trained for 200 epochs in CIFAR-10 and 40 epochs in SVHN. EPBT surpasses most of the baselines at just over the half-way point.

4.3 Performance Analysis

A summary of the final test accuracies at the end of training for EPBT and the baselines is shown in Table 1. The results show that EPBT achieves the best results for multiple datasets and model architectures. Another noticeable benefit provided by EPBT is the ability to train models to convergence significantly faster than non-population based methods, especially with a limited number of training epochs. This is because multiple models are simultaneously trained with EPBT, each with different loss functions. If progress is made in one of the models, its higher fitness leads to that model’s loss function or weights being shared among the rest of the models, thus lifting their performance as well.

Table 2 details how many epochs of training are required for EPBT to surpass each of the baselines. As expected, EPBT outperforms most of the baselines after training for roughly half the total number of epochs. These experiments thus demonstrate the power of EPBT in not just training better models but also doing it faster.

4.4 Computational Complexity Analysis

Compared to simpler hyperparameter tuning methods that do not interleave training and optimization, EPBT is significantly more efficient. On the CIFAR-10 dataset, EPBT discovered 40 new loss functions during the first generation and an additional 20 loss functions every subsequent generation. EPBT was run for 25 generations and thus was able to explore up to 520 unique TaylorGLO parameterizations. This process is efficient given the size of the search space; if a grid search were performed at intervals of , a total of (38 billion) unique loss function parameterizations would have to be evaluated.

Furthermore, the computational complexity of EPBT scales linearly with the population size and not with the number of loss functions explored. Loss function evaluation is efficient in EPBT because it is not necessary to retrain the model from scratch whenever a new loss function is discovered; the model’s weights are copied over from an existing model with good performance. If each of the 520 discovered loss functions was used to fully train a model from random initialization, over 100,000 epochs of training would be required, much higher than the 8,000 epochs EPBT needed.

Because EPBT evaluates all the individuals in the population in parallel, the real-time complexity of each generation is not significantly higher than training a single model for the same number of epochs. Furthermore, the amount of time spent in Stages 1 and 2 to generate new individuals is negligible compared to Stage 3, where model training occurs. The EPBT experiments in this paper were run on a machine with four NVIDIA V100 GPUs, utilized one GPU-day worth of computing, and took roughly six hours to complete.

Figure 5: EPBT loss function ancestries for the best candidates across five different runs on CIFAR-10 with ResNet-32. Their shapes are simplified into a 2D binary classification loss (Gonzalez and Miikkulainen, 2019) for visualization purposes. Cross-entropy loss is shown in the bottom right plot for comparison. Loss functions in the starting generations with fewer training epochs are darker, while functions from later generations with more training are lighter. Some runs have fewer ancestors because elitism allows the same loss function to be reused for multiple generations. Across all runs, there is a temporal pattern in the loss function ancestry. Early loss functions tend to regularize more (indicated by a positive slope at ), while later functions encourage more accurate fitting to the ground-truth labels.

5 Loss Function Analysis

Do the loss functions discovered by EPBT remain static or adapt to the current stage of the training process? An analysis of functions discovered by EPBT during an experiment indicates that they do change significantly over the generations.

To characterize how the loss functions adapt with increased training, the ancestries for the final top-performing functions across five separate runs of EPBT (CIFAR-10, ResNet-32) are shown in Figure 5. The cross-entropy loss (as used in Baseline 1) is also plotted for comparison. Ancestry is determined by tracing the sequence of individuals , where is the parent whose and were used to create . The sequence is simplified by removing any duplicate individuals that do not change between generations due to elitism, thus causing some runs to have shorter ancestries.

Because the loss functions are multidimensional, graphing them is not straightforward. However, for visualization purposes, the losses can be simplified into a 2D binary classification modality where represents a perfect prediction, and represents a completely incorrect prediction (Gonzalez and Miikkulainen, 2019). This approach makes it clear that the loss generally decreases as the predicted labels become more accurate and closer to the ground-truth labels.

There is an interesting trend across all five runs: the loss functions optimized by EPBT are not all monotonically-decreasing. Instead, many have a parabolic shape that has a minimum of around and rises slightly as approaches . Such concavity is likely a form of regularization that prevents the network from overfitting to the training data by penalizing low-entropy prediction distributions centered around . Similar behavior was observed for GLO as well (Gonzalez and Miikkulainen, 2019).

The plots also show that the loss functions change shape as training progresses. As the number of epochs increases, the slope near becomes increasingly positive, suggesting less regularization would occur. This result is consistent with recent research that suggests regularization is most important during a critical period early in the training process (Golatkar et al., 2019). If regularization is reduced or removed after this critical period, generalization sometimes may even improve. In EPBT, this principle was discovered and optimized without any prior knowledge as part of the metalearning process. EPBT thus provides an automatic way for exploring metaknowledge that could be difficult to come upon manually.

6 Future Work

As shown by experimental analysis, different stages of EPBT utilize different types of loss functions. This finding supports the notion that a single static loss function might not be optimal for the entire training process. Furthermore, loss functions that change makes sense considering that the learning dynamics for some DNNs are non-stationary or unstable (Jaderberg et al., 2017). For example, adaptive losses might improve the training of generative adversarial networks (Goodfellow et al., 2014; Radford et al., 2015).

Another direction of future work for EPBT is tuning an adaptive loss function parameterization that explicitly takes the current state of model training as input, thus resulting in more refined training and consequently better performance. Alternatively, domain information can be taken into account and allow learned EPBT loss function schedules to be more easily transferred to different tasks. Finally, EPBT can incorporate neural architecture search to jointly optimize network structure and loss functions.

7 Conclusion

This paper presents an evolutionary algorithm called EPBT that allows metalearning to be interleaved with weight training. EPBT was used to optimize a TaylorGLO loss function parameterization. Results on the CIFAR-10 and SVHN image classification benchmarks showed the power of EPBT in discovering loss functions that result in better and faster learning. An analysis of the optimized loss functions suggests that these advantages are from automatically discovered regularization. Furthermore, an adaptive loss function schedule naturally arises and is likely to be the key to achieving good performance.

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone (1998) Genetic programming: an introduction. Vol. 1, Morgan Kaufmann San Francisco. Cited by: §2.2.
  • A. S. Bosman, A. Engelbrecht, and M. Helbig (2019) Visualising basins of attraction for the cross-entropy and the squared error neural network loss functions. arXiv preprint arXiv:1901.02302. Cited by: §2.2.
  • A. S. Golatkar, A. Achille, and S. Soatto (2019) Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence. In Advances in Neural Information Processing Systems 32, pp. 10677–10687. Cited by: §5.
  • D. E. Goldberg and J. H. Holland (1988) Genetic algorithms and machine learning. Machine learning 3 (2), pp. 95–99. Cited by: §2.1, §3.2.
  • S. Gonzalez and R. Miikkulainen (2019) Improved training speed, accuracy, and data utilization through loss function optimization. arXiv preprint arXiv:1905.11528. Cited by: §1, §2.2, §3.3, Figure 5, §5, §5.
  • S. Gonzalez and R. Miikkulainen (2020) Evolving loss functions with multivariate Taylor polynomial parameterizations. arXiv preprint arXiv:2002.00059. Cited by: §1, §2.2, §3.3, §3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §6.
  • N. Hansen and A. Ostermeier (1996) Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pp. 312–317. Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition.

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 770–778.
    Cited by: §4.1.
  • D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen (2019) Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393. Cited by: §2.1, §4.1.
  • R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel (2018) Evolved policy gradients. In Advances in Neural Information Processing Systems, pp. 5400–5409. Cited by: §1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.2.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §1, §2.1, §3.2, §6.
  • K. Janocha and W. M. Czarnecki (2017) On loss functions for deep neural networks in classification. arXiv preprint arXiv:1702.05659. Cited by: §2.2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto. Cited by: §1, §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §1, §2.2.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18 (1), pp. 6765–6816. Cited by: §2.1.
  • H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.1.
  • I. Loshchilov and F. Hutter (2016) CMA-ES for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269. Cited by: §2.1.
  • Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf (2018) NSGA-net: a multi-objective genetic algorithm for neural architecture search. arXiv preprint arXiv:1810.03522. Cited by: §2.1.
  • D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §2.1.
  • R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §1, §2.1, §2.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §1, §4.2.
  • H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.1, §2.1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §6.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1, §2.1.
  • A. Shukla, H. M. Pandey, and D. Mehrotra (2015) Comparative review of selection techniques in genetic algorithm. In 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), pp. 515–519. Cited by: §3.2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §2.1.
  • J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §1, §2.1.
  • K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen (2019) Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp. 24–35. External Links: ISSN 2522-5839, Document, Link Cited by: §2.1, §3.1.
  • D. Whitley (1994) A genetic algorithm tutorial. Statistics and computing 4 (2), pp. 65–85. Cited by: §1, §2.1, §3.2, §3.2.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
  • B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.1.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8697–8710. Cited by: §2.1.