Together with topology, loss function, and learning rate, the choice of activation function plays a large role in determining how a neural network learns and behaves. However, only a small number of activation functions are widely used in modern deep learning architectures. The Rectified Linear Unit,, is among the most popular because it is simple and effective. Other activation functions such as and are commonly used when it is useful to restrict the activation value within a certain range. There have also been attempts to engineer new activation functions to have certain properties. For example, Leaky ReLU (Maas et al., 2013) allows information to flow when . Softplus (Nair and Hinton, 2010) is positive, monotonic, and smooth. Many hand-engineered activation functions exist (Nwankpa et al., 2018), but none have achieved widespread adoption comparable to ReLU. Activation function design can therefore be seen as a largely untapped resource in neural network design.
This situation points to an interesting opportunity: it may be possible to optimize activation functions automatically through metalearning. Rather than attempting to directly optimize a single model as in traditional machine learning, metalearning seeks to find better performance over a space of models, such as a collection of network architectures, hyperparameters, learning rates, or loss functions(Finn et al., 2017; Finn and Levine, 2017; Elsken et al., 2019; Wistuba et al., 2019; Feurer et al., 2015; Gonzalez and Miikkulainen, 2019)
. Several techniques for metalearning have been proposed, including gradient descent, Bayesian hyperparameter optimization, reinforcement learning, and evolutionary algorithms(Chen et al., 2019; Feurer et al., 2015; Zoph and Le, 2016; Real et al., 2019). Of these, evolution is the most versatile and can be applied to several aspects of neural network design.
Along these lines, this paper develops an evolutionary approach to optimize activation functions. Activation functions are represented as trees, and novel activation functions are discovered through crossover, mutation, and exhaustive search in expressive search spaces containing billions of candidate activation functions. The resulting functions are unlikely to be discovered manually, yet they perform well, surpassing traditional activation functions like ReLU on image classification on CIFAR-10 and CIFAR-100.
This paper continues with Section 2, which presents a summary of related research in metalearning and neural network optimization, and discusses how this work furthers the field. Section 3 explains how activation functions are evolved in detail, defining the search space of candidate functions as well as the implementation of crossover and mutation. In Section 4, a number of search strategies are presented, and results of the experiments are presented in Section 5. This work continues with discussion and potential future work in Section 6, before concluding in Section 7.
2. Related Work
There has been a significant amount of recent work on metalearning with neural networks. Neuroevolution, the optimization of neural network architectures with evolutionary algorithms, is one of the most popular approaches (Elsken et al., 2019; Wistuba et al., 2019). Real et al. (Real et al., 2017) proposed a scheme where pairs of architectures are sampled, mutated, and the better of the two is trained while the other is discarded. Xie and Yuille (Xie and Yuille, 2017)
implemented a full genetic algorithm with crossover to search for architectures. Real et al.(Real et al., 2019)
introduced a novel evolutionary approach called aging evolution to discover an architecture that achieved higher ImageNet classification accuracy than any previous human-designed architecture.
While the above work focuses on using evolution to optimize neural network topologies, less attention has been paid to optimizing other aspects of neural networks. A rare example is work by Gonzalez and Miikkulainen (Gonzalez and Miikkulainen, 2019), which used a genetic algorithm to construct novel loss functions, and then optimized the coefficients of the loss functions with a covariance-matrix adaptation evolutionary strategy. They discovered a loss function that results in faster training and higher accuracy compared to the standard cross-entropy loss.
Other approaches to metalearning include Monte Carlo tree search to explore a space of architectures (Negrinho and Gordon, 2017) and reinforcement learning (RL), which has been used to automatically generate architectures through value function methods (Baker et al., 2016) and policy gradient approaches (Zoph and Le, 2016).
Like evolution, RL has been used to optimize other aspects of neural networks besides the architecture, including activation functions (Ramachandran et al., 2017). Ramachandran et al. discovered a number of high-performing activation functions in this manner; however, they analyzed just one in depth: , which they call Swish. This paper expands on the work by Ramachandran et al. by applying evolutionary optimization to the same problem. It also takes activation function design a step further: instead of only searching for a single activation function that performs reasonably well for most datasets and neural network architectures, it demonstrates that it is possible to evolve specialized activation functions that perform particularly well for specific datasets and neural network architectures, thus utilizing the full power of metalearning. This approach will be discussed in detail next.
3. Evolving Activation Functions
This section presents the approach to evolving activation functions, introducing the search space, mutation and crossover implementations, and the overall evolutionary algorithm.
3.1. Search Space
Each activation function is defined as a tree structure consisting of unary and binary functions. Functions are grouped in layers such that two unary functions always feed into one binary function. The following functions, modified slightly from the search space of Ramachandran et al. (Ramachandran et al., 2017), are used:
Unary: 0, 1, , , , , , , , , , , , , , , , , , , , , ,
Binary: , , , , ,
Following Ramachandran et al., a “core unit” is an activation function that can be represented as core_unit = binary(unary1(x), unary2(x)). Let be a set of balanced core unit trees. is then defined as a family of search spaces
For example, corresponds to the set of functions that can be represented by one core unit, represents functions of the form: core_unit1(core_unit2(x), core_unit3(x)), and so on. Elements of are illustrated in Figures 1 and 2.
An evolutionary approach is used to explore by selecting high-performing activation functions and creating new activation functions via mutation and crossover, as described next.
In mutation, one node in an activation function tree is selected uniformly at random. The function at that node is replaced with another random function in the search space. Unary functions are always replaced with unary functions, and binary functions with binary functions. Mutation is shown in Figure 1. Theoretically, mutation alone is sufficient for constructing any activation function. However, preliminary experiments showed that crossover can increase the rate at which good activation functions are found.
In crossover, two parent activation functions exchange randomly selected subtrees, producing one new child activation function. The subtrees are constrained to be of the same depth, ensuring the resulting child activation function is a member of the same search space as its parents. Crossover is depicted in Figure 2.
Starting with a population of activation functions, a neural network is trained with each function on a given training dataset. Each function is assigned a fitness
equal to the softmax of an evaluation metric. This metric could be either accuracy or negative loss obtained on the validation dataset. More specifically,
From the activation functions,
are selected with replacement for reproduction with probability proportional to their fitness. Crossover followed by mutation is applied to the selected activation functions to obtain a new population of size. In order to increase exploration, randomly generated functions are included for a final population that will again be of size . This process is repeated for several generations, and the activation functions with the best performance over the history of the search are returned as a result.
4. Experiments and Setup
This section presents experiments with multiple architectures, datasets, and search strategies.
4.1. Architectures and Datasets
A wide residual network (Zagoruyko and Komodakis, 2016)
with depth 28 and widening factor 10 (WRN-28-10), implemented in TensorFlow(Abadi et al., 2016), is trained on the CIFAR-10 and CIFAR-100 image datasets (Krizhevsky et al., 2009)
. The architecture is comprised of repeated residual blocks which apply batch normalization and ReLU prior to each convolution. In the experiments, all ReLU activations are replaced with a novel activation function. No other changes to the architecture are made. Hyperparameters are chosen to mirror those of Zagoruyko and Komodakis(Zagoruyko and Komodakis, 2016)
as closely as possible. Featurewise center, horizontal flip, and ZCA whitening preprocessing are applied to the datasets. Dropout probability is 0.3, and the architecture is optimized using stochastic gradient descent with Nesterov momentum 0.9. WRN-40-4 (a deeper and thinner wide residual network architecture) is also used in some experiments for comparison.
The CIFAR-10 and CIFAR-100 datasets both have 50K training images, 10K testing images, and no standard validation set. To prevent overfitting, balanced validation splits are created for both datasets by randomly selecting 500 images per class from the CIFAR-10 training set and 50 images per class from the CIFAR-100 training set. The test set is not modified so that the results can be compared with other work.
To discover activation functions, a number of search strategies are used. Regardless of the strategy, the training set always consists of 45K training images while the validation set contains 5K validation images; the test set is never used during the search. All ReLU activations in WRN-28-10 are replaced with a candidate activation function and the architecture is trained for 50 epochs. The initial learning rate is set to 0.1, and decreases by a factor of 0.2 after epochs 25, 40, and 45. Training for only 50 epochs makes it possible to evaluate many activation functions without excessive computational cost. After the search is complete, the top three activation functions by validation accuracy are returned. For each of these functions, a WRN-28-10 is trained from scratch for 200 epochs. The initial learning rate is set to 0.1, and decreases by a factor of 0.2 after epochs 60, 120, and 160, mirroring the work by Zagoruyko and Komodakis(Zagoruyko and Komodakis, 2016). After training is complete, the test set accuracy is measured. The median test accuracy of five runs is reported as the final result, as is commonly done in similar work in the literature (Ramachandran et al., 2017).
|Evolution with loss-based fitness ()|
|Evolution with accuracy-based fitness ()|
|Random Search ()|
|Exhaustive Search ()|
|Baseline Activation Functions|
4.2. Search Strategies
Three different techniques are used to explore the space of activation functions: exhaustive search, random search, and evolution. Exhaustive search methodically evaluates every function in , while random search and evolution explore . It is noteworthy that evolution is able to discover high-performing activation functions in , where the search space contains over 41 billion possible function strings.
4.2.1. Exhaustive Search
Ramachandran et al. search for activation functions using reinforcement learning and argue that simple activation functions consistently outperform more complicated ones (Ramachandran et al., 2017). Although evolution is capable of discovering high-performing complex activation functions in an enormous search space, exhaustive search can be effective in smaller search spaces. In particular, exhaustive search is used to evaluate every function in .
4.2.2. Random Search
An illustrative baseline comparison with evolution is random search. Instead of evolving a population of 50 activation functions for 10 generations, 500 random activation functions from are grouped into 10 “generations” of 50 functions each.
As shown in Figure 3, evolution discovers better activation functions more quickly than random search in , a search space where exhaustive search is infeasible. During evolution, candidate activation functions are assigned a fitness value based on either accuracy or loss on the validation set. Both fitness metrics give high-performing activation functions a better chance of surviving to the next generation of evolution. Accuracy-based fitness favors exploration over exploitation: activation functions with poor validation accuracy still have a reasonable probability of surviving to the next generation. A hypothetical activation function that achieves 90% validation accuracy is only 2.2 times more likely to be chosen for the next generation than a function with only 10% validation accuracy since .
Loss-based fitness sharply penalizes poor activation functions. It finds high-performing activation functions more quickly, and gives them greater influence over future generations. An activation function with 0.01 validation loss is 21,807 times more likely to be selected for the following generation than a function with a validation loss of 10. .
Both experiments begin with an initial population of 50 random activation functions, and run through 10 generations of evolution. Each new generation of 50 activation functions is comprised of the top five functions from the previous generation, 10 random functions, and 35 functions created by applying crossover and mutation to existing functions in the population, as described in Section 3.
4.3. Activation Function Specialization
An important question is the extent to which activation functions are specialized for the architecture and dataset for which they were evolved, or perform well across different architectures and datasets. To address this question, activation functions discovered for WRN-28-10 on CIFAR-10 are transferred to WRN-40-4 on CIFAR-100 and compared with activation functions discovered specifically for WRN-40-4 and CIFAR-100.
This section presents the experimental results, which demonstrate that evolved activation functions can outperform baseline functions like ReLU and Swish.
5.1. Improving Performance
Table 1 lists the activation functions that achieved the highest validation set accuracies after 50 epochs of training with WRN-28-10 on CIFAR-10. The top three activation functions for each search strategy are included. To emulate their true performance, a WRN-28-10 with each activation function was trained for 200 epochs five times on both CIFAR-10 and CIFAR-100 and evaluated on the test set. The median accuracy of these tests is reported in Table 1. Although no search was performed on CIFAR-100 with WRN-28-10, the functions that perform well on CIFAR-10 successfully generalize to CIFAR-100.
All three activation functions discovered through exhaustive search in consistently train well and outperform ReLU and Swish. This finding shows how important it is to have an effective search method. There are good functions even in . It is likely that there are even better functions in , but with a search space of 41 billion functions, a more sophisticated search method is necessary.
The activation functions discovered by random search have unintuitive shapes (Table 1). Although they fail to outperform the baseline activation functions, it is impressive that they still consistently reach a reasonable accuracy. One of the functions discovered by random search occasionally failed to train to completion due to an asymptote at .
Evolution with accuracy-based fitness is less effective because it does not penalize poor activation functions severely enough. One of the functions failed to learn anything better than random guessing. Likely it is too sensitive to random initialization or was unable to learn with the slightly different learning rate schedule of a full 200-epoch training. Another function often did not train to completion due to an asymptote at . The one function that consistently trained well still failed to outperform ReLU.
Evolution with loss-based fitness is able to find good functions in . One of the three activation functions discovered by evolution outperformed both ReLU and Swish on CIFAR-10, and two of the three discovered outperformed ReLU and Swish on CIFAR-100. Figure 4 shows the top activation function after each generation of loss-based evolution. This approach discovers both novel and unintuitive functions that perform reasonably well, as well as simple, smooth, and monotonic functions that outperform ReLU and Swish. It is therefore the most promising search method in large spaces of activation functions.
The performance gains on CIFAR-10 are consistent but small, and the improvement on CIFAR-100 is larger. It is likely that more difficult datasets provide more room for improvement that a novel activation function can exploit.
To evaluate the significance of these results, WRN-28-10 was trained on CIFAR-100 for 200 epochs 50 times with ReLU, 50 times with the best function found by exhaustive search in , , 25 times with Swish, and 15 times with the best function found by evolution in , . Table 2 shows the results of independent -tests comparing the mean accuracies achieved with each activation function. The results show that replacing a baseline activation function with an evolved one results in a statistically significant increase in accuracy.
5.2. Specialized Activation Functions
Since different functions are seen to emerge in different experiments, an important question is: How general or specialized are they to a particular architecture and dataset? To answer this question, the top activation function discovered for WRN-28-10 on CIFAR-10, , was trained with WRN-40-4 on CIFAR-100 for 200 epochs. This result was then compared with performance achieved by , an activation function discovered specifically for WRN-40-4 on CIFAR-100. Table 3 summarizes the result: The activation function evolved for the first task does transfer to the second task, but even higher performance is achieved when a specialized function is discovered specifically for the second task.
The specialized activation function, , is shown in Figure 6c. It is similar to in that it tends towards 0 as , and approaches 1 as . It differs from in that it has a non-monotonic bump for small, negative values of . A 50-epoch training of WRN-40-4 on CIFAR-100 with activation achieved validation accuracy of just 63.2. The superior performance of suggests that the negative bump was an important factor in its success, as the shapes of the two activation functions are otherwise similar. This result demonstrates how evolution can discover specializations that make a significant difference.
|Activation Functions||-statistic; -value|
|Best from and ReLU||4.51;|
|Best from and Swish||3.17;|
|Best from and ReLU||9.73;|
|Best from and Swish||5.91;|
6. Discussion and Future Work
Among the top activation functions discovered, many are smooth and monotonic. Hand-engineered activation functions frequently share these properties (Nwankpa et al., 2018). Two notable exceptions were found by random search and evolution with accuracy-based fitness. Although these functions do not outperform ReLU, the fact that WRN-28-10 was able to achieve such high accuracy with these arbitrary functions raises questions as to what makes an activation function effective. Ramachandran et al. (Ramachandran et al., 2017) asserted that simpler activation functions consistently outperformed more complicated ones. However, the high accuracy achieved with activation functions discovered by evolution in demonstrates that complicated activation functions can compete with simpler ones. Such flexibility may be particularly useful in specialization to different architectures and datasets. It is plausible that there exist many unintuitive activation functions which can outperform the more general ones in specialized settings. Evolution is well-positioned to discover them.
Activation functions discovered by evolution perform best on the architectures and datasets for which they were evolved. Figure 5 demonstrates this principle. More generally, it shows the performance of several activation functions when trained with WRN-28-10 for 50 epochs on CIFAR-10 and when trained with WRN-40-4 for 50 epochs on CIFAR-100. Activation functions that perform well for one task often perform well on another task, but not always. Therefore, if possible, one should evolve them specifically for each architecture and dataset. However, as the results in Section 5 show, it is feasible to evolve using smaller architectures and datasets and then transfer to scaled up architectures and more difficult datasets within the same domain.
In the future, it may be possible to push such generalization further, by evaluating functions across multiple architectures and datasets. In this manner, evolution may be able to combine the requirements of multiple tasks, and discover functions that perform well in general. However, the main power in activation function metalearning is to discover functions that are specialized to each architecture and dataset. In that setting most significant improvements are possible.
Multiple strategies for discovering novel high-performing activation functions were presented and evaluated, namely exhaustive search in a small search space () and random search and evolution in a larger search space (). Evolution with loss-based fitness finds novel activation functions that achieve high accuracy and outperform standard functions such as ReLU, and novel functions such as Swish, demonstrating the power of search in large spaces. The novel activation functions successfully transfer from CIFAR-10 to CIFAR-100 and from WRN-28-10 to WRN-40-4. However, the main power of activation function metalearning is in finding specialized functions for each architecture and dataset, leading to significant improvement.
Acknowledgements.This research was supported in part by NSF under grant DBI-0939454. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper.
- Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.1.
- Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.
Progressive differentiable architecture search: bridging the depth gap between search and evaluation.
Proceedings of the IEEE International Conference on Computer Vision, pp. 1294–1303. Cited by: §1.
- Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §1, §2.
Initializing bayesian hyperparameter optimization via meta-learning.
Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §1.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1.
- Meta-learning and universality: deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622. Cited by: §1.
- Improved training speed, accuracy, and data utilization through loss function optimization. arXiv preprint arXiv:1905.11528. Cited by: §1, §2.
- Learning multiple layers of features from tiny images. Cited by: §4.1.
- Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §1.
Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §1.
- Deeparchitect: automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792. Cited by: §2.
- Activation functions: comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378. Cited by: §1, §6.
- Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §2, §3.1, §4.1, §4.2.1, §6.
Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1, §2.
- Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §2.
- A survey on neural architecture search. arXiv preprint arXiv:1905.01392. Cited by: §1, §2.
- Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1379–1388. Cited by: §2.
- Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1, §4.1.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.