Searching for Activation Functions

10/16/2017 ∙ by Prajit Ramachandran, et al. ∙ Google 0

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, f(x) = x ·sigmoid(β x), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

PyTorch-SRGAN

A modern PyTorch implementation of SRGAN


view repo

activation-function-swish-in-Keras

Google just release a paper to describe a new activation function: SWISH: A SELF-GATED ACTIVATION FUNCTION


view repo

CNN-using-Swish

CNN trained on MNIST using swish


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

At the heart of every deep network lies a linear transformation followed by an activation function

. The activation function plays a major role in the success of training deep neural networks. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU) (Hahnloser et al., 2000; Jarrett et al., 2009; Nair & Hinton, 2010), defined as . The use of ReLUs was a breakthrough that enabled the fully supervised training of state-of-the-art deep networks (Krizhevsky et al., 2012)

. Deep networks with ReLUs are more easily optimized than networks with sigmoid or tanh units, because gradients are able to flow when the input to the ReLU function is positive. Thanks to its simplicity and effectiveness, ReLU has become the default activation function used across the deep learning community.

While numerous activation functions have been proposed to replace ReLU (Maas et al., 2013; He et al., 2015; Clevert et al., 2015; Klambauer et al., 2017), none have managed to gain the widespread adoption that ReLU enjoys. Many practitioners have favored the simplicity and reliability of ReLU because the performance improvements of the other activation functions tend to be inconsistent across different models and datasets.

The activation functions proposed to replace ReLU were hand-designed to fit properties deemed to be important. However, the use of search techniques to automate the discovery of traditionally human-designed components has recently shown to be extremely effective (Zoph & Le, 2016; Bello et al., 2017; Zoph et al., 2017). For example, Zoph et al. (2017) used reinforcement learning-based search to find a replicable convolutional cell that outperforms human-designed architectures on ImageNet.

In this work, we use automated search techniques to discover novel activation functions. We focus on finding new scalar activation functions, which take in as input a scalar and output a scalar, because scalar activation functions can be used to replace the ReLU function without changing the network architecture. Using a combination of exhaustive and reinforcement learning-based search, we find a number of novel activation functions that show promising performance. To further validate the effectiveness of using searches to discover scalar activation functions, we empirically evaluate the best discovered activation function. The best discovered activation function, which we call Swish, is , where is a constant or trainable parameter. Our extensive experiments show that Swish consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as image classification and machine translation. On ImageNet, replacing ReLUs with Swish units improves top-1 classification accuracy by 0.9% on Mobile NASNet-A (Zoph et al., 2017) and 0.6% on Inception-ResNet-v2 (Szegedy et al., 2017). These accuracy gains are significant given that one year of architectural tuning and enlarging yielded 1.3% accuracy improvement going from Inception V3 (Szegedy et al., 2016) to Inception-ResNet-v2 (Szegedy et al., 2017).

2 Methods

In order to utilize search techniques, a search space that contains promising candidate activation functions must be designed. An important challenge in designing search spaces is balancing the size and expressivity of the search space. An overly constrained search space will not contain novel activation functions, whereas a search space that is too large will be difficult to effectively search. To balance the two criteria, we design a simple search space inspired by the optimizer search space of Bello et al. (2017) that composes unary and binary functions to construct the activation function.

Figure 1: An example activation function structure. The activation function is composed of multiple repetitions of the “core unit”, which consists of two inputs, two unary functions, and one binary function. Unary functions take in a single scalar input and return a single scalar output, such or . Binary functions take in two scalar inputs and return a single scalar output, such as or .

As shown in Figure 1, the activation function is constructed by repeatedly composing the the “core unit”, which is defined as . The core unit takes in two scalar inputs, passes each input independently through an unary function, and combines the two unary outputs with a binary function that outputs a scalar. Since our aim is to find scalar activation functions which transform a single scalar input into a single scalar output, the inputs of the unary functions are restricted to the layer preactivation and the binary function outputs.

Given the search space, the goal of the search algorithm is to find effective choices for the unary and binary functions. The choice of the search algorithm depends on the size of the search space. If the search space is small, such as when using a single core unit, it is possible to exhaustively enumerate the entire search space. If the core unit is repeated multiple times, the search space will be extremely large (i.e., on the order of possibilities), making exhaustive search infeasible.

Figure 2: The RNN controller used to search over large spaces. At each step, it predicts a single component of the activation function. The prediction is fed back as input to the next timestep in an autoregressive fashion. The controller keeps predicting until every component of the activation function has been chosen. The controller is trained with reinforcement learning.

For large search spaces, we use an RNN controller (Zoph & Le, 2016), which is visualized in Figure 2. At each timestep, the controller predicts a single component of the activation function. The prediction is fed back to the controller in the next timestep, and this process is repeated until every component of the activation function is predicted. The predicted string is then used to construct the activation function.

Once a candidate activation function has been generated by the search algorithm, a “child network” with the candidate activation function is trained on some task, such as image classification on CIFAR-10. After training, the validation accuracy of the child network is recorded and used to update the search algorithm. In the case of exhaustive search, a list of the top performing activation functions ordered by validation accuracy is maintained. In the case of the RNN controller, the controller is trained with reinforcement learning to maximize the validation accuracy, where the validation accuracy serves as the reward. This training pushes the controller to generate activation functions that have high validation accuracies.

Since evaluating a single activation function requires training a child network, the search is computationally expensive. To decrease the wall clock time required to conduct the search, a distributed training scheme is used to parallelize the training of each child network. In this scheme, the search algorithm proposes a batch of candidate activation functions which are added to a queue. Worker machines pull activation functions off the queue, train a child network, and report back the final validation accuracy of the corresponding activation function. The validation accuracies are aggregated and used to update the search algorithm.

3 Search Findings

We conduct all our searches with the ResNet-20 (He et al., 2016a) as the child network architecture, and train on CIFAR-10 (Krizhevsky & Hinton, 2009)

for 10K steps. This constrained environment could potentially skew the results because the top performing activation functions might only perform well for small networks. However, we show in the experiments section that many of the discovered functions generalize to larger models. Exhaustive search is used for small search spaces, while an RNN controller is used for larger search spaces. The RNN controller is trained with Policy Proximal Optimization

(Schulman et al., 2017)

, using the exponential moving average of rewards as a baseline to reduce variance. The full list unary and binary functions considered are as follows:

  • Unary functions: , , , , , , , , , , , , , , , , , , , , , , ,

  • Binary functions: , , , , , , , , ,

where indicates a per-channel trainable parameter and

is the sigmoid function. Different search spaces are created by varying the number of core units used to construct the activation function and varying the unary and binary functions available to the search algorithm.

Figure 3: The top novel activation functions found by the searches. Separated into two diagrams for visual clarity. Best viewed in color.

Figure 3 plots the top performing novel activation functions found by the searches. We highlight several noteworthy trends uncovered by the searches:

  • Complicated activation functions consistently underperform simpler activation functions, potentially due to an increased difficulty in optimization. The best performing activation functions can be represented by or core units.

  • A common structure shared by the top activation functions is the use of the raw preactivation as input to the final binary function: . The ReLU function also follows this structure, where and .

  • The searches discovered activation functions that utilize periodic functions, such as and . The most common use of periodic functions is through addition or subtraction with the raw preactivation (or a linearly scaled ). The use of periodic functions in activation functions has only been briefly explored in prior work (Parascandolo et al., 2016), so these discovered functions suggest a fruitful route for further research.

  • Functions that use division tend to perform poorly because the output explodes when the denominator is near . Division is successful only when functions in the denominator are either bounded away from , such as , or approach only when the numerator also approaches , producing an output of .

Since the activation functions were found using a relatively small child network, their performance may not generalize when applied to bigger models. To test the robustness of the top performing novel activation functions to different architectures, we run additional experiments using the preactivation ResNet-164 (RN) (He et al., 2016b), Wide ResNet 28-10 (WRN) (Zagoruyko & Komodakis, 2016), and DenseNet 100-12 (DN) (Huang et al., 2017)

models. We implement the 3 models in TensorFlow and replace the ReLU function with each of the top novel activation functions discovered by the searches. We use the same hyperparameters described in each work, such as optimizing using SGD with momentum, and follow previous works by reporting the median of 5 different runs.

Function RN WRN DN
ReLU [] 93.8 95.3 94.8
94.5 95.5 94.9
94.3 95.3 94.8
94.1 94.8 94.6
94.0 95.1 94.4
93.9 94.7 94.9
93.9 94.2 94.5
91.5 92.1 92.0
85.1 92.1 91.1
Table 1: CIFAR-10 accuracy.
Function RN WRN DN
ReLU [] 74.2 77.8 83.7
75.1 78.0 83.9
74.8 78.6 84.2
75.2 76.6 81.8
73.4 77.1 74.3
75.2 76.7 83.1
74.8 76.0 78.6
66.1 68.3 67.9
52.8 70.6 68.1
Table 2: CIFAR-100 accuracy.

The results are shown in Tables 2 and 2. Despite the changes in model architecture, six of the eight activation functions successfully generalize. Of these six activation functions, all match or outperform ReLU on ResNet-164. Furthermore, two of the discovered activation functions, and , consistently match or outperform ReLU on all three models.

While these results are promising, it is still unclear whether the discovered activation functions can successfully replace ReLU on challenging real world datasets. In order to validate the effectiveness of the searches, in the rest of this work we focus on empirically evaluating the activation function , which we call Swish. We choose to extensively evaluate Swish instead of because early experimentation showed better generalization for Swish. In the following sections, we analyze the properties of Swish and then conduct a thorough empirical evaluation comparing Swish, ReLU, and other candidate baseline activation functions on number of large models across a variety of tasks.

4 Swish

To recap, Swish is defined as , where is the sigmoid function and is either a constant or a trainable parameter. Figure 5 plots the graph of Swish for different values of . If , Swish is equivalent to the Sigmoid-weighted Linear Unit (SiL) of Elfwing et al. (2017) that was proposed for reinforcement learning. If , Swish becomes the scaled linear function . As , the sigmoid component approaches a -

function, so Swish becomes like the ReLU function. This suggests that Swish can be loosely viewed as a smooth function which nonlinearly interpolates between the linear function and the ReLU function. The degree of interpolation can be controlled by the model if

is set as a trainable parameter.

Figure 4: The Swish activation function.
Figure 5: First derivatives of Swish.

Like ReLU, Swish is unbounded above and bounded below. Unlike ReLU, Swish is smooth and non-monotonic. In fact, the non-monotonicity property of Swish distinguishes itself from most common activation functions. The derivative of Swish is

The first derivative of Swish is shown in Figure 5 for different values of . The scale of controls how fast the first derivative asymptotes to and . When , the derivative has magnitude less than for inputs that are less than around . Thus, the success of Swish with implies that the gradient preserving property of ReLU (i.e., having a derivative of 1 when ) may no longer be a distinct advantage in modern architectures.

Figure 6: Preactivation distribution after training of Swish with on ResNet-32.
Figure 7: Distribution of trained values of Swish on Mobile NASNet-A.

The most striking difference between Swish and ReLU is the non-monotonic “bump” of Swish when . As shown in Figure 7, a large percentage of preactivations fall inside the domain of the bump (, which indicates that the non-monotonic bump is an important aspect of Swish. The shape of the bump can be controlled by changing the parameter. While fixing is effective in practice, the experiments section shows that training can further improve performance on some models. Figure 7 plots distribution of trained values from a Mobile NASNet-A model (Zoph et al., 2017). The trained values are spread out between and and have a peak at , suggesting that the model takes advantage of the additional flexibility of trainable parameters.

Practically, Swish can be implemented with a single line code change in most deep learning libraries, such as TensorFlow (Abadi et al., 2016) (e.g., x * tf.sigmoid(beta * x) or tf.nn.swish(x) if using a version of TensorFlow released after the submission of this work). As a cautionary note, if BatchNorm (Ioffe & Szegedy, 2015) is used, the scale parameter should be set. Some high level libraries turn off the scale parameter by default due to the ReLU function being piecewise linear, but this setting is incorrect for Swish. For training Swish networks, we found that slightly lowering the learning rate used to train ReLU networks works well.

5 Experiments with Swish

We benchmark Swish against ReLU and a number of recently proposed activation functions on challenging datasets, and find that Swish matches or exceeds the baselines on nearly all tasks. The following sections will describe our experimental settings and results in greater detail. As a summary, Table 3 shows Swish in comparison to each baseline activation function we considered (which are defined in the next section). The results in Table 3 are aggregated by comparing the performance of Swish to the performance of different activation functions applied to a variety of models, such as Inception ResNet-v2 (Szegedy et al., 2017) and Transformer (Vaswani et al., 2017), across multiple datasets, such as CIFAR, ImageNet, and EnglishGerman translation.111To avoid skewing the comparison, each model type is compared just once. A model with multiple results is represented by the median of its results. Specifically, the models with aggregated results are (a) ResNet-164, Wide ResNet 28-10, and DenseNet 100-12 across the CIFAR-10 and CIFAR-100 results, (b) Mobile NASNet-A and Inception-ResNet-v2 across the 3 runs, and (c) WMT Transformer model across the 4 newstest results. The improvement of Swish over other activation functions is statistically significant under a one-sided paired sign test.

Baselines ReLU LReLU PReLU Softplus ELU SELU GELU
Swish Baseline 9 7 6 6 8 8 8
Swish Baseline 0 1 3 2 0 1 1
Swish Baseline 0 1 0 1 1 0 0
Table 3: The number of models on which Swish outperforms, is equivalent to, or underperforms each baseline activation function we compared against in our experiments.

5.1 Experimental Set Up

We compare Swish against several additional baseline activation functions on a variety of models and datasets. Since many activation functions have been proposed, we choose the most common activation functions to compare against, and follow the guidelines laid out in each work:

  • Leaky ReLU (LReLU) (Maas et al., 2013):

    where . LReLU enables a small amount of information to flow when .

  • Parametric ReLU (PReLU) (He et al., 2015): The same form as LReLU but is a learnable parameter. Each channel has a shared which is initialized to .

  • Softplus (Nair & Hinton, 2010): . Softplus is a smooth function with properties similar to Swish, but is strictly positive and monotonic. It can be viewed as a smooth version of ReLU.

  • Exponential Linear Unit (ELU) (Clevert et al., 2015):

    where

  • Scaled Exponential Linear Unit (SELU) (Klambauer et al., 2017):

    with and .

  • Gaussian Error Linear Unit (GELU) (Hendrycks & Gimpel, 2016): , where

    is the cumulative distribution function of the standard normal distribution. GELU is a nonmonotonic function that has a shape similar to Swish with

    .

We evaluate both Swish with a trainable and Swish with a fixed (which for simplicity we call Swish-1, but it is equivalent to the Sigmoid-weighted Linear Unit of Elfwing et al. (2017)). Note that our results may not be directly comparable to the results in the corresponding works due to differences in our training setup.

5.2 Cifar

We first compare Swish to all the baseline activation functions on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009). We follow the same set up used when comparing the activation functions discovered by the search techniques, and compare the median of 5 runs with the preactivation ResNet-164 (He et al., 2016b), Wide ResNet 28-10 (WRN) (Zagoruyko & Komodakis, 2016), and DenseNet 100-12 (Huang et al., 2017) models.

Model ResNet WRN DenseNet
LReLU 94.2 95.6 94.7
PReLU 94.1 95.1 94.5
Softplus 94.6 94.9 94.7
ELU 94.1 94.1 94.4
SELU 93.0 93.2 93.9
GELU 94.3 95.5 94.8
ReLU 93.8 95.3 94.8
Swish-1 94.7 95.5 94.8
Swish 94.5 95.5 94.8
Table 4: CIFAR-10 accuracy.
Model ResNet WRN DenseNet
LReLU 74.2 78.0 83.3
PReLU 74.5 77.3 81.5
Softplus 76.0 78.4 83.7
ELU 75.0 76.0 80.6
SELU 73.2 74.3 80.8
GELU 74.7 78.0 83.8
ReLU 74.2 77.8 83.7
Swish-1 75.1 78.5 83.8
Swish 75.1 78.0 83.9
Table 5: CIFAR-100 accuracy.

The results in Tables 5 and 5 show how Swish and Swish-1 consistently matches or outperforms ReLU on every model for both CIFAR-10 and CIFAR-100. Swish also matches or exceeds the best baseline performance on almost every model. Importantly, the “best baseline” changes between different models, which demonstrates the stability of Swish to match these varying baselines. Softplus, which is smooth and approaches zero on one side, similar to Swish, also has strong performance.

5.3 ImageNet

Next, we benchmark Swish against the baseline activation functions on the ImageNet 2012 classification dataset (Russakovsky et al., 2015). ImageNet is widely considered one of most important image classification datasets, consisting of a 1,000 classes and 1.28 million training images. We evaluate on the validation dataset, which has 50,000 images.

We compare all the activation functions on a variety of architectures designed for ImageNet: Inception-ResNet-v2, Inception-v4, Inception-v3 (Szegedy et al., 2017), MobileNet (Howard et al., 2017), and Mobile NASNet-A (Zoph et al., 2017)

. All these architectures were designed with ReLUs. We again replace the ReLU activation function with different activation functions and train for a fixed number of steps, determined by the convergence of the ReLU baseline. For each activation function, we try 3 different learning rates with RMSProp

(Tieleman & Hinton, 2012) and pick the best.222For some of the models with ELU, SELU, and PReLU, we train with an additional 3 learning rates (so a total of 6 learning rates) because the original 3 learning rates did not converge. All networks are initialized with He initialization (He et al., 2015).333For SELU, we tried both He initialization and the initialization recommended in Klambauer et al. (2017), and choose the best result for each model separately. To verify that the performance differences are reproducible, we run the Inception-ResNet-v2 and Mobile NASNet-A experiments 3 times with the best learning rate from the first experiment. We plot the learning curves for Mobile NASNet-A in Figure 8.

Model Top-1 Acc. (%) Top-5 Acc. (%) LReLU 73.8 73.9 74.2 91.6 91.9 91.9 PReLU 74.6 74.7 74.7 92.4 92.3 92.3 Softplus 74.0 74.2 74.2 91.6 91.8 91.9 ELU 74.1 74.2 74.2 91.8 91.8 91.8 SELU 73.6 73.7 73.7 91.6 91.7 91.7 GELU 74.6 - - 92.0 - - ReLU 73.5 73.6 73.8 91.4 91.5 91.6 Swish-1 74.6 74.7 74.7 92.1 92.0 92.0 Swish 74.9 74.9 75.2 92.3 92.4 92.4
Figure 8: Training curves of Mobile NASNet-A on ImageNet. Best viewed in color
Table 6: Mobile NASNet-A on ImageNet, with 3 different runs ordered by top-1 accuracy. The additional 2 GELU experiments are still training at the time of submission.
Model Top-1 Acc. (%) Top-5 Acc. (%)
LReLU 79.5 79.5 79.6 94.7 94.7 94.7
PReLU 79.7 79.8 80.1 94.8 94.9 94.9
Softplus 80.1 80.2 80.4 95.2 95.2 95.3
ELU 75.8 79.9 80.0 92.6 95.0 95.1
SELU 79.0 79.2 79.2 94.5 94.4 94.5
GELU 79.6 79.6 79.9 94.8 94.8 94.9
ReLU 79.5 79.6 79.8 94.8 94.8 94.8
Swish-1 80.2 80.3 80.4 95.1 95.2 95.2
Swish 80.2 80.2 80.3 95.0 95.2 95.0
Table 7: Inception-ResNet-v2 on ImageNet with 3 different runs. Note that the ELU sometimes has instabilities at the start of training, which accounts for the first result.
Model Top-1 Acc. (%) Top-5 Acc. (%)
LReLU 72.5 91.0
PReLU 74.2 91.9
Softplus 73.6 91.6
ELU 73.9 91.3
SELU 73.2 91.0
GELU 73.5 91.4
ReLU 72.0 90.8
Swish-1 74.2 91.6
Swish 74.2 91.7
Table 8: MobileNet on ImageNet.
Model Top-1 Acc. (%) Top-5 Acc. (%)
LReLU 78.4 94.1
PReLU 77.7 93.5
Softplus 78.7 94.4
ELU 77.9 93.7
SELU 76.7 92.8
GELU 77.7 93.9
ReLU 78.4 94.2
Swish-1 78.7 94.2
Swish 78.7 94.0
Table 9: Inception-v3 on ImageNet.
Model Top-1 Acc. (%) Top-5 Acc. (%)
LReLU 79.3 94.7
PReLU 79.3 94.4
Softplus 79.6 94.8
ELU 79.5 94.5
SELU 78.3 94.5
GELU 79.0 94.6
ReLU 79.2 94.6
Swish-1 79.3 94.7
Swish 79.3 94.6
Table 10: Inception-v4 on ImageNet.

The results in Tables 6-10 show strong performance for Swish. On Inception-ResNet-v2, Swish outperforms ReLU by a nontrivial . Swish performs especially well on mobile sized models, with a boost on Mobile NASNet-A and a boost on MobileNet over ReLU. Swish also matches or exceeds the best performing baseline on most models, where again, the best performing baseline differs depending on the model. Softplus achieves accuracies comparable to Swish on the larger models, but performs worse on both mobile sized models. For Inception-v4, the gains from switching between activation functions is more limited, and Swish slightly underperforms Softplus and ELU. In general, the results suggest that switching to Swish improves performance with little additional tuning.

5.4 Machine Translation

We additionally benchmark Swish on the domain of machine translation. We train machine translation models on the standard WMT 2014 EnglishGerman dataset, which has 4.5 million training sentences, and evaluate on 4 different newstest sets using the standard BLEU metric. We use the attention based Transformer (Vaswani et al., 2017) model, which utilizes ReLUs in a 2-layered feedforward network between each attention layer. We train a 12 layer “Base Transformer” model with 2 different learning rates444We tried an additional learning rate for Softplus, but found it did not work well across all learning rates. for 300K steps, but otherwise use the same hyperparameters as in the original work, such as using Adam (Kingma & Ba, 2015) to optimize.

Model newstest2013 newstest2014 newstest2015 newstest2016
LReLU 26.2 27.9 29.8 33.4
PReLU 26.3 27.7 29.7 33.1
Softplus 23.4 23.6 25.8 29.2
ELU 24.6 25.1 27.7 32.5
SELU 23.7 23.5 25.9 30.5
GELU 25.9 27.3 29.5 33.1
ReLU 26.1 27.8 29.8 33.3
Swish-1 26.2 28.0 30.1 34.0
Swish 26.5 27.6 30.0 33.1
Table 11: BLEU score of a 12 layer Transformer on WMT EnglishGerman.

Table 11 shows that Swish outperforms or matches the other baselines on machine translation. Swish-1 does especially well on newstest2016, exceeding the next best performing baseline by BLEU points. The worst performing baseline function is Softplus, demonstrating inconsistency in performance across differing domains. In contrast, Swish consistently performs well across multiple domains.

6 Related Work

Swish was found using a variety of automated search techniques. Search techniques have been utilized in other works to discover convolutional and recurrent architectures (Zoph & Le, 2016; Zoph et al., 2017; Real et al., 2017; Cai et al., 2017; Zhong et al., 2017) and optimizers (Bello et al., 2017). The use of search techniques to discover traditionally hand-designed components is an instance of the recently revived subfield of meta-learning (Schmidhuber, 1987; Naik & Mammone, 1992; Thrun & Pratt, 2012). Meta-learning has been used to find initializations for one-shot learning (Finn et al., 2017; Ravi & Larochelle, 2016), adaptable reinforcement learning (Wang et al., 2016; Duan et al., 2016), and generating model parameters (Ha et al., 2016). Meta-learning is powerful because the flexibility derived from the minimal assumptions encoded leads to empirically effective solutions. We take advantage of this property in order to find scalar activation functions, such as Swish, that have strong empirical performance.

While this work focuses on scalar activation functions, which transform one scalar to another scalar, there are many types of activation functions used in deep networks. Many-to-one

functions, like max pooling, maxout

(Goodfellow et al., 2013), and gating (Hochreiter & Schmidhuber, 1997; Srivastava et al., 2015; van den Oord et al., 2016; Dauphin et al., 2016; Wu et al., 2016; Miech et al., 2017), derive their power from combining multiple sources in a nonlinear way. One-to-many functions, like Concatenated ReLU (Shang et al., 2016), improve performance by applying multiple nonlinear functions to a single input. Finally, many-to-many functions, such as BatchNorm (Ioffe & Szegedy, 2015) and LayerNorm (Ba et al., 2016), induce powerful nonlinear relationships between their inputs.

Most prior work has focused on proposing new activation functions (Maas et al., 2013; Agostinelli et al., 2014; He et al., 2015; Clevert et al., 2015; Hendrycks & Gimpel, 2016; Klambauer et al., 2017; Qiu & Cai, 2017; Zhou et al., 2017; Elfwing et al., 2017), but few studies, such as Xu et al. (2015), have systematically compared different activation functions. To the best of our knowledge, this is the first study to compare scalar activation functions across multiple challenging datasets.

Our study shows that Swish consistently outperforms ReLU on deep models. The strong performance of Swish challenges conventional wisdom about ReLU. Hypotheses about the importance of the gradient preserving property of ReLU seem unnecessary when residual connections

(He et al., 2016a) enable the optimization of very deep networks. A similar insight can be found in the fully attentional Transformer (Vaswani et al., 2017), where the intricately constructed LSTM cell (Hochreiter & Schmidhuber, 1997) is no longer necessary when constant-length attentional connections are used. Architectural improvements lessen the need for individual components to preserve gradients.

7 Conclusion

In this work, we utilized automatic search techniques to discover novel activation functions that have strong empirical performance. We then empirically validated the best discovered activation function, which we call Swish and is defined as . Our experiments used models and hyperparameters that were designed for ReLU and just replaced the ReLU activation function with Swish; even this simple, suboptimal procedure resulted in Swish consistently outperforming ReLU and other activation functions. We expect additional gains to be made when these models and hyperparameters are specifically designed with Swish in mind. The simplicity of Swish and its similarity to ReLU means that replacing ReLUs in any network is just a simple one line code change.

Acknowledgements

We thank Esteban Real, Geoffrey Hinton, Irwan Bello, Jascha Sohl-Dickstein, Jon Shlens, Kathryn Rough, Mohammad Norouzi, Navdeep Jaitly, Niki Parmar, Sam Smith, Simon Kornblith, Vijay Vasudevan, and the Google Brain team for help with this project.

References