SPLASH: Learnable Activation Functions for Improving Accuracy and Adversarial Robustness

06/16/2020 ∙ by Mohammadamin Tavakoli, et al. ∙ University of California, Irvine 17

We introduce SPLASH units, a class of learnable activation functions shown to simultaneously improve the accuracy of deep neural networks while also improving their robustness to adversarial attacks. SPLASH units have both a simple parameterization and maintain the ability to approximate a wide range of non-linear functions. SPLASH units are: 1) continuous; 2) grounded (f(0) = 0); 3) use symmetric hinges; and 4) the locations of the hinges are derived directly from the data (i.e. no learning required). Compared to nine other learned and fixed activation functions, including ReLU and its variants, SPLASH units show superior performance across three datasets (MNIST, CIFAR-10, and CIFAR-100) and four architectures (LeNet5, All-CNN, ResNet-20, and Network-in-Network). Furthermore, we show that SPLASH units significantly increase the robustness of deep neural networks to adversarial attacks. Our experiments on both black-box and open-box adversarial attacks show that commonly-used architectures, namely LeNet5, All-CNN, ResNet-20, and Network-in-Network, can be up to 31 simply using SPLASH units instead of ReLUs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonlinear activation functions are fundamental for deep neural networks (DNNs). They determine the class of functions that DNNs can implement and influence their training dynamics, thereby affecting their final performance. For example, DNNs with rectified linear units (ReLUs)

[nair2010rectified] have been shown to perform better than logistic and tanh units in several scenarios [pedamonti2018comparison, nwankpa2018activation, nair2010rectified, goodfellow2016deep]. Instead of using a fixed activation function, one can use a parameterized activation function and learn its parameters to add flexibility to the model. Piecewise linear functions are a reasonable choice for the parameterization of activation functions [agostinelli2014learning, He_2015, ramachandran2017searching, jin2016deep, li2016multi] due to their straightforward parameterization and their ability to approximate non-linear functions [garvin1957applications, stone1961approximation]. However, in the context of deep neural networks, the best way to parameterize these piecewise linear activation functions is still an open question. Previous piecewise linear activation functions either sacrifice expressive power for simplicity (i.e. having few parameters) or sacrifice simplicity for expressive power. While expressive power allows deep neural networks to approximate complicated functions, simplicity can make optimization easier by adding useful inductive biases and reducing the size of the hypothesis space. Therefore, we set out to find a parameterized piecewise linear activation function that is as simple as possible while maintaining the ability to approximate a wide range of functions.

Piecewise linear functions, in the most general form, are real-valued functions defined as line segments with hinges that denote where one segment ends and the next segment begins. As detailed in Section 3, a function of this most general form requires parameters. Many functions in this hypothesis space, such as discontinuous functions, are unlikely to be useful activation functions. Therefore, we significantly reduce the size of the hypothesis space while maintaining the ability to approximate a wide range of useful activation functions. We restrict the form of the piecewise linear function to be continuous and grounded (having an output of zero for an input of zero) with symmetric and fixed hinges. By doing so, we reduce the number of parameters to . Furthermore, we still maintain the ability to approximate almost every successful deep neural network activation function. We call this parameterized piecewise linear activation function SPLASH (Simple Piecewise Linear and Adaptive with Symmetric Hinges).

Typically, learned activation functions are evaluated in terms of accuracy on a test set. We compare the classification accuracy of SPLASH units to nine other learned and fixed activation functions and show that SPLASH units consistently give superior performance. We also perform ablation studies to gain insight into why SPLASH units improve performance and show that the flexibility of the SPLASH units during training significantly affects the final performance. In addition, we also evaluate the robustness of SPLASH units to adversarial attacks [szegedy2013intriguing, goodfellow2014explaining, nguyen2015deep]. When compared to ReLUs, SPLASH units reduce the success of adversarial attacks by up to 31%, without any modifications to how they are parameterized or learned.

2 Related Work

Variants of ReLUs, such as leaky-ReLUs [maas2013rectifier], exponential linear units (ELUs) [clevert2015fast], and scaled exponential linear units (SELUs) [klambauer2017self]

have been shown to improve upon ReLUs. ELUs and SELUs encourage the outputs of the activation functions to have zero mean while SELUs also encourage the outputs of the activation functions to have unit variance. Neural architecture search

[ramachandran2017searching] has also discovered novel activation functions, in particular, the Swish activation function. The Swish activation function is defined as and performs slightly better than ReLUs. It is worth mentioning that, in lin2013network

, the authors proposed the network-in-network approach where they replace activation functions in convolutional layers with small multi-layer perceptrons. Theoretically, due to universal approximation theorem

[csaji2001approximation], this is the most expressive activation function; however, it requires many more parameters.

Some of the early attempts to learn activation functions in neural networks can be found in poli1996parallel, weingaertner2002hierarchical, and khan2013fast

, where the authors proposed learning the best activation function per neuron among a pool of candidate activation functions using genetic and evolutionary algorithms. Maxout

[goodfellow2013maxout] has been introduced as an activation function aimed at enhancing the model averaging properties of dropout [srivastava2014dropout]. However, not only is it limited to approximating convex functions, but it also requires a significant increase in parameters.

APL units [agostinelli2014learning], P-ReLUs [He_2015] and S-ReLUs [jin2016deep] are adaptive activation functions from the piecewise linear family that can mimic both convex and non-convex functions. Of these activation functions, APL units are the most general. However, they require a parameter for the slope of each line segment as well as for the location of each hinge. Additionally, APL units give more expressive power to the left half of the input space than to the right half. Furthermore, the locations of the hinges are not determined by the data and, therefore, it is possible that some line segments may go unused. S-ReLUs also learn the slopes of the line segments and the locations of the hinges, however, the initial locations of the hinges are determined by the data. S-ReLUs have less expressive power than APL units as the form of the function is restricted to only have two hinges. P-ReLUs are the simplest of these activation functions with one fixed hinge where only the slope of one of the line segments is learned. On the other hand, SPLASH units can have few or many hinges and the the locations of the hinges are fixed and determined by the data. Therefore, only the slopes of the line segments have to be learned. Furthermore, SPLASH units give equal expressive power to the left and the right half of the input space.

3 From Piecewise Linear Functions to SPLASH Units

3.1 Family of Piecewise Linear Functions

Given line segments and hinges, piecewise linear functions can be parameterized with parameters: one parameter for the slope and one parameter for the y-intercept of each segment, plus parameters for the locations of the hinges. We reduce the number of parameters to while still being able to approximate a wide range of functions by restricting the activation function to be continuous and grounded with symmetric and fixed hinges.

Continuous

The general form of piecewise linear functions allows for discontinuous functions. Because virtually all successful activation functions are continuous, we argue that continuous learnable activation functions will still provide sufficient flexibility for DNNs. For a continuous piecewise linear function, we need to specify the y-intercept of one segment, the slopes of the segments, as well as the locations of the hinges, reducing the number of parameters to .

Grounded

Furthermore, we restrict the function to be grounded, that is, having an output of zero for an input of zero. We can do this without loss of generality as a function that is not grounded can still be created with the use of a bias. Since the y-intercept is fixed at zero, we no longer have to specify the y-intercept for any of the segments, reducing the number of parameters to .

Symmetric Hinges

In our design, we place the hinges in symmetric locations on the positive and negative halves on the x-axis, giving equal expressive power to each half. This allows, if need be, the activation function to approximate both even and odd functions. Because the location of one hinge determines the location of another, we can reduce the number of parameters for the hinges to

. In the case of an odd number of hinges, one hinge will be fixed at zero to maintain symmetry. This reduces the number of parameters to

Fixed Hinges

Finally, we address the issue of where to set the exact location of each segment. It is important that each segment has the potential to influence the output of the function. The distribution of the input could be such that only some of the segments influence the output while others remain unused. In the worst case, the input could be concentrated on a single segment, reducing the activation function to just a linear function. To ensure that each segment is able to play a role in the output of the function, we train our DNNs using batch normalization

[ioffe2015batch]

. At the beginning of training, batch normalization ensures that, for each batch, the input to the activation function has a mean of zero and a standard deviation of one. Using this knowledge, we can place the hinges at fixed locations that correspond to a certain number standard deviations away from the mean. With the location of the hinges fixed, the number of parameters is reduced to

. This activation function can approximate the vast majority of existing activation functions, such as tanh units, ReLUs, leaky ReLUs, ELUs, and, with the use of a bias, logistic units. We show the different types of piecewise linear functions that we have described in Table 1.

Type
General
Continuous
Continuous
Grounded
Continuous
Grounded
Symmetric hinges
Continuous
Grounded
Symmetric hinges
Fixed hinges
# Params
Viz

Table 1: Different types of piecewise linear functions defined on intervals. The rightmost function is what we use to parameterize our SPLASH activation functions.

3.2 SPLASH Units

We formulate the activation of a hidden unit as the summation of max functions with symmetric offsets, where is an odd number and one of the offsets is zero:

(1)

The first summation contains max functions with a non-zero output starting at and continuing to infinity. The second summation contains max functions with a non-zero output starting at and continuing to negative infinity. When summed together, these max functions form continuous and grounded line segments with hinges located at and . To ensure the function has symmetric and fixed hinges, we use the same in both summations, where for all ; furthermore, we have the values of remain fixed during training. Since we are using batch normalization, we fix the positions of the hinges for each to be a predetermined number of standard deviations away from the mean. We ensure there is always one hinge at zero by setting to be zero. The learned parameters and determine the slope of each line segment and are shared across all units in a layer. Therefore, SPLASH units add parameters per layer. We study the effect of different initializations as well as the effect of the number of hinges, , on training accuracy. From our experiments, we found that initializing SPLASH units to have the shape of a ReLU and setting to gave the best results. More details are given in the appendix.

The following theorem shows that SPLASH units can approximate any non-linear and uniformly continuous function that has an output of zero for an input of zero in a closed interval of real numbers.

Theorem

For any function and , , where , assuming:

  • and are finite real numbers.

  • is uniformly continuous.

Proof

Uniform continuity of implies that for every , such that for every and where , then we have . Placing equally distanced hinges on the interval , divides this into equal sub-intervals . We choose to be greater than , so the length of each sub-interval would be smaller than . For any of the sub-intervals starting at , we approximate by a line segment which connects to . Due to the linear form of SPLASH(x) for :

(2)

is uniformly continuous, so:

(3)

Now we need to show that SPLASH function (i.e., Equation 1) is able to connect to for . We do so by a simple induction as follows: Suppose that connected to for . The slope of SPLASH in the sub-interval are set to be or (depending on the sign of the sub-interval). However, the slope of SPLASH in the sub-interval is either or . In both cases, the extra term or can change the slope to any arbitrary value. This fact plus the assumption of continuity of SPLASH guarantees that which was our proposed approximation. The last thing to mention is that since SPLASH is grounded (SPLASH(0)=0), this approximation by line segments can only approximate functions where .

4 Accuracy

4.1 Comparison to Other Activation Functions

In order to show that SPLASH units are beneficial for deep neural networks, we compare it with well-known activation functions in different architectures. We train LeNet5, Network-in-Network, All-CNN, and ResNet-20, on three different datasets: MNIST [lecun1998gradient], CIFAR-10, and CIFAR-100 [krizhevsky2009learning]. We set and fix the locations of the hinges at . is initialized to and the remaining slopes are initialized to . With this initialization, the starting shape of a SPLASH unit mimics the shape of a ReLU.

With the exception of the All-CNN architecture, moderate data augmentation is performed as it is explained in he2016deep. Moderate data augmentation adds horizontally flipped examples of all images to the training set as well as random translations with a maximum translation of 5 pixels in each dimension. For the All-CNN architecture, we use heavy data augmentation which is introduced in springenberg2014striving

. More details on the hyperparameters are mentioned in the appendix.

We compare SPLASH units to ReLUs, leaky-ReLUs, PReLUs, APL units, tanh units, sigmoid units, ELUs, maxout units with nine features, and Swish units. We tune the hyperparameters for each DNN using ReLUs and use the same hyperparameters for each activation function. The results of the experiments are shown in Table 2. We report the average and the standard deviation of the error rate on the test set across five runs. The table shows that SPLASH units have the best performance across all datasets and architectures.

Activation MNIST CIFAR-10 CIFAR-100
- D-A - D-A
LeNet5 + ReLU [bigballon2017cifar10cnn] 31.22 23.77 - -
LeNet5 (ours) + ReLU 1.11 30.98 23.41
LeNet5 (ours) + PReLU 1.13 30.71 23.33
LeNet5 (ours) + SPLASH 1.03 30.14 22.93
Net in Net + ReLU [lin2013network] 10.41 8.81 35.68 -
Net in Net (ours) + ReLU 9.71 8.11 36.06 32.98
Net in Net + APL [agostinelli2014learning] 9.59 7.51 34.40 30.83
Net in Net (ours) + SPLASH 9.21 7.29 33.91 30.32
All-CNN + ReLU [springenberg2014striving] 9.08 7.25 33.71 -
All-CNN (ours) + ReLU 9.24 7.42 34.11 32.43
All-CNN (ours) + maxout 9.19 7.45 34.21 32.33
All-CNN (ours) + SPLASH 9.02 7.18 33.14 32.06
ResNet-20 + ReLU [he2016deep] - 8.75 - -
ResNet-20 (ours) + ReLU 10.65 8.71 34.54 32.63
ResNet-20 (ours) + APL 10.29 8.59 34.62 32.51
ResNet-20 (ours) + SPLASH 9.98 8.18 33.97 32.12
Table 2: Deep neural networks with ReLUs, leaky-ReLUs, PReLUs, tanh units, sigmoid units, ELUs, maxout units with nine features, Swish units, APL with , and SPLASH units are compared on three different datasets. We compare the error rates of SPLASH units on the test set to the best of the other activation functions. For the sake of brevity, D-A refers to Data Augmentation. The values in the tables are error-rates and are reported in percentages. The numbers are shown in the form of meanstandard deviation.

4.2 Insights into why SPLASH Units Improve Accuracy

Figure 1 shows how the shape of the SPLASH units change during training for the ResNet-20 architecture. From these figures, we can see that, during the early stages of training, the SPLASH units have a negative output for a negative input and a positive output for a positive input. During the later stages of training, SPLASH units have a positive output for both a negative input and a positive input. SPLASH units look similar to that of a leaky-ReLU during the early stages of training and look similar to a symmetric function during the later stages of training.

To better understand why SPLASH units lead to better performance, we used the final shape of the SPLASH units as a fixed activation function to train another ResNet-20 architecture. In Figure 2, we can see that the performance is only as good as that of ReLUs. This leads us to believe that the evolution of the shape of the SPLASH units during training is crucial to obtaining improved performance. Since we observed that SPLASH units would first give a negative output for a negative input and then give a positive output for a negative input, we train ResNet-20 with SPLASH units under two different conditions: 1) the first slope on the negative half of the input () is forced to be only positive, yielding a negative output for the line segment at zero (SPLASH-negative units) and 2) the first slope on the negative half of the input () is forced to be only negative, yielding a positive output for the line segment at 0 (SPLASH-positive units).

The performance of SPLASH-positive and SPLASH-negative units is shown in Figure 2. The figure shows that, although SPLASH-positive units have the ability to mimic the final learned shape of SPLASH units, it performs worse than SPLASH units and only slightly better than ReLUs. This shows that the ability to give a negative output for a negative input is crucial for SPLASH units. Furthermore, SPLASH-negative units perform better than SPLASH-positive units, but still worse than SPLASH units. In addition, we see that SPLASH-negative units exhibit a relatively quick decrease in the training loss, similar to that of SPLASH units, but do not reach the final training loss of SPLASH units. These observations suggest that the flexibility of the learnable activation function plays a crucial role in the final performance.

Figure 1: The shape of the SPLASH units in six different layers of the ResNet-20 architecture during training on the CIFAR-10 dataset. In the early stages of training, the shape of SPLASH units looks visually similar to that of a leaky-ReLU. However, during the later stages of training, the shape of SPLASH units looks visually similar to that of a symmetric function.
Figure 2: Training loss for ReLUs and different types of SPLASH units for the ResNet-20 architecture on CIFAR-10. SPLASH units converge faster and also have the lowest final loss. Fixed SPLASH is a fixed activation function that mimics the final shape of the SPLASH units trained on the ResNet-20 architecture. Fixed SPLASH performs only about as well as ReLUs. SPLASH-negative units perform better than SPLASH-positive units, however, they perform worse than SPLASH units. Furthermore, although SPLASH-positive units have the ability to mimic the final shape of SPLASH units, they perform worse.

4.3 Tradeoffs

The benefits of SPLASH units come at the cost of longer training time. The average per epoch training time and the final accuracy of a variety of fixed and learned activation functions are reported in Table

3. The table shows that training with SPLASH units can take between 1.2 and 3 times longer, depending on and the chosen architecture. We see that accuracy does not significantly decrease beyond . Therefore, we chose

for our experiments. While, for many deep learning algorithms, obtaining better performance often comes at the cost of longer training times, in Section

5, we show that SPLASH units also improve the robustness of deep neural networks to adversarial attacks.

Activation SPLASH Tanh Maxout ReLU Swish APL
MNIST (MLP) T 10 14 16 18 19 8 13 6 7 14
E 1.57 1.33 1.13 1.10 1.12 1.88 1.45 1.35 1.35 1.40
CIFAR-10 (LeNet5) T 21 24 29 33 35 19 22 17 17 24
E 30.79 30.57 30.20 30.14 30.11 31.14 31.01 30.88 30.69 30.66
Table 3: Per-epoch training time is reported in seconds. The benefits of SPLASH come at the cost of slower training time. All models are trained using NVIDIA TITAN V GPU with 12036MiB memory and 850MHz. Maxout is trained with six features and APL is set to have five hinges. For the sake of brevity, T and E are corresponding to per-epoch training time and error rate respectively.

5 Robustness to Adversarial Attacks

DNNs have been shown to be vulnerable to many types of adversarial attacks [szegedy2013intriguing, goodfellow2014explaining]. Research suggests that activation functions are a major cause of this vulnerability [zantedeschi2017efficient, brendel2017decision]. For example, zhang2018efficient bounded a given activation function using linear and quadratic functions with adaptive parameters and applied a different activation for each neuron to make neural networks robust to adversarial attacks. wang2018adversarial proposed a data-dependent activation function and empirically showed its robustness to both black-box and gradient-based adversarial attacks. Other studies such as rakin2018defend, dhillon2018stochastic, and song2018defense focused on other properties of activation functions, such as quantization and pruning, and showed that they can improve the robustness of DNNs to adversarial examples.

Recently, authors in zhao2016suppressing theoretically showed that DNNs with symmetric activations are less likely to get fooled. The authors proved that “symmetric units suppress unusual signals of exceptional magnitude which result in robustness to adversarial fooling and higher expressibility.” Because SPLASH units are capable of approximating a symmetric function, they may also be capable of increasing the robustness of DNNs to adversarial attacks. In this section, we show that SPLASH units greatly improve the robustness of DNNs to adversarial attacks. This claim is verified through a wide range of experiments with the CIFAR-10 dataset under both black-box and open-box methods, including the one-pixel-attack and the fast gradient sign method.

An intuition for why a DNN with SPLASH units is more robust than a DNN with ReLUs is provided in Figure 3. For each of the two networks, we take 100 random samples of frog and ship images and visualize the pre-softmax representations using the tSNE visualization [maaten2008visualizing] in Figure 3. The figure shows that the two classes have less overlap for the DNN with SPLASH units than for the DNN with ReLUs.

Figure 3:

tSNE visualization of the pre-softmax layer’s outputs for the LeNet5 architecture trained on CIFAR-10. Left: Trained with ReLUs. Right: Trained with SPLASH units. The figures show that the samples from the frog and ship classes are better separated using the DNN trained with SPLASH units.

5.1 Black-Box Adversarial Attacks

For black-box adversarial attacks, we assume the adversary has no information about the parameters of the DNN. The adversary can only observe the inputs to the DNN and outputs of the DNN, similar to that of a cryptographic oracle. We test the robustness of DNNs with SPLASH units using two powerful black box adversarial attacks, namely, the one-pixel attack and the boundary attack.

5.1.1 One Pixel Attack

A successful one pixel attack was proposed by Su_2019, which is based on differential evolution. Using this technique, we can iteratively generate adversarial images to try to minimize the confidence of the true class. The process starts with randomly modifying a few pixels to generate adversarial examples. At each step, several adversarial images are fed to the DNN and the output of the softmax function is observed. Examples that lowered the confidence of the true class will be kept to generate the next generation of adversaries. New adversarial images are then generated through mutations. By repeating these steps for a few iterations, the adversarial modifications generate more and more misleading images. The last step returns the adversarial modification that reduced the confidence of the true class the most, with the goal being that a class other than the true class has the highest confidence.

In the following experiment, we modify one, three, and five pixels of images to generate adversarial examples. The mutation scheme we used for this experiment is as follows:

(4)

Where , , and are three non-equal random indices of the modifications at step . will be an element of a new candidate modification.

To evaluate the effect of SPLASH units on the robustness of DNNs, we employ commonly-used architectures, namely, LeNet5, Network-in-Network, All-CNN, and ResNet-20. Each architecture is trained with ReLUs, APL units, Swish units, and SPLASH units. The results are shown in Table 4. The results show that SPLASH units significantly improve robustness to adversarial attacks for all architectures and outperform all other activation functions. In particular, for LeNet5 and ResNet-20, SPLASH units improve performance over ReLUs by 31% and 28%, respectively.

Model Activation one-pixel three-pixels five-pixels
LeNet5 ReLU 736 803 868 0.740
Swish 701 780 840 0.805
APL 635 709 781 0.465
SPLASH 514 588 651 0.540
Net in Net ReLU 644 701 769 0.621
Swish 670 715 760 0.419
APL 521 661 703 0.455
SPLASH 449 530 599 0.311
All-CNN ReLU 580 661 707 0.366
Swish 597 630 699 0.511
APL 509 581 627 0.295
SPLASH 471 515 570 0.253
ResNet-20 ReLU 689 721 781 0.551
Swish 650 689 730 0.601
APL 579 631 692 0.290
SPLASH 493 544 579 0.332
Table 4: Robustness to the one-pixel attack using 1000 randomly chosen CIFAR-10 test set images. We attack each architecture five times and report the results in the form of meanstandard deviation of the number of successful attacks. The maximum number of iterations for all attacks is set to 40. is computed for the one-pixel attack.

After observing adversarial samples which are deceiving to DNNs with ReLUs and DNNs with SPLASH units, we found that DNNs with SPLASH units still assign higher confidence to the true labels of the perturbed images than ReLUs and Swish units. More precisely, we measure the average of over all adversarial samples where both networks are fooled, where is the output of the softmax layer and is the adversarial sample. For each model, this measurement is included in Table 4. The results show that SPLASH units often have a smaller average value, again showing that SPLASH units are more robust to adversarial attacks.

5.1.2 Boundary Attacks

We use another black-box adversarial attack to further examine the effect SPLASH units have on the robustness of DNNs to adversarial fooling. Boundary attacks, which were recently introduced by brendel2017decision, are a powerful and commonly used black-box adversarial attack. Considering the original pair of input image and the corresponding target as , the attack algorithm is initialized from an adversarial pair of , where s.t. . Then, a random walk is performed times along the boundary between the adversarial region, , and the region of the true label such that (1) stays in the adversarial region and (2) the distance towards the original image is reduced. The random walk uses the following three steps: (1) Draw a random sample from an i.i.d. Gaussian as the direction of the next move. (2) Project the sampled direction onto the sphere centered at with a radius of and take a step of size in this projected direction. This step guarantees that the perturbed image gets closer to the original image at each step. (3) Make a move of size towards the original image, where . Ideally, this algorithm will converge to the adversarial sample which is the closest to the original input . The details and hyper-parameters of the attack are explained in the appendix.

In what follows, we employ the same architectures and activation functions that were used in the previous section. The results of this attack are shown in Table 5. We observe that DNNs with SPLASH units are more robust to this adversarial attack than DNNs with APL units, ReLUs, and Swish units.

Model Activation # of successful attacks
LeNet5 ReLU 801 0.815
Swish 779 0.511
APL 730 0.541
SPLASH 619 0.401
Net in Net ReLU 766 0.502
Swish 759 0.391
APL 654 0.340
SPLASH 598 0.351
All-CNN ReLU 744 0.621
Swish 700 0.710
APL 672 0.480
SPLASH 611 0.421
ResNet-20 ReLU 790 0.548
Swish 793 0.566
APL 711 0.471
SPLASH 621 0.349
Table 5: Robustness to the boundary attack using 1000 randomly chosen CIFAR-10 test set images. We attack each architecture five times and report the results in the form of meanstandard deviation of the number of successful attacks.

5.2 Open-Box Adversarial Attacks

For open-box adversarial attacks, the adversary now has information about the parameters of the DNN. To further explore the robustness of DNNs with SPLASH units, in this section, we consider two of the popular benchmarks of open-box adversarial attacks: the fast gradient sign method (FGSM) [goodfellow2014explaining] and Carlini and Wagner (CW) attacks [carlini2017towards]. For both attack methods, we consider four different architectures and compare the rate of successful attacks for each of the networks with ReLUs, Swish units, APL units, and SPLASH units. The dataset and architectures are the same as those used for black-box adversarial attacks.

5.2.1 Fgsm

FGSM generates an adversarial image from the original image by maximizing the loss , where is the true label of the image . This maximization problem is subjected to where is considered as the attack strength. Using the first order Taylor series approximation, we then have:

(5)

So the adversarial image would be:

(6)

The results for different are summarized in Table 6. The results show that SPLASH units are consistently better than all other activation functions with performance improvements of up to 28.5%.

Model Activation
LeNet5 ReLU 690 755 825 0.710
Swish 634 740 830 0.713
APL 611 691 807 0.419
SPLASH 493 598 772 0.521
Net in Net ReLU 590 651 798 0.609
Swish 577 619 750 0.439
APL 531 607 719 0.561
SPLASH 498 554 689 0.499
All-CNN ReLU 561 653 741 0.590
Swish 519 622 740 0.576
APL 522 615 721 0.549
SPLASH 479 588 676 0.333
ResNet-20 ReLU 651 736 801 0.641
Swish 639 730 793 0.522
APL 609 701 749 0.303
SPLASH 541 617 711 0.411
Table 6: Robustness to the FGSM attack using 1000 randomly chosen CIFAR-10 test set images. We attack each architecture five times with random start and report the results in the form of meanstandard deviation of the number of successful attacks. is computed for .
Model Activation # of successful attacks
LeNet5 ReLU 932 0.801
Swish 919 0.713
APL 922 0.609
SPLASH 898 0.541
Net in Net ReLU 916 0.790
Swish 919 0.724
APL 915 0.653
SPLASH 892 0.674
All-CNN ReLU 894 0.611
Swish 887 0.631
APL 876 0.509
SPLASH 863 0.365
ResNet-20 ReLU 903 0.603
Swish 911 0.441
APL 894 0.590
SPLASH 870 0.541
Table 7: Robustness to the CW-L2 attack using 1000 randomly chosen CIFAR-10 test set images. We attack each architecture five times and report the results in the form of meanstandard deviation of the number of successful attacks.

5.2.2 Cw-L2

Another open-box adversarial attack, which is generally more powerful than FGSM, was introduced in carlini2017towards. For a given image and label , this technique tries to find the minimum perturbation , so that the perturbed image

is classified as

. Using the norm, this perturbation minimization problem can be formulated as follows:

(7)

To ease the satisfaction of equality, Equation 7 can be rephrased as where , is Lagrange multiplier, and

is the pre-softmax vector for the input

.

The robustness performance of ReLUs, Swish units, APL units, and SPLASH units for the CW-L2 attack is shown in Table 7. The table is consistent with previous results as it shows that SPLASH units are the most robust to this adversarial attack.

6 Conclusion

SPLASH units are simple and flexible parameterized piecewise linear functions that simultaneously improve both the accuracy and adversarial robustness of DNNs. They had the best classification accuracy across three different datasets and four different architectures when compared to nine other learned and fixed activation functions. When investigating the reason behind their success, we found that the final shape of the learnable SPLASH units did not serve as a good non-learnable (fixed) activation function. Additionally, in our ablation studies, we saw that restricting the flexibility of the activation function hurts performance, even if the restricted activation function can still mimic the final shape of the unrestricted SPLASH units. It could be possible that changes in the activation functions play a particular role in shaping the loss landscape of deep neural networks [hochreiter1997flat, dauphin2014identifying, choromanska2015loss]. Future work will use visualization techniques [craven1992visualizing, gallagher2003visualization, li2018visualizing] to obtain an intuitive understanding of how learnable activation functions affect the optimization process.

Though no adversarial examples are shown during training, SPLASH units still significantly increase the robustness of DNNs to adversarial attacks. Prior research suggests that the reason for this may be related to their final shape, which looks visually similar to that of a symmetric function [zhao2016suppressing]. Given that research has shown that certain activation functions may make deep neural networks susceptible to adversarial attacks [croce2018randomized], it is possible that adding more inductive biases aimed at reducing these vulnerabilities may increase the robustness of learned activation functions to adversarial attacks. Since our ablation studies have shown the importance of having flexible activation functions during training, these inductive biases may need to allow for flexibility or be applied during the later stages of training, for example, in the form of a regularization penalty.

7 Acknowledgement

Work in part supported by ARO grant 76649-CS, NSF grant 1839429, and NSF grant NRT 1633631 to PB. We wish to acknowledge Yuzo Kanomata for computing support.

References

8 Appendix

8.1 Initialization of SPLASH weights

In order to choose the best initialization of SPLASH weights ( and ), we compare the performance of five different LeNet5 architecture trained on CIFAR-10. Each of these architectures uses a differently initialized SPLASH activation. Figure 4 shows that the leaky ReLU and ReLU initializations perform the best. Leaky ReLUs require us to determine the slope of the line segment on the left side of the x-axis. Adding another parameter that may possibly need tuning. Therefore, for simplicity, we use the ReLU initialization (, ans all other parameters set to ) in all of our experiments.

Figure 4: Left: The loss trajectory of training LeNet5 architecture on CIFAR-10 using different initializations of SPLASH units. Right: Visualizations of the initializations.

8.2 Number of Hinges

In this section, we perform a variety of experiments to find the best setting for SPLASH activation in terms of both complexity and performance.

S 3 5 7 9 11
Error rate
MNIST 1.57-1.61 1.33-1.39 1.13-1.17 1.10-1.08 1.12-1.08
CIFAR-10 30.79-30.55 30.57-30.29 30.20-30.18 30.14-30.22 30.11-30.19
# of additional params
MNIST 12-1408 18-2112 24-2816 30-3520 36-4224
CIFAR-10 16- 75k 24-120k 32-150k 40-180k 48-225k
Table 8: Three classifications tasks are performed with five different numbers of hinges for SPLASH activation. The number of additional parameters due to the use of SPLASH and training loss are compared below. The MLP architecture consists of three layers each with 256, 64, and 32 units. LeNet5 is used for CIFAR-10. For each experiment, two numbers are reported which are corresponding to shared SPLASH units and independent SPLASH units respectively.

First, we assess the effect of on the performance of SPLASH. Due to Theorem 3.2, greater values increase the expressive power of the SPLASH which generally results better training performance. We tried , with symmetrically fixed hinges for SPLASH units. We also use MNIST [lecun-mnisthandwrittendigit-2010] and CIFAR-10 [cifar10]. Each network is trained with two types of SPLASH activations; 1) A shared SPLASH: a shared unit among all neurons of a layer and 2) An independent SPLASH unit for each neuron of a layer. As it is summarized in Table 8, in all cases of there is no significant improvement in the performance of the DNNs.

On the other hand, due to the increase in the number of parameters of SPLASH, the activation units become more computationally expensive. In Table 3 we compare the per-epoch training run-time for different number of hinges of a shared SPLASH.

For small values of , we can see that SPLASH is comparable to an exponential activation unit such as Tanh, and much faster than heavier activation such as Maxout.

As one can conclude from both Table 8 and 3, there is a trade-off between the complexity of SPLASH units and the performance of DNNs. We believe that is the best choice for the number of hinges.

8.3 Experiments’ Details and Statistical Significance

In this section, we explain the experimental conditions and all the parameters used for each experiment. Also, in order to make the results of Table 2

more interpretable, we perform a t-test

[kim2015t] on all the error rates achieved in that experiment.

In section 4, experiments corresponding to Table 2 are performed using four different architectures. LeNet5 is used as it was introduced in lecun1998gradient. It has two convolution layers followed by two MLPs that are connected to a softmax layer. We use our own implementation of LeNet5 with all the hyper-parameters from bigballon2017cifar10cnn. However, We train the networks for 100 epochs.

All-CNN architecture which is only taking advantage of convolutional layers, was introduced in springenberg2014striving. Since we could not reproduce the exact numbers for the top-1 accuracy on CIFAR-10 dataset using the specifications in the main article, we used our own implementation. We use a learning rate of 0.1, with the decay rate of 1e-6 and momentum of 0.9. The batch size is set to 64 and we trained the networks for 300 epochs. The rest of the hyper parameters are the same as those mentioned in springenberg2014striving.

For ResNet architectures, we try a popular variant, ResNet-20, introduced in he2016deep which has 0.27M parameters. Our implementation of ResNet-20 is taken from chollet2015keras and bigballon2017cifar10cnn. All the hyper-parameters including batch size, number epochs, initialization, learning rate and it’s decay, and optimizer are left to the default values of the mentioned repositories.

Lastly, Net in Net architecture which is using an MLP instead of a fixed nonlinear transformation is taken from bigballon2017cifar10cnn. We use the same set of hyper-parameters including batch size, number of epochs, learning rate, and etc as it was mentioned in bigballon2017cifar10cnn.

In Table 9, we show the statistical significance of the experiments performed in section 4. Since each number is the average of five experiments, we are able to perform a t-test and provide p-values and statistical significance for each individual experiment. As one can see in Table 9, most of the numbers of Table 2 are statistically significant.

Activation MNIST CIFAR-10 CIFAR-100
- - D-A - D-A
LeNet5 (PReLU vs SPLASH) 0.057 0.043 0.055
Net in Net(ReLU vs SPLASH) 0.042 0.039 0.038 0.055
All-CNN (maxout vs SPLASH) 0.041 0.050 0.066 0.061
ResNet-20 (PReLU vs SPLASH) 0.033 0.044 0.046 0.044
Table 9: The best activation among ReLU, leaky-ReLU, PReLU, tanh, sigmoid, ELU, maxout (nine features), Swish: is chosen by the minimum average of the error rate. Then the significance of the comparison between the best network and the network with SPLASH activation is calculated through a t-test. The p-vales for each comparison is provided below.
Figure 5: Training loss trajectory for different SPLASH initializations compared to fixed ReLU and leaky ReLU.

In section 5, we use ResNet-20 architecture to visualize SPLASH shapes at different stages of the training process. Here we include two more plots showing the evolution of SPLASH units during training. Figure 6 and Figure 7 are sowing the evolution of SPLASH units during training MLP and LeNet5 architectures respectively. Both architectures are described in section 4.

Figure 6: Shape of SPLASH activation during training a simple network of MLPs on MNIST dataset.
Figure 7: Shape of SPLASH during training a LeNet5 architecture on CIFAR-10 dataset.

In section 6, we start by a tSNE visualization of 100 random samples of frogs and ships images from the CIFAR-10 test set. The tSNE mapping is performed using a learning rate of 30 and a perplexity of 40.

For the black-box adversarial attack experiments, each network is attacked five times and the reported number is the average of successful modifications in five different attacks. One-pixel-attacks are done using the maximum number of iteration to be 40 and the pop size to be 400. For the boundary attack, we use the implementation in rauber2017foolbox. To reduce the rate of successful attacks, the hyper-parameters steps is set to 6000. All other hyper-parameters are left as the default from the mentioned implementation.

As for the open-box attacks, for both FGSM and CW-L2 attack, we employ the implementation and default hyper-parameters in rauber2017foolbox. However, to reduce the attack success rate for CW technique, we use 7 and 1000 for variables binary search steps and steps respectively. The network architectures used for experiments in this section, are identical to the architectures used in section 4.

Lastly, four common;y used activation functions were used to train different DNNs in section 6. ReLU (), APL (, with fixed hinges on , Swish ( with , and SPLASH (with the configurations mentioned in the previous section) are used.