1 Introduction
Nonlinear activation functions are fundamental for deep neural networks (DNNs). They determine the class of functions that DNNs can implement and influence their training dynamics, thereby affecting their final performance. For example, DNNs with rectified linear units (ReLUs)
[nair2010rectified] have been shown to perform better than logistic and tanh units in several scenarios [pedamonti2018comparison, nwankpa2018activation, nair2010rectified, goodfellow2016deep]. Instead of using a fixed activation function, one can use a parameterized activation function and learn its parameters to add flexibility to the model. Piecewise linear functions are a reasonable choice for the parameterization of activation functions [agostinelli2014learning, He_2015, ramachandran2017searching, jin2016deep, li2016multi] due to their straightforward parameterization and their ability to approximate nonlinear functions [garvin1957applications, stone1961approximation]. However, in the context of deep neural networks, the best way to parameterize these piecewise linear activation functions is still an open question. Previous piecewise linear activation functions either sacrifice expressive power for simplicity (i.e. having few parameters) or sacrifice simplicity for expressive power. While expressive power allows deep neural networks to approximate complicated functions, simplicity can make optimization easier by adding useful inductive biases and reducing the size of the hypothesis space. Therefore, we set out to find a parameterized piecewise linear activation function that is as simple as possible while maintaining the ability to approximate a wide range of functions.Piecewise linear functions, in the most general form, are realvalued functions defined as line segments with hinges that denote where one segment ends and the next segment begins. As detailed in Section 3, a function of this most general form requires parameters. Many functions in this hypothesis space, such as discontinuous functions, are unlikely to be useful activation functions. Therefore, we significantly reduce the size of the hypothesis space while maintaining the ability to approximate a wide range of useful activation functions. We restrict the form of the piecewise linear function to be continuous and grounded (having an output of zero for an input of zero) with symmetric and fixed hinges. By doing so, we reduce the number of parameters to . Furthermore, we still maintain the ability to approximate almost every successful deep neural network activation function. We call this parameterized piecewise linear activation function SPLASH (Simple Piecewise Linear and Adaptive with Symmetric Hinges).
Typically, learned activation functions are evaluated in terms of accuracy on a test set. We compare the classification accuracy of SPLASH units to nine other learned and fixed activation functions and show that SPLASH units consistently give superior performance. We also perform ablation studies to gain insight into why SPLASH units improve performance and show that the flexibility of the SPLASH units during training significantly affects the final performance. In addition, we also evaluate the robustness of SPLASH units to adversarial attacks [szegedy2013intriguing, goodfellow2014explaining, nguyen2015deep]. When compared to ReLUs, SPLASH units reduce the success of adversarial attacks by up to 31%, without any modifications to how they are parameterized or learned.
2 Related Work
Variants of ReLUs, such as leakyReLUs [maas2013rectifier], exponential linear units (ELUs) [clevert2015fast], and scaled exponential linear units (SELUs) [klambauer2017self]
have been shown to improve upon ReLUs. ELUs and SELUs encourage the outputs of the activation functions to have zero mean while SELUs also encourage the outputs of the activation functions to have unit variance. Neural architecture search
[ramachandran2017searching] has also discovered novel activation functions, in particular, the Swish activation function. The Swish activation function is defined as and performs slightly better than ReLUs. It is worth mentioning that, in lin2013network, the authors proposed the networkinnetwork approach where they replace activation functions in convolutional layers with small multilayer perceptrons. Theoretically, due to universal approximation theorem
[csaji2001approximation], this is the most expressive activation function; however, it requires many more parameters.Some of the early attempts to learn activation functions in neural networks can be found in poli1996parallel, weingaertner2002hierarchical, and khan2013fast
, where the authors proposed learning the best activation function per neuron among a pool of candidate activation functions using genetic and evolutionary algorithms. Maxout
[goodfellow2013maxout] has been introduced as an activation function aimed at enhancing the model averaging properties of dropout [srivastava2014dropout]. However, not only is it limited to approximating convex functions, but it also requires a significant increase in parameters.APL units [agostinelli2014learning], PReLUs [He_2015] and SReLUs [jin2016deep] are adaptive activation functions from the piecewise linear family that can mimic both convex and nonconvex functions. Of these activation functions, APL units are the most general. However, they require a parameter for the slope of each line segment as well as for the location of each hinge. Additionally, APL units give more expressive power to the left half of the input space than to the right half. Furthermore, the locations of the hinges are not determined by the data and, therefore, it is possible that some line segments may go unused. SReLUs also learn the slopes of the line segments and the locations of the hinges, however, the initial locations of the hinges are determined by the data. SReLUs have less expressive power than APL units as the form of the function is restricted to only have two hinges. PReLUs are the simplest of these activation functions with one fixed hinge where only the slope of one of the line segments is learned. On the other hand, SPLASH units can have few or many hinges and the the locations of the hinges are fixed and determined by the data. Therefore, only the slopes of the line segments have to be learned. Furthermore, SPLASH units give equal expressive power to the left and the right half of the input space.
3 From Piecewise Linear Functions to SPLASH Units
3.1 Family of Piecewise Linear Functions
Given line segments and hinges, piecewise linear functions can be parameterized with parameters: one parameter for the slope and one parameter for the yintercept of each segment, plus parameters for the locations of the hinges. We reduce the number of parameters to while still being able to approximate a wide range of functions by restricting the activation function to be continuous and grounded with symmetric and fixed hinges.
Continuous
The general form of piecewise linear functions allows for discontinuous functions. Because virtually all successful activation functions are continuous, we argue that continuous learnable activation functions will still provide sufficient flexibility for DNNs. For a continuous piecewise linear function, we need to specify the yintercept of one segment, the slopes of the segments, as well as the locations of the hinges, reducing the number of parameters to .
Grounded
Furthermore, we restrict the function to be grounded, that is, having an output of zero for an input of zero. We can do this without loss of generality as a function that is not grounded can still be created with the use of a bias. Since the yintercept is fixed at zero, we no longer have to specify the yintercept for any of the segments, reducing the number of parameters to .
Symmetric Hinges
In our design, we place the hinges in symmetric locations on the positive and negative halves on the xaxis, giving equal expressive power to each half. This allows, if need be, the activation function to approximate both even and odd functions. Because the location of one hinge determines the location of another, we can reduce the number of parameters for the hinges to
. In the case of an odd number of hinges, one hinge will be fixed at zero to maintain symmetry. This reduces the number of parameters toFixed Hinges
Finally, we address the issue of where to set the exact location of each segment. It is important that each segment has the potential to influence the output of the function. The distribution of the input could be such that only some of the segments influence the output while others remain unused. In the worst case, the input could be concentrated on a single segment, reducing the activation function to just a linear function. To ensure that each segment is able to play a role in the output of the function, we train our DNNs using batch normalization
[ioffe2015batch]. At the beginning of training, batch normalization ensures that, for each batch, the input to the activation function has a mean of zero and a standard deviation of one. Using this knowledge, we can place the hinges at fixed locations that correspond to a certain number standard deviations away from the mean. With the location of the hinges fixed, the number of parameters is reduced to
. This activation function can approximate the vast majority of existing activation functions, such as tanh units, ReLUs, leaky ReLUs, ELUs, and, with the use of a bias, logistic units. We show the different types of piecewise linear functions that we have described in Table 1.Type 






# Params  
Viz 





3.2 SPLASH Units
We formulate the activation of a hidden unit as the summation of max functions with symmetric offsets, where is an odd number and one of the offsets is zero:
(1) 
The first summation contains max functions with a nonzero output starting at and continuing to infinity. The second summation contains max functions with a nonzero output starting at and continuing to negative infinity. When summed together, these max functions form continuous and grounded line segments with hinges located at and . To ensure the function has symmetric and fixed hinges, we use the same in both summations, where for all ; furthermore, we have the values of remain fixed during training. Since we are using batch normalization, we fix the positions of the hinges for each to be a predetermined number of standard deviations away from the mean. We ensure there is always one hinge at zero by setting to be zero. The learned parameters and determine the slope of each line segment and are shared across all units in a layer. Therefore, SPLASH units add parameters per layer. We study the effect of different initializations as well as the effect of the number of hinges, , on training accuracy. From our experiments, we found that initializing SPLASH units to have the shape of a ReLU and setting to gave the best results. More details are given in the appendix.
The following theorem shows that SPLASH units can approximate any nonlinear and uniformly continuous function that has an output of zero for an input of zero in a closed interval of real numbers.
Theorem
For any function and , , where , assuming:

and are finite real numbers.

is uniformly continuous.
Proof
Uniform continuity of implies that for every , such that for every and where , then we have . Placing equally distanced hinges on the interval , divides this into equal subintervals . We choose to be greater than , so the length of each subinterval would be smaller than . For any of the subintervals starting at , we approximate by a line segment which connects to . Due to the linear form of SPLASH(x) for :
(2) 
is uniformly continuous, so:
(3) 
Now we need to show that SPLASH function (i.e., Equation 1) is able to connect to for . We do so by a simple induction as follows: Suppose that connected to for . The slope of SPLASH in the subinterval are set to be or (depending on the sign of the subinterval). However, the slope of SPLASH in the subinterval is either or . In both cases, the extra term or can change the slope to any arbitrary value. This fact plus the assumption of continuity of SPLASH guarantees that which was our proposed approximation. The last thing to mention is that since SPLASH is grounded (SPLASH(0)=0), this approximation by line segments can only approximate functions where .
4 Accuracy
4.1 Comparison to Other Activation Functions
In order to show that SPLASH units are beneficial for deep neural networks, we compare it with wellknown activation functions in different architectures. We train LeNet5, NetworkinNetwork, AllCNN, and ResNet20, on three different datasets: MNIST [lecun1998gradient], CIFAR10, and CIFAR100 [krizhevsky2009learning]. We set and fix the locations of the hinges at . is initialized to and the remaining slopes are initialized to . With this initialization, the starting shape of a SPLASH unit mimics the shape of a ReLU.
With the exception of the AllCNN architecture, moderate data augmentation is performed as it is explained in he2016deep. Moderate data augmentation adds horizontally flipped examples of all images to the training set as well as random translations with a maximum translation of 5 pixels in each dimension. For the AllCNN architecture, we use heavy data augmentation which is introduced in springenberg2014striving
. More details on the hyperparameters are mentioned in the appendix.
We compare SPLASH units to ReLUs, leakyReLUs, PReLUs, APL units, tanh units, sigmoid units, ELUs, maxout units with nine features, and Swish units. We tune the hyperparameters for each DNN using ReLUs and use the same hyperparameters for each activation function. The results of the experiments are shown in Table 2. We report the average and the standard deviation of the error rate on the test set across five runs. The table shows that SPLASH units have the best performance across all datasets and architectures.
Activation  MNIST  CIFAR10  CIFAR100  
  DA    DA  
LeNet5 + ReLU [bigballon2017cifar10cnn]  31.22  23.77      
LeNet5 (ours) + ReLU  1.11  30.98  23.41  
LeNet5 (ours) + PReLU  1.13  30.71  23.33  
LeNet5 (ours) + SPLASH  1.03  30.14  22.93  
Net in Net + ReLU [lin2013network]  10.41  8.81  35.68    
Net in Net (ours) + ReLU  9.71  8.11  36.06  32.98  
Net in Net + APL [agostinelli2014learning]  9.59  7.51  34.40  30.83  
Net in Net (ours) + SPLASH  9.21  7.29  33.91  30.32  
AllCNN + ReLU [springenberg2014striving]  9.08  7.25  33.71    
AllCNN (ours) + ReLU  9.24  7.42  34.11  32.43  
AllCNN (ours) + maxout  9.19  7.45  34.21  32.33  
AllCNN (ours) + SPLASH  9.02  7.18  33.14  32.06  
ResNet20 + ReLU [he2016deep]    8.75      
ResNet20 (ours) + ReLU  10.65  8.71  34.54  32.63  
ResNet20 (ours) + APL  10.29  8.59  34.62  32.51  
ResNet20 (ours) + SPLASH  9.98  8.18  33.97  32.12 
4.2 Insights into why SPLASH Units Improve Accuracy
Figure 1 shows how the shape of the SPLASH units change during training for the ResNet20 architecture. From these figures, we can see that, during the early stages of training, the SPLASH units have a negative output for a negative input and a positive output for a positive input. During the later stages of training, SPLASH units have a positive output for both a negative input and a positive input. SPLASH units look similar to that of a leakyReLU during the early stages of training and look similar to a symmetric function during the later stages of training.
To better understand why SPLASH units lead to better performance, we used the final shape of the SPLASH units as a fixed activation function to train another ResNet20 architecture. In Figure 2, we can see that the performance is only as good as that of ReLUs. This leads us to believe that the evolution of the shape of the SPLASH units during training is crucial to obtaining improved performance. Since we observed that SPLASH units would first give a negative output for a negative input and then give a positive output for a negative input, we train ResNet20 with SPLASH units under two different conditions: 1) the first slope on the negative half of the input () is forced to be only positive, yielding a negative output for the line segment at zero (SPLASHnegative units) and 2) the first slope on the negative half of the input () is forced to be only negative, yielding a positive output for the line segment at 0 (SPLASHpositive units).
The performance of SPLASHpositive and SPLASHnegative units is shown in Figure 2. The figure shows that, although SPLASHpositive units have the ability to mimic the final learned shape of SPLASH units, it performs worse than SPLASH units and only slightly better than ReLUs. This shows that the ability to give a negative output for a negative input is crucial for SPLASH units. Furthermore, SPLASHnegative units perform better than SPLASHpositive units, but still worse than SPLASH units. In addition, we see that SPLASHnegative units exhibit a relatively quick decrease in the training loss, similar to that of SPLASH units, but do not reach the final training loss of SPLASH units. These observations suggest that the flexibility of the learnable activation function plays a crucial role in the final performance.
4.3 Tradeoffs
The benefits of SPLASH units come at the cost of longer training time. The average per epoch training time and the final accuracy of a variety of fixed and learned activation functions are reported in Table
3. The table shows that training with SPLASH units can take between 1.2 and 3 times longer, depending on and the chosen architecture. We see that accuracy does not significantly decrease beyond . Therefore, we chosefor our experiments. While, for many deep learning algorithms, obtaining better performance often comes at the cost of longer training times, in Section
5, we show that SPLASH units also improve the robustness of deep neural networks to adversarial attacks.Activation  SPLASH  Tanh  Maxout  ReLU  Swish  APL  
MNIST (MLP)  T  10  14  16  18  19  8  13  6  7  14 
E  1.57  1.33  1.13  1.10  1.12  1.88  1.45  1.35  1.35  1.40  
CIFAR10 (LeNet5)  T  21  24  29  33  35  19  22  17  17  24 
E  30.79  30.57  30.20  30.14  30.11  31.14  31.01  30.88  30.69  30.66 
5 Robustness to Adversarial Attacks
DNNs have been shown to be vulnerable to many types of adversarial attacks [szegedy2013intriguing, goodfellow2014explaining]. Research suggests that activation functions are a major cause of this vulnerability [zantedeschi2017efficient, brendel2017decision]. For example, zhang2018efficient bounded a given activation function using linear and quadratic functions with adaptive parameters and applied a different activation for each neuron to make neural networks robust to adversarial attacks. wang2018adversarial proposed a datadependent activation function and empirically showed its robustness to both blackbox and gradientbased adversarial attacks. Other studies such as rakin2018defend, dhillon2018stochastic, and song2018defense focused on other properties of activation functions, such as quantization and pruning, and showed that they can improve the robustness of DNNs to adversarial examples.
Recently, authors in zhao2016suppressing theoretically showed that DNNs with symmetric activations are less likely to get fooled. The authors proved that “symmetric units suppress unusual signals of exceptional magnitude which result in robustness to adversarial fooling and higher expressibility.” Because SPLASH units are capable of approximating a symmetric function, they may also be capable of increasing the robustness of DNNs to adversarial attacks. In this section, we show that SPLASH units greatly improve the robustness of DNNs to adversarial attacks. This claim is verified through a wide range of experiments with the CIFAR10 dataset under both blackbox and openbox methods, including the onepixelattack and the fast gradient sign method.
An intuition for why a DNN with SPLASH units is more robust than a DNN with ReLUs is provided in Figure 3. For each of the two networks, we take 100 random samples of frog and ship images and visualize the presoftmax representations using the tSNE visualization [maaten2008visualizing] in Figure 3. The figure shows that the two classes have less overlap for the DNN with SPLASH units than for the DNN with ReLUs.
5.1 BlackBox Adversarial Attacks
For blackbox adversarial attacks, we assume the adversary has no information about the parameters of the DNN. The adversary can only observe the inputs to the DNN and outputs of the DNN, similar to that of a cryptographic oracle. We test the robustness of DNNs with SPLASH units using two powerful black box adversarial attacks, namely, the onepixel attack and the boundary attack.
5.1.1 One Pixel Attack
A successful one pixel attack was proposed by Su_2019, which is based on differential evolution. Using this technique, we can iteratively generate adversarial images to try to minimize the confidence of the true class. The process starts with randomly modifying a few pixels to generate adversarial examples. At each step, several adversarial images are fed to the DNN and the output of the softmax function is observed. Examples that lowered the confidence of the true class will be kept to generate the next generation of adversaries. New adversarial images are then generated through mutations. By repeating these steps for a few iterations, the adversarial modifications generate more and more misleading images. The last step returns the adversarial modification that reduced the confidence of the true class the most, with the goal being that a class other than the true class has the highest confidence.
In the following experiment, we modify one, three, and five pixels of images to generate adversarial examples. The mutation scheme we used for this experiment is as follows:
(4) 
Where , , and are three nonequal random indices of the modifications at step . will be an element of a new candidate modification.
To evaluate the effect of SPLASH units on the robustness of DNNs, we employ commonlyused architectures, namely, LeNet5, NetworkinNetwork, AllCNN, and ResNet20. Each architecture is trained with ReLUs, APL units, Swish units, and SPLASH units. The results are shown in Table 4. The results show that SPLASH units significantly improve robustness to adversarial attacks for all architectures and outperform all other activation functions. In particular, for LeNet5 and ResNet20, SPLASH units improve performance over ReLUs by 31% and 28%, respectively.
Model  Activation  onepixel  threepixels  fivepixels  
LeNet5  ReLU  736  803  868  0.740 
Swish  701  780  840  0.805  
APL  635  709  781  0.465  
SPLASH  514  588  651  0.540  
Net in Net  ReLU  644  701  769  0.621 
Swish  670  715  760  0.419  
APL  521  661  703  0.455  
SPLASH  449  530  599  0.311  
AllCNN  ReLU  580  661  707  0.366 
Swish  597  630  699  0.511  
APL  509  581  627  0.295  
SPLASH  471  515  570  0.253  
ResNet20  ReLU  689  721  781  0.551 
Swish  650  689  730  0.601  
APL  579  631  692  0.290  
SPLASH  493  544  579  0.332 
After observing adversarial samples which are deceiving to DNNs with ReLUs and DNNs with SPLASH units, we found that DNNs with SPLASH units still assign higher confidence to the true labels of the perturbed images than ReLUs and Swish units. More precisely, we measure the average of over all adversarial samples where both networks are fooled, where is the output of the softmax layer and is the adversarial sample. For each model, this measurement is included in Table 4. The results show that SPLASH units often have a smaller average value, again showing that SPLASH units are more robust to adversarial attacks.
5.1.2 Boundary Attacks
We use another blackbox adversarial attack to further examine the effect SPLASH units have on the robustness of DNNs to adversarial fooling. Boundary attacks, which were recently introduced by brendel2017decision, are a powerful and commonly used blackbox adversarial attack. Considering the original pair of input image and the corresponding target as , the attack algorithm is initialized from an adversarial pair of , where s.t. . Then, a random walk is performed times along the boundary between the adversarial region, , and the region of the true label such that (1) stays in the adversarial region and (2) the distance towards the original image is reduced. The random walk uses the following three steps: (1) Draw a random sample from an i.i.d. Gaussian as the direction of the next move. (2) Project the sampled direction onto the sphere centered at with a radius of and take a step of size in this projected direction. This step guarantees that the perturbed image gets closer to the original image at each step. (3) Make a move of size towards the original image, where . Ideally, this algorithm will converge to the adversarial sample which is the closest to the original input . The details and hyperparameters of the attack are explained in the appendix.
In what follows, we employ the same architectures and activation functions that were used in the previous section. The results of this attack are shown in Table 5. We observe that DNNs with SPLASH units are more robust to this adversarial attack than DNNs with APL units, ReLUs, and Swish units.
Model  Activation  # of successful attacks  
LeNet5  ReLU  801  0.815 
Swish  779  0.511  
APL  730  0.541  
SPLASH  619  0.401  
Net in Net  ReLU  766  0.502 
Swish  759  0.391  
APL  654  0.340  
SPLASH  598  0.351  
AllCNN  ReLU  744  0.621 
Swish  700  0.710  
APL  672  0.480  
SPLASH  611  0.421  
ResNet20  ReLU  790  0.548 
Swish  793  0.566  
APL  711  0.471  
SPLASH  621  0.349 
5.2 OpenBox Adversarial Attacks
For openbox adversarial attacks, the adversary now has information about the parameters of the DNN. To further explore the robustness of DNNs with SPLASH units, in this section, we consider two of the popular benchmarks of openbox adversarial attacks: the fast gradient sign method (FGSM) [goodfellow2014explaining] and Carlini and Wagner (CW) attacks [carlini2017towards]. For both attack methods, we consider four different architectures and compare the rate of successful attacks for each of the networks with ReLUs, Swish units, APL units, and SPLASH units. The dataset and architectures are the same as those used for blackbox adversarial attacks.
5.2.1 Fgsm
FGSM generates an adversarial image from the original image by maximizing the loss , where is the true label of the image . This maximization problem is subjected to where is considered as the attack strength. Using the first order Taylor series approximation, we then have:
(5) 
So the adversarial image would be:
(6) 
The results for different are summarized in Table 6. The results show that SPLASH units are consistently better than all other activation functions with performance improvements of up to 28.5%.
Model  Activation  
LeNet5  ReLU  690  755  825  0.710 
Swish  634  740  830  0.713  
APL  611  691  807  0.419  
SPLASH  493  598  772  0.521  
Net in Net  ReLU  590  651  798  0.609 
Swish  577  619  750  0.439  
APL  531  607  719  0.561  
SPLASH  498  554  689  0.499  
AllCNN  ReLU  561  653  741  0.590 
Swish  519  622  740  0.576  
APL  522  615  721  0.549  
SPLASH  479  588  676  0.333  
ResNet20  ReLU  651  736  801  0.641 
Swish  639  730  793  0.522  
APL  609  701  749  0.303  
SPLASH  541  617  711  0.411 
Model  Activation  # of successful attacks  
LeNet5  ReLU  932  0.801 
Swish  919  0.713  
APL  922  0.609  
SPLASH  898  0.541  
Net in Net  ReLU  916  0.790 
Swish  919  0.724  
APL  915  0.653  
SPLASH  892  0.674  
AllCNN  ReLU  894  0.611 
Swish  887  0.631  
APL  876  0.509  
SPLASH  863  0.365  
ResNet20  ReLU  903  0.603 
Swish  911  0.441  
APL  894  0.590  
SPLASH  870  0.541 
5.2.2 CwL2
Another openbox adversarial attack, which is generally more powerful than FGSM, was introduced in carlini2017towards. For a given image and label , this technique tries to find the minimum perturbation , so that the perturbed image
is classified as
. Using the norm, this perturbation minimization problem can be formulated as follows:(7) 
To ease the satisfaction of equality, Equation 7 can be rephrased as where , is Lagrange multiplier, and
is the presoftmax vector for the input
.The robustness performance of ReLUs, Swish units, APL units, and SPLASH units for the CWL2 attack is shown in Table 7. The table is consistent with previous results as it shows that SPLASH units are the most robust to this adversarial attack.
6 Conclusion
SPLASH units are simple and flexible parameterized piecewise linear functions that simultaneously improve both the accuracy and adversarial robustness of DNNs. They had the best classification accuracy across three different datasets and four different architectures when compared to nine other learned and fixed activation functions. When investigating the reason behind their success, we found that the final shape of the learnable SPLASH units did not serve as a good nonlearnable (fixed) activation function. Additionally, in our ablation studies, we saw that restricting the flexibility of the activation function hurts performance, even if the restricted activation function can still mimic the final shape of the unrestricted SPLASH units. It could be possible that changes in the activation functions play a particular role in shaping the loss landscape of deep neural networks [hochreiter1997flat, dauphin2014identifying, choromanska2015loss]. Future work will use visualization techniques [craven1992visualizing, gallagher2003visualization, li2018visualizing] to obtain an intuitive understanding of how learnable activation functions affect the optimization process.
Though no adversarial examples are shown during training, SPLASH units still significantly increase the robustness of DNNs to adversarial attacks. Prior research suggests that the reason for this may be related to their final shape, which looks visually similar to that of a symmetric function [zhao2016suppressing]. Given that research has shown that certain activation functions may make deep neural networks susceptible to adversarial attacks [croce2018randomized], it is possible that adding more inductive biases aimed at reducing these vulnerabilities may increase the robustness of learned activation functions to adversarial attacks. Since our ablation studies have shown the importance of having flexible activation functions during training, these inductive biases may need to allow for flexibility or be applied during the later stages of training, for example, in the form of a regularization penalty.
7 Acknowledgement
Work in part supported by ARO grant 76649CS, NSF grant 1839429, and NSF grant NRT 1633631 to PB. We wish to acknowledge Yuzo Kanomata for computing support.
References
8 Appendix
8.1 Initialization of SPLASH weights
In order to choose the best initialization of SPLASH weights ( and ), we compare the performance of five different LeNet5 architecture trained on CIFAR10. Each of these architectures uses a differently initialized SPLASH activation. Figure 4 shows that the leaky ReLU and ReLU initializations perform the best. Leaky ReLUs require us to determine the slope of the line segment on the left side of the xaxis. Adding another parameter that may possibly need tuning. Therefore, for simplicity, we use the ReLU initialization (, ans all other parameters set to ) in all of our experiments.
8.2 Number of Hinges
In this section, we perform a variety of experiments to find the best setting for SPLASH activation in terms of both complexity and performance.
S  3  5  7  9  11 
Error rate  
MNIST  1.571.61  1.331.39  1.131.17  1.101.08  1.121.08 
CIFAR10  30.7930.55  30.5730.29  30.2030.18  30.1430.22  30.1130.19 
# of additional params  
MNIST  121408  182112  242816  303520  364224 
CIFAR10  16 75k  24120k  32150k  40180k  48225k 
First, we assess the effect of on the performance of SPLASH. Due to Theorem 3.2, greater values increase the expressive power of the SPLASH which generally results better training performance. We tried , with symmetrically fixed hinges for SPLASH units. We also use MNIST [lecunmnisthandwrittendigit2010] and CIFAR10 [cifar10]. Each network is trained with two types of SPLASH activations; 1) A shared SPLASH: a shared unit among all neurons of a layer and 2) An independent SPLASH unit for each neuron of a layer. As it is summarized in Table 8, in all cases of there is no significant improvement in the performance of the DNNs.
On the other hand, due to the increase in the number of parameters of SPLASH, the activation units become more computationally expensive. In Table 3 we compare the perepoch training runtime for different number of hinges of a shared SPLASH.
For small values of , we can see that SPLASH is comparable to an exponential activation unit such as Tanh, and much faster than heavier activation such as Maxout.
8.3 Experiments’ Details and Statistical Significance
In this section, we explain the experimental conditions and all the parameters used for each experiment. Also, in order to make the results of Table 2
more interpretable, we perform a ttest
[kim2015t] on all the error rates achieved in that experiment.In section 4, experiments corresponding to Table 2 are performed using four different architectures. LeNet5 is used as it was introduced in lecun1998gradient. It has two convolution layers followed by two MLPs that are connected to a softmax layer. We use our own implementation of LeNet5 with all the hyperparameters from bigballon2017cifar10cnn. However, We train the networks for 100 epochs.
AllCNN architecture which is only taking advantage of convolutional layers, was introduced in springenberg2014striving. Since we could not reproduce the exact numbers for the top1 accuracy on CIFAR10 dataset using the specifications in the main article, we used our own implementation. We use a learning rate of 0.1, with the decay rate of 1e6 and momentum of 0.9. The batch size is set to 64 and we trained the networks for 300 epochs. The rest of the hyper parameters are the same as those mentioned in springenberg2014striving.
For ResNet architectures, we try a popular variant, ResNet20, introduced in he2016deep which has 0.27M parameters. Our implementation of ResNet20 is taken from chollet2015keras and bigballon2017cifar10cnn. All the hyperparameters including batch size, number epochs, initialization, learning rate and it’s decay, and optimizer are left to the default values of the mentioned repositories.
Lastly, Net in Net architecture which is using an MLP instead of a fixed nonlinear transformation is taken from bigballon2017cifar10cnn. We use the same set of hyperparameters including batch size, number of epochs, learning rate, and etc as it was mentioned in bigballon2017cifar10cnn.
In Table 9, we show the statistical significance of the experiments performed in section 4. Since each number is the average of five experiments, we are able to perform a ttest and provide pvalues and statistical significance for each individual experiment. As one can see in Table 9, most of the numbers of Table 2 are statistically significant.
Activation  MNIST  CIFAR10  CIFAR100  
    DA    DA  
LeNet5 (PReLU vs SPLASH)  0.057  0.043  0.055  
Net in Net(ReLU vs SPLASH)  0.042  0.039  0.038  0.055  
AllCNN (maxout vs SPLASH)  0.041  0.050  0.066  0.061  
ResNet20 (PReLU vs SPLASH)  0.033  0.044  0.046  0.044 
In section 5, we use ResNet20 architecture to visualize SPLASH shapes at different stages of the training process. Here we include two more plots showing the evolution of SPLASH units during training. Figure 6 and Figure 7 are sowing the evolution of SPLASH units during training MLP and LeNet5 architectures respectively. Both architectures are described in section 4.
In section 6, we start by a tSNE visualization of 100 random samples of frogs and ships images from the CIFAR10 test set. The tSNE mapping is performed using a learning rate of 30 and a perplexity of 40.
For the blackbox adversarial attack experiments, each network is attacked five times and the reported number is the average of successful modifications in five different attacks. Onepixelattacks are done using the maximum number of iteration to be 40 and the pop size to be 400. For the boundary attack, we use the implementation in rauber2017foolbox. To reduce the rate of successful attacks, the hyperparameters steps is set to 6000. All other hyperparameters are left as the default from the mentioned implementation.
As for the openbox attacks, for both FGSM and CWL2 attack, we employ the implementation and default hyperparameters in rauber2017foolbox. However, to reduce the attack success rate for CW technique, we use 7 and 1000 for variables binary search steps and steps respectively. The network architectures used for experiments in this section, are identical to the architectures used in section 4.
Lastly, four common;y used activation functions were used to train different DNNs in section 6. ReLU (), APL (, with fixed hinges on , Swish ( with , and SPLASH (with the configurations mentioned in the previous section) are used.