Kafnets: kernel-based non-parametric activation functions for neural networks

07/13/2017 ∙ by Simone Scardapane, et al. ∙ Universidad de Cantabria Sapienza University of Rome 0

Neural networks are generally built by interleaving (adaptable) linear layers with (fixed) nonlinear activation functions. To increase their flexibility, several authors have proposed methods for adapting the activation functions themselves, endowing them with varying degrees of flexibility. None of these approaches, however, have gained wide acceptance in practice, and research in this topic remains open. In this paper, we introduce a novel family of flexible activation functions that are based on an inexpensive kernel expansion at every neuron. Leveraging over several properties of kernel-based models, we propose multiple variations for designing and initializing these kernel activation functions (KAFs), including a multidimensional scheme allowing to nonlinearly combine information from different paths in the network. The resulting KAFs can approximate any mapping defined over a subset of the real line, either convex or nonconvex. Furthermore, they are smooth over their entire domain, linear in their parameters, and they can be regularized using any known scheme, including the use of ℓ_1 penalties to enforce sparseness. To the best of our knowledge, no other known model satisfies all these properties simultaneously. In addition, we provide a relatively complete overview on alternative techniques for adapting the activation functions, which is currently lacking in the literature. A large set of experiments validates our proposal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

Code Repositories

kernel-activation-functions

Several implementations of the kernel-based activation functions


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks (NNs) are powerful approximators, which are built by interleaving linear layers with nonlinear mappings (generally called activation functions

). The latter step is usually implemented using an element-wise, (sub-)differentiable, and fixed nonlinear function at every neuron. In particular, the current consensus has shifted from the use of contractive mappings (e.g., sigmoids) to the use of piecewise-linear functions (e.g., rectified linear units, ReLUs

(glorot2010understanding)

), allowing a more efficient flow of the backpropagated error

(goodfellow2016deep). This relatively inflexible architecture might help explaining the extreme redundancy found in the trained parameters of modern NNs (denil2013predicting).

Designing ways to adapt the activation functions themselves, however, faces several challenges. On one hand, we can parameterize a known activation function with a small number of trainable parameters, describing for example the slope of a particular linear segment (he2015delving). While immediate to implement, this only results in a small increase in flexibility and a marginal improvement in performance in the general case (agostinelli2014learning). On the other hand, a more interesting task is to devise a scheme allowing each activation function to model a large range of shapes, such as any smooth function defined over a subset of the real line. In this case, the inclusion of one (or more) hyper-parameters enables the user to trade off between greater flexibility and a larger number of parameters per-neuron. We refer to these schemes in general as non-parametric activation functions, since the number of (adaptable) parameters can potentially grow without bound.

There are three main classes of non-parametric activation functions known in the literature: adaptive piecewise linear (APL) functions (agostinelli2014learning), maxout networks (goodfellow2013maxout), and spline activation functions (guarnieri1999multilayer). These are described more in depth in Section 5. In there, we also argue that none of these approaches is fully satisfactory, meaning that each of them loses one or more desirable properties, such as smoothness of the resulting functions in the APL and maxout cases (see Table 1 in Section 6 for a schematic comparison).

In this paper, we propose a fourth class of non-parametric activation functions, which are based on a kernel representation of the function. In particular, we define each activation function as a linear superposition of several kernel evaluations, where the dictionary of the expansion is fixed beforehand by sampling the real line. As we show later on, the resulting kernel activation functions (KAFs) have a number of desirable properties, including: (i) they can be computed cheaply using vector-matrix operations; (ii) they are linear with respect to the trainable parameters; (iii) they are smooth over their entire domain; (iv) using the Gaussian kernel, their parameters only have

local effects on the resulting shapes; (v) the parameters can be regularized using any classical approach, including the possibility of enforcing sparseness through the use of norms. To the best of our knowledge, none of the known methods possess all these properties simultaneously. We call a NN endowed with KAFs at every neuron a Kafnet.

Importantly, framing our method as a kernel technique allows us to potentially leverage over a huge literature on kernel methods, either in statistics, machine learning

(hofmann2008kernel), and signal processing (liu2011kernel)

. Here, we preliminarly demonstrate this by discussing several heuristics to choose the kernel hyper-parameter, along with techniques for initializing the trainable parameters. However, much more can be applied in this context, as we discuss in depth in the conclusive section. We also propose a bi-dimensional variant of our KAF, allowing the information from multiple linear projections to be nonlinearly combined in an adaptive fashion.

In addition, we contend that one reason for the rareness of flexible activation functions in practice can be found in the lack of a cohesive (introductory) treatment on the topic. To this end, a further aim of this paper is to provide a relatively comprehensive overview on the selection of a proper activation function. In particular, we divide the discussion on the state-of-the-art in three separate sections. Section 3

introduces the most common (fixed) activation functions used in NNs, from the classical sigmoid function up to the recently developed self-normalizing unit

(klambauer2017self) and Swish function (ramachandran2017swish). Then, we describe in Section 4 how most of these functions can be efficiently parameterized by one or more adaptive scalar values in order to enhance their flexibility. Finally, we introduce the three existing models for designing non-parametric activation functions in Section 5. For each of them, we briefly discuss relative strengths and drawbacks, which serve as a motivation for introducing the model subsequently.

The rest of the paper is composed of four additional sections. Section 6 describes the proposed KAFs, together with several practical implementation guidelines regarding the selection of the dictionary and the choice of a proper initialization for the weights. For completeness, Section 7 briefly describes additional strategies to improve the activation functions, going beyond the addition of trainable parameters in the model. A large set of experiments is described in Section 8, and, finally, the main conclusions and a set of future lines of research are given in Section 9.

Notation

We denote vectors using boldface lowercase letters, e.g., ; matrices are denoted by boldface uppercase letters, e.g., . All vectors are assumed to be column vectors. The operator is the standard norm on an Euclidean space. For , it coincides with the Euclidean norm, while for we obtain the Manhattan (or taxicab) norm defined for a generic vector as . Additional notations are introduced along the paper when required.

2 Preliminaries

We consider training a standard feedforward NN, whose -th layer is described by the following equation:

(1)

where is the -dimensional input to the layer, and are adaptable weight matrices, and is a nonlinear function, called activation function, which is applied element-wise. In a NN with layers, denotes the input to the network, while denotes the final output.

For training the network, we are provided with a set of input/output pairs , and we minimize a regularized cost function given by:

(2)

where collects all the trainable parameters of the network,

is a loss function (e.g., the squared loss),

is used to regularize the weights using, e.g., or penalties, and the regularization factor balances the two terms.

In the following, we review common choices for the selection of , before describing methods to adapt them based on the training data. For readability, we will drop the subscript , and we use the letter to denote a single input to the function, which we call an activation. Note that, in most cases, the activation function for the last layer cannot be chosen freely, as it depends on the task and a proper scaling of the output. In particular, it is common to select for regression problems, and a sigmoid function for binary problems with :

(3)

For multi-class problems with dummy encodings on the output, the softmax function generalizes the sigmoid and it ensures valid probability distributions in the output

(bishop2006pattern).

3 Fixed activation functions

We briefly review some common (fixed) activation functions for neural networks, that are the basis for the parametric ones in the next section. Before the current wave of deep learning, most activation functions used in NNs were of a ‘squashing’ type, i.e., they were monotonically non-decreasing functions satisfying:

(4)

where can be either or , depending on convention. Apart from the sigmoid in (3), another common choice is the hyperbolic tangent, defined as:

(5)

cybenko1989approximation proved the universal approximation property for this class of functions, and his results were later extended to a larger class of functions in hornik1989multilayer. In practice, squashing functions were found of limited use in deep networks (where is large), being prone to the problem of vanishing and exploding gradients, due to their bounded derivatives (hochreiter2001gradient)

. A breakthrough in modern deep learning came from the introduction of the rectifier linear unit (ReLU) function, defined as:

(6)

Despite being unbounded and introducing a point of non-differentiability, the ReLU has proven to be extremely effective for deep networks (glorot2010understanding; maas2013rectifier). The ReLU has two main advantages. First, its gradient is either or ,111For , the function is not differentiable, but any value in is a valid subgradient. Most implementations of ReLU use as the default choice in this case. making back-propagation particularly efficient. Secondly, its activations are sparse, which is beneficial from several points of views. A smoothed version of the ReLU, called softplus, is also introduced in glorot2011deep:

(7)

Despite its lack of smoothness, ReLU functions are almost always preferred to the softplus in practice. One obvious problem of ReLUs is that, for a wrong initialization or an unfortunate weight update, its activation can get stuck in , irrespective of the input. This is referred to as the ‘dying ReLU’ condition. To circumvent this problem, maas2013rectifier introduced the leaky ReLU function, defined as:

(8)

where the user-defined parameter is generally set to a small value, such as . While the resulting pattern of activations is not exactly sparse anymore, the parameters cannot get stuck in a poor region. (8) can also be written more compactly as .

Another problem of activation functions having only non-negative output values is that their mean value is always positive by definition. Motivated by an analogy with the natural gradient, clevert2016fast introduced the exponential linear unit (ELU) to renormalize the pattern of activations:

(9)

where in this case is generally chosen as . The ELU modifies the negative part of the ReLU with a function saturating at a user-defined value . It is computationally efficient, being smooth (differently from the ReLU and the leaky ReLU), and with a gradient which is either or for negative values of .

The recently introduced scaled ELU (SELU) generalizes the ELU to have further control over the range of activations (klambauer2017self):

(10)

where is a second user-defined parameter. Particularly, it is shown in klambauer2017self that for and , the successive application of (1

) converges towards a fixed distribution with zero mean and unit variance, leading to a self-normalizing network behavior.

Finally, Swish (ramachandran2017swish) is a recently proposed activation somewhat inspired by the gating steps in a standard LSTM recurrent cell:

(11)

where is the sigmoid in (3).

4 Parametric adaptable activation functions

An immediate way to increase the flexibility of a NN is to parameterize one of the previously introduced activation functions with a fixed (small) number of adaptable parameters, such that each neuron can adapt its activation function to a different shape. As long as the function remains differentiable with respect to these new parameters, it is possible to adapt them with any numerical optimization algorithm together with the linear weights and biases of the layer. Due to their fixed number of parameters and limited flexibility, we call these parametric activation functions.

Historically, one of the first proposals in this sense was the generalized hyperbolic tangent (chen1996feedforward), a tanh function parameterized by two additional positive scalar values and :

(12)

Note that the parameters are initialized randomly and are adapted independently for every neuron. Specifically, determines the range of the output (which is called the amplitude of the function), while controls the slope of the curve. trentin2001networks provides empirical evidence that learning the amplitude for each neuron is beneficial (either in terms of generalization error, or speed of convergence) with respect to having unit amplitude for all activation functions. Similar results were also obtained for recurrent networks (goh2003recurrent).

More recently, he2015delving consider a parametric version of the leaky ReLU in (8), where the coefficient is initialized at everywhere and then adapted for every neuron. The resulting activation function is called parametric ReLU (PReLU), and it has a very simple derivative with respect to the new parameter:

(13)

For a layer with hidden neurons, this introduces only additional parameters, compared to parameters for the generalized tanh. Importantly, in the case of regularization, the user has to be careful not to regularize the parameters, which would bias the optimization process towards classical ReLU / leaky ReLU activation functions.

Similarly, trottier2016parametric propose a modification of the ELU function in (9) with an additional scalar parameter , called parametric ELU (PELU):

(14)

where both and are initialized randomly and adapted during the training process. Based on the analysis in jin2016deep, there always exists a setting for the linear weights and

which avoids the vanishing gradient problem. Differently from the PReLU, however, the two parameters should be regularized in order to avoid a degenerate behavior with respect to the linear weights, where extremely small linear weights are coupled with very large values for the parameters of the activation functions.

A more flexible proposal is the S-shaped ReLU (SReLU) (jin2016deep), which is parameterized by four scalar values :

(15)

The SReLU is composed by three linear segments, the middle of which is the identity. Differently from the PReLU, however, the cut-off points between the three segments can also be adapted. Additionally, the function can have both convex and nonconvex shapes, depending on the orientation of the left and right segments, making it more flexible than previous proposals. Similar to the PReLU, the four parameters should not be regularized (jin2016deep).

Finally, a parametric version of the Swish function is the -swish (ramachandran2017swish), which includes a tunable parameter inside the self-gate:

(16)

5 Non-parametric activation functions

Intuitively, parametric activation functions have limited flexibility, resulting in mixed performance gains on average. Differently from parametric approaches, non-parametric activation functions allow to model a larger class of shapes (in the best case, any continuous segment), at the price of a larger number of adaptable parameters. As stated in the introduction, these methods generally introduce a further global hyper-parameter allowing to balance the flexibility of the function, by varying the effective number of free parameters, which can potentially grow without bound. Additionally, the methods can be grouped depending on whether each parameter has a local or global effect on the overall function, the former being a desirable characteristic.

In this section, we describe three state-of-the-art approaches for implementing non-parametric activation functions: APL functions in Section 5.1, spline functions in Section 5.2, and maxout networks in Section 5.3.

5.1 Adaptive piecewise linear methods

An APL function, introduced in agostinelli2014learning, generalizes the SReLU function in (15) by summing multiple linear segments, where all slopes and cut-off points are learned under the constraint that the overall function is continuous:

(17)

is a hyper-parameter chosen by the user, while each APL is parameterized by adaptable parameters . These parameters are randomly initialized for each neuron, and can be regularized with regularization, similarly to the PELU, in order to avoid the coupling of very small linear weights and very large coefficients for the APL units.

The APL unit cannot approximate any possible function. Its approximation properties are described in the following theorem.

Theorem 1 (Theorem 1, (agostinelli2014learning)).

The APL unit can approximate any continuous piecewise-linear function , for some choice of and , provided that satisfies the following two conditions:

  1. There exists such that .

  2. There exist two scalars, such that .

The previous theorem implies that any piecewise-linear function can be approximated, provided that its behavior is linear for very large, or small, . A possible drawback of the APL activation function is that it introduces points of non-differentiability for each neuron, which may damage the optimization algorithm. The next class of functions solves this problem, at the cost of a possibly larger number of parameters.

5.2 Spline activation functions

An immediate way to exploit polynomial interpolation in NNs it to build the activation function over powers of the activation

(piazza1992artificial):

(18)

where is a hyper-parameter and we adapt the coefficients . Since a polynomial of degree can pass exactly through points, this polynomial activation function (PAF) can in theory approximate any smooth function. The drawback of this approach is that each parameter has a global influence on the overall shape, and the output of the function can easily grow too large or encounter numerical problems, particularly for large absolute values of and large .

An improved way to use polynomial expansions is spline interpolation, giving rise to the spline activation function (SAF). The SAF was originally studied in vecci1998learning; guarnieri1999multilayer, and later re-introduced in a more modern context in scardapane2016learning

, following previous advances in nonlinear filtering

(scarpiniti2013nonlinear). In the sequel, we adopt the newer formulation.

A SAF is described by a vector of parameters, called knots, corresponding to a sampling of its -values over an equispaced grid of points over the -axis, that are symmetrically chosen around the origin with sampling step . For any other value of , the output value of the SAF is computed with spline interpolation over the closest knot and its rightmost neighbors, where is generally chosen equal to , giving rise to a cubic spline interpolation scheme. Specifically, denote by the index of the closest knot, and by the vector comprising the corresponding knot and its neighbors. We call this vector the span. We also define a new value

(19)

where is the user-defined sampling step. defines a normalized abscissa value between the -th knot and the -th one. The output of the SAF is then given by (scarpiniti2013nonlinear):

(20)

where the vector collects powers of up to the order :

(21)

and is the spline basis matrix, which defines the properties of the interpolation scheme (as shown later in Fig. (b)b). For example, the popular Catmull-Rom basis for is given by:

(22)

The derivatives of the SAF can be computed in a similar way, both with respect to and with respect to , e.g., see scardapane2016learning. A visual example of the SAF output is given in Fig. 1.

(a) CR matrix
(b) B-basis matrix
Figure 1: Example of output interpolation using a SAF neuron. Knots are shown with red markers, while the overall function is given in a light blue. (a) For a given activation, in black, only the control points in the green background are active. (b) We use the same control points as before, but we interpolate using the B-basis matrix (scarpiniti2013nonlinear) instead of the CR matrix in (22). The resulting curve is smoother, but it is not guaranteed to pass through all the control points.

Each knot has only a limited local influence over the output, making their adaptation more stable. The resulting function is also smooth, and can in fact approximate any smooth function defined over a subset of the real line to a desired level of accuracy, provided that is chosen small enough. The drawback is that regularizing the resulting activation functions is harder to achieve, since regularization cannot be applied directly to the values of the knots. In guarnieri1999multilayer, this was solved by choosing a large , in turn severely limiting the flexibility of the interpolation scheme. A different proposal was made in scardapane2016learning, where the vector of SAF parameters is regularized by penalizing deviations from the values at initialization. Note that it is straightforward to initialize the SAF as any of the known fixed activation functions described before.

5.3 Maxout networks

Differently from the other functions described up to now, the maxout function introduced in goodfellow2013maxout replaces an entire layer in (1). In particular, for each neuron, instead of computing a single dot product to obtain the activation (where is the input to the layer), we compute different products with separate weight vectors and biases , and take their maximum:

(23)

where the activation function is now a function of a subset of the output of the previous layer. A NN having maxout neurons in all hidden layers is called a maxout network, and remains an universal approximator according to the following theorem.

Theorem 2 (Theorem 4.3, (goodfellow2013maxout)).

Any continuous function can be approximated arbitrarily well on a compact domain by a maxout network with two maxout hidden units, provided is chosen sufficiently large.

The advantage of the maxout function is that it is extremely easy to implement using current linear algebra libraries. However, the resulting functions have several points of non-differentiability, similarly to the APL units. In addition, the number of resulting parameters is generally higher than with alternative formulations. In particular, by increasing we multiply the original number of parameters by a corresponding factor, while other approaches contribute only linearly to this number. Additionally, we lose the possibility of plotting the resulting activation functions, unless the input to the maxout layer has less than dimensions. An example with dimension is shown in Fig. 2.

Figure 2: An example of a maxout neuron with a one-dimensional input and . The three linear segments are shown with a light gray, while the resulting activation is shown with a shaded red. Note how the maxout can only generate convex shapes by definition. However, plots cannot be made for inputs having more than three dimensions.

In order to solve the smoothness problem, (zhang2014improving) introduced two smooth versions of the maxout neuron. The first one is the soft-maxout:

(24)

The second one is the -maxout, for a user-defined natural number :

(25)

Closely related to the -maxout neuron is the unit proposed in gulcehre2013learned. Denoting for simplicity , the unit is defined as:

(26)

where the parameters are all learned via back-propagation.222In practice, is re-parameterized as to guarantee that (26) defines a proper norm. If we fix , for going to infinity the unit degenerates to a special case of the maxout neuron:

(27)

6 Proposed kernel-based activation functions

In this section we describe the proposed KAF. Specifically, we model each activation function in terms of a kernel expansion over terms as:

(28)

where are the mixing coefficients, are the called the dictionary elements, and is a 1D kernel function (hofmann2008kernel). In kernel methods, the dictionary elements are generally selected from the training data. In a stochastic optimization setting, this means that would grow linearly with the number of training iterations, unless some proper strategy for the selection of the dictionary is implemented (liu2011kernel; van2012kernel). To simplify our treatment, we consider a simplified case where the dictionary elements are fixed, and we only adapt the mixing coefficients. In particular, we sample values over the -axis, uniformly around zero, similar to the SAF method, and we leave as a user-defined hyper-parameter. This has the additional benefit that the resulting model is linear in its adaptable parameters, and can be efficiently implemented for a mini-batch of training data using highly-vectorized linear algebra routines. Note that there is a vast literature on kernel methods with fixed dictionary elements, particularly in the field of Gaussian processes (snelson2006sparse).

The kernel function needs only respect the positive semi-definiteness property, i.e., for any possible choice of and we have that:

(29)

For our experiments, we use the 1D Gaussian kernel defined as:

(30)

where is called the kernel bandwidth, and its selection is discussed more at length below. Other choices, such as the polynomial kernel with , are also possible:

(31)

By the properties of kernel methods, KAFs are equivalent to learning linear functions over a large number of nonlinear transformations of the original activation , without having to explicitly compute such transformations. The Gaussian kernel has an additional benefit: thanks to its definition, the mixing coefficients have only a local effect over the shape of the output function (where the radius depends on , see below), which is advantageous during optimization. In addition, the expression in (28) with the Gaussian kernel can approximate any continuous function over a subset of the real line (micchelli2006universal)

. The expression resembles a one-dimensional radial basis function network, whose universal approximation properties are also well studied

(park1991universal). Below, we go more in depth over some additional considerations for implementing our KAF model. Note that the model has very simple derivatives for back-propagation:

(32)
(33)

On the selection of the kernel bandwidth

(a) = 2.0
(b) = 0.5
(c) = 0.1
Figure 3: Examples of KAFs. In all cases we sample uniformly points on the

-axis, while the mixing coefficients are sampled from a normal distribution. The three plots show three different choices for

.

Selecting is crucial for the well-behavedness of the method, by acting indirectly on the effective number of adaptable parameters. In Fig. 3 we show some examples of functions obtained by fixing , randomly sampling the mixing coefficients, and only varying the kernel bandwidth, showing how acts on the smoothness of the overall functions.

In the literature, many methods have been proposed to select the bandwidth parameter for performing kernel density estimation

(jones1996brief). These methods include popular rules of thumb such as scott2015multivariate or silverman1986density.

In the problem of kernel density estimation, the abscissa corresponds to a given dataset with an arbitrary distribution. In the proposed KAF scheme, the abscissa are chosen according to a grid, and as such the optimal bandwidth parameter depends uniquely on the grid resolution. Instead of leaving the bandwidth parameter as an additional hyper-parameter, we have empirically verified that the following rule of thumb represents a good compromise between smoothness (to allow an accurate approximation of several initialization functions) and flexibility:

(34)

where is the distance between the grid points. We also performed some experiments in which was adapted through back-propagation, though this did not provide any gain in accuracy.

On the initialization of the mixing coefficients

A random initialization of the mixing coefficients from a normal distribution, as in Fig. 3, provides good diversity for the optimization process. Nonetheless, a further advantage of our scheme is that we can initialize some (or all) of the KAFs to follow any know activation function, so as to guarantee a certain desired behavior. Specifically, denote by the vector of desired initial KAF values corresponding to the dictionary elements . We can initialize the mixing coefficients

using kernel ridge regression:

(35)

where is the kernel matrix computed between and , and we add a diagonal term with to avoid degenerate solutions with very large mixing coefficients. Two examples are shown in Fig. 4.

(a)
(b) ELU
Figure 4: Two examples of initializing a KAF using (35), with . (a) A hyperbolic tangent. (b) The ELU in (9). The red dots indicate the corresponding initialized values for the mixing coefficients.

Multi-dimensional kernel activation functions

In our experiments, we also consider a two-dimensional variant of the proposed KAF, that we denote as 2D-KAF. Roughly speaking, the 2D-KAF acts on a pair of activation values, instead of a single one, and learns a two-dimensional function to combine them. It can be seen as a generalization of a two-dimensional maxout neuron, which is instead constrained to output the maximum value among the two inputs.

Similarly to before, we construct a dictionary by sampling a uniform grid over the 2D plane, by considering positions uniformly spaced around in both dimensions. We group the incoming activation values in pairs (assuming that the layer has even size), and for each possible pair of activations we output:

(36)

where is the -th element of the dictionary, and we now have adaptable coefficients . In this case, we consider the 2D Gaussian kernel:

(37)

where we use the same rule of thumb in (34), multiplied by , to select . The increase in parameters is counter-balanced by two factors. Firstly, by grouping the activations we halve the size of the linear matrix in the subsequent layer. Secondly, we generally choose a smaller with respect to the case, i.e., we have found that values in are enough to provide a good degree of flexibility. Table 1 provides a comparison of the two proposed KAF models to the three alternative non-parametric activation functions described before. We briefly mention here that a multidimensional variant of the SAF was explored in solazzi2000artificial.

Name Smooth Locality Can use regularization Plottable Hyper-parameter Trainable weights
APL No Partially Only regularization Yes Number of segments
SAF Yes Yes No Yes Number of control points
Maxout No No Yes No* Number of affine maps
Proposed KAF Yes Yes** Yes Yes Size of the dictionary
Proposed 2D-KAF Yes Yes** Yes Yes Size of the dictionary
  • Maxout functions can only be plotted whenever , which is almost never the case.

  • Only when using the Gaussian (or similar) kernel function.

Table 1: A comparison of the existing non-parametric activation functions and the proposed KAF and 2D-KAF. In our definition, an activation function is local if each adaptable weight only affects a small portion of the output values.

7 Related work

Many authors have considered ways of improving the performance of the classical activation functions, which do not necessarily require to adapt their shape via numerical optimization, or that require special care when implemented. For completeness, we briefly review them here before moving to the experimental section.

As stated before, the problem of ReLU is that its gradient is zero outside the ‘active’ regime where . To solve this, the randomized leaky ReLU (xu2015empirical) considers a leaky ReLU like in (8), in which during training the parameter

is randomly sampled at every step in the uniform distribution

, and the lower/upper bounds are selected beforehand. To compensate with the stochasticity in training, in the test phase is set equal to:

(38)

which is equivalent to taking the average of all possible values seen during training. This is similar to the dropout technique (srivastava2014dropout), which randomly deactivates some neurons during each step of training, and later rescales the weights during the test phase.

More in general, several papers have developed stochastic versions of the classical artificial neurons, whose output depend on one or more random variables sampled during their execution

(bengio2013estimating), under the idea that the resulting noise can help guide the optimization process towards better minima. Notably, this provides a link between classical NNs and other probabilistic methods, such as generative networks and networks trained using variational inference (bengio2014deep; schulman2015gradient). The main challenge is to design stochastic neurons that provide a simple mechanism for back-propagating the error through the random variables, without requiring expensive sampling procedures, and with a minimal amount of interference over the network. As a representative example, the noisy activation functions proposed in gulcehre2016noisy achieve this by combining activation functions with ‘hard saturating’ regimes (i.e., their value is exactly zero outside a limited range) with random noise over the outputs, whose variance increases in the regime where the function saturates to avoid problems due to the sparse gradient terms. An example is given in Fig. 5.

(a) Original function
(b) Noise
(c) Noisy activation function
Figure 5: An example of noisy activation function. (a) Original sigmoid function (blue), together with its hard-thresholded version (in red). (b) For any possible activation value outside the saturated regime, we add random half-normal noise with increasing variance and matching sign according to the algorithm in gulcehre2016noisy

(the shaded areas correspond to one standard deviation). (c) Final noisy activation function computed as in

gulcehre2016noisy. At test time, only the expected values (represented with solid green line) are returned.

Another approach is to design vector-valued activation functions to maximize parameter sharing. In the simplest case, the concatenated ReLU (shang2016understanding) returns two output values by applying a ReLU function both on and on . Similarly, the order statistic network (rennie2014deep) modifies a maxout neuron by returning the input activations in sorted order, instead of picking the highest value only. Multi-bias activation functions (li2016multi) compute several activations values by using different bias terms, and then apply the same activation function independently over each of the resulting values. The network-in-network (lin2013network)

model is a non-parametric approach specific to convolutional neural networks, wherein the nonlinear units are replaced with a fully connected NN.

For specific tasks of audio modeling, some authors have proposed the use of Hermite polynomials for adapting the activation functions (siniscalchi2013hermitian; siniscalchi2017adaptation). Similarly to our proposed KAF, the functions are expressed as a weighted sum of several fixed nonlinear transformations of the activation values, i.e., the Hermite polynomials. However, the nonlinear transformations are computed through the use of a recurrence formula, thus highly increasing the computational load.

8 Experimental results

In this section we provide a comprehensive evaluation of the proposed KAFs and 2D-KAFs when applied to several use cases. As a preliminary experiment, we begin by comparing multiple activation functions on a relatively small classification dataset (Sensorless) in Section 8.2, where we discuss several examples of the shapes that are generally obtained by the networks and initialization strategies. We then consider a large-scale dataset taken from baldi2014searching in Section 8.3, where we show that two layers of KAFs are able to significantly outperform a feedforward network with five hidden layers, even when considering parametric activation functions and state-of-the-art regularization techniques. In Section 8.4 we show that KAFs and 2D-KAF provide an increase in performance also when applied to convolutional layers on the CIFAR-10 dataset. Finally, we show in Section 8.5

that they have significantly faster training and higher cumulative reward for a set of reinforcement learning scenario using MuJoCo environments from the OpenAI Gym

333https://gym.openai.com/. We provide an open-source library to replicate the experiments, with the implementation of KAFs and 2D-KAFs in three separate frameworks, i.e., AutoGrad444https://github.com/HIPS/autograd

, TensorFlow

555https://www.tensorflow.org/

, and PyTorch

666http://pytorch.org/, which is publicly accessible on the web777https://github.com/ispamm/kernel-activation-functions/.

8.1 Experimental setup

Unless noted otherwise, in all experiments we linearly preprocess the input features between -1 and +1, and we substitute eventual missing values with the median values computed from the corresponding feature columns. From the full dataset, we randomly keep a portion of the dataset for validation and another portion for test. All neural networks use a softmax activation function in their output layer, and they are trained by minimizing the average cross-entropy on the training dataset, to which we add a small -regularization term whose weight is selected in accordance to the literature. For optimization, we use the Adam algorithm (kingma2014adam)

with mini-batches of 100 elements and default hyper-parameters. For each epoch we compute the accuracy over the validation set, and we stop training whenever the validation accuracy is not improving for 15 consecutive epochs. Experiments are performed using the PyTorch implementation on a machine with an Intel Xeon E5-2620 CPU, with 16 GB of RAM and a CUDA back-end employing an Nvidia Tesla K20c. All accuracy measures over the test set are computed by repeating the experiments for 5 different splits of the dataset (unless the splits are provided by the dataset itself) and initializations of the networks. Weights of the linear layers are always initialized using the so-called ‘Uniform He’ strategy, while additional parameters introduced by parametric and non-parametric activation functions are initialized by following the guidelines of the original papers.

8.2 Visualizing the activation functions

We begin with an experiment on the ‘Sensorless’ dataset to investigate whether KAFs and 2D-KAFs can indeed provide improvements in accuracy with respect to other baselines, and for visualizing some of the common shapes that are obtained after training. The Sensorless dataset is a standard benchmark for supervised techniques, composed of 58509 examples with 49 input features representing electric signals, that are used to predict one among 11 different classes representing operating conditions. We partition it using a random 15% for validation and another 15% for testing, and we use a small regularization factor of .

In this dataset, we found that the best performing fixed activation function is a simple hyperbolic tangent. In particular, a network with one hidden layers of 100 neurons achieves a test accuracy of , while the best result is obtained with a network of three hidden layers (each composed of 100 neurons), which achieves a test accuracy of . Due to the simplicity of the dataset, we have not found improvements here by adding more layers or including dropout during training as in the following sections. The best performing parametric activation function is instead the PReLU, that improves the results by obtaining a accuracy with a single hidden layer, and with three hidden layers.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 6: Examples of trained KAFs (with random initialization) on the Sensorless dataset. On the -axis we plot the output value of the KAF. The KAF after initialization is shown with a dashed red, while the final KAF is shown with a solid green. The distribution of activation values after training is shown as a reference with a light blue.

For comparison, we train several feedforward networks with KAFs in the hidden layers, having a dictionary of elements equispaced between and , and initializing the linear coefficients from a normal distribution with mean and variance . Using this setup, we already outperform all other baselines obtaining an accuracy of with a single hidden layer, which improves to when considering two hidden layers of KAFs.

Although the dataset is relatively simple, the shapes we obtain are representative of all the experiments we performed, and we provide a selection in Fig. 6. Specifically, the initialization of the KAF is shown with a dashed red line, while the final KAF is shown with a solid green line. For understanding the behavior of the functions, we also plot the empirical distribution of the activations on the test set using a light blue in the background. Some shapes are similar to common activation functions discussed in Section 3, although they are shifted on the -axis to correspond to the distribution of activation values. For example Fig. (a)a is similar to an ELU, while Fig. (d)d is similar to a standard saturating function. Also, while in the latter case the final shape is somewhat determined by the initialization, the final shapes in general tend to be independent from initialization, as in the case of Fig. (a)a. Another common shape is that of a radial-basis function, as in Fig. (f)f, which is similar to a Gaussian function centered on the mean of the empirical distribution. Shapes, however, can be vastly more complex than these. For example, in Fig. (e)e we show a function which acts as a standard saturating function on the main part of the activations’ distribution, while its right-tail tends to remove values larger than a given threshold, effectively acting as a sort of implicit regularizer. In Fig. (b)b we show a KAF without an intuitive shape, that selectively amplify (either positively or negatively) multiple regions of its activation space. Finally, in Fig. (c)c we show an interesting pruning effect, where useless neurons correspond to activation functions that are practically zero everywhere. This, combined with the possibility of applying -regularization (scardapane2017group) allows to obtain networks with a significant smaller number of effective parameters.

Interestingly, the shapes obtained in Fig. 6 seem to be necessary for the high performance of the networks, and they are not an artifact of initialization. Specifically, we obtain similar accuracies (and similar output functions) even when initializing all KAFs as close as possible to hyperbolic tangents, following the method described in Section 6, while we obtain a vastly inferior performance (in some cases even worse than the baseline), if we initialize the KAFs randomly and we prevent their adaptation. This (informally) points to the fact that their flexibility and adaptability seems to be an intrinsic component of their good performance in this experiment and in the following sections, an aspect that we will return to in our conclusive section.

Activation 2 Activation 1
Figure 7: Examples of trained 2D-KAFs on the Sensorless dataset. On the - and -axis we plot the two activation values (ranging from to ), while we show the output of the KAF using a heat map, where darker colors represent larger output values and lighter colors represent lower activation values, respectively.

Results are also similar when using 2D-KAFs, that we initialize with elements on each axis using the same strategy as for KAFs. In this scenario, they obtain a test accuracy of with a single hidden layer, and an accuracy of (thus improving over the KAF) for two hidden layers. Some examples of obtained shapes are provided in Fig. 7.

8.3 Comparisons on the SUSY benchmark

In this section, we evaluate the algorithms on a realistic large-scale use case, the SUSY benchmark introduced in baldi2014searching. The task is to predict the presence of super symmetric particles on a simulation of a collision experiment, starting from 18 features (both low-level and high-level) describing the simulation itself. The overall dataset is composed of five million examples, of which the last 500000 are used for test, and another 500000 for validation. The task is interesting for several reasons. Due to the nature of the data, even a tiny change in accuracy (measured in terms of area under the curve, AUC) is generally statistically significant. In the original paper, baldi2014searching showed the best AUC was obtained by a deep feedforward network having five hidden layers, with significantly better results when compared to a shallow network. Surprisingly, agostinelli2014learning later showed that a shallow network is in fact sufficient, so long as it uses non-parametric activation functions (in that case, APL units).

In order to replicate these results with our proposed methods, we consider a baseline network inspired to baldi2014searching, having five hidden layers with 300 neurons each and ReLU activation functions, with dropout applied to the last two hidden layer with probability 0.5. For comparison, we also consider the same architecture, but we substitute ReLUs with ELU, SELU, and PReLU functions. For SELU, we also substitute the standard dropout with the customized version proposed in klambauer2017self. We compare with simpler networks composed of one or two hidden layers of 300 neurons, employing Maxout neurons (with ), APL units (with as proposed in agostinelli2014learning), and the proposed KAFs and 2D-KAFs, following the same random initializations as the previous section. Results, in terms of AUC and amount of trainable parameters, are given in Table 2.

Activation function Testing AUC Trainable parameters
ReLU
ELU
SELU
PReLU
Maxout (one layer)
Maxout (two layers)
APL (one layer)
APL (two layers)
KAF (one layer)
KAF (two layers)
2D-KAF (one layer)
2D-KAF (two layers)
Table 2: Results of different activation functions on the SUSY benchmark. The last four rows are the proposed KAF and 2D-KAF. Standard deviation for the AUC is given between brackets, the best result is shown in bold, and the second best result is underlined. All networks with fixed or parametric activation functions have five hidden layers. See the text for a full description of the architectures.

There are several clear results that emerge from the analysis of Table 2. First of all, the use of sophisticated activation functions (such as the SELU), or of parametric functions (such as the PReLU) can improve performance, in some cases even significantly. However, these improvements still require several layers of depth, while they both fail to provide accurate results when experimenting with shallow networks. On the other hand, all non-parametric functions are able to achieve similar (or even superior) results, while only requiring one or two hidden layers of neurons. Among them, APL and Maxout achieve a similar AUC with one layer, but only APL is able to benefit from the addition of a second layer. Both KAF and 2D-KAF are able to significantly outperform all the competitors, and the overall best result is obtained by a 2D-KAF network with two hidden layers. This is obtained with a significant reduction in the number of trainable parameters, as also described more in depth in the following section.

8.4 Experiments with convolutive layers on CIFAR-10

Although our focus has been on feedforward networks, an interesting question is whether the superior performance exhibited by KAFs and 2D-KAFs can also be obtained on different architectures, such as convolutional neural networks (CNNs). To investigate this, we train several CNNs on the CIFAR-10 dataset, composed of images of size , belonging to classes. Since our aim is only to compare different architectures for the convolutional kernels, we train simple CNNs made by stacking convolutional ‘modules’, each of which is composed by (a) a convolutive layer with filters, with a filter size of

and a stride of

; (b) a max-pooling operation over

windows with stride of ; (c) a dropout layer with probability of . We consider CNNs with a minimum of such modules and a maximum of , where the output of the last dropout operation is flattened before applying a linear projection and a softmax operation. Our training setup is equivalent to the previous sections.

We consider different choices for the nonlinearity of the convolutional filters, using ELU as baseline, and our proposed KAFs and 2D-KAFs. In order to improve the gradient flow in the initial stages of training, KAFs in this case are initialized with the KRR strategy using ELU as the target. The results are shown in Fig. 8, where we show on the left the final test accuracy, and on the right the number of trainable parameters of the three architectures.

(a) Test accuracy
(b) Parameters
Figure 8: Results of KAF, 2D-KAF, and a baseline composed of ELU functions when using only convolutive layers on the CIFAR-10 dataset (see the text for a full description of the architectures). (a) Test accuracy. (b) Number of trainable parameters for the architectures.

Interestingly, both KAFs and 2D-KAFs are able to get significantly better results than the baseline, i.e., even layers of convolutions are sufficient to surpass the accuracy obtained by an equivalent -layered network with the baseline activation functions. From Fig. (b)b, we can see that this is obtained with a negligible increase in the number of trainable parameters for KAF, and with a significant decrease (roughly ) for 2D-KAF. The reason, as before, is that each nonlinearity for the 2D-KAF is merging information coming from two different convolutive filters, effectively halving the number of parameters required for the subsequent layer.

8.5 Experiments on a reinforcement learning scenario

Before concluding, we evaluate the performance of the proposed activation functions when applied to a relatively more complex reinforcement learning scenario. In particular, we consider some representative MuJoCo environments from the OpenAI Gym platform,888https://github.com/openai/gym/mujoco where the task is to learn a policy to control highly nonlinear physical systems, including pendulums and bipedal robots. As a baseline, we use the open-source OpenAI implementation of the proximal policy optimization algorithm (schulman2017proximal), that learns a policy function by alternating between gathering new episodes of interactions with the environment, and building the policy itself by optimizing a surrogate loss function. All hyper-parameters are taken directly from the original paper, without attempting a specific fine-tuning for our algorithm. The policy function for the baseline is implemented as a NN with two hidden layers, each of which has hidden neurons with

nonlinearities, providing in output the mean of a Gaussian distribution that is used to select an action. For the comparison, we keep the overall setup fixed, but we replace the nonlinearities with KAF neurons, using the same initialization as the preceding sections.

(a) swimmer
(b) humanoidstandup
(c) pendulum_inverted
Figure 9:

Results for the reinforcement learning experiments, in terms of average cumulative rewards. We compare the baseline algorithm to an equivalent architecture with KAF nonlinearities. Details on the models and hyperparameters are provided in the main discussion.

We plot the average cumulative reward obtained for every iteration of the algorithms on different environments in Fig. 9. We see that the policy networks implemented with the KAF functions consistently learn faster than the baseline with, in several cases, a consistent improvement with respect to the final reward.

9 Conclusive remarks

In this paper, after extensively reviewing known methods to adapt the activation functions in a neural network, we proposed a novel family of non-parametric functions, framed in a kernel expansion of their input value. We showed that these functions combine several advantages of previous approaches, without introducing an excessive number of additional parameters. Furthermore, they are smooth over their entire domain, and their operations can be implemented easily with a high degree of vectorization. Our experiments showed that networks trained with these activations can obtain a higher accuracy than competing approaches on a number of different benchmark and scenarios, including feedforward and convolutional neural networks.

From our initial model, we made a number of design choices in this paper, which include the use of a fixed dictionary, of the Gaussian kernel, and of a hand-picked bandwidth for the kernel. However, many alternative choices are possible, such as the use of dictionary selection strategies, alternative kernels (e.g., periodic kernels), and several others. In this respect, one intriguing aspect of the proposed activation functions is that they provide a further link between neural networks and kernel methods, opening the door to a large number of variations of the described framework.

Acknowledgments

The authors would like to thank the anonymous reviewers for their suggestions on how to improve the paper.

References