Learning Neuron Non-Linearities with Kernel-Based Deep Neural Networks

07/17/2018 ∙ by Giuseppe Marra, et al. ∙ 0

The effectiveness of deep neural architectures has been widely supported in terms of both experimental and foundational principles. There is also clear evidence that the activation function (e.g. the rectifier and the LSTM units) plays a crucial role in the complexity of learning. Based on this remark, this paper discusses an optimal selection of the neuron non-linearity in a functional framework that is inspired from classic regularization arguments. It is shown that the best activation function is represented by a kernel expansion in the training set, that can be effectively approximated over an opportune set of points modeling 1-D clusters. The idea can be naturally extended to recurrent networks, where the expressiveness of kernel-based activation functions turns out to be a crucial ingredient to capture long-term dependencies. We give experimental evidence of this property by a set of challenging experiments, where we compare the results with neural architectures based on state of the art LSTM cells.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

By and large, the appropriate selection of the activation function in deep architectures is regarded as an important choice for achieving challenging performance. For example, the rectifier function [7] has been playing an important role in the impressive scaling up of nowadays deep nets. Likewise, LSTM cells [8]

are widely recognized as the most important ingredient to face long-term dependencies when learning by recurrent neural networks. Both choices come from insightful ideas on the actual non-linear process taking place in deep nets. At a first glance, one might wonder why such an optimal choice must be restricted to a single unit instead of extending it to the overall function to be learned. In addition, this general problem has been already been solved; its solution 

[12, 5, 6] is in fact at the basis of kernel machines, whose limitations as shallow nets, have been widely addressed (see e.g. [10, 11]). However, the optimal formulation given for the neuron non-linearity enjoys the tremendous advantage of acting on 1-D spaces. This strongly motivates the reformulation of the problem of learning in deep neural network as a one where the weights and the activation functions are jointly determined by optimization in the framework of regularization operators [13], that are used to enforce the smoothness of the solution. The idea of learning the activation function is not entirely new. In [15], activation functions are chosen from a pre-defined set and combine this strategy with a single scaling parameter that is learned during training. It has been argued that one can think of this function as a neural network itself, so as the overall architecture is still characterized by a directed acyclic graph [3]. Other approaches learn activation functions as piecewise linear [1], doubled truncated gaussian [14] or Furier series [4]. In this paper, it is proven that, like for kernel machines, the optimal solution can be expressed by a kernel expansion, so as the overall optimization is reduced to the discovery of a finite set of parameters. The risk function to be minimized contains the weights of the network connections, as well as the parameters associated with the the points of the kernel expansion. Hence, the classic learning of the weights of the network takes place with the concurrent development of the optimal shape of the activation functions, one for each neuron. As a consequence, the machine architecture turns out to enjoy the strong representational issues of deep networks in high dimensional spaces that is conjugated with the elegant and effective setting of kernel machines for the learning of the activation functions. The powerful unified regularization framework is not the only feature that emerges from the proposed architecture. Interestingly, unlike most of the activation functions used in deep networks, those that are typically developed during learning, are not necessarily monotonic. This property has a crucial impact in their adoption in classic recurrent networks, since this properly addresses classic issues of gradient vanishing when capturing long-term dependencies. Throughout this paper, recurrent networks with activation functions based on kernel expansion, are referred to as Kernel-Based Recurrent Networks (KBRN). The intuition is that the associated iterated map can either be contractive or expansive. Hence, while in some states the contraction yields gradient vanishing, in others the expansion results in to gradient pumping, which allows the neural network to propagate information back also in case of long time dependences. The possibility of implementing contractive and expanding maps during the processing of a given sequence comes from the capabilities of KBRN to develop different activation functions for different neurons that are not necessarily monotonic. This variety of units is somewhat related to the clever solution proposed in LSTM cells [8], where the authors early realized that there was room for getting rid of the inherent limitation of the contractive maps deriving from sigmoidal units. The given experimental results provide evidence of this property on challenging benchmarks that are inspired to seminal paper [2]

, where the distinctive information for classification of long sequences is only located in the first positions, while the rest contains uniformly distributed noisy information. We get very promising results on these benchmarks when comparing KBRN with state of the art recurrent architectures based on LSTM cells.

2 Representation and learning

The feedforward architecture that we consider is based on a directed graph , where is the set of ordered vertices and is the set of the oriented arcs. Given there is connection from to iff . Instead of assuming a uniform activation function for each vertex of , a specific function is attached to each vertex. We denote with the set of input neurons, with the set of the output neurons and with the set of hidden neurons; the cardinality of these sets will be denoted as , , and . Without loss of generality we will also assume that: , and .

The learning process is based on the training set

. Given an input vector

, the output associated with the vertices of the graph is computed as follows111We are using here the Iverson’s notation: Given a statement , we set to if is true and to if is false:


with , where are the parents of neuron , and are one dimensional real functions; , with chosen big enough, so that Eq. (1) is always well defined. Now let and define the output function of the network by

The learning problem can then be formulated as a double optimization problem defined on both the weights , and on the activation functions . It is worth mentioning that while the optimization on the weights of the graph reflects all important issues connected with the powerful representational properties of deep nets, the optimal discovery of the activation functions are somewhat related to the framework of kernel machines. Such an optimization is defined with respect to the following objective function:


which accumulates the empirical risk and a regularization term based regularization operators [13]. Here, we indicate with the standard inner product of , with a differential operator of degree , while

is a suitable loss function.

Clearly, one can optimize by independently checking the stationarity with respect to the weights associated with the neural connections and the stationarity with respect to the activation functions. Now we show that the stationarity condition of with respect to the functional variables (chosen in a functional space that depends on the order of differential operator ) yields a solution that is very related to classic case of kernel machines that is addressed in [13]. If we consider a variation with vanishing derivatives on the boundary 222Here, we are assuming here that the values of the functions in at the boundaries together with the derivatives up to order are fixed. of up to order and define . The first variation of the functional along is therefore . When using arguments already discussed in related papers  [12, 5, 13] we can easily see that

where and , being the adjoint operator of . We notice in passing that the functional dependence of on is quite involved, since it depends on the compositions of liner combinations of the functions (see Figure 1–(a)). Hence, the given expression of the coefficients is a rather a formal equation that, however, dictates the structure of the solution.

   (a)   (b)

Figure 1: (a) A simple network architecture; the output evaluated using Eq. (1) is . (b) Highlight of the structure of neuron (encircled in the dashed line) of (a): The activation function of the neuron is computed as an expansion over the training set. Each neuron , in the figure corresponds to the term in Eq. (4).

The stationarity conditions reduce to the following Euler-Lagrange (E-L) equations


where is the value of the activation function on the -th example of the training set. Let be the Green function of the operator , and let be the solution of . Then, we can promptly see that


is the general form of the solution of Eq. (3). Whenever has null kernel, then this solution is reduced to an expansion of the Green function over the points of the training set. For example, this happens in the case of the pseudo differential operator that originates the Gaussian as the Green function. If we choose , then . Interestingly, the Green function of the second derivative is the rectifier and, moreover, we have . In this case


where , while . Because of the representation structure expressed by Eq. (4), the objective function the original optimization problem collapses to a standard finite-dimensional optimization on333Here we omit the dependencies of the optimization function from the parameters that defines .

here is the regularization term and . This collapse of dimensionality is the same which leads to the dramatic simplification that gives rise to the theory of kernel machines. Basically, in all cases in which the Green function can be interpreted as a kernel, this analysis suggests the neural architecture depicted in Figure 1, where we can see the integration of graphical structures, typical of deep nets, with the representation in the dual space that typical of kernel methods.

We can promptly see that the idea behind kernel-based deep networks can be extended to cyclic graphs, that is to recurrent neural networks. In that case, the analogous of Eq. (1) is:

Here we denote with the input at step and with the state of the network. The set contains the vertices that are parents of neuron ; the corresponding arcs are associated with a delay, while vertices with non-delayed arcs

. The extension of learning in KBDNN to the case of recurrent nets is a straightforward consequence of classic Backpropagation Through Time.

3 Approximation and algorithmic issues

The actual experimentation of the model described in the previous section requires to deal with a number of important algorithmic issues. In particular, we need to address the typical problem associated with the kernel expansion over the entire training set, that is very expensive in computational terms. However, we can early realize that KBDNNs only require to express kernel in 1-D, which dramatically simplify the kernel approximation. Hence, instead of expanding over the entire training set, we can use a number of points with . This means that the expansion in Eq. (4) is approximated as follows


where and are the centers and parameters of the expansion, respectively. Notice that are replacing in the formulation given in Section 2). We consider and as parameters to be learned, and integrate them in the whole optimization scheme.

In the experiments described below we use the rectifier (ReLU) as Green function (

) and neglect the linear terms from both and

. We can easily see that this is compatible with typical requirements in machine learning experiments, where in many cases the expected solution is not meaningful with very large inputs. For instance, the same assumption is typically at the basis of kernel machines, where the asymptotic behavior is not typically important. The regularization term

can be inherited from the regularization operator . For the experiments carried out in this paper we decided to choose the norm444This choice is due to the fact that we want to enforce the sparseness of , i.e. to use the smallest number of terms in expansion 6.:

with being an hyper-parameter that measures the strength of the regularization.

In a deep architecture, when stacking multiple layers of kernel-based units, the non-monotonicity of the activation functions implies the absence of guarantees about the interval on which these functions operate, thus requiring them to be responsive to very heterogeneous inputs. In order to face this problem and to allow kernel-based units to concentrate their representational power on limited input ranges, it is possible to apply a normalization [9] to the input of the function. In particular, given , can be normalized as:

while and are additional trainable parameters.

4 Experiments

Figure 2: XOR.  The plots show the activation functions learned by the simplest KBDN which consists of one unit only for the 2-dim (1(a)) and 4-dim (1(b)) XOR. The first/second row refer to experiments with without/with regularization, whereas the three columns correspond with the chosen number of point for the expansion of the Green function .

We carried out several experiments in different learning settings to investigate the effectiveness of the KBDNN with emphasis on the adoption of kernel-based units in recurrent networks for capturing long-term dependences. Clearly, KBDNN architectures require to choose both the graph and the activation function. As it will be clear in the reminder of this section, the interplay of these choices leads to gain remarkable properties.

Figure 3: Charging Problem. The plot shows the accuracy obtained a by recurrent nets which classic sigmoidal unit, LSTM cell, and KB unit. The horizontal axis is in logarithmic scale.

The XOR problem.  We begin presenting a typical experimental set up in the classic XOR benchmark. In this experiment we chose a single unit with the Green function , so as turns out to be

where , and are trainable variables and the learning of corresponds with the discovery of both the centroids and the associated weights . The simplicity of this learning task allows us to underline some interesting properties of KBDNNs. We carried out experiment by selecting a number of points for the expansion of the Green function that ranges from to . This was done purposely to assess the regularization capabilities of the model, that is very much related to what typically happens with kernel machines. In Figure 2, we can see the neuron function at the end of the learning process under different settings. In the different columns, we plot function with a different numbers of clusters, while the two rows refer to experiments carried out with and without regularization. As one could expect, the learned activation functions become more and more complex as the number of clusters increases. However, when performing regularization, the effect of the kernel-based component of the architecture plays a crucial role by smoothing the functions significantly.

The charging problem.  Let us consider a dynamical system which generates a Boolean sequence according to the model


where , is a sequence of integers and is a Boolean sequence, that is . An example of sequences generated by this system is the following:

Notice that the system keeps memory when other bit are coming, that is


The purpose of this experiment was that of checking what are the learning capabilities of KBRN to approximate sequences generated according to Eq. 7. The intuition is that a single KB-neuron is capable to charge the state according to an input, and then to discharge it until the state is reset. We generated sequences of length . Three random element of each sequence were set with a random number ranging from to . We compared KBRN, RNN with sigmoidal units, and recurrent with LSTM cells, with a single hidden unit. We used a KBRN unit with centers to approximate the activation function. The algorithm used for optimization used the Adam algorithm with in all cases. Each model was trained for iterations with mini-batches of size . Figure 3 shows the accuracy on a randomly generated test set of size during the training process. The horizontal axis is in logarithmic scale. The horizontal axis is in logarithmic scale.

Figure 4: Activation functions.  The activation functions corresponding with the problem of capturing long-term dependencies in sequences that are only discriminated by the first two bit ( function). All functions are plotted in the interval . The functions with a dashed frame are the ones for which in some subset of .
Figure 5: Capturing Long-Term dependencies.  Number of successful trials and average number of iterations for a classification problem when the , , and functions are used to determine the target, given the first two discriminant bits.

Learning Long-Term dependencies.  We carried out a number of experiments aimed at investigating the capabilities of KBRN in learning tasks where we need to capture long-term dependencies. The difficulties of solving similar problems was addressed in [2]

by discussions on gradient vanishing that is mostly due to the monotonicity of the activation functions. The authors also provided very effective yet simple benchmarks to claim that classic recurrent networks are unable to classify sequences where the distinguishing information is located only at the very beginning of the sequence; the rest of the sequence was supposed to be randomly generated. We defined a number of benchmarks inspired by the one given in 

[2], where the decision on the classification of sequence is contained in the first bits of a Boolean sequence of length . We compared KBRN and recurrent nets with LSTM cells using an architecture where both networks were based on hidden units. We used the Adam algorithm with in all cases. Each model was trained for a maximum of iterations with mini-batches of size

; for each iteration, a single weight update was performed. For the LSTM cells, we used the standard implementation provided by TensorFlow (following

[16]). For KBRN we used a number of centroids and the described normalization.

We generated automatically a set of benchmarks with and variable length , where the binary sequences can be distinguished when looking simply at the first two bits, while the the rest is a noisy string with uniformly random distribution. Here we report some of our experiments when choosing the first two discriminant bits according to the , , and functions.

For each Boolean function, that was supposed to be learned, and for several sequence lengths (up to 50), we performed 5 different runs, with different initialization seeds. A trial was considered successful if the model was capable of learning the function before the maximum allowed number of iterations was reached. In Figure 5 we present the results of these experiments. Each of the four quadrants of Figure 5 is relative to a different Boolean function, and reports two different plots. The first one has the sequence length on the -axis and the number of successful trials on the -axis. The second plot has the sequence length on the -axis and, on the -axis, the average number of iterations required to solve the task. The analysis of these plots allows us to draw ta couple of interesting conclusions: (i) KBRN architectures are capable of solving the problems in almost all cases, regardless of the sequence length, while recurrent networks with LSTM cells started to experiment difficulties for sequences longer than 30, and (ii), whenever convergence is achieve, KBRN architectures converge significantly faster than LSTM. In order to investigate with more details the capabilities of KBRN of handling very long sequences, we carried out another experiment, that was based on the benchmark that KBRN solved with more difficulty, namely the equivalence () problem. We carried out a processing over sequences with length , and . In Figure 6, we report the results of this experiment. As we can see, KBRN are capable of solving the task even with sequences of length 150, eventually failing with sequences of length 200.

Figure 6: Capturing Long-Term dependencies.  Number of successful trials and average number of iterations when facing the problem with sequences of length ranging from 5 to 200, when the distinguishing information is located in the first two bits.

5 Conclusions

In this paper we have introduced Kernel-Based Deep Neural Networks. The proposed KBDNN model is characterized by the classic primal representation of deep nets, that is enriched with the expressiveness of activation functions given by kernel expansion. The idea of learning the activation function is not entirely new. However, in this paper we have shown that the KBDNN representation turns out to be the solution of a general optimization problem, in which both the weights, that belong to a finite-dimensional space, and the activation function, that are chosen from a functional space are jointly determined. This bridges naturally the powerful representation capabilities of deep nets with the elegant and effective setting of kernel machines for the learning of the neuron functions.

A massive experimentation of KBDNN is still required to assess the actual impact of the appropriate activation function in real-world problems. However, this paper already proposes a first important conclusion which involves recurrent networks, that are based on this kind of activation function. In particular, we have provided both theoretical and experimental evidence to claim that the KBRN architecture exhibits an ideal computational structure to deal with classic problems of capturing long-term dependencies.