By and large, the appropriate selection of the activation function in deep architectures is regarded as an important choice for achieving challenging performance. For example, the rectifier function  has been playing an important role in the impressive scaling up of nowadays deep nets. Likewise, LSTM cells 
are widely recognized as the most important ingredient to face long-term dependencies when learning by recurrent neural networks. Both choices come from insightful ideas on the actual non-linear process taking place in deep nets. At a first glance, one might wonder why such an optimal choice must be restricted to a single unit instead of extending it to the overall function to be learned. In addition, this general problem has been already been solved; its solution[12, 5, 6] is in fact at the basis of kernel machines, whose limitations as shallow nets, have been widely addressed (see e.g. [10, 11]). However, the optimal formulation given for the neuron non-linearity enjoys the tremendous advantage of acting on 1-D spaces. This strongly motivates the reformulation of the problem of learning in deep neural network as a one where the weights and the activation functions are jointly determined by optimization in the framework of regularization operators , that are used to enforce the smoothness of the solution. The idea of learning the activation function is not entirely new. In , activation functions are chosen from a pre-defined set and combine this strategy with a single scaling parameter that is learned during training. It has been argued that one can think of this function as a neural network itself, so as the overall architecture is still characterized by a directed acyclic graph . Other approaches learn activation functions as piecewise linear , doubled truncated gaussian  or Furier series . In this paper, it is proven that, like for kernel machines, the optimal solution can be expressed by a kernel expansion, so as the overall optimization is reduced to the discovery of a finite set of parameters. The risk function to be minimized contains the weights of the network connections, as well as the parameters associated with the the points of the kernel expansion. Hence, the classic learning of the weights of the network takes place with the concurrent development of the optimal shape of the activation functions, one for each neuron. As a consequence, the machine architecture turns out to enjoy the strong representational issues of deep networks in high dimensional spaces that is conjugated with the elegant and effective setting of kernel machines for the learning of the activation functions. The powerful unified regularization framework is not the only feature that emerges from the proposed architecture. Interestingly, unlike most of the activation functions used in deep networks, those that are typically developed during learning, are not necessarily monotonic. This property has a crucial impact in their adoption in classic recurrent networks, since this properly addresses classic issues of gradient vanishing when capturing long-term dependencies. Throughout this paper, recurrent networks with activation functions based on kernel expansion, are referred to as Kernel-Based Recurrent Networks (KBRN). The intuition is that the associated iterated map can either be contractive or expansive. Hence, while in some states the contraction yields gradient vanishing, in others the expansion results in to gradient pumping, which allows the neural network to propagate information back also in case of long time dependences. The possibility of implementing contractive and expanding maps during the processing of a given sequence comes from the capabilities of KBRN to develop different activation functions for different neurons that are not necessarily monotonic. This variety of units is somewhat related to the clever solution proposed in LSTM cells , where the authors early realized that there was room for getting rid of the inherent limitation of the contractive maps deriving from sigmoidal units. The given experimental results provide evidence of this property on challenging benchmarks that are inspired to seminal paper 
, where the distinctive information for classification of long sequences is only located in the first positions, while the rest contains uniformly distributed noisy information. We get very promising results on these benchmarks when comparing KBRN with state of the art recurrent architectures based on LSTM cells.
2 Representation and learning
The feedforward architecture that we consider is based on a directed graph , where is the set of ordered vertices and is the set of the oriented arcs. Given there is connection from to iff . Instead of assuming a uniform activation function for each vertex of , a specific function is attached to each vertex. We denote with the set of input neurons, with the set of the output neurons and with the set of hidden neurons; the cardinality of these sets will be denoted as , , and . Without loss of generality we will also assume that: , and .
The learning process is based on the training set
. Given an input vector, the output associated with the vertices of the graph is computed as follows111We are using here the Iverson’s notation: Given a statement , we set to if is true and to if is false:
with , where are the parents of neuron , and are one dimensional real functions; , with chosen big enough, so that Eq. (1) is always well defined. Now let and define the output function of the network by
The learning problem can then be formulated as a double optimization problem defined on both the weights , and on the activation functions . It is worth mentioning that while the optimization on the weights of the graph reflects all important issues connected with the powerful representational properties of deep nets, the optimal discovery of the activation functions are somewhat related to the framework of kernel machines. Such an optimization is defined with respect to the following objective function:
which accumulates the empirical risk and a regularization term based regularization operators . Here, we indicate with the standard inner product of , with a differential operator of degree , while
is a suitable loss function.
Clearly, one can optimize by independently checking the stationarity with respect to the weights associated with the neural connections and the stationarity with respect to the activation functions. Now we show that the stationarity condition of with respect to the functional variables (chosen in a functional space that depends on the order of differential operator ) yields a solution that is very related to classic case of kernel machines that is addressed in . If we consider a variation with vanishing derivatives on the boundary 222Here, we are assuming here that the values of the functions in at the boundaries together with the derivatives up to order are fixed. of up to order and define . The first variation of the functional along is therefore . When using arguments already discussed in related papers [12, 5, 13] we can easily see that
where and , being the adjoint operator of . We notice in passing that the functional dependence of on is quite involved, since it depends on the compositions of liner combinations of the functions (see Figure 1–(a)). Hence, the given expression of the coefficients is a rather a formal equation that, however, dictates the structure of the solution.
The stationarity conditions reduce to the following Euler-Lagrange (E-L) equations
where is the value of the activation function on the -th example of the training set. Let be the Green function of the operator , and let be the solution of . Then, we can promptly see that
is the general form of the solution of Eq. (3). Whenever has null kernel, then this solution is reduced to an expansion of the Green function over the points of the training set. For example, this happens in the case of the pseudo differential operator that originates the Gaussian as the Green function. If we choose , then . Interestingly, the Green function of the second derivative is the rectifier and, moreover, we have . In this case
where , while . Because of the representation structure expressed by Eq. (4), the objective function the original optimization problem collapses to a standard finite-dimensional optimization on333Here we omit the dependencies of the optimization function from the parameters that defines .
here is the regularization term and . This collapse of dimensionality is the same which leads to the dramatic simplification that gives rise to the theory of kernel machines. Basically, in all cases in which the Green function can be interpreted as a kernel, this analysis suggests the neural architecture depicted in Figure 1, where we can see the integration of graphical structures, typical of deep nets, with the representation in the dual space that typical of kernel methods.
We can promptly see that the idea behind kernel-based deep networks can be extended to cyclic graphs, that is to recurrent neural networks. In that case, the analogous of Eq. (1) is:
Here we denote with the input at step and with the state of the network. The set contains the vertices that are parents of neuron ; the corresponding arcs are associated with a delay, while vertices with non-delayed arcs
. The extension of learning in KBDNN to the case of recurrent nets is a straightforward consequence of classic Backpropagation Through Time.
3 Approximation and algorithmic issues
The actual experimentation of the model described in the previous section requires to deal with a number of important algorithmic issues. In particular, we need to address the typical problem associated with the kernel expansion over the entire training set, that is very expensive in computational terms. However, we can early realize that KBDNNs only require to express kernel in 1-D, which dramatically simplify the kernel approximation. Hence, instead of expanding over the entire training set, we can use a number of points with . This means that the expansion in Eq. (4) is approximated as follows
where and are the centers and parameters of the expansion, respectively. Notice that are replacing in the formulation given in Section 2). We consider and as parameters to be learned, and integrate them in the whole optimization scheme.
In the experiments described below we use the rectifier (ReLU) as Green function () and neglect the linear terms from both and
. We can easily see that this is compatible with typical requirements in machine learning experiments, where in many cases the expected solution is not meaningful with very large inputs. For instance, the same assumption is typically at the basis of kernel machines, where the asymptotic behavior is not typically important. The regularization termcan be inherited from the regularization operator . For the experiments carried out in this paper we decided to choose the norm444This choice is due to the fact that we want to enforce the sparseness of , i.e. to use the smallest number of terms in expansion 6.:
with being an hyper-parameter that measures the strength of the regularization.
In a deep architecture, when stacking multiple layers of kernel-based units, the non-monotonicity of the activation functions implies the absence of guarantees about the interval on which these functions operate, thus requiring them to be responsive to very heterogeneous inputs. In order to face this problem and to allow kernel-based units to concentrate their representational power on limited input ranges, it is possible to apply a normalization  to the input of the function. In particular, given , can be normalized as:
while and are additional trainable parameters.
We carried out several experiments in different learning settings to investigate the effectiveness of the KBDNN with emphasis on the adoption of kernel-based units in recurrent networks for capturing long-term dependences. Clearly, KBDNN architectures require to choose both the graph and the activation function. As it will be clear in the reminder of this section, the interplay of these choices leads to gain remarkable properties.
The XOR problem. We begin presenting a typical experimental set up in the classic XOR benchmark. In this experiment we chose a single unit with the Green function , so as turns out to be
where , and are trainable variables and the learning of corresponds with the discovery of both the centroids and the associated weights . The simplicity of this learning task allows us to underline some interesting properties of KBDNNs. We carried out experiment by selecting a number of points for the expansion of the Green function that ranges from to . This was done purposely to assess the regularization capabilities of the model, that is very much related to what typically happens with kernel machines. In Figure 2, we can see the neuron function at the end of the learning process under different settings. In the different columns, we plot function with a different numbers of clusters, while the two rows refer to experiments carried out with and without regularization. As one could expect, the learned activation functions become more and more complex as the number of clusters increases. However, when performing regularization, the effect of the kernel-based component of the architecture plays a crucial role by smoothing the functions significantly.
The charging problem. Let us consider a dynamical system which generates a Boolean sequence according to the model
where , is a sequence of integers and is a Boolean sequence, that is . An example of sequences generated by this system is the following:
Notice that the system keeps memory when other bit are coming, that is
The purpose of this experiment was that of checking what are the learning capabilities of KBRN to approximate sequences generated according to Eq. 7. The intuition is that a single KB-neuron is capable to charge the state according to an input, and then to discharge it until the state is reset. We generated sequences of length . Three random element of each sequence were set with a random number ranging from to . We compared KBRN, RNN with sigmoidal units, and recurrent with LSTM cells, with a single hidden unit. We used a KBRN unit with centers to approximate the activation function. The algorithm used for optimization used the Adam algorithm with in all cases. Each model was trained for iterations with mini-batches of size . Figure 3 shows the accuracy on a randomly generated test set of size during the training process. The horizontal axis is in logarithmic scale. The horizontal axis is in logarithmic scale.
Learning Long-Term dependencies. We carried out a number of experiments aimed at investigating the capabilities of KBRN in learning tasks where we need to capture long-term dependencies. The difficulties of solving similar problems was addressed in 
by discussions on gradient vanishing that is mostly due to the monotonicity of the activation functions. The authors also provided very effective yet simple benchmarks to claim that classic recurrent networks are unable to classify sequences where the distinguishing information is located only at the very beginning of the sequence; the rest of the sequence was supposed to be randomly generated. We defined a number of benchmarks inspired by the one given in, where the decision on the classification of sequence is contained in the first bits of a Boolean sequence of length . We compared KBRN and recurrent nets with LSTM cells using an architecture where both networks were based on hidden units. We used the Adam algorithm with in all cases. Each model was trained for a maximum of iterations with mini-batches of size
; for each iteration, a single weight update was performed. For the LSTM cells, we used the standard implementation provided by TensorFlow (following). For KBRN we used a number of centroids and the described normalization.
We generated automatically a set of benchmarks with and variable length , where the binary sequences can be distinguished when looking simply at the first two bits, while the the rest is a noisy string with uniformly random distribution. Here we report some of our experiments when choosing the first two discriminant bits according to the , , and functions.
For each Boolean function, that was supposed to be learned, and for several sequence lengths (up to 50), we performed 5 different runs, with different initialization seeds. A trial was considered successful if the model was capable of learning the function before the maximum allowed number of iterations was reached. In Figure 5 we present the results of these experiments. Each of the four quadrants of Figure 5 is relative to a different Boolean function, and reports two different plots. The first one has the sequence length on the -axis and the number of successful trials on the -axis. The second plot has the sequence length on the -axis and, on the -axis, the average number of iterations required to solve the task. The analysis of these plots allows us to draw ta couple of interesting conclusions: (i) KBRN architectures are capable of solving the problems in almost all cases, regardless of the sequence length, while recurrent networks with LSTM cells started to experiment difficulties for sequences longer than 30, and (ii), whenever convergence is achieve, KBRN architectures converge significantly faster than LSTM. In order to investigate with more details the capabilities of KBRN of handling very long sequences, we carried out another experiment, that was based on the benchmark that KBRN solved with more difficulty, namely the equivalence () problem. We carried out a processing over sequences with length , and . In Figure 6, we report the results of this experiment. As we can see, KBRN are capable of solving the task even with sequences of length 150, eventually failing with sequences of length 200.
In this paper we have introduced Kernel-Based Deep Neural Networks. The proposed KBDNN model is characterized by the classic primal representation of deep nets, that is enriched with the expressiveness of activation functions given by kernel expansion. The idea of learning the activation function is not entirely new. However, in this paper we have shown that the KBDNN representation turns out to be the solution of a general optimization problem, in which both the weights, that belong to a finite-dimensional space, and the activation function, that are chosen from a functional space are jointly determined. This bridges naturally the powerful representation capabilities of deep nets with the elegant and effective setting of kernel machines for the learning of the neuron functions.
A massive experimentation of KBDNN is still required to assess the actual impact of the appropriate activation function in real-world problems. However, this paper already proposes a first important conclusion which involves recurrent networks, that are based on this kind of activation function. In particular, we have provided both theoretical and experimental evidence to claim that the KBRN architecture exhibits an ideal computational structure to deal with classic problems of capturing long-term dependencies.
-  Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830, 2014.
-  Y. Bengio, P. Frasconi, and P. Simard. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. Special Issue on Dynamic Recurrent Neural Networks.
Ilaria Castelli and Edmondo Trentin.
Combination of supervised and unsupervised learning for training the activation functions of neural networks.Pattern Recogn. Lett., 37:178–191, February 2014.
-  Carson Eisenach, Zhaoran Wang, and Han Liu. Nonparametrically learning activation functions in deep neural nets. 2016.
-  F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7:219–269, 1995.
F. Girosi, M. Jones, and T. Poggio.
Regularization networks and support vector machines.Advances in Computational Mathematics, 13(1):1–50, 2000.
-  Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, editors, AISTATS, volume 15 of JMLR Proceedings, pages 315–323. JMLR.org, 2011.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
-  Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  Hrushikesh Mhaskar, Qianli Liao, and Tomaso A. Poggio. Learning real and boolean functions: When is deep better than shallow. CoRR, abs/1603.00988, 2016.
-  T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481–1497, 1990.
-  A.J. Smola, B. Schoelkopf, and K.R. Mueller. The connection between regularization operators and support vector kernels. Neural Networks, 11:637– 649, 1998.
-  Qinliang Su, xuejun Liao, and Lawrence Carin. A probabilistic framework for nonlinearities in stochastic neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4486–4495. Curran Associates, Inc., 2017.
-  Andrew James Turner and Julian Francis Miller. Neuroevolution: evolving heterogeneous artificial neural networks. Evolutionary Intelligence, 7(3):135–154, 2014.
-  Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.