The recent successes of deep learning models (e.g., in machine translation) have provided a parallel boost in understanding theoretically their good generalization properties. In particular, several works have been devoted to the so-called ‘overfitting puzzle’[zhang2017understanding, poggio2017theory], i.e., the fact that highly over-parameterized neural networks (NNs), able to immediately memorize the entire training set [zhang2017understanding]
in principle, are nonetheless able to generalize well even with only moderate amounts of regularization. While this is counter-intuitive from the point of view of classical results in statistical learning theory (e.g., capacity measures such as the VC-dimension), a wide range of alternative explanations have been proposed to justify the strong empirical performance of NNs[neyshabur2017exploring, kawaguchi2017generalization, raghu2017expressive, asadi2018chaining].
The vast majority of these works has focused on a standard class of NNs, composed of linear projections (or convolutions) interleaved with fixed, element-wise nonlinearities, particularly the rectified linear unit (ReLU)[du2018gradient]. Some have also explored the interplay of this type of NNs with the optimization hyper-parameters, e.g., the batch size [hoffer2017train]
, or with newer types of regularization, such as batch normalization.
However, several authors have recently advocated the need for other types of architectures, especially by the use of flexible activation functions, able to learn per-neuron shapes during the training process. Examples of these are the maxout network [goodfellow2013maxout], the adaptive piecewise linear unit [agostinelli2014learning], and the kernel activation function (KAF) [scardapane2019kafnets], which is the main focus of this paper. While KAFs and similar models have shown good empirical performance compared to classical architectures (e.g., see also [scardapane2018recurrent] for the recurrent case), they introduce significantly more flexibility into the architecture, and thus more potentiality for overfitting. For this reason, it is essential to supplement their evaluation with thorough analyses of their generalization properties, which in the case of KAFs have not been addressed yet by previous research.
Contributions of the paper
The aim of this paper is to analyze the generalization capabilities of NNs endowed with KAF nonlinearities. To this end, we exploit the analysis presented in [hardt2016train], whose roots can be found in previous works linking the generalization capabilities of a model to its algorithmic stability [bousquet2002stability]. More in detail, in [hardt2016train] it was shown that a non-convex model trained via stochastic gradient descent (SGD) for a finite number of steps can generalize well, provided we are able to bound several constants related to its Lipschitz continuity and smoothness. Using this, we can obtain bounds on the generalization properties of KAFs by indirectly analyzing their smoothness and plugging back these results in the theorems from [hardt2016train]. Interestingly, our main theorem in this sense (see Section IV) provides a rigorous bound on one key hyper-parameter of the model, thus providing a practical guideline for using KAFs in real-world scenarios. We note that the general outline of our proof method is similar to the one introduced in [eisenach2016nonparametrically], from which we take a few results. However, because of the differences in the models we explore, the bulk of the proof differs significantly from [eisenach2016nonparametrically].
Organization of the paper
The rest of this brief is organized as follows. In Section II we introduce the key concepts from [hardt2016train] that will be used for our analysis. Section III describes the KAF model from [scardapane2019kafnets], which is the focus of this brief. We prove the generalization of this model in Sections IV and V. After a small experimental evaluation in Section VI, we conclude in Section VII with insights on possible future work.
In this section, we recall some basic elements from the stability theory of [hardt2016train], on which we build our analysis. We denote by a generic NN, where we collect all adaptable parameters in , while
denotes an input vector (later on we will specialize our analysis to a specific architecture for, based on KAFs [scardapane2019kafnets]
). Given a loss function
and an (unknown) probability distributiongenerating the data, we define the expected risk of under as:
If we are only provided with a sample of i.i.d. draws from given by , the empirical risk is the finite sample approximation of (1) using :
The expected generalization error of a randomized algorithm is given by:
where expectation is taken both with respect to all possible training sets and runs of the algorithm.
Algorithm is -uniformly stable if, for all data sets , differing for at most one example, we have:
Fundamentally for our analysis, stability of an algorithm implies generalization in expectation (see also [bousquet2002stability]).
(Hardt et al., 2016 [hardt2016train]) If is -uniformly stable, then .
In particular [hardt2016train]
considers training with SGD, where at every iteration the estimateis refined as:
with the index randomly chosen at every time step and is the step size. Before introducing the main theorem from [hardt2016train], we need two additional definitions.
A real-valued function is -Lipschitz if for any points and in its domain we have:
A real-valued function is -smooth if for any points and in its domain we have:
(Uniform stability of SGD) Assume that the loss function is bounded, -Lipschitz and -smooth in terms of the vector , for any couple . Assume we optimize it with SGD for a finite number of steps, using monotonically non-increasing step sizes , with and a certain constant. Then, the algorithm is -uniformly stable with:
The proof can be found in Theorem 3.12 from [hardt2016train]. Summarizing these results, for any architecture trained using SGD one can prove generalization bounds in this way: first, one proves that the loss function is bounded, -Lipschitz and -smooth; then, one uses the Theorems 1 and 2 to automatically infer uniform stability and, in turn, generalization bounds. Remarkably, this also provides an intuition which is somehow in contrast with the overfitting puzzle, i.e., it makes sense to use models for which are as expressive as possible, provided this additional flexibility translates to a faster rate of convergence, allowing to reduce the number of SGD steps taken during training. In the words of [hardt2016train]: “it may make sense for practitioners to focus on minimizing training time, for instance, by designing model architectures for which stochastic gradient method converges fastest to a desired error level.”
Iii Kernel activation functions
Consider now a feedforward, fully connected NN architecture:
where each hidden layer consists of a certain number of functions, the neurons. More formally, let denote the th neuron of the th layer. When applied to a input vector , we have that is of the type , where:
and is a certain nonlinear transformation, the activation function. Some classical choices for such nonlinearities are and ReLU, for which Lipschitz properties have received considerable attention [wiatowski2018mathematical]. However, in this paper we consider a more expressive kind of activation functions, that can adapt their shape based on the training data, namely, that have parameters to be tuned during the learning process. In particular, for we model each as a KAF [scardapane2019kafnets], whose definition is given below, whereas, since we will consider multiclass classification problems, in the last layer, , we define the activation functions ’s in order to constitute the softmax.
A kernel activation function (KAF) [scardapane2019kafnets] for layer , node is defined as:
where is a generic input to the KAF, is a hyper-parameter, are the mixing coefficients, is a one-dimensional kernel function, and are the dictionary elements.
As in the original implementation of KAFs, we consider a separate set of mixing coefficients for every neuron (which are adapted via SGD together with the parameters of the linear projection), while the dictionary elements are fixed, shared across the entire network. In other words, in our scenario the parameter vector is composed by the parameters ’s and ’s, as usual, but also by the mixing coefficients ’s of the activation functions in the hidden layers. The dictionary elements are defined by first choosing a value for and then sampling values across the -axis uniformly with a certain sampling step. Generally speaking, increasing allows to increase the expressivity of the single activation function. Depending on the kernel function , one obtains different schemes for the functions’ adaptation. In [scardapane2019kafnets] and most of the later extensions, e.g. [scardapane2018recurrent], a Gaussian kernel is used:
where is a second hyper-parameter (the inverse of the kernel bandwidth). Varying has a clear intuitive meaning in varying the ‘receptive field’ of each component of the linear expansion in (7). We consider this kernel for our analysis. A NN endowed with KAFs at all hidden layers is called a Kafnet.
The function represents the affine transformation applied on the inputs of the neuron. Such inputs, due to the feedforward and fully connected architecture of the considered network, are the outputs ’s of the neurons from the previous layer. For the case , we have that the quantity is equal to the input and is equal to . The neurons in the output layer, where , are defined analogously, with the only difference that the softmax function is used in place of (9).
Iv Generalization properties of Kafnets
To simplify reading, we state here our main result, and postpone its proof to the next section. As stated before, we focus on classification problems, although the method is easily extendable to regression tasks. For this, we assume the output function in (5) is a softmax, while in (2) is the cross-entropy loss. As in the previous section, we use to index a layer, to index a neuron inside a layer111Strictly speaking we should write , but we drop the suffix for readability..
(Smoothness of Kafnets) Let us assume that there exist constants , , , such that:
For any , ;
For any , and ;
For any , .
In addition, if we define , under the following conditions:
The elements of the dictionary are sampled uniformly in with ;
then the loss function in (2) is bounded, -Lipschitz and -smooth with respect to , with:
We remark that the asymptotic notation used above is related to the variables and , whereas the parameters , , and are intended to be constants of the problem, namely, belonging to .
Due to the large number of computations involved, the proof is postponed to Section V. ∎
Note that conditions (1)-(3) are similar to those used in [eisenach2016nonparametrically] and can be enforced trivially. Condition (4) simply states that the network has at least one hidden layer. Conditions (5) and (6) are more interesting because they impose some constraints on the hyper-parameters of the model. In particular, condition (5) requires that the elements of the dictionary should be sampled from a sufficiently strict distribution, that increases at most linearly w.r.t. , while condition (6) is a non-trivial constraint on the receptive field of each mixing coefficient. Combining this result with the results from Section II gives us the desired property of generalization.
V Proof of the main theorem
V-a Notational Conventions and Outline of the Proof.
Furthermore, for any parameters and in , we define the generic partial derivatives of as:
Likewise, we define the partial derivatives of and as , , and .
Now we are ready to depict the outline of the proof, deferring some technical digressions in the next subsections. According to lemmas B.9 and B.10 in [eisenach2016nonparametrically], if the functions ’s and ’s are bounded by and , respectively, then the loss function is bounded, -Lipschitz and -smooth. More precisely, in asymptotic notation the two lemmas state that:
The core part of the proof consists of proving:
In fact, once these bounds are computed, the statement easily follows by substituting (15) in (14) 222The computations of these bounds heavily depend on the kind of activation functions that are employed. For this reason, even if our proof follows a similar outline of its counterpart in [eisenach2016nonparametrically], it is heavily different in its technical apparatus, due to the very different characteristics of the activation functions involved here.. Therefore, in the reminder of the proof we will focus on proving (15).
V-B Some intermediate results.
In this subsection we collect some intermediate results that are needed to prove (15). Before continuing, we present an useful inequality that will be repeatedly used in the next, , that is well known as triangle inequality.
Contrariwise, when , the hypotheses and , together with the previous result in (17), lead to:
Now, we analyse the bounds of the first and second order derivatives of the functions ’s, ’s and ’s. In particular, these results hold under the assumption that, given a fixed , for any :
From the definition of in (8), we obtain:
Since, for the triangle inequality, , and since for hypothesis and , and recalling also that , we obtain:
Now, we focus on the functions ’s and ’s. Let us define , from the definition of in (9) we have:
and, recalling that , we have:
As usual, since we have , and, for the triangle inequality, , by using the hypothesis and (LABEL:eq:boundsEppp) we obtain:
With analogous considerations as above, we obtain:
Finally, we prove two boundary inequalities for the derivatives of the functions ’s. Let us define . According to the definition of in (10), we have:
and, since , we have:
When , recalling that is constant and equal to , we simply have and for any , and we also have . Therefore:
and is less or equal than the maximum among:
In particular, for obtaining these results we used the identities , and , for any .
V-C Asymptotic bounds of the affine transformations.
In this subsection we conclude the proof, proving the missing piece (15). To this aim, by exploiting the recursive structure of the network, we prove by induction a more general result, such that, for any fixed , the functions ’s and ’s are all bounded by:
Then, since we assumed , this general result implies the desired one in (15). Preliminarily, we recall that the asymptotic notation used here is intended to be relative to the variables ’s and , whereas the parameters , , , , are intended to be constants. Technically speaking, this means that, for example, expressions like belong to . In the next, we will repeatedly use some well known properties of the asymptotic notation as, for example, if and then , and .
We start by inductively proving the first part of (28) about the ’s. Let us consider the base case . From (18) and (24), we note that the bounds of the functions ’s and ’s do depend neither by nor by , and, therefore, they are two constants and . Furthermore, using the hypothesis , and recalling that , we have that , and, therefore:
Now, we turn on the inductive step, namely, we prove (28) assuming that it holds for . First, from (19) we note that when the functions ’s have bound such that . We have again and , like for the base case, but now, since , and since fulfils (28), we have:
For the boundaries ’s, the same identical arguments apply. For space constraints, we provide only a brief sketch of their proof that is, anyway, complete of all the most technical intermediate results. For the base case , since and , the asymptotic counterpart of (27) becomes:
The maximum returns the fourth entry (recall that ), that in compact form reduces to , namely, to the claim (28).
Vi Experimental evaluation
In this section we provide a small experimental verification of our main theorem (Theorem 11). We note that this section is not intended to verificate experimentally KAFs, as this was done extensively in the original publication [scardapane2019kafnets] and later works. Instead, we showcase a simplified toy scenario to test the validity of our bounds.
We use the experimental procedure from [guyon2003design] (which is also implemented in the scikit-learn Python library) to generate a simple two-class classification problem, where each class is described by two clusters randomly assigned to the vertices of a -dimensional hypercube. To this we add two additional random features with no correlation to the actual class. We sample points as our training set, and another for testing. We compare two NNs with one -dimensional hidden layer each endowed with KAFs nonlinearities. In both cases, each KAF has elements in the dictionary sampled uniformly from
, while the mixing coefficients are randomly initialized from a normal distribution. For the first KAF we select, while for the second one we select a smaller . We remark that only the former choice is consistent with the condition in Theorem 3. Furthermore, we remark that the choice
and the values assigned to the other hyperparameters are coherent with the ones considered in[scardapane2019kafnets]); this is an important point, since we are interested to test generalization bounds for a Kafnet that already proved to work well in practice. We train the two networks with the Adam optimization algorithm on random mini-batches of elements, evaluating at every step the empirical risk on a separate mini-batch taken from the test portion of the dataset.
Both models converge to a final training loss which is in only a few iterations. However, we plot in Fig. 1 the generalization gap of the two models, that we compute as the ratio between the test empirical risk and the training empirical risk. As can be seen, the model satisfying the assumptions of our theorem only starts to overfit after a few hundred iterations, while the model with the lower starts to overfit almost immediately, with a final empirical risk which is almost double for the new data than for the training data.
Vii Conclusions and discussion
In this paper we provided an analysis of the generalization capabilities of a new class of non-parametric activation functions, i.e., kernel activation functions (KAFs). The analysis builds on top of recent results with respect to the interplay between stochastic gradient descent (SGD) and stability of the functions being optimized. Our main theorem shows that a NN endowed with KAFs is stable and generalizes well when trained with SGD for a finite number of steps, provided it satisfies a simple bound on one of its hyper-parameters.
Here, we focused on a specific variant of KAF, which is the one used in [scardapane2019kafnets]. For future research, we aim at generalizing our theorems to more types of kernel functions. In addition, we are eager to investigate different types of generalization analyses. For example, in [marra2018learning] it is shown that an activation function similar to KAFs can appear from an optimization problem formulated in a functional space. As a large literature has been devoted to complexity measures for these scenarios, it would be interesting to explore the interplay and possible extensions of these tools to the case considered here.