In the more extreme case, multiple authors have advocated for non-parametric formulations, in which the overall flexibility and number of parameters can be chosen freely by the user on the basis of one or more hyper-parameters. As a result, the trained functions can potentially approximate a much larger family of shapes. Different proposals, however, differ on the way in which each function is modeled, resulting in vastly different characteristics in terms of approximation, optimization, and simplicity of implementation. Examples of non-parametric activation functions are the maxout neuron [goodfellow2013maxout, zhang2014improving], defined as the maximum over a fixed number of affine functions of its input, the Fourier activation function [eisenach2017nonparametrically], defined as a linear combination of a predetermined trigonometric basis expansion, or the Hermitian-based expansion [siniscalchi2017adaptation]. We refer to [scardapane2018kafnets] for a fuller overview on the topic.
In this paper we focus on the recently proposed kernel activation function (KAF) [scardapane2018kafnets], in which each (scalar) function is modeled as a one-dimensional kernel expansion, with the linear mixing coefficients adapted together with all the other parameters of the network during optimization. In [scardapane2018kafnets] it was shown that KAFs can greatly simplify the design of neural networks, allowing to reach higher accuracies, sometimes with a smaller number of hidden layers. Linking neural networks with kernel methods also allows to leverage a large body of literature on the learning of kernel functions (e.g., kernel filters [liu2011kernel]), particularly with respect to their approximation capabilities. At the same time, compared to ReLUs, KAFs introduce a number of additional design choices, most notably the selection of which kernel function to use, and its eventual hyper-parameters (e.g., the bandwidth of the Gaussian kernel). Although in [scardapane2018kafnets] and successive works we mostly focused on the Gaussian kernel, it is not guaranteed to be the optimal one in all applications.
1.0.1 Contribution of the paper
To solve the kernel selection problem of KAFs, in this paper we propose an extension inspired to the theory of multiple kernel learning [gonen2011multiple, aiolli2015easymkl]. In the proposed multi-KAF, different kernels are linearly combined for every neuron through an additional set of mixing coefficients, adapted during training. In this way, the optimal kernel for each neuron (or a specific mixture of them) can be learned in a principled way during the optimization process. In addition, since in our KAF implementation the points where the kernels are evaluated are fixed, a large amount of computation can be shared between the different kernels, leading to a very small computational overhead overall, as we show in the following sections.
1.0.2 Case study: In Codice Ratio
To show the usefulness of the proposed activation functions, we provide a realistic use case by applying them to the data from the ‘In Codice Ratio’ (ICR) project [firmani2018towards], whose aim is the automatic transcription of a large part of the Vatican Secret Archive.111http://www.archiviosegretovaticano.va/ A key component of the project is an OCR tool applied to characters from a Latin handwritten text (see Section 4 for additional details). In [firmani2017codice]
we presented a convolutional neural network (CNN) for this task, which we applied to a dataset ofdifferent Latin characters extracted from a sample selection of pages from the Vatican Register. In this paper we show that, using multi-KAFs, we can increase the accuracy of the CNN even while reducing the number of filters per layer.
1.0.3 Organization of the paper
2.1 Feedforward neural networks
Consider a generic feedforward NN layer, taking as input a vectorand producing in output a vector :
where are adaptable weight matrices, and is an element-wise activation function. Multiple layers can be stacked to obtain a complete NN. In the following we focus especially on the choice of , but we note that everything extends immediately to more complex types of layer, including convolutional layers (wherein the matrix product is replaced by a convolutional operator), or recurrent layers [goodfellow2016deep].
Generally speaking, the activation functions for the hidden (not last) layers are chosen as simple operations, such as the rectified linear unit (ReLU), originally introduced in [glorot2011deep]:
where we use the letter to denote a generic scalar input to the function, i.e., a single activation value. An NN is a generic composition of such layers, denoted as , that is trained with a dataset of training samples . In the experimental section, in particular, we deal with multi-class classification with classes, where the desired output
represents a one-hot encoding of the target class. We train it by minimizing a regularized cross-entropy cost:
where is a weight vector collecting all the adaptable weights of the network, is a positive scalar, and we use a subscript to denote the -th element of a vector.
2.2 Kernel activation functions
Differently from (2), a KAF can be adapted from the data. In particular, each activation function is modeled in terms of expansions of the activation with a kernel function :
where the scalars form the so-called dictionary, while the scalars are called the mixing coefficients. To make the training problem simpler, the dictionary is fixed beforehand (and not adapted) by sampling values from the real line, uniformly around zero, and it is shared across the network, while a different set of mixing coefficients is adapted for every neuron. This makes implementation extremely efficient. The integer is the key hyper-parameter of the model: a higher value of increases the overall flexibility of each function, at the expense of adding additional mixing coefficients to be adapted.
Any kernel function from the literature can be used in (4), provided it respects the semi-definiteness property:
for any choice of the mixing coefficients and the dictionary. In practice, [scardapane2018kafnets] and all subsequent papers only used the one-dimensional Gaussian kernel defined as:
where is a parameter of the kernel. The value of influences the ‘locality’ of each with respect to the dictionary. In [scardapane2018kafnets] we proposed the following rule-of-thumb, found empirically:
where is the distance between any two dictionary elements. However, note that neither the Gaussian kernel in (6) nor the rule-of-thumb in (7) are optimal in general. For example, simple smooth shapes (like slowly varying polynomials) could be more easily modeled via different types of kernels or much larger values of with a possibly smaller . These are common problems also in the kernel literature. Leveraging it, in the next section we propose an extension of KAF to mitigate both problems.
3 Proposed multiple kernel activation functions
In order to mitigate the problems mentioned in the previous section, assume to have available a set of candidate kernel functions . These can be entirely different functions or the same kernel with different choices of its parameters. There has been a vast research on how to successfully combine different kernels to obtain a new one, going under the name of multiple kernel learning (MKL) [aiolli2015easymkl]. For the purpose of this paper we adopt a simple approach, in order to evaluate its feasibility. In particular, we build each KAF with a new kernel given by a linearly weighted sum of the constituents (base) kernels:
where are an additional set of mixing coefficients. From the properties of reproducing kernel Hilbert spaces, it is straightforward to show that is a valid kernel if its constituents are also valid kernel functions. We call the resulting activation functions multi-KAFs. Note that such an approach only introduces additional parameters for each neuron, where is generally small. In addition, since in our implementation the dictionary is fixed, all kernels are evaluated on the same points, which can greatly simplify the implementation and allows to share a large part of the computation.
More in detail, for our experiments we consider an implementation with , where is the Gaussian kernel in (6), is chosen as the (isotropic) rational quadratic [genton2001classes]:
with being a parameter, and is chosen as the polynomial kernel of order 2:
The rational quadratic is similar to the Gaussian, but it is sometimes preferred in practical applications [genton2001classes], while the polynomial kernel allows to introduce a smoothly varying global trend to the function. We show a simple example of the mix of and in Fig. 1.
In order to simplify optimization, especially for deeper architectures, we apply the kernel ridge regression initialization procedure described in[scardapane2018kafnets] to initialize all multi-KAFs to a known activation function, i.e., the exponential linear unit (ELU) [clevert2016fast]. For this purpose, denote by the vector of ELU values computed on our dictionary points. We initialize all to , and initialize the vector of mixing coefficients as:
where is the kernel matrix computed between and using , and . As a final remark, note that in (8) we considered an unrestricted linear combination of the constituting kernels. We can easily obtain more restricted formulations (which are sometimes found in the MKL literature [gonen2011multiple]) by applying some nonlinear transformation to the mixing coefficients , e.g., a softmax function to obtain convex combinations. We leave such comparisons to a future work.
4 Case Study: In Codice Ratio
As stated in the introduction, we apply the proposed multi-KAF on a realistic case study taken from the ICR project. Apart from the details described below, we refer the reader to [firmani2017codice, firmani2018towards] for a fuller description of the project.
The overall goal of ICR is the transcription of a large portion of the Vatican Secret Archives, one of the largest existing historical libraries. In the first phase of the project, we collected and manually annotated a set of images representing different types of characters (one of which is a special ‘non-character’ class). All characters were extracted from a sample of (handwritten) pages of private correspondence of Pope Honorii III from the XIII century. Each character was then annotated using a crowdsourcing platform and volunteer students.222The dataset is available on the web at http://www.dia.uniroma3.it/db/icr/. A few examples taken from the dataset are shown in Fig. 2.
In [firmani2017codice] we described the design and evaluation of a CNN for tackling the problem of automatically assigning a class to each character, represented as a black-and-white image. The final CNN had the following architecture: (i) a convolutive block with filters of size
; (ii) max-pooling with size; (iii) two additional series of convolutive blocks and max-pooling, this time with filters per layer; (iv) a fully connected layer with neurons, followed by (v) an output layer of neurons. In the original implementation, all linear operations were preceded by dropout [srivastava2014dropout]
(with probability), and followed by ReLU nonlinearities. We will use this dataset and architecture as baseline for our experiments. Note that this design was heavily fine-tuned, and it was not possible to increase the testing accuracy by simply adding more neurons / layers to the resulting CNN model.
5 Experimental results
5.1 Experimental setup
For our comparisons, we use the same CNN architecture described in Section 4, but we replace the ReLU functions by either KAFs or the proposed multi-KAFs, with the hyper-parameters described in the previous sections. In order to make the networks comparable in terms of parameters, for KAF and multi-KAF we decrease the number of filters and neurons in the linear layers by
. To stabilize training, we also replace dropout with a batch normalization step[ioffe2015batch] before applying KAF-based nonlinearities.
We train the networks following a similar procedure as [firmani2017codice]. We use the Adam optimization algorithm on random mini-batches of elements, with a small regularization factor . After every iterations of the optimization algorithm we evaluate the accuracy on a randomly held-out set of samples, taken from the original training set. Training is stopped whenever the validation accuracy has not improved for at least iterations. The networks are then evaluated on a second independent held-out set of
examples. All networks are implemented in PyTorch, and experiments are run on a CUDA backend using the Google Colaboratory platform.
5.2 Experimental results
|Activation function||Testing accuracy [%]||Trainable parameters|
The results of the experiments, averaged over different repetitions, are shown in Table 1, together with the number of trainable parameters of the different architectures. It can be seen that, while KAF fails to provide a meaningful improvement in this case, the multi-KAF architecture obtains a significant gain in testing accuracy, stably throughout the repetitions. Most notably, this gain is obtained with a strong decrease in the number of trainable parameters, which is around for the KAF-based architectures, compared to parameters for the baseline one.
Loss and validation accuracy evolution for the baseline network (using ReLUs) and the proposed multi-KAF. Standard deviation for all curves is shown with a lighter color.
This gain is accuracy is not only obtained with a smaller number of overall parameters, but also with a much faster rate of convergence. To see this, we plot in Fig. 3
the evolution of the loss function in (3) and the evolution of the accuracy on the validation portion of the dataset.
In this paper we investigated a new non-parametric activation function for neural networks, which extends the recently proposed kernel activation function, by incorporating ideas from the field of multiple kernel learning to simplify the choice of the kernel function and further increase the expressiveness.
We evaluated the resulting multi-KAF on a benchmark dataset of Latin handwritten characters recognition, in the context of an ongoing real-world project. While in a sense these are only preliminary results, they point to the greater flexibility of such activation functions, coming with a faster rate of convergence and an overall smaller number of trainable parameters for the full architecture. We are currently in the process of collecting a larger dataset, considering a bigger amount of possible classes for the characters, in order to further evaluate the proposed architecture.
Additional research directions will consider the application of multi-KAFs on different types of benchmarks, going beyond standard CNNs, particularly with respect to recurrent models, complex-valued kernels [scardapane2018complex], and generative adversarial networks. Furthermore, we plan to test more extensively additional types of kernels, such as the periodic ones [genton2001classes], in order to evaluate the scalability of multi-KAFs in the presence of a larger number of constituent kernels. Generalization bounds for the architecture are also a promising research direction.