## Introduction

Convolutional neural networks (CNNs) have been tremendously successful in computer vision, image recognition [Krizhevsky:2012wl, He:2016ib] and object detection [Girshick:2015id, Ren:2017kt]

. The core operator, convolution, was partially inspired by the animal visual cortex where different neurons respond to stimuli in a restricted and partially overlaped region known as the receptive field

[hubel1962receptive, hubel1968receptive]. Convolution leverages on its equivariance to translation to improve the performance of a machine learning system

[Goodfellow2016deep]. Its efficiency lies in that the learnable parameters are sparse and shared across the entire input (receptive field). Nonetheless, convolution still has certain limitations, which will be analyzed below. To address them, this paper introduces kervolution to generalize convolution via the kernel trick. The artificial neural networks containing kervolutional layers are named as kervolutional neural networks (KNN).There is circumstantial evidence that suggests most cells inside striate cortex^{1}^{1}1The striate cortex is the part of the visual cortex that is involved in processing visual information. can be categorized as simple, complex, and hypercomplex, with specific response properties [hubel1968receptive].
However, the convolutional layers are linear and designed to mimic the behavior of simple cells in human visual cortex [Zoumpourlis:2017vh]

, hence they are not able to express the non-linear behaviors of the complex and hypercomplex cells inside the striate cortex. It was also demonstrated that higher order non-linear feature maps are able to make subsequent linear classifiers more discriminative

[lin2015bilinear, Blondel:2016uz, Cui:2017em]. However, the non-linearity that comes from the activation layers, rectified linear unit (ReLU) can only provide point-wise non-linearity. We argue that CNN may perform better if convolution can be generalized to patch-wise non-linear operations via kernel trick. Because of the increased expressibility and model capacity, better model generalization may be obtained.

Non-linear generalization is simple in mathematics, however, it is generally difficult to retain the advantages of convolution, (i) sharing weights (weight sparsity) and (ii) low computational complexity. There exists several works towards non-linear generalization. The non-linear convolutional networks [Zoumpourlis:2017vh] implement a quadratic convolution at the expense of additional parameters, where is the size of the receptive field. However, the quadratic form of convolution loses the property of ”weight sparsity”, since the number of additional parameters of the non-linear terms increases exponentially with the polynomial order, which dramatically increases the training complexity. Another strategy to introduce high order features is to explore the pooling layers. The kernel pooling method in [Cui:2017em] directly concatenates the non-linear terms, while it requires the calculation of non-linear terms, resulting in a higher complexity.

To address the above problems, kervolution is introduced in this paper to extend convolution to kernel space while keeping the aforementioned advantages of linear convolutions. Since convolution has been applied to many fields, image and signal processing, we expect kervolution will also play an important role in those applications. However, in this paper, we focus on its applications in artificial neural networks. The contributions of this paper include: (i) via kernel trick, the convolutional operator is generalized to kervolution, which retains the advantages of convolution and brings new features, including the increased model capacity, translational equivariance, stronger expressibility, and better generalization; (ii) we provide explanations for kervolutional layers from feature view and show that it is a powerful tool for network construction; (iii) it is demonstrated that KNN achieves better accuracy and surpasses the baseline CNN.

## Related Work

As the name indicates, CNN [lecun1989generalization] employs convolution as the main operation, which is modeled to mimic the behavior of simple cells found in the primary visual cortex known as V1 [hubel1962receptive, hubel1968receptive]. It have been tremendously successful in numerous applications [lecun1989backpropagation, Krizhevsky:2012wl, He:2016ib, Goodfellow2016deep]. Many strategies have been applied to improve the capability of model generalization.

AlexNet [Krizhevsky:2012wl] proves that the ensemble method ”dropout” is very effective for reducing over-fitting of convolutional networks. The non-saturated rectified linear unit (ReLU) dramatically improves the convergence speed [Krizhevsky:2012wl] and becomes a standard component of CNN. Network in network (NIN) [Lin:2013wb]

establishes micro networks for local patches within the receptive field, each of which consists of multiple fully connected layers followed by non-linear activation functions. This improves the model capacity at the expense of additional calculation and complex structures. GoogLeNet

[Szegedy:2015ja] increases both depth and width of CNN by introducing the Inception model, which further improves the performance. VGG [Simonyan:2015ws] shows that deep CNN with small convolution filters () is able to bring about significant improvement.ResNet [He:2016ib] addresses the training problem of deeper CNN and proposes to learn the residual functions with reference to the layer input. This strategy makes CNN easier to optimize and improves regression accuracy with an increased depth. DenseNet [Huang:2017kg] proposes to connect each layer to every other layer in a feed-forward fashion, which further mitigates the problem of vanishing-gradient. The ResNeXt [Xie:2017hu] is constructed by repeating a building block that aggregates a set of transformations with the same topology, resulting in a homogeneous, multi-branch architecture. It demonstrates the essence of a new dimension, which is the size of the set of transformations. In order to increase the representation power, SENet [Hu:2017tf] focuses on channels and adaptively recalibrates channel-wise feature responses by explicitly modeling the interdependencies channels.

In recent years, researchers have been paying much attentions on the extension of convolution. To enable the expressibility of convolution for complex cells, the non-linear convolutional network [Zoumpourlis:2017vh]

extends convolution to non-linear space by directly introducing high order terms. However, as indicated before, this introduces a large number additional parameters and increases the training complexity exponentially. To be invariant to spatial transformations, the spatial transformer network

[Jaderberg:2015vo] inserts learnable modules to CNN for manipulating transformed data. For the same purpose, deformable convolutional network [Dai:2017vy] adds 2-D learnable offsets to regular grid sampling locations for standard convolution, which enables the learning of affine transforms; while [Henriques:2017te] applies a simple two-parameter image warp before a convolution. CapsNet [Sabour:2017ts]proposes to replace convolution by representing the instantiation parameters of a specific type of entity as activity vectors via a capsule structure. This opens a new research space for artificial neural networks, although its performance on large dataset is still relatively weak. Decoupled network

[Liu:2018vk] interprets convolution as the product of norm and cosine angle of the weight and input vectors, resulting in explicit geometrical modeling of the intra-class and extra-class variation. To process graph inputs, Spline-CNN [fey2018splinecnn] extends convolution by using continuous B-spline basis, which is parametrized by a constant number of trainable control values. To decrease the storage, Modulated CNN [wang2018modulated] extends the convolution operator to binary filters, resulting in easier deployment on low power devices.The kernel technique in this paper was applied to create non-linear classifiers in the context of optimal margin [boser1992training]

, which was later recognized as support vector machines (SVM)

[cortes1995support]. It recently has also been widely applied to correlation filter for improving the processing speed. For example, the kernelized correlation filter (KCF) [Henriques:2015jy]is proposed to speed up the calculation of kernel ridge regression by bypassing a big matrix inversion, while it assumes that all the data are circular shifts of each other

[wang2019kernel], hence it can only predict signal translation. To break this theoretical limitation, the kernel cross-correlator (KCC) is proposed in [Wang:2018vt] by defining the correlator in frequency domain directly, resulting in a closed-form solution with computational complexity of , where is the signal length. Moreover, it does not impose any constraints on the training data, thus KCC is useful for other applications [Wang:2017wb, 2017arXiv171005502W, Wang:2017wl] and is applicable for affine transform prediction, translation, rotation, and scale. This theorem is further extended to speed up the prediction of joint rotation-scale transforms in [wang2018correlation]. The above works show that the kernel technique is a powerful tool for obtaining both accuracy and efficiency.The kernel technique recently has also been applied to artificial neural networks to improve the model performance. The convolutional kernel network (CKN) [Mairal:2014wb] proposes to learn transform invariance by kernel approximation, where the kernel is used as a tool for learning CNN. Nevertheless, the aim of CKN is not to extract non-linear features and it is only different from CNN in the cost functions. The SimNets [Cohen:2016gp] proposes to insert kernel similarity layers under convolutional layer. However, both the similarity templates and filters are needed to be trained and require a pre-training process for initialization, which dramatically increases the complexity. To capture higher order interactions of features, the kernel pooling [Cui:2017em] is proposed in a parameter-free manner. This is motivated by the aforementioned idea that higher dimensional feature map produced by kernel functions is able to make subsequent linear classifier more discriminative [Blondel:2016uz]. However, the kernel extension in the pooling stage is not able to extract non-linear features in a patch-wise way. Moreover, the additional higher order features are still need to be calculated explicitly, which also dramatically improves the complexity. To solve these problems, kervolution is defined to generalize convolution via the kernel trick.

## Kervolution

We start from a convolution with output ,

() |

where is the convolutional operator and is a vectorized input and is the filter. Specifically, the element of the convolution output is calculated as:

() |

where is the inner product of two vectors and is the circular shift of by elements. We define the index started from . The kervolution output is defined as

() |

where is the kervolutional operator. Specifically, the element of is defined as:

() |

where is a non-linear mapping function. The definition eq:kernel-function-high enables us to extract features in a high dimensional space, while its computational complexity is also much higher than eq:linear-kernel. Fortunately, we are able to bypass the explicit calculation of the high dimensional features via the kernel trick [cortes1995support], since

() |

where is a kernel function, whose complexity is normally of as same as the inner product of convolution. The coefficient can be determined by the mapping function or a predefined kernel , the Gaussian RBF kernel, in which the feature dimension is infinite. Intuitively, the inner product eq:linear-kernel is a linear kernel, thus convolution is a linear case of kervolution.

The kervolution eq:kervolution retains the advantages of convolution and brings new features: (i) sharing weights (Section Kervolutional Neural Networks); (ii) equivariance to translation (Section Kervolutional Neural Networks); (iii) increased model capacity and new feature similarity (Section Kervolutional Neural Networks);

### Sharing Weights

Sharing weights normally mean less trainable parameters and lower computational complexity. It is straightforward that the number of elements in filter is not increased according to the definition of eq:kernel-function, thus kervolution keeps the sparse connectivity of convolution. As a comparison, we take the Volterra series-based non-linear convolution adopted in [Zoumpourlis:2017vh] as an example, the additional parameters of non-linear terms dramatically increase the training complexity, since the number of learnable parameters increases exponentially with the order of non-linearity. Even an quadratic expression in eq:non-linear-convolution from [Zoumpourlis:2017vh] is of complexity :

() |

where and are the linear and quadratic filters, respectively. The quadratic term in eq:non-linear-convolution introduces additional parameters ( is an upper triangular matrix). Instead, a typical non-linear kernel is normally of complexity , the Gaussian RBF kernel, which is the same with linear kernel eq:linear-kernel, thus kervolution preserves the linear computational complexity .

Another strategy to introduce higher order features is to explore the pooling layers. For example, the kernel pooling method proposed in [Cui:2017em] directly concatenates the non-linear terms as in eq:kernel-function to the pooling layers. However, this requires explicit calculation of the non-linear terms up to

orders, although it can be approximated by applying the discrete Fourier transform (DFT) of

times, resulting in a computational complexity of . Nevertheless, based on the kernel trick, kervolution can introduce any order of non-linear terms yet still with linear complexity.### Translational Equivariance

A crucial aspect of current architectures of deep learning is the encoding of invariances. One of the reasons is that the convolutional layers are naturally equivariant to image translation

[Goodfellow2016deep]. In this section, we show that kervolution preserves this important property. An operator is equivariant to a transform when the effect of the transform is detectable in the operator output [Cohen:2016to]. Therefore, we have , which means the input translation results in the output translation [Goodfellow2016deep]. Similarly, kervolution eq:kervolution is equivariant to translation. Assume , according to eq:kernel-function, we have() |

Therefore, the element of is the element of , hence we have

() |

which completes the proof. Note that the translational invariance of CNN is achieved by concatenating pooling layers to convolutional layers [Goodfellow2016deep], and the translational invariance of KNN can be achieved similarly. This property is crucial, since when invariances are present in the data, encoding them explicitly in an architecture provides an important source of regularization, which reduces the amount of training data required [Henriques:2017te]. As mentioned in Section Kervolutional Neural Networks, the same property is also presented in [Henriques:2015jy], which is achieved by assuming that all the training samples are circular shifts of each other [wang2019kernel], while ours is inherited from convolution. Interestingly, the kernel cross-correlator (KCC) defined in [Wang:2018vt] is equivariant to any affine transforms (, translation, rotation, and scale), which may be useful for further development of this work.

Method | Convolution | -norm | -norm |
---|---|---|---|

None | 99.17 | 99.12 | 99.11 |

FGSM | 71.92 | 74.08 | 76.36 |

### Model Capacity and Features

It is straightforward that the kernel function eq:kernel-function takes kervolution to non-linear space, thus the model capacity is increased without introducing extra parameters. Recall that CNN is a powerful approach to extract discriminative local descriptors. In particular, the linear kernel eq:linear-kernel of convolution measures the similarity of the input and filter , the cosine of the angle between the two patches, since . From this point of view, kervolution measures the similarity by match kernels, which are equivalent to extracting specific features [Bo:2009tp]. We next discuss how to interpret the kernel functions and present a few instances of the kervolutional operator. One of the advantages of kervolution is that the non-linear properties can be customized without explicit calculation.

#### -Norm Kervolution

The -norm in eq:1norm-kernel and -norm in eq:2norm-kernel simply measures the Manhattan and Euclidean distances between input and filter , respectively.
κ_m(x,w) &= ∥x - w∥_1,

κ_e(x,w) &= ∥x - w∥_2.
Both ”distances” of two points involves aggregating the distances between each element.
If vectors are close on most elements, but more discrepant on one of them,
Euclidean distance will reduce that discrepancy (elements are mostly smaller than 1 because of the normalization layers), being more influenced by the closeness of the other elements.
Therefore, the Euclidean kervolution may be more robust to slight pixel perturbation.
This hypothesis is verified by a simple simulation of adversary attack using the fast gradient sign method (FGSM) [Goodfellow:2014tl], shown in Table Kervolutional Neural Networks, where ‘None’ means the test accuracy on clean data.

#### Polynomial Kervolution

Although the existing literatures have shown that the polynomial kernel eq:polynomial-kernel works well for the problem of natural language processing (NLP) when

using SVM [goldberg2008splitsvm], we find its performance is better when in KNN for the problem of image recognition.() |

where extends the feature space to dimensions; is able to balance the non-linear orders (Intuitively, higher order terms play more important roles when ). As a comparison, the kernel pooling strategy [Cui:2017em] concatenates the non-linear terms directly, while they are finally linearly combined by subsequent fully connected layer, which dramaticaly increases the number of learnable parameters in the linear layer.

To show the behavior of polynomial kervolution, the learned filters of LeNet-5 trained for MNIST are visualized in Figure Kervolutional Neural Networks, which contains all six channels of the first kervolutional layer using polynomial kernel (). The optimization process is described in Section Kervolutional Neural Networks. For a comparison, the learned filters from CNN are also presented. It is interesting that some of the learned filters of KNN and CNN are quite similar, channel 4, which means that part of the capacity of KNN learns linear behavior as CNN. This verifies our understanding of polynomial kernel, which is a combination of linear and higher order terms. This phenomenon also indicates that polynomial kervolution introduces higher order feature interaction in a more flexible and direct way than the existing methods.

#### Gaussian Kervolution

The Gaussian RBF kernel eq:gaussian-kernel extends kervolution to infinite dimensions.

() |

where

is a hyperparameter to control the smoothness of decision boundary. It extends kervolutoin to infinite dimensions because of the

-degree terms in eq:gaussian-series.() |

where if .

The expression eq:gaussian-series is helpful for our intuitive understanding, while the recent discovery reveals more information. It is shown in [Bo:2010vi] that the Gaussian kernel and its variants are able to measure the similarity of gradient based patch-wise features, SIFT [Anonymous:2004uq] and HOG [Dalal:2005gq]. This provides a unified way to generate a rich, diverse visual feature set [Gehler:2009hj]. However, instead of using the hand-crafted features as kernel SVM, with KNN, we are able to inherit the substantial achievements based on kernel trick while still taking advantage of the great generalization ability of neural networks.

### Kervolutional Layers and Learnable Kernel

Similar to a convolutional layer, the operation of a kervolutional layer is slightly different from the standard definition eq:kervolution in which becomes a 3-D patch in a sliding window on the input. To be compatible with existing works, we also implement all popular available structures of convolution in CNN library [paszke2017automatic]

for kervolution, including the input and output channels, input padding, bias, groups (to control connections between input and output), size, stride, and dilation of the sliding window. Therefore, the convolutional layers of all existing networks can be directly or partially replaced by kervolutional layers, which makes KNN inherit all the the existing achievements of CNN, network architectures

[Krizhevsky:2012wl, He:2016ib] and their numerous applications [Ren:2017kt].With kervolution, we are able to extract specific type of features without paying attention to the weight parameters.
However, as aforementioned, we still need to tune the hyperparameters for some specific kernels, the balancer in polynomial kernel, the smoother in Gaussian RBF kernel.
Although we noticed that the model performance is mostly insensitive to the kernel hyperparameters, which is presented in Section Kervolutional Neural Networks, it is sometimes troublesome when we have no knowledge about the kernel.
Therefore, we also implement training the network with learnable kernel hyperparameters based on the back-propagation [rumelhart1988learning].
This slightly increases the training complexity theoretically, but in experiments we found that this brings more flexibility and the additional cost for training several kernel parameters is negligible, compared to learning millions of parameters in the network.
Taking the Gaussian kervolution as an example, the gradients are computed as:
∂∂w
κ_g(x,w) &=2γ_g(x-w)κ_g(x,w),

∂∂γg
κ_g(x,w) &=-∥x-w∥^2κ_g(x,w).
Note that the polynomial order is not trainable because of the integer limitation, since the real exponent may produce complex numbers, which makes the network complicated.

## Ablation Study

This section explores the influences of the kernels, the hyperparameters, and combination of kervolutional layers using LeNet-5 and MNIST [Lecun:1998hy]. To eliminate the influence of other factors, all configurations are kept as the same. The accuracy of modern networks on MNIST has been saturated, thus we adopt the evaluation criteria proposed in DAWNBench [Coleman:ue]

that jointly considers the computational efforts and precision. It measures the total training time to a target validation accuracy (98%), which is a trade-off between efficiency and accuracy. In all the experiments of this section, we apply the stochastic gradient descent (SGD) method for training, where a mini-batch size of 50, a momentum of 0.9, an initial learning rate of 0.003, a multiplicative factor of 0.1, and a maximum epoch of 20 with milestones of

are adopted. Our algorithm is implemented based on the PyTorch library

[paszke2017automatic]. All tests are conducted on a single Nvidia GPU of GeForce GTX 1080Ti. The reported training time does not include the testing and checkpoint saving time.### Kernels

Following the ablation principle, we only replace the convolutional layers of LeNet-5 by kernvolutional layer using three kernel functions, polynomial kernel of in eq:polynomial-kernel, Gaussian kernel of in eq:gaussian-kernel, and also sigmoid kernel . As shown in Figure Kervolutional Neural Networks (a) and (b), although the computational complexity of non-linear kernels is slightly higher than that of linear kernel (convolution), the polynomial and Gaussian KNN are still able to converge to a validation accuracy of more than faster than the original CNN. However, the convergence speed of sigmoid KNN is slower than that of CNN, which indicates that the kernel functions are crucial and have a significant impact on performance. Thanks to the wealth of traditional methods, we have many other useful kernels [smola1998learning], although we cannot test all of them in this paper. The and -norm KNN achieve an accuracy of 99.05% and 99.19%, respectively, but we omit them in Figure Kervolutional Neural Networks (a) and (b) because they nearly coincide with the polynomail curve.

### Hyperparameters

From the above analysis, we credit the significant improvements of convergence speed to the usage of different kernels. This part verifies this assumption and further explores the influences of kernel hyperparameters. The polynomial kervolution eq:polynomial-kernel with two hyperparameters, non-linear order and balancer , is selected. As shown in Figure Kervolutional Neural Networks (c) and (d), the convergence speed and validation accuracy of polynomial KNN using different hyperparameters are pretty similar to that of Figure Kervolutional Neural Networks (a) and (b), which indicates that KCC is less sensitive to the kernel hyperparameters.

It is also noticed that the KNN with learnable kernel parameters achieves the best precision (99.20%) in this group, although it slightly increases the training time compared to KNN (). However, the cost is justifiable since it saves the hyperparameter tuning process and the convergence is still much faster than the baseline CNN.

### Layer Arrangement

This part explores the influences of the placement of kervolutional layers. Thanks to the simplicity of LeNet-5 (two convolutional layers), we can test all possible configurations of layer arrangement, ”conv-conv”, ”kerv-conv”, ”conv-kerv”, and ”kerv-kerv”. As shown in Figure Kervolutional Neural Networks (e) and (f), where the polynomial kernel () is adopted, KNN still brings faster convergence. One interesting phenomenon is that the architecture of ”kerv-conv” achieves better precision but slower convergence than ”conv-kerv” (we run multiple times and the results are similar). This indicates that the sequence of kervolutional layers has an impact on performance, although the model complexity is the same. One side effect is that we may need to make some efforts to adjust the layer sequence for deeper KNNs. It is also noticed that the architecture of ‘kerv-kerv’ achieves the fastest convergence but only with a comparable validation accuracy to CNN. We argue that this is caused by the over-fitting problem, since its final training loss is very close to others (

), which means that the model capacity of double kervolutional layers together with the activation and max pooling layers is too large for the MNIST dataset.

### Removing ReLU

As mentioned in Section Kervolutional Neural Networks, the non-linearity of CNN mainly comes from the activation (ReLU) and max pooling layers. Intuitively, KNN may be able to achieve same model performance without activation or max pooling layers. To this end, we simply remove all the activation layers of LeNet-5 and replace the max pooling by average pooling layers, which means that all the non-linearity comes from the kervolutional layers. Without surprise, the CNN only achieves an accuracy of 92.22%, which is far from the target accuracy of 98%, hence the training time comparison figure is omitted. However, the KNN of ”gaussian-polynomial” and ”polynomial-polynomial” both achieve an accuracy of 99.11%, which further verifies the effectiveness of kervolution. In another sense, the strategy of removing the activation layers is one of the solutions to the aforementioned over-fitting problem in Section Kervolutional Neural Networks, although we need more investigations to find the best architectures for KNN.

Network | CIFAR-10 | CIFAR-100 |
---|---|---|

CNN [Huang:2016vd] | 13.63 | 44.74 |

KNN | 10.85 | 37.12 |

2*Architecture | CIFAR-10+ | CIFAR-100+ | ||
---|---|---|---|---|

CNN | KNN | CNN | KNN | |

GoogLeNet [dubey2018pairwise] | 13.37 | 5.16 | 26.65 | 20.84 |

ResNet [He:2016ib] | 6.43 | 4.69 | 27.22 | 22.49 |

DenseNet [Huang:2017kg] | 5.24 | 5.08 | 24.42 | 24.92 |

## Performance

This section aims to demonstrate the effectiveness of deep KNN on larger datasets. In practice, the network architecture has a significant impact on performance. Since the modern networks are so deep and kervolution provides many possibilities via different kernels, we cannot perform exhaustive tests to find the best sequence of kervolutional layers. Hence, we construct KNN based on several existing architectures by mainly changing the first convolutional layers to kervolutional layers. Other factors, such as data augmentation and optimizers, are kept as their original configurations. As discovered in Section Kervolutional Neural Networks, this may not be the best configuration, while it can demonstrate the effectiveness of kervolution.

The CIFAR experiments in this section are conducted in a single GPU of Nvidia GeForce GTX 1080Ti while we employ four Nvidia Tesla M40 in the ImageNet experiments. The polynomial kervolutional layer in this section adopts the learnable balancer

with power .### Cifar

The CIFAR-10 and CIFAR-100 [krizhevsky2009learning] datasets consist of colored natural images with pixels in 10 and 100 classes, respectively. Each dataset contains images for training and for testing. In the testing procedure, only the single view of the original image is evaluated.

Hyperparm | ||
---|---|---|

4.78 | 5.42 | |

4.60 | 5.36 | |

learnable | 4.76 | 4.73 |

Hyperparm | ||
---|---|---|

0.83 | 1.41 | |

1.41 | 1.46 | |

learnable | 0.86 | 0.79 |

The proposed KNNs are first evaluated without data augmentaion using the architecture of ResNet. We construct and train ResNet-110 following the architecture of [He:2016ib] with cross-entropy loss. The stochastic gradient descent (SGD) is adopted with momentum of 0.9. We train the networks for 200 epochs with a mini-batch size of 128. The learning rate decays by 0.1 at the 75, 125, and 150 epochs; the weight decay stays at . The validation error of KNN as well as the best performance of baseline CNN from [Huang:2016vd] are presented in Table Kervolutional Neural Networks. It is interesting that KNN outperforms CNN with a significant improvements on the CIFAR dataset.

We perform more experiments using the data augmentation techniques, and the datasets are denoted as ‘CIFAR-10+’ and ’CIFAR-100+’, respectively. The KNNs are constructed following the architecture of GoogLeNet [Szegedy:2015ja] and DenseNet-40-12 [Huang:2017kg]. Data augmentation is applied following the configuration in ResNet [He:2016ib]

, including horizontal flipping with a probability of 50%, reflection-padding by 4 pixels, and random crop with size

.Different from ResNet, we train DenseNet-40-12 following its original configuration [Huang:2017kg], SGD with batch size 64 for 300 epochs. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of iterations. Table Kervolutional Neural Networks lists the performance of KNN and baselines from [He:2016ib, Huang:2017kg, dubey2018pairwise]. We cannot see a significant improvement on DenseNet, which indicates that polynomial kervolution may not suit for fully connected architecture.

We further demonstrate the sensitivity to the kernel hyperparameters. Table Kervolutional Neural Networks lists the validation errors of KNN on CIFAR-10+ with polynomial kernel of and using the architecture of ResNet-32. The performance with learnable balancer of is also given for comparison. As suggested by [Coleman:ue], their training time to an accuracy of 94% is measured and presented in Table Kervolutional Neural Networks. It is interesting that, the configuration of achieves the best accuracy, while with learnable requires the least training time. The networks with learnable kernel achieve the best overall performance by jointly considering the training time and accuracy. This indicates that the learnable kernel technique can produce compromised performance without tuning parameters.

### ImageNet

The ILSVRC 2012 classification dataset [deng2009imagenet] is composed of images for training and images for validation in classes. For fair comparison, we apply the same data augmentation as described in [He:2016ib, he2016identity], where the single-crop and 10-crop at a size of are applied for testing.

We select four versions of ResNet [He:2016ib], including ResNet-18, ResNet-34, ResNet-50 and ResNet-101, as the baselines. The kervolutional layer is applied with a polynomial kernel (). All the networks are trained using the stochastic gradient descent (SGD) method for 100 epochs with a batch size of 256. The learning rate is set to 0.1, and is reduced every 30 epochs. Also, a weight decay of and a momentum of without dampening are employed. In our experiments, the best performance of ResNet cannot be achieved in limited training time. To guarantee a fair comparison, the results of ResNet which have the best accuracy in [fb2016, He:2016ib, he2016identity, Huang:2017kg] are chosen. We report the single-crop and 10-crop validation error on ImageNet in Table Kervolutional Neural Networks, where the performance of KNN is the average of five running.

In Table Kervolutional Neural Networks, top-1 errors using ResNet-18/34/50/101 are reduced by 0.5%, 0.41%, 0.29%, and 0.7% in single-crop testing and 0.45%, 0.75%, 0.80%, and 0.83% in 10-crop testing, respectively. For top-5 errors, KNN outperform corresponding ResNets by 0.43%, 0.24%, 0.23%, and 0.28% in single-crop testing, and 0.39%, 0.68%, 0.74%, and 0.87% in 10-crop testing, respectively. It is noticed that simple replacement of the convolutional layer in ResNet leads to obvious improvements. We believe that more customized network architecture as well as extensive hyperparameter searches can further improve the performance on ImageNet.

Network | Top-1 | Top-5 |
---|---|---|

ResNet-18 | 30.24 / 27.88 | 10.92 / 9.42 |

KNN-18 | 29.74 / 27.43 | 10.49 / 9.03 |

ResNet-34 | 26.70 / 25.03 | 8.58 / 7.76 |

KNN-34 | 26.29 / 24.28 | 8.34 / 7.08 |

ResNet-50 | 23.85 / 22.85 | 7.13 / 6.71 |

KNN-50 | 23.56 / 22.05 | 6.90 / 5.97 |

ResNet-101 | 22.63 / 21.75 | 6.44 / 6.05 |

KNN-101 | 21.93 / 20.92 | 6.16 / 5.18 |

## Discussion

#### Kernel

Different from convolution, which can only extract linear features, kervolution is able to extract customized patch-wise non-linear features, which makes KNN much more flexible. It is demonstrated that the higher order terms make the subsequent linear classifier more discriminant, while this does not increase the computational complexity. However, we have only tested several kernels, polynomial and Gaussian, which may not be optimal. It is obvious that the kernel functions and their hyperparameters can be task-driven and more investigations are necessary.

#### Training

It is also noticed that the training can be unstable when a network contains too much non-linearity, this is because the model complexity is too high for a specific task, which can be simply solved by reducing the number of kervolutional layers. More investigations on searching appropriate non-linearity for a specific task is challenging.

#### Architecture

We have only applied kervolution to the existing architectures, ResNet. While this is not optimal, especially when considering that the mechanism of deep architectures is still unclear [kuo2016understanding]. It is obvious that the performance of kernvolution is dependent on the architecture. One of the interesting challenges for future work is to investigate the relationship between the architecture and kervolution.

## Conclusion

This paper introduces the kervolution to generalize convolution to non-linear space, and extends convolutional neural networks to kervolutional neural networks. It is shown that kervolution not only retains the advantages of convolution, sharing weights and equivalence to translation, but also enhances the model capacity and captures higher order interactions of features, via patch-wise kernel functions without introducing additional parameters. It has been demonstrated that, with careful kernel chosen, the performance of CNN can be significantly improved on MNIST, CIFAR, and ImageNet dataset via replacing convolutional layers by kervolutional layers. Due to the large number of choices of kervolution, we cannot perform a brute force search for all the possibilities, while this opens a new space for the construction of deep networks. We expect the introduction of kervolutional layers in more architectures and extensive hyperparameter searches can further improve the performance.

Comments

There are no comments yet.