1 Introduction
The idea of using correlation/convolution operators in neural networks goes back to Fukushima who proposed cognitron (Fukushima, 1975) and neocognitron (Fukushima, 1980, 1988) neural networks. In neocognitron, the input image is matched against a set of patterns that are represented by some weights. For each pattern, a map is produced in which the positions in the input image at which that very pattern is present are marked. This operation can be interpreted both as correlation and convolution, depending on how we assemble the weight vector as an image pattern. In this paper, we strive to the term convolution since it is well established in this context and since correlational neural networks refer to a completely different network(Chandar et al, 2016)
. To reduce the sensitivity to the exact positions of the patterns in the input image, Fukushima proposed subsampling of the produced maps with maxpooling. By repeating these layers of convolution and pooling (which he termed Ulayers and SLayers) it was possible to represent more complex patterns by neurons at deeper layers. It is interesting that the convolution and maxpooling operations which constitute the backbone of today’s convolutional neural networks (CNN) have been present in the very early design of neocognitron. Neocognitron was trained with an unsupervised learning algorithm that clearly implemented a matched filter.
After the invention of the backpropagation algorithm by Williams and Hinton (1986), LeCun et al (1989)
introduced CNNs which was essentially a neocognitron trained with the backpropagation algorithm. However, in contrast to the initial goal of implementing matched filters, it was observed that the weights of a trained CNN contained both positive and negative values. Having weights with negative values, the interpretation of convolution operator in CNNs generalized from the original idea of implementing a matched filter to a new interpretation in which the convolution operator extracts features from the image planes of the previous layer. Whether we accept the matched filter or feature extraction viewpoints, the convolution operator is computing the inner product between image patches and a pattern represented by the weights. We call this the
generalized matched filter viewpoint in which we still view the inner product operator as a similarity measure, even though the pattern may take negative values.In this paper, we propose two generalizations of the convolution operator. In the first generalization, which is based on the kernel methods, we propose substituting the inner product operator within the convolution operation by a positive definite kernel function. In contrast to kernel methods such as support vector machines (SVM) and kernel principal component analysis (KPCA), here the positive definiteness of the kernel function is not crucial and we show that any monotonically increasing function of a positive definite kernel function can be used as well. The second generalization comes from the fact that the primary goal of including the convolution layer in the neocognitron and CNNs was to detect spots in the input plane that are locally similar to a target pattern. In this view, we propose that the inner product operation within the convolution operator can be replaced by a similarity measure. Specifically, we define a similarity measure as any monotonically increasing function of the negation of a distance function which assigns similarity zero to distance infinity. In this way, numerous similarity measures can be constructed by applying different monotonically increasing nonlinear functions to the negation of a distance metric. Therefore, instead of implementing a full similaritybased convolution layer, we implement a generalized convolution layer with the inner product operation replaced by the desired distance function. We then negate the result and apply an appropriate monotonically increasing function to arrive at a similarity measure. We have implemented our generalized convolution operators as layers within the wellknown Caffe
(Jia et al, 2014) framework.Prior to this work, some researchers also proposed networks in which the inner product operation within the convolution operator had been replaced by some other function . We denote the resulting operation as convolution. Serre et al (2007) proposed the HMax model for object recognition where, in its second scale, the similarity between image patches and stored patterns are measured by a Gaussian function. In other words, the second scale of HMax computes a Gaussianconvolution. However, the HMax model is considerably different from a CNN and is not trained by the backpropagation algorithm. In convolutional kernel networks (CKN), Mairal et al (2014) considered a special kernel function for measuring the similarity between two complete images and showed that its associated feature map can be approximated by a Gaussianconvolution followed by pooling. Assuming that the image patches and the learned patterns are normalized, Mairal et al (2014)
showed that the computation of a Gaussianconvolution is equivalent to an ordinary convolution operator followed by a special nonlinearity that resembles the rectified linear unit (ReLU) in the interval [1,1]. In case CKNs are implemented with the ordinary convolution operator followed by a nonlinearity,
Mairal (2016) could train the network by the backpropagation algorithm. Lin et al (2014)were the first who explicitly proposed generalizing the convolution operator in CNNs and training the whole network with backpropagation. They introduced the network in network (NIN) model in which the inner product operation within the convolution operator is replaced with a multilayer perceptron (MLP). At first sight, considering the universal approximation property of multilayer perceptrons
(Hornik et al, 1989) one may view NIN as a radical generalization of the convolution operator in which the internal inner product operation is substituted by an arbitrary function. However, it can be shown that MLPconvolution is equivalent to an ordinary convolution followed by several convolutions and nonlinearities. In fact, this feature helped Lin et al (2014) to implement NIN by ordinary CNNs without altering their implementation. In this view, it can be said that the NIN did not generalize the convolution operator at all and all it did was the discovery that CNNs should be much deeper and should have a slow pace of decreasing the resolution of the convolution planes by pooling. Although the broad idea of generalizing the convolution operator has been present in the abovementioned works, the specific ideas presented in this paper are completely novel and it is for the first time that a network with a really generalized convolution operator is trained by the backpropagation algorithm.The paper proceeds as follows. In section 2, we introduce two classes of generalized convolution operators. Some specific examples of generalized convolution operators are introduced in section 3. We found that simple random initialization of the parameters or applying algorithms like Xavier(Glorot and Bengio, 2010) are not suitable choices for initializing generalized convolutional neural networks (GCNN). Two initialization algorithms that can be used for the initialization of GCNNs are introduced in section 4. We report our experiments on the MNIST dataset in section 5. We conclude the paper in section 6 and mention the future works in section 7.
2 The proposed method
CNN is a deep neural network which consists of different types of layers, including convolution, pooling, nonlinearity, inner product, and loss layers. Usually, a module consisting of a sequence of convolution, nonlinearity, and pooling layers is repeated several times to produce a suitable representation of the input data which is then fed to a fully connected network to estimate the output (for recent generalizations of this block see
(Szegedy et al, 2015; He et al, 2016)). The convolution layer computes the convolution between its input planes and several filters represented by the weight parameters and produces a set of output planes, one associated with each filter. However, the convolution is essentially a linear operation. It is wellknown that deepening of neural networks with linear activation function does not increase the representational power and these networks are still representing a linear function of the input data. So, to increase the modeling capability of CNNs, the convolution layer is usually followed by a nonlinearity layer. Some examples of common activation functions include rectified linear unit (ReLU), tangent hyperbolic(TanH), and logistic sigmoid (Sigmoid). The goal of the pooling layer is to reduce the sensitivity of the network to translations of the input images and to summarize the important information of the input planes in a more compact representation.Consider a convolution layer which operates on input planes and generates output layers. Assume that is an ndimensional vector generated by vectorizing patches of input planes centered at position . Let be the weight vector associated with ’th output plane. Then the output value at position of plane is computed by formula
(1) 
Our first proposal for generalizing the convolution operator is to replace the inner product operation in Eq.(1) with a positive definite kernel function. Assuming that is a positive definite kernel function, the output is now computed by
(2) 
This generalization allows us to use a handful of kernel functions such as Gaussian, polynomial, Laplacian, cosine, Cauchy, and intersection in place of the inner product operation. However, our choice of using kernel functions is severely restricted by the positive definiteness requirement. In a kernel method like SVM, the positive definiteness property plays a crucial role and violation of it makes the objective function unbounded from below. In RBF neural networks, the positive definiteness property of kernel functions guarantees that the kernel matrix would be invertible and so the optimal weights of radial basis functions exist and are unique. Generally, the positive definiteness property of kernel functions eliminates the possibility that the inner product of a vector with itself becomes negative. Assume that
is a inner product kernel function with feature space and feature map (i.e. ) which is not positive definite. Since is not positive definite, there exist inputs and coefficients such that(3) 
One can easily verify that the expression on the left side of Eq.(3) is equal to inner product of the vector with itself. Therefore, use of nonpositive definite kernel functions in GCNNs may lead to patterns which are not similar to themselves, violating the
generalized pattern matching viewpoint
. However, in GCNNs we are not concerned with all vectors that can be constructed in the feature space associated with a kernel function. Instead, we are applying the kernel function directly to two input vectors and the requirement of similarity of patterns to themselves translates to the condition for all input vectors x. This requirement is satisfied for any function having the form , where is a positive definite kernel function and is a monotonically increasing function with . So, we arrive at our first generalization of the convolution operator.Generalization 1: The convolution operator in CNNs can be generalized by substituting the inner product operation between a vectorized input and a weight vector by , where is a positive definite kernel function and is a monotonically increasing function with .
It is evident that the main purpose of the inner product stage within the convolution operator in neocognitron was to measure the similarity between the patches of the input maps of the preceding layer and a template pattern. One justification for this is that the inner product operator is essentially a similarity measure (see section 1.1 of Schölkopf and Smola, 2002). It may be argued that the inner product operation is not a suitable similarity measure since for example there are vectors which are more similar to a chosen vector than itself. Our second proposal for generalizing the convolution operator is to substitute the inner product operator with a similarity measure. Defining similarity based on a distance measure, we arrive at our second generalization of the convolution operator.
Generalization 2: The convolution operator in CNNs can be generalized by substituting the inner product operation between a vectorzied input and a weight vector by , where is a distance metric and is a monotonically increasing function with . Adding the additional constraint ensures that , meaning that the similarity of each vector with itself is 1. However, since the dimension of the input space is usually high^{1}^{1}1For example, in our experiments on the MNIST dataset, we have 12 planes in the first convolution layer which, considering a window of size 5, induces a dimensionality of on the input of the second convolution layer., exact matching almost never happens and even similar items typically have high numerical distances. In addition, since we want to permit the use of nonsquashing activation functions such as ReLU, which have experimentally proven to do better than squashing functions such as TanH and Sigmoid, we don’t impose the restriction .
In practice, we implemented similarity/kernel functions by using multiple layers in Caffe. These layers include a generalized convolution layer based on a metric distance(e.g. weighted L2 distance), possibly followed by an AdaptiveLinear layer with negative slope^{2}^{2}2AdaptiveLinear is a simple new kind of layer that we have added to Caffe which implements , where the parameters and differ between output channels., followed by an activation function layer like exponential (Exp) or ReLU. In this view, activation functions are appropriate monotonically increasing functions that complement the functionality of a distancebased generalized convolution layer such that the whole module implements a generalized convolution operator. For example, ReLU activation function can be seen as a monotonically increasing function that complements the role of the preceding layers by enforcing the nonnegativity criterion of a similarity measure.
3 Examples of generalized convolution operators
3.1 Nonisotropic Gaussian kernel
Nonisotropic Gaussian kernel is defined as
(4) 
where is the precision vector which consists of positive values. Use of Gaussian kernel function in generalized convolution is admissible since both it is a positive definite kernel function and it can be expressed as the application of the monotonically increasing function with the definition to the negation of the weighted L2 distance(WL2Dist). In addition to satisfying the required constraint , the function has the additional property that , ensuring that it is a similarity measure spanning the range .
3.2 Nonisotropic Laplacian kernel
Nonisotropic Laplacian kernel is defined as
(5) 
where is the precision vector which should be positive. Again, Laplacian kernel function is both a positive definite kernel function and it can be expressed as the application of the monotonically increasing function to the negation of the weighted L1 distance(WL1Dist).
3.3 Cosine kernel: justifying the Sine activation function
It is well known that is a positive definite kernel function and so it can be used to measure the similarity between an input image patch and a pattern . Since is the parameter of the kernel, it is fixed and is equal to some constant . It follows that , where . Thus, is essentially computing the kernel function , where z is an implicit pattern satisfying the equation . Note that in contrast to the ordinary convolution operation where the similarity of is measured against the vector of weights , here is solely a parameter of the kernel function and the desired pattern is hidden in the bias parameter . The above line of reasoning works exactly for the cosine activation function. However, since the gradient of the cosine function vanishes at zero, the parameters of a network with cosine activation function would get stuck in their initial values. This is because almost all initialization algorithms initialize the weight vector and the bias parameter in a way that is on average zero.
In section 5.3 we will experimentally show that the Sine activation function works similar to ReLU and significantly better than TanH. One benefit of Sine is that it does not have the saturation problem of TanH and Sigmoid. As is illustrated in Figure 1, Sine and TanH have similar shapes in the range , however, outside this region TanH is saturated while Sine is periodic. One problem with TanH is that if the target value is , then the weights are pushed towards infinity and the gradient of the TanH function vanishes. To remedy this problem, LeCun et al (1998b) proposed a scaled version of TanH with definition
(6) 
.
Another benefit of Sine is that it can produce an output value of without pushing the weights towards infinity.
3.4 Other similarities based on the weighted L2distance
We saw in section 3.1 that Gaussian kernel equals to the composition of the monotonically increasing function with definition and the negation of WL2Dist. In this section, we propose to use other activation functions on top of the WL2Dist. For the case of Gaussian kernel, we can view Eq. (4
) as the unnormalized Gaussian probability density that an input patch
matches pattern . Rewriting Eq. (4) in the usual form of a multivariate Gaussian distribution we have
(7) 
where is a diagonal precision matrix with entries .
Viewing the Gaussian kernel as a probability density function, we can say that the value returned by this function is proportional to the probability density of input
in a Gaussian distribution with mean and precision matrix . The problem with this value is that it cannot directly be used as a measure for deciding whether the input data is similar to the desired pattern or not. In this section, we exploit this probabilistic point of view to arrive at some other activation functions on the top of the WL2Dist.Since, by assumption, the precision matrix of the Gaussian distribution is diagonal, it follows that different dimensions are independent of each other and so the squared weighted L2 distance is equal to the sum of squares of standard normal variables which is known to have a distribution with degrees of freedom. So, the probability that an input with belongs to the Gaussian distribution associated with pattern is equal to the pvalue of the distribution at point . Using inverse cumulative distribution, one can determine two thresholds and such that for inputs with the probability that is generated by this distribution is very high and for inputs with this probability is very low. Therefore, a suitable value for similarity of input to pattern is given by
If we first apply a linear transformation with the slope
and the bias to the output of the WL2Dist layer, then the desired functionality can be achieved using the DoubleThreshold activation function defined asIf we allow similarities greater than , then the upper limit of the DoubleThreshold activation function is dropped and we reach at the wellknown ReLU activation function.
3.5 Other similarities based on the weighted L1distance
The discussion of the previous section can be repeated for the Laplacian distribution, resulting in generalized convolution operators based on the WL1Dist.
4 Initialization of parameters
The performance of deep CNNs is strongly influenced by the method of initializing the parameters and by controlling the amount of backward gradient returned to each parameter(Krähenbühl et al, 2016; Mishkin and Matas, 2016; Glorot and Bengio, 2010; Ioffe and Szegedy, 2015). Initialization and optimization algorithms proposed for neural networks are designed based on the linear model of neurons (i.e.
). When the incoming weights are initialized randomly with mean zero, this model ensures that the mean of the output is zero as well. By choosing appropriate values for the magnitudes of weights one can ensure that all neurons have a mean value of zero and a variance of one, avoiding the vanishing/exploding problems in the forward pass
(LeCun et al, 1998b). Recently, similar approaches have been devised that control the magnitude of the gradient in the backward pass (Glorot and Bengio, 2010; Ioffe and Szegedy, 2015; Krähenbühl et al, 2016). However, by substituting the inner product operation with a kernel/distance function all of these nice properties fade and the vanishing/exploding problems reappear both in the forward and backward passes. Each kernel/distance function has its own properties that should be considered in initializing its parameters. In this section, we consider two initialization algorithms for networks based on the weighted L1/L2 distances. The specific architecture we are considering is depicted in Figure 2.4.1 Precision adjustment initialization algorithm
The goal of this initialization method is to ensure that all signals at the forward pass have appropriate magnitudes. The initialization proceeds module by module, adjusting the precision parameters of the generalized convolution layer of each module to ensure that the empirical mean of the signal passed to the subsequent nonlinearity layer is zero and its empirical variance is some target value . This algorithm is very similar to the initialization algorithm of Mishkin and Matas (2016) except that here the precision parameters are preinitialized randomly while Mishkin and Matas (2016) preinitialize weights with orthonormal matrices. Since this algorithm only adjusts the precision parameters of the generalized convolution layers, we call it the precision adjustment initialization algorithm.
4.2 Wholenetwork adjustment initialization algorithm
Glorot and Bengio (2010) proposed that the weights should be initialized in a way that both the activation values of neurons in the forward pass and the gradient backpropagated in the backward pass have appropriate variances. They proposed an analytical algorithm for initializing the weights of a convolutional neural network based on this idea. Recently, Krähenbühl et al (2016) proposed a datadependent iterative algorithm for attaining this goal and showed that their approach works superior to the analytical approach of Glorot and Bengio (2010), at least in the experiments reported in their paper. We added the support for AdaptiveLinear, WL1Dist, and WL2Dist layers to the implementation of Krähenbühl et al (2016). Since this algorithm adjusts the parameters of the whole network, layer by layer, we call it the wholenetwork adjustment algorithm.
5 Experiments
In this section, we experimentally evaluate two realizations of GCNNs. To compare GCNNs with [ordinary] CNNs in the fairest and most informative way, we conduct our experiments on the MNIST(LeCun et al, 1998a) dataset which has been a classical testbed for CNNs from its advent till now. The MNIST dataset is a collection of 60000 training and 10000 testing samples of handwritten digits. We train the networks with the official training samples, without applying any distortions.
5.1 Experimental setup
The general form of the network architecture considered in this section is depicted in Figure 3. Only boxes with thick border may differ between the experiments. The dotted boxes of the AdaptiveLinear layers imply that they may not be present in some experiments (or are present with slope 1, bias 0, and zero learning rate multipliers). The number of planes of the first and second generalized convolution layers is and , respectively, with a window size of . In our experiments, we set the base learning rate to , the momentum to , the minibatch size to , the maximum number of iterations to (which is equivalent to epochs), and the weight decay coefficient to . The learning rate at ’th iteration is computed by dividing the base learning rate by , where and . We chose a significance level of for determining the statistical significance of the experiments. If not explicitly mentioned, the number of repetitions of each experiment is . To ensure exact reproducibility of the results, we have set the random_seed parameter of each experiment to a deterministic function of the experiment number.
5.2 Some insights into ordinary convolutional neural networks
In this section, we perform some experiments on ordinary CNNs that will prove to be useful in design and analysis of our experiments on GCNNs. First, we conduct experiments to investigate the role of training of the convolution layers in the accuracy of CNNs. Second, we investigate the role of the negative weights in the accuracy of CNNs.
5.2.1 Investigating the role of convolution layers
In this experiment, we want to identify the role of the convolution layers in the accuracy of CNNs on the MNIST dataset. In the experimental settings described in Section 5.1, an ordinary CNN obtains an accuracy of
on the MNIST dataset. If we confine the weights of the convolution layers to their initial random values (by setting their associated learning rate multipliers to zero), the accuracy declines to
. This shows that only less than of the accuracy of a CNN on the MNIST dataset is due to training of the convolution layers.5.2.2 Studying matched filter CNNs
In this experiment, our goal is to study the effect of negative weights on the accuracy of CNNs. In the matched filter viewpoint of CNNs, the weights of the convolution layer should represent a cluster of patches of the preceding layer. Since the values of the original image are between and the values of all convolution layers are passed from ReLU, all input maps to the convolution layers are positive. In this experiment, we want to examine the effect of confining the weights of the convolution layers to positive values on the accuracy of CNNs. To minimize the unwanted effects of initialization and optimization issues on this experiment, we force the positivity of the weights by the absolute value operation, so that the magnitudes of the gradients with respect to weights is unchanged. For a network with positive weights, the average activation of neurons would no longer be zero and the initialization algorithm of Xavier(Glorot and Bengio, 2010) is inapplicable. So, we first use the precision adjustment
initialization algorithm to ensure that the outputs of the convolution layers have mean zero and standard deviation
. After training for one epoch (i.e. 600 iterations), we use the wholenetwork adjustment initialization algorithm to initialize the weights, ensuring that the gradients returned to all layers have appropriate magnitudeTo eliminate the effect of initialization algorithm, we first trained an ordinary CNN with the wholenetwork adjustment and obtained an accuracy of . After imposing the positivity constraint on weights, the accuracy significantly reduced to . This proves that the negative weights have a remarkable role in the high accuracy of CNNs and the matched filter viewpoint (at least when weights are initialized randomly) cannot fully explain the high accuracy of CNNs.We conclude from this experiment that in GCNNs we should also allow the template pattern to take negative values. For example in the case of distancebased generalized convolution operators (such as Gaussian, Laplacian, WL1Dist, and WL2Dist) we allow the mean parameter to take negative values.
5.2.3 Effect of initialization method on accuracy of ordinary CNNs
None of the GCNNs considered in this paper can be trained with randomly initialized weights. So, we ought to resort to the initialization algorithms of section 4. In this section, we want to study the suitability of these initialization algorithms for initializing an ordinary CNN on the MNIST dataset. In addition to initialization algorithms of section 4, we also consider the Xavier algorithm(Glorot and Bengio, 2010) which is the initialization algorithm chosen by the MNIST example in Caffe^{3}^{3}3Truly speaking, although Glorot and Bengio (2010) introduced a new algorithm which considers the backward gradient, the Xavier initialization algorithm in Caffe with default parameters is what has been introduced by LeCun et al (1998b) many years ago.. Table 1 shows the accuracies obtained by different initialization algorithms on ordinary CNNs . As it can be seen, our datadependent precision adjustment algorithm works significantly better than both Xavier(Glorot and Bengio, 2010) and wholenetwork adjustment(Krähenbühl et al, 2016) in this experiment. So, in section 5.4, when comparing CNNs with GCNNs, we also consider CNNs initialized by the precision adjustment algorithm.
Initialization algorithm  accuracy 

Xavier(Glorot and Bengio, 2010)  
Precision adjustment  
Wholenetwork adjustment 


5.3 Experimental evaluation of the Sine activation function
In section 3.3 we showed that the use of Sine as a neural activation function can be explained from a kernel methods perspective. In this section, we experimentally compare Sine with other important activation functions such as Sigmoid, TanH , and ReLU. To identify the role of negative values at the output, we also include the rectified sine (ReSine) and rectified tangent hyperbolic (ReTanH) activation functions in our experiments. Table 2 shows the results of these experiments. In all experiments, we have used the Xavier(Glorot and Bengio, 2010) algorithm for initialization of the parameters. As the results show, Sine has performed significantly better than Sigmoid and TanH. However, the slightly higher accuracy of Sine in comparison to ReLU is not [statistically] significant.
Activation  Accuracy 

None  
Sigmoid  
TanH 

ReLU  
ReTanH  
ReSine  
Sine 
5.4 Weighted L1 and L2 distances
, we showed that WL1Dist/WL2Dist+AdaLin+ReLU modules implement legitimate generalized convolution operators. In this section, we want to study these modules experimentally. We force the positivity of the precision parameters with absolute value operation. The mean parameter is initialized with a uniform distribution on the nonnegative range
but is allowed to take arbitrary positive/negative values during learning. We experimentally found that the maximum of iterations chosen for ordinary CNNs is not sufficient for full training of GCNNs in this experiment and so increased this number to iterations. We also repeated our previous experiments with ordinary CNNs with 36000 iterations which slightly improved the previous results. Each experiment is repeated with values and for learning rate multipliers (lrmult) of the parameters of the generalized convolution layers. Table 3 shows the accuracies obtained with different initialization algorithms. In each row, accuracies that are significantly higher than others are boldfaced.Init. alg.  lrmult  CNN  WL1DistGCNN  WL2DistGCNN 

Xavier  1      
Xavier 
10      
PrecAdj 
1  
PrecAdj 
10  
WhlNetAdj  1  
WhlNetAdj 
10 
6 Conclusions
In this paper, we proposed two methods for generalizing the convolution operator in CNNs. The first method is based on substituting the inner product operation within the convolution operator with a monotonically increasing function of a positive definite kernel function. In the second method, we replace the inner product operator with a monotonically increasing function of the negation of a distance function. In this paper, we implemented generalized CNNs (GCNN) based on the cosine kernel and weighted L1/L2 distances, and showed that the resulting networks achieve or even slightly surpass the accuracies of ordinary CNNs on the MNIST dataset. However, we believe that the main merit of this research is that it introduces a generalized conceptual framework that paves the way for the application of sophisticated methods developed in other fields of machine learning at the heart of GCNNs. Some of the machine learning methods that can be potentially used in GCNNs include kernel principal component analysis, multiple kernel learning, infinite kernel learning, metric learning, and similarity learning. In addition, this work sheds more light on the nature of the convolution operator as a central element of CNNs.
7 Future works
In this paper, we introduced the key idea that the convolution operator in CNNs can be generalized by a wide class of kernel/distance functions. We experimentally supported this idea by implementing two generalized convolution operators based on the weighted L1/L2 distance functions and carrying out experiments on the MNIST dataset. In the future, we aim to study and improve the proposed approach in several directions. First, we plan to implement the weighted L1/L2 distance generalized GCNNs on GPU and apply them to more challenging datasets like CIFAR10 and CIFAR100(Krizhevsky and Hinton, 2009)
, and Imagenet
(Russakovsky et al, 2015). This would be a difficult task since the successful network models proposed for these datasets are very deep and use other complementary techniques, such as dropout(Srivastava et al, 2014), which are not yet adapted to the proposed generalized framework. Second, we decide to exploit the discovered link between the kernel methods and CNNs to apply kernel methods machinary (such as SVM, KPCA, KFDA, MKL, and IKL) to CNNs. Finally, this work can be followed by implementing other possible forms of GCNNs(e.g. those based on polynomial or inverse multiquadric kernels). Our preliminary experiments suggest that almost every generalization of the convolution operator requires its own handling of initialization and optimization algorithms.Acknowledgments
The author wishes to express appreciation to Research Deputy of Ferdowsi University of Mashhad for supporting this project by grant No.: 2/43037. The author also thanks his fellows Ahad Harati and Ehsan FazlErsi for their valuable comments.
References
References
 Chandar et al (2016) Chandar S, Khapra MM, Larochelle H, Ravindran B (2016) Correlational neural networks. Neural computation
 Fukushima (1975) Fukushima K (1975) Cognitron: A selforganizing multilayered neural network. Biological cybernetics 20(34):121–136

Fukushima (1980)
Fukushima K (1980) Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36(4):193–202
 Fukushima (1988) Fukushima K (1988) Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks 1(2):119–130
 Glorot and Bengio (2010) Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, vol 9, pp 249–256

He et al (2016)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
 Hornik et al (1989) Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366

Ioffe and Szegedy (2015)
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of The 32nd International Conference on Machine Learning, pp 448–456
 Jia et al (2014) Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:14085093
 Krähenbühl et al (2016) Krähenbühl P, Doersch C, Donahue J, Darrell T (2016) Datadependent initializations of convolutional neural networks. In: International Conference on Learning Representations
 Krizhevsky and Hinton (2009) Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Tech. rep., University of Toronto

LeCun et al (1989)
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1(4):541–551
 LeCun et al (1998a) LeCun Y, Bottou L, Bengio Y, Haffner P (1998a) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
 LeCun et al (1998b) LeCun Y, Bottou L, Orr GB, Müller KR (1998b) Efficient backprop. In: Neural Networks: Tricks of the Trade, pp 9–50
 Lin et al (2014) Lin M, Chen Q, Yan S (2014) Network in network. In: International Conference on Learning Representations
 Mairal (2016) Mairal J (2016) Endtoend kernel learning with supervised convolutional kernel networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pp 1399–1407
 Mairal et al (2014) Mairal J, Koniusz P, Harchaoui Z, Schmid C (2014) Convolutional kernel networks. In: Advances in Neural Information Processing Systems, pp 2627–2635
 Mishkin and Matas (2016) Mishkin D, Matas J (2016) All you need is a good init. In: International Conference on Learning Representations
 Russakovsky et al (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, FeiFei L (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211–252
 Schölkopf and Smola (2002) Schölkopf B, Smola A (2002) Learning with Kernels Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA
 Serre et al (2007) Serre T, Wolf L, Bileschi S, Riesenhuber M, Poggio T (2007) Robust object recognition with cortexlike mechanisms. IEEE transactions on pattern analysis and machine intelligence 29(3):411–426
 Srivastava et al (2014) Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958
 Szegedy et al (2015) Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
 Williams and Hinton (1986) Williams D, Hinton G (1986) Learning representations by backpropagating errors. Nature 323:533–536
Comments
There are no comments yet.