The idea of using correlation/convolution operators in neural networks goes back to Fukushima who proposed cognitron (Fukushima, 1975) and neocognitron (Fukushima, 1980, 1988) neural networks. In neocognitron, the input image is matched against a set of patterns that are represented by some weights. For each pattern, a map is produced in which the positions in the input image at which that very pattern is present are marked. This operation can be interpreted both as correlation and convolution, depending on how we assemble the weight vector as an image pattern. In this paper, we strive to the term convolution since it is well established in this context and since correlational neural networks refer to a completely different network(Chandar et al, 2016)
. To reduce the sensitivity to the exact positions of the patterns in the input image, Fukushima proposed sub-sampling of the produced maps with max-pooling. By repeating these layers of convolution and pooling (which he termed U-layers and S-Layers) it was possible to represent more complex patterns by neurons at deeper layers. It is interesting that the convolution and max-pooling operations which constitute the backbone of today’s convolutional neural networks (CNN) have been present in the very early design of neocognitron. Neocognitron was trained with an unsupervised learning algorithm that clearly implemented a matched filter.
introduced CNNs which was essentially a neocognitron trained with the back-propagation algorithm. However, in contrast to the initial goal of implementing matched filters, it was observed that the weights of a trained CNN contained both positive and negative values. Having weights with negative values, the interpretation of convolution operator in CNNs generalized from the original idea of implementing a matched filter to a new interpretation in which the convolution operator extracts features from the image planes of the previous layer. Whether we accept the matched filter or feature extraction viewpoints, the convolution operator is computing the inner product between image patches and a pattern represented by the weights. We call this thegeneralized matched filter viewpoint in which we still view the inner product operator as a similarity measure, even though the pattern may take negative values.
In this paper, we propose two generalizations of the convolution operator. In the first generalization, which is based on the kernel methods, we propose substituting the inner product operator within the convolution operation by a positive definite kernel function. In contrast to kernel methods such as support vector machines (SVM) and kernel principal component analysis (KPCA), here the positive definiteness of the kernel function is not crucial and we show that any monotonically increasing function of a positive definite kernel function can be used as well. The second generalization comes from the fact that the primary goal of including the convolution layer in the neocognitron and CNNs was to detect spots in the input plane that are locally similar to a target pattern. In this view, we propose that the inner product operation within the convolution operator can be replaced by a similarity measure. Specifically, we define a similarity measure as any monotonically increasing function of the negation of a distance function which assigns similarity zero to distance infinity. In this way, numerous similarity measures can be constructed by applying different monotonically increasing nonlinear functions to the negation of a distance metric. Therefore, instead of implementing a full similarity-based convolution layer, we implement a generalized convolution layer with the inner product operation replaced by the desired distance function. We then negate the result and apply an appropriate monotonically increasing function to arrive at a similarity measure. We have implemented our generalized convolution operators as layers within the well-known Caffe(Jia et al, 2014) framework.
Prior to this work, some researchers also proposed networks in which the inner product operation within the convolution operator had been replaced by some other function . We denote the resulting operation as -convolution. Serre et al (2007) proposed the HMax model for object recognition where, in its second scale, the similarity between image patches and stored patterns are measured by a Gaussian function. In other words, the second scale of HMax computes a Gaussian-convolution. However, the HMax model is considerably different from a CNN and is not trained by the back-propagation algorithm. In convolutional kernel networks (CKN), Mairal et al (2014) considered a special kernel function for measuring the similarity between two complete images and showed that its associated feature map can be approximated by a Gaussian-convolution followed by pooling. Assuming that the image patches and the learned patterns are normalized, Mairal et al (2014)
showed that the computation of a Gaussian-convolution is equivalent to an ordinary convolution operator followed by a special nonlinearity that resembles the rectified linear unit (ReLU) in the interval [-1,1]. In case CKNs are implemented with the ordinary convolution operator followed by a nonlinearity,Mairal (2016) could train the network by the back-propagation algorithm. Lin et al (2014)
were the first who explicitly proposed generalizing the convolution operator in CNNs and training the whole network with back-propagation. They introduced the network in network (NIN) model in which the inner product operation within the convolution operator is replaced with a multilayer perceptron (MLP). At first sight, considering the universal approximation property of multilayer perceptrons(Hornik et al, 1989) one may view NIN as a radical generalization of the convolution operator in which the internal inner product operation is substituted by an arbitrary function. However, it can be shown that MLP-convolution is equivalent to an ordinary convolution followed by several convolutions and nonlinearities. In fact, this feature helped Lin et al (2014) to implement NIN by ordinary CNNs without altering their implementation. In this view, it can be said that the NIN did not generalize the convolution operator at all and all it did was the discovery that CNNs should be much deeper and should have a slow pace of decreasing the resolution of the convolution planes by pooling. Although the broad idea of generalizing the convolution operator has been present in the above-mentioned works, the specific ideas presented in this paper are completely novel and it is for the first time that a network with a really generalized convolution operator is trained by the back-propagation algorithm.
The paper proceeds as follows. In section 2, we introduce two classes of generalized convolution operators. Some specific examples of generalized convolution operators are introduced in section 3. We found that simple random initialization of the parameters or applying algorithms like Xavier(Glorot and Bengio, 2010) are not suitable choices for initializing generalized convolutional neural networks (GCNN). Two initialization algorithms that can be used for the initialization of GCNNs are introduced in section 4. We report our experiments on the MNIST dataset in section 5. We conclude the paper in section 6 and mention the future works in section 7.
2 The proposed method
CNN is a deep neural network which consists of different types of layers, including convolution, pooling, nonlinearity, inner product, and loss layers. Usually, a module consisting of a sequence of convolution, nonlinearity, and pooling layers is repeated several times to produce a suitable representation of the input data which is then fed to a fully connected network to estimate the output (for recent generalizations of this block see(Szegedy et al, 2015; He et al, 2016)). The convolution layer computes the convolution between its input planes and several filters represented by the weight parameters and produces a set of output planes, one associated with each filter. However, the convolution is essentially a linear operation. It is well-known that deepening of neural networks with linear activation function does not increase the representational power and these networks are still representing a linear function of the input data. So, to increase the modeling capability of CNNs, the convolution layer is usually followed by a nonlinearity layer. Some examples of common activation functions include rectified linear unit (ReLU), tangent hyperbolic(TanH), and logistic sigmoid (Sigmoid). The goal of the pooling layer is to reduce the sensitivity of the network to translations of the input images and to summarize the important information of the input planes in a more compact representation.
Consider a convolution layer which operates on input planes and generates output layers. Assume that is an n-dimensional vector generated by vectorizing patches of input planes centered at position . Let be the weight vector associated with ’th output plane. Then the output value at position of plane is computed by formula
Our first proposal for generalizing the convolution operator is to replace the inner product operation in Eq.(1) with a positive definite kernel function. Assuming that is a positive definite kernel function, the output is now computed by
This generalization allows us to use a handful of kernel functions such as Gaussian, polynomial, Laplacian, cosine, Cauchy, and intersection in place of the inner product operation. However, our choice of using kernel functions is severely restricted by the positive definiteness requirement. In a kernel method like SVM, the positive definiteness property plays a crucial role and violation of it makes the objective function unbounded from below. In RBF neural networks, the positive definiteness property of kernel functions guarantees that the kernel matrix would be invertible and so the optimal weights of radial basis functions exist and are unique. Generally, the positive definiteness property of kernel functions eliminates the possibility that the inner product of a vector with itself becomes negative. Assume thatis a inner product kernel function with feature space and feature map (i.e. ) which is not positive definite. Since is not positive definite, there exist inputs and coefficients such that
One can easily verify that the expression on the left side of Eq.(3) is equal to inner product of the vector with itself.
Therefore, use of non-positive definite kernel functions in GCNNs may lead to patterns which are not similar to themselves, violating the generalized pattern matching viewpoint
generalized pattern matching viewpoint. However, in GCNNs we are not concerned with all vectors that can be constructed in the feature space associated with a kernel function. Instead, we are applying the kernel function directly to two input vectors and the requirement of similarity of patterns to themselves translates to the condition for all input vectors x. This requirement is satisfied for any function having the form , where is a positive definite kernel function and is a monotonically increasing function with . So, we arrive at our first generalization of the convolution operator.
Generalization 1: The convolution operator in CNNs can be generalized by substituting the inner product operation between a vectorized input and a weight vector by , where is a positive definite kernel function and is a monotonically increasing function with .
It is evident that the main purpose of the inner product stage within the convolution operator in neocognitron was to measure the similarity between the patches of the input maps of the preceding layer and a template pattern. One justification for this is that the inner product operator is essentially a similarity measure (see section 1.1 of Schölkopf and Smola, 2002). It may be argued that the inner product operation is not a suitable similarity measure since for example there are vectors which are more similar to a chosen vector than itself. Our second proposal for generalizing the convolution operator is to substitute the inner product operator with a similarity measure. Defining similarity based on a distance measure, we arrive at our second generalization of the convolution operator.
Generalization 2: The convolution operator in CNNs can be generalized by substituting the inner product operation between a vectorzied input and a weight vector by , where is a distance metric and is a monotonically increasing function with . Adding the additional constraint ensures that , meaning that the similarity of each vector with itself is 1. However, since the dimension of the input space is usually high111For example, in our experiments on the MNIST dataset, we have 12 planes in the first convolution layer which, considering a window of size 5, induces a dimensionality of on the input of the second convolution layer., exact matching almost never happens and even similar items typically have high numerical distances. In addition, since we want to permit the use of non-squashing activation functions such as ReLU, which have experimentally proven to do better than squashing functions such as TanH and Sigmoid, we don’t impose the restriction .
In practice, we implemented similarity/kernel functions by using multiple layers in Caffe. These layers include a generalized convolution layer based on a metric distance(e.g. weighted L2 distance), possibly followed by an AdaptiveLinear layer with negative slope222AdaptiveLinear is a simple new kind of layer that we have added to Caffe which implements , where the parameters and differ between output channels., followed by an activation function layer like exponential (Exp) or ReLU. In this view, activation functions are appropriate monotonically increasing functions that complement the functionality of a distance-based generalized convolution layer such that the whole module implements a generalized convolution operator. For example, ReLU activation function can be seen as a monotonically increasing function that complements the role of the preceding layers by enforcing the non-negativity criterion of a similarity measure.
3 Examples of generalized convolution operators
3.1 Non-isotropic Gaussian kernel
Non-isotropic Gaussian kernel is defined as
where is the precision vector which consists of positive values. Use of Gaussian kernel function in generalized convolution is admissible since both it is a positive definite kernel function and it can be expressed as the application of the monotonically increasing function with the definition to the negation of the weighted L2 distance(WL2Dist). In addition to satisfying the required constraint , the function has the additional property that , ensuring that it is a similarity measure spanning the range .
3.2 Non-isotropic Laplacian kernel
Non-isotropic Laplacian kernel is defined as
where is the precision vector which should be positive. Again, Laplacian kernel function is both a positive definite kernel function and it can be expressed as the application of the monotonically increasing function to the negation of the weighted L1 distance(WL1Dist).
3.3 Cosine kernel: justifying the Sine activation function
It is well known that is a positive definite kernel function and so it can be used to measure the similarity between an input image patch and a pattern . Since is the parameter of the kernel, it is fixed and is equal to some constant . It follows that , where . Thus, is essentially computing the kernel function , where z is an implicit pattern satisfying the equation . Note that in contrast to the ordinary convolution operation where the similarity of is measured against the vector of weights , here is solely a parameter of the kernel function and the desired pattern is hidden in the bias parameter . The above line of reasoning works exactly for the cosine activation function. However, since the gradient of the cosine function vanishes at zero, the parameters of a network with cosine activation function would get stuck in their initial values. This is because almost all initialization algorithms initialize the weight vector and the bias parameter in a way that is on average zero.
In section 5.3 we will experimentally show that the Sine activation function works similar to ReLU and significantly better than TanH. One benefit of Sine is that it does not have the saturation problem of TanH and Sigmoid. As is illustrated in Figure 1, Sine and TanH have similar shapes in the range , however, outside this region TanH is saturated while Sine is periodic. One problem with TanH is that if the target value is , then the weights are pushed towards infinity and the gradient of the TanH function vanishes. To remedy this problem, LeCun et al (1998b) proposed a scaled version of TanH with definition
Another benefit of Sine is that it can produce an output value of without pushing the weights towards infinity.
3.4 Other similarities based on the weighted L2-distance
We saw in section 3.1 that Gaussian kernel equals to the composition of the monotonically increasing function with definition and the negation of WL2Dist. In this section, we propose to use other activation functions on top of the WL2Dist. For the case of Gaussian kernel, we can view Eq. (4
) as the unnormalized Gaussian probability density that an input patchmatches pattern . Rewriting Eq. (4
) in the usual form of a multivariate Gaussian distribution we have
where is a diagonal precision matrix with entries .
Viewing the Gaussian kernel as a probability density function, we can say that the value returned by this function is proportional to the probability density of inputin a Gaussian distribution with mean and precision matrix . The problem with this value is that it cannot directly be used as a measure for deciding whether the input data is similar to the desired pattern or not. In this section, we exploit this probabilistic point of view to arrive at some other activation functions on the top of the WL2Dist.
Since, by assumption, the precision matrix of the Gaussian distribution is diagonal, it follows that different dimensions are independent of each other and so the squared weighted L2 distance is equal to the sum of squares of standard normal variables which is known to have a distribution with degrees of freedom. So, the probability that an input with belongs to the Gaussian distribution associated with pattern is equal to the p-value of the distribution at point . Using inverse cumulative distribution, one can determine two thresholds and such that for inputs with the probability that is generated by this distribution is very high and for inputs with this probability is very low. Therefore, a suitable value for similarity of input to pattern is given by
If we first apply a linear transformation with the slopeand the bias to the output of the WL2Dist layer, then the desired functionality can be achieved using the DoubleThreshold activation function defined as
If we allow similarities greater than , then the upper limit of the DoubleThreshold activation function is dropped and we reach at the well-known ReLU activation function.
3.5 Other similarities based on the weighted L1-distance
The discussion of the previous section can be repeated for the Laplacian distribution, resulting in generalized convolution operators based on the WL1Dist.
4 Initialization of parameters
The performance of deep CNNs is strongly influenced by the method of initializing the parameters and by controlling the amount of backward gradient returned to each parameter(Krähenbühl et al, 2016; Mishkin and Matas, 2016; Glorot and Bengio, 2010; Ioffe and Szegedy, 2015). Initialization and optimization algorithms proposed for neural networks are designed based on the linear model of neurons (i.e.
). When the incoming weights are initialized randomly with mean zero, this model ensures that the mean of the output is zero as well. By choosing appropriate values for the magnitudes of weights one can ensure that all neurons have a mean value of zero and a variance of one, avoiding the vanishing/exploding problems in the forward pass(LeCun et al, 1998b). Recently, similar approaches have been devised that control the magnitude of the gradient in the backward pass (Glorot and Bengio, 2010; Ioffe and Szegedy, 2015; Krähenbühl et al, 2016). However, by substituting the inner product operation with a kernel/distance function all of these nice properties fade and the vanishing/exploding problems reappear both in the forward and backward passes. Each kernel/distance function has its own properties that should be considered in initializing its parameters. In this section, we consider two initialization algorithms for networks based on the weighted L1/L2 distances. The specific architecture we are considering is depicted in Figure 2.
4.1 Precision adjustment initialization algorithm
The goal of this initialization method is to ensure that all signals at the forward pass have appropriate magnitudes. The initialization proceeds module by module, adjusting the precision parameters of the generalized convolution layer of each module to ensure that the empirical mean of the signal passed to the subsequent nonlinearity layer is zero and its empirical variance is some target value . This algorithm is very similar to the initialization algorithm of Mishkin and Matas (2016) except that here the precision parameters are pre-initialized randomly while Mishkin and Matas (2016) pre-initialize weights with orthonormal matrices. Since this algorithm only adjusts the precision parameters of the generalized convolution layers, we call it the precision adjustment initialization algorithm.
4.2 Whole-network adjustment initialization algorithm
Glorot and Bengio (2010) proposed that the weights should be initialized in a way that both the activation values of neurons in the forward pass and the gradient back-propagated in the backward pass have appropriate variances. They proposed an analytical algorithm for initializing the weights of a convolutional neural network based on this idea. Recently, Krähenbühl et al (2016) proposed a data-dependent iterative algorithm for attaining this goal and showed that their approach works superior to the analytical approach of Glorot and Bengio (2010), at least in the experiments reported in their paper. We added the support for AdaptiveLinear, WL1Dist, and WL2Dist layers to the implementation of Krähenbühl et al (2016). Since this algorithm adjusts the parameters of the whole network, layer by layer, we call it the whole-network adjustment algorithm.
In this section, we experimentally evaluate two realizations of GCNNs. To compare GCNNs with [ordinary] CNNs in the fairest and most informative way, we conduct our experiments on the MNIST(LeCun et al, 1998a) dataset which has been a classical testbed for CNNs from its advent till now. The MNIST dataset is a collection of 60000 training and 10000 testing samples of handwritten digits. We train the networks with the official training samples, without applying any distortions.
5.1 Experimental setup
The general form of the network architecture considered in this section is depicted in Figure 3. Only boxes with thick border may differ between the experiments. The dotted boxes of the AdaptiveLinear layers imply that they may not be present in some experiments (or are present with slope 1, bias 0, and zero learning rate multipliers). The number of planes of the first and second generalized convolution layers is and , respectively, with a window size of . In our experiments, we set the base learning rate to , the momentum to , the mini-batch size to , the maximum number of iterations to (which is equivalent to epochs), and the weight decay coefficient to . The learning rate at ’th iteration is computed by dividing the base learning rate by , where and . We chose a significance level of for determining the statistical significance of the experiments. If not explicitly mentioned, the number of repetitions of each experiment is . To ensure exact reproducibility of the results, we have set the random_seed parameter of each experiment to a deterministic function of the experiment number.
5.2 Some insights into ordinary convolutional neural networks
In this section, we perform some experiments on ordinary CNNs that will prove to be useful in design and analysis of our experiments on GCNNs. First, we conduct experiments to investigate the role of training of the convolution layers in the accuracy of CNNs. Second, we investigate the role of the negative weights in the accuracy of CNNs.
5.2.1 Investigating the role of convolution layers
In this experiment, we want to identify the role of the convolution layers in the accuracy of CNNs on the MNIST dataset. In the experimental settings described in Section 5.1, an ordinary CNN obtains an accuracy of
on the MNIST dataset. If we confine the weights of the convolution layers to their initial random values (by setting their associated learning rate multipliers to zero), the accuracy declines to. This shows that only less than of the accuracy of a CNN on the MNIST dataset is due to training of the convolution layers.
5.2.2 Studying matched filter CNNs
In this experiment, our goal is to study the effect of negative weights on the accuracy of CNNs. In the matched filter viewpoint of CNNs, the weights of the convolution layer should represent a cluster of patches of the preceding layer. Since the values of the original image are between and the values of all convolution layers are passed from ReLU, all input maps to the convolution layers are positive. In this experiment, we want to examine the effect of confining the weights of the convolution layers to positive values on the accuracy of CNNs. To minimize the unwanted effects of initialization and optimization issues on this experiment, we force the positivity of the weights by the absolute value operation, so that the magnitudes of the gradients with respect to weights is unchanged. For a network with positive weights, the average activation of neurons would no longer be zero and the initialization algorithm of Xavier(Glorot and Bengio, 2010) is inapplicable. So, we first use the precision adjustment
initialization algorithm to ensure that the outputs of the convolution layers have mean zero and standard deviation. After training for one epoch (i.e. 600 iterations), we use the whole-network adjustment initialization algorithm to initialize the weights, ensuring that the gradients returned to all layers have appropriate magnitude
To eliminate the effect of initialization algorithm, we first trained an ordinary CNN with the whole-network adjustment and obtained an accuracy of . After imposing the positivity constraint on weights, the accuracy significantly reduced to . This proves that the negative weights have a remarkable role in the high accuracy of CNNs and the matched filter viewpoint (at least when weights are initialized randomly) cannot fully explain the high accuracy of CNNs.We conclude from this experiment that in GCNNs we should also allow the template pattern to take negative values. For example in the case of distance-based generalized convolution operators (such as Gaussian, Laplacian, WL1Dist, and WL2Dist) we allow the mean parameter to take negative values.
5.2.3 Effect of initialization method on accuracy of ordinary CNNs
None of the GCNNs considered in this paper can be trained with randomly initialized weights. So, we ought to resort to the initialization algorithms of section 4. In this section, we want to study the suitability of these initialization algorithms for initializing an ordinary CNN on the MNIST dataset. In addition to initialization algorithms of section 4, we also consider the Xavier algorithm(Glorot and Bengio, 2010) which is the initialization algorithm chosen by the MNIST example in Caffe333Truly speaking, although Glorot and Bengio (2010) introduced a new algorithm which considers the backward gradient, the Xavier initialization algorithm in Caffe with default parameters is what has been introduced by LeCun et al (1998b) many years ago.. Table 1 shows the accuracies obtained by different initialization algorithms on ordinary CNNs . As it can be seen, our data-dependent precision adjustment algorithm works significantly better than both Xavier(Glorot and Bengio, 2010) and whole-network adjustment(Krähenbühl et al, 2016) in this experiment. So, in section 5.4, when comparing CNNs with GCNNs, we also consider CNNs initialized by the precision adjustment algorithm.
|Xavier(Glorot and Bengio, 2010)|
5.3 Experimental evaluation of the Sine activation function
In section 3.3 we showed that the use of Sine as a neural activation function can be explained from a kernel methods perspective. In this section, we experimentally compare Sine with other important activation functions such as Sigmoid, TanH , and ReLU. To identify the role of negative values at the output, we also include the rectified sine (ReSine) and rectified tangent hyperbolic (ReTanH) activation functions in our experiments. Table 2 shows the results of these experiments. In all experiments, we have used the Xavier(Glorot and Bengio, 2010) algorithm for initialization of the parameters. As the results show, Sine has performed significantly better than Sigmoid and TanH. However, the slightly higher accuracy of Sine in comparison to ReLU is not [statistically] significant.
5.4 Weighted L1 and L2 distances
, we showed that WL1Dist/WL2Dist+AdaLin+ReLU modules implement legitimate generalized convolution operators. In this section, we want to study these modules experimentally. We force the positivity of the precision parameters with absolute value operation. The mean parameter is initialized with a uniform distribution on the non-negative rangebut is allowed to take arbitrary positive/negative values during learning. We experimentally found that the maximum of iterations chosen for ordinary CNNs is not sufficient for full training of GCNNs in this experiment and so increased this number to iterations. We also repeated our previous experiments with ordinary CNNs with 36000 iterations which slightly improved the previous results. Each experiment is repeated with values and for learning rate multipliers (lr-mult) of the parameters of the generalized convolution layers. Table 3 shows the accuracies obtained with different initialization algorithms. In each row, accuracies that are significantly higher than others are boldfaced.
In this paper, we proposed two methods for generalizing the convolution operator in CNNs. The first method is based on substituting the inner product operation within the convolution operator with a monotonically increasing function of a positive definite kernel function. In the second method, we replace the inner product operator with a monotonically increasing function of the negation of a distance function. In this paper, we implemented generalized CNNs (GCNN) based on the cosine kernel and weighted L1/L2 distances, and showed that the resulting networks achieve or even slightly surpass the accuracies of ordinary CNNs on the MNIST dataset. However, we believe that the main merit of this research is that it introduces a generalized conceptual framework that paves the way for the application of sophisticated methods developed in other fields of machine learning at the heart of GCNNs. Some of the machine learning methods that can be potentially used in GCNNs include kernel principal component analysis, multiple kernel learning, infinite kernel learning, metric learning, and similarity learning. In addition, this work sheds more light on the nature of the convolution operator as a central element of CNNs.
7 Future works
In this paper, we introduced the key idea that the convolution operator in CNNs can be generalized by a wide class of kernel/distance functions. We experimentally supported this idea by implementing two generalized convolution operators based on the weighted L1/L2 distance functions and carrying out experiments on the MNIST dataset. In the future, we aim to study and improve the proposed approach in several directions. First, we plan to implement the weighted L1/L2 distance generalized GCNNs on GPU and apply them to more challenging datasets like CIFAR10 and CIFAR100(Krizhevsky and Hinton, 2009)
, and Imagenet(Russakovsky et al, 2015). This would be a difficult task since the successful network models proposed for these datasets are very deep and use other complementary techniques, such as dropout(Srivastava et al, 2014), which are not yet adapted to the proposed generalized framework. Second, we decide to exploit the discovered link between the kernel methods and CNNs to apply kernel methods machinary (such as SVM, KPCA, KFDA, MKL, and IKL) to CNNs. Finally, this work can be followed by implementing other possible forms of GCNNs(e.g. those based on polynomial or inverse multiquadric kernels). Our preliminary experiments suggest that almost every generalization of the convolution operator requires its own handling of initialization and optimization algorithms.
The author wishes to express appreciation to Research Deputy of Ferdowsi University of Mashhad for supporting this project by grant No.: 2/43037. The author also thanks his fellows Ahad Harati and Ehsan Fazl-Ersi for their valuable comments.
- Chandar et al (2016) Chandar S, Khapra MM, Larochelle H, Ravindran B (2016) Correlational neural networks. Neural computation
- Fukushima (1975) Fukushima K (1975) Cognitron: A self-organizing multilayered neural network. Biological cybernetics 20(3-4):121–136
Fukushima K (1980) Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36(4):193–202
- Fukushima (1988) Fukushima K (1988) Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks 1(2):119–130
- Glorot and Bengio (2010) Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, vol 9, pp 249–256
He et al (2016)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
- Hornik et al (1989) Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366
Ioffe and Szegedy (2015)
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of The 32nd International Conference on Machine Learning, pp 448–456
- Jia et al (2014) Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:14085093
- Krähenbühl et al (2016) Krähenbühl P, Doersch C, Donahue J, Darrell T (2016) Data-dependent initializations of convolutional neural networks. In: International Conference on Learning Representations
- Krizhevsky and Hinton (2009) Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Tech. rep., University of Toronto
LeCun et al (1989)
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1(4):541–551
- LeCun et al (1998a) LeCun Y, Bottou L, Bengio Y, Haffner P (1998a) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
- LeCun et al (1998b) LeCun Y, Bottou L, Orr GB, Müller KR (1998b) Efficient backprop. In: Neural Networks: Tricks of the Trade, pp 9–50
- Lin et al (2014) Lin M, Chen Q, Yan S (2014) Network in network. In: International Conference on Learning Representations
- Mairal (2016) Mairal J (2016) End-to-end kernel learning with supervised convolutional kernel networks. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pp 1399–1407
- Mairal et al (2014) Mairal J, Koniusz P, Harchaoui Z, Schmid C (2014) Convolutional kernel networks. In: Advances in Neural Information Processing Systems, pp 2627–2635
- Mishkin and Matas (2016) Mishkin D, Matas J (2016) All you need is a good init. In: International Conference on Learning Representations
- Russakovsky et al (2015) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3):211–252
- Schölkopf and Smola (2002) Schölkopf B, Smola A (2002) Learning with Kernels- Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA
- Serre et al (2007) Serre T, Wolf L, Bileschi S, Riesenhuber M, Poggio T (2007) Robust object recognition with cortex-like mechanisms. IEEE transactions on pattern analysis and machine intelligence 29(3):411–426
- Srivastava et al (2014) Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958
- Szegedy et al (2015) Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
- Williams and Hinton (1986) Williams D, Hinton G (1986) Learning representations by back-propagating errors. Nature 323:533–536