Implementation of Convolutional Kernel Networks for the paper "Kernel-based Translations of Convolutional Networks"
Convolutional Neural Networks, as most artificial neural networks, are commonly viewed as methods different in essence from kernel-based methods. We provide a systematic translation of Convolutional Neural Networks (ConvNets) into their kernel-based counterparts, Convolutional Kernel Networks (CKNs), and demonstrate that this perception is unfounded both formally and empirically. We show that, given a Convolutional Neural Network, we can design a corresponding Convolutional Kernel Network, easily trainable using a new stochastic gradient algorithm based on an accurate gradient computation, that performs on par with its Convolutional Neural Network counterpart. We present experimental results supporting our claims on landmark ConvNet architectures comparing each ConvNet to its CKN counterpart over several parameter settings.READ FULL TEXT VIEW PDF
In this work we evaluate different approaches to parallelize computation...
Convolutional neural network is a very important model of deep learning....
Deep neural network-based architectures give promising results in variou...
Convolutional neural network is a very important model of deep learning....
In many robotics and VR/AR applications, 3D-videos are readily-available...
A convolutional neural network for image classification can be construct...
We introduce a novel artificial neural network architecture that integra...
Implementation of Convolutional Kernel Networks for the paper "Kernel-based Translations of Convolutional Networks"
For many tasks, convolutional neural networks (ConvNets) are currently the most successful approach to learning a functional mapping from inputs to outputs. For example, this is true for image classification, where they can learn a mapping from an image to a visual object category. The common description of a convolutional neural network decomposes the architecture into layers that implement particular parameterized functions (Goodfellow et al., 2016).
The first generation of ConvNets, which include the Neocognitron (Fukushima, 1980) and the LeNet series (LeCun, 1988, 1989; LeCun et al., 1989, 1995, 2001) stack two main types of layers: convolutional layers and pooling layers. These two types of layers were motivated from the Hubel-Wiesel model of human visual perception (Hubel and Wiesel, 1962). A convolutional layer decomposes into several units. Each unit is connected to local patches in the feature maps of the previous layer through a set of weights. A pooling layer computes a local statistic of a patch of units in one feature map.
This operational description of ConvNets contrasts with the mathematical description of kernel-based methods. Kernel-based methods, such as support vector machines, were at one point the most popular array of approaches to learning functional mappings from input examples to output labels(Schölkopf and Smola, 2002; Steinwart and Christmann, 2008). Kernels are positive-definite pairwise similarity measures that allow one to design and learn such mappings by defining them as linear functionals in a Hilbert space. Owing to the so-called reproducing property of kernels, these linear functionals can be learned from data.
This apparent antagonism between the two families of approaches is, however, misleading and somewhat unproductive. We argue and demonstrate that, in fact, any convolutional neural network can potentially be translated into a convolutional kernel network, a kernel-based method with an appropriate hierarchical compositional kernel. Indeed, the operational description of a ConvNet can be seen as the description of a data-dependent approximation of an appropriate kernel map.
The kernel viewpoint brings important insights. Despite the widespread use of ConvNets, relatively little is understood about them. We would, in general, like to be able to address questions such as the following: What kinds of activation functions should be used? How many filters should there be at each layer? Why should we use spatial pooling? Through CKNs we can begin to understand the answers to these questions. Activation functions used in ConvNets are used to approximate a kernel, i.e., a similarity measure, between patches. The number of filters determines the quality of the approximation. Moreover, spatial pooling may be viewed as approximately taking into account the distance between patches when measuring the similarity between images.
We lay out a systematic translation framework between ConvNets and CKNs and put it to practice with three landmark architectures on two problems. The three ConvNet architectures, LeNet-1, LeNet-5, and All-CNN-C, correspond to milestones in the development of ConvNets. We consider digit classification with LeNet-1 and LeNet-5 on MNIST (LeCun et al., 1995, 2001) and image classification with All-CNN-C (Springenberg et al., 2015) on CIFAR-10 (Krizhevsky and Hinton, 2009). We present an efficient algorithm to train a convolutional kernel network end-to-end, based on a first stage of unsupervised training and a second stage of gradient-based supervised training.
To our knowledge, this work presents the first systematic experimental comparison on an equal standing of the two approaches on real-world datasets. By equal standing we mean that the two architectures compared are analogous from a functional viewpoint and are trained similarly from an algorithmic viewpoint. This is also the first time kernel-based counterparts of convolutional nets are shown to perform on par with convolutional neural nets on several real datasets over a wide range of settings.
In summary, we make the following contributions:
Translating the LeNet-1, LeNet-5, and All-CNN-C ConvNet architectures into their Convolutional Kernel Net counterparts;
Establishing a general gradient formula and algorithm to train a Convolutional Kernel Net in a supervised manner;
Demonstrating that Convolutional Kernel Nets can achieve comparable performance to their ConvNet counterparts.
The CKN code for this project is publicly available in the software package yesweckn at https://github.com/cjones6/yesweckn.
This paper builds upon two interwoven threads of research related to kernels: the connections between kernel-based methods and (convolutional) neural networks and the use of compositions of kernels for the design of feature representations.
The first thread dates back to Neal (1996), who showed that an infinite-dimensional single-layer neural network is equivalent to a Gaussian process. Building on this, Williams (1996) derived what the corresponding covariance functions were for two specific activation functions. Later, Cho and Saul (2009)
proposed what they termed the arc-cosine kernels, showing that they are equivalent to infinite-dimensional neural networks with specific activation functions (such as the ReLU) when the weights of the neural networks are independent and unit Gaussian. Moreover, they composed these kernels and obtained higher classification accuracy on a dataset like MNIST. However, all of these works dealt with neural networks, not convolutional neural networks.
More recently, several works used kernel-based methods or approximations thereof as substitutes to fully-connected layers or other parts of convolutional neural networks to build hybrid ConvNet architectures (Bruna and Mallat, 2013; Huang et al., 2014; Dai et al., 2014; Yang et al., 2015; Oyallon et al., 2017, 2018). In contrast to these works, we are interested in purely kernel-based networks.
The second thread began with Schölkopf and Smola (2002, Section 13.3.1), who proposed a kernel over image patches for image classification. The kernel took into account the inner product of the pixels at every pair of pixel locations within a patch, in addition to the distance between the pixels. This served as a precursor to later work that composed kernels. A related idea, introduced by Bouvrie et al. (2009), entailed having a hierarchy of kernels. These kernels were defined to be normalized inner products of “neural response functions”, which were derived by pooling the values of a kernel at the previous layer over a particular region of an image. In a similar vein, Bo et al. (2011) first defined a kernel over sets of patches from two images based on the sum over all pairs of patches within the sets. Within the summation is a weighted product of a kernel over the patches and a kernel over the patch locations. They then proposed using this in a hierarchical manner, approximating the kernel at each layer by projecting onto a subspace.
Multi-layer convolutional kernels were introduced in Mairal et al. (2014). In Mairal et al. (2014) and Paulin et al. (2017), kernel-based methods using such kernels were shown to achieve competitive performance for particular parameter settings on image classification and image retrieval tasks, respectively. The kernels considered were however different from the kernel counterparts of the ConvNets they competed with. Building upon this work, Mairal (2016) proposed an end-to-end training algorithm for convolutional kernel networks. Each of these aforementioned works relied on an approximation to the kernels based on either optimization or the Nyström method (Williams and Seeger, 2000; Bo and Sminchisescu, 2009). Alternatively, kernel approximations using random Fourier features (Rahimi and Recht, 2007) or variants thereof could also approximate such kernels for a variety of activation functions (Daniely et al., 2016, 2017), although at a slower rate. Finally, Bietti and Mairal (2019) studied the invariance properties of convolutional kernel networks from a theoretical function space point of view.
Building off of Mairal et al. (2014); Daniely et al. (2016, 2017); Bietti and Mairal (2019) and Paulin et al. (2017), we put to practice the translation of a convolutional neural network into its convolutional kernel network counterpart for several landmark architectures. Translating a convolutional neural network to its convolutional kernel network counterpart requires a careful examination of the details of the architecture, going beyond broad strokes simplifications made in previous works. When each translation is carefully performed, the resulting convolutional kernel network can compete with the convolutional neural network. We provide all the details of our translations in Appendix C. To effectively train convolutional kernel networks, we present a rigorous derivation of a general formula of the gradient of the objective with respect to the parameters. Finally, we propose a new stochastic gradient algorithm using an accurate gradient computation method to train convolutional kernel networks in a supervised manner. As a result, we demonstrate that convolutional kernel networks perform on par with their convolutional neural network counterparts.
To refresh the reader on convolutional kernel networks (CKNs), we begin this section by providing a novel informal viewpoint. We then proceed to describe the correspondences between CKNs and ConvNets. Using these correspondences, we then demonstrate how to translate an example ConvNet architecture, LeNet-5 (LeCun et al., 2001), to a CKN.
Convolutional kernel networks provide a natural means of generating features for structured data such as signals and images. Consider the aim of generating features for images such that “similar” images have “similar” features. CKNs approach this problem by developing a similarity measure between images.
Let be a similarity measure between images and suppose can be written as an inner product
in a space for some . Then we could take to be the feature representation for image and similarly for image . The question therefore is how to choose . A similarity measure that is applied pixel-wise will be ineffective. This is because there is no reason why we would expect two “similar” images to be similar pixel-wise. Therefore, we consider similarity measures applied to patches. Here a patch consists of the pixel values at a (typically contiguous) subset of pixel locations. Statistics on patches more closely align with how humans perceive images. Moreover, treating a patch as a unit takes into account the fact that images are often not locally rotationally invariant.
Let and , be patches from images and , respectively, where for all , and are from the same positions. Then, given a similarity measure on the patches, we could choose the similarity measure on the images to be given by
for some weights . The overall similarity measure accounts for the fact that images that are similar will not necessarily have patches in the same locations by comparing all pairs of patches between the two images. The weighting accounts for the fact that while similar patches may not occur in the same location, we would not expect them to be too far apart.
Convolutional kernel networks build such similarity measures by using a positive definite kernel as the similarity measure between patches. A positive definite kernel implicitly maps the patches to an infinite-dimensional space (a reproducing kernel Hilbert space (RKHS)) and computes their inner product in this space, i.e., , where is the mapping from patches to induced by . As long as is a kernel, is also a kernel and can therefore be written as , where is the RKHS associated with kernel and is the mapping from patches to induced by .
If two images are similar, we would expect their patches to be similar for several patch sizes. Multi-layer CKNs incorporate this via a hierarchy of kernels. A primary aim of using a hierarchical structure is to gain invariance in the feature representation of the images. Specifically, we seek to gain translation invariance and invariance to small deformations. We will now describe such a hierarchical structure.
Let be the canonical feature map of a kernel defined on patches of a given size, i.e.,
where is the RKHS associated with kernel . Then provides a feature representation of the patch . Applying to each patch of the given size in the image, we obtain a new representation of the image (See Figure 1).
If two images are similar, we would expect them to also be similar when comparing their representations obtained by applying . Therefore, we may apply the same logic as before. Let be a kernel on patches in the new space. Applying its canonical feature map , we obtain another representation of the image. The features at each spatial location in this representation are derived from a larger portion of the original image than those in the previous representation (previous layer). Specifically, denoting by and image patches from image representations and at layer , we apply the canonical feature map of the kernel given by
to each patch in and .
One way to increase the invariance is to include an averaging (i.e., pooling) step after applying each feature map . Specifically, denote by the feature representation of the image at spatial location after applying feature map . Letting denote a spatial neighborhood of the point , we compute
for all where are pre-specified weights. For example, for average pooling, .
After pooling we often subsample the locations in the image for computational purposes. Subsampling by an integer factor of entails retaining every th feature representation in each row and then every th feature representation in each column. By subsampling after pooling we aim to remove redundant features. Building layers in the above manner by applying feature maps , pooling, and subsampling, we obtain a convolutional kernel network.
Figure 1 depicts an example CKN. In the figure an initial RGB image of size (represented by the bottom rhombus in the figure) gets transformed by applying feature map to patches of size . As is applied with stride
is applied with stride(i.e., it is applied to every possible contiguous patch in the image), this results in a new feature representation (second rhombus) with spatial dimensions . Atop each spatial location sits an infinite-dimensional vector.
At the second layer pooling is applied to the infinite-dimensional vectors, followed by subsampling by a factor of 2. The pooling is performed on all contiguous patches, which initially decreases the spatial dimensions to . Subsampling by a factor of two entails removing the features on top of every other spatial location, yielding an output with spatial dimensions . The output of this layer results in the next representation (third rhombus).
Finally, the figure depicts the application of another feature map to patches of size to obtain the feature representation at the final layer. As the stride is , this results in an output with spatial dimensions .
Given such a network, we may then compute the similarity of two images by concatenating each image’s feature representation at the final layer and then applying a linear kernel. While there are only two feature maps , , depicted in this figure, the process could continue for many more layers.
While computing the overall kernel exactly is theoretically possible (assuming that the kernels at each layer only depend on inner products of features at previous layers), it is computationally unwieldy for even moderately-sized networks. To overcome the computational difficulties, we approximate the kernel at each layer by finding a map such that for some positive integer . The ’s then replace the ’s at each layer, thereby providing feature representations of size of the patches at each layer . There are many ways to choose , including directly optimizing an approximation to the kernel, using random features, and projecting onto a subspace.
We consider here the approximation resulting from the projection onto a subspace spanned by “filters”, usually referred to as the Nyström method (Williams and Seeger, 2000; Bo and Sminchisescu, 2009; Mairal, 2016). These filters may be initialized at random by sampling patches from the images. We shall show in Sections 4.1-4.2 how to differentiate through this approximation and learn the filters from data in a supervised manner.
Consider a dot product kernel with corresponding RKHS and canonical feature map . Furthermore, let be a set of filters. Given a patch of size , the Nyström approximation projects onto the subspace spanned by in by solving the kernel least squares problem
Defining and assuming that is a dot product kernel, this results in the coefficients , where the kernel is understood to be applied element-wise.111A dot product kernel is a kernel of the form for a function . For notational convenience for a dot product kernel we will write rather than where . For a matrix the element-wise application of to results in . Therefore, for two patches and with corresponding optimal coefficients and , we have
Hence, a finite-dimensional approximate feature representation of is given by
We will add a regularization term involving a small value , as may be poorly conditioned.
Denote the input features to layer by , where the rows index the features and the columns index the spatial locations (which are flattened into one dimension). Let be a function that extracts patches from . We then write the features output by the Nyström method as
Here denotes the function that applies the approximate feature map as derived above to the features at each spatial location. It is important to note that the number of filters controls the quality of the approximation at layer . Moreover, such a procedure results in the term , in which the filters are convolved with the images. This convolution is followed by a non-linearity computed using the kernel , resulting in the application of .
The core hyperparameter in CKNs is the choice of kernel. For simplicity of the exposition we assume that the same kernel is used at each layer. Traditionally, CKNs use normalized kernels of the form
where is a dot product kernel on the sphere. Examples of such kernels include the arc-cosine kernel of order 0 and the RBF kernel on the sphere. Here we allow for this formulation. Using dot product kernels on the sphere allows us to restrict the filters to lying on the sphere. Doing so adds a projection step in the optimization.
Let be an input image. Denote by the filters at layer , the function that extracts patches from at layer , and the function normalizing the patches of at layer . Furthermore, let be the pooling and subsampling operator, represented by a matrix. (See Appendix D for precise definitions.) Then the representation at the next layer given by extracting patches, normalizing them, projecting onto a subspace, re-multiplying by the norms of the patches, pooling, and subsampling is given by
After such compositions we obtain a final representation of the image that can be used for a classification task.
|ConvNet component||CKN component|
|Convolutional layer||Projection onto the same subspace for all patch locations|
|Partially connected layer||Projection onto a different subspace for each region|
|Fully connected layer||Projection onto a subspace for the entire image representation|
|Convolution + no nonlinearity||Applying feature map of linear kernel|
|Convolution + tanh nonlinearity||Applying feature map of arc-cosine kernel of order 0|
|Convolution + ReLU nonlinearity||Applying feature map of arc-cosine kernel of order 1|
|Average pooling||Averaging of feature maps|
|Local response normalization||Dividing patches by their norm|
denotes an inexact correspondence.
CKNs may be viewed as infinite-dimensional analogues of ConvNets. Table 1 lists a set of transformations between ConvNets and CKNs. These are discussed below in more detail. For the remainder of this section we let
denote the feature representation of an image in a ConvNet. For clarity of exposition we represent it as a 3D tensor rather than a 2D matrix as for CKNs above. Here the first dimension indexes the features while the second and third dimensions index the spatial location. We denote the element ofin feature map at spatial location by .
The main component of ConvNets is the convolution of patches with filters, followed by a pointwise nonlinearity. More precisely, denote the filters by , a patch from by , and a nonlinearity by . A ConvNet computes for every patch in where is understood to be applied element-wise. This can be seen as an approximation of a kernel, as stated in the following proposition (see Daniely et al. (2016) for more details).
Consider a measurable space with probability measure
with probability measureand an activation function such that is square integrable with respect to for any . Then the pair defines a kernel on patches as the dot product of the functions and on the measurable space with probability measure , i.e.,
Hence, the convolution and pointwise nonlinearity in ConvNets with random weights approximate a kernel on patches. This approximation converges to the true value of the kernel as the number of filters goes to infinity. The downside to using such a random feature approximation is that it produces less concise approximations of kernels than e.g., the Nyström method. In order to assess whether trained CKNs perform similarly regardless of the approximation, we approximate CKNs using the Nyström method.
Several results have been proven relating specific activation functions to their corresponding kernels. For example, the ReLU corresponds to the arc-cosine kernel of order 1 (Cho and Saul, 2009) and the identity map corresponds to the linear kernel. The tanh nonlinearity may be approximated by a step function, and a step function corresponds to the arc-cosine kernel of order 0 (Cho and Saul, 2009).
ConvNets may have several types of layers, including convolutional, partially-connected, and fully connected layers. Each layer is parameterized by filters. Convolutional layers define patches and apply the same set of filters to each patch. On the other hand, partially-connected layers in ConvNets define patches and apply filters that differ across image regions to the patches. Finally, fully connected layers in ConvNets are equivalent to convolutional layers where the size of the patch is the size of the image.
As in ConvNets, CKNs may have convolutional, partially-connected, and fully connected layers. Recall from Section 3.1 that CKNs project onto a subspace at each layer and that the subspace is defined by a set of filters. At convolutional layers in CKNs the projection is performed onto the same subspace for every patch location. On the other hand, for partially connected layers for CKNs we project onto a different subspace for each image region. Finally, for fully connected layers CKNs project onto a subspace defined by filters that are the size of the feature representation of an entire image.
Pooling in ConvNets can take many forms, including average pooling. In each case, one defines spatial neighborhoods within the dimensions of the current feature representation (e.g., all blocks). Within each neighborhood a local statistic is computed from the points within each feature map.222In the ConvNet literature, in contrast to the kernel literature, a feature map is defined as a slice of the feature representation along the depth dimension. That is, for a given , is a feature map. Concretely, for a spatial neighborhood centered at the point , average pooling computes
for all .
Average pooling in ConvNets corresponds to an averaging of the feature maps in CKNs. In addition, any weighted averaging, where the weights are the same across layers, corresponds to a weighted averaging of the feature maps. Specifically, note that in Mairal et al. (2014); Mairal (2016) and Paulin et al. (2017) the authors proposed Gaussian pooling for CKNs. In this formulation the weight of a feature map at location when averaging about a feature map at location is given by . Here is a hyperparameter.
There are a wide range of normalizations that have been proposed in the ConvNet literature. Normalizations of ConvNets modify the representation at each location of by taking into account values in a neighborhood of . One such normalization is local response normalization. Local response normalization computes
for all where , and are parameters that can be learned. In local response normalization the neighborhood is typically defined to be at a given spatial location across some or all of the feature maps. However, the spatial scale of the neighborhood could be expanded to be defined across multiple locations within feature maps.
In CKNs there is not a meaningful counterpart to defining a neighborhood across only a subset of the feature maps. Therefore, we present only a counterpart to using all feature maps at once. Consider local response normalization in ConvNets when taking the neighborhood to be the locations across all feature maps within a given spatial area. This roughly corresponds to dividing by a power of the norm of a patch in CKNs when and .
We illustrate the translation from ConvNets to CKNs on the LeNet-5 architecture. LeNet-5 was one of the first modern ConvNets. However, the order of its modules differs from that of many recent ConvNets, as the nonlinearities follow the pooling. See Appendix A for the details of the LeNet-5 ConvNet and Figure 2 for a depiction of the translated CKN architecture. For clarity of exposition we represent the features at each layer of the CKN as a 3D tensor rather than a 2D matrix as in Section 3.1. In performing the translation, we use the approximate correspondence between the tanh activation and the arc-cosine kernel of order 0.
Let denote the initial representation of an image. The first layer is the counterpart to a convolutional layer and consists of applying a linear kernel and projecting onto a subspace. Let . For , let
where denotes the convolution operation. Then the output of the first layer is given by with
Next, the second layer in the ConvNet performs average pooling and subsampling with learnable weights and then applies a pointwise nonlinearity (tanh). The corresponding CKN pools and subsamples and then applies an arc-cosine kernel on patches. Define where is a vector with 1 in element and 0 elsewhere. The pooling and subsampling result in given by
for . Next, let
be the identity matrix and letbe the arc-cosine kernel of order zero. The output of the second layer is then given by
The third layer in LeNet-5 is again a convolutional layer. Here we use a complete connection scheme since for the ConvNet we found that empirically a complete connection scheme outperforms an incomplete connection scheme (see Appendix B). Therefore, this layer again consists of applying a linear kernel and projecting onto a subspace. For , let
Then the output of the third layer is given by with
The fourth layer is similar to the second layer. The CKN pools and subsamples and then applies an arc-cosine kernel on patches. Define where is a vector with 1 in element and 0 elsewhere. The pooling and subsampling result in given by
for . Next, let be the identity matrix and let be the arc-cosine kernel of order zero. The output of the fourth layer is then given by
The fifth layer is a fully connected layer. Let and let be the arc-cosine kernel of order zero. Then the output of this layer is given by given by
Finally, the sixth layer is also a fully connected layer. Let and let be the arc-cosine kernel of order zero. Then the output is given by with
The output from this layer is the set of features provided to a classifier.
Like any functional mapping defined as a composition of modules differentiable with respect to their parameters, a CKN can be trained using a gradient-based optimization algorithm. An end-to-end learning approach to train a CKN in a supervised manner was first considered in Mairal (2016). There are three essential ingredients to optimizing CKNs that we contribute in this section: (i) a rigorously derived general gradient formula; (ii) a numerically accurate gradient computation algorithm; and (iii) an efficient stochastic optimization algorithm.
When a CKN uses a differentiable dot product kernel , each layer of the CKN is differentiable with respect to its weights and inputs. Therefore, the entire CKN is differentiable. This provides a benefit over commonly used ConvNets that use non-differentiable activation functions such as the ReLU, which must be trained using subgradient methods rather than gradient methods. Note that, while widely used, only weak convergence guarantees are known for stochastic subgradient methods. Moreover, they require sophisticated topological non-smooth analysis notions (Davis et al., 2019). As we shall show here, a CKN with a kernel corresponding to a smooth nonlinearity performs comparably to a ConvNet with non-smooth nonlinearities.
The derivatives of the loss functionfrom Section 3.1
with respect to the filters at each layer and the inputs at each layer can be derived using the chain rule. First recall the output of a single convolutional layer presented in Section3.1:
We then have the following proposition, which is detailed in Appendix D.
Let be the loss incurred by an image-label sample , where is the output of th layer of the network described by (3) and parameterizes the linear classifier. Then the Jacobian of the loss with respect to the inner weights , is given by
Computing the derivatives of the output of a convolutional layer involves several linear algebra manipulations. The critical component lies in differentiating through the matrix inverse square root. For this we use the following lemma.
Define the matrix square root function by . Then for a positive definite matrix and a matrix such that we have
Hence, computing the gradient of the CKN in this manner consists of solving a continuous Lyapunov equation. The remainder of the gradient computations involve Kronecker products and matrix multiplications.
The straightforward approach to computing the derivative of the matrix inverse square root involved in Proposition 2
is to call a solver for continuous-time Lyapunov equations. However this route becomes an impediment for large-scale problems requiring fast matrix-vector computations on GPUs. An alternative is to leave it to current automatic differentiation software, which computes it through a singular value decomposition. This route does not leverage the structure and leads to worse estimates of gradients (See Section5.2).
Here we propose a simple and effective approach based on two intertwined Newton methods. Consider the matrix . We aim to compute by an iterative method. Denote by
the eigenvalues ofand let
As the eigenvalues of are , can be diagonalized as with the diagonal matrix of its eigenvalues. Let where is a diagonal matrix whose diagonal is the sign of the eigenvalues in . Then satisfies (Higham, 2008, Theorem 5.2)
The sign matrix of is a square root of the identity, i.e., . It can then be computed by a Newton’s method starting from followed by . Provided that , it converges quadratically to (Higham, 2008, Theorem 5.6). Decomposing the iterates of this Newton’s method on the blocks defined in give Denman and Beavers’ algorithm (Denman and Beavers Jr., 1976). This algorithm begins with and and proceeds with the iterations and . The sequence then converges to .
Each iteration, however, involves the inverses of the iterates and , which are expensive to compute when using a large number of filters. We propose applying the Newton method one more time, yet now to compute and (sometimes called the Newton-Schulz method), starting respectively from and as initial guesses (Higham, 1997). An experimental evaluation of this strategy when we run, say, iterations of the outer Newton method (to compute the inverse matrix square root) yet only iteration of the inner Newton method (to compute the inverse matrices) demonstrates that it is remarkably effective in practice (See Figure 3 in Section 5.2). We present the pseudocode in Algorithm 1 for the case where one iteration of the inner Newton’s method is used. Note that we first scale the matrix by its Frobenius norm to ensure convergence.
By differentiating through these iterations we can obtain the derivatives of with respect to the entries of
. Comparing the accuracy of the gradient obtained using this algorithm to the one returned using automatic differentiation in PyTorch, we find that our approach is twice as accurate. Furthermore, the algorithm only involves matrix multiplications, which is critical to scale to large problems with GPU computing. Hence, this provides a better means of computing the gradient.
The training of CKNs consists of two main stages: unsupervised and supervised learning. The unsupervised learning entails first initializing the filters with an unsupervised method and then fixing the filters and optimizing the ultimate layer using the full dataset. The supervised learning entails training the whole initialized architecture using stochastic estimates of the objective. Here we detail the second stage, for which we propose a new approach. Algorithm2 outlines the overall CKN training with this new method.
A major difference between CKNs and ConvNets is the spherical constraints imposed on the inner layers. On the implementation side, this simply requires an additional projection during the gradient steps for those layers. On the theoretical side it amounts to a stochastic gradient step on a manifold whose convergence to a stationary point is still ensured, provided that the classifier is not regularized but constrained. Specifically, given image-label pairs , we consider the constrained empirical risk minimization problem
where is the product of Euclidean unit spheres at the th layer and is the output of layers of the network described by (3
). Projected stochastic gradient descent draws a mini-batchof samples at iteration , forming an estimate of the objective, and performs the following update:
where is the Euclidean ball centered at the origin of radius and is a step size. Its convergence is stated in the following proposition, detailed in Appendix E.
In practice we use the penalized formulation to compare with classical optimization schemes for ConvNets.
The network architectures present a discrepancy between the inner layers and the ultimate layer: the former computes a feature representation, while the latter is a simple classifier that could be optimized easily once the inner layers are fixed. This motivates us to back-propagate the gradient in the inner layers through the classification performed in the ultimate layer. Formally, consider the regularized empirical risk minimization problem (1),
where denotes the parameters of the inner layers constrained on spheres in the set , parameterizes the last layer and , are the feature representations of the images output by the network. The problem can be simplified as
Strong convexity of the classification problem ensures that the simplified problem is differentiable and its stationary points are stationary points of the original objective. This is recalled in the following proposition, which is detailed in Appendix F.
Assume that is twice differentiable and that for any , the partial functions are strongly convex. Then the simplified objective is differentiable and satisfies
Therefore if a given is -near stationary for the simplified objective , then the pair , where , is -near stationary for the original objective .
Least squares loss In the case of the least squares loss, the computations can be performed analytically, as shown in Appendix F. However, the objective cannot be simplified on the whole dataset, since it would lose its decomposability in the samples. Instead, we apply this strategy on mini-batches. I.e., at iteration , denoting the objective formed by a mini-batch of the samples, the algorithm updates the inner layers via, for ,
where and we normalize the gradients by projecting them on the spheres to use a single scaling for all layers.
Other losses For other losses such as the multinomial loss, no analytic form exists for the minimization. At each iteration we therefore approximate the partial objective on the mini-batch by a regularized quadratic approximation and perform the step above on the inner layers. The ultimate layer reversal step at iteration is detailed in Algorithm 3. The quadratic approximation in Step 2 depends on the current point and can be formed using the full Hessian or a diagonal approximation of the Hessian. The gradient in Step 4 is computed by back-propagating through the operations.
In the experiments we seek to address the following two questions:
How well do the proposed training methods perform for CKNs?
Can a supervised CKN attain the same performance as its ConvNet counterpart?
Previous works reported that specially-designed CKNs can achieve comparable performance to ConvNets in general on MNIST and CIFAR-10 (Mairal et al., 2014; Mairal, 2016). Another set of previous works designed hybrid architectures mixing kernel-based methods and ConvNet ideas (Bruna and Mallat, 2013; Huang et al., 2014; Dai et al., 2014; Yang et al., 2015; Oyallon et al., 2017, 2018). We are interested here in whether, given a ConvNet architecture, an analogous CKN can be designed and trained to achieve similar or superior performance. Our purely kernel-based approach stands in contrast to previous works as, for each (network, dataset) pair, we consider a ConvNet and its CKN counterpart, hence compare them on an equal standing, for varying numbers of filters.
The experiments use the datasets MNIST and CIFAR-10 (LeCun et al., 2001; Krizhevsky and Hinton, 2009). MNIST consists of 60,000 training images and 10,000 test images of handwritten digits numbered 0-9 of size pixels. In contrast, CIFAR-10 consists of 50,000 training images and 10,000 test images from 10 classes of objects of size pixels.
The raw images are transformed prior to being input into the networks. Specifically, the MNIST images are standardized while the CIFAR-10 images are standardized channel-wise and then ZCA whitened on a per-image basis. Validation sets are created for MNIST and CIFAR-10 by randomly separating the training set into two parts such that the validation set has 10,000 images.
The networks we consider in the experiments are LeNet-1 and LeNet-5 on MNIST (LeCun et al., 2001) and All-CNN-C on CIFAR-10 (Springenberg et al., 2015; Krizhevsky and Hinton, 2009). LeNet-1 and LeNet-5 are prominent examples of first modern versions of ConvNets. They use convolutional layers and pooling/subsampling layers and achieved state-of-the-art performance on digit classification tasks on datasets such as MNIST. The ConvNets from Springenberg et al. (2015), including All-CNN-C, were the first models used to make the claim that pooling is unnecessary. All-CNN-C was one of the best-performing models on CIFAR-10 at the time of publication. For mathematical descriptions of the ConvNets and their CKN counterparts, see Appendices A and C, respectively.
Using the principles outlined in Section 3, we translate each architecture to its CKN counterpart. The networks are in general reproduced as faithfully as possible. However, there are a few differences between the original implementations and ours. In particular, the original LeNets have an incomplete connection scheme at the third layer in which each feature map is only connected to a subset of the feature maps from the previous layer. This was included for computational reasons. In our implementation of the LeNets we find that converting the incomplete connection scheme to a complete connection scheme does not decrease performance (See Appendix B
). We therefore use the complete connection schemes in our ConvNet and CKN implementations. In addition, the original All-CNN-C has a global average pooling layer as the last layer. In order to have trainable unconstrained parameters in the CKN, we add a fully connected layer after the global average pooling layer in the ConvNet and CKN. Also note that we apply zero-padding at the convolutional layers that have a stride of one to maintain the spatial dimensions at those layers. Moreover, we omit the dropout layers. Since CKNs do not have biases, we omit the biases from the ConvNets. Lastly, as the arc-cosine kernels are not differentiable, we switch to using the RBF kernel on the sphere for the supervised CKN implementations. The nonlinearity generated by this kernel resembles the ReLU(Mairal et al., 2014). We fix the bandwidths to 0.6.333Note, however, that it is possible to train the bandwidths; see Proposition 18 in Appendix D.
The training of the ConvNets used in the experiments is performed as follows. The initialization is performed using draws from a mean-zero random normal distribution. For the LeNets the standard deviation is set to 0.2 while for All-CNN-C the standard deviation is set to 0.3. The output features are normalized in the same way as for the CKNs, so they are centered and on average have annorm of one. The multinomial logistic loss is used and trained with SGD with momentum set to 0.9. The batch size is set to the largest power of two that fits on the GPU when training the CKN counterpart (see Table 3 in Appendix G). The step size is chosen initially from the values for by training for five iterations with each step size and choosing the step size yielding the lowest training loss on a separate mini-batch. The same method is used to update the step size every 100 iterations, except at subsequent updates the step size is selected from for , where
is the current step size. For All-CNN-C we monitor the training accuracy every epoch. If the accuracy decreases by more than 2% from one epoch to the next we replace the current network parameters with those from the previous epoch and decrease the learning rate by a factor of 4. Cross-validation is performed over the valuesfor for the penalty of the multinomial logistic loss parameters. During cross-validation the optimization is performed for 1000 iterations. The final optimization using the optimal penalty is performed for 10,000 iterations.
Now we detail the unsupervised CKN initialization. The unsupervised training of the CKNs entails approximating the kernel at each layer and then training a classifier on top. Unless otherwise specified, the kernel approximations are performed using spherical -means layer-wise with 10,000 randomly sampled non-constant patches per layer, all from different images. Unless otherwise specified, when evaluating the CKN at each layer the intertwined Newton method is used. In order to achieve a high accuracy but keep the computational costs reasonable the number of outer Newton iterations is set to 20 and the number of inner Newton iterations is set to 1. The regularization of the Gram matrix on the filters is set to 0.001. After the unsupervised training the features are normalized as in Mairal et al. (2014) so that they are centered and on average have an norm of one. A classifier is trained on these CKN features using the multinomial logistic loss. The loss function is optimized using L-BFGS (Liu and Nocedal, 1989) on all of the features with the default parameters from the Scipy implementation. Cross-validation is performed over the values for for the penalty of the multinomial logistic loss parameters. Both the cross-validation and the final optimization with the optimal penalty are performed for a maximum of 1000 iterations.
Finally, we describe the supervised CKN training. The supervised training of CKNs begins with the unsupervised initialization. The multinomial logistic loss is used and trained with our ultimate layer reversal method. In the ultimate layer reversal method we use an approximation of the full Hessian for all but the LeNet-1 experiment with 128 filters per layer. Due to memory constraints we use a diagonal approximation to the Hessian for the LeNet-1 experiment with 128 filters per layer. The regularization parameter of the Hessian was selected via cross-validation and set to 0.03125 for the LeNets and to 0.0625 for All-CNN-C. The batch size is set to the largest power of two that fits on the GPU (see Table 3 in Appendix G). The step sizes are determined in the same way as for the ConvNets, but the initial step sizes considered are for . The penalty of the multinomial logistic loss parameters is fixed to the initial value from the unsupervised training throughout the ULR iterations. After 10,000 iterations of ULR, the parameters of the loss function are once again optimized with L-BFGS for a maximum of 1000 iterations. Cross-validation over the penalty of the multinomial logistic loss parameters is once again performed at this final stage in the same manner as during the unsupervised initialization.
The code for this project was primarily written using PyTorch (Paszke et al., 2017) and may be found online at https://github.com/cjones6/yesweckn. FAISS (Johnson et al., 2017) is used during the unsupervised initialization of the CKNs. We ran the experiments on Titan Xps, Titan Vs, and Tesla V100s. The corresponding time to run the experiments on an NVIDIA Titan Xp GPU would be more than 20 days.
We commence by demonstrating the superiority of our proposed training methods described in Section 4 to the standard methods.
The straightforward way of computing the gradient of a CKN is by allowing automatic differentiation software to differentiate through SVDs. In Section 4.2 we introduced an alternative approach: the intertwined Newton method. Here we compare the two approaches when training the deepest network we consider in the experiments: the CKN counterpart to All-CNN-C. We compare the gradients from differentiating through the SVD and the intertwined Newton method in two ways: directly and also indirectly via the performance when training a CKN.
First, we compare the gradients from each method to the result from using a finite difference method. We find that for the CKN counterpart to All-CNN-C on CIFAR-10 with 8 filters/layer, differentiating through 20 Newton iterations yields relative errors that are 2.5 times smaller than those from differentiating through the SVD. This supports the hypothesis that differentiating through the SVD is more numerically unstable than differentiating through Newton iterations. We moreover note that using Newton iterations allows us to control the numerical accuracy of the gradient and of the matrix inverse square root itself.
Given that the gradients from the intertwined Newton method are more accurate, we now investigate whether this makes a difference in the training. Figure 3 compares the performance of the two methods on All-CNN-C with 8 and 128 filters/layer. We set the number of outer Newton iterations to 50 and leave the number of inner Newton iterations at 1. From the plots we can see that for 8 filters/layer there is no difference in the training performance, despite the gradients for the intertwined Newton method being more accurate. However, for 128 filters/layer the intertwined Newton method begins to outperform the SVD after approximately 200 iterations. After 1000 iterations the accuracy from differentiating through the intertwined Newton method is 1.7% better than that from differentiating through the SVD. The intertwined Newton method therefore appears to be superior for larger networks.
Next, we compare training CKNs using stochastic gradient optimization (SGO) to using our proposed ultimate layer reversal method (ULR-SGO) as detailed in Section 4.3. In our SGO implementation we use the version of the optimization in which is penalty parameter rather than a constraint.
Figure 4 displays the results of the comparison for the CKN counterparts to LeNet-5 on MNIST and All-CNN-C on CIFAR-10 with 8 and 128 filters/layer. From the plots we can see that ULR-SGO is nearly always better than SGO throughout the iterations. This difference is most pronounced for the experiments in which the accuracy increased the most from the initialization: the LeNet-5 CKN with 8 filters/layer and the All-CNN-C CKN with 128 filters/layer. The final accuracy from the ULR-SGO method after 1000 iterations ranges from being 0.5% better on the easier task of classifying MNIST digits with the LeNet-5 CKN architectures to 4% and 40% better on the harder task of classifying CIFAR-10 images with the All-CNN-C CKN architectures. It is also interesting to note that the ULR-SGO curve is much smoother in the case of LeNet-5 CKN with 8 filters/layer. In addition, in the final case of the All-CNN-C CKN with 128 filters/layer, the SGO method seems to have gotten stuck, whereas this was not a problem for ULR-SGO. The initial drop in performance for the All-CNN-C CKN plot with 8 filters/layer is due to the method choosing an initial learning rate that was too large. The learning rate was corrected when it was next updated, at 100 iterations, at which point the accuracy proceeds to increase again.
While it is clear that ULR-SGO dominates SGO in terms of performance over the iterations, it is also important to ensure that this is true in terms of time. Figure 10 in Appendix G provides the same plots as Figure 4, except that the x-axis is now time. The experiments for the LeNet-5 CKN were performed using an Nvidia Titan Xp GPU while the All-CNN-C CKN experiments were performed using an Nvidia Tesla V100 GPU. From the plots we can see that the ULR-SGO method still outperforms the SGO method in terms of accuracy vs. time.
Now we turn to the comparison between CKNs and ConvNets. We perform this comparison for LeNet-1 and LeNet-5 on MNIST and for All-CNN-C on CIFAR-10. Figure 5 displays the results when we vary the number of filters per layer by powers of two, from 8 to 128. Beginning with the LeNets, we see that both the CKN and ConvNet perform well on MNIST over a wide range of the number of filters per layer. The CKN outperforms the ConvNet for almost every number of filters per layer. At best the performance of the CKN is 1% better and at worst it is 0.1% worse. The former value is large, given that the accuracy of both the CKNs and the ConvNets exceed 97%. The success of the CKNs continues for All-CNN-C on CIFAR-10. For All-CNN-C the CKN outperforms the ConvNet by 3-16%. From the plot we can see that the CKN performance aligns well with the ConvNet performance toward the endpoints of the range considered. Overall, the results for the LeNets and All-CNN-C suggest that translated CKNs do perform similarly to their ConvNet counterparts.
Recalling from Section 4.3 that CKNs are initialized in an unsupervised manner, we also compare the performance of unsupervised CKNs to their supervised CKN and ConvNet counterparts. We explore this in Figure 11 in Appendix G. For LeNet-1 the unsupervised CKN performs extremely well, achieving at minimum 98% of the accuracy of the corresponding supervised CKN. The performance is slightly worse for LeNet-5, with the unsupervised CKN achieving 64-97% of the performance of the supervised CKN with the same number of filters. The relative performance is the worst for All-CNN-C, with the unsupervised performance being 44-59% of that of the supervised CKN with the same number of filters. Therefore, the supervised training contributes tremendously to the overall performance of the LeNet-5 CKN with a small number of filters and to the All-CNN-C CKN. These results also suggest that for more complex tasks the unsupervised CKN may require more than 16 times as many filters to achieve comparable performance to the supervised CKN and ConvNet.
In this work we provided a systematic study of the translation of a ConvNet to its CKN counterpart. We presented a new stochastic gradient algorithm to train a CKN in a supervised manner. When trained using this method, the CKNs we studied achieved comparable performance to their ConvNet counterparts. As with the training of any deep network, there are a number of design choices we made that could be modified. Each such choice in the ConvNet world has a counterpart in the CKN world. For example, we could perform the initialization of the filters of the ConvNet and CKN using a different method. In addition, we could use additional normalizations in the ConvNet and CKN. We leave the exploration of the effects of these alternatives to future work.
This work was supported by NSF TRIPODS Award CCF-1740551, the program “Learning in Machines and Brains” of CIFAR, and faculty research awards.
Journal of Machine Learning Research, 20:25:1–25:49, 2019.
Kernel methods for deep learning.In Advances in Neural Information Processing Systems, pages 342–350, 2009.