NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
An important goal in visual recognition is to devise image representations that are invariant to particular transformations. In this paper, we address this goal with a new type of convolutional neural network (CNN) whose invariance is encoded by a reproducing kernel. Unlike traditional approaches where neural networks are learned either to represent data or for solving a classification task, our network learns to approximate the kernel feature map on training data. Such an approach enjoys several benefits over classical ones. First, by teaching CNNs to be invariant, we obtain simple network architectures that achieve a similar accuracy to more complex ones, while being easy to train and robust to overfitting. Second, we bridge a gap between the neural network literature and kernels, which are natural tools to model invariance. We evaluate our methodology on visual recognition tasks where CNNs have proven to perform well, e.g., digit recognition with the MNIST dataset, and the more challenging CIFAR-10 and STL-10 datasets, where our accuracy is competitive with the state of the art.
READ FULL TEXT VIEW PDF
Convolutional neural networks are witnessing wide adoption in computer v...
read it
The goal of our research is to develop methods advancing automatic visua...
read it
Convolutional neural networks (CNNs) have achieved state-of-the-art resu...
read it
In this paper, we propose an inference procedure for deep convolutional
...
read it
Modern deep neural networks require a tremendous amount of data to train...
read it
The research community has observed a massive success of convolutional n...
read it
Over the past several years progress in designing better neural network
...
read it
This project collects the different accepted papers and their link to Arxiv or Gitxiv
Machine Learning Explorations - A list of machine learning resources
We have recently seen a revival of attention given to convolutional neural networks (CNNs) [22] due to their high performance for large-scale visual recognition tasks [15, 21, 30]. The architecture of CNNs is relatively simple and consists of successive layers organized in a hierarchical fashion; each layer involves convolutions with learned filters followed by a pointwise non-linearity and a downsampling operation called “feature pooling”. The resulting image representation has been empirically observed to be invariant to image perturbations and to encode complex visual patterns [33], which are useful properties for visual recognition. Training CNNs remains however difficult since high-capacity networks may involve billions of parameters to learn, which requires both high computational power, e.g., GPUs, and appropriate regularization techniques [18, 21, 30].
The exact nature of invariance that CNNs exhibit is also not precisely understood. Only recently, the invariance of related architectures has been characterized; this is the case for the wavelet scattering transform [8] or the hierarchical models of [7]. Our work revisits convolutional neural networks, but we adopt a significantly different approach than the traditional one. Indeed, we use kernels [26], which are natural tools to model invariance [14]. Inspired by the hierarchical kernel descriptors of [2], we propose a reproducing kernel that produces multi-layer image representations.
Our main contribution is an approximation scheme called convolutional kernel network
(CKN) to make the kernel approach computationally feasible. Our approach is a new type of unsupervised convolutional neural network that is trained to approximate the kernel map. Interestingly, our network uses non-linear functions that resemble rectified linear units
[1, 30], even though they were not handcrafted and naturally emerge from an approximation scheme of the Gaussian kernel map.By bridging a gap between kernel methods and neural networks, we believe that we are opening a fruitful research direction for the future. Our network is learned without supervision since the label information is only used subsequently in a support vector machine (SVM). Yet, we achieve competitive results on several datasets such as MNIST
[22], CIFAR-10 [20] and STL-10 [13] with simple architectures, few parameters to learn, and no data augmentation. Open-source code for learning our convolutional kernel networks is available on the first author’s webpage.There have been several attempts to build kernel-based methods that mimic deep neural networks; we only review here the ones that are most related to our approach.
Kernels for building deep large-margin classifiers have been introduced in
[10]. The multilayer arc-cosine kernel is built by successive kernel compositions, and each layer relies on an integral representation. Similarly, our kernels rely on an integral representation, and enjoy a multilayer construction. However, in contrast to arc-cosine kernels: (i) we build our sequence of kernels by convolutions, using local information over spatial neighborhoods (as opposed to compositions, using global information); (ii) we propose a new training procedure for learning a compact representation of the kernel in a data-dependent manner.Kernels with invariance properties for visual recognition have been proposed in [7]. Such kernels are built with a parameterized “neural response” function, which consists in computing the maximal response of a base kernel over a local neighborhood. Multiple layers are then built by iteratively renormalizing the response kernels and pooling using neural response functions. Learning is performed by plugging the obtained kernel in an SVM. In contrast to [7], we propagate information up, from lower to upper layers, by using sequences of convolutions. Furthermore, we propose a simple and effective data-dependent way to learn a compact representation of our kernels and show that we obtain near state-of-the-art performance on several benchmarks.
The convolutional multilayer kernel is a generalization of the hierarchical kernel descriptors introduced in computer vision
[2, 3]. The kernel produces a sequence of image representations that are built on top of each other in a multilayer fashion. Each layer can be interpreted as a non-linear transformation of the previous one with additional spatial invariance. We call these layers
image feature maps^{1}^{1}1In the kernel literature, “feature map” denotes the mapping between data points and their representation in a reproducing kernel Hilbert space (RKHS) [26]. Here, feature maps refer to spatial maps representing local image characteristics at everly location, as usual in the neural network literature [22]., and formally define them as follows:An image feature map is a function , where is a (usually discrete) subset of representing normalized “coordinates” in the image and is a Hilbert space.
For all practical examples in this paper, is a two-dimensional grid and corresponds to different locations in a two-dimensional image. In other words, is a set of pixel coordinates. Given in , the point represents some characteristics of the image at location , or in a neighborhood of . For instance, a color image of size with three channels, red, green, and blue, may be represented by an initial feature map , where is an regular grid, is the Euclidean space , and provides the color pixel values. With the multilayer scheme, non-trivial feature maps will be obtained subsequently, which will encode more complex image characteristics. With this terminology in hand, we now introduce the convolutional kernel, first, for a single layer.
Let us consider two images represented by two image feature maps, respectively and , where is a set of pixel locations, and is a Hilbert space. The one-layer convolutional kernel between and is defined as
(1) |
where and are smoothing parameters of Gaussian kernels, and if and otherwise. Similarly, is a normalized version of .^{2}^{2}2When is not discrete, the notation in (1) should be replaced by the Lebesgue integral in the paper.
It is easy to show that the kernel is positive definite (see Appendix A). It consists of a sum of pairwise comparisons between the image features and computed at all spatial locations and in . To be significant in the sum, a comparison needs the corresponding and to be close in , and the normalized features and to be close in the feature space . The parameters and respectively control these two definitions of “closeness”. Indeed, when is large, the kernel is invariant to the positions and but when is small, only features placed at the same location are compared to each other. Therefore, the role of is to control how much the kernel is locally shift-invariant. Next, we will show how to go beyond one single layer, but before that, we present concrete examples of simple input feature maps .
Assume that and that provides the two-dimensional gradient of the image at pixel , which is often computed with first-order differences along each dimension. Then, the quantity is the gradient intensity, and is its orientation, which can be characterized by a particular angle—that is, there exists in such that . The resulting kernel is exactly the kernel descriptor introduced in [2, 3] for natural image patches.
In that setting, associates to a location an image patch of size centered at . Then, the space is simply , and is a contrast-normalized version of the patch, which is a useful transformation for visual recognition according to classical findings in computer vision [19]. When the image is encoded with three color channels, patches are of size .
We now define the multilayer convolutional kernel, generalizing some ideas of [2].
Let us consider a set and a Hilbert space . We build a new set and a new Hilbert space as follows:
(i) choose a patch shape defined as a bounded symmetric subset of , and a set of coordinates such that for all location in , the patch is a subset of ;^{3}^{3}3For two sets and , the Minkowski sum is defined as . In other words, each coordinate in corresponds to a valid patch in centered at .
(ii) define the convolutional kernel on the “patch” feature maps , by replacing in (1): by , by , and by appropriate smoothing parameters . We denote by the Hilbert space for which the positive definite kernel is reproducing.
An image represented by a feature map at layer is now encoded in the -th layer as , where for all in , is the representation in of the patch feature map for in .
Concretely, the kernel between two patches of and at respective locations and is
(2) |
where is the Hilbertian norm of . In Figure 1(a), we illustrate the interactions between the sets of coordinates , patches , and feature spaces across layers. For two-dimensional grids, a typical patch shape is a square, for example for a patch in an image of size . Information encoded in the -th layer differs from the -th one in two aspects: first, each point in layer contains information about several points from the -th layer and can possibly represent larger patterns; second, the new feature map is more locally shift-invariant than the previous one due to the term involving the parameter in (2).
The multilayer convolutional kernel slightly differs from the hierarchical kernel descriptors of [2] but exploits similar ideas. Bo et al. [2] define indeed several ad hoc kernels for representing local information in images, such as gradient, color, or shape. These kernels are close to the one defined in (1) but with a few variations. Some of them do not use normalized features , and these kernels use different weighting strategies for the summands of (1) that are specialized to the image modality, e.g., color, or gradient, whereas we use the same weight for all kernels. The generic formulation (1) that we propose may be useful per se, but our main contribution comes in the next section, where we use the kernel as a new tool for learning convolutional neural networks.
Generic schemes have been proposed for approximating a non-linear kernel with a linear one, such as the Nyström method and its variants [5, 31], or random sampling techniques in the Fourier domain for shift-invariant kernels [24]. In the context of convolutional multilayer kernels, such an approximation is critical because computing the full kernel matrix on a database of images is computationally infeasible, even for a moderate number of images () and moderate number of layers. For this reason, Bo et al. [2] use the Nyström method for their hierarchical kernel descriptors.
In this section, we show that when the coordinate sets are two-dimensional regular grids, a natural approximation for the multilayer convolutional kernel consists of a sequence of spatial convolutions with learned filters, pointwise non-linearities, and pooling operations, as illustrated in Figure 1(b). More precisely, our scheme approximates the kernel map of defined in (1) at layer by finite-dimensional spatial maps , where is a set of coordinates related to , and is a positive integer controlling the quality of the approximation. Consider indeed two images represented at layer by image feature maps and , respectively. Then,
[leftmargin=0.7cm]
the corresponding maps and are learned such that , where is the Euclidean inner-product acting as if and were vectors in ;
the set is linked to by the relation where is a patch shape, and the quantities in admit finite-dimensional approximations in ; as illustrated in Figure 1(b), is a patch from centered at location with shape ;
an activation map is computed from by convolution with filters followed by a non-linearity. The subsequent map is obtained from by a pooling operation.
We call this approximation scheme a convolutional kernel network (CKN). In comparison to CNNs, our approach enjoys similar benefits such as efficient prediction at test time, and involves the same set of hyper-parameters: number of layers, numbers of filters at layer , shape of the filters, sizes of the feature maps. The other parameters can be automatically chosen, as discussed later. Training a CKN can be argued to be as simple as training a CNN in an unsupervised manner [25] since we will show that the main difference is in the cost function that is optimized.
A key component of our formulation is the Gaussian kernel. We start by approximating it by a linear operation with learned filters followed by a pointwise non-linearity. Our starting point is the next lemma, which can be obtained after a simple calculation.
For all and in , and ,
(3) |
The lemma gives us a mapping of any in to the function in , where the kernel is linear, and is the constant in front of the integral. To obtain a finite-dimensional representation, we need to approximate the integral with a weighted finite sum, which is a classical problem arising in statistics (see [29] and chapter 8 of [6]). Then, we consider two different cases.
When the data lives in a compact set of , the integral in (3) can be approximated by uniform sampling over a large enough set. We choose such a strategy for two types of kernels from Eq. (1): (i) the spatial kernels ; (ii) the terms when is the “gradient map” presented in Section 2. In the latter case, and is the gradient orientation. We typically sample a few orientations as explained in Section 4.
To prevent the curse of dimensionality, we learn to approximate the kernel on training data, which is intrinsically low-dimensional. We optimize importance weights
in and sampling points in on training pairs in :(4) |
Interestingly, we may already draw some links with neural networks. When applied to unit-norm vectors and , problem (4) produces sampling points whose norm is close to one. After learning, a new unit-norm point in is mapped to the vector in , which may be written as , assuming that the norm of is always one, where is the function for in . Therefore, the finite-dimensional representation of only involves a linear operation followed by a non-linearity, as in typical neural networks. In Figure 2, we show that the shape of resembles the “rectified linear unit” function [30].
We have now all the tools in hand to build our convolutional kernel network. We start by making assumptions on the input data, and then present the learning scheme and its approximation principles.
We assume that the input data is a finite-dimensional map , and that “extracts” patches from . Formally, there exists a patch shape such that , , and for all in , is a patch of centered at . Then, property (B) described at the beginning of Section 3 is satisfied for by choosing . The examples of input feature maps given earlier satisfy this finite-dimensional assumption: for the gradient map, is the gradient of the image along each direction, with , is a patch, , and ; for the patch map, is the input image, say with for RGB data.
The zeroth layer being characterized, we present in Algorithms 1 and 2 the subsequent layers and how to learn their parameters in a feedforward manner. It is interesting to note that the input parameters of the algorithm are exactly the same as a CNN—that is, number of layers and filters, sizes of the patches and feature maps (obtained here via the subsampling factor). Ultimately, CNNs and CKNs only differ in the cost function that is optimized for learning the filters and in the choice of non-linearities. As we show next, there exists a link between the parameters of a CKN and those of a convolutional multilayer kernel.
(5) |
(6) |
We proceed recursively to show that the kernel approximation property (A) is satisfied; we assume that (B) holds at layer , and then, we show that (A) and (B) also hold at layer . This is sufficient for our purpose since we have previously assumed (B) for the zeroth layer. Given two images feature maps and , we start by approximating by replacing and by their finite-dimensional approximations provided by (B):
(7) |
Then, we use the finite-dimensional approximation of the Gaussian kernel involving and
(8) |
where is defined in (5) and is defined similarly by replacing by . Finally, we approximate the remaining Gaussian kernel by uniform sampling on , following Section 3.1. After exchanging sums and grouping appropriate terms together, we obtain the new approximation
(9) |
where the constant comes from the multiplication of the constant from (3) and the weight of uniform sampling orresponding to the square of the distance between two pixels of .^{4}^{4}4The choice of in Algorithm 2 is driven by signal processing principles. The feature pooling step can indeed be interpreted as a downsampling operation that reduces the resolution of the map from to by using a Gaussian anti-aliasing filter, whose role is to reduce frequencies above the Nyquist limit. As a result, the right-hand side is exactly , where is defined in (6), giving us property (A). It remains to show that property (B) also holds, specifically that the quantity (2) can be approximated by the Euclidean inner-product with the patches and of shape ; we assume for that purpose that is a subsampled version of the patch shape by a factor .
We remark that the kernel (2) is the same as (1) applied to layer by replacing by . By doing the same substitution in (9), we immediately obtain an approximation of (2). Then, all Gaussian terms are negligible for all and that are far from each other—say when . Thus, we may replace the sums by , which has the same set of “non-negligible” terms. This yields exactly the approximation .
Regarding problem (4
), stochastic gradient descent (SGD) may be used since a potentially infinite amount of training data is available. However, we have preferred to use L-BFGS-B
[9] on pairs of randomly selected training data points, and initializewith the K-means algorithm. L-BFGS-B is a parameter-free state-of-the-art batch method, which is not as fast as SGD but much easier to use. We always run the L-BFGS-B algorithm for
iterations, which seems to ensure convergence to a stationary point. Our goal is to demonstrate the preliminary performance of a new type of convolutional network, and we leave as future work any speed improvement.We now present experiments that were performed using Matlab and an L-BFGS-B solver [9] interfaced by Stephen Becker. Each image is represented by the last map of the CKN, which is used in a linear SVM implemented in the software package LibLinear [16]. These representations are centered, rescaled to have unit -norm on average, and the regularization parameter of the SVM is always selected on a validation set or by -fold cross-validation in the range , .
The patches are typically small; we tried the sizes with for the first layer, and for the upper ones. The number of filters in our experiments is in the set . The downsampling factor is always chosen to be between two consecutive layers, whereas the last layer is downsampled to produce final maps of a small size—say, or . For the gradient map , we approximate the Gaussian kernel by uniformly sampling orientations, setting . Finally, we also use a small offset to prevent numerical instabilities in the normalization steps .
Unsupervised learning was first used for discovering the underlying structure of natural image patches by Olshausen and Field [23]. Without making any a priori assumption about the data except a parsimony principle, the method is able to produce small prototypes that resemble Gabor wavelets—that is, spatially localized oriented basis functions. The results were found impressive by the scientific community and their work received substantial attention. It is also known that such results can also be achieved with CNNs [25]. We show in this section that this is also the case for convolutional kernel networks, even though they are not explicitly trained to reconstruct data.
Following [23], we randomly select a database of whitened natural image patches of size and learn filters using the formulation (4). We initialize with Gaussian random noise without performing the K-means step, in order to ensure that the output we obtain is not an artifact of the initialization. In Figure 3, we display the filters associated to the top- largest weights . Among the filters, exhibit interpretable Gabor-like structures and the rest was less interpretable. To the best of our knowledge, this is the first time that the explicit kernel map of the Gaussian kernel for whitened natural image patches is shown to be related to Gabor wavelets.
The MNIST dataset [22] consists of images of handwritten digits for training and for testing. We use two types of initial maps in our networks: the “patch map”, denoted by CNK-PM and the “gradient map”, denoted by CNK-GM. We follow the evaluation methodology of [25] for comparison when varying the training set size. We select the regularization parameter of the SVM by -fold cross validation when the training size is smaller than , or otherwise, we keep examples from the training set for validation. We report in Table 1 the results obtained for four simple architectures. CKN-GM1 is the simplest one: its second layer uses patches and only filters, resulting in a network with parameters. Yet, it achieves an outstanding performance of error on the full dataset. The best performing, CKN-GM2, is similar to CKN-GM1 but uses filters. When working with raw patches, two layers (CKN-PM2) gives better results than one layer. More details about the network architectures are provided in the supplementary material. In general, our method achieves a state-of-the-art accuracy for this task since lower error rates have only been reported by using data augmentation [11].
Tr. | CNN | Scat-1 | Scat-2 | CKN-GM1 | CKN-GM2 | CKN-PM1 | CKN-PM2 | [32] | [18] | [19] |
size | [25] | [8] | [8] | () | () | () | () | |||
4.15 | NA | |||||||||
2.05 | NA | |||||||||
1.3 | NA | |||||||||
1.03 | NA | |||||||||
0.88 | 0.88 | NA | ||||||||
0.58 | NA | |||||||||
0.51 | NA | |||||||||
0.39 | 0.53 |
We now move to the more challenging datasets CIFAR-10 [20] and STL-10 [13]. We select the best architectures on a validation set of examples from the training set for CIFAR-10, and by -fold cross-validation on STL-10. We report in Table 2 results for CKN-GM, defined in the previous section, without exploiting color information, and CKN-PM when working on raw RGB patches whose mean color is subtracted. The best selected models have always two layers, with
filters for the top layer. Since CKN-PM and CKN-GM exploit a different information, we also report a combination of such two models, CKN-CO, by concatenating normalized image representations together. The standard deviations for STL-10 was always below
. Our approach appears to be competitive with the state of the art, especially on STL-10 where only one method does better than ours, despite the fact that our models only use layers and require learning few parameters. Note that better results than those reported in Table 2 have been obtained in the literature by using either data augmentation (around on CIFAR-10 for [18, 30]), or external data (around on STL-10 for [28]). We are planning to investigate similar data manipulations in the future.In this paper, we have proposed a new methodology for combining kernels and convolutional neural networks. We show that mixing the ideas of these two concepts is fruitful, since we achieve near state-of-the-art performance on several datasets such as MNIST, CIFAR-10, and STL10, with simple architectures and no data augmentation. Some challenges regarding our work are left open for the future. The first one is the use of supervision to better approximate the kernel for the prediction task. The second consists in leveraging the kernel interpretation of our convolutional neural networks to better understand the theoretical properties of the feature spaces that these networks produce.
This work was partially supported by grants from ANR (project MACARON ANR-14-CE23-0003-01), MSR-Inria joint centre, European Research Council (project ALLEGRO), CNRS-Mastodons program (project GARGANTUA), and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025).
To show that the kernel defined in (1) is positive definite (p.d.), we simply use elementary rules from the kernel literature described in Sections 2.3.2 and 3.4.1 of [26]. A linear combination of p.d. kernels with non-negative weights is also p.d. (see Proposition 3.22 of[26]), and thus it is sufficient to show that for all in , the following kernel on is p.d.:
Specifically, it is also sufficient to show that the following kernel on is p.d.:
with the convention if . This is a pointwise product of two kernels and is p.d. when each of the two kernels is p.d. The first one is obviously p.d.: . The second one is a composition of the Gaussian kernel—which is p.d.—, with feature maps of a normalized linear kernel in . This composition is p.d. according to Proposition 3.22, item (v) of [26] since the normalization does not remove the positive-definiteness property.
We present in details the architectures used in the paper in Table 3.
Arch. | param | |||||||
MNIST | ||||||||
CKN-GM1 | 2 | 12 | 2 | 50 | ||||
CKN-GM2 | 2 | 12 | 2 | 400 | ||||
CKN-PM1 | 1 | 200 | 2 | - | - | |||
CKN-PM2 | 2 | 50 | 2 | 200 | ||||
CIFAR-10 | ||||||||
CKN-GM | 2 | 12 | 2 | 800 | ||||
CKN-PM | 2 | 100 | 2 | 800 | ||||
STL-10 | ||||||||
CKN-GM | 2 | 12 | 2 | 800 | ||||
CKN-PM | 2 | 50 | 2 | 800 |
Comments
There are no comments yet.