NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
In this paper, we introduce a new image representation based on a multilayer kernel machine. Unlike traditional kernel methods where data representation is decoupled from the prediction task, we learn how to shape the kernel with supervision. We proceed by first proposing improvements of the recently-introduced convolutional kernel networks (CKNs) in the context of unsupervised learning; then, we derive backpropagation rules to take advantage of labeled training data. The resulting model is a new type of convolutional neural network, where optimizing the filters at each layer is equivalent to learning a linear subspace in a reproducing kernel Hilbert space (RKHS). We show that our method achieves reasonably competitive performance for image classification on some standard "deep learning" datasets such as CIFAR-10 and SVHN, and also for image super-resolution, demonstrating the applicability of our approach to a large variety of image-related tasks.
READ FULL TEXT VIEW PDF
We propose a connectionist-inspired kernel machine model with three key
...
read it
Kernel learning methods are among the most effective learning methods an...
read it
Substring kernels are classical tools for representing biological sequen...
read it
Building highly non-linear and non-parametric models is central to sever...
read it
In this paper, we study deep signal representations that are invariant t...
read it
Local Hebbian learning is believed to be inferior in performance to
end-...
read it
Deep kernel learning aims at designing nonlinear combinations of multipl...
read it
This project collects the different accepted papers and their link to Arxiv or Gitxiv
In the past years, deep neural networks such as convolutional or recurrent ones have become highly popular for solving various prediction problems, notably in computer vision and natural language processing. Conceptually close to approaches that were developed several decades ago (see,
lecun-98x ), they greatly benefit from the large amounts of labeled data that have been made available recently, allowing to learn huge numbers of model parameters without worrying too much about overfitting. Among other reasons explaining their success, the engineering effort of the deep learning community and various methodological improvements have made it possible to learn in a day on a GPU complex models that would have required weeks of computations on a traditional CPU (see, e.g., he2015deep ; krizhevsky2012 ; simonyan2014very ).Before the resurgence of neural networks, non-parametric models based on positive definite kernels were one of the most dominant topics in machine learning
scholkopf2002learning . These approaches are still widely used today because of several attractive features. Kernel methods are indeed versatile; as long as a positive definite kernel is specified for the type of data considered—e.g., vectors, sequences, graphs, or sets—a large class of machine learning algorithms originally defined for linear models may be used. This family include supervised formulations such as support vector machines and unsupervised ones such as principal or canonical component analysis, or K-means and spectral clustering.
The problem of data representation is thus decoupled from that of learning theory and algorithms. Kernel methods also admit natural mechanisms to control the learning capacity and reduce overfitting scholkopf2002learning .On the other hand, traditional kernel methods suffer from several drawbacks. The first one is their computational complexity, which grows quadratically with the sample size due to the computation of the Gram matrix. Fortunately, significant progress has been achieved to solve the scalability issue, either by exploiting low-rank approximations of the kernel matrix williams2001 ; zhang2008improved , or with random sampling techniques for shift-invariant kernels rahimi2007 . The second disadvantage is more critical; by decoupling learning and data representation, kernel methods seem by nature incompatible with end-to-end learning—that is, the representation of data adapted to the task at hand, which is the cornerstone of deep neural networks and one of the main reason of their success. The main objective of this paper is precisely to tackle this issue in the context of image modeling.
Specifically, our approach is based on convolutional kernel networks, which have been recently introduced in mairal2014convolutional . Similar to hierarchical kernel descriptors bo2011 , local image neighborhoods are mapped to points in a reproducing kernel Hilbert space via the kernel trick. Then, hierarchical representations are built via kernel compositions, producing a sequence of “feature maps” akin to convolutional neural networks, but of infinite dimension. To make the image model computationally tractable, convolutional kernel networks provide an approximation scheme that can be interpreted as a particular type of convolutional neural network learned without supervision.
To perform end-to-end learning given labeled data, we use a simple but effective principle consisting of learning discriminative subspaces in RKHSs, where we project data
. We implement this idea in the context of convolutional kernel networks, where linear subspaces, one per layer, are jointly optimized by minimizing a supervised loss function. The formulation turns out to be a new type of convolutional neural network with a non-standard parametrization. The network also admits simple principles to learn without supervision: learning the subspaces may be indeed achieved efficiently with classical kernel approximation techniques
williams2001 ; zhang2008improved .To demonstrate the effectiveness of our approach in various contexts, we consider image classification benchmarks such as CIFAR-10 krizhevsky2012 and SVHN netzer2011reading , which are often used to evaluate deep neural networks; then, we adapt our model to perform image super-resolution, which is a challenging inverse problem. On the SVHN and CIFAR-10 datasets, we obtain a competitive accuracy, with about and error rates, respectively, without model averaging or data augmentation. For image up-scaling, we outperform recent approaches based on classical convolutional neural networks dong2014learning ; dong2015image .
We believe that these results are highly promising. Our image model achieves competitive performance in two different contexts, paving the way to many other applications. Moreover, our results are also subject to improvements. In particular, we did not use GPUs yet, which has limited our ability to exhaustively explore model hyper-parameters and evaluate the accuracy of large networks. We also did not investigate classical regularization/optimization techniques such as Dropout krizhevsky2012
ioffe2015batch , or recent advances allowing to train very deep networks he2015deep ; simonyan2014very . To gain more scalability and start exploring these directions, we are currently working on a GPU implementation, which we plan to publicly release along with our current CPU implementation.One of our goals is to make a bridge between kernel methods and deep networks, and ideally reach the best of both worlds. Given the potentially attractive features of such a combination, several attempts have been made in the past to unify these two schools of thought. A first proof of concept was introduced in cho2009
with the arc-cosine kernel, which admits an integral representation that can be interpreted as a one-layer neural network with random weights and infinite number of rectified linear units. Besides, a multilayer kernel may be obtained by kernel compositions
cho2009 . Then, hierarchical kernel descriptors bo2011 and convolutional kernel networks mairal2014convolutional extend a similar idea in the context of images leading to unsupervised representations mairal2014convolutional .Multiple kernel learning sonnenburg2006large is also related to our work since is it is a notable attempt to introduce supervision in the kernel design. It provides techniques to select a combination of kernels from a pre-defined collection, and typically requires to have already “good” kernels in the collection to perform well. More related to our work, the backpropagation algorithm for the Fisher kernel introduced in sydorov2014deep
learns the parameters of a Gaussian mixture model with supervision. In comparison, our approach does not require a probabilistic model and learns parameters at several layers. Finally, we note that a concurrent effort to ours is conducted in the Bayesian community with deep Gaussian processes
damianou , complementing the Frequentist approach that we follow in our paper.In this section, we present the principles of convolutional kernel networks and a few generalizations and improvements of the original approach of mairal2014convolutional . Essentially, the model builds upon four ideas that are detailed below and that are illustrated in Figure 1 for a model with a single layer.
Given a set , a positive definite kernel implicitly defines a Hilbert space , called reproducing kernel Hilbert space (RKHS), along with a mapping . This embedding is such that the kernel value corresponds to the inner product . Called “kernel trick”, this approach can be used to obtain nonlinear representations of local image patches bo2011 ; mairal2014convolutional .
More precisely, consider an image , where is the number of channels, e.g., for RGB, and is a set of pixel coordinates, typically a two-dimensional grid. Given two image patches of size , represented as vectors in , we define a kernel as
(1) |
where and denote the usual Euclidean norm and inner-product, respectively, and is a dot-product kernel on the sphere. Specifically, should be smooth and its Taylor expansion have non-negative coefficients to ensure positive definiteness scholkopf2002learning . For example, the arc-cosine cho2009 or the Gaussian (RBF) kernels may be used: given two vectors with unit -norm, choose for instance
(2) |
Then, we have implicitly defined the RKHS associated to and a mapping .
The representation of patches in a RKHS requires finite-dimensional approximations to be computationally manageable. The original model of mairal2014convolutional does that by exploiting an integral form of the RBF kernel. Specifically, given two patches and , convolutional kernel networks provide two vectors in such that the kernel value is close to the Euclidean inner product . After applying this transformation to all overlapping patches of the input image , a spatial map may be obtained such that for all in , , where is the patch from centered at pixel location .^{1}^{1}1
To simplify, we use zero-padding when patches are close to the image boundaries, but this is optional.
With the approximation scheme of mairal2014convolutional , can be interpreted as the output feature map of a one-layer convolutional neural network.A conceptual drawback of mairal2014convolutional is that data points are approximated by vectors that do not live in the RKHS . This issue can be solved by using variants of the Nyström method williams2001 , which consists of projecting data onto a subspace of with finite dimension . For this task, we have adapted the approach of zhang2008improved : we build a database of patches randomly extracted from various images and normalized to have unit -norm, and perform a spherical -means algorithm to obtain centroids with unit -norm. Then, a new patch is approximated by its projection onto the -dimensional subspace .
The projection of onto admits a natural parametrization in . The explicit formula is classical (see williams2001 ; zhang2008improved and Appendix A), leading to
(3) |
where we have introduced the matrix , and, by an abuse of notation, the function is applied pointwise to its arguments. Then, the spatial map introduced above can be obtained by (i) computing the quantities for all patches of the image (spatial convolution after mirroring the filters ); (ii) contrast-normalization involving the norm ; (iii) applying the pointwise non-linear function
; (iv) applying the linear transform
at every pixel location (which may be seen as spatial convolution); (v) multiplying by the norm making homogeneous. In other words, we obtain a particular convolutional neural network, with non-standard parametrization. Note that learning requires only performing a K-means algorithm and computing the inverse square-root matrix ; therefore, the training procedure is very fast.Then, it is worth noting that the encoding function with kernel (2
) is reminiscent of radial basis function networks (RBFNs)
broomhead1988radial , whose hidden layer resembles (3) without the matrix and with no normalization. The difference between RBFNs and our model is nevertheless significant. The RKHS mapping, which is absent from RBFNs, is indeed a key to the multilayer construction that will be presented shortly: a network layer takes points from the RKHS’s previous layer as input and use the corresponding RKHS inner-product. To the best of our knowledge, there is no similar multilayer and/or convolutional construction in the radial basis function network literature.The previous steps transform an image into a map , where each vector in encodes a point in representing information of a local image neighborhood centered at location . Then, convolutional kernel networks involve a pooling step to gain invariance to small shifts, leading to another finite-dimensional map with smaller resolution:
(4) |
The Gaussian weights act as an anti-aliasing filter for downsampling the map and is set according to the desired subsampling factor (see mairal2014convolutional ), which does not need to be integer. Then, every point in may be interpreted as a linear combination of points in , which is itself in since is a linear subspace. Note that the linear pooling step was originally motivated in mairal2014convolutional as an approximation scheme for a match kernel, but this point of view is not critically important here.
By following the first three principles described above, the input image is transformed into another one . It is then straightforward to apply again the same procedure to obtain another map , then , etc. By going up in the hierarchy, the vectors in represent larger and larger image neighborhoods (aka. receptive fields) with more invariance gained by the pooling layers, akin to classical convolutional neural networks.
The multilayer scheme produces a sequence of maps , where each vector encodes a point—say —in the linear subspace of . Thus, we implicitly represent an image at layer as a spatial map such that for all . As mentioned previously, the mapping to the RKHS is a key to the multilayer construction. Given , larger image neighborhoods are represented by patches of size that can be mapped to a point in the Cartesian product space endowed with its natural inner-product; finally, the kernel defined on these patches can be seen as a kernel on larger image neighborhoods than .
In the previous section, we have described a variant of convolutional kernel networks where linear subspaces are learned at every layer. This is achieved without supervision by a K-means algorithm leading to small projection residuals. It is thus natural to introduce also a discriminative approach.
We now consider a prediction task, where we are given a training set of images with respective scalar labels living either in for binary classification and for regression. For simplicity, we only present these two settings here, but extensions to multiclass classification and multivariate regression are straightforward. We also assume that we are given a smooth convex loss function that measures the fit of a prediction to the true label .
Given a positive definite kernel on images, the classical empirical risk minimization formulation consists of finding a prediction function in the RKHS associated to by minimizing the objective
(5) |
where the parameter controls the smoothness of the prediction function with respect to the geometry induced by the kernel, hence regularizing and reducing overfitting scholkopf2002learning . After training a convolutional kernel network with layers, such a positive definite kernel may be defined as
(6) |
where are the -th finite-dimensional feature maps of , respectively, and the corresponding maps in , which have been defined in the previous section. The kernel is also indexed by , which represents the network parameters—that is, the subspaces , or equivalently the set of filters from Eq. (3). Then, formulation (5) becomes equivalent to
(7) |
where is the Frobenius norm that extends the Euclidean norm to matrices, and, with an abuse of notation, the maps are seen as matrices in . Then, the supervised convolutional kernel network formulation consists of jointly minimizing (7) with respect to in and with respect to the set of filters , whose columns are constrained to be on the Euclidean sphere.
Since we consider a smooth loss function , e.g., logistic, squared hinge, or square loss, optimizing (7) with respect to can be achieved with any gradient-based method. Moreover, when is convex, we may also use fast dedicated solvers, (see, e.g., hongzhou , and references therein). Optimizing with respect to the filters , is more involved because of the lack of convexity. Yet, the objective function is differentiable, and there is hope to find a “good” stationary point by using classical stochastic optimization techniques that have been successful for training deep networks.
For that, we need to compute the gradient by using the chain rule—also called “backpropagation”
lecun-98x . We instantiate this rule in the next lemma, which we have found useful to simplify the calculation.Consider an image represented here as a matrix in , associated to a label in and call the -th feature map obtained by encoding with the network parameters . Then, consider a perturbation of the set of filters . Assume that we have for all ,
(8) |
where is equal to , and is a matrix in such that for all matrices of the same size,
(9) |
where the inner-product is the Frobenius’s one and are linear functions. Then,
(10) |
where denote the derivative of the smooth function with respect to its second argument.
The proof of this lemma is straightforward and follows from the definition of the Fréchet derivative. Nevertheless, it is useful to derive the closed form of the gradient in the next proposition.
Consider the quantities introduced in Lemma 1, but denote by for simplicity. By construction, we have for all ,
(11) |
where is seen as a matrix in ; is the linear operator that extracts all overlapping patches from a map such that is a matrix of size ; is a diagonal matrix whose diagonal entries carry the -norm of the columns of ; is short for ; and is a matrix of size performing the linear pooling operation. Then, the gradient of the loss with respect to the filters , is given by (10) with
(12) |
where is any matrix of the same size as , is the -th feature map before the pooling step, is the Hadamart (elementwise) product, is the adjoint of , and
(13) |
The proof is presented in Appendix B. Most quantities that appear above admit physical interpretations: multiplication by performs downsampling; multiplication by performs upsampling; multiplication of on the right by performs -normalization of the columns; can be seen as a spatial convolution of the map by the filters ; finally, “combines” a set of patches into a spatial map by adding to each pixel location the respective patch contributions.
Computing the gradient requires a forward pass to obtain the maps through (11) and a backward pass that composes the functions as in (10). The complexity of the forward step is dominated by the convolutions , as in convolutional neural networks. The cost of the backward pass is the same as the forward one up to a constant factor. Assuming , which is typical for lower layers that require more computation than upper ones, the most expensive cost is due to and which is the same as . We also pre-compute and
by eigenvalue decompositions, whose cost is reasonable when performed only once per minibatch. Off-diagonal elements of
are also not computed since they are set to zero after elementwise multiplication with a diagonal matrix. In practice, we also replace by with , which corresponds to performing a regularized projection onto (see Appendix A). Finally, a small offset of is added to the diagonal entries of .When using the kernel (2), the objective is differentiable with respect to the hyper-parameters . When large amounts of training data are available and overfitting is not a issue, optimizing the training loss by taking gradient steps with respect to these parameters seems appropriate instead of using a canonical parameter value. Otherwise, more involved techniques may be needed; we plan to investigate other strategies in future work.
The backpropagation rules of the previous section have set up the stage for using a stochastic gradient descent method (SGD). We now present a few strategies to accelerate it in our context.
Recently, many incremental optimization techniques have been proposed for solving convex optimization problems of the form (7) when is large but finite (see hongzhou and references therein). These methods usually provide a great speed-up over the stochastic gradient descent algorithm without suffering from the burden of choosing a learning rate. The price to pay is that they rely on convexity, and they require storing into memory the full training set. For solving (7) with fixed network parameters , it means storing the maps , which is often reasonable if we do not use data augmentation. To partially leverage these fast algorithms for our non-convex problem, we have adopted a minimization scheme that alternates between two steps: (i) fix , then make a forward pass on the data to compute the maps and minimize the convex problem (7) with respect to using the accelerated MISO algorithm hongzhou ; (ii) fix , then make one pass of a projected stochastic gradient algorithm to update the set of filters . The set of network parameters is initialized with the unsupervised learning method described in Section 2.
The kernels are defined on the sphere; therefore, it is natural to constrain the filters—that is, the columns of the matrices —to have unit -norm. As a result, a classical stochastic gradient descent algorithm updates at iteration each filter as follows , where
is an estimate of the gradient computed on a minibatch and
is a learning rate. In practice, we found that convergence could be accelerated by preconditioning, which consists of optimizing after a change of variable to reduce the correlation of gradient entries. For unconstrainedoptimization, this heuristic involves choosing a symmetric positive definite matrix
and replacing the update direction by , or, equivalently, performing the change of variable and optimizing over . When constraints are present, the case is not as simple since may not be a descent direction. Fortunately, it is possible to exploit the manifold structure of the constraint set (here, the sphere) to perform an appropriate update absil2009optimization . Concretely, (i) we choose a matrix per layer that is equal to the inverse covariance matrix of the patches from the same layer computed after the initialization of the network parameters. (ii) We perform stochastic gradient descent steps on the sphere manifold after the change of variable , leading to the update . Because this heuristic is not a critical component, but simply an improvement of SGD, we relegate mathematical details in Appendix C.Choosing the right learning rate in stochastic optimization is still an important issue despite the large amount of work existing on the topic, see, e.g., lecun-98x and references therein. In our paper, we use the following basic heuristic: the initial learning rate is chosen “large enough”; then, the training loss is evaluated after each update of the weights
. When the training loss increases between two epochs, we simply divide the learning rate by two, and perform “back-tracking” by replacing the current network parameters by the previous ones.
For classification tasks, “easy” samples have often negligible contribution to the gradient (see, e.g., lecun-98x ). For instance, for the squared hinge loss , the gradient vanishes when the margin is greater than one. This motivates the following heuristic: we consider a set of active samples, initially all of them, and remove a sample from the active set as soon as we obtain zero when computing its gradient. In the subsequent optimization steps, only active samples are considered, and after each epoch, we randomly reactivate of the inactive ones.
We now present experiments on image classification and super-resolution. All experiments were conducted on -core and -core GHz Intel CPUs using C++ and Matlab.
We consider the datasets CIFAR-10 krizhevsky2012 and SVHN netzer2011reading , which contain images from classes. CIFAR-10 is medium-sized with training samples and test ones. SVHN is larger with training examples and test ones. We evaluate the performance of a -layer network, designed with few hyper-parameters: for each layer, we learn filters and choose the RBF kernels defined in (2) with initial parameters . Layers use patches and a subsampling pooling factor of except for layer where the factor is ; Layers use simply patches and no subsampling. For CIFAR-10, the parameters are kept fixed during training, and for SVHN, they are updated in the same way as the filters. We use the squared hinge loss in a one vs all setting to perform multi-class classification (with shared filters between classes). The input of the network is pre-processed with the local whitening procedure described in paulin2015local . We use the optimization heuristics from the previous section, notably the automatic learning rate scheme, and a gradient momentum with parameter , following krizhevsky2012 . The regularization parameter and the number of epochs are set by first running the algorithm on a validation split of the training set. is chosen near the canonical parameter , in the range , with , and the number of epochs is at most . The initial learning rate is with a minibatch size of .
We present our results in Table 1 along with the performance achieved by a few recent methods without data augmentation or model voting/averaging. In this context, the best published results are obtained by the generalized pooling scheme of lee2016generalizing . We achieve about test error on SVHN and about on CIFAR-10, which positions our method as a reasonably “competitive” one, in the same ballpark as the deeply supervised nets of lee2015 or network in network of lin2013network .
Stoch P. zeiler2013 | MaxOut goodfellow2013 | NiN lin2013network | DSN lee2015 | Gen P. lee2016generalizing | SCKN (Ours) | |
---|---|---|---|---|---|---|
CIFAR-10 | 15.13 | 11.68 | 10.41 | 9.69 | 7.62 | 10.20 |
SVHN | 2.80 | 2.47 | 2.35 | 1.92 | 1.69 | 2.04 |
Due to lack of space, the results reported here only include a single supervised model. Preliminary experiments with no supervision show also that one may obtain competitive accuracy with wide shallow architectures. For instance, a two-layer network with (1024-16384) filters achieves error on CIFAR-10. Note also that our unsupervised model outperforms original CKNs mairal2014convolutional . The best single model from mairal2014convolutional gives indeed . Training the same architecture with our approach is two orders of magnitude faster and gives . Another aspect we did not study is model complexity. Here as well, preliminary experiments are encouraging. Reducing the number of filters to per layer yields indeed error on CIFAR-10 and on SVHN. A more precise comparison with no supervision and with various network complexities will be presented in another venue.
Image up-scaling is a challenging problem, where convolutional neural networks have obtained significant success dong2014learning ; dong2015image ; wang2015deep . Here, we follow dong2015image and replace traditional convolutional neural networks by our supervised kernel machine. Specifically, RGB images are converted to the YCbCr color space and the upscaling method is applied to the luminance channel only to make the comparison possible with previous work. Then, the problem is formulated as a multivariate regression one. We build a database of patches of size randomly extracted from the BSD500 dataset arbelaez2011contour after removing image 302003.jpg, which overlaps with one of the test images. versions of the patches are build using the Matlab function imresize, and upscaled back to
by using bicubic interpolation; then, the goal is to predict high-resolution images from blurry bicubic interpolations.
The blurry estimates are processed by a -layer network, with patches and filters at every layer without linear pooling and zero-padding. Pixel values are predicted with a linear model applied to the -dimensional vectors present at every pixel location of the last layer, and we use the square loss to measure the fit. The optimization procedure and the kernels are identical to the ones used for processing the SVHN dataset in the classification task. The pipeline also includes a pre-processing step, where we remove from input images a local mean component obtained by convolving the images with a averaging box filter; the mean component is added back after up-scaling.
For the evaluation, we consider three datasets: Set5 and Set14 are standard for super-resolution; Kodim is the Kodak Image database, available at http://r0k.us/graphics/kodak/, which contains high-quality images with no compression or demoisaicing artefacts. The evaluation procedure follows dong2014learning ; dong2015image ; timofte2013anchored ; wang2015deep by using the code from the author’s web page. We present quantitative results in Table 2. For x3 upscaling, we simply used twice our model learned for x2 upscaling, followed by a 3/4 downsampling. This is clearly suboptimal since our model is not trained to up-scale by a factor 3, but this naive approach still outperforms other baselines dong2014learning ; dong2015image ; wang2015deep that are trained end-to-end. Note that wang2015deep also proposes a data augmentation scheme at test time that slightly improves their results. In Appendix D, we also present a visual comparison between our approach and dong2015image , whose pipeline is the closest to ours, up to the use of a supervised kernel machine instead of CNNs.
Fact. | Dataset | Bicubic | SC zeyde2010single | ANR timofte2013anchored | A+timofte2013anchored | CNN1 dong2014learning | CNN2 dong2015image | CSCN wang2015deep | SCKN |
---|---|---|---|---|---|---|---|---|---|
x2 | Set5 | 33.66 | 35.78 | 35.83 | 36.54 | 36.34 | 36.66 | 36.93 | 37.07 |
Set14 | 30.23 | 31.80 | 31.79 | 32.28 | 32.18 | 32.45 | 32.56 | 32.76 | |
Kodim | 30.84 | 32.19 | 32.23 | 32.71 | 32.62 | 32.80 | 32.94 | 33.21 | |
x3 | Set5 | 30.39 | 31.90 | 31.92 | 32.58 | 32.39 | 32.75 | 33.10 | 33.08 |
Set14 | 27.54 | 28.67 | 28.65 | 29.13 | 29.00 | 29.29 | 29.41 | 29.50 | |
Kodim | 28.43 | 29.21 | 29.21 | 29.57 | 29.42 | 29.64 | 29.76 | 29.88 |
This work was supported by ANR (MACARON project ANR-14-CE23-0003-01).
Local convolutional features with unsupervised training for image retrieval.
In Proc. ICCV, 2015.First, we remark that the kernel is homogeneous such that for every patch and scalar ,
Thus, we may assume to have unit -norm without loss of generality and perform the projection on of the normalized patch, before applying the inverse rescaling.
Then, let us denote by the orthogonal projection of a patch with unit -norm defined as
which is equivalent to
After short calculation, we obtain
since the vectors provided by the spherical K-means algorithm have unit -norm. Assuming to be invertible, we have . After projection, normalized patches , may be parametrized by and , respectively. Then, we have
which is the desired result.
When is not invertible or simply badly conditioned, it is also common to use instead
where is a small regularization that improves the condition number of . Such a modification can be interpreted as performing a slightly regularized projection onto the finite-dimensional subspace .
To compute the gradient of the loss function, we use Lemma 1 and start by analyzing the effect of perturbing every quantity involved in (11) such that we may obtain the desired relations (8) and (9). Before proceeding, we recall the definition of the set and the precise definition of the Landau notation , which we use in (8). Here, it simply means a quantity that is negligible in front of the norm —that is,
Then, we start by initializing a recursion: is unaffected by the perturbation and thus . Consider now an index and assume that (8) holds for with .
First, we remark that
Then, the diagonal matrix becomes after perturbation
The inverse diagonal matrix becomes
and the matrix becomes
where we have used the relation . Note that the quantities that we have introduced are all . Then, by replacing the quantities by their perturbed versions in the definition of given in (11), we obtain that is equal to
Then, after short calculation, we obtain the desired relation with
First, we remark that , which is one of the induction hypothesis we need. Then, after plugging in the values of , and with further simplification, we obtain
where is the -th feature map of before the -th linear pooling step—that is, . We now see that is linear in and , which guarantees that there exist two linear functions that satisfy (9). More precisely, we want for all matrix of the same size as
and
Then, it is easy to obtain the form of given in (12), by using in the right order the following elementary calculus rules: (i) , (ii) , (iii) for any matrices of appropriate sizes, and also (iv) , by definition of the adjoint operator. We conclude by induction.
In this section, we present a preconditioning heuristic for optimizing over the sphere , inspired by second-order (Newton) optimization techniques on smooth manifolds absil2009optimization . Following absil2009optimization , we will consider gradient descent steps on the manifold. A fundamental operation is thus the projection operator onto the tangent space at a point . This operator is defined for the sphere by
for any vector in . Another important operator is the Euclidean projection on , which was denoted by in previous parts of the paper.
When optimizing on a manifold, the natural descent direction is the projected gradient . In the case of the sphere, a gradient step on the manifold is equivalent to a classical projected gradient descent step in with particular step size:
For unconstrained optimization in , faster convergence is usually achieved when one has access to an estimate of the inverse of the Hessian —assuming twice differentiability—and using the descent direction instead of ; then, we obtain a Newton method. When the exact Hessian is not available, or too costly to compute and/or invert, it is however common to use instead a constant estimate of the inverse Hessian, denoted here by , which we call pre-conditioning matrix. Finding an appropriate matrix is difficult in general, but for learning linear models, a typical choice is to use the inverse covariance matrix of the data (or one approximation). In that case, the preconditioned gradient descent step consists of the update . Such a matrix is defined similarly in the context of convolutional kernel networks, as explained in the main part of the paper. A useful interpretation of preconditioning is to see it as optimizing after a change of variable. Define indeed the objective
Then, minimizing is equivalent to minimizing with respect to , with the relation . Moreover, when there is no constraint on and , the regular gradient descent algorithm on is equivalent to the preconditioned gradient descent on :
We remark that the Hessian is equal to , which is equal to identity when coincides with the inverse Hessian of . In general, this is of course not the case, but the hope is to obtain a Hessian that is better conditioned than , thus resulting in faster convergence.
Unfortunately, using second-order information (or simply a pre-conditioning matrix) when optimizing over a constraint set or over a smooth manifold is not as simple as optimizing in since the quantities may not be feasible descent directions. However, the point of view that sees pre-conditioning as a change of variable will give us the right direction to follow.
Optimizing on is in fact equivalent to optimizing on the smooth manifold
which represents an ellipsoid. The tangent plane at a point of being defined by the normal vector , it is then possible to introduce the projection operator on the tangent space:
Then, we may define the gradient descent step rule on as
With the change of variable , this is equivalent to
This is exactly the update rule we have chosen in our paper, as a heuristic in a stochastic setting.
We present a quantitative comparison in Table 3 using the structural similarity index measure (SSIM), which is known to better reflect the quality perceived by humans than the PSNR; it is commonly used to evaluate the quality of super-resolution methods, see dong2015image ; timofte2013anchored ; wang2015deep . Then, we present a visual comparison between several approaches in Figures 2, 3, and 4. We focus notably on the classical convolutional neural network of dong2015image since our pipeline essentially differs in the use of our supervised kernel machine instead of convolutional neural networks. After subjective evaluation, we observe that both methods perform equally well in textured areas. However, our approach recovers better thin high-frequency details, such as the eyelash of the baby in the first image. By zooming on various parts, it is easy to notice similar differences in other images. We also observed a few ghosting artefacts near object boundaries with the method of dong2015image , which is not the case with our approach.
Fact. | Dataset | Bicubic | SC zeyde2010single | ANR timofte2013anchored | A+timofte2013anchored | CNN1 dong2014learning | CNN2 dong2015image | CSCN wang2015deep | SCKN |
---|---|---|---|---|---|---|---|---|---|
x2 | Set5 | 0.9299 | 0.9492 | 0.9499 | 0.9544 | 0.9521 | 0.9542 | 0.9552 | 0.9580 |
Set14 | 0.8689 | 0.8989 | 0.9004 | 0.9056 | 0.9037 | 0.9067 | 0.9074 | 0.9115 | |
Kodim | 0.8684 | 0.8990 | 0.9007 | 0.9075 | 0.9043 | 0.9068 | 0.9104 | 0.9146 | |
x3 | Set5 | 0.8677 | 0.8959 | 0.8959 | 0.9088 | 0.9025 | 0.9090 | 0.9144 | 0.9165 |
Set14 | 0.7741 | 0.8074 | 0.8092 | 0.8188 | 0.8148 | 0.8215 | 0.8238 | 0.8297 | |
Kodim | 0.7768 | 0.8066 | 0.8084 | 0.8175 | 0.8109 | 0.8174 | 0.8222 | 0.8283 |
Bicubic | Sparse coding zeyde2010single | CNN2 dong2015image | SCKN (Ours) |
Bicubic | Sparse coding zeyde2010single | CNN2 dong2015image | SCKN (Ours) |
Bicubic | Sparse coding zeyde2010single | CNN2 dong2015image | SCKN (Ours) |
Comments
There are no comments yet.