1 Introduction
kNN is one of the most popular classification methods due to its simplicity and reasonable effectiveness: it doesn’t require fitting a model and it has been shown to have good performance for classifying many types of data. However, the good classification performance of kNN is highly dependent on the metric used for computing pairwise distances between data points. In practice, we often use Euclidean distances as similarity metric to calculate k nearest neighbors of data points of interest. To classify highdimensional data in real applications, we often need to learn or choose a good distance metric.
Previous work on metric learning in [21] and [7]
learns a global linear transformation matrix in the original feature space of data points to make similar data points stay closer while making dissimilar data points move farther apart using additional similarity or label information. In
[6], a global linear transformation is applied to the original feature space of data points to learn Mahalanobis metrics, which requires all data points in the same class collapse to one point. Making data points in the same class collapse to one point is unnecessary for kNN classification. It may produce poor performance when data points cannot be essentially collapsed to points, which is often true for some class containing multiple patterns. An informationtheoretic based approach is used to learn linear transformations in [4]. In [13], a global linear transformation is learned to directly improve kNN classification to achieve the goal of a large margin. This method has been shown to yield significant improvement over kNN classification, but the linear transformation often fails to give good performance in highdimensional space and a preprocessing dimensionality reduction step by PCA is often required for success.In many situations, a linear transformation is not powerful enough to capture the underlying classspecific data manifold; thus we need to resort to more powerful nonlinear transformations, so that each data point will stay closer to its nearest neighbors having the same class as itself than to any other data in the nonlinearly transformed feature space. Kernel tricks have been used to kernelize some of the above methods in order to improve kNN classification [6, 4]. The method in [17] extends the work in [13] to perform linear dimensionality reduction to improve largemargin kNN classification and kernelized the method in [13]. However, the kernelbased approaches behave almost like templatebased approaches. If the chosen kernel cannot well reflect the true classrelated structure of the data, the resulting performance will be bad. Besides, kernelbased approaches often have difficulty in handling large datasets.
We might want to achieve nonlinear mappings by learning a directed multilayer belief net or a deep autoencoder, and then perform kNN classification using the hidden distributed representations of the original input data. However, a multilayer belief net often suffers from the ”explaining away” effect, that is, the top hidden units become dependent conditional on the bottom visible units, which makes inference intractable; and learning a deep autoencoder with backpropagation is amost impossible because the gradient backpropagated to the lower layers from the output often becomes very noisy and meaningless. Fortunately, recent research has shown that training a deep generative model called Deep Belief Net is feasible by pretraining the deep net using a type of undirected graphical model called Restricted Boltzmann Machine (RBM)
[11]. RBMs produce ”complementary priors” to make the inference process in a deep belief net much easier, and the deep net can be trained greedily layer by layer using the simple and efficient learning rule of RBM. The greedy layerwise pretraining strategy has made learning models with deep architures possible [14, 1]. Moreover, the greedy pretraining idea has also been successfully applied to initialize the weights of a deep autoencoder to learn a very powerful nonlinear mapping for dimensionality reduction, which is illustrated in Fig. 1a) and 1b). Besides, the idea of deep learning has motivated researchers to use powerful generative models with deep architectures to learn better discriminative models
[20].In this paper, by combining the idea of deep learning and largemargin discriminative learning, we propose a new kNN classification and supervised dimensionality reduction method called DNetkNN. It learns a nonlinear feature transformation to directly achieve the goal of largemargin kNN classification, which is based on a Deep Encoder Network pretrained with RBMs as shown in Fig 2. Our approach is mainly inspired by the work in [13], [17] and [12]. Given the labels of some or all training data, it allows us to learn a nonlinear feature mapping to minimize the invasions to each data point’s genuine neighborhood by other impostor nearest neighbors, which favours kNN classification directly. Previous researchers once used an autoencoder or a deep autoencoder for nonlinear dimensionality reduction to improve kNN [12, 15]. None of these approaches used an objective function as direct as what we use here for improving kNN classification. The approach discussed in [3] uses a convolution net to learn a similarity metric discriminatively, but it was handcrafted. Our approach based on general deep neural networks is more flexible and the connection weight matrices between layers are automatically learned from data.
We applied DNetkNN on the USPS and MNIST handwritten digit datasets for classification. The test error we obtained on the MNIST benchmark dataset is , which is better than that obtained by deep belief net, deep autoencoder and SVM [5, 12, 11]. In addition, our finetuning process is very fast and converges to a good local minimum within several iterations of conjugategradient update. Our experimental results show that: (1) a good generative model can be used as a pretraining stage to improve discriminative learning; (2) pretraining with generative models in a layerwise greedy way makes it possible to learn a good discriminative model with deep architecture; (3) pretraining with RBMs makes discriminative learning process much faster than that without pretraining; (4) pretraining helps to find a much better local minimum than without pretraining. These conclusions are consistent with the results of previous research trials on deep networks [14, 1, 20, 11, 12].
We organize this paper as follows: in section 2, we introduce kNN classification using linear transformations in a largemargin framework. In section 3, we describe previous work on RBM and training models with deep architectures. In section 4, we present DNetkNN, which trains a Deep Encoder Network for improving largemargin kNN classification. In section 5, we present our experimental results on the USPS and MNIST [19] handwritten digit datasets. In section 6, we conclude the paper with some discussions and propose possible extensions of our current method.
2 Largemargin kNN classification using linear transformation
In this section, we review the largemargin framework of kNN classification described in [13]. Given a set of data points and additional neighborhood information , where , for labeled data points, is the total number of classes, and if is one of ’s target neighbors, we seek a distance function for pairwise data points and such that the given neighborhood information will be preserved in the transformed feature space corresponding to the distance function. If is based on Mahanalobis distances, then it admits the following form:
(1) 
where is a linear transformation matrix. Based on the goal of margin maximization, we learn the parameters of the distance function, , such that, for each data point , the distance between and each data point from another class will be at least plus the largest distance between and its target neighbors. Using a binary matrix to represent that and are in the same class and otherwise for the labeled data points, we can formulate the above problem as an optimization problem:
(2) 
where is a penalty coefficient penalizing constraint violations, and
is a hinge loss function with
. If is a matrix, this problem corresponds to the work in [13]; if is a matrix where , this problem corresponds to the work in [17]. When a nonsquare matrixis learned for dimensionality reduction, the resulting problem is nonconvex, stochastic gradient descent and conjugate gradient descent are often used to solve the problem. When
is constrained to be a fullrank square matrix, we can solve directly and the resulting problem is convex. Alternating projection or simple gradientbased methods can be applied here [13].3 RBM and Deep Neural Network
On large datasets, rich information existing in data features often enables us to build powerful generative models to learn the constraints and the structures underlying the given data. The learned information often reveals the characteristics of data points belonging to different classes. In [12], it is shown that a deep belief net composed of stacked Restricted Boltzmann Machines (RBM) can perform handwritten digit classification remarkably well [16]. RBM is an undirected graphical model with one visible layer and one hidden layer . There are symmetric connections between the hidden layer and the visible layer, but there are no withinlayer connections. For a RBM with stochastic binary visible units and stochastic binary hidden units
, the joint probability distribution of a configuration
, of RBM is defined based on its energy as follows:(3)  
(4) 
where and are biases, and is the partition function with . The good property due to the structure of RBM is that, given the visible states, each hidden unit is conditionally independent, and given the hidden states, the visible units are conditionally independent.
(5)  
(6) 
where
. This beneficial property allows us to get unbiased samples from the posterior distribution of the hidden units given an input data vector. By minimizing the negative loglikelihood of the observed input data vectors using gradient descent, the update rule for the weight
turns out to be,(7) 
where is learning rate, denotes the expectation with respect to the data distribution and denotes the expectation with respect to the model distribution. In practice, we do not have to sample from the equilibrium distribution of the model, and even onestep reconstruction samples work very well [9].
(8) 
Although the above update rule does not follow the gradient of the loglikelihood of data exactly, it approximately follows the gradient of another objective function [2]. In [11]
, it is shown that a deep belief net based on stacked RBMs can be trained greedily layer by layer. Given some observed input data, we train a RBM to get the hidden representations of the data. We can view the learned hidden representations as new data and train another RBM. We can repeat this procedure many times. It is shown that in
[11], under this greedy training strategy, we always get better hidden representations of the original input data if the number of features in the added layer does not decrease, and a varational lower bound of the loglikelihood of the observed input data never decreases. In [12], the greedy training strategy is used to initialize the weights of a deep autoencoder as shown in Fig. 1a) and then backpropagation is used for tuning the weights of the network as shown in Fig 1b). This time the lowerbound guarantee no longer holds, but the greedy pretraining still works very well in practice [12].4 Largemargin kNN classification using deep neural networks
The work in [12] and [11] made full use of the capabilities of generative models, but label information is only weakly used. In the following, we describe DNetkNN, in which we use stacked RBMs to initialize the weights of an encoder with 4 hidden layers as shown in Fig. 2. Then we finetune the weights of the encoder by minimizing the following objective function:
(9) 
(10) 
where if and only if is an impostor neighbour of , which will be discussed in details later. The definition of is the same as discussed before in section 2, and . The function is a continuous nonlinear mapping, and each component of is a continuous function of the input vector , and the parameters of are connection weight matrices in a deep neural network. For example, in Fig. 1, . This equation differs from Eqn 2 in two ways. First, the distance between data points and is computed by the Euclidean distance using the feature vectors output by the top layer of the encoder in Fig. 1. Secondly, the objective function focuses on maximizing the margin and neglects the term reducing the distance between nearest neighbors. Additionally, unlike Hinton’s deep autoencoder, we no longer minimize the reconstruction error, since it was found that this criterion reduced the ability of the code vectors to accurately describe subtle differences between classes in practice.
To reduce the complexity of the backpropagation training, we use simplified versions of and in the objective function 10, as compared to those described in Section 2. For each index , only if is one of ’s top nearest neighbors among the data points having the same class label as (inclass nearest neighbors). In contrast, for each , the s for which are selected from the set of impostor nearest neighbors of , which is the union of the nearest neighbors from each and every class other than the class of . For example, in the case of digit recognition with ten classes, there are a total of impostor s for each data point . This method of choosing impostor nearest neighbors is optimal for kNN classification because, by selecting impostor neighbors from every other class, we help ensure that all potential competitors are removed.
Let be the low dimensional code vector of generated by the Deep Encoder Network. Then the time complexity of computing Eq. 10 is , which is a significant improvement over the time complexity of Eq. 2, which is . For the purposes of calculating nearest neighbors and impostor nearest neighbors, we use Euclidean distances in the pixel space. This means that the and do not need to be recalculated each time the code vectors are updated. Unfortunately, due to the nonlinear mapping, this may mean that ordinary data points in the pixel space may become impostors in the code space and will not be taken into account in the objective function. However, it is likely that the mapping is quasilinear. Therefore, by taking a large value for , we find that this captures most of the impostors in the code space, as evidenced by our low kNN classification errors. In our experiments, we use =5 and .
To improve the computation time of calculating the objective function gradient, a x matrix of triples was generated. These triples represent the sets of all allowed indices , , and in Eq 10 for which and are nonzero. Therefore, in the triples matrix, the entries in the 2nd column represent the inclass nearest neighbors relative to the first column, and the entries in the 3rd column represent the impostor nearest neighbors relateive to the first column. The triples matrix is used in calculating both the gradient of the objective function, and the value of the objective function itself.
The gradient of the objective function, relative to the code vector = is given by:
(11) 
where is the flag for margin violations: if .
While this equation is unwieldy for Matlab implementation, the use of the triples matrix makes the computation much easier. In Eq 11, we calculate each sum individually, using the triples matrix to determine the appropriate indicies, and then combined them later. For example, to determine the value of the first summation term,
we simply search through the triples matrix to identify all the triples that yield a margin violation (). Then, we choose those that have index in their first column. Thus, this specific set of triples tells us the appropriate indices to use in the first sum. Specifically, the second column of the triples matrix becomes the l index values, and the third column becomes the j index values. Likewise, the same strategy is repeated for the second and third summations.
5 Experimental Results
We will test our model DNetkNN for both classification and dimensionality reduction on two handwritten digit dataset: USPS and MNIST. We demonstrate two different types of classification: standard kNN and minimum energy classification. For standard kNN, after we finish learning the nonlinear mapping by discriminative finetuning, we can directly compute pairwise Euclidean disitances for kNN classication, which is used in DNetkNN. Alternatively, for minimum energy classification, after we calculate the feature vectors of training data and test data, we can also predict the class label of a test data point by the class to which the test data point is assigned to have the lowest energy defined by Eq. 10. This minimum energy classification is denoted by ”E” in the experimental results. In both USPS and MINIST experiments, we set and .
5.1 Experimental Results on USPS Dataset
We downloaded the USPS digit dataset from a public link^{1}^{1}1http://www.cs.toronto.edu/ roweis/data/usps_all.mat. From this dataset, several different preparations are used. The first preparation is USPSfixed, which takes the first 800 data points from each of the ten digit classes to create an 8000 point training set. The test set for USPSfixed then consists of a further 3000 data points, with 300 from each data class.
Second to sixth preparations, called USPSrandom1 to USPSrandom5 are then obtained from USPSfixed by randomly shuffling the data points for each class between training and testing datasets.
In Figures 3 and 4, we observe the training errors and test errors for different dimensionality codes. In all cases, the DNetkNN classification outperforms the deep autoencoder (DA). Furthermore, as the dimensionality of the codes increases, the classification accuracy increases. This trend coninues from d=2 up till d=15, and then levels off.
Figures 5 and 6 compare DNetkNN 2D dimensionality reduction with the deep autoencoder. The DNetkNN clearly produces superior clustering of data point classes in twodimensional space. There are still some class overlaps, however, because the backpropagation algorithm we use to optimize for kNN classification is not the best choice to improve visualization. This is because the objective function chooses which data points it considers to be in the set of impostor nearest neighbours (allowed ’s) using the pixel space rather than the code space (see Section 4). However, visualization requires reduction to very lowdimensional spaces, and the mapping from pixel space to code space must become highly nonlinear as dimensionality is reduced. Therefore, the pixel space becomes a poorer representation of spatial relationships in the code space and the correct choice of impostor nearest neighbours becomes less reliable during visualization.
Dataset  kNN  DA  LMNN  DNetkNN 

USPSrandom1  5.12  2.66  0.76  0.00 
USPSrandom2  5.09  2.35  1.10  0.00 
USPSrandom3  4.95  2.28  0.71  0.00 
USPSrandom4  5.08  2.24  0.85  0.00 
USPSrandom5  4.93  2.48  0.95  0.01 

Dataset  kNN  DA  LMNN  LMNNE  DNetkNN  DNetkNNE 

USPSrandom1  4.47  2.80  2.20  1.77  1.20  1.43 
USPSrandom2  4.93  2.36  2.13  1.53  0.87  0.97 
USPSrandom3  5.23  2.33  2.36  1.80  1.43  1.50 
USPSrandom4  4.17  1.93  2.33  1.80  1.06  1.20 
USPSrandom5  5.37  2.23  1.93  1.63  1.13  1.00 
Tables 1 and 2 show respectively the training and test error on multiple USPSrandom data sets. DNetkNN almost consistently outperforms the other methods.
5.2 Experimental Results on MNIST Dataset
This section deals with the MNIST dataset, which is another digit set available online.^{2}^{2}2http://yann.lecun.com/exdb/mnist/ This dataset contains 60,000 training samples and 10,000 test samples. For the USPS dataset, it was possible to do both the pretraining and the backpropagation on a single batch of data. However, given the size of the MNIST dataset, the training data had to be broken into smaller batches of 10,000 randomly selected datapoints. Then, RBM training and backpropagation could be applied iteratively to each batch. In all our experiments, batch size was set to 10,000.
Figure 7 and 8 show the mapping of the MNIST dataset onto a reduced space using the DNetkNN and the deep autoencoder. As with the USPS dataset, it shows a significant improvement above the deep autoencoder.
Methods  results 
DNetkNN (dim = 30, batch size = 10k)  0.94 
DNetkNNE (dim = 30, batch size = 10k)  0.95 
Nonlinear NCA based on a Deep Autoencoder ([15]  1.03 
Deep Belief Net [11]  1.25 
SVM: degree 9 [5]  1.4 
kNN  3.05 
LMNN (dim = 30)  2.62 
LMNNE (dim = 30)  1.58 

This final table shows the classification error of the DNetkNN as compared to other common classification techniques on the MNIST dataset. Despite the fact that we must use batches, the DNetkNN still produces the best classifications. This indicates that the DNetkNN classifier is highly robust, since it can perform well when limited to seeing only part of the dataset at any one time.
Finally, it is worth noting that, unlike the deep autoencoder, the fine tuning of the DNetkNN classifier during backpropagation displays extremely fast convergence. Often, the error reaches a minimum after three to five epochs. This is due to the fact that the RBM pretraining has provided an ideal starting point and also that we are using a supervised learning algorithm, as opposed to an unsupervised algorithm as in the deep autoencoder.
6 Discussions and future research
In this paper, we present a new nonlinear feature mapping method called DNetkNN that uses a deep encoder network pretrained with RBMs to achieve the goal of largemargin kNN classification. Our experimental resuls on USPS and MNIST handwritten digits show that DNetkNN is powerful in both classification and nonlinear embedding. Our results suggest that, pretraining with a good generative model is very helpful for learning a good discriminative model, and the pretraining makes discriminative learning much faster, and it often help it find a much better local minimum especially in a deep architecture than without pretraining. Our findings are consistent as the idea discussed in [10].
On huge dataset, the current implemention of our method only works by using minibatches. We essentially compute the genuine nearest neighbors and impostor nearest neighbors in each minibatch, which might be not optimum over the whole dataset. In the future, we plan to develop a dynamic version, in which the minibatches will change dynamically during training and we dynamically update the true nearest neighbors and impostor nearest neighbors of each data point. Additionally, we plan to use the label information of training data to constrain the distances between pairwise data points in the same class. For example, we can add a penalty term using supervised stochastic neighbor embedding (SNE) [8] or tSNE [18] to constrain the withinclass distances.
Acknowledgement
We thank Geoff Hinton for his guidance and inspiration. We thank Lee Zamparo for proofreading the manuscript and Jin Ke for drawing figure 1 and Figure 2.
References
 [1] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, 2007.

[2]
M. A. CarreiraPerpignan and H. G. E.
On contrastive divergence learning.
InProceedings of the International Conference on Artificial Intelligence and Statistics
, volume 10, 2005.  [3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 1:539–546, 2005.

[4]
J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon.
Informationtheoretic metric learning.
In
ICML ’07: Proceedings of the 24th international conference on Machine learning
, pages 209–216, New York, NY, USA, 2007. ACM. 
[5]
D. DeCoste and B. Schölkopf.
Training invariant support vector machines.
Machine Learning, 46:161–190, 2002.  [6] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Schölkopf, and J. Platt, editors, NIPS 18, pages 451–458. MIT Press, Cambridge, MA, 2006.
 [7] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, NIPS 17, pages 513–520. MIT Press, Cambridge, MA, 2005.
 [8] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15, pages 833–840. MIT Press.
 [9] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comp., 14(8):1771–1800, August 2002.
 [10] G. E. Hinton. To recognize shapes, first learn to generate images. Progress in brain research, 165:535–547, 2007.
 [11] G. E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, 2006.
 [12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.
 [13] J. B. K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. In Y. Weiss and B. Sch editors, NIPS 18.
 [14] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. ICML2007, pages 473–480, 2007.
 [15] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 11, 2007.

[16]
Y. W. Teh and G. E. Hinton.
Ratecoded restricted boltzmann machines for face recognition.
In NIPS, pages 908–914, 2000.  [17] L. Torresani and K. chih Lee. Large margin component analysis. In B. Sch editor, NIPS 19. MIT Press.
 [18] L. van der Maaten and G. Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9:2579–2605, November 2008.

[19]
H. Wang and S. Bengio.
The mnist database of handwritten uppercase letters.
IDIAPCom 04, IDIAP, 2002.  [20] J. Weston, F. Ratle, and R. Collobert. Deep learning via semisupervised embedding, 2008.
 [21] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with sideinformation. In S. T. S. Becker and K. Obermayer, editors, NIPS 15. MIT Press.
Comments
There are no comments yet.