1 Introduction
Given the class information of training data, metric learning methods for dimensionality reduction and data visualization essentially learn a linear or nonlinear transformation from a highdimensional input feature space to a lowdimensional embedding space, aiming at increasing the similarity between pairwise data points from the same class while decreasing the similarity between pairwise data points from different classes in the embedding space. These methods in combination with kNN have been widely used in many applications including computer vision, information retrieval, and bioinformatics. Recent surveys on metric learning can be found in
[Kulis2013, Bellet et al.2013]. However, most of these approaches, including the popular Maximally Collapsing Metric Learning (MCML) [Globerson and Roweis2006], Neighborhood Component Analysis (NCA) [Goldberger et al.2004], and LargeMargin Nearest Neighbor (LMNN) [Weinberger and Saul2009], need to model neighborhood structures by comparing pairwise training data points either for learning parameters or for constructing target neighborhoods in the input feature space, which results in quadratic computational complexity requiring careful tuning and heuristics to get approximate solutions in practice and thus limits the methods’ scalability. Moreover, during testing, kNN is often employed to compare each test data point against all training data points in the input feature or embedding space, which is also expensive in terms of both computational cost and resources required. In addition, a lot of previous methods, e.g., MCML, on one extreme, focus on learning a Mahalanobis metric that is equivalent to learning a linear feature transformation matrix and thus incapable of achieving the goal of collapsing classes. On the other extreme, nonlinear metric learning methods based on deep neural networks such as dtMCML and dtNCA
[Min et al.2010]are powerful but very hard to learn and require complicated procedures such as tuning network architectures and tuning many hyperparameters. For data embedding and visualization purposes, most users are reluctant to go through these complicated procedures, which explains why dtMCML and dtNCA were not widely used although they are much more powerful than simpler MCML, NCA, and LMNN.
To address the aforementioned issues of previous metric learning methods for dimensionality reduction and data visualization, in this paper, we present an exemplarcentered supervised shallow parametric data embedding model based on a Maximally Collapsing Metric Learning objective and Student
distributions. Our model learns a shallow highorder parametric embedding function that is as powerful as a deep neural network but much easier to learn. Moreover, during training, our model avoids pairwise training data comparisons and compares training data only with some jointly learned exemplars or precomputed exemplars from supervised kmeans centers, resulting in an objective function with linear computational complexity with respect to the size of training set. In addition, during testing, our model only compares each test data point against a very small number of exemplars. As a result, our model in combination with kNN accelerates kNN using highdimensional input features by hundreds of times owing to the benefits of both dimensionality reduction and sample size reduction, and achieves much better performance. Even surprisingly, in terms of both accuracy and testing speed, our shallow model based on precomputed exemplars significantly outperforms stateoftheart deep embedding method dtMCML. We also empirically observe that, using a very small number of randomly sampled exemplars from training data, our model can also achieve competitive classification performance. We call our proposed model exemplarcentered High Order Parametric Embedding (enHOPE).
Our contributions in this paper are summarized as follows: (1) We propose a salable metric learning strategy for data embedding with an objective function of linear computational complexity, avoiding pairwise training data comparisons; (2) Our method compares test data only with a small number of exemplars and gains speedup of kNN by hundreds of times; (3) Our approach learns a simple shallow highorder parametric embedding function, beating stateoftheart embedding models on several benchmark datasets in term of both speed and accuracy.
2 Related Work
Metric learning methods and their applications have been comprehensively surveyed in [Kulis2013, Bellet et al.2013]. Among them, our proposed method enHOPE is closely related to the ones that can be used for dimensionality reduction and data visualization, including MCML [Globerson and Roweis2006], NCA [Goldberger et al.2004], LMNN [Weinberger and Saul2009], nonlinear LMNN [Kedem et al.2012]
, and their deep learning extensions such as dtMCML
[Min et al.2010], dtNCA [Min et al.2010], and DNetkNN [Min et al.2009]. enHOPE is also related to neighborhoodmodeling dimensionality reduction methods such as LPP [He and Niyogi2003], tSNE [van der Maaten and Hinton2008], its parametric implementation SNEencoder [Min2005] and deep parametric implementation ptSNE [van der Maaten2009]. The objective functions of all these related methods have at least quadratic computational complexity with respect to the size of training set due to pairwise training data comparisons required for either loss evaluations or target neighborhood constructions. Our work is also closely related to the RVML method [Perrot and Habrard2015], which suffers scalability issues as MCML does.enHOPE is closely related to a recent sample compression method called Stochastic Neighbor Compression (SNC) [Kusner et al.2014] for accelerating kNN classification in a highdimensional input feature space. SNC learns a set of highdimensional exemplars by optimizing a modified objective function of NCA. enHOPE differs from SNC in several aspects: First, their objective functions are different; Second, enHOPE learns a nonlinear metric based on a shallow model for dimensionality reduction and data visualization, but SNC does not have such capabilities; Third, enHOPE does not necessarily learn exemplars, instead, which can be precomputed. We will compare enHOPE to SNC in the experiments to evaluate the compression ability of enHOPE, however, the focus of enHOPE is for data embedding and visualization but not for sample compression in a highdimensional space.
enHOPE learns a shallow parametric embedding function by considering highorder feature interactions. Highorder feature interactions have been studied for learning Boltzmann Machines, autoencoders, structured outputs, feature selections, and biological sequence classification
[Memisevic2011, Min et al.2014b, Min et al.2014a, Ranzato and Hinton2010, Ranzato et al.2010, Guo et al.2015, Purushotham et al.2014, Kuksa et al.2015]. To the best of our knowledge, our work here is the first successful one to model input highorder feature interactions for supervised data embedding and exemplar learning.3 Method
In this section, we introduce MCML and dtMCML at first. Then we describe our shallow parametric embedding function based on highorder feature interactions. Finally, we present our scalable model enHOPE.
3.1 A Shallow Parametric Embedding Model for Maximally Collapsing Metric Learning
Given a set of data points , where
is the input feature vector,
is the class label of a labeled data point, and is the total number of classes. MCML learns a Mahalanobis distance metric to collapse all data points in the same class to a single point and push data points from different classes infinitely farther apart. Learning a Mahalanobis distance metric can be thought of as learning a linear feature transformation from the highdimensional input feature space to a lowdimensional latent embedding space, where , and . For data visualization, we often set .MCML assumes,
, the probability of each data point
chooses every other data pointas its nearest neighbor in the latent embedding space follows a Gaussian distribution,
(1) 
and
(2) 
To maximally collapse classes, MCML minimizes the sum of the KullbackLeibler divergence between the conditional probabilities
computed in the embedding space and the “groundtruth” probabilities calculated based on the class labels of training data. Specifically, iff and iff . Formally, the objective function of the MCML is as follows:(3) 
where is an indicator function.
However, learning a Mahalanobis metric requires solving a positive semidefinite programming problem, which is computationally prohibitive and prevents MCML from scaling to a fairly big dataset. Moreover, a linear feature transformation is very constrained and makes it impossible for MCML to achieve its goal of collapsing classes. dtMCML extends MCML in two aspects: (1) it learns a powerful deep neural network to parameterize the feature transformation function ; (2) it uses a symmetric heavytailed distribution to compute for supervised embedding due to its capabilities of reducing overfitting, creating tight clusters, increasing class separation, and easing gradient optimization. Formally, this stochastic neighborhood metric first centers a distribution over , and then computes the density of under the distribution as follows.
(4) 
Although dtMCML based on a deep neural network has a powerful nonlinear feature transformation, parameter learning is hard and requires complicated procedures such as tuning network architectures and tuning many hyperparameters. Most users who are only interested in data embedding and visualization are reluctant to go through these complicated procedures. Here we propose to use highorder feature interactions, which often capture structural knowledge of input data, to learn a shallow parametric embedding model instead of a deep model. The shallow model is much easier to train and does not have many hyperparameters. In the following, the shallow highorder parametric embedding function will be presented. We expand each input feature vector to have an additional component of for absorbing bias terms, that is, , where . The order feature interaction is the product of all possible features where, , and . Ideally, we want to use each
order feature interaction as a coordinate and then learn a linear transformation to map all these highorder feature interactions to a lowdimensional embedding space. However, it’s very expensive to enumerate all possible
order feature interactions. For example, if , we must deal with a dimensional vector of highorder features. We approximate a Sigmoidtransformed highorder feature mappingby constrained tensor factorization as follows (derivations omitted due to space constraint),
(5) 
where is a bias term, is a factorization matrix, is the th column of , and are projection matrices, is the th component of , is the number of factors, is the number of highorder hidden units, and . Because the last component of is 1 for absorbing bias terms, the full polynomial expansion of essentially captures all orders of input feature interactions up to order . Empirically, we find that works best for all datasets we have and set for all our experiments. The hyperparameters and are set by users.
Combining Equation 3, Equation 4 and the feature transformation function in Equation 5 leads to a method called High Order Parametric Embedding (HOPE). As MCML and dtMCML, the objective function of HOPE involves comparing pairwise training data and thus has quadratic computational complexity with respect to the sample size. The parameters of HOPE are learned by Conjugate Gradient Descent.
3.2 enHOPE for Data Embedding and Fast kNN Classification
Building upon HOPE for data embedding and visualization described earlier, we present two related approaches to implement enHOPE, resulting in an objective function with linear computational complexity with respect to the size of training set. The underlying intuition is that, instead of comparing pairwise training data points, we compare training data only with a small number of exemplars in the training set to achieve the goal of collapsing classes, collapsing all training data to the points defined by exemplars. In the first approach, we simply precompute the exemplars by supervised kmeans and only update the parameters of the embedding function during training. In the second approach, we simultaneously learn exemplars and embedding parameters during training. During testing, fast kNN classification can be efficiently performed in the embedding space against a small number of exemplars especially when the dataset is huge.
Given the same dataset with formal descriptions as introduced in Section 3.1, we aim to obtain exemplars from the whole dataset with their designated class labels uniformly sampled from the training set to account for data label distributions, where is a userspecified free parameter and . We denote these exemplars by . In the first approach, we perform kmeans on the training data to identify the same number of exemplars as in the sampling step for each class (please note that kmeans often converges within a dozen iterations and shows linear computational cost in practice). Then we minimize the following objective function to learn highorder embedding parameters while keeping the exemplars fixed,
(6) 
where indexes training data points, indexes exemplars, denotes the highorder embedding parameters in Equation 5, is calculated in the same way as in the previous description, but is calculated with respect to exemplars,
(7)  
(8) 
where denotes the highorder embedding function as described in Equation 5
. Note that unlike the probability distribution in Equation
4, here is computed only using the pairwise distances between training data points and exemplars. This small modification has significant benefits. Because , compared to the quadratic computational complexity with respect to of Equation 3, the objective function in Equation 3.2 has a linear computational complexity with respect to . In the second approach, we jointly learn the highorder embedding parameters and the exemplars simultaneously by optimizing the objective function in Equation 3.2. The derivative of the above objective function with respect to exemplar is as follows,(9)  
In both approaches to implementing enHOPE, all the model parameters are learned using Conjugate Gradient Descent. We call the first approach enHOPE (kmeans exemplars) and the second approach enHOPE (learned exemplars).
MINIST  USPS  20 Newsgroups  

Linear Methods  NonLinear Methods  Linear Methods  NonLinear Methods  Linear Methods  NonLinear Methods  
LPP  47.20  ptSNE  9.90  LPP  34.77  ptSNE  17.90  LPP  24.64  ptSNE  28.90 
NCA  45.91  dtNCA  3.48  NCA  37.17  dtNCA  5.11  NCA  30.84  dtNCA  25.85 
MCML  35.67  dtMCML  3.35  MCML  44.60  dtMCML  4.07  MCML  26.65  dtMCML  21.10 
LMNN  56.28  LMNN  48.40  LMNN  29.15  
HOPE  3.20  HOPE  3.03  HOPE  20.05 
Methods  Error Rates 

VGG + LMNN  1.75 
VGG+ NCA  1.83 
VGG + MCML  0.80 
VGG + HOPE  0.65 
VGG + LMNN (skmeans)  2.22 
VGG + NCA (skmeans)  2.18 
VGG + enHOPE (10 kmeans exemplars)  0.67 
VGG + enHOPE (10 learned exemplars)  0.66 
VGG + enHOPE (20 kmeans exemplars)  0.64 
VGG + enHOPE (20 learned exemplars)  0.68 
VGG + enHOPE (10 random exemplars)  0.68 
4 Experiments
In this section, we evaluate the effectiveness of HOPE and enHOPE by comparing them against several baseline methods based upon three datasets, i.e., MNIST, USPS, and 20 Newsgroups. The MNIST dataset contains 60,000 training and 10,000 test graylevel 784dimensional images. The USPS data set contains 11,000 256pixel graylevel images, with 8,000 for training and 3,000 for test. The 20 Newsgroups dataset is a collection of 16,242 newsgroup documents among which we use 15,000 for training and the rest for test as in [van der Maaten2009].
To evaluate whether our proposed shallow highorder parametric embedding function is powerful enough, we first compare HOPE with four linear metric learning methods, including LPP, LMNN, NCA, and MCML, and three deep learning methods without convolutions, including a deep unsupervised model ptSNE, as well as two deep supervised models, i.e.
, dtNCA and dtMCML. To make computational procedures and tuning procedures for data visualization simpler, none of these models was pretrained using any unsupervised learning strategy, although HOPE, enHOPE, dtNCA, and dtMCML could all be pretrained by autoencoders or variants of Restricted Boltzmann Machines
[Min et al.2010, Kuksa et al.2015].We set the number of exemplars used to 10 and 20 in all our experiments. When 10 exemplars are used, for kNN, otherwise, . We used of training data as validation set to tune the number of factors (), the number of highorder units (), and batch size. For HOPE and enHOPE, we set and for all the datasets used. In practice, we find that the feature interaction order often works best for all applications. The parameters for all baseline methods were carefully tuned to achieve the best results.
4.1 Classification Performance of Highorder Parametric Embedding
Table 1
presents the test error rates of 5nearest neighbor classifier on 2dimensional embedding generated by HOPE and some baseline methods. The error rate is calculated by the number of misclassified test data points divided by the total number of test data points. We chose 2D as in ptSNE because we can effectively visualize and intuitively understand the quality of the constructed embeddings as will be presented and discussed later in this section. The results in Table
1 indicate that HOPE significantly outperforms its linear and nonlinear competitors on three datasets. Due to the nonscalability issue of the original MCML, it fails to run on the MNIST dataset. We implemented an improved version of MCML called MCML by directly learning a linear feature transformation matrix based on conjugate gradient descent.Promisingly, results in Table 1 suggest that our shallow model HOPE even outperforms deep embedding models based on deep neural networks, in terms of accuracy obtained on the 2dimensional embedding for visualization. For example, on MNIST, the error rate (3.20%) of HOPE is lower than the ones of the ptSNE, dtNCA, and dtMCML methods. These results clearly demonstrate the representational efficiency and power of supervised shallow models with highorder feature interactions.
To further confirm the representation power of HOPE, we extracted the 512dimensional features of MNIST digits below the softmax layer learned by a wellknown deep convolutional architecture VGG
[Simonyan and Zisserman2015], which currently holds thestateoftheart classification performance through a softmax layer on MNIST. Next, we ran HOPE based on these features to generate 2D embedding. As is shown in the top part of Table 2, VGG+HOPE can achieve an error of 0.65%. In contrast, NCA and LMNN on top of VGG, respectively, produces test error rate of 1.83% and 1.75%. This error rate of HOPE represents the historically low test error rate in twodimensional space on MNIST, which implies that even on top of a powerful deep convolutional network, modeling explicit highorder feature interactions can further improve accuracy and outperform all other models without feature interactions.4.2 Experimental Results for Different Methods with Exemplar Learning
skmeans+methods  enHOPE  

LPP  45.13  enHOPE (10 kmeans exemplars)  2.86 
NCA  50.67  enHOPE (10 learned exemplars)  2.80 
LMNN  59.67  enHOPE (20 kmeans exemplars)  2.72 
ptSNE  18.86  enHOPE (20 learned exemplars)  2.66 
dtMCML  3.17  enHOPE (10 random exemplars)  3.19 
LPP  33.23  enHOPE (10 kmeans exemplars)  2.96 
NCA  35.13  enHOPE (10 learned exemplars)  2.67 
LMNN  59.67  enHOPE (20 kmeans exemplars)  2.83 
ptSNE  29.47  enHOPE (20 learned exemplars)  3.03 
dtMCML  4.27  enHOPE (10 random exemplars)  3.10 
LPP  33.09  enHOPE (10 kmeans exemplars)  18.27 
NCA  36.71  enHOPE (10 learned exemplars)  18.84 
LMNN  38.24  enHOPE (20 kmeans exemplars)  19.64 
ptSNE  33.17  enHOPE (20 learned exemplars)  18.44 
dtMCML  21.90  enHOPE (10 random exemplars)  18.84 
In this section, we evaluate the performance of enHOPE for data embedding, data visualization, and fast kNN classification. Table 3 presents the classification error rates of kNN on 2dimensional embeddings generated by enHOPE with the two proposed exemplar learning. Exemplarbased enHOPE consistently achieves better performance than the ones of HOPE in Table 1. To construct stronger baselines, we run supervised kmeans to get exemplars and train each baseline method independently. During testing, we only use these kmeans centers for comparisons with test data. We call these experiments "skmeans+methods". Please note that "skmeans+methods" heuristics have objective functions with quadratic computational complexity as the original baseline methods and thus are not scalable to big datasets. To test whether enHOPE can indeed effectively collapse classes, we also randomly select data points from each class as fixed exemplars and then learn the highorder embedding function of enHOPE. The results in Table 3 suggest the following: when coupled with exemplars, enHOPE significantly outperforms other baseline methods including the deep embedding models; even with randomly sampled exemplars, for example, one exemplar per class on MNIST and USPS, enHOPE with an objective function of linear computational complexity can still achieve very competitive performance compared to baseline methods, demonstrating the effectiveness of our proposed shallow highorder model coupled with exemplars for collapsing classes. The bottom part of Table 2
again verifies the additional gain of our shallow highorder model enHOPE on top of an established deep convolutional neural network.
4.2.1 Twodimensional Data Embedding Visualization
Figure 1 shows the test data embeddings of MINST by different methods. These embeddings were constructed by, respectively, MCML, dtMCML, enHOPE with 20 learned exemplars, and enHOPE with 10 learned exemplars. The 20 learned exemplars overlap in the twodimensional space. enHOPE produced the best visualization, collapsed all the data points in the same class close to each other, and generated large separations between class clusters. Furthermore, the embeddings of the learned exemplars created during training (depicted as red empty circles in subfigure (c) and (d)) are located almost at the centers of all the clusters.
4.2.2 Computational Efficiency of enHOPE for Sample Compression
Datasets  MNIST  USPS  20 Newsgroups 
Speedup (times)  463  28  101 
Error rates of en  
HOPE in 2D space  2.66  3.03  18.44 
Error rates of kNN  
in highD space  3.05  4.77  25.12 
enHOPE speeds up computational efficiency of fast information retrieval such as kNN classification used in the above experiments by hundreds of times. Table 4 shows the experimentally observed computational speedup of enHOPE over standard kNN on our desktop with Intel Xeon 2.60GHz CPU and 48GB memory on different datasets. The test error rates by kNN in highdimensional feature space are much worse than the ones produced by enHOPE even in a much lower feature dimension, i.e., the twodimensional latent space. In detail, on our desktop, for classifying 10000 MNIST test data, standard kNN takes 124.97 seconds, but our method enHOPE with 20 learned exemplars only takes 0.24 seconds including the time for computing the twodimensional embedding of test data. In other words, our method enHOPE has 463 times speedup over standard kNN along with much better classification performance. This computational speedup will be more pronounced on massive datasets.
4.2.3 Comparisons of enHOPE with SNC and dtMCML
Stochastic neighbor compression (SNC) [Kusner et al.2014] is a leading sample compression method in highdimensional input feature space. In contrast, SNC can only achieve up to 136 times speedup over kNN with comparable performance on MNIST with at least 600 learned exemplars [Kusner et al.2014]. That is, it only achieves a compression ratio as high as 30 times of that of enHOPE. Part of the reason here is that it is not designed for data embedding and visualization and thus unable to compress dataset from the aspect of dimensionality reduction. This assumption is further verified by the following experimental observations. When using 20 learned exemplars in the highdimensional input feature space, SNC produced test error rates of 6.31% on MNIST and 17.50% on USPS, which are much higher than those of enHOPE. Also, if we preproject data to twodimensional space by other methods such as PCA or LMNN and then run SNC, the results of SNC should be much worse than the ones in the highdimensional input feature space. Although the focus of enHOPE is not for sample compression but for data embedding and visualization by collapsing classes, when we embed MNIST data to a 10dimensional latent space using enHOPE with 20 exemplars, we can further reduce the test error rate from % to %.
We also further evaluate the performance of our shallow model enHOPE with 20 learned exemplars against deep method dtMCML on the MNIST data. When compared to dtMCML, enHOPE achieves 316 times speedup for classifying MNIST test data in 2D owing to its proposed exemplar learning functionality. It is also worth mentioning that, although both methods have the overhead of computing the 2D embedding of test data, enHOPE has 2 times speedup over dtMCML on this burden owing to its shallow architecture.
5 Conclusion and Future Work
In this paper, we present an exemplarcentered supervised shallow parametric data embedding model enHOPE by collapsing classes for data visualization and fast kNN classification. Owing to the benefit of a small number of precomputed or learned exemplars, enHOPE avoids pairwise training data comparisons and only has linear computational cost for both training and testing. Experimental results demonstrate that enHOPE accelerates kNN classification by hundreds of times, outperforms stateoftheart supervised embedding methods, and effectively collapses classes for impressive twodimensional data visualizations in terms of both classification performance and visual effects.
In the future, we aim to extend our method to an unsupervised learning setting to increase the scalability of traditional tSNE, for which we just need to compute the pairwise probability using highdimensional feature vectors instead of class labels and optimize exemplars accordingly.
References
 [Bellet et al.2013] Aurelien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. CoRR, abs/1306.6709, 2013.
 [Globerson and Roweis2006] Amir Globerson and Sam Roweis. Metric learning by collapsing classes. In Proceedings of Advances in Neural Information Processing Systems 21, pages 451–458. MIT Press, Cambridge, MA, 2006.
 [Goldberger et al.2004] Jacob Goldberger, Sam T. Roweis, Geoffrey E. Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Proceedings of Advances in Neural Information Processing Systems 19, pages 513–520. 2004.
 [Guo et al.2015] Hongyu Guo, Xiaodan Zhu, and Martin Renqiang Min. A deep learning model for structured outputs with highorder interaction. CoRR, abs/1504.08022, 2015.
 [He and Niyogi2003] Xiaofei He and Partha Niyogi. Locality preserving projections. In Proceedings of Advances in Neural Information Processing Systems 16, 2003.
 [Kedem et al.2012] Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, and Kilian Q Weinberger. Nonlinear metric learning. In Proceedings of Advances in Neural Information Processing Systems 25, pages 2573–2581. 2012.
 [Kuksa et al.2015] Pavel P. Kuksa, Martin R. Min, Rishabh Dugar, and Mark Gerstein. Highorder neural networks and kernel methods for peptideMHC binding prediction. Bioinformatics, 31(22):3600–3607, 2015.

[Kulis2013]
Brian Kulis.
Metric learning: A survey.
Foundations and Trends in Machine Learning
, 5(4):287–364, 2013.  [Kusner et al.2014] Matt J. Kusner, Stephen Tyree, Kilian Q. Weinberger, and Kunal Agrawal. Stochastic neighbor compression. In Proceedings of the 31st International Conference on Machine Learning, pages 622–630, 2014.
 [Memisevic2011] Roland Memisevic. Gradientbased learning of higherorder image features. In ICCV, pages 1591–1598, 2011.
 [Min et al.2009] Renqiang Min, David A Stanley, Zineng Yuan, Anthony Bonner, and Zhaolei Zhang. A deep nonlinear feature mapping for largemargin knn classification. In ICDM, pages 357–366. IEEE, 2009.
 [Min et al.2010] Martin Renqiang Min, Laurens van der Maaten, Zineng Yuan, Anthony J. Bonner, and Zhaolei Zhang. Deep supervised tdistributed embedding. In Proceedings of the 27th International Conference on Machine Learning, pages 791–798, 2010.
 [Min et al.2014a] Martin Renqiang Min, Salim Chowdhury, Yanjun Qi, Alex Stewart, and Rachel Ostroff. An integrated approach to bloodbased cancer diagnosis and biomarker discovery. In Pacific Symposium on Biocomputing (PSB), pages 87–98, 2014.

[Min et al.2014b]
Martin Renqiang Min, Xia Ning, Chao Cheng, and Mark Gerstein.
Interpretable sparse highorder boltzmann machines.
In
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
, pages 614–622, 2014.  [Min2005] Martin Renqiang Min. A nonlinear dimensionality reduction method for improving nearest neighbour classification. In Master Thesis. Department of Computer Science, University of Toronto, 2005.
 [Perrot and Habrard2015] Michaël Perrot and Amaury Habrard. Regressive virtual metric learning. In NIPS15, Montreal, Quebec, Canada, pages 1810–1818, 2015.
 [Purushotham et al.2014] S. Purushotham, M. R. Min, CC. Jay Kuo, and R. Ostroff. Factorized sparse learning models with interpretable high order feature interactions. In KDD, New York, USA, 2014.
 [Ranzato and Hinton2010] Marc’Aurelio Ranzato and Geoffrey E. Hinton. Modeling pixel means and covariances using factorized thirdorder boltzmann machines. In CVPR, 2010.
 [Ranzato et al.2010] Marc’Aurelio Ranzato, Alex Krizhevsky, and Geoffrey E. Hinton. Factored 3way restricted boltzmann machines for modeling natural images. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy, May 1315, 2010, pages 621–628, 2010.
 [Simonyan and Zisserman2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In Proceedings of International Conference on Learning Representations, 2015.
 [van der Maaten and Hinton2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. The Journal of Machine Learning Research, 9:2579–2605, 2008.
 [van der Maaten2009] Laurens van der Maaten. Learning a parametric embedding by preserving local structure. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, pages 384–391, 2009.
 [Weinberger and Saul2009] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009.
Comments
There are no comments yet.