1 Introduction
Highorder feature interactions naturally exist in many realworld data, including images, documents, financial time series, biological sequences, and medical records, amongst many others. These interplays often convey essential information about the latent structures of the datasets of interest. For data embedding and visualization, therefore, it is crucial to utilize these highorder characteristic features to generate the dimensionality reduction function.
Recently supervised deep learning models have made promising progresses on sensory data with a lot of regularities such as images and speeches, in terms of generating powerful complex parametric embedding functions that capture highorder feature interactions through deep architectures. Current stateoftheart deep strategies, however, fail to deploy an
explicit highorder parametricform to map highdimensional data to lowdimensional space. Explicit parametric mapping not only effectively avoids the need to develop outofsample extension as in the cases of nonparametric methods such as the tSNE
van2008visualizing , but also reveals the structural information intuitively understandable to human that enables people to make good sense of the data through visualization or to acquire interpretative knowledge out of the visualization. Furthermore, current embedding methods often ignore or fail to perform data compression or summarization while generating embedding. Such functionality is very desirable when dealing with largescale datasets for fast information retrieval, on which we often perform kNN classification and computational efficiency is important.To address the above mentioned challenges, in this paper, we present a HighOrder Parametric Embedding (HOPE) approach. The aims of HOPE are twofold: learning an explicit highorder parametric embedding function for data visualization and constructing a small set of synthetic exemplars with highorder feature interactions borne to represent the whole input dataset. In specific, our approach targets supervised data visualization with two new procedures. Firstly, we linearly map explicit highorder such as order interaction features, which are the products of all possible features, to twodimensional space for visualization, such that all pairwise data points in the same class stay together and pairwise data points from different classes stay farther apart. To avoid directly enumerating all possible
feature interactions that is computationally prohibitive, we propose using tensor factorization to learn a set of feature interaction filters. As a result, the highorder interactions can not only be preserved in the lowdimensional embedding space, but also be explicitly represented by these feature interaction filters. Consequently, one can directly compute the explicit highorder interactions hidden in the data. Secondly, we develop exemplar learning techniques to create a small set of exemplars associated with the embedding to represent the entire data set. As a result, one can just use these exemplars to perform fast information retrieval such as the widely adopted kNN classification, instead of the whole data set, to speed up computation and gain insight into the characteristic features of the data. This is particularly important when the data set is massive.
We evaluated the performance of HOPE and its nonlinear extension using the benchmarking MNIST and USPS datasets. Our experimental results strongly support the effectiveness and the efficiency of our methods for both data visualization and data compression.
2 Related work
Dimensionality reduction and data visualization methods mainly fall into two categories, unsupervised approaches Belkin2003 ; Hinton2003 ; Hotelling1933 ; MinThesis ; Roweis2000 ; Tenenbaum2000 ; Shieh2011 ; Maaten09 ; van2008visualizing and supervised approaches MCML2006 ; NCA2005 ; He2003 ; MinMYBZ10 ; MinICDM ; Weinberger:2009 . Among the supervised approaches, MCML MCML2006 , NCA NCA2005 , LPP He2003 , and LMNN Weinberger:2009 are linear methods, and dtMCML MinMYBZ10 , dtNCA MinMYBZ10 , and DNNkNN MinICDM are deep nonlinear methods. Our method HOPE is a supervised embedding approach. But unlike the above methods, it directly maps explicit highorder interaction features, instead of the original input features, either through linear projection or sigmoid transformations followed by linear projection to lowdimensional embedding space. This simple feature mapping enables users to identify important interaction features. With only linear projection, HOPE can be viewed as a linear method applied to highorder interaction features, so its baseline counterparts should be those linear embedding methods. With Sigmoid transformation followed by linear projection, it can be viewed as a shallow nonlinear method. Through HOPE, a small number of exemplars conveying highorder feature interactions are synthesized. It is worth noting that, HOPE with the proposed two exemplar learning techniques is similar to but intrinsically different from Stochastic Neighbor Compression (SNC) SNC2014 . Specifically, learning exemplars in HOPE aims for constructing an embedding mapping that optimizes an objective function of maximally collapsing classes MCML2006 instead of neighbourhood component analysis NCA2005
in SNC. In particular, unlike in SNC, the joint exemplar learning technique in HOPE is coupled with highorder embedding parameter learning, which powers the exemplars created to capture essential data variations bearing highorder interactions. In addition, the results of HOPE with the kmeans based exemplar learning technique show that, using a powerful feature mapping generated by HOPE with Sigmoid transformations, optimization over exemplars is unnecessary, which is actually against the motivations of SNC.
Highorder feature interactions have been studied for building more powerful generative models such as Boltzmann Machines and autoencoders
DBLP:journals/corr/GuoZM15 ; DBLP:conf/iccv/Memisevic11 ; DBLP:conf/aistats/MinNCG14 ; DBLP:conf/cvpr/RanzatoH10 ; DBLP:journals/jmlr/RanzatoKH10 . Factorization Machine (FM) Rendle2010 and FHIM Min2014kddare similar to the version of HOPE with only linear projection, but they used feature interactions for classification, regression, or feature selection. None of previous research has been conducted under the context of data embedding, visualization, or compression, and therefore has different objective function or parametric form. Especially, our joint learning approach is completely different from previous methods. And to the best of our knowledge, our work here is the first successful one to model input feature interactions with order higher than two for practical supervised embedding.
3 HighOrder Parametric Embedding
3.1 Supervised highorder parametric embedding by maximally collapsing classes
Given a set of data points , where
is the input feature vector with the last component being
for absorbing bias terms, is the class label of labeled data points, and is the total number of classes. HOPE intends to find a highorder parametric embedding function that maps highdimensional data points to a lowdimensional space , where we expect that data points in the same class stay tightly close to each other and data points from different classes stay farther apart from each other. For data visualization, we often set . Unlike previous methods that directly embed original input features , HOPE assumes that highorder feature interactions are essential for capturing structural knowledge and learns a similarity metric directly based on these feature interactions. Suppose that HOPE directly embeds order feature interactions, i.e., the products of all possible features where , and . A straightforward approach is to explicitly calculate all these order feature interactions and use them as new input feature vectors of data points, and then learn a linear projection matrix to map them to a dimensional space as follows,(1) 
where , and is the lowdimensional embedding vector. We can rewrite the above equation in the following equivalent tensor form,
(2) 
where is a way tensor, . However, it’s very expensive to enumerate all possible order feature interactions. For example, if , we must deal with a dimensional vector of highorder features. To speed up computation, we factorize the tensor as follows,
(3) 
where is the number of factors. If we enforce , the th highorder embedding coordinate in Equation 2 can be rewritten as follows,
(4)  
where . With the above constrained tensor factorization, we can easily calculate the linear embedding for any highorder interaction features of any highdimensional data by an embarrassingly simple operation, that is, a linear projection followed by a power operation. It is worth noting that, the above factorization form not only reduces computational complexity significantly, but also is amenable to explicitly model different order of feature interactions in the data with a userspecified parameter .
The above HOPE method has an explicit highorder parametric form for mapping and is essentially equivalent to a linear model with all explicit highorder feature interactions expanded as shown above. Compared to supervised deep embedding methods with complicated deep architectures, the above linear projection method has limited modeling power. Fortunately, there is a very simple way to significantly enhance the model’s expressive power. That is, by simply adding Sigmoid transformations to the above factorized model before performing linear projection. We call the resulting model Sigmoid HOPE (SHOPE). In SHOPE, the th coordinate of the lowdimensional embedding vector is computed as,
(5) 
where is the bias term and . SHOPE dramatically improves the modeling power of HOPE with a trivial modification. As is shown in the experimental result section, the resulting shallow highorder parametric method even significantly outperforms the stateoftheart deep learning models with many layers for supervised embedding, which clearly demonstrates the representational power of shallow models with highorder feature interactions.
Given the highorder feature mapping of the th data point , , we perform supervised metric learning by maximally collapsing classes (MCML) MCML2006 . Following the line of research in MCML2006 ; NCA2005 ; Hinton2003 ; MinMYBZ10 , we deploy a stochastic neighbourhood criterion to compute the pairwise similarity of data points in the transformed space. In this setting, the similarity of two data points and
are measured by a probability
. The indicates the chance of the data point assigns as its nearest neighbor in the lowdimensional embedding space. Following the work in MinMYBZ10 , we use a heavytailed tdistribution to compute for supervised embedding due to its capabilities of reducing overfitting, creating tight clusters, increasing class separation, and easing gradient optimization. Formally, this stochastic neighborhood metric first centers a tdistribution over , and then computes the density of under the distribution as follows.(6)  
(7) 
To maximally collapsing classes, the parameters of (S)HOPE are learned by minimizing the sum of the KullbackLeibler divergence between the conditional probabilities
computed in the embedding space and the “groundtruth” probabilities calculated based on the class labels of training data. Specifically, iff and iff . Formally, the objective function of the HOPE method is as follows:(8) 
where is an indicator function. The above objective function essentially maximizes the product of pairwise probabilities between data points in the same class, which creates favorable tight clusters that are suitable for supervised twodimensional embedding in limited accommodable space. We use Conjugate Gradient Descent to optimize this objective function. Although (S)HOPE shares the same objective as MCML MCML2006 and dtMCML MinMYBZ10 , it learns a shallow explicit highorder embedding function. On the contrary, MCML aims at a linear mapping over original input features, while dtMCML targets a complicated deep nonlinear function parametrized by a deep neural network.
3.2 Scalable exemplar learning for data compression and fast kNN classification
In addition to learning explicit highorder feature interactions for data embedding, we also aim to synthesize a small set of exemplars that do not exist in the training set for data compression, so that fast information retrieval such as kNN classification can be efficiently performed in the embedding space when the dataset is huge. Given the same dataset with formal descriptions as introduced in section 3.1, we aim to learn exemplars per class with their designated class labels fixed, where is a userspecified free parameter and . We denote these exemplars by . We propose two approaches to exemplar learning. The first one is straightforward and relies on supervised kmeans. In specific, we perform kmeans on the training data to identify exemplars for each class. If a powerful feature mapping such as the one by SHOPE is learned, all the data points in the same class will be mapped to a compact point cloud in the twodimensional space, therefore this simple exemplar learning approach will achieve excellent performance; Otherwise, further optimization over the exemplars is needed. The second approach is based on a joint optimization. We jointly learn the highorder embedding parameters and the exemplars by optimizing the following objective function,
(9) 
where indexes training data points, indexes exemplars, denotes the highorder embedding parameters, is calculated in the same way as in section 3.1, but is calculated with respect to exemplars,
(10)  
(11) 
where denotes the highorder embedding function as described in Equation 4 and 5
. Please note that, unlike the symmetric probability distribution in Equation
6, the asymmetric here is computed only using the pairwise distances between training data points and exemplars. Because , it saves us a lot of computations compared to using the original distribution in Equation 6. The derivative of the above objective function with respect to exemplar is as follows,(12) 
The derivatives of other model parameters can be easily calculated similarly. We update these synthetic exemplars and the embedding parameters of HOPE in a deterministic ExpectationMaximization fashion using Conjugate Gradient Descent. In specific, the
exemplars belonging to each class are initialized by the first exemplar learning approach. During the early phase of the joint optimization of exemplars and highorder embedding parameters, the learning process alternatively fixes one while updating the other. Then the algorithm updates all the parameters simultaneously until reaching convergence or the specified maximum number of epochs.
Linear Methods  NonLinear Methods  

LDA  52.00  deepAE  24.7 
LPP  47.20  ptSNE  9.90 
NCA  45.91  dtNCA  3.48 
MCML    dtMCML  3.35 
LMNN  56.28  
HOPE  5.96  SHOPE  3.20 
4 Experiments
In this section, we evaluate the effectiveness of (S)HOPE by comparing it against nine different baseline methods based upon two handwritten digit datasets, i.e., MNIST and USPS. The MNIST dataset contains 60,000 training and 10,000 test graylevel 784dimensional images. The USPS data set contains 11,000 256pixel graylevel images, with 8000 for training and 3000 for test. We compare the shallow linear HOPE with its five linear counterparts, including LPP, LMNN, NCA, LDA, and MCML; the nonlinear shallow SHOPE uses four deep learning baselines, including deep unsupervised models such as deep autoencoder (deepAE) Bengio2009 and ptSNE Maaten09 , as well as two deep supervised models, i.e., dtNCA MinMYBZ10 and dtMCML MinMYBZ10 . We set the size of exemplars as 20 in all our experiments. We used of training data as validation set to tune hyperparameters such as the order of feature interactions, the number of factors (), the number of highorder units (), batch size, and the number of iterations for conjugate gradient descent on each minibatch. For HOPE, on MNIST and on USPS. For SHOPE, on MNIST, , ; on USPS, , . On both datasets, for HOPE and for SHOPE. The parameters for all baseline methods were carefully tuned to achieve the best results.
4.1 Results on MNIST
4.1.1 Classification performance on 2D embedding
Table 1
presents the test error rates of 5nearest neighbor classifier on 2dimensional embedding generated by (S)HOPE and the baseline methods. The results indicate that the linear HOPE, with an error rate of 5.96%, significantly outperforms its linear counterparts, namely the LDA, LPP, NCA, LMNN methods (all with an error rate about 50%).
Promisingly, results in Table 1 also suggest that our shallow method HOPE with simple Sigmoid transformations, namely SHOPE, significantly outperforms the deep embedding models based on deep neural networks, in terms of accuracy obtained on the 2dimensional embedding for visualization. For example, the error rate (3.20%) of SHOPE is lower than the ones of the deepAE, ptSNE, dtNCA, and dtMCML methods. These results clearly demonstrate the representational efficiency and power of supervised shallow models with highorder feature interactions.
To further confirm the representation power of HOPE, we extracted the 512dimensional features below the softmax layer learned by a wellknown deep convolutional architecture VGG
Simonyan2015 , which currently holds thestateoftheart classification performance through softmax layers on MNIST. Next, we ran SHOPE based on these features to generate 2D embedding. Promisingly, KNN can achieve an error of 0.65%. In contrast, NCA and LMNN on top of VGG, respectively, produces test error rate of, 1.15% and 1.75%. This error rate of SHOPE represents the historically low test error rate in twodimensional space on MNIST. This observation implies that even with the most powerful deep learning networks, modeling explicit highorder feature interactions can achieve further predictive accuracy and outperform other models without feature interactions.4.1.2 Exemplar learning
supervised kmeans  (S)HOPE with exemplars  

LDA  48.80  
LPP  45.13  supervised kmeans + HOPE  45.29 
NCA  50.67  supervised kmeans + SHOPE  3.14 
LMNN  59.67  HOPE with 20 optimized exemplars  5.52 
ptSNE  18.86  SHOPE with 20 optimized exemplars  3.14 
dtMCML  3.17 
In this section, we evaluate the two approaches that we propose to generating a small number of exemplars conveying highorder interactions to represent largescale data sets. Table 2 presents the classification errors of kNN () on 2dimensional embeddings generated by (S)HOPE with the two proposed exemplar learning. These results suggest the following: First, using kmeans as exemplar learning works well only when coupled with SHOPE, which demonstrates the power of the feature mapping by SHOPE; On the other hand, when coupled with optimized exemplar learning, both HOPE and SHOPE work very well. These observations suggest that, sophisticated exemplar learning method is unnecessary if we have a powerful feature mapping function such as the one by SHOPE.
4.1.3 Exemplars visualization
In Figures 1, we present 20 optimized exemplars created by the most accurate models from both HOPE and SHOPE. These figures indicate that SHOPE can construct better representative exemplars than HOPE. The exemplars generated by SHOPE clearly captured global shape information. In contrast, exemplars created by HOPE can barely be recognized by human. Part of the reason is that the former achieved much lower error (i.e., 3.14%) than the latter (with error of 5.52%). Another reason is that HOPE and SHOPE have different focus when optimizing the same cost function as depicted in Equation 9
. Promisingly, the bottom subfigure clearly show that these exemplars can capture the most important variations in the data, such as the skew and style information of different digits. Intuitively, the exemplars are based on the entire data set, thus they summarize global essential information about the data set. This is in contrast to the local knowledge contained by individual digits from a small sample when exploring massive data.
4.1.4 2D embedding visualization
Figures 2 shows the test data embeddings of MINST. These embeddings were constructed by, respectively, linear MCML, dtMCML, HOPE with 20 optimized exemplars, and SHOPE. SHOPE produced the best visualization, collapsed all the data points in the same class close to each other, and generated large separations between class clusters. Furthermore, the embeddings of the optimized exemplars created during training (depicted as red empty circles in subfigure (c)) are located almost at the centers of all the clusters, which suggest that the synthetic exemplars bear highorder feature interactions capturing essential data variations.
Linear Methods  NonLinear Methods  

LDA  38.23  deepAE  28.43 
LPP  34.77  ptSNE  17.90 
NCA  37.17  dtNCA  5.11 
MCML  44.60  dtMCML  4.07 
LMNN  48.40  
HOPE  6.90  SHOPE  3.03 
supervised kmeans  (S)HOPE with exemplars  

LDA  35.23  
LPP  33.23  supervised kmeans + HOPE  32.97 
NCA  35.13  supervised kmeans + SHOPE  2.97 
LMNN  59.67  HOPE with 20 optimized exemplars  6.90 
ptSNE  29.47  SHOPE with 20 optimized exemplars  3.60 
dtMCML  4.27 
4.2 Results on USPS
We also conducted experiments on the USPS data set. Table 1 presents the performances of kNN classification () on twodimensional embeddings constructed by various dimensionality reduction techniques.
From results as presented in Tables 3 and 4, one can draw very similar conclusions as the ones on the MNIST data. Visualization on the exemplars and embeddings learned also show consistent behaviors of the (S)HOPE models as that on the MNIST data. We included all the plotted images with highresolution in the supplementary material to this paper.
4.3 Computational efficiency of exemplar learning
SHOPE with exemplar learning speeds up computational efficiency for fast information retrieval such as kNN classification used in the above experiments by thousands of times. On MNIST and USPS, in the feature space, test data prediction is against, respectively, 60000 training data points in 784dimensional space (test error rate 3.05%) and 8000 training data points in 256dimensional space (test error rate 4.77%). With SHOPE and 20 synthesized exemplars, test data prediction is only against 20 exemplars in 2dimensional space and even gets comparable or much better performance than in the original feature space! This computational speedup will be more pronounced on massive datasets.
5 Conclusion and future work
In this paper, we present a supervised HighOrder Parametric Embedding (HOPE) approach for data visualization and compression. Our experimental results indicate that modeling highorder feature interactions can significantly improve the data visualization in lowdimensional embedding space, when compared with its linear counterparts. Surprisingly, our shallow method HOPE with simple Sigmoid transformations significantly outperforms stateoftheart supervised deep embedding models based on deep neural networks, and even achieved historically low test error rate of 0.65% in twodimensional space on MNIST. In addition, the learned synthetic exemplars in combination with the shallow highorder feature mapping speed up kNN classification by thousands of times with comparable or much better performance than that in the original feature space. These results clearly demonstrate the high representational efficiency and power of supervised shallow models with highorder feature interactions, and suggest that the performance and representational efficiency of supervised deep learning models might be significantly improved by incorporating explicit highorder feature interactions. Our methods can be readily extended to the setting of unsupervised learning, for which we just need to compute the pairwise probabilities
’s using highdimensional feature vectors instead of class labels and optimize exemplars accordingly. SHOPE can also be easily extended to deep structures across different layers.References
 (1) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Compututation, 15:1373–1396, 2003.

(2)
Y. Bengio.
Learning deep architectures for ai.
Foundations and Trends in Machine Learning
, (2), 2009.  (3) A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Schölkopf, and J. Platt, editors, NIPS 18, pages 451–458. MIT Press, Cambridge, MA, 2006.
 (4) J. Goldberger, S. T. Roweis, G. E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS 2004, pages 513–520. 2004.
 (5) H. Guo, X. Zhu, and M. R. Min. A deep learning model for structured outputs with highorder interaction. CoRR, abs/1504.08022, 2015.
 (6) X. He and P. Niyogi. Locality preserving projections. In Proceedings of Advances in Neural Information Processing Systems 16, 2003.
 (7) G. E. Hinton and S. T. Roweis. Stochastic neighbor embedding. In Proceedings of Advances in Neural Information Processing Systems 16, 2003.
 (8) H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417–441, 1933.
 (9) M. J. Kusner, S. Tyree, K. Q. Weinberger, and K. Agrawal. Stochastic neighbor compression. In ICML2014, Beijing, China, 2126 June 2014, pages 622–630.

(10)
R. Memisevic.
Gradientbased learning of higherorder image features.
In
Proceedings of IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 613, 2011
, pages 1591–1598, 2011.  (11) M. R. Min. A nonlinear dimensionality reduction method for improving nearest neighbour classification. In Master Thesis. Department of Computer Science, University of Toronto, 2005.

(12)
M. R. Min, X. Ning, C. Cheng, and M. Gerstein.
Interpretable sparse highorder boltzmann machines.
In
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 2225, 2014
, pages 614–622, 2014.  (13) M. R. Min, L. van der Maaten, Z. Yuan, A. J. Bonner, and Z. Zhang. Deep supervised tdistributed embedding. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pages 791–798, 2010.
 (14) R. Min, D. A. Stanley, Z. Yuan, A. Bonner, and Z. Zhang. A deep nonlinear feature mapping for largemargin knn classification. In ICDM, pages 357–366. IEEE, 2009.
 (15) S. Purushotham, M. R. Min, C.C. J. Kuo, and R. Ostroff. Factorized sparse learning models with interpretable high order feature interactions. In KDD 2014, New York, USA, 2014.
 (16) M. Ranzato and G. E. Hinton. Modeling pixel means and covariances using factorized thirdorder boltzmann machines. In CVPR 2010, 2010.

(17)
M. Ranzato, A. Krizhevsky, and G. E. Hinton.
Factored 3way restricted boltzmann machines for modeling natural images.
In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 1315, 2010, pages 621–628, 2010.  (18) S. Rendle. Factorization machines. In Proceedings of the 10th International Conference on Data Mining,ICDM 2010, Sydney, Australia, 2010.
 (19) S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
 (20) A. D. Shieh, T. B. Hashimoto, and E. M. Airoldi. Tree preserving embedding. In Proceedings of the 28th International Conference on Machine Learning, 2011.
 (21) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In Proceedings of International Conference on Learning Representations, 2015.
 (22) J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.
 (23) L. van der Maaten. Learning a parametric embedding by preserving local structure. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 1618, 2009, pages 384–391, 2009.
 (24) L. van der Maaten and G. Hinton. Visualizing data using tsne. The Journal of Machine Learning Research, 9:2579–2605, 2008.
 (25) K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009.
Comments
There are no comments yet.