1 Introduction
Unsupervised representation learning has been shown effective in tasks such as dimension reduction, clustering, visualization, information retrieval, and semisupervised learning [Goodfellow, Bengio, and Courville2016]. Learned representations have been shown to achieve better performance on individual tasks than domainspeciﬁc handcrafted features, and different tasks can use the same learned representation [Goodfellow, Bengio, and Courville2016]. For example, the embedding obtained by methods like word2vec [Mikolov et al.2013] has been exploited in many different text mining systems [Catherine and Cohen2017, Zheng, Noroozi, and Yu2017]. Moreover, to help a user extract knowledge from a data set, a data exploration system can first learn the representation without supervision for each item in the data set; then display both the clustering (e.g., means [Lloyd1982]) and visualization (e.g., projection with Distributed Stochastic Neighbor Embedding/SNE [Maaten and Hinton2008]) results produced from the representation.
There are two types of unsupervised representation learning methods: domainspecific unsupervised representation learning methods and general unsupervised representation learning methods. While domainspecific unsupervised representation learning methods like word2vec [Mikolov et al.2013] and videobased methods [Agrawal, Carreira, and Malik2015, Jayaraman and Grauman2015, Wang and Gupta2015, Pathak et al.2017] have been widely adopted in their respective domains, their success cannot be directly transferred to other domains because their assumptions do not hold for other types of data. In contrast, general unsupervised representation learning methods, such as autoencoder [Bengio et al.2007, Huang et al.2007, Vincent et al.2010], can be effortlessly applied to data from various domains, but the performance of general methods is usually inferior to those that utilize domain knowledge [Mikolov et al.2013, Agrawal, Carreira, and Malik2015, Jayaraman and Grauman2015, Wang and Gupta2015, Pathak et al.2017].
In this work, we propose an unsupervised representation learning framework (i.e., neighborencoder) which is general, as it can be applied to various types of data, and versatile since domain knowledge can be added by adopting various “offtheshelf” data mining algorithms for finding neighbors. Figure 1 previews the SNE [Maaten and Hinton2008] visualization produced from a human physical activity data set (see Section 4.3 for details). The embedding is generated by projecting representation learned by neighborencoder, autoencoder, and raw data respectively to . By using a suitable neighbor finding algorithm, the representation learned by neighborencoder provides a more meaningful visualization than its rival methods.
In summary, our major contributions include:

We propose a general and versatile framework, neighborencoder, which incorporates domain knowledge into unsupervised representation learning by leveraging a large family of offtheshelf similarity search techniques.

We demonstrate that the performance of the representations learned by neighborencoder is superior to representations learned by autoencoder.

We showcase the applicability of neighborencoder in a diverse set of domains (i.e., handwritten digit data, text, and human physical activity data) for various data mining tasks (i.e., classification, clustering, and visualization).
To allow reproducibility, all the codes and models associated with the paper can be downloaded from nnwebsite (nnwebsite). The rest of this paper is organized as follows. In Section 2 we consider related work. Section 3 we introduce the propose neighborencoder framework. We perform a comprehensive evaluation in Section 4 before offering conclusions and directions for future research in Section 5.
2 Related Work
Unsupervised representation learning
is usually achieved by optimizing either domainspecific objectives or general unsupervised objectives. For example, in the domain of computer vision and music processing, unsupervised representation learning problem is formulated as a supervised learning problem with surrogate labels, generated by exploiting the temporal coherence in videos and music
[Agrawal, Carreira, and Malik2015, Jayaraman and Grauman2015, Wang and Gupta2015, Pathak et al.2017, Huang, Chou, and Yang2017]. In the case of natural language processing, word embedding can be achieved by optimizing an objective function that “pushes” words occurring in a similar context (i.e., surrounded by similar words) closer in the embedding space
[Mikolov et al.2013]. Alternatively, general unsupervised objectives are also useful for unsupervised representation learning. For example, minimizing the selfreconstruction error is used in autoencoder [Bengio et al.2007, Huang et al.2007, Vincent et al.2010], while optimizing themeans objective is shown effective in coates2012nn (coates2012nn) and yang2017icml (yang2017icml). Other objectives, such as selforganizing map criteria
[Kohonen1982, Bojanowski and Joulin2017] and adversarial training [Goodfellow et al.2014, Donahue, Krähenbühl, and Darrell2016, Radford, Metz, and Chintala2015, Larsen et al.2015], are also effective objectives for unsupervised representation learning.Autoencoder
is a decadeold unsupervised learning framework for dimension reduction, representation learning, and deep hierarchical model pretraining; many variants have been proposed since its initial introduction
[Bengio et al.2007, Goodfellow, Bengio, and Courville2016]. For example, the denoising autoencoder reconstructs the input data from its corrupted version; such modification improves the robustness of the learned representation
[Vincent et al.2010]. The variational autoencoder (VAE) regularizes the learning process by imposing a standard normal prior over the latent variable (i.e., representation), and such constraints help the autoencoder learn a valid generative model [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014]. larsen2015arxiv (larsen2015arxiv) and makhzani2015arxiv (makhzani2015arxiv) further improves generative model learning by combining VAE with adversarial training. Sparsity constraints on the learned representation are another form of regularization for autoencoders to learn a more discriminating representation for classification; both the sparse autoencoder [Makhzani and Frey2013, Makhzani and Frey2015] and competitive autoencoder [Chen and Zaki2017] incorporate such ideas.3 Neighborencoder Framework
In this section, we introduce the proposed neighborencoder framework and make a comparison with autoencoder. Figure 2 shows different encoderdecoder configurations for both neighborencoder and autoencoder. In the following sections, we discuss the motivation and design of each encoderdecoder configuration in detail.
3.1 Autoencoder (AE)
The overall architecture of autoencoder consists of two components: an encoder and a decoder. Given input data , the encoder is a function that encodes into a latent representation (usually in a lower dimensional space), and the decoder is a function that decodes in order to reconstruct . Figure 1(a) shows the feedforward path of an autoencoder where and . We train the autoencoder by minimizing the difference between the input data and the reconstructed data . Formally, given a set of training data , the parameters in and are learned by minimizing the objective function , where
. The particular loss function we used in this work is cross entropy, but other loss function, like mean square error or mean absolute error can also be applied. Once the autoencoder is learned, any given data can be projected to the latent representation space with
. Both the encoder and the decoder can adopt any existing neural network architecture, such as multilayer perceptron
[Bengio et al.2007], convolutional net [Huang et al.2007][Hochreiter and Schmidhuber1997, Srivastava, Mansimov, and Salakhudinov2015].3.2 Neighborencoder (NE)
Similar to the autoencoder, neighborencoder also consists of an encoder and a decoder. Both the encoder and the decoder in neighborencoder work similarly to their counterparts in autoencoder; the major difference is in the objective function. Given input data and the neighborhood function (which returns the neighbor of ), the encoder is a function that encodes into a latent representation , and the decoder is a function that reconstructs ’s neighbor by decoding . Figure 1(b) shows the feedforward path of a neighborencoder where and . Formally, given a set of training data and a neighborhood function , the neighborencoder is learned by minimizing the objective function , where and . Neighborencoder can be considered as a generalization of autoencoder as the input data can be treated as the nearest neighbor of itself with zero distance. Note that here neighbor can be defined in a variety of ways. We will introduce examples of different neighbor definitions later in Section 3.4.
Compared to autoencoder, we argue that neighborencoder can better retain the similarity between data samples in the latent representation space. Figure 3 builds a case for this claim. As shown in Figure 2(a), we assume the data set of interest consists of samples from two classes (i.e., blue class and red class, and each class forms a cluster) in space. Since the autoencoder is trained by mapping each data point to itself, the learned representation for this data set would most likely be a rotated and/or rescaled version of Figure 2(a). In contrast, the neighborencoder (trained with nearest neighbor relation, as shown in Figure 2(b)) would learn a representation with much less intraclass variation. As Figure 2(c) shows, when several similar data points share the same nearest neighbor, the objective function will force the network to generate exactly the same output for these similar data points, thus forcing their latent representation (which is the input of the decoder) to be very similar.
Alternatively, neighborencoder can be understood as a nonparametric way of generating corrupted data for denoising autoencoder. Instead of being trained to remove arbitrary noise (e.g., Gaussian noise) from the corrupted data (which is the norm), the neighborencoder is trained to remove more meaningful noise from the corrupted data. For example, a pair of nearest neighbors found using Euclidean distance in MNIST database
[LeCun et al.1998] usually reflects different writing styles of the same numeric digit (see Figure 4(a)). By training the neighborencoder with such nearest neighbor pairs, the learning process would push the encoder network to ignore or “remove” the writing style aspect from the handwritten digits.Since we are using neighbor finding algorithms to guide the representation learning process, one may argue that we could instead construct a graph using the neighbor finding algorithm, then apply various graphbased representation learning methods like the ones proposed in [Perozzi, AlRfou, and Skiena2014, Tang et al.2015, Grover and Leskovec2016, Dong, Chawla, and Swami2017, Ribeiro, Saverese, and Figueiredo2017]. Graphbased methods are indeed valid alternatives to neighborencoder; however, they have the following two limitations: 1) If one wishes to encode a newly obtained data, the outofsample problem would bring about additional complexity, as these methods are not designed to handle such a scenario. 2) It will be impossible to learn a generative model, as graphbased methods learn the representation by modeling the relationship between examples in a data set, rather than modeling the example itself. As a result, whenever the above limitations are crucial, the proposed neighborencoder is preferred over the graphbased methods.
3.3 neighborencoder
Similar to the idea of generalizing the
nearest neighbor classifier to a
nearest neighbor classifier, neighborencoder can also be extended to the neighborencoder by reconstructing neighbors of the input data (see Figure 1(c)). We train decoders to simultaneously reconstruct all neighbors of the input. Given an input data and the neighborhood function (which returns the neighbors of ), the encoder is a function that encodes into the latent representation . Then, we have a set of decoders , in which each individual function decodes in order to reconstruct ’s th neighbor .The neighbor encoder learning process is slightly more complicated than the neighborencoder (i.e., neighborencoder). Given a set of training data and a neighborhood function , the neighborencoder can be learned by minimizing where and . Note that since there are decoders, we need to assign each to one of the decoders. If there are “naturally” types of neighbors, we can train one decoder for each type of neighbor. Otherwise, one possible decoder assignment strategy is choosing the decoder that provides the lowest reconstruction loss for each . This decoder assignment strategy will work if each training example has less than “modes” of neighbors.
3.4 Neighborhood Function
To use any of the introduced neighborencoder configurations, we need to properly define the term neighbor. In this section, we discuss several possible neighborhood functions for the neighborencoder framework. Note that the functions listed in this section are just a small subset of all the available functions, and were chosen because they demonstrate the versatility of our approach.
Simple Neighbor is defined as the objects that are closest to a given object in Euclidean distance or other distances, assuming the distance between every two objects is computable. For example, given a set of objects
where each object is a realvalue vector, the neighboring relationship among the objects under Euclidean distance can be approximately identified by construing a
 tree.Feature Space Neighbor is very similar to simple neighbor, except that instead of computing the distance between objects in the space where the reconstruction is performed (e.g., the rawdata space), we compute the distance in an alternative representation or feature space. To give a more concrete example, suppose we have a set of objects where each object is an audio clip in melfrequency spectrum space. Instead of finding neighbors directly in the melfrequency spectrum space, we transform the data into the MelFrequency Cepstral Coefficient (MFCC) space, as neighbors discovered in MFCC space are semantically more meaningful and searching in MFCC space is more efficient.
Time Series Subspace Neighbor , as defined for multidimensional time series data, is the similarity between two objects measured by using only a subset of all dimensions. By ignoring some dimensions, a time series could find higher quality neighbors since it is very likely that some of the dimensions contain irrelevant or noisy information (e.g., room temperature in human physical activity data). Given a multidimensional time series, we can use STAMP [Yeh, Kavantzas, and Keogh2017] to evaluate the neighboring relationship between all the subsequences within the time series.
Spatial or Temporal Neighbor defines the neighbor based on the spatial or temporal closeness of objects. Specifically, given a set of objects where the subscript denotes the temporal (or spatial) arrival order, and are neighbors when , where is the predefined size of the neighborhood. The skipgram model in word2vec [Mikolov et al.2013] is an example of spatial neighborencoder, as the skipgram model can be regarded as reconstructing the spatial neighbors (in the form of onehot vector) of a given word.
Side Information Neighbor defines the neighbor with side information, which could be more semantically meaningful than the aforementioned functions. For example, images shown in the same eCommerce webpage (e.g., Amazon) would most likely belong to the same merchandise, but they can reflect different angles, colors, etc., of the merchandise. If we select a random image from a webpage and assign it as the nearest neighbor for all the other images in the same page, we could train a representation that is invariant to view angles, lighting conditions, product variations (e.g., different color of the same smart phone), and so forth. One may consider that using such side information implies a supervised learning system instead of an unsupervised learning system. However, note that we only have the information regarding similar pairs while the information regarding dissimilar pairs (i.e., negative examples) is missing^{1}^{1}1 We can construct a nearestneighbor graph by treating each image as a node and connecting each image with its nearest neighbor. One may sample pairs of disconnected nodes as negative examples, but such sampling method may produce false negatives, as disconnected nodes may or may not be semantically dissimilar. ; compared to the information required by a conventional supervised learning system, this information is very limiting.
4 Experimental Evaluation
In this section, we show the effectiveness and versatility of neighborencoder compared to autoencoder by performing experiments on handwritten digits, texts, and human physical activities with different neighborhood functions. As the neighborencoder framework is a generalization of autoencoder, all the variants of autoencoder (e.g., denoising autoencoder [Vincent et al.2010], variational autoencoder [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014], sparse autoencoder [Makhzani and Frey2013, Makhzani and Frey2015], or adversarial autoencoder [Larsen et al.2015, Makhzani et al.2015]) can be directly ported to the neighborencoder framework. As a result, we did not exhaustively test all variants of autoencoder/neighborencoder, but instead selected the three most popular variants (i.e., vanilla, denoising, and variational). We leave the exhaustive comparison of the other variants for future work.
4.1 Handwritten Digits
The MNIST database is commonly used in the initial study of newly proposed methods due to its simplicity [LeCun et al.1998]. It contains images of handwritten digits (one digit per image); of these images are test data, and the other
are training data. The original task for the data set is multiclass classification. Since the proposed method is not a classifier but a representation learner (i.e., an encoder), we have evaluated our method using the following procedure: 1) we train the encoder with all the training data, 2) we encode both training data and test data into the learned representation space, 3) we train a simple classifier (i.e., linear support vector machine/SVM) with various amounts of labeled training data in the representation space, then apply the classifier to the representation of test data and report the classification error (i.e., semisupervised classification problem), and 4) we also apply a clustering method (i.e.,
means) to the representation of test data and report the adjusted Rand index. As a proof of concept, we did not put much effort in optimizing the structure of the encoder/decoder. We simply used a layer convolutional net (6464128128) as the encoder and a layer transposed convolutional net (1281286464) as the decoder. We have tried several other convolutional net architectures as well; we draw the same conclusion from the experimental results with these alternative architectures.Here we use the neighborencoder configuration (Figure 1(b)) with the simple neighbor definition for our neighborencoder. We compare the performance of three variants (vanilla, denoising, and variational) of neighborencoder and the same three variants of autoencoder. Figure 4 shows the classification error rate as we change the number of labeled training data for linear SVM. All neighborencoder variants outperform their corresponding autoencoder variants, except the variational neighborencoder when the number of labeled training data is larger. Overall, denoising neighborencoder produces the most discriminating representations.
Besides the semisupervised learning experiment, we also performed a purely unsupervised clustering experiment with means. Table 1 summarizes the experiment’s result. The overall conclusion is similar to that of the semisupervised learning experiment, where all neighborencoder variants outperformed their corresponding autoencoder variants. Unlike the semisupervised experiment, variational neighborencoder produces the most clusterable representations in this particular experiment, but all three variants of neighborencoder are comparable with each other.
Vanilla  Denoising  Variational  

AE  0.3005  0.3710  0.4492 
NE  0.4926  0.5039  0.5179 
In the previous two experiments, we define the neighbor of an object as its nearest neighbor under Euclidean distance. With this definition, the visual difference between an object and its neighbor is usually small, given that we have sufficient data. To allow for more visual discrepancy between the objects and their neighbors, we could change that neighbor definition to the th nearest neighbor under Euclidean distance (). We have repeated the clustering experiment under different settings of to examine the effect of increasing discrepancy between the objects and their neighbors. We chose to perform the clustering experiment instead of the semisupervised learning experiment because 1) clustering is unsupervised and 2) it is easier to present the clustering result in a single figure, as semisupervised learning requires varying both the amount of training data and .
Figure 6 summarizes the result, and Figure 5 shows a randomly selected set of objectneighbor pairs under different settings of . The performance peaks around and decreases as we increase ; therefore, choosing the th nearest neighbor as the reconstruction target for neighborencoder would create enough discrepancy between the objectneighbor pair for better representation learning. When neighborencoder is used in this fashion, it can be regarded as a nonparametric way of generating noisy objects (similar as the principle of denoising autoencoder), and the settings of controls the amount of noise added to the object. Note that neighborencoder is not equivalent to denoising autoencoder, as several objects can share the same th nearest neighbor (recall Figure 2(c)), but denoising autoencoder would most likely generate different noisy inputs for different objects.
To explain the performance difference between autoencoder and neighborencoder, we randomly selected five test examples from each class (see Figure 6(a)) and fed them through both the autoencoder and the neighborencoder trained in the previous experiments. The outputs are shown in Figure 7, where the top row and bottom row are autoencoder and neighborencoder respectively. As expected, the output of autoencoder is almost identical to the input image. Although the output of neighborencoder is still very similar to the input image, the intraclass variation is less than the output of autoencoder. This is because neighborencoder tends to reconstruct the same neighbor image from similar input data points (recall Figure 2(c)). As a result, the latent representation learned by neighborencoder is able to achieve better performances.
4.2 Texts
The 20Newsgroup^{2}^{2}2downloaded from cardoso2007phd (cardoso2007phd)
data set contains nearly 20,000 newsgroup posts grouped into 20 (almost) balanced newsgroups/classes. It is a popular data set for experimenting with machine learning algorithms on text documents. We follow the clustering experiment setup presented in yang2017icml (yang2017icml), wherein each document is represented as a tfidf vector (using the 2,000 most frequent words in the corpus), and the performance of a method is measured by the normalized mutual information (NMI), adjusted Rand index (ARI), and clustering accuracy (ACC). To ensure the fairness of the comparison, we use an identical network structure (25010020 multilayer perceptron) for the encoder
[Yang et al.2017].We test three different autoencoder variants (vanilla autoencoder/AE, denoising autoencoder/DAE, and variational autoencoder/VAE) as the baselines, and enhance the best variant with the neighborencoder objective function (denoising neighborencoder/DNE). The neighbor definition adopted in this set of experiments is the feature space neighbor, where we find the nearest neighbor of each document in the current encoding space at each epoch. We use
means (KM) to cluster the learned representation. Table 2 shows our experiment results accompanied by the experiment result reported in yang2017icml (yang2017icml). The proposed method (neighborencoder), when combined with the best variant of autoencoder, outperforms all other methods.Methods  NMI  ARI  ACC 

JNKM^{*}  0.40  0.10  0.24 
XARY^{*}  0.19  0.02  0.18 
SC^{*}  0.40  0.17  0.34 
KM^{*}  0.41  0.15  0.30 
NMF+KM^{*}  0.39  0.17  0.33 
LCCF^{*}  0.46  0.17  0.32 
SAE+KM^{*}  0.47  0.28  0.42 
DCN^{*}  0.48  0.34  0.44 
AE+KM  0.44  0.29  0.43 
DAE+KM  0.52  0.38  0.53 
VAE+KM  0.41  0.18  0.31 
DNE+KM  0.56  0.41  0.57 

Experiment results reported by yang2017icml (yang2017icml).
The most similar systems (to our baselines) examined by yang2017icml (yang2017icml) is the stacked autoencoder with means (SAE+KM). When comparing our baselines with SAE+KM, AE+KM unsurprisingly performs similar to SAE+KM, as they are almost identical. Out of our three baselines, the denoising autoencoder outperforms the other two variants considerably, with the variational autoencoder being the worst system. Because the denoising is the best autoencoder variant, we decided to extend it with the neighbor reconstruction loss function. The resulting system (DNE+KM) outperforms all other systems, including the previous stateoftheart deep clustering network (DCN).
Finally, we apply DNE+KM to a larger data set with imbalanced classes, RCV1v2 [Lewis et al.2004], following the experiment/encoder setup with 20 clusters outlined in yang2017icml (yang2017icml). Table 3 summarizes the results. The performance of DNE+KM is similar to DCN in terms of NMI, while outperforming DCN in terms of ARI/ACC.
Methods  NMI  ARI  ACC 

XARY^{*}  0.25  0.04  0.28 
DEC^{*}  0.08  0.01  0.14 
KM^{*}  0.58  0.29  0.47 
SAE+KM^{*}  0.59  0.33  0.46 
DCN^{*}  0.61  0.33  0.47 
DNE+KM  0.60  0.40  0.49 

Experiment results reported by yang2017icml (yang2017icml).
4.3 Human Physical Activities
In Section 3, we introduced the neighborencoder in addition to the neighborencoder. Here we test the neighborencoder on the PAMAP2 data set [Reiss and Stricker2012a, Reiss and Stricker2012b] using the time series subspace neighbor definition [Yeh, Kavantzas, and Keogh2017]. We chose the subspace neighbor definition because 1) it addresses one of the commonly seen multidimensional time series problem scenarios (the existence of irrelevant/noisy dimensions), 2) it is able to extract meaningful repeating patterns, and 3) it naïvely gives multiple “types” of neighbors to each object.
The PAMAP2 data set was collected by mounting three inertial measurement units and a heart rate monitor on nine subjects, and recording them performing different physical activities (e.g., walking, running, playing soccer), with one session per subject, each ranging from hours to hours. The subjects performed one activity for a few minutes, took a short break, then continued performing another activity. In order to transfer the data set into a format that we can use for evaluation (i.e., a training/test split), for each subject (or recording session) we cut the data into segments according to their corresponding physical activities; then, within each activity segment, we generated training data from the first half, and test data from the second half with a sliding window length of and a step size of one. We make sure that there is no overlap between training data and test data. After the reorganization, we end up with none data sets (one pair of training/test set per subject). We ran experiments on each data set independently, and report averaged performance results.
The experiment procedure is very similar to the one presented in Section 4.1. We perform the experiments under two different scenarios: “clean” and “noisy.” In the “clean” scenario, we manually deleted some dimensions of the data that are irrelevant (or harmful) to the classification/clustering tasks, while in the “noisy” scenario, all dimensions of the data are retained. Here we use a layer convolutional net (6464128256) as the encoder, and a layer transposed convolutional net (2561286464) as the decoder. Similar to Section 4.1
, we did not put much effort in optimizing the structure of this network architecture. We have tried modifying the convolutional net architectures in various ways, such as adding batch normalization, changing the number of layers, or varying the number of filters for each layer, etc., and the conclusion drawn from the experimental results remains virtually unchanged.
In Figure 8, we compare the semisupervised classification capability of vanilla, denoising, and variational autoencoder/neighborencoder under both the“clean” scenario and the “noisy” scenario. Both vanilla and denoising neighborencoder outperforms their corresponding autoencoder in all scenarios. The performance difference is more notable when the number of training data is small. On the contrary, variational autoencoder outperforms the corresponding neighborencoder; however, the performance of both variational autoencoder and neighborencoder are considerably worse than their vanilla and denoising counterparts. Overall, both the vanilla and denoising neighborencoders work relatively well for this problem.
Table 4 shows the clustering experiment with means. For the vanilla encoderdecoder system, neighborencoder surpasses autoencoder in both scenarios, especially in the noisy scenario. When the denoising mechanism is added to the encoderdecoder system, it greatly boosts the performance of autoencoders, but the performance of neighborencoder still greatly exceeds autoencoder. Similar to the semisupervised learning experiment, the variational encoderdecoder system performs poorly for this data set. In general, both the vanilla and denoising neighborencoders outperform their autoencoder counterparts for the clustering problem on PAMAP2 data set.
Vanilla  Denoising  Variational  

Clean  AE  0.3815  0.4159  0.1597 
NE  0.4203  0.4272  0.1192  
Noisy  AE  0.1844  0.2336  0.1034 
NE  0.3832  0.3948  0.1081 
Figure 1 further demonstrates the advantage of neighborencoder over autoencoder. Here we use SNE to project various representations of the data of subject into space. The representations include the raw data itself, the latent representation learned by denoising autoencoder, and the latent representation learned by denoising neighborencoder. Despite the clustering experiment suggests that autoencoder is comparable with neighborencoder, we can see that the latent representation learned by neighborencoder provides a much more meaningful visualization of different classes than the rival methods do (includes autoencoder) in the face of noisy/irrelevant dimensions.
5 Conclusion
In this work, we have proposed an unsupervised learning framework called neighborencoder that is both general, in that it can easily be applied to data in various domains, and versatile as it can incorporate domain knowledge by utilizing different neighborhood functions. We have showcased the effectiveness of neighborencoder compared to autoencoder in various domains, including images, text, time series, and so forth. In future work, we plan to either 1) explore the possibility of applying neighborencoder to problems like oneshot learning or 2) demonstrate the usefulness of the neighborencoder in more practical and applied tasks, including information retrieval. We made all the codes/models available at nnwebsite (nnwebsite), to allow others to confirm and expand our work.
References
 [Agrawal, Carreira, and Malik2015] Agrawal, P.; Carreira, J.; and Malik, J. 2015. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision.
 [Bengio et al.2007] Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. 2007. Greedy layerwise training of deep networks. In Advances in neural information processing systems, 153–160.
 [Bojanowski and Joulin2017] Bojanowski, P., and Joulin, A. 2017. Unsupervised learning by predicting noise. In Proceedings of the 34th international conference on Machine learning.
 [CardosoCachopo2007] CardosoCachopo, A. 2007. Improving Methods for Singlelabel Text Categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa. http://ana.cachopo.org/.
 [Catherine and Cohen2017] Catherine, R., and Cohen, W. 2017. Transnets: Learning to transform for recommendation. arXiv preprint arXiv:1704.02298.
 [Chen and Zaki2017] Chen, Y., and Zaki, M. J. 2017. Kate: Kcompetitive autoencoder for text. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 85–94. ACM.

[Coates and Ng2012]
Coates, A., and Ng, A. Y.
2012.
Learning feature representations with kmeans.
In Neural networks: Tricks of the trade. Springer. 561–580.  [Donahue, Krähenbühl, and Darrell2016] Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv preprint arXiv:1605.09782.
 [Dong, Chawla, and Swami2017] Dong, Y.; Chawla, N. V.; and Swami, A. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 135–144. ACM.
 [Goodfellow, Bengio, and Courville2016] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
 [Grover and Leskovec2016] Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864. ACM.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.

[Huang et al.2007]
Huang, F. J.; Boureau, Y.L.; LeCun, Y.; et al.
2007.
Unsupervised learning of invariant feature hierarchies with
applications to object recognition.
In
Computer Vision and Pattern Recognition, 2007. IEEE Conference on
, 1–8. IEEE.  [Huang, Chou, and Yang2017] Huang, Y.S.; Chou, S.Y.; and Yang, Y.H. 2017. Similarity embedding network for unsupervised sequential pattern learning by playing music puzzle games. arXiv preprint arXiv:1709.04384.
 [Jayaraman and Grauman2015] Jayaraman, D., and Grauman, K. 2015. Learning image representations tied to egomotion. In Proceedings of the IEEE International Conference on Computer Vision, 1413–1421.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 [Kohonen1982] Kohonen, T. 1982. Selforganized formation of topologically correct feature maps. Biological cybernetics 43(1):59–69.
 [Larsen et al.2015] Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Winther, O. 2015. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.
 [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Lewis et al.2004] Lewis, D. D.; Yang, Y.; Rose, T. G.; and Li, F. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5(Apr):361–397.
 [Lloyd1982] Lloyd, S. 1982. Least squares quantization in pcm. IEEE transactions on information theory 28(2):129–137.
 [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using tsne. Journal of Machine Learning Research 9(Nov):2579–2605.
 [Makhzani and Frey2013] Makhzani, A., and Frey, B. 2013. Ksparse autoencoders. arXiv preprint arXiv:1312.5663.
 [Makhzani and Frey2015] Makhzani, A., and Frey, B. J. 2015. Winnertakeall autoencoders. In Advances in Neural Information Processing Systems, 2791–2799.
 [Makhzani et al.2015] Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and Frey, B. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
 [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
 [Nguyen et al.2017] Nguyen, M.; Purushotham, S.; To, H.; and Shahabi, C. 2017. mtsne: A framework for visualizing highdimensional multivariate time series. arXiv preprint arXiv:1708.07942.
 [Pathak et al.2017] Pathak, D.; Girshick, R.; Dollár, P.; Darrell, T.; and Hariharan, B. 2017. Learning features by watching objects move. In Computer Vision and Pattern Recognition, 2017. IEEE Conference on.
 [Perozzi, AlRfou, and Skiena2014] Perozzi, B.; AlRfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710. ACM.
 [Radford, Metz, and Chintala2015] Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
 [Reiss and Stricker2012a] Reiss, A., and Stricker, D. 2012a. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, 40. ACM.
 [Reiss and Stricker2012b] Reiss, A., and Stricker, D. 2012b. Introducing a new benchmarked dataset for activity monitoring. In Wearable Computers (ISWC), 2012 16th International Symposium on, 108–109. IEEE.
 [Rezende, Mohamed, and Wierstra2014] Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
 [Ribeiro, Saverese, and Figueiredo2017] Ribeiro, L. F.; Saverese, P. H.; and Figueiredo, D. R. 2017. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 385–394. ACM.
 [Srivastava, Mansimov, and Salakhudinov2015] Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 843–852.
 [Tang et al.2015] Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web.
 [Vincent et al.2010] Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11:3371–3408.
 [Wang and Gupta2015] Wang, X., and Gupta, A. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, 2794–2802.
 [Yang et al.2017] Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th international conference on Machine learning.
 [Yeh, Kavantzas, and Keogh2017] Yeh, C.C. M.; Kavantzas, N.; and Keogh, E. 2017. Matrix profile vi: Meaningful multidimensional motif discovery. In 2017 IEEE 17th International Conference on Data Mining (ICDM).
 [Yeh2018] Yeh, C.C. M. 2018. Project website. https://sites.google.com/view/neighborencoder/.
 [Zheng, Noroozi, and Yu2017] Zheng, L.; Noroozi, V.; and Yu, P. S. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 425–434. ACM.