Log In Sign Up

Representation Learning by Reconstructing Neighborhoods

by   Chin-Chia Michael Yeh, et al.
University of California, Riverside
University of New Mexico

Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easy-to-apply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning. In this work, we propose a novel unsupervised representation learning framework called neighbor-encoder, in which domain knowledge can be easily incorporated into the learning process without modifying the general encoder-decoder architecture of the classic autoencoder.In contrast to autoencoder, which reconstructs the input data itself, neighbor-encoder reconstructs the input data's neighbors. As the proposed representation learning problem is essentially a neighbor reconstruction problem, domain knowledge can be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, our framework can leverage any off-the-shelf similarity search algorithms or side information to find the neighbor of an input object. Applications of other algorithms (e.g., association rule mining) in our framework are also possible, given that the appropriate definition of neighbor can vary in different contexts. We have demonstrated the effectiveness of our framework in many diverse domains, including images, text, and time series, and for various data mining tasks including classification, clustering, and visualization. Experimental results show that neighbor-encoder not only outperforms autoencoder in most of the scenarios we consider, but also achieves the state-of-the-art performance on text document clustering.


page 6

page 7


Symmetric Graph Convolutional Autoencoder for Unsupervised Graph Representation Learning

We propose a symmetric graph convolutional autoencoder which produces a ...

NECA: Network-Embedded Deep Representation Learning for Categorical Data

We propose NECA, a deep representation learning method for categorical d...

Unsupervised Image Classification for Deep Representation Learning

Deep clustering against self-supervised learning is a very important and...

Watch the Neighbors: A Unified K-Nearest Neighbor Contrastive Learning Framework for OOD Intent Discovery

Discovering out-of-domain (OOD) intent is important for developing new s...

DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning

Clustering performs an essential role in many real world applications, s...

TLDR: Twin Learning for Dimensionality Reduction

Dimensionality reduction methods are unsupervised approaches which learn...

Learning Universal Sentence Representations with Mean-Max Attention Autoencoder

In order to learn universal sentence representations, previous methods f...

1 Introduction

Unsupervised representation learning has been shown effective in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning [Goodfellow, Bengio, and Courville2016]. Learned representations have been shown to achieve better performance on individual tasks than domain-specific handcrafted features, and different tasks can use the same learned representation [Goodfellow, Bengio, and Courville2016]. For example, the embedding obtained by methods like word2vec [Mikolov et al.2013] has been exploited in many different text mining systems [Catherine and Cohen2017, Zheng, Noroozi, and Yu2017]. Moreover, to help a user extract knowledge from a data set, a data exploration system can first learn the representation without supervision for each item in the data set; then display both the clustering (e.g., -means [Lloyd1982]) and visualization (e.g., projection with -Distributed Stochastic Neighbor Embedding/-SNE [Maaten and Hinton2008]) results produced from the representation.

There are two types of unsupervised representation learning methods: domain-specific unsupervised representation learning methods and general unsupervised representation learning methods. While domain-specific unsupervised representation learning methods like word2vec [Mikolov et al.2013] and video-based methods [Agrawal, Carreira, and Malik2015, Jayaraman and Grauman2015, Wang and Gupta2015, Pathak et al.2017] have been widely adopted in their respective domains, their success cannot be directly transferred to other domains because their assumptions do not hold for other types of data. In contrast, general unsupervised representation learning methods, such as autoencoder [Bengio et al.2007, Huang et al.2007, Vincent et al.2010], can be effortlessly applied to data from various domains, but the performance of general methods is usually inferior to those that utilize domain knowledge [Mikolov et al.2013, Agrawal, Carreira, and Malik2015, Jayaraman and Grauman2015, Wang and Gupta2015, Pathak et al.2017].

In this work, we propose an unsupervised representation learning framework (i.e., neighbor-encoder) which is general, as it can be applied to various types of data, and versatile since domain knowledge can be added by adopting various “off-the-shelf” data mining algorithms for finding neighbors. Figure 1 previews the -SNE [Maaten and Hinton2008] visualization produced from a human physical activity data set (see Section 4.3 for details). The embedding is generated by projecting representation learned by neighbor-encoder, autoencoder, and raw data respectively to . By using a suitable neighbor finding algorithm, the representation learned by neighbor-encoder provides a more meaningful visualization than its rival methods.

Figure 1: Visualizing the learned representation versus the raw time series on the PAMAP2 (human physical activity) data set using -SNE with either Euclidean or dynamic time warping (DTW) distance [Nguyen et al.2017]. If we manually select dimensions of the time series that are clean and relevant (acceleration, gyroscope, magnetometer, etc.), the representation learned by both autoencoder and neighbor-encoder achieves better class separation than raw data. However, if the data includes noisy and/or irrelevant dimensions (heart rate, temperature, etc.), neighbor-encoder outperforms autoencoder noticeably.

In summary, our major contributions include:

  • We propose a general and versatile framework, neighbor-encoder, which incorporates domain knowledge into unsupervised representation learning by leveraging a large family of off-the-shelf similarity search techniques.

  • We demonstrate that the performance of the representations learned by neighbor-encoder is superior to representations learned by autoencoder.

  • We showcase the applicability of neighbor-encoder in a diverse set of domains (i.e., handwritten digit data, text, and human physical activity data) for various data mining tasks (i.e., classification, clustering, and visualization).

To allow reproducibility, all the codes and models associated with the paper can be downloaded from nnwebsite (nnwebsite). The rest of this paper is organized as follows. In Section 2 we consider related work. Section 3 we introduce the propose neighbor-encoder framework. We perform a comprehensive evaluation in Section 4 before offering conclusions and directions for future research in Section 5.

2 Related Work

Unsupervised representation learning

is usually achieved by optimizing either domain-specific objectives or general unsupervised objectives. For example, in the domain of computer vision and music processing, unsupervised representation learning problem is formulated as a supervised learning problem with surrogate labels, generated by exploiting the temporal coherence in videos and music 

[Agrawal, Carreira, and Malik2015, Jayaraman and Grauman2015, Wang and Gupta2015, Pathak et al.2017, Huang, Chou, and Yang2017]

. In the case of natural language processing, word embedding can be achieved by optimizing an objective function that “pushes” words occurring in a similar context (i.e., surrounded by similar words) closer in the embedding space 

[Mikolov et al.2013]. Alternatively, general unsupervised objectives are also useful for unsupervised representation learning. For example, minimizing the self-reconstruction error is used in autoencoder [Bengio et al.2007, Huang et al.2007, Vincent et al.2010], while optimizing the

-means objective is shown effective in coates2012nn (coates2012nn) and yang2017icml (yang2017icml). Other objectives, such as self-organizing map criteria 

[Kohonen1982, Bojanowski and Joulin2017] and adversarial training [Goodfellow et al.2014, Donahue, Krähenbühl, and Darrell2016, Radford, Metz, and Chintala2015, Larsen et al.2015], are also effective objectives for unsupervised representation learning.


is a decade-old unsupervised learning framework for dimension reduction, representation learning, and deep hierarchical model pre-training; many variants have been proposed since its initial introduction 

[Bengio et al.2007, Goodfellow, Bengio, and Courville2016]

. For example, the denoising autoencoder reconstructs the input data from its corrupted version; such modification improves the robustness of the learned representation 

[Vincent et al.2010]. The variational autoencoder (VAE) regularizes the learning process by imposing a standard normal prior over the latent variable (i.e., representation), and such constraints help the autoencoder learn a valid generative model [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014]. larsen2015arxiv (larsen2015arxiv) and makhzani2015arxiv (makhzani2015arxiv) further improves generative model learning by combining VAE with adversarial training. Sparsity constraints on the learned representation are another form of regularization for autoencoders to learn a more discriminating representation for classification; both the -sparse autoencoder [Makhzani and Frey2013, Makhzani and Frey2015] and -competitive autoencoder [Chen and Zaki2017] incorporate such ideas.

3 Neighbor-encoder Framework

In this section, we introduce the proposed neighbor-encoder framework and make a comparison with autoencoder. Figure 2 shows different encoder-decoder configurations for both neighbor-encoder and autoencoder. In the following sections, we discuss the motivation and design of each encoder-decoder configuration in detail.

Figure 2: Various encoder-decoder configurations for training autoencoder and neighbor-encoder: fig_scheme_auto) autoencoder, fig_scheme_neighbor) neighbor-encoder, and fig_scheme_kneighbor1) -neighbor-encoder with decoders.

3.1 Autoencoder (AE)

The overall architecture of autoencoder consists of two components: an encoder and a decoder. Given input data , the encoder is a function that encodes into a latent representation (usually in a lower dimensional space), and the decoder is a function that decodes in order to reconstruct . Figure 1(a) shows the feed-forward path of an autoencoder where and . We train the autoencoder by minimizing the difference between the input data and the reconstructed data . Formally, given a set of training data , the parameters in and are learned by minimizing the objective function , where

. The particular loss function we used in this work is cross entropy, but other loss function, like mean square error or mean absolute error can also be applied. Once the autoencoder is learned, any given data can be projected to the latent representation space with

. Both the encoder and the decoder can adopt any existing neural network architecture, such as multilayer perceptron 

[Bengio et al.2007], convolutional net [Huang et al.2007]

, or long short-term memory 

[Hochreiter and Schmidhuber1997, Srivastava, Mansimov, and Salakhudinov2015].

3.2 Neighbor-encoder (NE)

Similar to the autoencoder, neighbor-encoder also consists of an encoder and a decoder. Both the encoder and the decoder in neighbor-encoder work similarly to their counterparts in autoencoder; the major difference is in the objective function. Given input data and the neighborhood function (which returns the neighbor of ), the encoder is a function that encodes into a latent representation , and the decoder is a function that reconstructs ’s neighbor by decoding . Figure 1(b) shows the feed-forward path of a neighbor-encoder where and . Formally, given a set of training data and a neighborhood function , the neighbor-encoder is learned by minimizing the objective function , where and . Neighbor-encoder can be considered as a generalization of autoencoder as the input data can be treated as the nearest neighbor of itself with zero distance. Note that here neighbor can be defined in a variety of ways. We will introduce examples of different neighbor definitions later in Section 3.4.

Compared to autoencoder, we argue that neighbor-encoder can better retain the similarity between data samples in the latent representation space. Figure 3 builds a case for this claim. As shown in Figure 2(a), we assume the data set of interest consists of samples from two classes (i.e., blue class and red class, and each class forms a cluster) in space. Since the autoencoder is trained by mapping each data point to itself, the learned representation for this data set would most likely be a rotated and/or re-scaled version of Figure 2(a). In contrast, the neighbor-encoder (trained with nearest neighbor relation, as shown in Figure 2(b)) would learn a representation with much less intra-class variation. As Figure 2(c) shows, when several similar data points share the same nearest neighbor, the objective function will force the network to generate exactly the same output for these similar data points, thus forcing their latent representation (which is the input of the decoder) to be very similar.

Figure 3: Intuition behind neighbor-encoder compared to autoencoder. fig_intuition_auto) A simple data set with two classes, fig_intuition_neighbor0) the nearest neighbor graph constructed for the data set (arrowheads are removed for clarity), and fig_intuition_neighbor1) an example of how neighbor-encoder would generate representation, with smaller intra-class variation for highlighted data points.

Alternatively, neighbor-encoder can be understood as a non-parametric way of generating corrupted data for denoising autoencoder. Instead of being trained to remove arbitrary noise (e.g., Gaussian noise) from the corrupted data (which is the norm), the neighbor-encoder is trained to remove more meaningful noise from the corrupted data. For example, a pair of nearest neighbors found using Euclidean distance in MNIST database 

[LeCun et al.1998] usually reflects different writing styles of the same numeric digit (see Figure 4(a)). By training the neighbor-encoder with such nearest neighbor pairs, the learning process would push the encoder network to ignore or “remove” the writing style aspect from the handwritten digits.

Since we are using neighbor finding algorithms to guide the representation learning process, one may argue that we could instead construct a graph using the neighbor finding algorithm, then apply various graph-based representation learning methods like the ones proposed in [Perozzi, Al-Rfou, and Skiena2014, Tang et al.2015, Grover and Leskovec2016, Dong, Chawla, and Swami2017, Ribeiro, Saverese, and Figueiredo2017]. Graph-based methods are indeed valid alternatives to neighbor-encoder; however, they have the following two limitations: 1) If one wishes to encode a newly obtained data, the out-of-sample problem would bring about additional complexity, as these methods are not designed to handle such a scenario. 2) It will be impossible to learn a generative model, as graph-based methods learn the representation by modeling the relationship between examples in a data set, rather than modeling the example itself. As a result, whenever the above limitations are crucial, the proposed neighbor-encoder is preferred over the graph-based methods.

3.3 -neighbor-encoder

Similar to the idea of generalizing the

-nearest neighbor classifier to a

-nearest neighbor classifier, neighbor-encoder can also be extended to the -neighbor-encoder by reconstructing neighbors of the input data (see Figure 1(c)). We train decoders to simultaneously reconstruct all neighbors of the input. Given an input data and the neighborhood function (which returns the neighbors of ), the encoder is a function that encodes into the latent representation . Then, we have a set of decoders , in which each individual function decodes in order to reconstruct ’s th neighbor .

The -neighbor encoder learning process is slightly more complicated than the neighbor-encoder (i.e., -neighbor-encoder). Given a set of training data and a neighborhood function , the -neighbor-encoder can be learned by minimizing where and . Note that since there are decoders, we need to assign each to one of the decoders. If there are “naturally” types of neighbors, we can train one decoder for each type of neighbor. Otherwise, one possible decoder assignment strategy is choosing the decoder that provides the lowest reconstruction loss for each . This decoder assignment strategy will work if each training example has less than “modes” of neighbors.

3.4 Neighborhood Function

To use any of the introduced neighbor-encoder configurations, we need to properly define the term neighbor. In this section, we discuss several possible neighborhood functions for the neighbor-encoder framework. Note that the functions listed in this section are just a small subset of all the available functions, and were chosen because they demonstrate the versatility of our approach.

Simple Neighbor is defined as the objects that are closest to a given object in Euclidean distance or other distances, assuming the distance between every two objects is computable. For example, given a set of objects

where each object is a real-value vector, the neighboring relationship among the objects under Euclidean distance can be approximately identified by construing a

- tree.

Feature Space Neighbor is very similar to simple neighbor, except that instead of computing the distance between objects in the space where the reconstruction is performed (e.g., the raw-data space), we compute the distance in an alternative representation or feature space. To give a more concrete example, suppose we have a set of objects where each object is an audio clip in mel-frequency spectrum space. Instead of finding neighbors directly in the mel-frequency spectrum space, we transform the data into the Mel-Frequency Cepstral Coefficient (MFCC) space, as neighbors discovered in MFCC space are semantically more meaningful and searching in MFCC space is more efficient.

Time Series Subspace Neighbor , as defined for multidimensional time series data, is the similarity between two objects measured by using only a subset of all dimensions. By ignoring some dimensions, a time series could find higher quality neighbors since it is very likely that some of the dimensions contain irrelevant or noisy information (e.g., room temperature in human physical activity data). Given a multidimensional time series, we can use STAMP [Yeh, Kavantzas, and Keogh2017] to evaluate the neighboring relationship between all the subsequences within the time series.

Spatial or Temporal Neighbor defines the neighbor based on the spatial or temporal closeness of objects. Specifically, given a set of objects where the subscript denotes the temporal (or spatial) arrival order, and are neighbors when , where is the predefined size of the neighborhood. The skip-gram model in word2vec [Mikolov et al.2013] is an example of spatial neighbor-encoder, as the skip-gram model can be regarded as reconstructing the spatial neighbors (in the form of one-hot vector) of a given word.

Side Information Neighbor defines the neighbor with side information, which could be more semantically meaningful than the aforementioned functions. For example, images shown in the same eCommerce webpage (e.g., Amazon) would most likely belong to the same merchandise, but they can reflect different angles, colors, etc., of the merchandise. If we select a random image from a webpage and assign it as the nearest neighbor for all the other images in the same page, we could train a representation that is invariant to view angles, lighting conditions, product variations (e.g., different color of the same smart phone), and so forth. One may consider that using such side information implies a supervised learning system instead of an unsupervised learning system. However, note that we only have the information regarding similar pairs while the information regarding dissimilar pairs (i.e., negative examples) is missing111 We can construct a -nearest-neighbor graph by treating each image as a node and connecting each image with its nearest neighbor. One may sample pairs of disconnected nodes as negative examples, but such sampling method may produce false negatives, as disconnected nodes may or may not be semantically dissimilar. ; compared to the information required by a conventional supervised learning system, this information is very limiting.

4 Experimental Evaluation

In this section, we show the effectiveness and versatility of neighbor-encoder compared to autoencoder by performing experiments on handwritten digits, texts, and human physical activities with different neighborhood functions. As the neighbor-encoder framework is a generalization of autoencoder, all the variants of autoencoder (e.g., denoising autoencoder [Vincent et al.2010], variational autoencoder [Kingma and Welling2013, Rezende, Mohamed, and Wierstra2014], -sparse autoencoder [Makhzani and Frey2013, Makhzani and Frey2015], or adversarial autoencoder [Larsen et al.2015, Makhzani et al.2015]) can be directly ported to the neighbor-encoder framework. As a result, we did not exhaustively test all variants of autoencoder/neighbor-encoder, but instead selected the three most popular variants (i.e., vanilla, denoising, and variational). We leave the exhaustive comparison of the other variants for future work.

4.1 Handwritten Digits

The MNIST database is commonly used in the initial study of newly proposed methods due to its simplicity [LeCun et al.1998]. It contains images of handwritten digits (one digit per image); of these images are test data, and the other

are training data. The original task for the data set is multi-class classification. Since the proposed method is not a classifier but a representation learner (i.e., an encoder), we have evaluated our method using the following procedure: 1) we train the encoder with all the training data, 2) we encode both training data and test data into the learned representation space, 3) we train a simple classifier (i.e., linear support vector machine/SVM) with various amounts of labeled training data in the representation space, then apply the classifier to the representation of test data and report the classification error (i.e., semi-supervised classification problem), and 4) we also apply a clustering method (i.e.,

-means) to the representation of test data and report the adjusted Rand index. As a proof of concept, we did not put much effort in optimizing the structure of the encoder/decoder. We simply used a -layer convolutional net (64-64-128-128) as the encoder and a -layer transposed convolutional net (128-128-64-64) as the decoder. We have tried several other convolutional net architectures as well; we draw the same conclusion from the experimental results with these alternative architectures.

Here we use the neighbor-encoder configuration (Figure 1(b)) with the simple neighbor definition for our neighbor-encoder. We compare the performance of three variants (vanilla, denoising, and variational) of neighbor-encoder and the same three variants of autoencoder. Figure 4 shows the classification error rate as we change the number of labeled training data for linear SVM. All neighbor-encoder variants outperform their corresponding autoencoder variants, except the variational neighbor-encoder when the number of labeled training data is larger. Overall, denoising neighbor-encoder produces the most discriminating representations.

Figure 4: The classification error rate with linear SVM versus various training data sizes using different variants (i.e., vanilla, denoising, variational) of autoencoder and neighbor-encoder.

Besides the semi-supervised learning experiment, we also performed a purely unsupervised clustering experiment with -means. Table 1 summarizes the experiment’s result. The overall conclusion is similar to that of the semi-supervised learning experiment, where all neighbor-encoder variants outperformed their corresponding autoencoder variants. Unlike the semi-supervised experiment, variational neighbor-encoder produces the most clusterable representations in this particular experiment, but all three variants of neighbor-encoder are comparable with each other.

Vanilla Denoising Variational
AE 0.3005 0.3710 0.4492
NE 0.4926 0.5039 0.5179
Table 1: The clustering adjust Rand index with -means.

In the previous two experiments, we define the neighbor of an object as its nearest neighbor under Euclidean distance. With this definition, the visual difference between an object and its neighbor is usually small, given that we have sufficient data. To allow for more visual discrepancy between the objects and their neighbors, we could change that neighbor definition to the th nearest neighbor under Euclidean distance (). We have repeated the clustering experiment under different settings of to examine the effect of increasing discrepancy between the objects and their neighbors. We chose to perform the clustering experiment instead of the semi-supervised learning experiment because 1) clustering is unsupervised and 2) it is easier to present the clustering result in a single figure, as semi-supervised learning requires varying both the amount of training data and .

Figure 5: Neighbor pairs under different proximity setting.

Figure 6 summarizes the result, and Figure 5 shows a randomly selected set of object-neighbor pairs under different settings of . The performance peaks around and decreases as we increase ; therefore, choosing the th nearest neighbor as the reconstruction target for neighbor-encoder would create enough discrepancy between the object-neighbor pair for better representation learning. When neighbor-encoder is used in this fashion, it can be regarded as a non-parametric way of generating noisy objects (similar as the principle of denoising autoencoder), and the settings of controls the amount of noise added to the object. Note that neighbor-encoder is not equivalent to denoising autoencoder, as several objects can share the same th nearest neighbor (recall Figure 2(c)), but denoising autoencoder would most likely generate different noisy inputs for different objects.

Figure 6: The clustering adjust Rand index versus the proximity of the neighbor using various neighbor-encoder variations (i.e., vanilla, denoising, variational). The proximity of a neighbor is defined as its ranking when queried with the input.

To explain the performance difference between autoencoder and neighbor-encoder, we randomly selected five test examples from each class (see Figure 6(a)) and fed them through both the autoencoder and the neighbor-encoder trained in the previous experiments. The outputs are shown in Figure 7, where the top row and bottom row are autoencoder and neighbor-encoder respectively. As expected, the output of autoencoder is almost identical to the input image. Although the output of neighbor-encoder is still very similar to the input image, the intra-class variation is less than the output of autoencoder. This is because neighbor-encoder tends to reconstruct the same neighbor image from similar input data points (recall Figure 2(c)). As a result, the latent representation learned by neighbor-encoder is able to achieve better performances.

(a) Input
(b) Vanilla AE
(c) Denoising AE
(d) Variational AE
(e) Vanilla NE
(f) Denoising NE
(g) Variational NE
Figure 7: Outputs of the decoders for different autoencoder (AE) and neighbor-encoder (NE) variations.

4.2 Texts

The 20Newsgroup222downloaded from cardoso2007phd (cardoso2007phd)

data set contains nearly 20,000 newsgroup posts grouped into 20 (almost) balanced newsgroups/classes. It is a popular data set for experimenting with machine learning algorithms on text documents. We follow the clustering experiment setup presented in yang2017icml (yang2017icml), wherein each document is represented as a tf-idf vector (using the 2,000 most frequent words in the corpus), and the performance of a method is measured by the normalized mutual information (NMI), adjusted Rand index (ARI), and clustering accuracy (ACC). To ensure the fairness of the comparison, we use an identical network structure (250-100-20 multilayer perceptron) for the encoder 

[Yang et al.2017].

We test three different autoencoder variants (vanilla autoencoder/AE, denoising autoencoder/DAE, and variational autoencoder/VAE) as the baselines, and enhance the best variant with the neighbor-encoder objective function (denoising neighbor-encoder/DNE). The neighbor definition adopted in this set of experiments is the feature space neighbor, where we find the nearest neighbor of each document in the current encoding space at each epoch. We use

-means (KM) to cluster the learned representation. Table 2 shows our experiment results accompanied by the experiment result reported in yang2017icml (yang2017icml). The proposed method (neighbor-encoder), when combined with the best variant of autoencoder, outperforms all other methods.

JNKM* 0.40 0.10 0.24
XARY* 0.19 0.02 0.18
SC* 0.40 0.17 0.34
KM* 0.41 0.15 0.30
NMF+KM* 0.39 0.17 0.33
LCCF* 0.46 0.17 0.32
SAE+KM* 0.47 0.28 0.42
DCN* 0.48 0.34 0.44
AE+KM 0.44 0.29 0.43
DAE+KM 0.52 0.38 0.53
VAE+KM 0.41 0.18 0.31
DNE+KM 0.56 0.41 0.57
  • Experiment results reported by yang2017icml (yang2017icml).

Table 2: The results of the experiment on 20Newsgroup.

The most similar systems (to our baselines) examined by yang2017icml (yang2017icml) is the stacked autoencoder with -means (SAE+KM). When comparing our baselines with SAE+KM, AE+KM unsurprisingly performs similar to SAE+KM, as they are almost identical. Out of our three baselines, the denoising autoencoder outperforms the other two variants considerably, with the variational autoencoder being the worst system. Because the denoising is the best autoencoder variant, we decided to extend it with the neighbor reconstruction loss function. The resulting system (DNE+KM) outperforms all other systems, including the previous state-of-the-art deep clustering network (DCN).

Finally, we apply DNE+KM to a larger data set with imbalanced classes, RCV1-v2 [Lewis et al.2004], following the experiment/encoder setup with 20 clusters outlined in yang2017icml (yang2017icml). Table 3 summarizes the results. The performance of DNE+KM is similar to DCN in terms of NMI, while outperforming DCN in terms of ARI/ACC.

XARY* 0.25 0.04 0.28
DEC* 0.08 0.01 0.14
KM* 0.58 0.29 0.47
SAE+KM* 0.59 0.33 0.46
DCN* 0.61 0.33 0.47
DNE+KM 0.60 0.40 0.49
  • Experiment results reported by yang2017icml (yang2017icml).

Table 3: The result of the experiment on RCV1-v2 with 20 clusters.

4.3 Human Physical Activities

In Section 3, we introduced the -neighbor-encoder in addition to the neighbor-encoder. Here we test the -neighbor-encoder on the PAMAP2 data set [Reiss and Stricker2012a, Reiss and Stricker2012b] using the time series subspace neighbor definition [Yeh, Kavantzas, and Keogh2017]. We chose the subspace neighbor definition because 1) it addresses one of the commonly seen multidimensional time series problem scenarios (the existence of irrelevant/noisy dimensions), 2) it is able to extract meaningful repeating patterns, and 3) it naïvely gives multiple “types” of neighbors to each object.

The PAMAP2 data set was collected by mounting three inertial measurement units and a heart rate monitor on nine subjects, and recording them performing different physical activities (e.g., walking, running, playing soccer), with one session per subject, each ranging from hours to hours. The subjects performed one activity for a few minutes, took a short break, then continued performing another activity. In order to transfer the data set into a format that we can use for evaluation (i.e., a training/test split), for each subject (or recording session) we cut the data into segments according to their corresponding physical activities; then, within each activity segment, we generated training data from the first half, and test data from the second half with a sliding window length of and a step size of one. We make sure that there is no overlap between training data and test data. After the reorganization, we end up with none data sets (one pair of training/test set per subject). We ran experiments on each data set independently, and report averaged performance results.

The experiment procedure is very similar to the one presented in Section 4.1. We perform the experiments under two different scenarios: “clean” and “noisy.” In the “clean” scenario, we manually deleted some dimensions of the data that are irrelevant (or harmful) to the classification/clustering tasks, while in the “noisy” scenario, all dimensions of the data are retained. Here we use a -layer convolutional net (64-64-128-256) as the encoder, and a -layer transposed convolutional net (256-128-64-64) as the decoder. Similar to Section 4.1

, we did not put much effort in optimizing the structure of this network architecture. We have tried modifying the convolutional net architectures in various ways, such as adding batch normalization, changing the number of layers, or varying the number of filters for each layer, etc., and the conclusion drawn from the experimental results remains virtually unchanged.

In Figure 8, we compare the semi-supervised classification capability of vanilla, denoising, and variational autoencoder/-neighbor-encoder under both the“clean” scenario and the “noisy” scenario. Both vanilla and denoising -neighbor-encoder outperforms their corresponding autoencoder in all scenarios. The performance difference is more notable when the number of training data is small. On the contrary, variational autoencoder outperforms the corresponding -neighbor-encoder; however, the performance of both variational autoencoder and -neighbor-encoder are considerably worse than their vanilla and denoising counterparts. Overall, both the vanilla and denoising -neighbor-encoders work relatively well for this problem.

(a) Clean scenario
(b) Noisy scenario
Figure 8: The classification accuracy with linear SVM versus various labeled training data size using different variants (i.e., vanilla, denoising, variational) of either autoencoder and -neighbor-encoder.

Table 4 shows the clustering experiment with -means. For the vanilla encoder-decoder system, -neighbor-encoder surpasses autoencoder in both scenarios, especially in the noisy scenario. When the denoising mechanism is added to the encoder-decoder system, it greatly boosts the performance of autoencoders, but the performance of -neighbor-encoder still greatly exceeds autoencoder. Similar to the semi-supervised learning experiment, the variational encoder-decoder system performs poorly for this data set. In general, both the vanilla and denoising -neighbor-encoders outperform their autoencoder counterparts for the clustering problem on PAMAP2 data set.

Vanilla Denoising Variational
Clean AE 0.3815 0.4159 0.1597
NE 0.4203 0.4272 0.1192
Noisy AE 0.1844 0.2336 0.1034
NE 0.3832 0.3948 0.1081
Table 4: The clustering adjust Rand index with -means.

Figure 1 further demonstrates the advantage of neighbor-encoder over autoencoder. Here we use -SNE to project various representations of the data of subject into space. The representations include the raw data itself, the latent representation learned by denoising autoencoder, and the latent representation learned by denoising -neighbor-encoder. Despite the clustering experiment suggests that autoencoder is comparable with -neighbor-encoder, we can see that the latent representation learned by -neighbor-encoder provides a much more meaningful visualization of different classes than the rival methods do (includes autoencoder) in the face of noisy/irrelevant dimensions.

5 Conclusion

In this work, we have proposed an unsupervised learning framework called neighbor-encoder that is both general, in that it can easily be applied to data in various domains, and versatile as it can incorporate domain knowledge by utilizing different neighborhood functions. We have showcased the effectiveness of neighbor-encoder compared to autoencoder in various domains, including images, text, time series, and so forth. In future work, we plan to either 1) explore the possibility of applying neighbor-encoder to problems like one-shot learning or 2) demonstrate the usefulness of the neighbor-encoder in more practical and applied tasks, including information retrieval. We made all the codes/models available at nnwebsite (nnwebsite), to allow others to confirm and expand our work.


  • [Agrawal, Carreira, and Malik2015] Agrawal, P.; Carreira, J.; and Malik, J. 2015. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision.
  • [Bengio et al.2007] Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. 2007. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, 153–160.
  • [Bojanowski and Joulin2017] Bojanowski, P., and Joulin, A. 2017. Unsupervised learning by predicting noise. In Proceedings of the 34th international conference on Machine learning.
  • [Cardoso-Cachopo2007] Cardoso-Cachopo, A. 2007. Improving Methods for Single-label Text Categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
  • [Catherine and Cohen2017] Catherine, R., and Cohen, W. 2017. Transnets: Learning to transform for recommendation. arXiv preprint arXiv:1704.02298.
  • [Chen and Zaki2017] Chen, Y., and Zaki, M. J. 2017. Kate: K-competitive autoencoder for text. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 85–94. ACM.
  • [Coates and Ng2012] Coates, A., and Ng, A. Y. 2012.

    Learning feature representations with k-means.

    In Neural networks: Tricks of the trade. Springer. 561–580.
  • [Donahue, Krähenbühl, and Darrell2016] Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv preprint arXiv:1605.09782.
  • [Dong, Chawla, and Swami2017] Dong, Y.; Chawla, N. V.; and Swami, A. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 135–144. ACM.
  • [Goodfellow, Bengio, and Courville2016] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. MIT Press.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
  • [Grover and Leskovec2016] Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864. ACM.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Huang et al.2007] Huang, F. J.; Boureau, Y.-L.; LeCun, Y.; et al. 2007. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In

    Computer Vision and Pattern Recognition, 2007. IEEE Conference on

    , 1–8.
  • [Huang, Chou, and Yang2017] Huang, Y.-S.; Chou, S.-Y.; and Yang, Y.-H. 2017. Similarity embedding network for unsupervised sequential pattern learning by playing music puzzle games. arXiv preprint arXiv:1709.04384.
  • [Jayaraman and Grauman2015] Jayaraman, D., and Grauman, K. 2015. Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision, 1413–1421.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • [Kohonen1982] Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biological cybernetics 43(1):59–69.
  • [Larsen et al.2015] Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Winther, O. 2015. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.
  • [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Lewis et al.2004] Lewis, D. D.; Yang, Y.; Rose, T. G.; and Li, F. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5(Apr):361–397.
  • [Lloyd1982] Lloyd, S. 1982. Least squares quantization in pcm. IEEE transactions on information theory 28(2):129–137.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov):2579–2605.
  • [Makhzani and Frey2013] Makhzani, A., and Frey, B. 2013. K-sparse autoencoders. arXiv preprint arXiv:1312.5663.
  • [Makhzani and Frey2015] Makhzani, A., and Frey, B. J. 2015. Winner-take-all autoencoders. In Advances in Neural Information Processing Systems, 2791–2799.
  • [Makhzani et al.2015] Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and Frey, B. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
  • [Nguyen et al.2017] Nguyen, M.; Purushotham, S.; To, H.; and Shahabi, C. 2017. m-tsne: A framework for visualizing high-dimensional multivariate time series. arXiv preprint arXiv:1708.07942.
  • [Pathak et al.2017] Pathak, D.; Girshick, R.; Dollár, P.; Darrell, T.; and Hariharan, B. 2017. Learning features by watching objects move. In Computer Vision and Pattern Recognition, 2017. IEEE Conference on.
  • [Perozzi, Al-Rfou, and Skiena2014] Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710. ACM.
  • [Radford, Metz, and Chintala2015] Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • [Reiss and Stricker2012a] Reiss, A., and Stricker, D. 2012a. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments,  40. ACM.
  • [Reiss and Stricker2012b] Reiss, A., and Stricker, D. 2012b. Introducing a new benchmarked dataset for activity monitoring. In Wearable Computers (ISWC), 2012 16th International Symposium on, 108–109. IEEE.
  • [Rezende, Mohamed, and Wierstra2014] Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
  • [Ribeiro, Saverese, and Figueiredo2017] Ribeiro, L. F.; Saverese, P. H.; and Figueiredo, D. R. 2017. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 385–394. ACM.
  • [Srivastava, Mansimov, and Salakhudinov2015] Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 843–852.
  • [Tang et al.2015] Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web.
  • [Vincent et al.2010] Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11:3371–3408.
  • [Wang and Gupta2015] Wang, X., and Gupta, A. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, 2794–2802.
  • [Yang et al.2017] Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th international conference on Machine learning.
  • [Yeh, Kavantzas, and Keogh2017] Yeh, C.-C. M.; Kavantzas, N.; and Keogh, E. 2017. Matrix profile vi: Meaningful multidimensional motif discovery. In 2017 IEEE 17th International Conference on Data Mining (ICDM).
  • [Yeh2018] Yeh, C.-C. M. 2018. Project website.
  • [Zheng, Noroozi, and Yu2017] Zheng, L.; Noroozi, V.; and Yu, P. S. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 425–434. ACM.