AutoEmbedder: A semi-supervised DNN embedding system for clustering

07/11/2020 ∙ by Abu Quwsar Ohi, et al. ∙ King Abdulaziz University 0

Clustering is widely used in unsupervised learning method that deals with unlabeled data. Deep clustering has become a popular study area that relates clustering with Deep Neural Network (DNN) architecture. Deep clustering method downsamples high dimensional data, which may also relate clustering loss. Deep clustering is also introduced in semi-supervised learning (SSL). Most SSL methods depend on pairwise constraint information, which is a matrix containing knowledge if data pairs can be in the same cluster or not. This paper introduces a novel embedding system named AutoEmbedder, that downsamples higher dimensional data to clusterable embedding points. To the best of our knowledge, this is the first research endeavor that relates to traditional classifier DNN architecture with a pairwise loss reduction technique. The training process is semi-supervised and uses Siamese network architecture to compute pairwise constraint loss in the feature learning phase. The AutoEmbedder outperforms most of the existing DNN based semi-supervised methods tested on famous datasets.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental approach to perform unsupervised learning. It is a very widely studied topic and is applied on a wide range of applications, including image segmentation [266767], image processing [649912], network analysis [10.1145/1516360.1516426], document analysis [huang2018adaptive, Kim2017BagofconceptsCD]

, and so on. Clustering remains an active research area due to its simplicity and ability to find a pattern in unlabeled data. Although clustering is a broadly used method, the performance of clustering methods degrades when it is applied to high dimensional data. To overcome the limitation of higher-dimensional data, researchers perform feature reduction strategies to reduce the higher dimensional features while keeping the necessary features.

Principal Component Analysis (PCA), is a common method, used for data dimensionality reduction [10.5555/2976248.2976312, REN2012147]

. Dimensionality reduction of data can be achieved through feature extraction or feature selection. However, in this paper, we attain dimensionality reduction of data through PCA, which reduces the dimension of data through feature extraction only.

The critical part of this process lies in ignoring the relativity between clustering and feature learning procedure. To eliminate this issue, Discriminative Cluster Analysis (DCA) was introduced


. The process combines Linear Discriminant Analysis (LDA) and K-means into a joint framework. However, the aforementioned method fails to represent a better estimation.

Currently, due to the recent advancement, Deep Neural Network (DNN) has been widely applied in supervised learning as well as unsupervised learning. The usage of DNN on clustering methods is often derived as deep clustering. Almost all the deep clustering architectures contain two phases: feature transformation and clustering. However, some deep clustering methods learn feature transformation and clustering jointly [xie2015unsupervised]. Although most unsupervised deep clustering methods fail to generate appreciable performance on complex datasets, the current state of the art models generate promising results on simple datasets [xie2015unsupervised, shaham2018spectralnet].

Some studies also use a feature mapping network. The most used one is the basic Convolutional Deep Neural Network (CDNN), which is pre-trained on a bigger dataset. The trained CDNN is further used to generate embedding points from unseen data. The embedding points are used to perform cluster [athanasiadis2018framework]. This type of learning is often interpreted as transfer learning method. However, until now, no studies attempted to improve the accuracy of CDNN networks by calculating cluster loss.

Although clustering is unsupervised, there may exist some pre-knowledge of the dataset that is to be used for a particular task. This pre-knowledge is often used in Semi-Supervised Clustering (SSC). SSC methods rely on a pairwise constrained matrix to gain better accuracy than unsupervised deep clustering [REN2019121]. A pairwise constraint matrix contains information if two instances are related or not. If they are related, then they must be in the same cluster, otherwise, in a different cluster. SSC can use this information to improve its learning.

This paper contributes to a semi-supervised clustering process via DNN architecture. We introduce a semi-supervised embedding system named AutoEmbedder that is aimed to generate clusterable embedding points based on pairwise constraints. The AutoEmbedder is built upon traditional DNN architecture. The training process of AutoEmbedder uses pairwise constraints and this type of training procedure is termed as an SSL process [REN2019121]

. The AutoEmbedder is iteratively trained on a Siamese Neural Network (SNN) architecture. SNN architecture uses two same weighted AutoEmbedders in parallel. Therefore, the SNN receives a pair of input and generates a pair of output. From the SNN architecture, a pairwise loss is computed by calculating the pairwise distance of the SNN-AutoEmbedder generated embedding. This loss is further reduced using the traditional backpropagation technique along with an optimization function. The AutoEmbedder is extracted from the SNN, and the finally trained AutoEmbedder is further used to generate meaningful embeddings, on which clustering is performed.

Overall, our main contributions include:

  • We develop a semi-supervised embedding system named AutoEmbedder that learns to produce cluster separable meaningful embedding points based on pairwise constraints.

  • We introduce DNN as an embedding system.

  • We introduce a procedure of using DNN architectures that links to the embedding system and clustering.

  • We carry out experiments addressing unsupervised and semi-supervised DNN based algorithms and validate that the AutoEmbedder performs better in most of the complex datasets.

The rest of this paper is organized as follows: Section 2 presents the related work. Section 3 introduces the overall architecture and AutoEmbedder training procedure. Section 4 contains empirical results done on the architecture. Finally, Section 5 concludes the paper.

2 Related Work

Unsupervised learning is the process of identifying unknown patterns from an unlabeled dataset. Before the advancement of neural networks, clustering algorithms were the only methods used to perform unsupervised learning. Clustering defines the process of separating data points based on their dissimilarity. Before the full phase implementation of the deep neural network, unsupervised clustering was performed based on generalization techniques. Through a generalization technique, a machine learning model can identify specific unseen example classes based on some specific feature patterns. Nevertheless, these unseen examples must contain the appropriate feature patterns. Otherwise, machine learning models may generate inappropriate cluster regions. This process of causing a general feature/concept between seen and unseen data is often termed as a generalization process. There exist generalization techniques based on PCA as well

[10.1145/1015330.1015408]. However, down-sampling higher dimensional data with PCA does not greatly improve clustering accuracy.

After the advancement of neural network architectures, researchers have introduced many unsupervised learning methods that are based on neural networks. Autoencoders (AE), Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), Deep Belief Net (DBN), etc. are common architectures that are used to create the recent state of the art unsupervised learning architectures

[8412085]. Most of the methods relying on autoencoders require a pre-training process for calibration. Deep Clustering Network (DCN) represents a combined approach of autoencoders and k-means algorithm [yang2016kmeansfriendly]. Deep Embedding Network (DEN) only rely on reconstruction loss of autoencoders and converges to a cluster friendly representation [6976982]. Deep Continuous Clustering (DCC), Deep Embedded Regularized Clustering (DEPICT), Deep Multi-Manifold Clustering (DMC), and Deep Subspace Clustering Networks (DSC-Nets) perform similar reconstruction loss of autoencoders to perform clustering [shah2018deep, dizaji2017deep, AAAIW1715099, NIPS2017_6608]. The Deep Embedding Clustering (DEC) is a renowned model, which uses a pre-trained autoencoder [xie2015unsupervised]. Later, the model is fine-tuned using the combination of cluster loss and autoencoder loss, which is described as hardening loss. Unsupervised CDNN depends on features extracted from deep neural networks, and fine-tunes the network based on clustering loss, where clustering loss defines the error of combining different cluster data points into the same cluster and vice versa. It is also proven that a supervised CDNN architecture, pre-trained on a large dataset, acquire better accuracy than traditional unsupervised pre-trained CDNN based methods [gurin2017cnn]. On the contrary, Joint Unsupervised Learning (JULE), and Deep Adaptive Image Clustering (DAC) are non-pre-trained CDNN architectures [Yang_2016_CVPR, Chang2017DeepAI]. The drawback of JULE is the computational cost and memory complexity of the learning process is greater than average for large datasets. Unlike JULE, DAC achieves better performance in some challenging datasets. Compared to the vast implementation of AE and CDNN, there are quite a few implementations based on VAE and GAN architectures. Most VAE and GAN based architectures fall behind due to their high complexity and hard to converge architectures. Gaussian Mixture VAE (GMVAE), and Variational Deep Embedding (VaDE) uses VAE architecture [dilokthanakul2016deep, jiang2016variational]. Categorical Generative Adversarial (CatGAN), and Information Maximizing Generative Adversarial Network (InfoGAN) are well-recognized architectures that are based on GAN. Although these procedures present outstanding performance on simple datasets, they fail to produce a reasonable performance on complex datasets. To improve performance measures, semi-supervised training architecture is used.

SSL process contains a small portion of labeled data with a large portion of unlabeled data. In the aspect of clustering, there are some variants of semi-supervised methods [Bair_2013]. In SSL, some data may contain information suitable to train the unsupervised architecture. This information might be the label of data or cluster linkage. Cluster linkage defines pairwise information by establishing connections among data pairs [10.1145/1015330.1015360, Wagstaff01constrainedk-means, ilprints528, 10.1007/978-3-319-97304-3_64]

. Cluster linkage is also named as a pairwise constraint because it contains pairwise data linkage information. Most SSL use pairwise constraints as their basic training information. The main point of improving the overall process is selecting the appropriate data dimensional reduction technique or objective function, the pairwise constraints as well as the pairwise loss calculation. Gang Chen used a Restricted Boltzmann Machine (RBM) as the objective function

[chen2015deep]. Although the approach is promising, it fails to generate better accuracy with lesser data. A semi-supervised implementation of the renowned DEC method, named Semi-supervised Deep Embedded Clustering (SDEC), proved to give better accuracy than traditional DEC architecture [REN2019121]. However, the method fails to give better estimations on famous datasets.

To further strengthen the position, many neural network architectures on unsupervised learning that are pre-trained on complex datasets are used to perform clustering on the different datasets. This type of training is often referred to as transfer learning. Transfer learning relies on data dependency [tan2018survey] and is actively used in many domains, including unsupervised learning [athanasiadis2018framework]

. Convolutional Neural Networks (CNN) are also proven to perform better on complex image datasets

[gurin2017cnn]. Therefore, feed-forward CNN is used to perform semi-supervised classification that is referred to as Semi-Supervised Feed-Forward CNN (SSFF-CNN) [chen2019semisupervised]. The SSFF-CNN architecture relates Feed-Forward CNN (FF-CNN) with SSL [zhang2017interpretable]. The parameters of FF-CNN target layers are generated by the statistics of the previous layers. Again, the CNN methods that learn from backpropagation are referred to as BP-CNN methods [zhang2017interpretable] which are used in SSL. Most promising CNN based semi-supervised network architectures are built using an ensemble system, where multiple weak architectures are connected to obtain a stronger one [2012]. Nevertheless, higher time and memory complexity is the burden of these ensemble systems.

SSL is mostly suitable for large datasets in which no sufficient information is available. SSL becomes essential when there exists a large unlabeled dataset. The challenge of SSL is to acquire higher accuracy when it is trained with a small amount of labeled dataset. Most semi-supervised and unsupervised learning architecture relates AE for data dimensionality reduction. In most cases, the AE is to be pre-trained using similar types of datasets to achieve a good performance. On the contrary, AE stores most of the information of data, which in some cases, may or may not be used as a feature. Instead of relying on AE, we propose a DNN architecture that performs data dimensionality reduction and can be used as an embedding system. The recent CDNN semi-supervised architectures rely on BP-CNN and FF-CNN architectures. But both of the architectures exhibit poor accuracy while they are trained with fewer data. Although ensemble-based CDNN architectures show better performance on complex datasets, they have higher time and memory complexity due to the overall fusion of multiple DNNs in the architecture. We solve the difficulty by implementing a better training method that requires a CDNN architecture for producing clusterable embedding points.

This paper refers to DNN as an embedding system on which clustering is applied. Although this architecture can be applied to most types of datasets, this paper specifically relates the AutoEmbedder to image-based datasets and most evaluations are performed on image-based datasets. The AutoEmbedder extracts features from higher dimensional data and compresses the features into a lower dimension embedding point, in which clustering is performed. The DCNN network performs the dimensionality reduction based on backpropagation distance loss calculation that is generated from Siamese network architecture.

3 Methodology

The proposed approach of this paper relates the dimension reduction technique of DNN networks [gurin2017cnn] with a semi-supervised deep embedding system. Firstly, an embedding function is generated using DNN architecture, and it is trained using SNN architecture. Finally, the trained embedding function is used to transform higher dimensional data into meaningful low dimensional embedding points, on which clustering can be performed. The algorithm of the AutoEmbedder training process is presented in Algorithm 1. In the following subsections, we first derive the properties of the AutoEmbedder along with its training process. In the subsequent section 3.6, we emphasize the intuition and the proof of the overall training architecture of the AutoEmbedder. Finally, in the last subsection 3.7, we present a randomized training data selection scheme that boosts the accuracy of the AutoEmbedder.

Figure 1: AutoEmbedder extraction from classifier CNN.

3.1 AutoEmbedder Architecture

A classifier neural network classifies inputs based on the activation node, which resides on the output layer. The output layer is also referred to as a softmax layer of a DNN architecture. In this paper, the previous layer of the last softmax layer of any classifier architecture is represented as the ‘feature space’ layer, which is illustrated in Figure

1. The feature space layer extracts final features from which the output layer performs the classification task. The feature space layer will work as an output layer of the AutoEmbedder after completing the AutoEmbedder training process. The number of nodes that reside in the feature space layer denotes the dimension of embedding points. The existing DNN architecture is defined as AutoEmbedder. Mathematically, the AutoEmbedder can be represented as , such that being the dimension of the embedding subspace. Like the autoencoder, the DNN architecture of the AutoEmbedder performs better in downsampling higher dimensional data, while keeping the required clusterable properties.

Figure 2: The SNN training architecture of AutoEmbedder.

3.2 AutoEmbedder Training Network

The training process of AutoEmbedder uses an SNN architecture as shown in Figure 2. In Siamese network architecture, a pair of AutoEmbedders with the same initial weights are placed parallelly. However, it is to be noted that the pair of networks does not share weights. The output of the AutoEmbedder pair is connected to a Euclidian distance calculation function as,


Where is the output of the first AutoEmbedder, and is the output of the second AutoEmbedder. The functions and

are the AutoEmbedder pairs. The output of the distance calculation layer is further passed through a Rectified Linear Unit (ReLU) activation function with an upper bound value

, defined as,


Where the input of the ReLU activation function and

is a hyperparameter set on the AutoEmbedder training phase. The hyperparameter

is also used in pairwise constraints. Due to the upper bound value set on the ReLU activation layer, the output of the SNN architecture will be in the range

. The defined SNN architecture receives a pair of data and outputs the Euclidean distance of the embedding vector pairs. Combining equation

1 and 2, the overall training framework of the SNN can be represented as,


Where represents the SNN architecture function.

3.3 AutoEmbedder Pairwise Constraints

To train the SNN architecture of the AutoEmbedder with a precise target value, a distance hyperparameter is to be decided. For any input data pair and , the pairwise constraint is rated to be if there exists a can-not-link constraint. Otherwise, the distance value is estimated to be . This can be mathematically stated as,


The pairwise constraints instruct the AutoEmbedder to create clusterable points. By equation 4, the AutoEmbedder pair is instructed to generate embedding vectors closer to zero if the input pair refer to the same class. Otherwise, it is instructed to generate embedding vectors greater or equal to , if the input pair refer to mixed classes.

(a) The actions performed in each training iteration.
(b) After some iterations.
Figure 3: The AutoEmbedder training process moves can-not-link embedding pairs at a distance greater than or equal to and moves embedding points of the must link pairs to a closer distance. Subfigures 2(a) and 2(b) illustrates the scenario.

3.4 AutoEmbedder Training

In each iteration of the AutoEmbedder training, the pair of AutoEmbedder is trained twice. Let the SNN architecture is trained though the function , which receives three parameters. The first two parameters are the inputs of the first and second AutoEmbedder, respectively, and the third parameter is the ground pairwise constraint output. At each iteration, the model is trained as,


Here and define two subsets of data, and defines the pairwise constraints. After completing the training of the AutoEmbedder pairs, any one of the two trained AutoEmbedders can be used from the SNN architecture. The backpropagation phase of each training iteration reduces the AutoEmbedder pairwise loss by moving the can-not-link pairs (mixed-class input pairs) at a distance and must-link pairs (same-class input pairs) closer as illustrated in Figure 3. This is the basic requirement of a data embedding for being clusterable.

3.5 AutoEmbedder Pairwise Loss Calculation

The output of the SNN architecture is a thresholded pairwise input distance, which is a continuous output in the range

. Due to the criteria, the SNN architecture is trained based on regression. Most SNN based architectures often implement contrastive loss function

[hadsell2006dimensionality]. However, due to the threshold of the final ReLU activation function derived in equation 2, the contrastive loss function may generate improper results. Table 1 illustrates a comparison of accuracy, while the AutoEmbedder is trained based on mean squared error (MSE) and contrastive loss. The training parameters used in the comparison are reported in Table 4. The comparison apprises that MSE is the most suitable for SNN training architecture. Hence, the pairwise loss is calculated using MSE.

Let the pairwise ground truth vector be , and the predicted pairwise vector be . The MSE for each iteration batch is calculated as,


Adam optimizer [kingma2014adam] and backpropagation is used to reduce the MSE.

Dataset Contrastive Loss ACC MSE Loss ACC Contrastive Loss NMI MSE Loss NMI Contrastive Loss ARI MSE Loss ARI
Table 1: A comparison between mean square error loss and contrastive loss based on ACC, NMI, and ARI metrics.
Figure 4: An illustration of an optimal clusterable data points. Each pair of clusters tries to maintain a distance of , which is maintained by the SNN architecture through the pairwise constraints.

3.6 Proof of AutoEmbedder Training Architecture

The significance of SNN architecture is to train the AutoEmbedder so that, a) must link embedding vector pairs remains as close as possible, and b) the can-not-link embedding vector pairs obtain at least distance from each other. If it is possible to construct a function that maintains the aforementioned properties, it can be concluded that the function generates clusterable embedding vectors. Also, the role of hyperparameter is to construct a minimum margin or distance between the dissimilar data embeddings. Let us consider the two AutoEmbedders of the SNN architecture as two functions, and . Let us also consider that and are two different data class sets, and they contain a can-not-link constraint for each other (), and a must link constraint for themselves ().

By considering the aforementioned cluster characteristics, it can be assumed that a function generates clusterable embedding points if it maintains the following properties,


The abovementioned property must satisfy for function , since in subsection 3.4, we discussed any of the functions can be used as AutoEmbedder.

Through the training process of the SNN architecture, let us consider that both functions converge to an optimal state such that both functions optimally maintain the pairwise constraints. Therefore, the functions hold the following properties due to the must-link constraints,


Also, due to the can-not-link constraint, the functions hold the following properties,


By placing the approximate value from equation 9,


The above equation proves that the SNN architecture maintains the property described in equation, 8.

Furthermore, let us consider that, the function produces embedding points for all input pairs and , which are in the distance greater than zero. Consider as random data input. This can be stated as,


However, if we consider and is equal to the input of the class . Mathematically this can be formed as,


If we place relevant value from equation 9, we get,


The above equations 13 and 14 are contradictory to equation 9. Because, it is considered that both functions satisfies equation 9 and 10 after they are trained through SNN. As the distance can not be negative, we can reform equation 13 as,


The above equation proves that SNN architecture maintains the property of equation 7. We can also construct a similar proof for function . Hence, it can be concluded that the SNN architecture can train two functions such that they generate clusterable embedding points based on the pairwise constraints.

From the general implementation of the AutoEmbedder training architecture, it can be understood that the hyperparameter works as a cluster margin. The margin states that two inputs are considered to be in separate clusters if their distance is greater than or equal to the margin value. Maintaining the pairwise constraints, the AutoEmbedder can obtain an optimal knowledge to downsample higher dimensional data into clusterable points, as illustrated in Figure 4. Also, the threshold of the ReLU activation function (equation 2) serves to reduce the distance of the can-not-link pairs if they are farther than . This scenario is illustrated in Figure 4 for cluster pair A and C.

Figure 5: Accuracy measurement of two datasets by selecting different

values and containing balanced and imbalanced pairwise constraints while training. The illustrated result is obtained based on 300 epochs.

3.7 AutoEmbedder Random Train Data Selection

The AutoEmbedder is trained by selecting a fixed number of input pairs per training iteration. At each training iteration of the AutoEmbedder, two data-subset and are created where, , , and . Here, is the number of inputs in each iteration. The elements of the respective sets and are randomly selected from the dataset . The two data-subsets and are used to train the AutoEmbedder at each iteration.


be the approximate probability of randomly selecting a data pair of same class,

be the number of data of the same class, and be the total number of classes. Then can be defined as,


As , the approximate probability of randomly selecting a pair of input data of different class is greater than . Due to the fact, the randomized selection of data pairs will contain more can-not-link constraints than must-link constraints. This leads to biased training data. If these link constraints are kept fully random or imbalanced, it is observed that the accuracy of AutoEmbedder is much lower after completing the training process of the AutoEmbedder, as depicted in Figure 5. Hence, for each iteration, half of the data pair is randomly selected, which contains must-link pairs, and the other half is randomly selected containing can-not-link pairs. Mathematically it can be viewed as,


Where is the number of pairs containing must-link constraint, is the number of pairs containing can-not-link constraint, and . By maintaining this criterion, a massive accuracy improvement is observed that is illustrated in Figure 5.

Input: Subset of the dataset to be clustered , AutoEmbedder model , Number of iterations , Training batch per iteration , Distance hyperparameter
Result: Trained AutoEmbedder
initialize two AutoEmbedder models , with the same architecture and weight as ;
build a siamese network with AutoEmbedders and ;
while  do
       initialize two empty input data set and ;

initialize an empty target output set

       while  do
             select two random data input and containing must link constraint ;
             append to , to and to ;
       while  do
             select two random data input and containing must link constraint ;
             append to , to and to ;
      train siamese network S with inputs , and ;
       train siamese network S with inputs , and ;
Algorithm 1 AutoEmbedder Training Algorithm

4 Empirical Results

In this section, we describe evaluation metrics along with the experimental setup of the AutoEmbedder architecture, and the information of the datasets on which the tests were performed. Finally, we demonstrate the result of the AutoEmbedder.

4.1 Evaluation Metric

To evaluate the correctness of the proposed method, three well known and standard accuracy metrics are used. The evaluation metrics are presented below.

Accuracy: Accuracy (ACC) defines the unsupervised clustering accuracy, stated as,


Where defines the ground-truth label, defines the cluster assignment produced by the algorithm, and ranges over all possible one-to-one mapping of the labels and clusters, from which the best mapping is taken. The mapping using the Hungarian algorithm is found to be efficient.

Normalized Mutual Information: The normalized mutual information (NMI) is defined as,


Where is the ground truth and is the predicted cluster. refers to the mutual information between and . denotes the entropy.

Adjusted Rand Index:

The adjusted random index (ARI) is calculated using the contingency table

[santos2009use]. The ARI can be derived as,


Here, , , and are the values from the contingency table.

Both NMI and ARI produce a result in the range whereas, ACC produces a result in the scale . The higher value of these indices indicates a better correspondence between the cluster and the ground truth.

4.2 Experimental Setup

The neural network architecture is implemented using Keras [chollet2015keras]. To perform mathematical operations numpy is used [walt2011numpy]. The overall clustering is performed using K-means via scikit-learn library [scikit-learn] with the default parameters. As an AutoEmbedder, Keras implemented MobileNet architecture is used. The parameters of the MobileNet architecture is defned in Table 2.

Argument Value
alpha 1
depth_multiplier 1
dropout 0.2
include_top False
pooling None
input_tensor None
weights None
Table 2: The MobileNet parameters set for AutoEmbedder.

The ‘input_shape’ parameter of Table 2 is a variable that is based on the applied datasets. Also, the last layer of the MobileNet is ‘conv_pw_13_relu’ with a shape of . To reduce the dimension for each dataset, a feature space layer is added. The shape of the feature space layer is reported in column ‘Embedding Dimensions’ of Table 4. The number of nodes residing in the feature space layer denotes the embedding dimension of the AutoEmbedder. As the MobileNet can handle a minimum image shape of

, the datasets with lesser image shapes are zero-padded. To convert a single channel of a black and white image to a three-channel image, a copy of the black and white image is generated into each of the three channels. A graphical process of this image conversion is shown in Figure

6. The converted dimension is reported in Table 3. The must-link and can-not link constraints are defined using the available data labels of the tested datasets.

Figure 6: Conversion process of MNIST input image shape to

The AutoEmbedder for REUTERS dataset contains four dense layers of nodes 512, 256, 128, and 64, respectively, with a default ReLU activation function. Keras implemented mean squared error and Adam optimization function is used to train the AutoEmbedder for all the datasets.

Dataset Classes Size Description Input Dimension Converted Input Dimension
MNIST 10 60,000 Handwritten Digits (28, 28, 1) (32, 32, 3)
Fashion-MNIST 10 60,000 Shoes/Clothing (28, 28, 1) (32, 32, 3)
CIFAR10 10 50,000 Vehicle/Animals (32, 32, 3) (32, 32, 3)
COIL20 20 1,440 Objects (128, 128, 1) (128, 128, 3)
SVHN 10 99,289 Street View House Number (32, 32, 3) (32, 32, 3)
REUTERS 46 11,228 Word Sequence - -
Table 3: Information on the datasets on which AutoEmbedder is evaluated. The reported image dimension is given by the format .

4.3 Datasets

The AutoEmbedder is tested on well-known datasets. A short description of the datasets on which the AutoEmbedder is evaluated is presented in Table 3.

4.4 Results

The test results are obtained by calculating mean of the maximum ACC, NMI, and ARI scores for six runs. The results are reported in the meanstd format. The batch size, the epochs, the embedding dimension used for each dataset, and the distance hyperparameter is reported in Table 4. The training iteration stated in Algorithm 1 can be calculated as, . The AutoEmbedder is compared with both unsupervised and semi-supervised methods. All of the comparisons were performed in the same environment, and the other methods were implemented by maintaining the optimal hyperparameters.

Dataset Embedding Dimension Batch Size Epochs Distance
MNIST 2 128 3,000 100
Fashion-MNIST 2 128 3,000 100
CIFAR10 3 128 3,000 100
COIL20 2 128 3,000 100
SVHN 3 128 3,000 100
REUTERS 16 128 3,000 100
Table 4: The parameters used to train the AutoEmbedder on different datasets.

Table 5 illustrates a comparison based on the ACC, NMI, and ARI scores tested over different image datasets. The table contains both unsupervised and semi-supervised methods, whereas the semi-supervised methods are marked with an asterisk (*). Furthermore, the highest scores are marked bold. In this comparison, some implemented methods generate pseudo labeling instead of generating embeddings [Husser2017AssociativeDC, springenberg2015unsupervised]. Therefore it is not possible to calculate NMI and ARI scores for these methods, and they are kept blank. From the comparison of Table 5, it can be observed that some semi-supervised methods perform less accurately than some unsupervised methods [basu2004active, REN2019121]. However, it can also be witnessed that unsupervised methods fail to generate better accuracy in complex datasets, such as Fashion-MNIST and CIFAR-10. Yet, GAN based semi-supervised and unsupervised methods [springenberg2015unsupervised, Mukherjee2019] perform better than most other methods. However, GAN based methods suffer from some unfortunate circumstances due to the unstable learning of generators and discriminators [springenberg2015unsupervised]. On the contrary, the AutoEmbedder does not suffer from any instability and outperforms all of the other implemented semi-supervised and unsupervised methods.

width=center Model MNIST Fashion-MNIST CIFAR10 COIL20 ACC NMI ARI ACC NMI ARI ACC NMI ARI ACC NMI ARI ADC  [Husser2017AssociativeDC] - - - - - - - - CatGAN  [springenberg2015unsupervised] - - - - - - - - ClusterGAN  [Mukherjee2019] DAE Network  [yang2019deep] DBC  [li2017discriminatively] DEC-DA  [pmlr-v95-guo18b] DEN  [6976982] DEPICT  [dizaji2017deep] JULE  [Yang_2016_CVPR] RTM  [Nina_2019_ICCV] SpectralNet  [shaham2018spectralnet] TAGnet  [e37fa3ff64a04c06adbe9d34d9e86cce] VaDE  [jiang2016variational] SS-KM*  [basu2004active] SS-DEC*  [REN2019121] CatGAN*  [springenberg2015unsupervised] - - - - - - - - AutoEmbedder*

Table 5: Accuracy, NMI, and ARI score comparison of different unsupervised and semi-supervised architectures tested on MNIST, Fashion-MNIST, CIFAR10, and COIL20 dataset. The semi-supervised architectures are marked with an asterisk (*). Methods that are unsuitable for calculating NMI and ARI are kept blank.

The AutoEmbedder architecture also outperforms other unsupervised methods that are applied to textual data. It is tested on the REUTERS dataset, and the evaluation report is presented in Table 6.

(a) Accuracy on MNIST dataset.
(b) Accuracy on CIFAR10 dataset.
(c) Accuracy on SVHN dataset.
Figure 7: A comparison of the AutoEmbedder with different label-based semi-supervised classifiers. Subfigure 6(a), 6(b), and 6(c) illustrate comparisons on MNIST, CIFAR10, and SVHN dataset, respectively. The accuracy of AutoEmbedder is reported using the ACC metric of the equation 19 and the accuracy of the label-based semi-supervised methods reported based on the percentage of positive prediction by total predictions.
DEC  [xie2015unsupervised]
AP  [Frey2007]
HDB  [Campello2013]
RCC  [Shah2017]
RCC-DR  [Shah2017]
Table 6: ACC, NMI, and ARI score of different unsupervised architectures tested on the REUTERS dataset.

We also compare the AutoEmbedder architecture with other label-based semi-supervised architectures [chen2019semisupervised, zhang2017interpretable]. The label-based semi-supervised architectures are trained on a small portion of labeled data. Furthermore, as these DCNN methods are trained on labeled data, they contain activation functions in the last layer. Therefore the final outputs of these architectures are class labels, instead of embeddings. Although the training criteria of these methods and the AutoEmbedder is different, we perform a comparison among the architectures. Also, as the final labels of the label-based DCNN methods are not pseudo labels, the accuracy of these methods is calculated based on the ratio of the percentage of positive predictions by total predictions. Figure 7 presents comparisons based on MNIST, CIFAR-10, and SVHN datasets. In the comparisons, the AutoEmbedder is compared with label-based semi-supervised FF-CNN, BP-CNN, and ensemble architectures [chen2019semisupervised, zhang2017interpretable]

. The methods are compared based on different numbers of known labels of the dataset. The label-based semi-supervised methods are trained based on the known labels, whereas the AutoEmbedder is trained by constructing pairwise constraints from the known labels. The AutoEmbedder outperforms all the label-based DCNN architectures on CIFAR10 and SVHN datasets. However, it fails to produce the best results in the simple MNIST dataset, since MobileNet is not a suitable architecture for single-channel image datasets. The label-based DCNN architectures perform less accurately because the final layer activation functions are not optimized to generate optimal hyperplanes, while they are trained on fewer data. On the contrary, the AutoEmbedder generates embeddings based on the distance hyperparameter

, which promises to generate low-dimensional clusterable points.

5 Conclusion

This paper introduces an embedding architecture, AutoEmbedder that produces meaningful clusterable embedding points. The end-to-end training process of the architecture is semi-supervised and requires a pairwise cluster linking information in the training phase. The training procedure does not include any clustering loss measures, instead, it uses Euclidean distance loss that is minimized by backpropagation. The AutoEmbedder only produces clusterable embedding points. The AutoEmbedder can be built based on any classification architecture with the required embedding dimension. From the benchmarks of this paper, it is to report that the AutoEmbedder presents better results on almost all the datasets. The embedding system constructs three-dimension embedding points from complex three-channel image datasets CIFAR10 along with SVHN and still produces better results. From the statistics of the empirical results, it may be concluded that the proposed method is beneficial to perform semi-supervised learning. We strongly believe that the overall contribution of this paper inaugurates a wider perception in the scope of embedding systems, semi-supervised learning, and image clustering research works.