Deep learning has shown great promise in solving pattern recognition problems(1; 2)
. It allows to learn space transformations and gradually extract higher semantic representations from one layer to another. The automatic feature extraction is guided by prior knowledge, generally the class labels which are considered to be the highest semantic representation of the data samples. Two samples belonging to the same class, although different from the raw data representation perspective (e.g., pixel level representation for computer vision), they are pushed to be considered the same by the neural network. Therefore, within-class variances and between-class similarities are destroyed, however, within-class similarities and between-class variances are preserved and emphasized(3)
. Hence, class labels oriented supervision can be seen as a dimensionality reduction operation where we only extract some embedded information with respect to the labels and we destroy everything else. Unfortunately, the progress made by deep learning is still mostly limited to supervised learning. It is still unclear how to obtain similar results without using any labeled data. Discovering hidden data structures based on the least possible prior knowledge (essentially the number of classes) remains an open and challenging research area.
As a matter of fact, a tremendous amount of labels can be expensive to obtain. Therefore, clustering has been intensively studied since several decades. Old-fashioned clustering approaches like K-means(4), DBSCAN (5) and Agglomerative Clustering (6) capture similarities implicitly based on a notion of distance in the original data space, but they perform poorly in high dimensional spaces (7). Old-fashioned clustering methods which rely on dimensionalty reduction (8; 9)
has proven to be more effective when dealing with high-dimensional data samples. However, when it comes to clustering high-semantic datasets using shallow representations, all the classical clustering models fall short of reliable discriminative abilities. Since natural data is compositional and can be efficiently represented hierarchically, a deep multi-layers architecture is a natural choice to handle the hierarchical structure of the data(10)
. Recently, clustering based on deep learning have gained popularity. Clustering approaches which relies on deep non-linear transformations are known as deep clustering.
Most of the existing deep clustering methods are based on autoencoders, which are neural networks with a particular architecture. In a nutshell, an autoencoder is composed of two parts: an encoder and a decoder which are trained to reconstruct the input data samples after encoding them in a latent space. There are several variants of autoencoders which aims at learning the key factors of similarities in the embedded space with respect to the data semanticity (11; 12). The common point between all of the them is the reconstruction loss function, and they mainly differ from each other by the way the encoding operation is constrained. As mentioned before, the main challenge in unsupervised learning is that there is no obvious straightforward objective function that can cluster data points according to their semanticity. Provided with their reconstruction capability which implicitly helps in reducing the dimentionality and capturing data semanticity without forcing any kind of bias, autoencoders seems to be a natural choice for clustering "high-semantic & high-dimensional" datasets. However, a common problem in self-encoding oriented clustering (13; 14; 15; 16; 17; 18) is the blue sky problem (19). For instance, a picture of a plane and another of a bird both would contain many blue pixels and few ones representing the variance. In such case, the reconstruction would foster and prioritize encoding information which are irrelevant for clustering (e.g., the color of the sky).
To deal with this problem, our proposed model has a dynamic loss function that allows to gradually dispense with the reconstruction loss while improving the discriminative capability and preserving the data topological characteristics. Our experimental results show significant improvement on benchmark datasets when compared to the state-of-the-art autoencoder-based clustering methods in terms of accuracy, normalized mutual information and running time. Unlike rival approaches, DynAE has reasonable dynamic hyperparameters which are updated automatically following the dynamics of the learning system. Automatic update of parameters avoid cross-validation, which is impractical for purely unsupervised clustering problems.
To the best of our knowledge, we are the first to introduce a deep learning loss function with an explicit smooth dynamics. The contributions of this work are: (a) a new deep clustering model with a dynamic loss function that enables to solve the clustering and reconstruction trade-off by gradually and smoothly eliminating the reconstruction objective in favor of a construction one while preserving the space topology; (b) an improved way of obtaining a K-means friendly latent space compared to DCN (15); and (c) outperforming its state-of-the art rivals by a large margin.
2 Related work
Existing Deep clustering methods broadly stem from two main families. For the first family, embedded learning and clustering are set apart, alternatively stated the deep embedding transformations should push the friendly-clustering representation to have a blob distribution. After that a classical clustering algorithm can be applied on that distribution. For example, in (20), the data is first projected in a lower dimensional space using a stacked autoencoder, then k-means was selected to run on the embedding. Some other interesting research studies revolve around this double-steps strategy (21; 22; 23). For the second family, embedded learning and clustering are performed jointly (simultaneously or alternately), in brief a clustering loss is explicitly introduced in order to make up for the classification loss in supervised learning. A number of approaches falls into the scope of this group (13; 14; 24; 15; 19; 16; 25; 18).
Some state of the art deep clustering algorithms rely on models whose initial parameters are pretrained from high semantic datasets like ImageNet(26). Some other approaches freeze the first layers of famous models like VGG (27), Inception (2) or Residual networks (28) and build on top of these features (19). Although not using any label from the target training set, these approaches strongly rely on features which are extracted in a supervised way (labels from other datasets). And since the low semantic features are similar across different datasets, we consider such methods to be out of the pure unsupervised learning realm.
Good results have also been reported based on approaches which firmly depend on unrealistic assumptions (e.g., prior knowledge on the size of each cluster) for formulating the loss function (16; 17; 29). Similar assumptions may have a considerable influence in clustering ambiguous samples. While these methods may perform well in specific scenarios, the solution is not general enough to be applicable to any dataset.
Deep Embedding for Clustering (DEC) (13) projects the data points in a lower dimensional space using an autoencoder. After that, the decoder is dispensed with and the encoder is trained to jointly improve the embedding representations and the clustering centers. DEC requires a pretraining phase. Furthermore, it offers no guarantee to preserve the space topology after discarding the decoder which can lead to random features generation. Added to that, each center contribution to the loss function is normalized to block large clusters from eminently altering the embedded representations compared to small clusters. Therefore, DEC works well only for balanced datasets.
Improved Deep Embedded Clustering (IDEC) (14) is quite similar to DEC. The major contribution of IDEC is that it aims at incorporating a reconstruction after the pretraining phase, because eliminating the reconstruction during the clustering phase would harm the space topology and hamper the network from preserving the local structures. Nevertheless, maintaining the reconstruction end-to-end is not as beneficial as it seems to be due to the natural compromise between clustering and reconstruction. On the one hand, clustering is associated with destroying irrelevant information that does not contribute to the data categorization, namely within-cluster variances and between-cluster similarities. On the other hand, reconstruction is mainly concerned with information preservation.
Deep Clustering Network (DCN) (15)
is also an autoencoder clustering approach. Similar to DEC and IDEC, DCN is a joint optimization process. First, the autoencoder is pretrained to reduce the dataset dimentionality based on a reconstruction loss function. The ultimate goal of this approach is to get a k-means-friendly representation at the end of the training by applying a k-means loss in the latent space along with the vanilla reconstruction. This method requires hard clustering assignments (as opposed to soft clustering assignments based on probabilities). That would induce a discrete optimization process which is incongruous with the differential aspect of the gradient descent concept. Similar to IDEC, DCN suffers from the reconstruction and the clustering trade-off.
Deep Embedded Regularized Clustering (DEPICT) (16) makes use of a convolution autoencoder for learning embedded features and clustering assignments. Similar to DEC, DEPICT has a relative cross entropy (KL divergence) objective function and a regularization term which allows to impose the size of every single cluster beforehand. Such term enables to exclude solutions that allocate most of the data points to some specific clusters. We argue that this regularization term can play a key role in making critical decisions when it comes to assigning conflicted sub-clusters to their target clusters. However, this prior knowledge (size of clusters) assumed by DEPICT is impractical for a pure unsupervised problem.
Variational Deep Embedding (VaDE) (25) is a generative deep clustering method, inspired by Variational Autoencoder (VAE). It allows both clustering and data generation. The data distribution is encoded as a GMM in the embedded space and the corresponding sampled embedding are decoded using the reparameterization trick in order to allow optimization based on Stochastic Gradient Variational Bayes (SGVB). VaDE relies on variational inference. Thus, the information loss induced by the mean-field approximation can lead to unreliable latent representation which in turn causes deterioration of the discriminative abilities of the associated embedded space.
For all the progress made in deep clustering, modern approaches are still passably successful when dealing with relatively low semantic datasets like MNIST (30), USPS (31), COIL (32), etc. However, all state of the art strategies perform very poorly when it comes to clustering high semantic datasets like CIFAR (33), STL (34), ImageNet (26), etc.
3 Dynamic autoencoder
All autoencoder-based clustering approaches rely on the same principle: representation learning based on a reconstruction loss function and clustering based on the learned representation. Basically, the joint optimization process is described as following:
Where is the reconstruction loss function and is the clustering loss. is a hyper-parameter which is used to balance the two costs. It has been shown empirically (14) that it is better to keep small in order to avoid having the clustering cost corrupt the latent space. The common network architecture of the autoencoder-based clustering approaches is illustrated in Figure 2.
In the context of a multi-objective cost function, a neural network is susceptible to the feature drift phenomena. Feature drift is associated with a significant deterioration in the global optimization performance. It takes place when the different objective functions compete improperly or unfairly against each other. A simple illustration of the feature drift phenomena can be found in Figure 2. In this illustration, an object is pulled by a couple of forces . The combination is adjusted by a
coefficient. If the two forces are not balanced correctly, the object would not reach its target even after a huge number of iterations. The competition can be very harmful if it is not meditated thoroughly. It could slow down the convergence, cause divergence or lead to an inadequate solution. Therefore, it is better to avoid any implicit competition within the same network if it is not required. This analogy was selected for the sake of simplicity of visualization. In our case, it is more difficult to display a real illustration, since the gradient vectors are computed in a high-dimensional space.
A deep data representation can be generative, discriminative or both at the same time. A generative representation is an encoded representation in a latent space from which we can recover or nearly recover the initial data distribution. For example the bottleneck representation of an autoencoder can be considered as a generative representation. Whereas, a discriminative representation is an encoded representation in a latent space where the principal factors of variations and similarities are stressed. Generally, a good discriminative representation captures the high level similarities (essentially class similarities).
Some recent deep clustering studies have given up the autoencoder-based clustering strategy in favor of GAN and CNN architectures (24; 19; 35; 36; 29; 37). The reason behind the recession of the autoencoder-based clustering tendency can be probably explained by the fact that the clustering and reconstruction losses are not naturally adapted to cooperate with each other. An implicit competition between them is not only slowing down the convergence of the optimization process but also hindering the generation of better discriminative features. On the one hand, the clustering loss aims at preserving the between-cluster variances and the within-cluster similarities and destroying the within-cluster variances and the between-cluster similarities. On the other hand, the reconstruction loss function aims at preserving all the similarities and all the variances (within-cluster similarities, within-cluster variances, between-cluster similarities, between-cluster variances). Therefore, any progress made by the clustering loss function on the bottleneck representation can be easily drifted by the optimization with respect to the reconstruction cost function. Getting completely rid of the reconstruction in the clustering phase similar to (13) would dramatically corrupt the clustering space and render the deep representations meaningless by allowing for the generation of random discriminative features. In order to deal with this problem of feature drift, we propose a new dynamic loss function (a loss function which changes from one iteration to another) that gradually mitigates the reconstruction objective and converges to a new one that preserves the space topological characteristics.
Consider the problem of clustering a dataset of data points in . This dataset can be grouped into clusters . and stand for the encoder and decoder mappings respectively and and represent their learnable parameters respectively. Let be the latent representation of the data point , be the reconstructed representation of and
be a latent interpolation ofand . Similar to all autoencoder clustering methods, our DynAE has a pretraining phase. In our case, the network’s weights are initialized based on the reconstruction loss regularized by data augmentation (e.g., small random shifting and small random rotation) and an adversarially constrained interpolation (38).
To keep the notation simple, we consider that represents the data after performing the random transformations (translation and rotation). is the critic network. While the critic is trained to regress the interpolation coefficient from in (2), the main network is trained to fool the critic into considering the interpolated points to be realistic in (3). The second term in (2) enforces the critic to output 0 for non-interpolated inputs. At each iteration, and are randomly generated in . This regularization technique aims at making the interpolants look realistic which allows for some form of continuity in the latent space as shown in (38).
3.2.1 From reconstruction towards centroids construction
Our dynamic loss function have two parts. The first one is designed to make up for the reconstruction and it can be formulated as following:
Where is the set of centroids and K is the number of centroids. t stands for the iterations index. is the probability of assigning an embedded point to the centroid computed similar to DEC based on the Student’s t-distribution. is a coefficient related to the Student’s t-distribution.
is a sequence of the maximal value of . and are hyperparameters that belong to the interval .
is the embedded clustering assignments function (a function that takes a data point in the initial space as input and outputs the associated centroid in the latent space). Apart from k-means, there are multiple popular strategies for computing the centroids coordinates sucha as k-medoids (39), clarans (40). In this work, for the sake of simplicity and since the goal is just to show empirically that the new dynamic loss function is a promising substitute to the classical reconstruction, kmeans is selected for updating the centroids.
The intuition behind this idea is to make the autoencoder output the image of the associated centroid of every data sample. The decoder is used to generate these images and hence they are used as a supervisory signal. However, some data points have ambiguous clustering assignments (the probabilities of the highest assignments are very close to each other), we call them conflicted data points provided the uncertainty about their real associated centers. These conflicted data points can be identified based on the condition in (7). Samples which have high confidence clustering assignments are selected for centroid construction. As for the conflicted data points, the reconstruction cost is preserved until being more confident about their associated centers. So depending on the data sample, there is two possible training schemes: reconstruction or centroid construction as stated by (5).
After pretraining the network using the reconstruction loss function similar to all autoencoder-based clustering strategies, the deep embedding turns into a generative and more discriminative, low dimensional representation of the initial data distribution (22). Although the reconstruction loss allows to nearly preserve all the factors of similarities and variations, its associative discriminative abilities quickly hits a plateau since the discrimination is not explicitly reflected by the loss function during the pretraining phase. Afterwards, the embedded points can be extracted and used to compute the initial embedded centroids which in turn are used to compute the loss function (4) based on (5), (6), (7), (8), (9), (10) and (11).
The centroids images, which are used as an implicit supervisory signal, are generated and do not represent real data points. Therefore, the first nearest neighbor (1NN) of each embedded center is selected as a more stable substitute to the decoder’s generated images. It is possible that the regularization with the adversarially constrained interpolation performed during the pretraining phase contributes to generating stable centroids by introducing some sort of continuity in the latent space.
DynAE has mainly 2 hyperparameters: and . stands for the minimal confidence threshold under which a data point is considered to be conflicted. is the minimal difference between the highest and the second highest probabilities of the assigned centroids. Hyperparameters are known to be dataset-specific. Besides, it is vital to make unsupervised learning approaches less sensitive to the choice of unpredictable hyperparameters because for real data applications, supervised cross-validation is impractical. Therefore, our hyperparameters are computed based on mathematical formulations. As shown in equation (11), and depend on which is the confidence threshold. is updated in a way to capture the dynamics of the number of conflicted data points.
The dynamics of the loss function is described by which specifies the amount of reconstruction in (4).
At the beginning of the training process, and are selected in a way to start with a high number of conflicted data points. So at this stage, most of the training set is devoted to reconstruction and the rest is used for performing centroids construction. Starting with higher hyperparameter values would allow for more reconstruction and less centroids construction which would only slow down the convergence.
During the training process, the number of conflicted data points is supposed to decrease gradually based on the knowledge acquired through constructing the centroids of the unconflicted data points. The loss function is considered to reach stability (becomes static) when the number of conflicted data points does not decrease any more. At this level, remains constant.
Local convergence and centroids update
The local convergence is characterized by the stability of the loss function. In order to escape local convergence, there are two options. The first option is to decrease the hyperparameters and and the second one consist of updating the centroids. In this work, we opted for both solutions to avoid local stability. The hyperparameters and are dropped by , where is the dropping rate of .
At the end of the training process, almost no reconstruction remains. Thus, for every sample, the network constructs its associated centroid. The output images of the network are smoother and easier to recognize as illustrated in Figure 3.
Unlike DEC, the proposed method avoids any space corruption since the network is constructing points which are strictly related to the data distribution. So, there is no possibility of building random discriminative features.
3.2.2 Embedded clustering
The data points of each cluster are generally spread near non-linear manifolds. However, typical centroid based clustering algorithms (e.g., K-means) are only suitable for blob-like distributions. To deal with this problem, a clustering loss function is proposed to make the target representation susceptible to centroid based clustering. In other words, the clustering cost function should push the bottleneck distribution to be K-means-friendly by penalizing the distance between the data points and their associated centroids. This strategy has been proposed first in DNC (15). However, DNC has hard clustering assignments and does not consider the dynamics of the data points since every sample is pulled by its associated centroid disregarding the uncertainty of assignment to another center. In this work, we extend on the mentioned approach to better handle the conflicted data points issue. Similar to the previous loss function (4), we propose a clustering cost function (13), controlled by the same confidence hyperparameters and .
3.2.3 Joint optimization
The total objective is defined as
Similar to the pretraining phase, the loss function L is regularized by data augmentation. Therefore, we consider x to be the transformed data after applying random shifting and rotation. In related works (14; 15; 18; 17), an unpredictable, hard-to-tune (require labels) and dataset-specific hyperparameter is required to balance the two cost functions in order to avoid having the clustering loss corrupting the latent space and avoid having the reconstruction loss drifting the learned discriminative features. In this work, this hyperparameter is unrequired since the main objective of DynAE is to get rid of the implicit competition between the two objectives. Empirical results on different datasets proves that the hyperparameter used by all the other autoencoder clustering approaches is not sufficient to solve the mentioned problem.
The loss function (15) is optimized using mini-batch gradient descent (SGD) and backpropagation to solve for the autoencoder’s weights. We run the optimization process forbatch iterations or until the number of conflicted data points is lower than a specific amount of the total number of data points.
To better understand the proposed approach, an analogy with signal processing. Extracting discriminative features can be considered similar to a filtering operation. First, the initial data is expressed in another space, some irrelevant information are destroyed and the remaining pertinent patterns are emphasized. The Fourier transformation is a bijective function. Therefore, it can be considered as a generative transformation (similar to the reconstruction in our case). The new generative representation can be expressed in a higher dimensional space (e.g., the wavelet transform) or a lower dimensional one. Then, based on the new generative representation, a more discriminative one can be extracted by simply filtering the unrequired components of the Fourier transform (in our case destroying some unwanted pieces of information which are impeding the network from selecting the relevant ones). For the proposed training mode, the centroids are constructed using the bottleneck features extracted from the data samples. And because every centroid is different from its associated samples, some unrequired features need to be destroyed and some relevant ones should be preserved. Based on the proposed training strategy, the feature selection is done automatically although we are not using any labels. For example, a seven can be written in different fashions but all of them will construct the same centroid. Hence, the network will learn to filter all the unwanted information. The simple analogy described above can be summarized in Table 1.
|Autoencoder clustering||Transformation||Strategy||Output representation|
|The target patterns are more pronounced in the new space.||Generative and more discriminative.|
|Filtering||Centroids construction and embedded clustering||
|The unrequired patterns are destroyed.||Non generative and more discriminative.|
The algorithm of our method is summarized in Algorithm 1. Without considering the initialization phase, the computational complexity of DynAE is , where is the number of training iterations, is the number of layers and
is the maximal number of neurons of the hidden layers. It is worth to mention that DEC, IDEC, DCN, DEPICT and DynAE, all have the same computational complexity. Therefore, for the same network architecture, the same batch size and the same pretraining settings, the execution time of these five approaches can be strictly compared based on the number of iterations required for convergence.
MNIST-full (30): a dataset that consists of , grayscale images of handwritten digits. Each sample was normalized and flattened to a dimensional vector.
MNIST-test: a test subset of the MNIST-full dataset with data samples.
USPS (41): a dataset of , grayscale digit images. Each sample was flattened to a dimensional vector.
Fashion-MNIST (42): a dataset of , grayscale images. It is composed of 10 classes.
A summary of the dataset statistics is shown in Table 2.
|Dataset||# points||# classes||Dimension||% of largest class|
In order to improve the generalization capability of our model, regularization using data augmentation is performed through stochastic affine distortion. Random transformations are applied to every single data point before being fed to the training model. The followings transformations are used in our experiments. Our implementation of the real-time data augmentation is based on Keras(43).
Random shift along the images’ width by , where .
Random shift along the images’ height by , where .
Random rotation by , where .
4.2 Evaluation Metrics
NMI and ACC have pros and cons and are by far the most utilized evaluation standards in the deep clustering literature since employing both of them is sufficient to validate the effectiveness of the clustering algorithm.
Where is the ground-truth labels vector and is the clustering indexes vector. denotes the mutual information function and denotes the entropy function. is the best one-to-one mapping function that matches clusters to the ground truth labels. The Hungarian algorithm (46) is used to find this mapping.
Since cross validation is not allowed, we avoid dataset-specific settings. Therefore, we use the same network architecture for all experiments. Following DEC and IDEC, the encoder and decoder are fully connected neural networks. The autoencoder has 8 layers with dimensions d - 500 - 500 - 2000 - 10 - 2000 - 500 - 500 - d. Except for the bottleneck layer and the last layer, all the other ones use ReLu(47). During the pretraining stage, the autoencoder is trained adversarially end-to-end in competition with a critic network for iterations. Parameters are updated using Adam (48) optimizer with a learning rate and default values for , , and (here we are referring to the optimizer hyperparameters not the ones of our model). is set equal to and is set to according to their original papers. During the training phase, the autoencoder is trained until meeting the convergence criterion with a convergence threshold or reaching a maximal number of iterations . The confidence threshold is set to and its dropping rate is set to . The training parameters are updated using SGD optimizer with a learning rate and a momentum . Both pretraining and training are executed in batches of size
. DynAE was implemented based on Python and Tensorflow(49).
As we can see in Table 3, DynAE outperforms all the other ones by a large margin. The outperformance of our approach over its rivals demonstrates empirically that the competition between the reconstruction and the clustering cost functions hinder the discriminative capabilities of the embedded space. It also shows that DynAE is an efficient solution to this trade-off issue. The discriminative ability of DynAE can be further illustrated by Figure 4, where we can see that each cluster is well-separated from the other ones. While some of the other methods show a significant decrease in (NMI) and (ACC) when dealing with a small subset of MNIST (i.e., MNIST-test) compared to the full dataset (i.e., MNIST-full), the performance of our approach remains nearly unaffected by scaling down the dataset size. Added to that, the largest improvement in (ACC) and (NMI) is observed for USPS which is an unbalanced dataset compared to MNIST. A possible explanation to this glaring disparity is the fact that DEC, IDEC and DEPICT are known for being unsuitable for unbalanced datasets. Furthermore, we stress the point that, unlike (13; 14; 16), our approach does not assume any additional prior knowledge (e.g., size of the clusters) and it does not have unpredictable, dataset-specific and hard-to-tune hyperparameters (14; 15; 18; 17).
In order to evaluate the running time of our algorithm, we compare it with DEC and IDEC. The first one is known to suffer from random discriminative features which can easily mislead the clustering procedure and the second one suffers from the feature drift phenomena according to our hypothesis. For fairness of comparison, we use the same autoencoder architecture, with the same initial weights, the same batch size, the same optimizer, the same learning rate and the same dataset. In this experiment, we made sure that the only difference between the three models is their respective loss functions and that all of them have the same starting point (with an initial accuracy ). We trained each model until surpassing a fixed clustering accuracy (). As we can see in Figure 5 (a), the clustering accuracy of DEC and DynAE makes a smooth jump in few iterations. Both of them almost require the same number of iterations to reach . However, IDEC requires far more iterations to converge to the same threshold. In Figure 5 (b), which is just a zoom on the IDEC training curve, we can observe intensive fluctuations. Theses fluctuations and the glaring slow-down in convergence of IDEC can be explained by the competition between the reconstruction and the clustering objectives which is causing what we called the feature drift.
In Figure 6, we draw the evolution of the number of conflicted data points (respectively the number of unconflicted ones) against the number of iterations. This experiment was performed on the MNIST dataset. As we can see, the number of conflicted data points decreases smoothly while the model is acquiring knowledge through constructing the centroids of the unconflicted data points. Close to iterations, the dropping rate of the number of conflicted data points starts slowing down. At this stage, the model has reached local stability. In order to escape this local solution, the centroids, and are updated automatically. This enforced update causes an abrupt decrease in the number of conflicted data points, as we can see inside the red circle.
In this paper, we have introduced Dynamic Autoencoder as the fist deep clustering model to integrate smooth dynamics in its loss function. Our proposition consists of gradually making up for the reconstruction without generating random discriminative features. Empirical studies, where we provide comparative results with state-of-the-art approaches on 4 benchmark datasets, showed the superiority of our approach. Our model comprehensively outperforms the other ones in all aspects. We strongly believe that the simple but intuitive formulation of DynAE has a lot more potential. With carefully constructed alternative formulations that utilize the same philosophy of achieving smooth transitions, DynAE can become a source of inspiration for introducing more dynamic deep learning systems.
All thanks to Almighty God. The first author was financially supported by his parents Fethi Mrabah and Nada Mechmech.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  Oren Rippel, Manohar Paluri, Piotr Dollar, and Lubomir Bourdev. Metric learning with adaptive density discrimination. arXiv preprint arXiv:1511.05939, 2015.
-  James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
Derya Birant and Alp Kut.
St-dbscan: An algorithm for clustering spatial–temporal data.
Data & Knowledge Engineering, 60(1):208–221, 2007.
-  Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011.
Alexander Hinneburg and Daniel A Keim.
Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering.In Proceedings of the 25th International Conference on Very Large Databases, pages 506–517, 1999.
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.In Advances in neural information processing systems, pages 849–856, 2002.
-  Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 833–840. Omnipress, 2011.
-  Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478–487, 2016.
Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin.
Improved deep embedded clustering with local structure preservation.
International Joint Conference on Artificial Intelligence (IJCAI-17), pages 1753–1759, 2017.
-  Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. arXiv preprint arXiv:1610.04794, 2016.
-  Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 5747–5756. IEEE, 2017.
-  Elad Tzoreff, Olga Kogan, and Yoni Choukroun. Deep discriminative latent space for clustering. arXiv preprint arXiv:1805.10795, 2018.
-  Sohil Atul Shah and Vladlen Koltun. Deep continuous clustering. arXiv preprint arXiv:1803.01449, 2018.
-  Philip Haeusser, Johannes Plapp, Vladimir Golkov, Elie Aljalbout, and Daniel Cremers. Associative deep clustering: Training a classification network with no labels. In Proceedings of the German Conference on Pattern Recognition (GCPR), 2018.
-  Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. Learning deep representations for graph clustering. In AAAI, pages 1293–1299, 2014.
-  Gang Chen. Deep learning with nonparametric clustering. arXiv preprint arXiv:1501.03084, 2015.
-  Xi Peng, Shijie Xiao, Jiashi Feng, Wei-Yun Yau, and Zhang Yi. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
-  Peihao Huang, Yan Huang, Wei Wang, and Liang Wang. Deep embedding network for clustering. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 1532–1537. IEEE, 2014.
-  Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5147–5156, 2016.
-  Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. arXiv preprint arXiv:1702.08720, 2017.
-  Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
-  S NENE. Columbia object image library. COIL-100. Technical Report, 6, 1996.
-  Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
-  Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. arXiv preprint arXiv:1807.05520, 2018.
-  Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive image clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5879–5887, 2017.
-  Chih-Chung Hsu and Chia-Wen Lin. Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Transactions on Multimedia, 20(2):421–429, 2018.
-  David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543, 2018.
-  Leonard Kaufman and Peter Rousseeuw. Clustering by means of medoids. North-Holland, 1987.
-  Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003–1016, 2002.
-  Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
-  François Chollet. keras. github repository. https://github. com/fchollet/keras>. Accessed on, 25:2017, 2015.
-  Deng Cai, Xiaofei He, and Jiawei Han. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913, 2011.
-  Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617, 2002.
-  Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
-  Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.