Decorrelating Adversarial Nets for Clustering Mobile Network Data

03/11/2021 ∙ by Marton Kajo, et al. ∙ Technische Universität München 0

Deep learning will play a crucial role in enabling cognitive automation for the mobile networks of the future. Deep clustering, a subset of deep learning, could be a valuable tool for many network automation use-cases. Unfortunately, most state-of-the-art clustering algorithms target image datasets, which makes them hard to apply to mobile network data due to their highly tuned nature and related assumptions about the data. In this paper, we propose a new algorithm, DANCE (Decorrelating Adversarial Nets for Clustering-friendly Encoding), intended to be a reliable deep clustering method which also performs well when applied to network automation use-cases. DANCE uses a reconstructive clustering approach, separating clustering-relevant from clustering-irrelevant features in a latent representation. This separation removes unnecessary information from the clustering, increasing consistency and peak performance. We comprehensively evaluate DANCE and other select state-of-the-art deep clustering algorithms, and show that DANCE outperforms these algorithms by a significant margin on a mobile network dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning algorithms are the basis for many of today’s technological advancements, and will continue to enable future use-cases in all technological fields. Naturally, mobile network automation follows this trend, where future Cognitive Autonomous Networks are expected to utilize deep learning in many network management tasks in order to make the network robust, reliable and adaptive [can].

The majority of deep learning research targets supervised learning, such as classification, where the deep learning algorithms learn to output the ground truth that is explicitly defined in the training data in the form of labels or values, usually collected through crowdsourcing or data mining. However, these label generation processes are not available for mobile network automation, as network management tasks require expert knowledge, thus only a handful of people are capable of undertaking them. Requiring these experts to manually generate examples to be able to train deep learning algorithms in a supervised way is not feasible. This paper focuses on deep clustering, a form of unsupervised learning where the ground truth is not known, and the algorithm is not supported with the correct answers during training. In clustering, the task is to assign observations (data points) to groups (or clusters) based on their characteristics, so that similar observations end up in the same cluster.

In the CAN concept, an important element called Environmental Modeling and Abstraction (EMA) uses clustering to define network states; discrete descriptions of the settings and performance of the network or parts thereof which are then used as the basis for decision making by the different CAN components [ema]. The correct designation of these EMA network states is critical in this setting, as optimization, self-healing and load balancing decisions are mapped to the EMA states directly. If an EMA state incorporates dissimilar behavior, these control functions will not trigger at the correct place or in the correct time, making the network less reliable and less optimized. However, network states can not be defined explicitly, as a network’s behavior changes depending on a myriad of internal and external factors, such as time, location, components used in the network, user behavior, etc. To this end, clustering is the only option to automatically define the EMA network states.

Apart from future management use-cases such as EMA, clustering is already used in a variety of network and service management use-cases, where deep clustering could improve current functionality. Some examples of these are:

  • Slice provisioning (instantiation) may utilize well-defined usage types through clustering, in order to select appropriate templates for the slices based on predicted requirements [slice].

  • Quality of Experience (QoE

    ) estimation tries to map explicitly measurable

    Key Performance Indicators, such as network delay, jitter or throughput to user satisfaction levels [qoe]. Here, clustering may be utilized to establish user archetypes.

  • Network anomaly detection may utilize generative clustering to map normal usage patterns, and detect outliers which point to anomalous events in the network

    [anomaly].

Fig. 1: Network automation use-cases involving behavior clustering.

These tasks all involve clustering of the behavior of users, applications or services, by finding implicit patterns in the collected data (Fig. 1). Feature engineering and manual definition of clustering rules is quite hopeless, as the generally collected information that could be useful for these tasks does not contain the necessary information explicitly. Well-intended targeted data collection, such as deep packet inspection (snooping), is also increasingly difficult, as more and more of the communication is encrypted or governed by privacy laws (and rightly so). Thus, deep clustering algorithms that simultaneously extract implicit patterns and form groups using these patterns are the perfect match for these tasks.

Deep clustering has seen a surge in attention recently, with huge performance improvements being published every few months. Some algorithms are now quite close to supervised classification performance, a feat that was unimaginable even a few years ago. However, as we show in this paper, applying these cutting-edge algorithms to mobile network data is not straight-forward. Most deep clustering algorithms are developed for image datasets, and are able to achieve great performance because of inherent assumptions and optimization that are specific to image data. These biases often don’t translate well to mobile networks, where the performance of the algorithms degrades, in some cases significantly. We will elaborate on the challenges faced in applying these new algorithms in network automation and how they can be overcome.

In this paper:

  • We discuss the design philosophy and examine the performance of state-of-the-art deep clustering algorithms (Sec. II), highlighting their strengths and weaknesses.

  • We propose our own deep clustering algorithm (Sec. III), which aims to mitigate some of the weaknesses of state-of-the-art algorithms. Furthermore, our algorithm is not biased to any application-domain, making it a preferable choice for clustering mobile network data.

  • We evaluate our algorithm extensively by comparing its performance against state-of-the-art deep clustering algorithms on both image and mobile networks datasets (Sec. IV), as well as through an ablation study to examine the performance benefits of the different components (Sec. V).

Our paper discusses two types of networks: neural networks, and mobile networks. To avoid confusion, we will refer to neural networks as ”nets”, while mobile networks will be referred to as ”networks”.

Ii State-of-the-Art in Deep Clustering

For the purpose of this discussion, we can split deep clustering methods into two categories: generative and discriminative. Generative methods try to describe the data in a clustering-friendly representation. While learning, this representation is used both to define clusters, as well as to be able (re)construct data points into the original data space. Discriminative methods immediately try to divide the data into groups, without learning to recreate or generate data points in the process.

Reconstructive methods, a subset of generative methods, learn to encode data into a simplified latent representation, from which the original data points can be decoded (reconstructed) effectively. For this purpose, most of the reconstructive clustering methods utilize autoencoder neural nets, made up of an encoder and a decoder sub-net (Fig. (a)a). By learning to distill information into a constrained latent space, autoencoders compress and reduce noise, formulating a high-level latent representation of the data, which contains only the most descriptive, meaningful features. Reconstructive clustering methods use these latent features on the assumption these are also the best descriptors of clusters in the data.

(a) Reconstructive
(b) Discriminative
Fig. 2: The basic setup of the two main deep clustering approaches.

One of the first examples of reconstructive deep clustering algorithms is Deep Embedded Clustering (DEC) [dec]. In DEC, a stacked autoencoder is pre-trained, after which cluster centroids in the latent space are jointly optimized with the encoder, in order to best fit the encoded points to a predefined distribution around the centroids. This optimization causes the encoded points to tightly group around the cluster centroids, which has the effect of refining the clusters and increasing the nearest-neighbor assignment’s accuracy. We will discuss this algorithm in more detail in Section III-C, as our proposed method is partly inspired by DEC. Other similar methods, where the encoding is jointly optimized with the internal clustering are Variational Deep Embedding (VaDE) [vade] and DEeP embedded regularIzed ClusTering (DEPICT) [depict].

Adversarially Constrained Autoencoder Interpolation (ACAI) [acai] represents a different reconstructive approach, where the clustering does not influence the encoding. Instead, a separate mechanism is used to optimize the encoded representation for the later clustering step. ACAI adopts an adversarial net commonly found in Generative Adversarial Nets [gan] to create believable data points when interpolating between encoded points in the latent space. Although not specifically meant for clustering, this regularization through believable interpolation leads to a highly clustering-friendly latent representation, where traditional clustering algorithm such as k-Means [kmeans] perform particularly well.

Purely generative methods are not as prevalent as reconstructive methods. One generative example is Latent Space Clustering in Generative Adversarial Networks (ClusterGAN) [clustergan], where a GAN generator is used to synthesize believable data points from a mixture of categorical and continuous latent points. Apart from the usual GAN setup of the generator (decoder) and adversarial nets, ClusterGAN also implements an encoder, effectively realizing an inside-out autoencoder. Because purely generative methods seldom exist, we will sometimes refer to reconstructive methods as generative in this paper.

Reconstructive methods assume that latent features learned through reconstruction are useful for clustering. Unfortunately this assumption does not hold for data in which clustering-irrelevant information (e.g.: small details, or information from other entities) outweighs clustering-relevant information. The best example of this can be seen in photographic datasets, where generative clustering algorithms often produce an effect labeled the blue sky problem; planes, birds and other flying objects are all assigned to the same category, because the largest area of the image is taken up not by the object itself, but by the sky in the background. For the reconstruction of these images, the autoencoder pays more attention to correctly encode the sky, while losing sight of the clustering-relevant information about the objects in the latent representation.

In another light, learning to decode (reconstruct) data serves as a regularizer in the formulation of the high-level latent representation. Discriminative methods do away with this generative regularization, and replace it with their own specific regularization terms. This approach can have two benefits: the high-level representation can disregard information which is only useful for reconstruction, and the method can output cluster assignments directly, without the need for an additional step. Because of these advantages, discriminative clustering methods generally achieve higher accuracy while being more consistent on image datasets compared to their generative counterparts. The change in approach is also visible in the neural net topologies of these methods, usually consisting of a single sub-net, which effectively only implements the encoder half of an autoencoder (Fig. (b)b).

Information Maximizing Self-Augmented Training (IMSAT) [imsat] was one of the first discriminative deep clustering methods to be published. IMSAT builds on Regularized Information Maximization (RIM) [rim], a (shallow) discriminative clustering approach which uses mutual information as a metric to develop cluster boundaries. In IMSAT, the added Self-Augmented Training (SAT) procedure regularizes the encoding developed by RIM so that the method can utilize deeper neural nets as encoder, without easily falling for degenerate models, thus arriving at better clustering accuracy on complex datasets. In a sense, SAT replaces the generative regularization. Another early discriminative method is Deep Adaptive image Clustering (DAC) [dac].

The recently published Deep image Clustering with Category-Style representation (DCCS) [dccs] method is especially interesting to our discussion. In DCCS, the latent space is split into two feature groups, which the authors call category and style features. While all the latent features are used for information maximization, only the category features are used for clustering, which allows the method to further disregard irrelevant information, achieving even purer clusters and thus higher clustering accuracy. The regularization in DCCS is done through a combination of adversarial nets and data augmentation in the form of randomized image transformations, such as cropping, aspect-ratio changes, hue and brightness changes, and the occasional horizontal flipping of the image. Other noteworthy discriminative deep clustering methods which use adversarial nets or data augmentation as regularization are Associative Deep Clustering (ADC) [adc] and Invariant Information Clustering (IIC) [iic].

Lastly, an often cited deep clustering method that does not fit into our categorization is Joint Unsupervised LEarning (JULE) [jule]. JULE is an agglomerative clustering algorithm, which creates clusters by merging individual observations, then smaller clusters, into ever bigger clusters. It is fundamentally different in its approach to the previously discussed methods here, which all at some point divide spaces or observations into clusters.

Most of the above mentioned methods are developed for-, and evaluated/benchmarked on image datasets. Commonly used image datasets for the evaluation of these algorithms are the MNIST

111http://yann.lecun.com/exdb/mnist/ dataset containing

pixel greyscale images of handwritten digits, or the CIFAR-10

222https://www.cs.toronto.edu/~kriz/cifar.html dataset containing pixel color photos of object categories (airplanes, cars, etc.). The clustering methods are trained in an unsupervised manner, without the input of the category labels, but are evaluated using the labels as ground truth. Their performance is measured using permutation-invariant external metrics, such as clustering ACCuracy (ACC), Normalized Mutual Information (NMI) or the Adjusted Rand Index (ARI), which quantify the similarity between the true category labels and the learned cluster assignments. Table I shows the published performance of the above discussed algorithms on the aforementioned (image) datasets. For the photographic CIFAR-10 dataset, many generative methods have no published results, and the ones that do, show worse performance than their discriminative peers, stemming from the previously discussed blue sky problem.

MNIST CIFAR-10
Alg. Year ACC NMI ACC NMI
DEC [dec] 2016
VaDE [vade] 2016
DEPICT [depict] 2017
ACAI [acai] 2019
ClusterGAN [clustergan] 2018
IMSAT [imsat] 2017
DAC [dac] 2017
DCCS [dccs] 2020
ADC [adc] 2019
IIC [iic] 2019
JULE [jule] 2016
TABLE I: Performance of the state-of-the-art algorithms on the mnist and cifar-10 datasets. Values marked with * are taken from [dccs]. All other values stem from the respective publications.

Iii Clustering with Decorrelating Adversarial Nets

Fig. 3: Overview of the components, training phases and losses in DANCE.

Iii-a An Argument for Generative Clustering

In the previous section, we discussed the blue sky problem, and it’s detrimental effect on generative (reconstructive) clustering methods when used on image datasets. Although common in photographic data, specifically this kind of data pollution, i.e.: information from other entities seeping into the observations, is less prevalent in data from other domains. In mobile networks, data isolation is a given; for example, values logged about a cell in the network will never accidentally contain information about other cells. For this reason, discriminative methods might not see such a large advantage on mobile network data as they do on images when compared to generative algorithms.

Furthermore, applying discriminative algorithms developed for image datasets to non-image datasets is not straight-forward. The specific regularization methods used by the discriminative approaches work well on image datasets, because biologically we humans have a great intuitive understanding of how vision and images behave, thus the authors were able to define meaningful augmentations of the data, which the methods then have to take into account in their model. The same can usually not be said about data from other domains: for example, image augmentations such as rotation or hue changes either don’t make sense, or it is questionable if such variations are truly present in the data. In contrast, reconstructive methods don’t suffer from such applicability problems; reconstruction always makes sense regardless of the data domain.

However, all the above does not mean that generative clustering algorithms always perform better than discriminative algorithms on non-image datasets. In fact, as we will see later, generative algorithms often still show lower performance. Our observation while working with these algorithms is that a major reason for the worse average performance is the inconsistency of the internal clustering step. The core of the reconstructive clustering methods is the training of the autoencoder, with clusters usually only defined in a subsequent step, sometimes not influencing the encoded features at all. The inconsistencies are a result of the applied simpler, ”traditional” clustering algorithms, such as k-Means or Gaussian mixtures [gmm], which are sensitive to initialization and can get stuck in local minima when starting from an unfortunate position. Some deep clustering methods that apply these simple clustering algorithms rectify/minimize this limitation by utilizing multiple initializations of the internal clustering step, trying to select the best clustering based on some unsupervised metric before proceeding. However, traditional (non-deep) clustering algorithms have no measures to effectively quantify the goodness of the fit.

We set out to create a method, which tries to improve on the above areas of generative clustering algorithms, with the objective of being consistent and easily applicable to non-image datasets, especially to data from mobile networks. What we propose here is not an entirely new algorithm, but an inventive way of using already established neural net components and training methods. The main aspects of our proposed method are:

  • Easy cross-domain application stemming from the generative nature.

  • Mitigation of the detrimental effects of reconstructive information perturbing the clustering features.

  • A good initialization for the internal clustering.

Iii-B Decorrelating Adversarial Net

Our proposal is called DANCE, whose components and losses are illustrated by Fig. 3. This section details the core of our proposed approach, the Decorrelating Adversarial Net (DAN).

DANCE is based on an autoencoder neural net, made up of an encoder and a decoder . realizes the non-linear mapping , where are learnable parameters, and is the encoded, latent feature space with a (much) lower dimensionality than the input (data) feature space . approximates the inverse of , trying to reconstruct the original observations from . Both and

parameters are optimized through stochastic gradient descent, with the objective of minimizing the reconstruction loss:

(1)

where and denote the original and reconstructed data points, the number of features in , and the number of data points in a batch, realizing the Mean Squared Error (MSE) commonly used for regression tasks.

To reduce unnecessary reconstructive information used for clustering, we split into two sets: features that contain the clustering-relevant information, and features that are purely reconstructive. We defined the following rules to distinguish the two sets:

  • As features in contain no clustering-relevant information, these must not be correlated to .

  • As features in contain only generic information which is applicable to all clusters, or include reconstruction-specific information about smaller, finer details in the data,

    is likely to have a simple, noise-like distribution, such as a Gaussian distribution.

To separate from , the above description is posed as an adversarial game. Let refer to points sampled randomly from a Gaussian distribution with the same dimensionality as ,

mean and variance

. To create non-correlated reference points with the desired distribution, we replace in the original encoded features by , arriving at ( denotes concatenation). An adversarial net (decorrelator) has the task to detect if a point comes from or , by formulating rules that either consider the difference in distribution between and , or detect correlation between and . The encoder has to generate a latent encoding which is impossible to differentiate from , thus mimicking the distribution of with and breaking any correlation between and .

outputs a singular scalar which represents the estimated probability that

came from rather than . To optimize the decorrelator and the encoder parameters for the adversarial game, stochastic gradient descent is used. The respective losses, which realize the minimization of the Jensen–Shannon divergence as proposed in the original GAN paper [gan] are the following:

(2)
(3)

where and refer to decorrelator guesses on points coming from and , and refers to the number of data points in a batch.

The above described neural net setup is very similar to Wasserstein Autoencoders [wae]

. As such, in theory the decorrelation does not interfere with the original autoencoding task, and any desired loss function could work well to calculate the reconstruction loss. To balance the losses that affect the encoder, coefficient

can be introduced, so that the final encoder loss is:

(4)

A major problem with the decorrelator in this format is the continuous disturbance of the features in during training. As the correlation between and can be broken in both feature sets, generates gradients which try to move the points around in a chaotic manner in . Since the adversarial game does not define any target distribution for , this disturbance does not converge, and is constantly present during training, in the worst cases causing to collapse into a single point. An obvious choice would be to impose a prior distribution on , the same way we are already imposing a prior on (similarly how to some extent DCCS does), however, in our experience priors on don’t work well in the autoencoder setting. Instead, in DANCEbackpropagation of the gradient from is stopped through , so that the adversarial game does not affect those features. This effect can be achieved in most deep learning frameworks with a simple detach() or stop_gradient() call. This gradient stop still allows the features to be used by for the detection of correlation with , but the gradients only affect , leaving undisturbed by the decorrelation.

In theory, the DANCE setup could lead to an encoding where both clustering and reconstructive information is communicated only through , and does not carry any information at all, only capturing random noise in order to adhere to the prior. In our experience this is never the case, as the autoencoder always tries to utilize all latent features. However, possibly for this reason, or because of saturation problems in , DANCE seems to work best if both and are of low dimensionality. On Figure 4, a typical DANCE encoding of the MNIST dataset can be seen, where both and are -dimensional.

(a) density
(b) ground truth
(c) density
(d) ground truth
Fig. 4: A typical DANCE encoding of the MNIST dataset at the end of the DAN pre-training. The top two figures depict how well follows the Gaussian prior, as well as being completely decorrelated to . The bottom two figures show the irregular, but well separated encoding in .

We realize that the above detailed decorrelation does not guarantee that all clustering-relevant information ends up in , nor that all clustering-irrelevant information is removed from . In reality, DANCE reduces variance in , and allows for a more coherent mapping where similar data points are close together, which is beneficial for a subsequent application of traditional clustering algorithms.

Iii-C RIM Initialization and DEC Clustering

The internal clustering step in DANCE is done with the mechanism from DEC. In its original form, DEC uses the k-Means algorithm to find initial positions for the cluster centroids. We found this initialization to be quite unreliable, because k-Means is biased towards convex, even-sized clusters by design, which is often not how the encoded clusters behave in DANCE. Instead, we opted to use the discriminative RIM [rim] algorithm to find a good initial clustering, which is then subsequently refined by DEC.

In RIM, a simple feed-forward neural net is used to find clusters, by looking for cluster boundaries that are in sparsely populated regions of the input data space. To achieve this, RIM minimizes the conditional entropy, balanced by the maximization of the entropy of the label distribution, which helps to form clusters with even populations. In effect, this optimization task maximizes the empirical estimate of the mutual information between data point and their assignment. Let denote the RIM net, where are learnable parameters. directly outputs cluster assignment probabilities for each input point. To train the RIM net, the following loss terms are minimized through stochastic gradient descent:

(5)
(6)
(7)

where refers to the cluster assignments output by , refers to the number of clusters, is a balancing coefficient between the two entropy terms, and is a balancing coefficient for the regularization term . For the regularization term we used the norm of the parameters , also commonly referred to as weight decay.

We found that the achieved conditional entropy at the end of the RIM training is also a somewhat good indicator of the objective goodness of the clustering, rectifying a problem mentioned in Sec. III-A. To exploit this feature, we re-train

multiple times for a fixed number of epochs, and select the training with the lowest

at the end. The averages of the clusters in found by this are then used as centroids to start the DEC refinement. Figure (a)a shows the clusters and the subsequent cluster centroids defined by RIM on the DANCE encoded MNIST dataset.

(a) RIM initialization
(b) DEC refinement
Fig. 5: RIM initialization and DEC refinement on the encoded (MNIST). The white dots represent the cluster averages (centroids), the coloration shows the cluster assignments.

DEC uses the initialized cluster centroids to influence the latent space in , by moving the encoded points closer to their assigned centroids, while simultaneously moving the centroids to best fit their assigned points [dec]. To do this, DEC uses Student’s

-distribution to define a soft-assignment between encoded points and cluster centroids. The encoded points and the centroids are then jointly moved to best fit a target distribution by minimizing the Kullback-Leibler divergence. Because of the soft assignment, the

DEC optimization has a more pronounced effect on the encoded points closest to the centroids compared to more remote points, and vice versa; the centroids are moved to best fit the closest encoded points. The DEC loss is defined as:

(8)
(9)
(10)

where refers to cluster centroids,

is the degrees of freedom of the Student’s

-distribution, is the soft assignment of the encoded points, and is the auxiliary distribution. During the DEC clustering phase, the encoder loss expands to:

(11)

where is an additional coefficient balancing the DEC loss with the others.

The DEC loss tightly groups the encoded points around the centroids (Fig. (b)b), which further refines the latent space by forcing the autoencoder to make ”decisions” about where encoded points end up. This tight grouping also compensates for the limitation of nearest-neighbor clustering, because the tight groups can be efficiently separated by linear boundaries (Voronoi tessellation). Without this grouping, straight cluster boundaries often don’t align with the correct distribution-boundaries, making nearest-neighbor clustering algorithms, especially k-Means, ineffective in these cases.

Input: Dataset , initial parameters , , , of encoder , decoder , decorrelator and RIM initializer , nr. of dimensions , hyper-parameters , , , coefficients , .
Procedure autoencode()
       Compute , and using
       these (Eq. 1);
       Generate ;
       Compute , and using
       these (Eq. 2) and (Eq. 3);
      
DAN pre-training 
       for  iterations do
             autoencode();
             Update by minimizing ;
             Update , by minimizing
             (Eq. 4);
            
       end for
      
RIM initialization 
       Deconcatenate ;
       for  tries do
             for  iterations do
                   Compute and using
                   it (Eq. 7);
                   Update by minimizing ;
                  
             end for
            Compute and using it (Eq. 5);
             Store if is the lowest so far;
            
       end for
      Compute assignments;
       for  do
             ;
            
       end for
      
DEC refinement 
       for  iterations do
             autoencode();
             Compute (Eq. 10) using , ;
             Update by minimizing ;
             Update , , by minimizing
            
       end for
      
Compute final assignments;
Algorithm 1 DANCE

The complete DANCE algorithm is shown by Alg. 1. The three main phases: DAN pre-training, RIM initialization and DEC refinement can also be seen as overhauled versions of the original DEC training steps, but we hope the reader agrees that the changes are significant enough to warrant a different algorithm name. Although other papers often refer to multi-phase methods in a negative light, we have found that the isolated phases are easier to debug or parameterize, not having to completely restart training if something is not perfect, which also helped us during the evaluation of the algorithm.

Iv Evaluation

Iv-a Methodology

Our ultimate goal was to evaluate DANCE on the mobile network dataset, but in order to be able to compare the performance to other state-of-the-art methods with this dataset, we needed to have working implementations of the other algorithms. As it would be an enormous undertaking to try to re-implement and evaluate every method we listed in Section II, we have selected four methods to compare against, based on their performance, age, and their connection to our algorithm:

  • DEC is an obvious choice for an older generative algorithm, as DANCE shares its internal clustering and the overall structure. In both methods, the internal clustering influences the encoding.

  • ACAI is a state-of-the-art generative clustering algorithm, which, contrary DEC and DANCE, develops the encoding independent from the internal clustering.

  • IMSAT is one of the first discriminative algorithms, working without manual data augmentation for the regularization. IMSAT builds on RIM, which we also utilize in DANCE.

  • DCCS is our choice of a state-of-the-art discriminative algorithm. In contrast to IMSAT, DCCS does use explicit data augmentation as regularization, and shares the core idea of separating clustering-relevant from clustering-irrelevant features with DANCE.

Altogether, we have an even split of generative and discriminative deep clustering algorithms, as well as an even split between earlier and recent publications. This aspect is important, because we suspected that older algorithms are by no means necessarily worse than newer publications on mobile network data, contrary to what the trend might show on image datasets.

Another beneficial effect of having to re-implement contending methods is that we can control for neural net complexity in the evaluation. Even in the few years since the first publication of these methods, the emergence of dedicated hardware accelerators, easy-to-use deep learning frameworks, and new type of components and training methods increased the possibility of training deep neural nets tremendously. While newer publications use residual nets tens or even hundreds of layers deep, some older algorithms were evaluated using only a few simple fully-connected layers. We suspected that the topological differences account for a large portion of the performance differences, and we were interested in understanding how the algorithms differ in performance when using the same neural net components and sizes. Thus, in the following evaluation, all methods use the same convolutional encoder, convolutional decoder and fully-connected adversarial nets (per dataset).

We ran multiple () trainings of all methods for each dataset, to be able to present worst, average and peak performance metrics. Of course, not having invented and exhaustively fine-tuned these methods, we likely can not utilize them to their utmost potential, and probably leave a few percentage points of accuracy on the table. On the other hand, the usability and applicability of an algorithm is just as important as peak performance, and is a focus of this paper. We ask the reader to keep this in mind while reading the following sections.

Iv-B Evaluation on Image Data

To give a performance comparison in a common setting, and to establish that our implementations are working correctly, we first evaluated the methods on the MNIST image dataset. This also allows us to compare DANCE to other state-of-the-art algorithms’ performance without having to implement them. Lastly, this evaluation gave us a chance to see if older algorithms show improved performance over their published results when using our deeper convolutional nets. The performance of the compared methods can be seen on Table II. The (external) metrics we utilize throughout this evaluation are:

  • ACC, which measures the ratio between number of points correctly assigned against the number of all datapoints in the dataset. As the mapping between labels and clusters is ambiguous, we used the Hungarian method [hungarian] to determine the best mapping/permutation, thus making the metric permutation-invariant.

  • NMI, which measures the mutual information between labels and cluster assignments. NMI is normalized so that means no mutual information, while is the maximal mutual information achievable.

ACC NMI
Alg. avg (std) min - max avg (std)
DEC () - ()
ACAI () - ()
IMSAT () - ()
DCCS () - ()
DANCE () - ()
TABLE II: Performance of the evaluated algorithms
on the mnist dataset

The two generative algorithms, DEC and ACAI

exhibits large standard deviation in accuracy and mutual information, which we attribute to the inconsistency of the traditional clustering algorithms in this setting;

k-Means cannot reliably find the true clusters in the encoding, and often converges to local minima, arriving at sub-optimal fits.

Apart from this, DEC performed quite a lot better than the originally published results (shown in Table I) using our deeper encoder and decoder. It is especially important to note the impressive maximum accuracy, which is by far the best out of any reconstructive algorithm. Also quite interesting is the relatively high NMI achieved compared to the ACC, which represents a high mutual information content between ground truth labels and cluster assignments. This phenomenon occurs when wrongly clustered observations have a systematic error. In the MNIST dataset, an example of such a systematic error would be that the cluster containing all s also includes a few s, but not any other numbers.

The ACAI results are a little worse than the originally published results in Table I. This is not a mistake or misconfiguration on our part; in the original paper the authors themselves admit to selecting the best k-Means clustering based on external metrics, reasoning that the ACAI algorithm was anyway not originally intended for clustering, and that the shown results are only there to signify the potential of such an approach [acai]. In order to provide a fair and unsupervised comparison of the algorithms, we selected the best out of repeated k-Means fits using an internal metric (not utilizing the ground truth) in our evaluation, hence the worse results.

IMSAT performed phenomenally, even improving on the already excellent originally published results and coming close to the published performance of DCCS. Of note is the very low deviation, and high minimum values, which is a nice guarantee for the user that even in the worst case the clustering is almost the best it can be. This trust would be very important for unsupervised algorithms, as the user has no way of confirming the quality of the clustering in a real-life scenario.

DCCS, on the other hand, proved quite sensitive to the net topology, and performed much worse than the published performance in Table I. We were definitely able to reproduce the originally published results using the proposed net, but switching to our more complex topology caused DCCS

to learn sub-optimal fits, where often one or more of the clusters were left unused (unpopulated). We tried rectifying this through changing the balance of the prior loss, as well as adding batch-normalization layers to

, but to no avail. It seems to us that the net complexity plays a major role for DCCS in inherently regularizing the model, and more complex nets are not regularized sufficiently with only the additional mechanism in DCCS. This is very counter-intuitive, as all the other algorithms benefited from the increased modeling capability of the more complex neural net used.

Our DANCE algorithm performed as expected; excellent for a generative clustering method, yet not in the range of most discriminative methods’ capabilities. Compared to the published values shown in Table I, DANCE is among the best performing generative approaches on the MNIST dataset. We are particularly happy with the quite high minimum accuracy metric, which again plays a major role in establishing trust towards the algorithm. As far as we can tell, most of the loss in accuracy stems from malformed latent representations, where parts of a class ends up separated, far away from most of the points in the same class. This effect could be possibly mitigated by utilizing more dimensions for , but in our experience the gain in consistency is counteracted by the loss in performance both from the decorrelation and from the RIM initialization, negating any benefit.

Lastly, our goal with this evaluation was to tune most of the hyper-parameters of the algorithms, and apart from the net topologies, reuse these settings in the mobile data evaluation. However, we realized that the difference between the two domains likely causes the hyper-parameters to be sub-optimal for the network dataset, so in the end we allowed the tuning of some parameters based on internal metrics, such as balancing losses, or adjusting learning rates. These changes could be reasonably made without the knowledge of the ground truth, in order to stay within the bounds of a realistic clustering scenario. Furthermore, every algorithm has seen increased training iterations to compensate for the smaller dataset, resulting in less updates per epoch.

Iv-C Evaluation on Mobile Network Data

Label Traffic Speed [km/h] Movement
Stationary 1 File Transfer Protocol (FTP) -
Stationary 2 Voice over IP (VoIP) -
Stationary 3 HyperText Transfer Protocol (HTTP) -
Pedestrian 1 FTP random
Pedestrian 2 VoIP random
Pedestrian 3 HTTP random
Vehicular 1 VoIP - streets
Vehicular 2 HTTP - streets
TABLE III: User groups in the mobile network dataset.

The evaluation on mobile network data represents the target clustering scenario for our algorithm, which also tests the other algorithms’ flexibility towards different application domains. The evaluation we devised tries to pose a similar task to the common problem in the use-cases introduced in Sec. I: the clustering of behavioral patterns. In our evaluation, the clustering algorithms were to assign mobile users to groups based on how they use the network, and what they use it for, using information that implicitly describes their behavior.

In order to generate data where we know the ground truth, we used a mobile network simulator, which enables us to use the usual external metrics in our evaluation. The simulation scenario was set in the city of Helsinki, where mobile users moved around and used the multi-layer heterogeneous network to communicate (Fig. 6). The network comprised of multiple macro, micro and WiFi cells (access points), and covered most of the city. The users were allocated into user groups, which were differentiated based on the user’s mobility patterns (stationary, pedestrian, vehicular) and their network usage type (talking using VoIP, web-browsing using HTTP and transferring files using FTP). The definitions of the user groups can be seen in Table III.

Fig. 6: Excerpt from the Helsinki simulation scenario.

The collected data contained:

A total of values were collected every seconds for every user. The simulation contained users, an even distribution of users from each of the user groups. Each user was observed for consecutive sequences, with a sequence consisting of time steps In total this corresponds to about hours of simulation time. The collected data was organized into an array with the shape of , which is functionally the same as pixel images containing channels (instead of the usual : red, green and blue). The clustering algorithms processed this data using -dimensional convolutional encoders (and decoders). The resulting performances can be seen on Table IV.

ACC NMI
Alg. avg (std) min - max avg (std)
DEC () - ()
ACAI () - ()
IMSAT () - ()
DCCS () - ()
DANCE () - ()
TABLE IV: Performance of the evaluated algorithms
on the mobile network dataset

DEC and ACAI, the two generative algorithms show low average and peak performance, even in the lucky training cases with good k-Means fits. We attribute this to the large amount of clustering-irrelevant information in the latent representation. The clusters formed by k-Means and the DEC mechanism ultimately incorporate this irrelevant information, which, in many instances, causes the encoded points to end up in the wrong cluster. On the other hand, the regularization through reconstruction seems to function well, as the deviation in accuracy for these algorithms is comparatively lower than the others.

IMSAT was not successful on the mobile network dataset, producing abysmal results. Originally, we have chosen IMSAT to re-implement because contrary to being a discriminative algorithm, it did not utilize any domain-specific regularization methods such as image-transformations, rather, the regularization was done through the SAT mechanism. SAT disturbs the data on-the-fly during training, in a seemingly domain-agnostic manner, however, in order to calibrate the disturbance imposed by SAT, IMSAT uses a pre-calculated value for every datapoint. For the MNIST dataset, these values are the Euclidean distances between datapoints and their closest neighbors (calculated in the original data-space, each pixel is a separate dimension). The same calibration value calculated on the mobile network dataset does not seem to work well, for the reason being that the Euclidean distance is simply not that meaningful for our dataset as it is for MNIST, or images in general. The large variance in the position of important patterns in the sequences, caused by the arbitrary sequence framing, creates a large distance between even the same patterns shifted in time. We have tried to tune which neighbor to use for the calculation, but have seen no significant improvement. The bad performance could also be an indication of a mismatch in data complexity and the used neural net topologies, although we think the other algorithms are proof that the nets were at least capable of producing good results, if not optimal.

DCCS was the second most accurate algorithm on the mobile network dataset. DCCS uses randomized data augmentation to separate ’categorical’ features from ’style’ features. These data augmentations are commonly used image transformations for the MNIST dataset: zooming, aspect ratio changes, brightness, hue and saturation changes. Zooming was quite straight-forward to implement for the mobile network dataset, and one could argue that such variation is probably present in the data: zooming in the temporal dimension is the equivalent of processes taking longer or shorter times, which manifests as the expansion or contraction of the generated patterns in the data. Aspect ratio changes do not apply to the mobile network dataset, as it behaves as an ”image” which is a single pixel tall. We replaced value variations (brightness, hue and saturation changes) with a randomized offset and scaling on individual channels, tuning the parameters on the MNIST dataset to produce visibly similar images to the originally proposed image transformations. It seems that these augmentations were adequate for the mobile network dataset, as DCCS proved to be quite accurate in it’s clustering. We suspect that a fine-tuned net topology, as well as better tuned augmentations could further improve the performance of DCCS, however, specifically this type of tuning is not possible if the user has only an unlabeled dataset available, the main premise of unsupervised learning.

DANCE performed the best on the mobile network dataset, reaching the highest average, maximum, and most importantly the highest minimum accuracy. Our algorithm did not require hyper-parameter changes apart from changing ; because of a higher reconstruction loss on the mobile network dataset, this balancing coefficient had to be increased in order for the decorrelator to have an effect on the encoding. This tuning can be done without any labeled data, solely by making sure that the decorrelator adversary converges close to accuracy (random guessing) when choosing between and at the end of the training. A scatter-plot of the encoded datapoints and clustering can be seen on Fig. 7. In order to further explore how this performance is achieved in DANCE, we continue with an ablation study.

(a) ground truth
(b) ground truth
(c) density
(d) RIM initialization
Fig. 7: A typical DANCE encoding of the mobile network dataset. The top two figures depict how well follows the uncorrelated prior, as well as how well the ground truth classes separate in . The bottom two figures show the density of the encoded, points, and how well RIM was able to find/determine the initial cluster centroids.

V Short Ablation Study

It is important to see how much each of the DANCE components contribute to the overall performance, which also helps in understanding the synergies between them. In the following ablation study, we examine every combination of the components evaluated on the mobile network dataset: the DAN, the RIM initialization and the DEC cluster refinement. Without the DAN, the autoencoder and the internal clustering steps worked in a single latent space, which was set to have the combined dimensionality of and , resulting in dimensions. In the absence of RIM, we utilized k-Means to find the initial cluster centroids for DEC. If DEC was not used either, only k-Means determined the final clustering. The results from the ablation study can be seen on Table V.

ACC
DAN RIM DEC avg (std) min - max
() -
() -
() -
() -
() -
() -
() -
() -
TABLE V: The effect of the different dance components on performance
measured on the mobile dataset

Without any of the components, the algorithm is simply k-Means run on an autoencoder-formed latent encoding. This setup is often used as a baseline for deep clustering, with the premise that the algorithm should greatly improve upon these results. Using RIM instead of the k-Means clustering does not bring tangible benefits, probably because the irrelevant information in the latent encoding hides the otherwise sparsely populated cluster boundaries RIM is looking for. Using DEC with a k-Means initialization is basically the originally proposed DEC algorithm (highlighted with blue), however, the results are a little worse than what we have shown in Table IV, because in this case DEC is operating with a lower dimensional latent space. Adding RIM as an initialization for DEC once again does not improve performance meaningfully, for the same reason RIM was not greatly beneficial by itself.

Using DAN and clustering with k-Means only in already improves the average accuracy as much as the other two components combined, but more importantly improves peak accuracy by a great margin. This is because the decorrelated encoding in maps clusters in a compact manner, without many datapoints mixed into wrong clusters. k-Means, though unreliably, sometimes fits these clusters well, resulting in high peak accuracy. Using RIM instead of k-Means for clustering greatly improves minimum and average accuracy, because in , without most of the clustering-irrelevant information, the sparse cluster boundaries stand out, and RIM is able to find these reliably. Not using RIM but using DEC once again loses these benefits, only retaining the high peak accuracy achieved through the DAN decorrelation. Finally, with all components combined, we arrive at the complete DANCE algorithm (highlighted with green), where DEC is able to exert its full benefits on the clustering, improving worse and average accuracy by quite a significant margin without losing its capability to maximize peak accuracy.

Vi Conclusion

In this paper, we discussed state-of-the-art deep clustering algorithms, splitting them into two groups: generative and discriminative methods. Although discriminative methods seem to be the peak performers in image clustering, their highly tuned nature and assumptions about the data make them hard to apply to mobile network data. Reasoning that generative algorithms seem to be more domain-agnostic, we have proposed our own generative deep clustering algorithm, DANCE, with the core idea of isolating clustering-relevant features in the latent space. We have evaluated DANCE and other state-of-the-art algorithms’ performance on an image- and a mobile network dataset, while also providing an ablation study to highlight the significance of the different components of our algorithm. DANCE achieved good performance on the image dataset, and excellent performance on the mobile network dataset, surpassing its competitors by a sizable margin.

In real-world applications, clustering algorithms require a great deal of expertise to use. In the hands of a less experienced user, or somebody who does not have the resources, time, or a labeled dataset to fine-tune these algorithms for the specific use-case, simplicity, usability and reliability play a far bigger role than peak performance in the overall usefulness of the algorithm.

As a closing remark, we would like to highlight the shared idea between DCCS and our DANCE algorithm; the concept of separating latent features into clustering-relevant and clustering-irrelevant sets. Although both implementations show various advantages and disadvantages in different data domains, at the least we can say that the concept itself is very promising, and could be an interesting topic for future research.

References