Improved Multi-objective Data Stream Clustering with Time and Memory Optimization

01/13/2022
by   Mohammed Oualid Attaoui, et al.
Université Paris 13
0

The analysis of data streams has received considerable attention over the past few decades due to sensors, social media, etc. It aims to recognize patterns in an unordered, infinite, and evolving stream of observations. Clustering this type of data requires some restrictions in time and memory. This paper introduces a new data stream clustering method (IMOC-Stream). This method, unlike the other clustering algorithms, uses two different objective functions to capture different aspects of the data. The goal of IMOC-Stream is to: 1) reduce computation time by using idle times to apply genetic operations and enhance the solution. 2) reduce memory allocation by introducing a new tree synopsis. 3) find arbitrarily shaped clusters by using a multi-objective framework. We conducted an experimental study with high dimensional stream datasets and compared them to well-known stream clustering techniques. The experiments show the ability of our method to partition the data stream in arbitrarily shaped, compact, and well-separated clusters while optimizing the time and memory. Our method also outperformed most of the stream algorithms in terms of NMI and ARAND measures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/09/2018

Fuzzy Clustering to Identify Clusters at Different Levels of Fuzziness: An Evolutionary Multi-Objective Optimization Approach

Fuzzy clustering methods identify naturally occurring clusters in a data...
07/16/2020

Data Stream Clustering: A Review

Number of connected devices is steadily increasing and these devices con...
01/26/2022

Multi-objective Semi-supervised Clustering for Finding Predictive Clusters

This study concentrates on clustering problems and aims to find compact ...
10/26/2020

Multi-Objective Frequent Termset Clustering

Large media collections rapidly evolve in the World Wide Web. In additio...
01/21/2021

Fast Clustering of Short Text Streams Using Efficient Cluster Indexing and Dynamic Similarity Thresholds

Short text stream clustering is an important but challenging task since ...
06/10/2015

Fast Online Clustering with Randomized Skeleton Sets

We present a new fast online clustering algorithm that reliably recovers...
04/15/2021

Optimizing Multiple Multi-Way Stream Joins

We address the joint optimization of multiple stream joins in a scale-ou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Swarm Intelligence or distributed intelligence is the collective behavior of decentralized and self-organizing natural or artificial systems nayyar2018advances. It has attracted lots of interest from researchers in the last two decades due to their dynamic and flexible ability and that they are highly efficient in solving nonlinear problems in the real world nayyar2018introduction.

AntTree azzag2003anttree

is a hierarchical clustering method that models how ants form living structures and use this behavior to organize this data into a tree that is built in a distributed manner. Intuitively, each ant/data is located at the start of reliable support (tree root). The behavior of the ants then consists either in moving or in clinging to the structure to extend it and allow other ants to come and stick in their turn. This behavior is determined in particular by the similarity between the data and the local structure of the tree. The result is a tree-like organization of the data whose properties will allow us to determine a classification automatically and to have a visual overview of the tree.

The AntTree algorithm was proposed to deal with Data Stream, a kind of data that evolves and arrives in an unbounded stream. Analyzing data stream implies time and space constraints. The process of data stream clustering consists of creating compact and well-separated partitions from dynamic streaming data in only a single scan, using limited time and memory.

Most of the clustering techniques follow one objective function. However, every objective function represent a different property of the clusters, such as the compactness or the separateness of a cluster. When the algorithm assumes a homogeneous similarity measure over the entire data set, it becomes not robust to variations in cluster shape, size, dimensionality, and other characteristics handl2007evolutionary. The Multi-Objective clustering methods (MOC) law2004multiobjective retrieve clusters by applying two or more objective functions. It uses a two-step process: 1) Generate multiple clustering solutions and store the optimal ones. 2) Construct an optimal partition based on the Pareto-set solutions. The following definitions are useful to understand MOC methods :

1.1 Definitions:

  • Dominated solutions: a solution is said to dominate a solution if , and there exists such that . Where is the number of objective functions and is the objective function.

  • Pareto-optimal solutions: a solution is called Pareto-optimal if it is not dominated by any other feasible solutions. The set of non-dominated solutions is called Pareto-set.

  • Idle times: in the case of a slow stream, time delays between data points can appear e.g., times where no data point is available. Traditional algorithms will stop and wait for new data points to process them. Figure 1 illustrates the concept of idle times.

    Figure 1: Idle times.

Main Contributions

In our previous work attaoui2020moc we proposed a multi-objective stream clustering method. This method uses genetic operators in every iteration to improve the solutions, costly in time computation. The second inconvenience of our previous method is calculating distances between all the clusters when trying to find the neighborhood of a particular cluster. This paper introduces an improved multi-objective clustering method based on data stream clustering, and Ant-Tree clustering algorithm azzag2003anttree. This method optimizes the computation time by applying the genetic operators only in idle times to improve the solution instead of using them in every iteration. It optimizes memory allocation by using the Ant-Tree algorithm to find a cluster’s neighborhood. It also introduces a new aggregation approach for the Ant-Tree algorithm to store only a synopsis of the data instead of storing all the data points. The method presented in this paper has the following merits compared to the other multi-objective and single-objective data stream clustering methods:

  • It uses the Ant-tree algorithm to give the hierarchical aspect to our method and make it easier to determine the clusters’ neighborhood. It also reduces memory allocation by modifying the AntTree algorithm not to store data points.

  • It reduces computation time by using idle times to apply genetic functions and enhance clustering quality.

  • It optimizes two objective functions to obtain high-quality solutions and arbitrarily shaped clusters.

  • It does not require the specification of the number of clusters and uses a Fading function to consider the most recent data as more important and reflect better the changes in the data distribution.

This paper is organized as follows: in section 2, we present some background methods. In section 3, we describe our method and its main features. In section 4, we present the experiments and the results obtained and compare them to some known clustering methods. Finally, we conclude this paper.

2 Related Works

This section discusses previous works on data stream clustering problems and highlights the most relevant algorithms proposed in the literature to deal with these problems. For stream clustering algorithms, only one Multi-Objective clustering method has been proposed. In paul2020online, authors optimize multiple objectives capturing cluster compactness and feature relevancy. They consider an evolutionary-based technique and optimize multiple objective functions simultaneously to determine the optimal subspace clusters. The generated clusters in the proposed method are allowed to contain overlapping of objects.

The closest methods to MOC are Evolutionary algorithms since they use the same encoding of the solutions and the same process with a single objective function.

2.1 AntTree Clustering

Ant-Tree algorithm azzag2003anttree produces a hierarchical structure in an incremental manner like how the ants join together. In this algorithm, each ant represents a single data point, and it moves in the structure according to the similarity with the other ants already connected in the tree under construction. is represented by the euclidean distance between two ants and . One should notice that this tree will not be strictly equivalent to a dendrogram as used in standard hierarchical clustering techniques: each node in our tree will correspond to one data while this is not the case in general for dendrograms, where data only correspond to leaves.

Starting from the support, materialized by a fictitious node , the ants will progressively fix themselves on this initial point, then successively on the ants set at this initial point, and so on until all the ants are attached to the structure. During the construction of the structure, each ant is either moving on the graph or connected to it. In the first case, is free to move to a neighbor of the ant on which it is located (or to the support).

In the second case, will no longer be able to be released. Furthermore, we will consider the fact that each ant has only one outgoing link to other ants and cannot have more than links connected to it from other ants (tree having at most threads per node). Initially, all ants are placed on the support. They will each have a similarity threshold and a dissimilarity threshold, which are set to 1 and 0, respectively.

An ant will connect under an existing node of the tree (ant ) if it is sufficiently similar to this node but also dissimilar enough to the threads of the node: will thus form a subclass of which will be different from the other subclasses of (possibly already existing). Otherwise, will move randomly in the tree, looking for another location to fix itself. As the ant fails in its attempts to attach to the structure, it is made more tolerant in order to increase its chances of connecting to the next iteration concerning it: its similarity threshold is decreased, and its dissimilarity threshold is increased. The particular case of the support is treated as follows: an ant connects to the support if it is sufficiently dissimilar to other ants already connected directly to . It means that a new class has just been built at the highest level of the tree. This class must be as distinct as possible from the other classes already created.

The algorithm ends when all the ants are connected. The sub-trees appearing at the first level of the tree, just below the support, will be interpreted as different classes. The properties of the tree can be analyzed visually and interactively (e.g., the classification error decreases as one goes down the tree). It is also possible to transform this tree into a dendrogram (by scrolling down the data placed on internal nodes to leaves. The Ant-Tree algorithm is presented in Algorithm 1 for a given ant .

if No ant or only one ant is connected to the support  then
        - connect to
else if Two ants are connected to the support then
        - Disconnect the second ant from (and recursively all ants connected to it); - Place all these ants back onto the support ; - Connect to
else
        - let be the lowest dissimilarity value between daughters of (i.e. () = in Sim( , ) where and ants connected to ); - If is dissimilar enough to (Sim( , ) ¡ ()) Then connects to ; - Else moves toward
Algorithm 1 Connection of an ant in the Ant-Tree

2.2 Evolutionary Multi-Objective Clustering Methods (MOC)

In the past decade, multi-objective evolutionary algorithms have been heavily used in clustering because of their effectiveness. However, there has not been any dedicated effort to review all of these methods. The most prominent effort in this direction can be found in mukhopadhyay2015survey, in which many multi-objective clustering algorithms and techniques were presented. This section presents a thorough survey of the state-of-the-art for a wide range of multi-objective clustering algorithms.

This section discusses previous works on multi-objective clustering problems and highlights the most relevant algorithms proposed in the literature to deal with these problems.
MOCK (handl2007evolutionary)

Multi-objective clustering with automatic K-determination, consists of two main phases: In its initial clustering phase, MOCK uses a Multi-Objective Evolutionary algorithm (MOEA) to optimize two complementary clustering objectives. The output of this first phase is a set of a mutually non-dominated clustering solution. Each corresponds to different tradeoffs between the two objectives. In the second phase, MOCK analyzes the shape of the tradeoff curve. It compares it to the tradeoffs obtained for an appropriate null model (i.e., by clustering random data). Based on this analysis, the algorithm provides an estimate of the quality of all individual clustering solutions and determines a set of potentially promising clustering solutions. Often, a single solution is preferred, and, in these cases, the number of clusters inherent to the data set,

, is thus estimated implicitly. Figure (2) illustrates the pareto set in MOCK algorithm. The improved version of MOCK, -MOCK has been proposed (garza2017improved), which can significantly decrease the computational overhand and reduce the search space.

Figure 2: Clustering solutions plotted according to their objective functions. Each point represents a clustering solution. The pareto optimal solution is obtained when K=6.

An Ant Colony Optimization-based clustering method ACO-C (inkaya2015ant) combines the connectivity, proximity, density, and distance information with the exploration and exploitation capabilities of ACO in a multi-objective framework. The proposed clustering methodology is capable of handling several challenging issues of the clustering problem, including solution evaluation, extraction of local properties, scalability, and the clustering task itself.

Multi-objective evolutionary algorithms with simultaneous clustering and classification MOASCC (luo2016learning) uses a clustering process to enhance the performance of the classification. To achieve this goal, two objective functions, fuzzy clustering connectedness function, and classification error rate, are adopted. A mutation operator is designed to make use of the feedback from both clustering and classification.

IMCPSO (gong2017improved)

proposes an improved multi-objective clustering framework using particle swarm optimization. The authors used the overall deviation and mean distance between clusters as objective functions. They introduced a clustering method to improve each particle (clustering solution) by finding a topological center, which is the point that has the maximum neighbors belonging to the same cluster Figure (

3) illustrates the using of topological centers to improve the clustering. Finally, the best particle is selected from the Pareto-set based on the sparsity of the solution.

Figure 3: Using topology centers to improve the clustering solution.

EMO-KC (wang2018multi) uses the term bi-objective clustering to describe a MOC method with two objective functions. The method has two main steps (i) constructing two conflicting objective functions, and (ii) solving the bi-objective optimization problem with an effective EMO(Evolutionary Multi-Objective) algorithm.

Another MOC algorithm, SOMDEA-clust (saini2019sophisticated)

, proposes an efficient automated decomposition-based multi-objective clustering technique, which is a hybridization of Self-Organizing Maps (SOM) and differential evolution algorithm. Two internal cluster validity indices, namely, Silhouette index (SI) and PBM (Pakhira-Bandyopadhyay-Maulik) index, are used as objective functions. SOM algorithm is used to creates new solutions based on the neighborhood of each neuron. AMOGA

(dutta2019automatic)

Automatic clustering by a multi-objective genetic algorithm is a Multi-objective clustering algorithm that handles numeric and categorical features. Each clustering solution is encoded as a gene to apply Genetic operators. The method initializes a population using K-prototypes algorithm and GA operators crossover and mutation. AMOGA uses compactness and separateness as objective functions and different validity measures (DB index, Purity…) to select the best solution from the Pareto-optimal set.

Multi-objective Gradient Evolution algorithm (kuo2020multi)

extends the Gradient Evolution GE algorithm, so then it can be applied for the multi-objective problem. This paper applies the Pareto ranking assignment to sort the vectors based on their fitness values. K-means is then used to perform a final clustering on the Pareto-optimal solutions to obtain the final clustering.


Combinatorial Multi-Objective Pigeon Optimization algorithm (CMOPIO) (chen2020multi) is based on a bio-inspired algorithm called Pigeon Optimization PIO. In CMOPIO, pigeons only interact with the pigeons in their neighborhood. Meanwhile, the update of the pigeon’s position and velocity relies on each pigeon’s neighborhood rather than the global best position. These improvements allow the CMOPIO to identify a variety of Pareto optimal clustering solutions.
Table (1) Compares the Multi-Objective Clustering Algorithms.

Method Encoding scheme Genetic functions Objective functions MOGE Centroid-based / SSW(Separeteness) SSB(Compactness) CMOPIO Locus-based / Connectivity Compactness MOCK Locus-based / Stability -MOCK Centroid-based Crossover Mutation Connectivity Class error rate ACO-C Point-based / Adjusted Compactness Relative Separateness MCPSO Locus-based / Compactness Separateness SOMDEA-Clust Centroid-based Mutation Crossover PBM Index Silhouette Index IMCPSO Locus-based / Overall deviation Mean distance between clusters MOEASCC Centroid-based Mutation (Connectedness) (Error rate) EMO-KC Centroid-based Crossover Mutation SSD Overlap-Separateness

Table 1: Comparison Between Multi-Objective Clustering Algorithms

2.3 Data Stream Clustering Methods

To the best of our knowledge, no Multi-Objective clustering method for data stream has been proposed. As discussed above, the closest methods to MOC are Evolutionary algorithms. evoStream carnein2018evostream (Evolutionary Stream Clustering) makes use of an evolutionary algorithm to bridge the gap between the online and offline components. Evolutionary algorithms are inspired by natural evolution where promising solutions are combined and slightly modified to create offsprings, which can yield an improved solution. By iteratively selecting the best solutions, an evolutionary pressure is created, which improves the result over time. evoStream (carnein2018evostream) (Evolutionary Stream Clustering) makes use of an evolutionary algorithm to bridge the gap between the online and offline components. By iteratively selecting the best solutions, an evolutionary pressure is created, which improves the result over time. evoStream uses this concept to enhance the macro-clusters through recombinations and small variations iteratively. Since macro-clusters are created incrementally, the evolutionary steps can be performed while the online components wait for new observations, i.e., when the algorithm would usually idle. As a result, the computational overhead of the offline part is removed, and clusters are available at any time. The online component is similar to DBSTREAM (hahsler2016clustering) but does not maintain a shared-density since it is not necessary for reclustering.

evoStream is based on DBSTREAM (hahsler2016clustering) (Density-based Stream Clustering), which uses the shared density between two micro-clusters to decide whether micro-clusters belong to the same macro-cluster. A new observation is merged into micro-clusters if it falls within the radius from their center. Subsequently, the centers of all clusters that absorb the observation are updated by moving the center towards x. If the point is not assigned to a cluster, it is used to initialize a new micro-cluster. Additionally, the algorithm maintains the shared density between two micro-clusters as the density of points in the intersection of their radii, relative to the size of the intersection area. In regular intervals, it removes micro-clusters and shared densities whose weight decayed below a respective threshold. In the offline component, micro-clusters with high shared density are merged into the same cluster.

evoStream was used in (supardi2020evolutionary)

to detect outliers in a data stream. The goal of this method is to treat the distinct data object as an outlier detection problem rather than the categorization problem.

HDCStream (amini2014fast) (hybrid density-based clustering for data stream) first combined grid-based algorithms with the concept of distance-based algorithms. In particular, it maintains a grid where dense cells can become micro-clusters as known from distance-based algorithms (see Section 4). Each observation in the stream is assigned to its closest microcluster if it lies within a radius threshold. Otherwise, it is inserted into the grid instead. Once a grid-cell has accumulated sufficient density, its points are used to initialize a new micro-cluster. Finally, the cell is no longer maintained, as its information has been transferred to the micro-cluster. In regular intervals, all micro-clusters and cells are evaluated and removed if their density decayed below a respective threshold. Whenever a clustering request arrives, the microclusters are considered virtual points to apply DBSCAN (ester1996density). The algorithm consists of three steps: (1) Merging or mapping: the new data point is added to an existing mini-cluster or mapped to the grid. (2) Pruning Grids and Mini-clusters: the grids cells, as well as mini-cluster weights, are periodically checked in pruning time. The periods are defined based on the minimum time for a mini-cluster to be converted to an outlier. The mini-clusters with weights less than a threshold are discarded. (3) Forming final clusters: final clusters are created based on mini-clusters, which are pruned. Each mini-cluster is clustered as a virtual point using a modified DBSCAN.

FlockStream is a bio-inspired algorithm for clustering data stream kennedy2006swarm simulating the behavior of a group of birds in flight. Boid is the abbreviation of the word bird-oid (which means in the form of a bird). These boids are interacting and follow certain rules:

  • cohesion to form a group, the boids are getting closer to each other

  • separation 2 boids can not be in the same place at the same time

  • alignment to stay grouped, boids try to follow the same path

FlockStream uses agents to mimic the behavior of boids. Each point is associated with an agent. An agent can be of three types: basic, p-representative (potential micro cluster), or co-representative (outlier microcluster, it can become p-representative if adding points, its weight exceeds a threshold). In the initialization phase, a set of basic agents is deployed in In space, the agents that have a great similarity approach (cohesion) form a cluster, while the other agents separate. The Euclidean distance is used to calculate the dissimilarity between agents. Agents can leave one group to join another with more similar agents. at the end of this phase, a summary for each cluster is calculated, and the other two types of agents appear p-representative and o-representative. In the second step, a mass of data stream is inserted. In this phase, the agents are updated as follows:

  • if an o-representative or p-representative meets another representative, if their distance is less than a threshold, then they join to form a swarm (cluster)

  • a basic agent A meets a representative R, if their calculated distance is lower than a threshold, A is absorbed by R

  • a basic agent meets another, so if their similarity is less than a threshold, he joins to form an o-representative.

We list the limitations and the merits of each algorithm in the following:

  • evoStream: -Merits: - Use idle times to improve the clustering quality - Output clusters at any time - Detection of outliers -Limitations:

    - Requires the set of the clusters number - Not suitable for high dimensional data.

  • DBStream: -Merits: - Use the shared density between clusters to determine if two clusters can be merged - Robust to noise -Limitations: - Several parameters need to be set - Depends on the insertion order of the data points.

  • HDCStream: -Merits: - Handles outliers - Improves the computation time and quality -Limitations: - Unable to detect variant levels of density - Can not handle high dimensional data.

  • FlockStream: -Merits: - Detects outliers - lower time complexity -Limitations: - Unable to handle high dimensional data.

3 Proposed Method

In this section, we introduce IMOC-Stream (Multi-Objective AntTree Clustering data stream). The algorithm is based on AntTree clustering and combines stream clustering and multi-objective clustering to create a Multi-objective stream clustering algorithm that satisfies two objective functions. It makes use of the hierarchical nature of AntTree to improve the clustering quality. We describe in the following sections the main properties of IMOC-Stream.

3.1 Clustering in a Streaming Context

We assume that the data stream consists in a sequence of (potentially infinite) elements, arriving at times , where . Since the most recent data points are more important and reflect better the changes in the data distribution, we use temporal windows to consider only recent data for the clustering. A set of clustering solutions is generated and updated for each window where is the clustering solution and is represented by K clusters . Each cluster is represented by a prototype where and is the dimension of the data. Each cluster is associated with a weight that decreases over time based following a fading function.

When the first batch of data arrives in the first time window, we create the tree as a clustering solution according to Section 2.1, and this solution is stored in the Pareto-set. From the same batch of data we initialize several solutions using K-means pelleg2000x with different , GNG fritzke1995growing, DBScan ester1996density. The parameter settings of these algorithms are reported in Table 2. The generated solutions are combined with the tree solution by the mutation and crossover operators, and the results are added to the solutions-set. We compute the objective function values for each solution-set and store the non-dominated solutions in the Pareto-set.

Algorithm Parameters Source
GNG = 30 Smile Package111https://haifengl.github.io
DBSCAN
= 20
= 10
Smile Package222https://haifengl.github.io
K-means K vary from 2 to 15 Clustering4Ever333https://github.com/Clustering4Ever
Ant-tree Clustering4Ever444https://github.com/Clustering4Ever
Table 2: Parameter settings for the used algorithm

After the Initialization phase and for each new window of data points, each point in the current window is assigned to the closest center in each clustering solution in the pareto-set. The distance calculated between the data points and the centers is the euclidean distance. We note that for each clustering solution, a point can be assigned to only one cluster. After all points being assigned, we update the clustering solutions with the new assigned points. We compute the objective functions values and we update the pareto-set. If the system idles, the method combines the solutions in the pareto-set using the genetic operators and calculates the objective values of the new generated solutions. At the end of each iteration, the pareto-set contains a set of non-dominated solutions. At the end of the process, a set of non-dominated clustering solutions is stored. These solutions are equally good mathematically. We used an internal quality measure Davies Bouldin davies1979cluster to select the best solution among the Pareto-set.

3.1.1 AntTree with Tree Aggregation

To deal with the memory constraints encountered when analyzing data streams, we propose a new representation of the tree to prevent storing all the data points and to reduce the memory allocation. The tree is initialized from the data points in the first window following the AntTree algorithm described in section 2.1. After placing all the points, we compute the prototypes of each cluster as the average of the points assigned to this cluster. All the points are discarded, and only the tree with the prototypes is stored in the memory. For the next windows, we assign the new data points to each cluster and update the prototype as follows:

(1)

Where is the previous prototype, is the number of points assigned to the cluster, is the new prototype computed only from the current window. is the number of points assigned to the cluster in the current window: . is the decay factor that decreases over time to give more importance to most recent data . If = 1 all data will be used from the beginning; = 0 only the most recent data will be used.

If a point is not assigned to a cluster, it becomes a prototype of a newly created cluster. Figure 4 illustrates tree representation and aggregation.

Figure 4: Topological and Hierarchical representation and tree aggregation process. The circles represent the data points, the squares represent the prototypes and the triangles represent the new data points from the current window.

3.1.2 Fading Function

Most data stream algorithms consider the most recent data as more important and reflect better the changes in the data distribution. For that, we consider a function, in which the weight of each cluster decreases exponentially with time by introducing a decay factor parameter .

(2)

where is the number of points assigned to the cluster at the current time . If the weight of a node is below a threshold value, this cluster is considered outdated and removed.

3.2 Evolutionary Representation and Functions

Most of the Multi-Objective clustering methods use an evolutionary representation for the clustering solutions as their use of population enables the variation of solutions and makes it easier to keep a population of clustering solutions and apply genetic operators. However, the use of such representation requires the following concepts:

  • Choosing an evolutionary encoding to represent a clustering solution.

  • The generation of the initial population by an effective initialization scheme.

  • Suitable genetic operators to mix the solutions.

  • Choosing two or more objective functions as a fitness function to choose the non-dominated solutions.

  • Developing a technique to obtaining a single clustering solution for the Pareto-set (leader selection method).

The choice of these components is crucial for the clustering quality and the algorithm scalability. In the next sections, we describe the components we chose after extensive experiments to deal with the requirements presented above.

3.2.1 Genetic Representation

Many representations were presented in the previous MOC methods mukhopadhyay2015survey. However, these representations are not suitable for data stream clustering since data points can not be stored and have to be processed in one pass. Therefore, we chose a new genetic representation that facilitates the clustering update as the data flows. Each clustering solution is represented by a chromosome, which is an array of + 2, where is the dimension of the data. The first and second components are the objective values for this solution. The last components represent the clusters. Each cluster is represented by a prototype of elements. Figure 5 illustrates the clustering representation and conversion.

Figure 5: Clustering solution representation and conversion.

3.2.2 Population Initialization

In each time window of the data stream, a set of clustering solutions is created and stored. Our algorithm does not require these solutions to have the same number of clusters. A first population is created from the first window using the AntTree algorithm combined with other solutions generated by several algorithms (K-means pelleg2000x with different , GNG fritzke1995growing, DBScan ester1996density). Those algorithms were chosen after extensive experimentation due to their ability to do a local search. The solutions given are encoded following the scheme described in Figure 5. We select the best solutions from this population to create new clustering solutions following the genetic operators Crossover and Mutation described in section 3.2.3. After the first population initialized, we compute objective functions for each clustering solution and store the Pareto-optimal solutions into the Pareto-set. For each window of the data stream, the new data points belonging to the current window are used to update the solutions in the Pareto-set and to create new solutions. The Pareto-set is then updated with the non-dominated solutions. We describe the initialization and update scheme in Figure 6.

Figure 6: Initialization and update scheme.

3.2.3 Genetic Functions

Genetic operators are essential for MOC methods as they enable the variety and diversity of the clustering solutions. For our method, we use two genetic operators: Crossover and Mutation, to explore more solutions. The use of those operators helps find a better solution by combining the optimal solutions obtained from the other algorithms.

  • Crossover: We used the single point crossover whitley1994genetic in this paper due to its Independence of the ordering of genes. The goal of the crossover operator is to create new clustering solutions from the two-parent solutions. First, we randomly select the Pareto-set two solutions that have respectively and clusters. We choose randomly a crossover point , as the number of clusters may vary, must satisfy .

    The first resulted clustering solution from the crossover is composed of cluster centers from to of the solutions with cluster centers, and + to of the second clustering solution. The second resulted clustering solution is composed of the cluster centers to from the first solution and of cluster centers to from the second solution. Figure 7 explains the process of crossover of two clustering solutions.

    Figure 7: Crossover of two clustering solutions. The figure on the left represents the prototypes and the one on the right represents the topological clusters, the squares represent the prototypes and the circles are the data points. The data points are added to illustrate, in the clustering process, no data point is kept in the memory.
  • Mutation: We use the random resetting mutation operator mitchell1998introduction to change randomly some values of a cluster center of a clustering solution to explore global solutions. We select a clustering solution from the Pareto-set, then from each cluster center in , we randomly select position values. For a value , a number is generated and the value is updated as follows:

    The ’+’ or ’-’ signs occur with equal probability. Figure

    8 illustrates the process of mutation of a clustering solution.

    Figure 8: Mutation of a clustering solution. = 50%

Both operators are applied during idle times on the solutions from the Pareto-set. The solutions are selected based on their fitness score, equal to (). We select clustering solutions and apply the genetic operators.

3.3 Objective Functions

One of the important aspects of MOC is the choice of suitable objective functions that are to be optimized simultaneously. For each clustering solution, several quality measures exist. The the goal is to have distinct clusters (separateness) that are the most dense in terms of the data points they contain (compactness). To satisfy these requirements, we introduce two objective functions and . The combination of both objective functions allows us to have arbitrarily shaped clusters.

  • Compactness: the compactness of a clustering solution reflects the overall intra-cluster size of the data and has to be minimized. The compactness of a clustering solution in a streaming context is computed as follows:

    (3)

    Where is the current window and is the index of the cluster where belongs. is the euclidean distance between the data point and . is the decay factor that decreases over time to give more importance to most recent data. The points of the previous windows are not kept, has been computed in the previous window with the previous prototype .

  • Separateness: the separateness of a clustering solution is the mean distance between clusters. It reflects the inter-cluster similarity and should be maximized. The separateness of a cluster is the shortest distance between a data point in this cluster and another data point of his neighborhood belonging to another cluster. In a streaming context, the separateness is computed as follows:

    (4)

    Where is the neighborhood of the data point belonging to the cluster . The neighborhood of a node is determined through the AntTree method. The neighborhood of a cluster is the directly connected nodes to this one on the tree.

3.4 Solution Selection

At the end of the online phase, a set of non-dominated solutions is stored in the Pareto-set. These non-dominated solutions are equally good mathematically. We used an internal quality measure Davies Bouldin davies1979cluster to select the best solution among the Pareto-set. The choice of an internal index is because the data might not be labeled. We sort all the solutions by their fitness (internal measures values), and we choose the best one as an output of the algorithm. The Davies-Bouldin index helps identify sets of clusters that are compact and well separated. The Davies-Bouldin index is calculated as:

(5)

is the distance between the data point , and its cluster and K is the number of clusters. DBI varies between 0 (best clustering) and (worst clustering).

3.5 Improved MOC-Stream Algorithm

IMOC-Stream is an extension of Multi-objective clustering for data stream to optimize computation time and memory allocation. It starts with creating a first clustering solution using the AntTree algorithm. In the contrast to the original algorithm where all the data points are stored, we introduced a new tree aggregation method to store only a synopsis of the data. The clustering solution is encoded and combined with different solutions obtained by different algorithms to create a population of solutions. The objective function values are computed for each solution, and only the non-dominated solutions are added to the Pareto-set. Then, we apply crossover and mutation on the best solutions selected from the Pareto-set and add the obtained solutions to the population. For each time window, the next point from the stream is mapped into the tree, the prototypes are computed, and only the aggregated tree is stored. We update the weights of the nodes and remove the outdated ones. If the stream idles, we apply genetic operators to generate more solutions. At the end of each time window, we compute the objective function values, select the non-dominated solutions, and update the Pareto-set. In the offline phase, we compute the Davies Bouldin index values of each potential solution and select the optimal one as an output for this algorithm. In summary, the algorithm of IMOC-Stream presented in this paper is described in Algorithm 2.

Result: Optimal clustering solution
From the first window: initialize the tree using AntTree algorithm and perform tree aggregation following Figure 4;
Generate several clustering solutions using K-means with different K, GNG, and DBScan with different parameters;
Encoding the clustering solutions following scheme in Figure 5;
Apply Crossover and Mutation following Figures 7 and 8 respectively. Add new solutions to the population of chromosomes;
Compute objective functions following equations (3) and (4). Store non-dominated solutions in the Pareto-set;
while There is data available do
       Map each point into the tree and compute prototypes following Equation(1);
       For each clustering solution in the pareto-set, assign each point to the closest cluster;
       Update each cluster in each clustering solution using the new points assigned as described in Equation (1);
       Update weights of nodes following Equation (2) and remove the outdated nodes;
       Compute objective functions of the clustering solutions in the pareto-set and the new solutions generated. Update the pareto-set with the new non-dominated solutions ;
       while Idle do
             Select best clustering solutions from Pareto-set based on their objective values ;
             Apply Crossover and Mutation following Figures 7 and 8 respectively. Add new solutions to the population of clustering solutions;
            
       end while
      
end while
Select best solution among the pareto-set solutions as described in section 3.4
Algorithm 2 Improved MOC-Stream Algorithm

4 Experiments and Discussion

4.1 Datasets and Quality Criteria

The IMOC-Stream method described in this article was implemented in Scala programming language and will be available on Clustering4Ever GitHub repository555https://github.com/Clustering4Ever/Clustering4Ever. We evaluated the clustering quality of IMOC-Stream on several real uci and synthetic666https://www.sites.google.com/site/nonstationaryarchive/ datasets. We describe the datasets in Table 3. The mutation rate is set to 20% and the number of selected clustering solutions to the crossover and mutation is set to .

Dataset Instances Features Classes
powersupply 29,928 2 24 100
HyperPlan 100,0000 10 5 1000
Covertype 581,102 10 23 1000
Sensor 2,219,802 4 54 10000
1CDT 16000 2 2 100
1CSurr 55280 2 2 1000
4CR 144000 2 1 1000
GEARS-2C-2D 200000 2 2 10000
Table 3: Description of datasets used in experimentation

For the quality measures, we used the internal measures (NMI) strehl2002cluster and the Adjusted Rand index (ARAND) hubert1985comparing, these two measures require the ground truth of the data to be available. provides a measure that is independent of the number of clusters as compared to purity. It reaches its maximum value of 1 only when the two sets of labels have a perfect one-to-one correspondence. The NMI of a clustering solution is calculated as follows :

(6)

Where are true labels and are labels predicted by the algorithm. and is the entropy of the partition calculated as follows:

(7)

Where is the number of points assigned to the partition . ARAND index is a measure of agreement between two partitions: one given by the clustering process and the other defined by external criteria. Given the following contingency matrix, where represents the number of points assigned to both clusters and of partitions and :

Sums
Sums
Table 4: Contingency matrix between two partitions and of and clusters respectively.

The Adjusted RAND index is calculated as follows:

(8)

4.2 Experimental Settings

Assuming large high-dimensional data arrives as a continuous stream, IMOC-Stream divides the streaming data into batches and processes each batch continuously. The batch size depends on the available memory and the size of the original dataset the size of the window for each dataset is shown in Table 3. We set the time interval between two batches to 1 second and the parameter to 0.7.
To show the effectiveness of our method, we compared it to five well known stream algorithms: de2011extending, tu2009stream, hahsler2016clustering, cao2006density and aggarwal2003framework from R package streamMOA777https://github.com/mhahsler/streamMOA. We repeated our experiments with different initialization and have chosen those giving the best results. Table 5 shows the optimal parameter configurations.

Algorithms Parameters Initialization
DStream
0.9
0.001
1000
3
DBStream
1.8
0.001
1000
2.5
DenStream
0.4
1.605
0.275
CluStream 2
Table 5: Optimal parameter configurations for the algorithms used for the experimentation. For IMOC-Stream, the decay factor is fixed to 0.7.

4.3 Clustering Evaluation

The results of IMOC-Stream on the datasets described above compared to the different algorithms are reported in Tables 6 and 7. The value of NMI and ARAND is the average value of ten runs. It is noticeable that IMOC-Stream gives better results than all the other methods. These results are due to the fact that our method optimizes two objective functions to maximize intra-cluster similarity and minimize inter-cluster similarity at the same time, which gives us a compact and well-separated clusters. The use of different algorithms to create a population of solutions allow IMOC-Stream to explore better solutions and escape the local minima. Another critical point is the use of the genetic parameters to combine the best solutions and explore the potential local solutions. The other algorithms are sensitive to the initialization of the settings, which justify why our algorithm yields better results since it has no input parameters. Finally, we noticed that the DStream algorithm gives better results compared to the other algorithms used in this experimentation since it is adapted to large datasets.

For synthetic datasets, IMOC-Stream also gave better results than the different stream algorithms except for StreamKM++ on the dataset. These results are due to the optimal choice of the for StreamKM++, which makes it find the exact number of clusters with synthetic datasets and gives better results. The number of clusters is not predefined in IMOC-Stream, but it still manages to find approximately the right amount of clusters.

Dataset Metrics IMOC-Stream StreamKM++ DStream DBStream DenStream CluStream
powersupply
NMI
ARAND
0.466 0.03
0.144 0.03
0,232 0,05
0,034 0,01
0,403 0,06
0,049 0,01
0,056 0,01
0.001 0.00
0.055 0.01
0.002 0.00
0.196 0.05
0.032 0.01
Sensor
NMI
ARAND
0.723 0.00
0.192 0.00
0.151 0.03
0.074 0.02
0.274 0.07
0.034 0.01
0.060 0.01
0.006 0.00
0.032 0.00
0.032 0.00
0.024 0.00
0.006 0.00
Covertype
NMI
ARAND
0.509 0.03
0.433 0.10
0.113 0.03
0.165 0.02
0.310 0.06
0.254 0.08
0.048 0.001
0.002 0.003
0.482 0.12
0.198 0.06
0.295 0.07
0.339 0.11
HyperPlan
NMI
ARAND
0.168 0.01
0.041 0.00
0.026 0.00
0.035 0.00
0.140 0.03
0.093 0.02
0.002 0.00
0.001 0.00
0.026 0.00
0.027 0.00
0.014 0.01
0.019 0.00
Table 6: Comparing IMOC-Stream with different algorithms on real datasets. The first value is the average of 10 repetitions and the value after

is the standard deviation.

Dataset Metrics IMOC-Stream StreamKM++ DStream DBStream DenStream CluStream
1CDT
NMI
ARAND
0.990 0.07
0.970 0.09
0.759 0.03
0.679 0.02
0.691 0.10
0.667 0.14
0.631 0.28
0.610 0.30
0.208 0.05
0.086 0.05
0.6210.06
0.5830.09
1CSURR
NMI
ARAND
0.481 0.00
0.248 0.01
0.534 0.12
0.529 0.17
0.1360.17
0.0410.19
0.0310.02
0.020.07
0.1500.05
0.0170.07
0.4090.1
0.3840.11
4CR
NMI
ARAND
0.957 0.00
0.954 0.00
0.7050.01
0.497 0.02
0.804 0.02
0.793 0.03
0.868 0.03
0.881 0.02
0.183 0.03
0.006 0.00
0.502 0.03
0.408 0.02
GEARS_2C_2D
NMI
ARAND
0.654 0.02
0.643 0.01
0.5430.03
0.4490.03
0.1600.12
0.1540.17
0.0010.00
0.00010.00
0.0210.02
0.0100.01
0.3010.02
0.2190.02
Table 7: Comparing IMOC-Stream with different algorithms on synthetic datasets. The first value is the average of 10 repetitions and the value after is the standard deviation.

4.4 Clustering High Dimensional Data

The curse of dimensionality refers to various phenomena that arise when clustering data in high-dimensional spaces. Most of the clustering algorithms suffer from the curse of dimensionality. This is due to many factors like the high number of parameters to set or the algorithm’s high complexity. To prove our method’s effectiveness on clustering high dimensional datasets (HDD), we tested it on 6 HDD’s from

DIMsets. The dimensions (number of features) of these datasets vary from to while the number of instances is and the number of classes equal to . We compared our results with different stream algorithms based on NMI and ARAND measures. The results are reported in Table 8. The results show that our method outperforms all the other methods in terms of NMI and ARAND. These results are because our algorithm does not require parameter settings and uses linear genetic functions to enhance the quality, unlike the other algorithms. The pre-defined parameter and the use of costly processes and algorithms (like DBSCAN for DBStream and DenStream) make the algorithms slower and not robust when dealing with HDD. The genetic operators and the update of the solutions in our method are performed linearly, making these functions not costly in the computation time.

Dataset Metrics IMOC-Stream StreamKM++ DStream DBStream DenStream CluStream
dim032
NMI
ARAND
0.600 0.01
0.227 0.01
0.500 0.04
0.199 0.04
0.051 0.03
0.021 0.01
0.421 0.06
0.139 0.02
0.062 0.02
0.003 0.00
0.211 0.02
0.0215 0.01
dim064
NMI
ARAND
0.675 0.01
0.380 0.00
0.546 0.00
0.252 0.01
0.037 0.01
0.005 0.04
0.522 0.02
0.339 0.01
0.104 0.04
0.017 0.01
0.184 0.02
0.012 0.01
dim128
NMI
ARAND
0.691 0.01
0.418 0.02
0.571 0.04
0.386 0.12
0.136 0.01
0.056 0.02
0.531 0.02
0.321 0.01
0.147 0.05
0.090 0.03
0.191 0.01
0.003 0.01
dim256
NMI
ARAND
0.777 0.01
0.487 0.00
0.575 0.07
0.377 0.16
0.078 0.01
0.055 0.03
0.391 0.02
0.265 0.01
0.147 0.03
0.045 0.01
0.171 0.00
0.004 0.02
dim512
NMI
ARAND
0.788 0.01
0.540 0.00
0.606 0.04
0.538 0.03
0.115 0.01
0.073 0.02
0.329 0.10
0.478 0.12
0.112 0.12
0.045 0.09
0.145 0.01
0.002 0.02
dim1024
NMI
ARAND
0.855 0.00
0.717 0.01
0.774 0.02
0.634 0.02
0.112 0.03
0.020 0.03
0.305 0.09
0.414 0.08
0.110 0.05
0.041 0.07
0.150 0.00
0.005 0.01
Table 8: Comparing IMOC-Stream with different algorithms on HDD datasets. The first value is the average of 10 repetitions and the value after is the standard deviation.

4.5 Clustering Evolution

Figure 9 shows an example of the evolution of IMOC-Stream clustering on 1CDT, 4CE-V1, 1CH and 2CDT datasets. Each line represents an evolution of clustering for a particular dataset. These figures are generated during the clustering process. We picked three partitionings at random iterations for each dataset. For each time window, the distribution of the incoming data points changes. With its Multi-Objective capability and the fading function’s use, IMOC-Stream manages to recognize the structures of the data stream and can separate these structures with the best visualization. It can also detect arbitrarily shaped, compact, and well-separated clusters. We note that the number of clusters does not necessarily stay the same, but the best is automatically chosen.

Figure 9: Clustering evolution of 1CDT, 4CE-V1, 1CH, 2CDT datasets. Each color represents a cluster. Each line represents the evolution of a clustering with one dataset.

4.6 Arbitrary Shaped Clusters

Figure LABEL:imoc-arbitrarily represents the cluster detection for the t4.9k, Compound, and Path-based datasets888http://cs.joensuu.fi/sipu/datasets/. We can see from this figure that our method manages to find clusters of arbitrary shapes and provide a good separation of the clusters. The other streaming clustering methods are unable to find clusters of arbitrary shapes (only spherical clusters can be found). The IMOC-Stream method is also able to find noise points due to the use of a density clustering method (DBSCAN).

Figure 10: Examples of detection of arbitrary shaped clusters by IMOC algorithm.

4.7 Time and Memory Complexity

In this section, we analyzed our method to prove its improvement over Ant-Tree in terms of memory allocation, and stream algorithms in terms of execution time. Figure 11 shows the execution times of IMOC-Stream and the other stream clustering algorithms. It can be noticed that the DBStream algorithm has the shortest execution time, but our method is faster than all the different stream clustering algorithms despite its evolutionary nature. In the meantime, IMOC-Stream outperforms all the other algorithms based on the results shown above. These results indicate that IMOC-Stream is non-dominated across all datasets since no different algorithm yields faster computation times. In other words, no other algorithm can produce better results within equal or less time. These results are because our method uses idle times to improve the clustering solutions and use genetic functions with linear complexity instead of using costly functions like the other algorithms.

Figure 11: Execution time in milliseconds of each algorithm for every dataset.

Figure 12 compares the memory allocation of our method and the Ant-tree algorithm. We observe that IMOC-Stream requires less memory allocation than Ant-tree, on all the datasets. We note that these results are because Ant-tree stores all the data points, making the complexity approximately . While IMOC-Stream stores only the synopsis that is equal to , and when we add the other algorithms’ solutions, the memory complexity becomes , where m is the number of clustering solutions.

Figure 12: Memory allocation in Kilobyte of IMOC-Stream and Ant-tree for every dataset.

5 Conclusion and Future Work

This paper presents a new method for clustering data stream based on a multi-objective algorithm called IMOC-Stream. Unlike those single-objective clustering techniques that have employed only one objective function, IMOC-Stream employs two objective functions to find clusters of arbitrary shaped clusters and enhance the clustering quality. IMOC-Stream uses a two-phase process: 1) online phase: creating several clustering solutions based on different algorithms and genetic operators 2) offline phase: construction of an optimal partition from the discovered clusters. We applied our method on large stream datasets and compared it to a different stream clustering algorithm. The experiments show the effectiveness of IMOC-Stream for detecting arbitrary shaped, compact, and well-separated clusters with better execution time. Part of our future work should be the proposition of a parallel solution to minimize the execution time. Further research needs to be conducted on incorporating the Ant Colony algorithm since it is suited for parallel algorithms due to its independent agents. More experimentation needs to be conducted using Spark Streaming to test our method on a real stream. We plan to conduct more studies and experiments using different standard optimization algorithms to improve the convergence rate.

Reproductibility

To facilitate further experiments and reproducible research, we provide our contributions through an open-source API that contains several clustering algorithms, including: S2G-Stream (local and global version), the 2S-SOM implemented in Spark/Scala and the API documentation at

Clustering4Ever999https://github.com/Clustering4Ever/Clustering4Ever.

Conflict of Intersests

The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

A two pages paper from this work has been published as a poster in attaoui2020moc. However, this paper is not only a long version of this paper but also introduce an improvement over the previous one.

References