1 Introduction
Swarm Intelligence or distributed intelligence is the collective behavior of decentralized and selforganizing natural or artificial systems nayyar2018advances. It has attracted lots of interest from researchers in the last two decades due to their dynamic and flexible ability and that they are highly efficient in solving nonlinear problems in the real world nayyar2018introduction.
AntTree azzag2003anttree
is a hierarchical clustering method that models how ants form living structures and use this behavior to organize this data into a tree that is built in a distributed manner. Intuitively, each ant/data is located at the start of reliable support (tree root). The behavior of the ants then consists either in moving or in clinging to the structure to extend it and allow other ants to come and stick in their turn. This behavior is determined in particular by the similarity between the data and the local structure of the tree. The result is a treelike organization of the data whose properties will allow us to determine a classification automatically and to have a visual overview of the tree.
The AntTree algorithm was proposed to deal with Data Stream, a kind of data that evolves and arrives in an unbounded stream. Analyzing data stream implies time and space constraints. The process of data stream clustering consists of creating compact and wellseparated partitions from dynamic streaming data in only a single scan, using limited time and memory.
Most of the clustering techniques follow one objective function. However, every objective function represent a different property of the clusters, such as the compactness or the separateness of a cluster. When the algorithm assumes a homogeneous similarity measure over the entire data set, it becomes not robust to variations in cluster shape, size, dimensionality, and other characteristics handl2007evolutionary. The MultiObjective clustering methods (MOC) law2004multiobjective retrieve clusters by applying two or more objective functions. It uses a twostep process: 1) Generate multiple clustering solutions and store the optimal ones. 2) Construct an optimal partition based on the Paretoset solutions. The following definitions are useful to understand MOC methods :
1.1 Definitions:

Dominated solutions: a solution is said to dominate a solution if , and there exists such that . Where is the number of objective functions and is the objective function.

Paretooptimal solutions: a solution is called Paretooptimal if it is not dominated by any other feasible solutions. The set of nondominated solutions is called Paretoset.

Idle times: in the case of a slow stream, time delays between data points can appear e.g., times where no data point is available. Traditional algorithms will stop and wait for new data points to process them. Figure 1 illustrates the concept of idle times.
Main Contributions
In our previous work attaoui2020moc we proposed a multiobjective stream clustering method. This method uses genetic operators in every iteration to improve the solutions, costly in time computation. The second inconvenience of our previous method is calculating distances between all the clusters when trying to find the neighborhood of a particular cluster. This paper introduces an improved multiobjective clustering method based on data stream clustering, and AntTree clustering algorithm azzag2003anttree. This method optimizes the computation time by applying the genetic operators only in idle times to improve the solution instead of using them in every iteration. It optimizes memory allocation by using the AntTree algorithm to find a cluster’s neighborhood. It also introduces a new aggregation approach for the AntTree algorithm to store only a synopsis of the data instead of storing all the data points. The method presented in this paper has the following merits compared to the other multiobjective and singleobjective data stream clustering methods:

It uses the Anttree algorithm to give the hierarchical aspect to our method and make it easier to determine the clusters’ neighborhood. It also reduces memory allocation by modifying the AntTree algorithm not to store data points.

It reduces computation time by using idle times to apply genetic functions and enhance clustering quality.

It optimizes two objective functions to obtain highquality solutions and arbitrarily shaped clusters.

It does not require the specification of the number of clusters and uses a Fading function to consider the most recent data as more important and reflect better the changes in the data distribution.
This paper is organized as follows: in section 2, we present some background methods. In section 3, we describe our method and its main features. In section 4, we present the experiments and the results obtained and compare them to some known clustering methods. Finally, we conclude this paper.
2 Related Works
This section discusses previous works on data stream clustering problems and highlights the most relevant algorithms proposed in the literature to deal with these problems. For stream clustering algorithms, only one MultiObjective clustering method has been proposed. In paul2020online, authors optimize multiple objectives capturing cluster compactness and feature relevancy. They consider an evolutionarybased technique and optimize multiple objective functions simultaneously to determine the optimal subspace clusters. The generated clusters in the proposed method are allowed to contain overlapping of objects.
The closest methods to MOC are Evolutionary algorithms since they use the same encoding of the solutions and the same process with a single objective function.
2.1 AntTree Clustering
AntTree algorithm azzag2003anttree produces a hierarchical structure in an incremental manner like how the ants join together. In this algorithm, each ant represents a single data point, and it moves in the structure according to the similarity with the other ants already connected in the tree under construction. is represented by the euclidean distance between two ants and . One should notice that this tree will not be strictly equivalent to a dendrogram as used in standard hierarchical
clustering techniques: each node in our tree will correspond to one data while this is not the case in general for dendrograms, where data only correspond to leaves.
Starting from the support, materialized by a fictitious node , the ants will progressively fix themselves on this initial point, then successively on the ants set at this initial point, and so on until all the ants are attached to the structure. During the construction of the structure, each ant is either moving on the graph or connected to it. In the first case, is free to move to a neighbor of the ant on which it is located (or to the support).
In the second case, will no longer be able to be released. Furthermore, we will consider the fact that each ant has only one outgoing link to other ants and cannot have more than links connected to it from other ants (tree having at most threads per node). Initially, all ants are placed on the support. They will each have a similarity threshold and a dissimilarity threshold, which are set to 1 and 0, respectively.
An ant will connect under an existing node of the tree (ant ) if it is sufficiently similar to this node but also dissimilar enough to the threads of the node: will thus form a subclass of which will be different from the other subclasses of (possibly already existing). Otherwise, will move randomly in the tree, looking for another location to fix itself. As the ant fails in its attempts to attach to the structure, it is made more tolerant in order to increase its chances of connecting to the next iteration concerning it: its similarity threshold is decreased, and its dissimilarity threshold is increased. The particular case of the support is treated as follows: an ant connects to the support if it is sufficiently dissimilar to other ants already connected directly to . It means that a new class has just been built at the highest level of the tree. This class must be as distinct as possible from the other classes already created.
The algorithm ends when all the ants are connected. The subtrees appearing at the first level of the tree, just below the support, will be interpreted as different classes. The properties of the tree can be analyzed visually and interactively (e.g., the classification error decreases as one goes down the tree). It is also possible to transform this tree into a dendrogram (by scrolling down the data placed on internal nodes to leaves. The AntTree algorithm is presented in Algorithm 1 for a given ant .
2.2 Evolutionary MultiObjective Clustering Methods (MOC)
In the past decade, multiobjective evolutionary algorithms have been heavily used in clustering because of their effectiveness. However, there has not been any dedicated effort to review all of these methods. The most prominent effort in this direction can be found in mukhopadhyay2015survey, in which many multiobjective clustering algorithms and techniques were presented. This section presents a thorough survey of the stateoftheart for a wide range
of multiobjective clustering algorithms.
This section discusses previous works on multiobjective clustering problems and highlights the most relevant algorithms proposed in the literature to deal with these problems.
MOCK (handl2007evolutionary)
Multiobjective clustering with automatic Kdetermination, consists of two main phases: In its initial clustering phase, MOCK uses a MultiObjective Evolutionary algorithm (MOEA) to optimize two complementary clustering objectives. The output of this first phase is a set of a mutually nondominated clustering solution. Each corresponds to different tradeoffs between the two objectives. In the second phase, MOCK analyzes the shape of the tradeoff curve. It compares it to the tradeoffs obtained for an appropriate null model (i.e., by clustering random data). Based on this analysis, the algorithm provides an estimate of the quality of all individual clustering solutions and determines a set of potentially promising clustering solutions. Often, a single solution is preferred, and, in these cases, the number of clusters inherent to the data set,
, is thus estimated implicitly. Figure (2) illustrates the pareto set in MOCK algorithm. The improved version of MOCK, MOCK has been proposed (garza2017improved), which can significantly decrease the computational overhand and reduce the search space.An Ant Colony Optimizationbased clustering method ACOC (inkaya2015ant) combines the connectivity, proximity, density, and distance information with the exploration and exploitation capabilities of ACO in a multiobjective framework. The proposed clustering methodology is capable of handling several challenging issues of the clustering problem, including solution evaluation, extraction of local properties, scalability, and the clustering task itself.
Multiobjective evolutionary algorithms with simultaneous clustering and classification MOASCC (luo2016learning) uses a clustering process to enhance the performance of the classification. To achieve this goal, two objective functions, fuzzy clustering connectedness function, and classification error rate, are adopted. A mutation operator is designed to make use of the feedback from both clustering and classification.
IMCPSO (gong2017improved)
proposes an improved multiobjective clustering framework using particle swarm optimization. The authors used the overall deviation and mean distance between clusters as objective functions. They introduced a clustering method to improve each particle (clustering solution) by finding a topological center, which is the point that has the maximum neighbors belonging to the same cluster Figure (
3) illustrates the using of topological centers to improve the clustering. Finally, the best particle is selected from the Paretoset based on the sparsity of the solution.EMOKC (wang2018multi) uses the term biobjective clustering to describe a MOC method with two objective functions. The method has two main steps (i) constructing two conflicting objective functions, and (ii) solving the biobjective optimization problem with an effective EMO(Evolutionary MultiObjective) algorithm.
Another MOC algorithm, SOMDEAclust (saini2019sophisticated)
, proposes an efficient automated decompositionbased multiobjective clustering technique, which is a hybridization of SelfOrganizing Maps (SOM) and differential evolution algorithm. Two internal cluster validity indices, namely, Silhouette index (SI) and PBM (PakhiraBandyopadhyayMaulik) index, are used as objective functions. SOM algorithm is used to creates new solutions based on the neighborhood of each neuron. AMOGA
(dutta2019automatic)Automatic clustering by a multiobjective genetic algorithm is a Multiobjective clustering algorithm that handles numeric and categorical features. Each clustering solution is encoded as a gene to apply Genetic operators. The method initializes a population using Kprototypes algorithm and GA operators crossover and mutation. AMOGA uses compactness and separateness as objective functions and different validity measures (DB index, Purity…) to select the best solution from the Paretooptimal set.
Multiobjective Gradient Evolution algorithm (kuo2020multi)
extends the Gradient Evolution GE algorithm, so then it can be applied for the multiobjective problem. This paper applies the Pareto ranking assignment to sort the vectors based on their fitness values. Kmeans is then used to perform a final clustering on the Paretooptimal solutions to obtain the final clustering.
Combinatorial MultiObjective Pigeon Optimization algorithm (CMOPIO) (chen2020multi) is based on a bioinspired algorithm called Pigeon Optimization PIO. In CMOPIO, pigeons only interact with the pigeons in their neighborhood. Meanwhile, the update of the pigeon’s position and velocity relies on each pigeon’s neighborhood rather than the global best position. These improvements allow the CMOPIO to identify a variety of Pareto optimal clustering solutions.
Table (1) Compares the MultiObjective Clustering Algorithms.
2.3 Data Stream Clustering Methods
To the best of our knowledge, no MultiObjective clustering method for data stream has been proposed. As discussed above, the closest methods to MOC are Evolutionary algorithms. evoStream carnein2018evostream (Evolutionary Stream Clustering) makes use of an evolutionary algorithm to bridge the gap between the online and offline components. Evolutionary algorithms are inspired by natural evolution where promising solutions are combined and slightly modified to create offsprings, which can yield an improved solution. By iteratively selecting the best solutions, an evolutionary pressure is created, which improves the result over time. evoStream (carnein2018evostream) (Evolutionary Stream Clustering) makes use of an evolutionary algorithm to bridge the gap between the online and offline components. By iteratively selecting the best solutions, an evolutionary pressure is created, which improves the result over time. evoStream uses this concept to enhance the macroclusters through recombinations and small variations iteratively. Since macroclusters are created incrementally, the evolutionary steps can be performed while the online components wait for new observations, i.e., when the algorithm would usually idle. As a result, the computational overhead of the offline part is removed, and clusters are available at any time. The online component is similar to DBSTREAM (hahsler2016clustering) but does not maintain a shareddensity since it is not necessary for reclustering.
evoStream is based on DBSTREAM (hahsler2016clustering) (Densitybased Stream Clustering), which uses the shared density between two microclusters to decide whether microclusters belong to the same macrocluster. A new observation is merged into microclusters if it falls within the radius from their center. Subsequently, the centers of all clusters that absorb the observation are updated by moving the center towards x. If the point is not assigned to a cluster, it is used to initialize a new microcluster. Additionally, the algorithm maintains the shared density between two microclusters as the density of points in the intersection of their radii, relative to the size of the intersection area. In regular intervals, it removes microclusters and shared densities whose weight decayed below a respective threshold. In the offline component, microclusters with high shared density are merged into the same cluster.
evoStream was used in (supardi2020evolutionary)
to detect outliers in a data stream. The goal of this method is to treat the distinct data object as an outlier detection problem rather than the categorization problem.
HDCStream (amini2014fast) (hybrid densitybased clustering for data stream) first combined gridbased algorithms with the concept of distancebased algorithms. In particular, it maintains a grid where dense cells can become microclusters as known from distancebased algorithms (see Section 4). Each observation in the stream is assigned to its closest microcluster if it lies within a radius threshold. Otherwise, it is inserted into the grid instead. Once a gridcell has accumulated sufficient density, its points are used to initialize a new microcluster. Finally, the cell is no longer maintained, as its information has been transferred to the microcluster. In regular intervals, all microclusters and cells are evaluated and removed if their density decayed below a respective threshold. Whenever a clustering request arrives, the microclusters are considered virtual points to apply DBSCAN (ester1996density). The algorithm consists of three steps: (1) Merging or mapping: the new data point is added to an existing minicluster or mapped to the grid. (2) Pruning Grids and Miniclusters: the grids cells, as well as minicluster weights, are periodically checked in pruning time. The periods are defined based on the minimum time for a minicluster to be converted to an outlier. The miniclusters with weights less than a threshold are discarded. (3) Forming final clusters: final clusters are created based on miniclusters, which are pruned. Each minicluster is clustered as a virtual point using a modified DBSCAN.
FlockStream is a bioinspired algorithm for clustering data stream kennedy2006swarm simulating the behavior of a group of birds in flight. Boid is the abbreviation of the word birdoid (which means in the form of a bird). These boids are interacting and follow certain rules:

cohesion to form a group, the boids are getting closer to each other

separation 2 boids can not be in the same place at the same time

alignment to stay grouped, boids try to follow the same path
FlockStream uses agents to mimic the behavior of boids. Each point is associated with an agent. An agent can be of three types: basic, prepresentative (potential micro cluster), or corepresentative (outlier microcluster, it can become prepresentative if adding points, its weight exceeds a threshold). In the initialization phase, a set of basic agents is deployed in In space, the agents that have a great similarity approach (cohesion) form a cluster, while the other agents separate. The Euclidean distance is used to calculate the dissimilarity between agents. Agents can leave one group to join another with more similar agents. at the end of this phase, a summary for each cluster is calculated, and the other two types of agents appear prepresentative and orepresentative. In the second step, a mass of data stream is inserted. In this phase, the agents are updated as follows:

if an orepresentative or prepresentative meets another representative, if their distance is less than a threshold, then they join to form a swarm (cluster)

a basic agent A meets a representative R, if their calculated distance is lower than a threshold, A is absorbed by R

a basic agent meets another, so if their similarity is less than a threshold, he joins to form an orepresentative.
We list the limitations and the merits of each algorithm in the following:

evoStream: Merits:  Use idle times to improve the clustering quality  Output clusters at any time  Detection of outliers Limitations:
 Requires the set of the clusters number  Not suitable for high dimensional data.

DBStream: Merits:  Use the shared density between clusters to determine if two clusters can be merged  Robust to noise Limitations:  Several parameters need to be set  Depends on the insertion order of the data points.

HDCStream: Merits:  Handles outliers  Improves the computation time and quality Limitations:  Unable to detect variant levels of density  Can not handle high dimensional data.

FlockStream: Merits:  Detects outliers  lower time complexity Limitations:  Unable to handle high dimensional data.
3 Proposed Method
In this section, we introduce IMOCStream (MultiObjective AntTree Clustering data stream). The algorithm is based on AntTree clustering and combines stream clustering and multiobjective clustering to create a Multiobjective stream clustering algorithm that satisfies two objective functions. It makes use of the hierarchical nature of AntTree to improve the clustering quality. We describe in the following sections the main properties of IMOCStream.
3.1 Clustering in a Streaming Context
We assume that the data stream consists in a sequence of (potentially infinite) elements, arriving at times , where . Since the most recent data points are more important and reflect better the changes in the data distribution, we use temporal windows to consider only recent data for the clustering. A set of clustering solutions is generated and updated for each window where is the clustering solution and is represented by K clusters . Each cluster is represented by a prototype where and is the dimension of the data. Each cluster is associated with a weight that decreases over time based following a fading function.
When the first batch of data arrives in the first time window, we create the tree as a clustering solution according to Section 2.1, and this solution is stored in the Paretoset. From the same batch of data we initialize several solutions using Kmeans pelleg2000x with different , GNG fritzke1995growing, DBScan ester1996density. The parameter settings of these algorithms are reported in Table 2. The generated solutions are combined with the tree solution by the mutation and crossover operators, and the results are added to the solutionsset. We compute the objective function values for each solutionset and store the nondominated solutions in the Paretoset.
Algorithm  Parameters  Source  
GNG  = 30  Smile Package^{1}^{1}1https://haifengl.github.io  
DBSCAN 

Smile Package^{2}^{2}2https://haifengl.github.io  
Kmeans  K vary from 2 to 15  Clustering4Ever^{3}^{3}3https://github.com/Clustering4Ever  
Anttree  Clustering4Ever^{4}^{4}4https://github.com/Clustering4Ever 
After the Initialization phase and for each new window of data points, each point in the current window is assigned to the closest center in each clustering solution in the paretoset. The distance calculated between the data points and the centers is the euclidean distance. We note that for each clustering solution, a point can be assigned to only one cluster. After all points being assigned, we update the clustering solutions with the new assigned points. We compute the objective functions values and we update the paretoset. If the system idles, the method combines the solutions in the paretoset using the genetic operators and calculates the objective values of the new generated solutions. At the end of each iteration, the paretoset contains a set of nondominated solutions. At the end of the process, a set of nondominated clustering solutions is stored. These solutions are equally good mathematically. We used an internal quality measure Davies Bouldin davies1979cluster to select the best solution among the Paretoset.
3.1.1 AntTree with Tree Aggregation
To deal with the memory constraints encountered when analyzing data streams, we propose a new representation of the tree to prevent storing all the data points and to reduce the memory allocation. The tree is initialized from the data points in the first window following the AntTree algorithm described in section 2.1. After placing all the points, we compute the prototypes of each cluster as the average of the points assigned to this cluster. All the points are discarded, and only the tree with the prototypes is stored in the memory. For the next windows, we assign the new data points to each cluster and update the prototype as follows:
(1) 
Where is the previous prototype, is the number of points assigned to the cluster, is the new prototype computed only from the current window. is the number of points assigned to the cluster in the current window: . is the decay factor that decreases over time to give more importance to most recent data . If = 1 all data will be used from the beginning; = 0 only the most recent data will be used.
If a point is not assigned to a cluster, it becomes a prototype of a newly created cluster. Figure 4 illustrates tree representation and aggregation.
3.1.2 Fading Function
Most data stream algorithms consider the most recent data as more important and reflect better the changes in the data distribution. For that, we consider a function, in which the weight of each cluster decreases exponentially with time by introducing a decay factor parameter .
(2) 
where is the number of points assigned to the cluster at the current time . If the weight of a node is below a threshold value, this cluster is considered outdated and removed.
3.2 Evolutionary Representation and Functions
Most of the MultiObjective clustering methods use an evolutionary representation for the clustering solutions as their use of population enables the variation of solutions and makes it easier to keep a population of clustering solutions and apply genetic operators. However, the use of such representation requires the following concepts:

Choosing an evolutionary encoding to represent a clustering solution.

The generation of the initial population by an effective initialization scheme.

Suitable genetic operators to mix the solutions.

Choosing two or more objective functions as a fitness function to choose the nondominated solutions.

Developing a technique to obtaining a single clustering solution for the Paretoset (leader selection method).
The choice of these components is crucial for the clustering quality and the algorithm scalability. In the next sections, we describe the components we chose after extensive experiments to deal with the requirements presented above.
3.2.1 Genetic Representation
Many representations were presented in the previous MOC methods mukhopadhyay2015survey. However, these representations are not suitable for data stream clustering since data points can not be stored and have to be processed in one pass. Therefore, we chose a new genetic representation that facilitates the clustering update as the data flows. Each clustering solution is represented by a chromosome, which is an array of + 2, where is the dimension of the data. The first and second components are the objective values for this solution. The last components represent the clusters. Each cluster is represented by a prototype of elements. Figure 5 illustrates the clustering representation and conversion.
3.2.2 Population Initialization
In each time window of the data stream, a set of clustering solutions is created and stored. Our algorithm does not require these solutions to have the same number of clusters. A first population is created from the first window using the AntTree algorithm combined with other solutions generated by several algorithms (Kmeans pelleg2000x with different , GNG fritzke1995growing, DBScan ester1996density). Those algorithms were chosen after extensive experimentation due to their ability to do a local search. The solutions given are encoded following the scheme described in Figure 5. We select the best solutions from this population to create new clustering solutions following the genetic operators Crossover and Mutation described in section 3.2.3. After the first population initialized, we compute objective functions for each clustering solution and store the Paretooptimal solutions into the Paretoset. For each window of the data stream, the new data points belonging to the current window are used to update the solutions in the Paretoset and to create new solutions. The Paretoset is then updated with the nondominated solutions. We describe the initialization and update scheme in Figure 6.
3.2.3 Genetic Functions
Genetic operators are essential for MOC methods as they enable the variety and diversity of the clustering solutions. For our method, we use two genetic operators: Crossover and Mutation, to explore more solutions. The use of those operators helps find a better solution by combining the optimal solutions obtained from the other algorithms.

Crossover: We used the single point crossover whitley1994genetic in this paper due to its Independence of the ordering of genes. The goal of the crossover operator is to create new clustering solutions from the twoparent solutions. First, we randomly select the Paretoset two solutions that have respectively and clusters. We choose randomly a crossover point , as the number of clusters may vary, must satisfy .
The first resulted clustering solution from the crossover is composed of cluster centers from to of the solutions with cluster centers, and + to of the second clustering solution. The second resulted clustering solution is composed of the cluster centers to from the first solution and of cluster centers to from the second solution. Figure 7 explains the process of crossover of two clustering solutions.

Mutation: We use the random resetting mutation operator mitchell1998introduction to change randomly some values of a cluster center of a clustering solution to explore global solutions. We select a clustering solution from the Paretoset, then from each cluster center in , we randomly select position values. For a value , a number is generated and the value is updated as follows:
The ’+’ or ’’ signs occur with equal probability. Figure
8 illustrates the process of mutation of a clustering solution.
Both operators are applied during idle times on the solutions from the Paretoset. The solutions are selected based on their fitness score, equal to (). We select clustering solutions and apply the genetic operators.
3.3 Objective Functions
One of the important aspects of MOC is the choice of suitable objective functions that are to be optimized simultaneously. For each clustering solution, several quality measures exist. The the goal is to have distinct clusters (separateness) that are the most dense in terms of the data points they contain (compactness). To satisfy these requirements, we introduce two objective functions and . The combination of both objective functions allows us to have arbitrarily shaped clusters.

Compactness: the compactness of a clustering solution reflects the overall intracluster size of the data and has to be minimized. The compactness of a clustering solution in a streaming context is computed as follows:
(3) Where is the current window and is the index of the cluster where belongs. is the euclidean distance between the data point and . is the decay factor that decreases over time to give more importance to most recent data. The points of the previous windows are not kept, has been computed in the previous window with the previous prototype .

Separateness: the separateness of a clustering solution is the mean distance between clusters. It reflects the intercluster similarity and should be maximized. The separateness of a cluster is the shortest distance between a data point in this cluster and another data point of his neighborhood belonging to another cluster. In a streaming context, the separateness is computed as follows:
(4) Where is the neighborhood of the data point belonging to the cluster . The neighborhood of a node is determined through the AntTree method. The neighborhood of a cluster is the directly connected nodes to this one on the tree.
3.4 Solution Selection
At the end of the online phase, a set of nondominated solutions is stored in the Paretoset. These nondominated solutions are equally good mathematically. We used an internal quality measure Davies Bouldin davies1979cluster to select the best solution among the Paretoset. The choice of an internal index is because the data might not be labeled. We sort all the solutions by their fitness (internal measures values), and we choose the best one as an output of the algorithm. The DaviesBouldin index helps identify sets of clusters that are compact and well separated. The DaviesBouldin index is calculated as:
(5) 
is the distance between the data point , and its cluster and K is the number of clusters. DBI varies between 0 (best clustering) and (worst clustering).
3.5 Improved MOCStream Algorithm
IMOCStream is an extension of Multiobjective clustering for data stream to optimize computation time and memory allocation. It starts with creating a first clustering solution using the AntTree algorithm. In the contrast to the original algorithm where all the data points are stored, we introduced a new tree aggregation method to store only a synopsis of the data. The clustering solution is encoded and combined with different solutions obtained by different algorithms to create a population of solutions. The objective function values are computed for each solution, and only the nondominated solutions are added to the Paretoset. Then, we apply crossover and mutation on the best solutions selected from the Paretoset and add the obtained solutions to the population. For each time window, the next point from the stream is mapped into the tree, the prototypes are computed, and only the aggregated tree is stored. We update the weights of the nodes and remove the outdated ones. If the stream idles, we apply genetic operators to generate more solutions. At the end of each time window, we compute the objective function values, select the nondominated solutions, and update the Paretoset. In the offline phase, we compute the Davies Bouldin index values of each potential solution and select the optimal one as an output for this algorithm. In summary, the algorithm of IMOCStream presented in this paper is described in Algorithm 2.
4 Experiments and Discussion
4.1 Datasets and Quality Criteria
The IMOCStream method described in this article was implemented in Scala programming language and will be available on Clustering4Ever GitHub repository^{5}^{5}5https://github.com/Clustering4Ever/Clustering4Ever. We evaluated the clustering quality of IMOCStream on several real uci and synthetic^{6}^{6}6https://www.sites.google.com/site/nonstationaryarchive/ datasets. We describe the datasets in Table 3. The mutation rate is set to 20% and the number of selected clustering solutions to the crossover and mutation is set to .
Dataset  Instances  Features  Classes  

powersupply  29,928  2  24  100 
HyperPlan  100,0000  10  5  1000 
Covertype  581,102  10  23  1000 
Sensor  2,219,802  4  54  10000 
1CDT  16000  2  2  100 
1CSurr  55280  2  2  1000 
4CR  144000  2  1  1000 
GEARS2C2D  200000  2  2  10000 
For the quality measures, we used the internal measures (NMI) strehl2002cluster and the Adjusted Rand index (ARAND) hubert1985comparing, these two measures require the ground truth of the data to be available. provides a measure that is independent of the number of clusters as compared to purity. It reaches its maximum value of 1 only when the two sets of labels have a perfect onetoone correspondence. The NMI of a clustering solution is calculated as follows :
(6) 
Where are true labels and are labels predicted by the algorithm. and is the entropy of the partition calculated as follows:
(7) 
Where is the number of points assigned to the partition . ARAND index is a measure of agreement between two partitions: one given by the clustering process and the other defined by external criteria. Given the following contingency matrix, where represents the number of points assigned to both clusters and of partitions and :
…  Sums  
…  
…  
…  …  …  …  …  … 
…  
Sums  … 
The Adjusted RAND index is calculated as follows:
(8) 
4.2 Experimental Settings
Assuming large highdimensional data arrives as a continuous stream, IMOCStream divides the streaming data into batches and processes each batch continuously. The batch size depends on the available memory and the size of the original dataset the size of the window for each dataset is shown in Table 3. We set the time interval between two batches to 1 second and the parameter to 0.7.
To show the effectiveness of our method, we compared it to five well known stream algorithms: de2011extending, tu2009stream, hahsler2016clustering, cao2006density and aggarwal2003framework from R package streamMOA^{7}^{7}7https://github.com/mhahsler/streamMOA. We repeated our experiments with different initialization and have chosen those giving the best results. Table 5 shows the optimal parameter configurations.
Algorithms  Parameters  Initialization  

DStream 



DBStream 



DenStream 



CluStream  2 
4.3 Clustering Evaluation
The results of IMOCStream on the datasets described above compared to the different algorithms are reported in Tables 6 and 7. The value of NMI and ARAND is the average value of ten runs. It is noticeable that IMOCStream gives better results than all the other methods. These results are due to the fact that our method optimizes two objective functions to maximize intracluster similarity and minimize intercluster similarity at the same time, which gives us a compact and wellseparated clusters. The use of different algorithms to create a population of solutions allow IMOCStream to explore better solutions and escape the local minima.
Another critical point is the use of the genetic parameters to combine the best solutions and explore the potential local solutions. The other algorithms are sensitive to the initialization of the settings, which justify why our algorithm yields better results since it has no input parameters. Finally, we noticed that the DStream algorithm gives better results compared to the other algorithms used in this experimentation since it is adapted to large datasets.
For synthetic datasets, IMOCStream also gave better results than the different stream algorithms except for StreamKM++ on the dataset. These results are due to the optimal choice of the for StreamKM++, which makes it find the exact number of clusters with synthetic datasets and gives better results. The number of clusters is not predefined in IMOCStream, but it still manages to find approximately the right amount of clusters.
Dataset  Metrics  IMOCStream  StreamKM++  DStream  DBStream  DenStream  CluStream  

powersupply 








Sensor 








Covertype 








HyperPlan 







is the standard deviation.
Dataset  Metrics  IMOCStream  StreamKM++  DStream  DBStream  DenStream  CluStream  

1CDT 








1CSURR 








4CR 








GEARS_2C_2D 







4.4 Clustering High Dimensional Data
The curse of dimensionality refers to various phenomena that arise when clustering data in highdimensional spaces. Most of the clustering algorithms suffer from the curse of dimensionality. This is due to many factors like the high number of parameters to set or the algorithm’s high complexity. To prove our method’s effectiveness on clustering high dimensional datasets (HDD), we tested it on 6 HDD’s from
DIMsets. The dimensions (number of features) of these datasets vary from to while the number of instances is and the number of classes equal to . We compared our results with different stream algorithms based on NMI and ARAND measures. The results are reported in Table 8. The results show that our method outperforms all the other methods in terms of NMI and ARAND. These results are because our algorithm does not require parameter settings and uses linear genetic functions to enhance the quality, unlike the other algorithms. The predefined parameter and the use of costly processes and algorithms (like DBSCAN for DBStream and DenStream) make the algorithms slower and not robust when dealing with HDD. The genetic operators and the update of the solutions in our method are performed linearly, making these functions not costly in the computation time.Dataset  Metrics  IMOCStream  StreamKM++  DStream  DBStream  DenStream  CluStream  

dim032 








dim064 








dim128 








dim256 








dim512 








dim1024 







4.5 Clustering Evolution
Figure 9 shows an example of the evolution of IMOCStream clustering on 1CDT, 4CEV1, 1CH and 2CDT datasets. Each line represents an evolution of clustering for a particular dataset. These figures are generated during the clustering process. We picked three partitionings at random iterations for each dataset. For each time window, the distribution of the incoming data points changes. With its MultiObjective capability and the fading function’s use, IMOCStream manages to recognize the structures of the data stream and can separate these structures with the best visualization. It can also detect arbitrarily shaped, compact, and wellseparated clusters. We note that the number of clusters does not necessarily stay the same, but the best is automatically chosen.
4.6 Arbitrary Shaped Clusters
Figure LABEL:imocarbitrarily represents the cluster detection for the t4.9k, Compound, and Pathbased datasets^{8}^{8}8http://cs.joensuu.fi/sipu/datasets/. We can see from this figure that our method manages to find clusters of arbitrary shapes and provide a good separation of the clusters. The other streaming clustering methods are unable to find clusters of arbitrary shapes (only spherical clusters can be found). The IMOCStream method is also able to find noise points due to the use of a density clustering method (DBSCAN).
4.7 Time and Memory Complexity
In this section, we analyzed our method to prove its improvement over AntTree in terms of memory allocation, and stream algorithms in terms of execution time. Figure 11 shows the execution times of IMOCStream and the other stream clustering algorithms. It can be noticed that the DBStream algorithm has the shortest execution time, but our method is faster than all the different stream clustering algorithms despite its evolutionary nature. In the meantime, IMOCStream outperforms all the other algorithms based on the results shown above. These results indicate that IMOCStream is nondominated across all datasets since no different algorithm yields faster computation times. In other words, no other algorithm can produce better results within equal or less time. These results are because our method uses idle times to improve the clustering solutions and use genetic functions with linear complexity instead of using costly functions like the other algorithms.
Figure 12 compares the memory allocation of our method and the Anttree algorithm. We observe that IMOCStream requires less memory allocation than Anttree, on all the datasets. We note that these results are because Anttree stores all the data points, making the complexity approximately . While IMOCStream stores only the synopsis that is equal to , and when we add the other algorithms’ solutions, the memory complexity becomes , where m is the number of clustering solutions.
5 Conclusion and Future Work
This paper presents a new method for clustering data stream based on a multiobjective algorithm called IMOCStream. Unlike those singleobjective clustering techniques that have employed only one objective function, IMOCStream employs two objective functions to find clusters of arbitrary shaped clusters and enhance the clustering quality. IMOCStream uses a twophase process: 1) online phase: creating several clustering solutions based on different algorithms and genetic operators 2) offline phase: construction of an optimal partition from the discovered clusters. We applied our method on large stream datasets and compared it to a different stream clustering algorithm. The experiments show the effectiveness of IMOCStream for detecting arbitrary shaped, compact, and wellseparated clusters with better execution time. Part of our future work should be the proposition of a parallel solution to minimize the execution time. Further research needs to be conducted on incorporating the Ant Colony algorithm since it is suited for parallel algorithms due to its independent agents. More experimentation needs to be conducted using Spark Streaming to test our method on a real stream. We plan to conduct more studies and experiments using different standard optimization algorithms to improve the convergence rate.
Reproductibility
To facilitate further experiments and reproducible research, we provide our contributions through an opensource API that contains several clustering algorithms, including: S2GStream (local and global version), the 2SSOM implemented in Spark/Scala and the API documentation at
Clustering4Ever^{9}^{9}9https://github.com/Clustering4Ever/Clustering4Ever.Conflict of Intersests
The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
A two pages paper from this work has been published as a poster in attaoui2020moc. However, this paper is not only a long version of this paper but also introduce an improvement over the previous one.
Comments
There are no comments yet.