## 1 Introduction

One of the most relevant data analysis problems is clustering [18]

, which consists of partitioning the data into a predetermined number of disjoint subsets called clusters. What’s more, clustering is widely carried out in many applied areas, such as artificial intelligence, machine learning and pattern recognition

[17, 19]. Among a wide variety of clustering methods, -means algorithm is one of the most popular [20]. In fact, it has been identified as one of the top-10 more important algorithms in data mining [28].### 1.1 -means Problem

Given a data set of -dimensional points of size , , the -means problem is defined as finding a set of centroids , which minimizes the -means error function:

(1) |

where denotes the Euclidean distance or norm.

#### -means Algorithm

-means problem is known to be NP-hard for and [26]

. The most popular heuristic approach to this problem is Lloyd’s algorithm

[23]. Given a set of initial centroids, Lloyd’s algorithm iterates two steps until convergence: 1) assignation step and 2) update step. In the assignation step, given a set of centroids, , the set of points is partitioned into clusters, , by assigning each point to the closest centroid. Then, the new set of centroids is obtained by computing the center of mass of the points in each partition. This set of centroids minimizes the -means error with respect to the given partition of the set of points. These two steps are repeated until reaching a fixed point, meaning, when the assignation step does not change the partition. This process has a time complexity. The combination of an initialization method plus Lloyd’s algorithm is called a -means algorithm. Many alternative initialization methods exist, where this process is optimized by carefully selecting initial centroids.#### -means Initialization

Regardless of all the benefits of the -means algorithm, its behaviour strongly depends on the initial set of centroids [11, 12, 25]. Consequently, in the literature different alternative initializations have been proposed. One of the most simple yet effective is Forgy’s approach [27]. Forgy’s initialization consists of choosing data points at random as initial centroids, and assign every other data point to the closest centroid. The main drawback of this approach is that it tends to choose data points located at dense regions of the space, thus these regions tend to be over-represented. Recently, probabilistic based seeding techniques have been proposed, which offer strong theoretical guarantees. -means++ (KM++) [9] initialization iteratively selects points from

at random, where the probability of selection is proportional to the distance of the closest centroid previously selected. This strategy has become one of the most prominent initializations since it guarantees to obtain a

-approximation. However, because KM++ has to pass times over the whole dataset it has a computational complexity of . As a consequence, other algorithms try to reduce the number of computed distances. For instance, in [2]an approximated KM++ is proposed obtaining the initial centroids in sublinear time using Markov chains. Other algorithms focus on reducing the converged error. In

[7] the authors use converged centroids and split the densest cluster into two, applying -means again to later fuse two clusters, in such a way that the error is reduced compared to the previous clusters.### 1.2 Streaming Data

Although the -means problem deals with a fixed data set , its usage can be generalized to scenarios in which data evolves over time. One of these scenarios is streaming data (SD). We define SD as a set of data batches that arrives sequentially, where each batch is a set of -dimensional points.

One of the main concerns when processing SD is how much data to store, since the volume of data increases indefinitely. Normally, a maximum number of stored batches is determined, this way time consumption and computational load of the clustering algorithm is controlled, and makes clustering tractable in this situation. Another main issue when dealing with SD is the concept drift

phenomenon. Each batch is assumed to be i.i.d. according to an unknown probability distribution

, and a concept drift occurs when the underlying distribution changes . In the presence of concept drifts there are two main approaches: a passive or active mechanism. On one hand, the active mechanism dynamically adjusts stored batches depending on whether a concept drift has occurred or not. On the other hand, in the passive approach more importance is given to recent batches. An example of passive approaches is the use of a sliding window of batches of fixed size [29, 21].### 1.3 Contributions

In this paper, we formally define the Streaming -means (SM) problem. We describe an active algorithm that is completely aware of when a concept drift occurs, and another one that solves this problem with a surrogate error function. This surrogate error deals with the concept drift phenomenon by assigning exponentially decaying weights to older batches. We proof that the surrogate error is a good approximation to the SM error. We propose a passive algorithm that minimizes the surrogate error, which is based on a weighted -means over batches. Its performances depends on the applied initialization each time a new batch happens. Moreover, we present some initialization techniques that combine previous and novel information of clusters, and conduct experiments to compare them.

This paper is organized in the following way. In section 2 the streaming -means problem is defined, which as we will see, demands prior knowledge of when the last concept drift occurred. Then we propose a passive approach, and proof the suitability of our approximation. Next, in section 3 we propose some appropriate initialization methods for the SM problem. We conduct the experiments in section 4 to compare the results of each initialization method. And finally, we discuss the main conclusions. In the Supplementary material there is more information about algorithm pseudocodes (Section B), datasets (LABEL:app:exp,E), experiments (C,F) and proofs (D).

## 2 Streaming -means Problem

In this section, we define the SM problem, a natural adaptation of the -means problem for SD, where the objective is to minimize the SM error. The SM error function is formally presented in Definition 1:

###### Definition 1.

Given a set of batches, and set of centroids , the SM error function is defined as

(2) |

where is where the last concept drift has occurred, i.e., every batch share the same underlying distribution.

In order to compute the SM error function we require to known the batch in which the last concept drift occurred, . Thus, the performance of an active approach to the problem will strongly depend on the behavior of the detection mechanism implemented. On one hand, if a fake drift is detected, then previously computed clusters are forgotten unnecessarily. On the other hand, if a concept drift occurs but is not detected, then previous computed centroids will be a bad initialization set and may lead to a bad clustering. In this work, we describe an active algorithm, which we call Privileged SM algorithm (PSM). PSM is an ideal active approach to the problem because it knows in advance if a concept drift occurs, and thus it can compute the SM error function. Clearly, PSM can not be used in practice, but we will use it as a reference in the experimental section since we will simulate streaming data with concept drifts. Alternatively, it is possible to conduct a passive approach to the SM problem, developing an algorithm which does not require detecting concept drifts.

### 2.1 A Surrogate for SM Error

Here we propose a surrogate error function for the SM error function. This alternative function incorporates a forgetting mechanism based on a memory parameter, , which assigns an exponentially decreasing weight to batches based on their antiquity . In particular, the surrogate error function is defined as follows:

###### Definition 2.

Given a set of batches of data points, , the surrogate error function, for a given set of centroids , is defined as

(3) |

where is the total weighted mass of the set of batches .

The surrogate error is a weighted version of the -means error for SD. Furthermore, the following theorem illustrates the suitability of this alternative function. Without loss of generality, we consider for this theorem that all batches have the same number of data points (sizes), for .

###### Theorem 1.

Let be a point, be a set of batches of points in , where and denotes the antiquity of . Let the batches before the drift be i.i.d. according to , where . Let the batches after the drift be i.i.d according to , where for . Let us assume that is upper-bounded by , for and .

Then, with at least probability the difference satisfies:

(4) |

where

(5) |

For this theorem we do not assume any underlying distribution, the only assumption is that the squared distance with respect to is upper bounded by . More importantly, observe that, according to Theorem 4, the expected value of the alternative error function tends to the SM error function exponentially fast with , since the mean value of their difference has the form . In particular, it shows that the surrogate function can be used to approximate the error for a single center, thus applying this result to every subgroup of points and their respective centroids yields a good approximation of the SM error. Thus, we can deal with the SM problem by minimizing the alternative error without requiring to detect concept drifts.

Due to the exponential decrease of the weights as antiquity increases, the contribution to the approximated error of older batches rapidly becomes negligible. Therefore, in practise, we can compute an arbitrarily close approximation to the surrogate error function by considering the last batches. By using this approximation we deal with the issue of indefinite increasing volume of data.

In Figure 1, we show how tends to zero as increases. For these experiments a batch size of was set in (a) and in (b). Notice that , therefore, as the number of points on each batch increases, the bounds get narrowed. Additionally, was set equal to and for (a) and (c), respectively. On the one hand, lower values of makes the average difference between the S

M error and the surrogate tends to zero faster. both functions to converge slower to zero. In other words, the surrogate as an estimate of the S

M error has lower bias. On the other hand, lower values of implies broader bounds to the difference between the SM error and the surrogate function. Thus, the variance of the surrogate estimate is higher. Clearly, there is a trade-off between fast convergence and low variance when choosing the forgetting parameter.

## 3 Streaming Lloyd’s Algorithm

We propose the Forgetful SM (FSM) algorithm, in order to deal with the SM problem in a passive way. FSM approximates the solution of the SM problem by minimizing the surrogate error function. When a new batch arrives, FSM runs an initialization procedure to find a set of initial centroids. Then, a weighted Lloyd’s algorithm is carried out over the available set of batches . The running time of weighted Lloyd’s algorithm is , where is the total number of points to be clustered. However, recall that we can compute an arbitrarily close approximation to the surrogate error function by discarding batches with a negligible weight. By discarding the batches with negligible weights, , the computational complexity of the weighted Lloyd’s step of FSM is reduced to , where is the average size of the stored batches.

As we have mentioned before, initialization is a crucial part for good and fast convergence of Lloyd’s algorithm, and thus the performance and efficiency of FSM depends on its initialization procedure.

### 3.1 Initialization Step

Here we propose efficient procedures for the initialization step of FSM. Once a new batch is received, an straightforward initialization strategy is to use the previously converged set of centroids. We call this approach use-previous-centroids (UPC), and the set of centroids obtained in previous iterations will be denoted as . UPC uses a set of local optima centroids for the past set of batches which can be a good and efficient choice once a new batch is presented. An alternative straightforward initialization is to use centroids obtained by applying an standard initialization procedure to the newest batch, , such as KM++. We call this approach the initialize-with-current-batch (ICB). Set of centroids obtained from initializing over the current batch is denoted as . Clearly, ICB allows FSM to adapt rapidly when a concept drift occurs. However, this initialization does not take into account the batches from the past neither the set . This could imply the waste of very valuable information, specially when a concept drift has not occurred for a long period of time.

### 3.2 Weighted -means Initialization

We now propose two efficient initialization strategies that combine information from UPC and ICB, by minimizing an upper-bound to the surrogate error function. The next result defines an upper-bound for the surrogate error function that will allow us to determine a competitive initialization for the FSM algorithm.

###### Theorem 2.

Given two set of centroids and , for any set of centroids , the surrogate function can be upper-bounded as follows:

(6) |

where

(7) |

for , , where and are the weights related to each centroid and is a value independent of the set of centroids .

In words, Theorem 5 shows that the surrogate error is upper-bounded by plus a constant. In fact, observe that has the form of a weighted -means error with as the data points, and weights . Hence, we propose an initialization procedure based on the weighted -means algorithm over the union of both sets of centroids. We refer to this initialization as Weighted -means Initialization (WI). Its computational complexity is .

### 3.3 Hungarian Initialization

An interesting analytical result can be acquired considering another assumption together with Theorem 5. Assume that each centroid has a single pair of centroids which are the closest to itself from both sets and , and are distinct for each centroid . We can index the centroid as in , but a different index may be needed for the centroid in . Then, we can re-write the upper-bound given in Eq. 19 as follows:

(8) |

where the weights and are the weights of and , respectively, for . The next theoretical results shows that the upper-bound can be analytically minimized with respect to with this assumption.

###### Theorem 3.

Let be the function defined as in Eq. 8 for a set of centroids of size , where and are given, and they are the closest points to of the sets and . Then the set of centroids that minimize this function is given by:

for .

Theorem 6 shows that just by making the one-to-one assumption given by , the optimal centroids , can be simply expressed as a linear combination between the elements of and . Notice that with this assumption we achieve an analytical minimum of .

#### Linear Sum Assignment Problem

If we want to compute the optimal centroids under the previous assumption, must be found. In order to do so we use the result in Theorem 6 to re-write Eq. 8:

(9) |

Hence, we define the matrix:

(10) |

and find the permutation such that the sum is minimal. This is a linear sum assignment problem and we can make use of the Hungarian (or Kuhn-Munkres) algorithm [22] to determine with a computational complexity of . Hence we propose another initialization method named Hungarian Initialization(HI). HI firstly computes a set of optimized centroids over the new batch . Then the matrix is constructed, which is used to determine the permutation that maps , via the linear sum assignment problem. This way, the sum is guaranteed to be the minimum value of , and hence the new set of centroids can be computed as defined in Theorem 6. The computational complexity of this algorithm is .

## 4 Experimentation

In this section we analyse the performance of FSM algorithms with the proposed initialization procedures: using-previous-centers (UPC), initialize-using-current-batch (ICB), Hungarian initialization (HI) and weighted -means initialization (WI). The converged SM error obtained by FSM with different initialization strategies are compared with the gold-standard PSM.

We say that an -drift for a set of centroids occurs when the underlying distribution (the concept) changes to such that , where and are the expected -means errors of under and concepts, respectively. In order to control the strength of the drifts, the experiments are performed using simulated streaming data with -drifts generated using real datasets taken from the UCI Machine Learning Repository [1], for different values of .

### 4.1 Experimental Setup

#### Datasets.

The experiments have been carried out in 8 different datasets simulated based on real datasets from UCI Machine Learning Repository [1]. The selected datasets have varying dimensions and number of data points, see Table 1 (Supplementary LABEL:app:exp). Simulated data consists of a sequence of batches with size , and a -concept drift takes place every 10 batches.

#### Procedure.

To analyze the behavior of the algorithms in streaming scenarios, we perform a burning out step by storing batches from the first concept. After this step, we start measuring the evolution of the performance of PSM, and FSM with different initialization techniques. To fairly compare their behaviour the set of centroids and

are the same for each initialization procedure each time a new batch arrive. After the burning out step a stream of 100 batches are processed with concept drifts each 10 batches. This procedure is repeated for each dataset and values of the hyperparameters.

#### Measurements.

We have measured the quality of the solutions obtained by different procedures in terms of the SM and surrogate error function. In order to have comparable scores for different datasets, the obtained scores (error values) on initialization and convergence are normalized. For each new batch, the score obtained with algorithm is normalized with respect to the minimum over every algorithm as . Using normalized scores allows us to summarize the results obtained for different algorithms with all the data sets in a single plot, reducing dramatically the number of figures needed to display results. In addition to the SM and the surrogate error function, we have measured the number of distances computed on Lloyd’s algorithm and initialization. The computed distances were also normalized, but we simply divide by the minimum . This way, in the figures of Section 4.3 what will be shown is how many times more distances have been computed compared to the fastest one. The number of iterations of Lloyd’s algorithm and the elapsed time were also measured, these were attached in Suplementary F.

#### Hyperparameters.

A key parameter is the forget parameter , since the surrogate function directly depends on this parameter. Theorem 4 shows that the surrogate differs from the real SM error with

, but the confidence interval grows as

decreases, hence findind a proper balance is necessary. Assuming that a difference of is negligible, we can find the value by solving the equality , where is our prior knowledge about the (average) number of batches in which a concept is stable and represents the fraction of the period in which the difference becomes negligible. Intuitively, determines how fast the term shrinks relative to the period of when a drifts happens . The magnitude of the concept drift and the number of clusters can affect how fast each algorithm adapts. For this reason, when generating streaming data, we use the next set of values for the parameters and : , . Note that for each value of and we set a different value of (see Table 2 in Suplementary ). In this paper, we show results for , for the sake of brevity. Further results for and are summarized in supplementary material F.### 4.2 Initial and Converged Errors

Because the results did not vary too much for intermediate batches, we show measurements for the first and second batch (indexed by and ), and an intermediate and the last batch before the next concept drift (indexed by and ).

HI and WI show better initial surrogate errors compared with UPC and ICB when concept drift occurs (see Figure 2,index ), for every and . When a concept drift occurs, UPC performs poorly, given that the initial centroids is focused on minimization the surrogate error function for previous batches. For smaller values of , ICB gets better results than UPC when a drift occurs, since the previous batches contribute less to the surrogate error. In this sense, ICB gets better results than UPC as increases, because previously computed centroids turn into a worse approximation for the novel concept. As new batches arrive, we observe that UPC obtains the best initial surrogate errors, because stored batches share the same underlying distribution and previously converged centroids are a good initialization.

Figure 3 summarizes the surrogate error function of FSM at convergence. HI and WI stand out over the trivial initialization methods. What’s more, HI obtains median scores close to for every value of and . In previous figure we saw how WI obtained better initialization error, but now HI obtains lower converged error. HI initialization is more restricted than WI, obtaining worse initialization error. However, this restriction seems to be reasonable since the fixed points where HI arrives get better converged error. Furthermore, WI executes -means over centroids, and ignores completely the structure of data points, which may lead to re-assignations that increase the error. UPC shows a higher variance, specially for bigger values of .

In Figure 4, we show the SM error at convergence. Here results of PSM are shown as reference. Observe that in general the medians of the converged SM error are comparable for every algorithm, specially when many batches of the same concept have already happened(). Recall that FSM does not minimize the SM error, concluding that the surrogate is a good approximation and that every initialization technique(except for ICB) works fine. We see that even though PSM obtains the best scores when a drift occurs, after the next batch (index ) HI and WI already attain scores comparable to PSM in terms of medians. In terms of dispersion HI and WI are even more stable(smaller variance) than PSM. We know from Theorem 4 that the surrogate error approximates better the SM error when more batches happened since the last concept drift, this can explain why even though FSM does not explicitly minimize SM error, its convergence value is better than that of PSM (which knows when the last drift occurred). We see that in the last batch before a concept drift occurs (index ), FSM obtains comparable scores to PSM. This happens for every initialization method with the exception of ICB, which has a higher variance.

### 4.3 Computed Distances

The computational load of the methods considered in our experimental setting is dominated by the number of distance computations. Therefore,
as it is common practice in -means problem related articles [2, 8], we use the number of distances computed to measure their computational performance.

Not needing any extra computation for the initialization, makes UPC compute the less amount of distances, thus we use UPC as reference in Figure 5, where the number of distances are shown relative to UPC’s. Because distances are normalized divided by the minimum obtained over every algorithm, what we observe in the Y axis is how many times more distances have been computed compared to UPC. Considering every boxplot, we conclude that the medians of HI and WI are around , thus they compute twice as many distances as UPC in general.

## 5 Conclusions

In this work we have proposed a surrogate function for the SM error, that can be computed without requiring the concept drift detection. We have proved that the surrogate is a good approximation to the SM error, and its quality improves as the number of batches for the same concept increases.

We also presented novel initialization methods for the SM problem, where information of previous iterations are used in order to construct more appropriate initial centroids. The conducted experiments have demonstrated the good performance of these methods, as well as the adequacy of the surrogate error.

We have performed a set of experiments using real data as basis and simulated streaming scenarios with concept drifts. We have compared minimizing the surrogate error to minimizing the actual SM error. The behavior of minimizing the surrogate has been analyzed for the proposed initialization procedures. In the last section, we have seen that the proposed initialization algorithms stood out over the trivial methods, at least in the converged real error. Using previously computed centroids demonstrates to be the fastest method, although it performs badly when a drift happens. Because the other initialization methods require more steps, they need more iterations, which implies more computed distances and hence larger elapsed time. However, this is a trade-off in exchange of better response to concept drifts, more stable solutions and smaller error values which is the main interest in the -means problem.

## Appendix A Appendix

This is the supplementary material of the original paper Passive Approach for the -means Problem on Streaming Data. The sections are structured this way: the first section consists of the pseudocodes of mentioned and proposed algorithms, then how the experiment showcasing Theorem 4 was performed is explained briefly, after that proofs for each theorem is given, next section explains how we simulated -drifts, and finally, further experimental results are displayed as well as two tables, one showing the values of used and the other one the datasets used in our experiments.

## Appendix B Algorithm Pseudocodes

In this section we include pseudocodes for the Algorithms mentioned in the original paper.

Algorithm 1 corresponds to Lloyd’s algorithm. Given a set of initial centroids, Lloyd’s algorithm iterates two steps until convergence: 1) assignation step and 2) update step. In the assignation step, given a set of centroids, , the set of points is partitioned into clusters, , by assigning each point to the closest centroid. Then, the new set of centroids is obtained by computing the center of mass of the points in each partition. This set of centroids minimizes the -means error with respect to the given partition of the set of points. These two steps are repeated until reaching a fixed point, meaning, when the assignation step does not change the partition. This process has a time complexity.

Algorithm 2 describes an active algorithm, which we call Privileged SM algorithm (PSM). PSM is an ideal active approach to the problem because it knows in advance if a concept drift occurs, and thus it can compute the SM error function.

As an alternative to this approach, we propose the Forgetful SM algorithm (Algorithm 3), which proceeds similar to PSM, but minimizes the surrogate error function instea.

One of the initialization techniques is WI (Algorithm 4). Theorem 5 shows that the surrogate error is upper-bounded by plus a constant. In fact, has the form of a weighted -means error with as the data points, and weights . Thus, this initialization technique computes a weighted -means on the union of both sets of centroids and use the computed centroids as initial centroids for FSM.

Algorithm 5 firstly computes a set of optimized centroids over the new batch . Then the matrix is constructed, which is used to determine the permutation that maps , via the linear sum assignment problem. This way, the sum is guaranteed to be the minimum value of , and hence the new set of centroids can be computed as defined in Theorem 6. The computational complexity of this algorithm is , which depends on whether (step 5) is bigger or smaller than (step 9).

## Appendix C Surrogate experiment

For this example, we have stored 40 batches of size with an specific concept, and then 20 batches of a -drift were added sequentially. Here, the centroid was stated as the center of mass of the data points^{1}^{1}1The data points of both concepts were previously generated, and are chosen randomly for each batch. from the first concept, and is the distance from the farthest point to the centroid . For each new batch increases by 1, and we compute the difference between both errors and their theoretical bounds (Eq. 11

). Because the theorem gives a probabilistic result, we have repeated the experiment many times, randomly selecting batches at each run. As the maximum confidence interval is given by 95% of probability, we executed the experiment 20 times. This way, the 95% of the experimental measures are ploted, by removing the maximum and minimum values obtained at each moment

. Figure 1 shows the computed differences with a boxplot layout. Two confidence intervals are given in the figure, for probabilities 95% and 68%, which correspond to the values of 0.05 and 0.32, respectively.## Appendix D Proofs

###### Theorem 4.

Let be a point, be a set of batches of points in , where and denotes the antiquity of . Let the batches before the drift be i.i.d. according to , where . Let the batches after the drift be i.i.d according to , where for . Let us assume that is upper-bounded by , for and .

Then, with at least probability the difference satisfies:

(11) |

where

(12) |

###### Proof.

is a r.v. distributed according to and with support in , for and , where for and for

The range of the support of is , for and . Thus we have that

(13) |

For any , by the Hoeffding’s inequality, we have that

(14) |

where

Equivalently,

Therefore,

(16) |

and thus we have that

Therefore, with at least probability , we have that , where

(17) |

which concludes the proof. ∎

In this proof, we have assumed a -drift occurred since batch , and that the distance from each data point to the center is bounded. In this manner, using Hoeffding’s inequality [15] we demonstrated the difference is bounded, and what’s more, its boundary shrinks when increases, in other words, when new batches arrive. Boundary width can be tuned with parameter via Eq. 17, which defines the confidence interval of probability . These bounds can be tightened even more with bigger batches, because , as illustrated in Figure 1. In conclusion, our alternative error function is a good approximation to the SM error for large values of and , that is, with many batches stored since last concept drift and many data points for each batch, and finally, confidence intervals can be adjusted with parameter .

###### Theorem 5.

Given two set of centroids and , for any set of centroids , the surrogate function can be upper-bounded as follows:

(18) |

where

(19) |

for , , where and are the weights related to each centroid and is a value independent of the set of centroids .

###### Proof.

First we show that:

(20) | |||||

Then observe that

(21) | |||||

Note that the last inequality holds as a consequence of the definition of , while equality would hold if there was no reassignments. We compute and as the closest centroids from to the previous centroids and the new centroids respectively. In order to obtain the desired form of the upper-bound of we shall recall how the centroids and are computed. With our notation is the mean value of the points in the set .

(22) | |||||

On the other hand, using the identity^{2}^{2}2Knowing that this equation is true, it is quite straight forward to prove that it is also true for a weighted version. , we obtain:

Note that the first term is independent of , so it is constant. Now we can develop the remaining term:

Comments

There are no comments yet.