# Online k-means Clustering

We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of k clusters. The specific formulation we use is the k-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the k-means objective (C) in hindsight. We show that provided the data lies in a bounded region, an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of Õ(√(T)) in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the k-means problem. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to (1 + ϵ) C and present a no-regret algorithm with runtime O(T(poly(log(T),k,d,1/ϵ)^k(d+O(1))). Our algorithm is based on maintaining an incremental coreset and an adaptive variant of the MWUA. We show that naïve online algorithms, such as Follow The Leader, fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data.

## Authors

• 24 publications
• 33 publications
• 21 publications
• 1 publication
• ### Online Isotonic Regression

We consider the online version of the isotonic regression problem. Given...
03/14/2016 ∙ by Wojciech Kotłowski, et al. ∙ 0

• ### Efficient Online Learning for Dynamic k-Clustering

We study dynamic clustering problems from the perspective of online lear...
06/08/2021 ∙ by Dimitris Fotakis, et al. ∙ 0

• ### Improved Algorithm on Online Clustering of Bandits

We generalize the setting of online clustering of bandits by allowing no...
02/25/2019 ∙ by Shuai Li, et al. ∙ 10

• ### Optimistic and Adaptive Lagrangian Hedging

In online learning an algorithm plays against an environment with losses...
01/23/2021 ∙ by Ryan D'Orazio, et al. ∙ 0

• ### Provably Efficient Online Agnostic Learning in Markov Games

We study online agnostic learning, a problem that arises in episodic mul...
10/28/2020 ∙ by Yi Tian, et al. ∙ 0

• ### Contextual Recommendations and Low-Regret Cutting-Plane Algorithms

We consider the following variant of contextual linear bandits motivated...
06/09/2021 ∙ by Sreenivas Gollapudi, et al. ∙ 0

• ### A Quasi-Bayesian Perspective to Online Clustering

When faced with high frequency streams of data, clustering raises theore...
02/01/2016 ∙ by Le Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering algorithms are one of the main tools of unsupervised learning and often form a key part of a data analysis pipeline. Unlabeled data is ubiquitous in the real world and discovering structure in such data is essential in many online applications. The focus of this work is on the

online setting where data elements arrive one at a time and need to be assigned to a cluster (either new or existing) without the benefit of having observed the entire sequence. While several objective functions for clustering exist, in our work we will focus on the -means objective. Most of our results can be easily generalized to most centre-based objectives.

The analysis of online algorithms comes in two flavours involving bounding either the competitive ratio, or the regret. The online algorithm makes irrevocable decisions and its performance is measured by the value the objective function. The competitive ratio is the ratio between the value achieved by the online algorithm and the best offline solution (for minimization problems). In the case of clustering, without strong assumptions on the aspect-ratio of the instance no algorithms with non-trivial bounds on the competitive ratio can be designed. In regret analysis, the difference between the value of the objective function of the online algorithm and the best offline solution (in hindsight) is sought to be bounded by a function that grows sublinearly with the number of data elements. We consider regret analysis in this paper.

More precisely, in the case of online -means clustering, the online algorithm at time maintains a set of candidate cluster centres, before observing the datum that arrives at time . The loss incurred by the algorithm at time is . The regret is the difference between the cumulative loss of the algorithm over time steps and the optimal fixed solution in hindsight, i.e.

 T∑t=1ℓ(Ct,xt)−minC:|C|=kT∑t=1minc∈C∥xt−c∥22.

### 1.1 Our Contributions

We consider the setting where the data all lie in the unit box in . We summarise our contributions below.

• [itemsep=0.2em, leftmargin=1em, label=-]

• A multiplicative weight-update algorithm (MWUA) over sets of size of candidate centres drawn from a uniform grid over achieves expected regret ; here the notation hides factors that are poly-logarithmic in and polynomial in and . The algorithm and its analysis is along standard lines and the algorithm is computationally inefficient. Nevertheless, this algorithm establishes that information-theoretically achieving regret is possible.

• We provide an online-to-offline reduction that shows that any online algorithm that runs in time at time , yields an offline algorithm that solves the -means problem to additive accuracy in time polynomial in , , , , . In particular, for an offline instance of -means with point in a bounded region with , an online algorithm with polynomial run time would yield a fully poly-time approximation scheme. We note that there exist hard instances for -means for which and that it is known that -means is -hard [2]. Furthermore, all known (approximation) FPT algorithms for -means are exponential in at least one of the two parameters and . This suggests that we need to relax performance requirements for efficient algorithms.

• We consider a weaker notion of regret called -regret. Let denote the loss of the best solution in hindsight and the cumulative loss of the algorithm; in the definition of regret, instead of , we consider . With this notion of regret, we provide an algorithm that achieves -regret and runs in time .

• Finally, we consider online algorithms which have oracle access to -means solver. For instance, this allows the us to implement follow the leader. We show that there exists a sequence of examples for which follow the leader has linear regret. We show that this construction indeed results in linear regret in simulations. We observe that FTL (using -means++ as a proxy for oracle) works rather well on real-world data.

### 1.2 Related Work

Clustering has been studied from various perspectives, e.g. combinatorial optimization, probabilistic modelling, and there are several widely used algorithms such as Lloyd’s local search algorithm with

-means++ seeding, spectral methods, the EM algorithm, etc. We will restrict discussion mainly to the clustering as combinatorial optimization viewpoint. The

-means objective is one of the family of centre-based objectives which uses the squared distance to the centre as a measure of variance. Framed as an optimization problem, the problem is

-complete. As a result, theoretical work has focused on approximation and FPT algorithms.

A related model to the online framework is the streaming model. As in the online model, the data is received one at a time. The focus in the streaming framework is to have extremely low memory footprint and the algorithm is only required to propose a solution once the stream has been exhausted. In contrast, in the online setting the learning algorithm has to make a decision at each time step and incur a corresponding loss. Coresets are widely used in computational geometry to obtain approximation algorithms. A coreset for -means is a mapping of the original data to a subset of the data, along with a weight function, such that the -means cost of partitions of the data is preserved up to some small error using the given mapping and weights.

Online learning with experts and related problems have been widely studied, see e.g. Cesa-Bianchi and Lugosi [4] and references therein. The Multiplicative Weight Update Algorithm (MWUA) is a widely studied algorithm that may be used for regret minimization in the prediction with expert advice setting. It maintains a distribution over experts that changes as new data points arrive, which is used to sample an advice of some expert to be used as the next prediction, resulting in low regret. For a thorough survey see Arora et al. [1]. FTL is a simple online algorithm that always predicts the best solution for the data witnessed so far. It is known to admit low regret for some problems, namely, strongly convex objectives [14]. Variants of FTL that optimize a regularized objective have been successfully utilized for a wider range of settings.

The closest related work to ours is the work of Dasgupta [9]. He defines the evaluation framework we use here, and presents a naïve greedy algorithm to adress it, with no analysis. In addition, he combines algorithms by Charikar et al. [5] and Beygelzimer et al. [3] that together maintain a set of constant approximations of the -centre objective at any time, for a range of values for .

Choromanska and Monteleoni [8] study the online clustering problem in the presence of experts. The experts are batch clustering algorithms that output the centre closest to the next point at each step (hence provide implicit information on the next point in the stream). Using experts that have approximation guarantees for the batches they obtain an approximate regret guarantee of for the stream. Our setting differs in that we must commit to the next cluster centres strictly before the next point in the stream is observed, or any implicit information about it.

Li et al. [10] provide a generalized Bayesian adaptive online clustering algorithm (built upon the PAC-Bayes theory). They describe a Gibbs Sampling procedure of centres and prove it has a minimax sublinear regret. They present a reversible jump MCMC process to sample these centres with no theoretical mixing time analysis.

Moshkovitz [13] studies a similar problem that considers only data points as candidate cluster centres, and the offline solution is defined similarly (an analog of the -medoids problem). Furthermore, the algorithm starts with an empty set of cluster centres and must output an incremental solution– cluster centers are only added to the set, and this is done in a streaming fashion. The loss is measured in hindsight, using the final set of cluster centres. They provide tight bounds for both adversarial and randomly ordered streams, and show that knowing the length of the stream reduces the amount of centres required to obtain a constant approximation by a logarithmic factor.

Liberty et al. [11] handle a different definition of online– the data points are labeled in an online fashion but the loss is calculated according to the centroids of the final clusters. They allow clusters where is the aspect ratio of the data (the ratio between the diameter of the set and the closest distance between two points), and guarantee a constant competitive ratio when compared to the best clusters in hindsight.

Meyerson [12] studies online facility location, where one maintains a set of facilities at each time step, and suffers a loss which is the distance of the new point to the closest facility. Once a facility is located, it cannot be moved, and placing a new one incurs a loss. Meyerson presents an algorithm that has a constant competitive ratio on randomly ordered streams. For adversarial order he presents an algorithm with competitive ratio and provides a lower bound showing no algorithm can have a constant competitive ratio.

## 2 Basic Results

### 2.1 Preliminaries and Notation

To precisely define the online clustering problem we will consider points in the unit box . The point arriving at time will be denoted by and we use to denote the data received before time . The learning algorithm must output a set of candidate centres using only . We refer to the set of all the candidate centres an algorithm is considering as sites. The loss incurred by the algorithm at time is . The total loss of an algorithm up to time step is denoted by . The loss of the best -means solution in hindsight after steps is denoted by . The regret is defined as . Several of the algorithms we consider will pick cluster centres from a constrained set; we sometimes refer to any set of sites from such a constrained set as an expert. We define the -approximate regret as .

The loss of the weighted -means problem is defined similarly, given a weight function , as .

We denote the best -means solution, i.e. best cluster centres, for by , hence is the best -means solution in hindsight. In our setting, the Follow-The-Leader (FTL) algorithm simply picks at time . We use to suppress factors that are poly-logarithmic in and polynomial in and .

### 2.2 Mwua with Grid Discretization

While the multiplicative weight update algorithm (MWUA) is very widely applicable, there are a couple of difficulties when it comes to applying it to our problem. In order to obtain a finite set of experts, we consider sites obtained by a -grid of . In order to obtain regret bounds that are for (typically ), we need to choose ; this means that the number of experts is exponentially large in and and polynomially large in . However, since the regret of MWUA only has logarithmic dependence on the number of experts, this mainly incurs a computational, rather than statistical cost.111There appears to be some statistical cost in that we are only able to prove bounds on expected regret. In Section 3, we develop a more data-adaptive version of the MWUA. This allows us to significantly reduce the number of sites required—we don’t put sites in location where there is no data—but also requires a much more intricate analysis. The price we pay for adaptivity and computational efficiency is that we are only able to get regret bounds for -regret. However, the results in Section 2.3 show that this is (under computational conjectures) unavoidable.

###### Theorem 2.1.

Let be the set of sites and let be the set of experts (-centres chosen out of the sites). Then, for any , with , the MWUA with the expert set achieves regret ; the per round running time is .

### 2.3 Lower Bound

Given the disappointing runtime of the grid-MWUA algorithm, one may wonder whether there is a way to avoid explicitly storing a weight for each of the exponentially many experts and speed-up the MWUA algorithm. The following result gives evidence that it is unlikely that a significant speed-up is possible under complexity-theoretic assumptions. In particular, a consequence of Theorem 2.2 is that for instances of -means with data lying in a bounded region and , a per-round polynomial time online algorithm would imply a fully polynomial-time approximation scheme. Recall that -means is -hard and current best known algorithms are exponential in at least one of the two parameters and .

###### Theorem 2.2.

Suppose there is an online -means clustering algorithm that achieves regret and runs at time in time . Then, for any , there is a randomized offline algorithm that given an instance of -means outputs a solution with cost at most

with constant probability and runs in time polynomial in

.

The lower bound presented here does not imply anything on the regret guarantees of Follow-The-Leader (FTL) that has orcale access to the best offline solution at each step. This section will show that FTL incurs linear regret in the worst case,

###### Theorem 2.3.

FTL obtains regret in the worst case, for any fixed and any dimension.

Figure (a)a shows the regret of FTL at any point in time if the stream halted at that point, which we refer to as the regret halted at , for a stream that was generated according to the scheme describe above with , along with a MWUA over the set of leaders , where the MWUA weight for any leader is calculated according to their historical loss, regardless of the time t when it was introduced to the expert set, i.e. . The staircase-like line for FTL is caused by the fact that the specific data used makes FTL suffer a constant excess loss (w.r.t. the optimal solution) every several iterations, and negligent excess loss in the rest of the iterations. This demonstrates that the counter example provided in A is viable and numerically stable without special care. The MWUA-FTL presents low asymptotic regret in this case, which is clearly sub-linear and possibly logarithmic in . The best intermediary -means solution at each step was calculated analytically.

The previous section showed that there are worst case instances that make FTL perform badly. This section will present experimental results, on synthetic and real data sets, that suggest that FTL performs very well on natural data sets.

Figure (b)b shows the regret halted at of FTL vs. for four different data sets, all of size 10000. The first is a random sample from MNIST, labelled , treated as a

dimensional vector, normalized to a unit diameter box, using

. The others three are Gaussian Mixture Models (

) with 3 Gaussians () or 4 Gaussians () in two dimensions. The case was run with

, where the 4 Gaussians are well separated (means distance is larger than 3 times the standard deviation) hence labelled

. The two data sets are labelled for the well separated case, and for a case where the Gaussians are ill separated (means distance is 0.7 fraction of the standard deviation). In all cases the best intermediary -means solution was calculated using -means++ with 300 iterations of local search, hence it is an approximation. The figure demonstrates a linear dependency between the regret and in these cases. All the standard deviations are set to .

## 3 Approximate Regret Minimization

We present an algorithm that aims to minimize approximate regret for the online clustering problem. The algorithm uses three main components

1. The Incremental Coreset algorithm of 3.1 that maintains a monotone sequence of sets such that contains a weighted coreset for .

2. A Hierarchical Region Decomposition of 3.2 corresponding to and provides a tree structure related to -approximations of -means.

3. MWUA for tree structured expert sets, 3.4, referred to as Mass Tree MWUA (MTMW), that is given at every step, and outputs a choice of centers.

We present our main theorem and provide proof later on.

###### Theorem 3.1.

Algorithm 1 has a regret of

 ε⋅C+O(k√d3Tlog(kT3√dε2))

and runtime of

 T⋅O(√dk2log(T)ε−2)k(d+O(1))

We now continue to describe the different components, and then combine them.

### 3.1 Incremental Coreset

The Incremental Coreset algorithm presented in this subsection receives an unweighted stream of points one point in each time step and maintains a monotone sequence of sets such that contains a weighted coreset for , for some given parameter . Formally, we have the following lemma, whose proof is provided in B.

###### Lemma 3.2.

For any time step , the algorithm described in 3.1.2 outputs a set of points such that it contains a coreset for , which we denote , and has size at most . Moreover, we have that .

#### 3.1.1 Maintaining an O(1)-Approximation for k-means

The first part of our algorithm is to maintain the sets of points , for any time step , representing possible sets of cluster centers where . We require that at any time, at least one set, denoted , induces a bicriteria -approximation to the -means problem, namely, contains at most centers and its loss is at most some constant times the loss of the best -means solution using at most centers. Moreover, for each , the sequence is an incremental clustering: First, it must be a monotone sequence, i.e. for any time step , . Furthermore, if a data point of the data stream is assigned to a center point at time , it remains assigned to in any for , i.e. until the end of the algorithm.

Each set which contains more than is said to be inactive and the algorithm stops adding centers to it. The remaining sets are said to be active.

To achieve this, we will use the algorithm of Charikar, O’callaghan and Panigrahy [6], which we call , and whose performance guarantees are summarized by the following proposition, that follows immediately from Charikar et al. [6].

###### Theorem 3.3 (Combination of Lemma 1 and Corollary 1 in [6]).

With probability at least 1/2, at any time , one set maintained by is an -approximation to the -means problem which uses at most centers.

#### 3.1.2 Maintaining a Coreset for k-means

Our algorithm maintains a coreset based on each solution maintained by . It makes use of coresets through the coreset construction introduced by Chen [7] whose properties are summarized in the following theorem.

###### Theorem 3.4 (Thm. 5.5/3.6 in Chen [7]).

Given a set of points in a metric space and parameters and , one can compute a weighted set such that and is a -coreset of for -means clustering, with probability .

We now review the coreset construction of Chen. Given a bicriteria -approximation to the -means problem, Chen’s algorithm works as follows. For each center , consider the points in whose closest center in is and proceed as follows. For each , we define the ring of to be the set of points of cluster that are at distance to . The coreset construction simply samples points among the points whose distance to is in the range (if the number of such points is below simply take the whole set). This ensures that the total number of points in the coreset is , where is the maximum to minimum distance ratio (which can be assume to be polynomial in without loss of generality).

Our algorithm stores at each time a set of points of small size that contains a -coreset. Moreover, we have that the sets are incremental: .

To do so, our algorithm uses the bicriteria approximation algorithm of Section 3.1.1 as follows. For each solution stored by , the algorithm uses it to compute a coreset via a refinement of Chen’s construction. Consider first applying Chen’s construction to each solution maintained by . Since is incremental, whatever decisions we have made until time , center open and point assignment, will remain unchanged until the end. Thus, applying Chen’s construction seems possible. The only problem is that for a given set , a given center and a given ring of , we don’t know in advance how many points are going to end up in the ring and so, what should be sampling rate so as to sample elements uniformly.

To circumvent this issue, our algorithm proceeds as follows. For each set , for each center , for each , the algorithm maintains samples: one for each which represents a “guess” on the number of points in the ring that will eventually arrive. More precisely, for a given time , let be the newly inserted point. The algorithm then considers each solution , and the center that is the closest to . If belongs to the ring, then is added to the set with probability for each if . Let denote the union over all center , integers of . Let . For more details, the reader is referred to the proof for Lemma (3.2) in B.

### 3.2 Hierarchical Region Decomposition

A region decomposition is a partition of , each part is referred to as region. A hierarchical region decomposition (HRD) is a sequence of region decompositions such that is a refinement of , for all . In other words, for all , for all region there exists a region such that .

As the hierarchical region decomposition only partitions existing regions, it allows us to naturally define a tree structure , rather than a DAG. There is a node in for each region of each . There is an edge from the node representing region to the node representing region if and there exists a such that and . We slightly abuse notation and refer to the node corresponding to region by . The bottom-level region decomposition is the region decomposition induced by the leaves of the tree. Moreover, given a hierarchical decomposition and a set of points of size , we define the representative regions of in as a sequence of multisets where with the correct multiplicity w.r.t. . Note that these correspond to a path in . We define the Approximate Centers of induced by as the sequence of multisets the consists of the centroids of the representative regions of in .

### 3.3 Adaptive Grid Hierarchical Region Decomposition

Given a sequence of points in , we describe an algorithm that maintains a hierarchical region decomposition with dimensional hypercube regions as follows. Let be a parameter s.t. is a power of 2. We require this in order to define an implicit grid with side length , i.e. diameter , such that it can be constructed from a single region containing the entire space by repeated halving in all dimensions. Denote . We refer to this implicit grid, along with the region tree structure that corresponds to the this halving process as the Full Grid and the Full Grid Tree. Consider a step , and . Denote the diameter of as , and the distance between and . Notice that if then .

We define the refinement criteria induced by at time as , which takes the value true if and only if the diameter of is smaller or equal . At a given time , a new point is received and the hierarchical region decomposition obtained at the end of time , , is refined using the following procedure, which guarantees that all the new regions satisfy the refinement criteria induced by all the points at the corresponding insertion times. The pseudocode is given in 2.

We now turn to proving Structural Properties of the hierarchical region decomposition that the algorithm maintains. The proof of the following lemma follows immediately from the definition.

###### Lemma 3.5.

Consider the hierarchical region decomposition produced by the algorithm at any time . Consider a region and let be the diameter of region , then the following holds. Either region belongs to or each child region of in has diameter at most .

###### Corollary 3.6.

Consider such that , a sequence of nested regions of length . We say that such a sequence cannot be refined more than times, i.e. . Lemma (3.5) along with the fact that the algorithm does not refine regions with diameter smaller than give us that

 Λ≤−log(δT/√d)=−log(εhrd2T3√d)

The proof for the following Lemma is provided in B.

###### Lemma 3.7.

For any stream of length , using the above algorithm, we have that the total number of regions that are added at step is at most hence the total amount of regions in is .

###### Corollary 3.8.

Let be any region at step and be the set of regions that refine in the next time step, then we have that the log max branch is

 β=logmaxt,R|S|≤log⎛⎝(9√dεhrd)dlog(T3)⎞⎠

Due to Lemma (3.7). Furthermore for sufficiently large we have that

We will now present a few properties relating to the approximation of the -means problem.

###### Lemma 3.9.

Let . Consider an online instance of the weighted -means problem where a new point and its weight are inserted at each time step. Let be the weighted point that arrives at time , such that its weight is bounded by . Consider the hierarchical region decomposition with parameter produced by the algorithm , for some time step .

Consider two multisets of centers such that for all , and are contained in the same region of the Region Decomposition of step . Then, the following holds.

We now extend this lemma to solutions for the -means problem. Given a set of centers and a hierarchical region decomposition , we associate a sequence of approximate centers for induced by by picking for each the approximation of induced by at step – the centroid of the region in that contains . Note that this is a multiset. The next lemma follows directly from applying Lemma (3.9) to these approximate centers, and summing over .

###### Lemma 3.10.

For the optimal set of candidate centers in hindsight and the approximate centers induced by the Hierarchical Region Distribution at time step , for a weighted stream

As Lemma (3.6) gives that that at most times (each of the regions may be refined times), and the loss is bounded by , then along with Lemma (3.10) we get the following corollary.

###### Corollary 3.11.

For the optimal set of candidate centers in hindsight and the approximate centers induced by the Hierarchical Region Distribution at time step , for an unweighted stream

 T∑t′=1ℓ(~St′,xt′)≤(1+εhrd)C+kdΛ+2εhrd

### 3.4 Mtmw– Mwua for Tree Structured Experts

We present an algorithm which we name Mass Tree MWUA (MTMW) which obtains low regret in the setting of Prediction from Expert Advice, as described in [1], for a set of experts that has the tree structure that will soon follow. The algorithm will be a modification of the Multiplicative Weights Algorithm and we will present simple modification to the proof of Theorem (2.1) of Arora et al. [1], to obtain a regret bound.

Let

denote a bounded loss function in

. Consider , a tree whose leaves are all of depth and the vertices correspond to expert predictions. The expert set is the set of all paths from the root to the leaves, denoted . We say that for a path , the prediction that is associated with at step is such that we can write the loss of the path w.r.t. the stream of elements as .

We associate a mass to any vertex in as follows. Define the of the root as , and the mass of any other node , denoted , as , where is its parent and is the out degree of . We define the mass of a path as the mass of the leaf node at the end of the path. before we move on to prove the regret bound, we provide a useful lemma.

###### Lemma 3.12 (Preservation of Mass).

Let be a vertex in the tree, a subtree of with as root, and the leaves of , then

We move on to bound the regret of the algorithm.

###### Theorem 3.13.

For a tree running MWUA over the expert set that corresponds to the paths of the final tree, , is possible even if the tree is known up to depth at any time step , provided the initial weight for path is modified to . The algorithm has a regret with respect to any path of

 √−Tln(M(p))

and has a time complexity of , the number of vertices of .

###### Corollary 3.14.

MTMW for a loss function bounded by obtains a regret of by using a normalized loss function

### 3.5 Approximate Regret Bound

We will now combine the three components described above to form the final algorithm. First, we will show the main property of a Hierarchical Region Decomposition that is constructed according to the points that are added to form the sequence , which is analogous to Lemma (3.11). Next we define the k-tree structure that corresponds to a Hierarchical Region Decomposition, and show that MTMW performs well on this k-tree. Lastly we show that an intelligent choice of parameters for and allows Algorithm 1 to obtain our main result, Theorem (3.1).

###### Lemma 3.15.

Let be a Hierarchical Region Decomposition with parameter that was constructed according to . For the best cluster centers in hindsight and the approximate centers of induced by we have that

 T∑t=1ℓ(~St,xt)≤(1+εc+8(εhrd+εc)kΛ)C+dkΛ

Consider the region tree structure described in 3.2 that corresponds to the Hierarchical Region Decomposition at step defined in (3.15). We define a -region tree induced by as a level-wise

-tensor product of

, namely, a tree whose vertices at depth correspond to -tuples of vertices of level of . A directed edge from a vertex of level to vertex of level exists iff all the edges exists in for every . We define the -center tree induced by or -tree, as a tree with the same topology as the -region tree, but the vertices correspond to multiset of centroids, rather than tensor products of regions, i.e. the -region tree vertex corresponds to the -tree vertex where is the centroid of the given region. The representative regions of any set of cluster centers correspond to a path in the -region tree, and the approximate centers correspond to equivalent path in the -tree. An important thing to note is that Lemma (3.15) proves that there exists a path in the -tree such that the loss of the sequence of approximate centers it contains is close to as described therein.

We will now analyze the run of MTMW on the aforementioned -tree. Let denote the -tree path that corresponds to the best cluster centers in hindsight.

###### Lemma 3.16.

Using the definitions from Lemmas (3.6), (3.8) we have that -

Defining , s.t. , where is some constant, gives Theorem (3.1). The complete proof is provided in B.

## 4 Discussion

The online -means clustering problem has been studied in a variety of settings. In the setting we use, we have shown that no efficient algorithm with sublinear regret exists, even in the Euclidean setting, under typical complexity theory conjunctures, due to -means being -hard [2]. We have presented a no-regret algorithm with runtime that is exponential in , showing that the main obstacle in devising these algorithms is computational rather than information-theoretic.

We have shown that FTL with orcale access to the best clustering so far fails to guarantee sublinear regret in the worst case, but performs very well on natural datasets. This opens a door for further study, specifically– what stability constraints on the data stream, such as well separation of clusters, or data points that are IID using well behaved distributions, allow FTL to obtain logarithmic regret?

We presented an algorithm that obtains approximate regret with a runtime of using an adaptive variant of MWUA and an incremental coreset construction, which provides a theoretical upper bound for the approximate regret minimization problem. The next steps in this line of research will involve studying lower bounds for this approximate regret minimization problem and providing simpler algorithms such as FTL with a regularizer fit for purpose. Another extension may reduce the dependency on the dimension by performing dimensionality reduction for the data that preserves -means cost of clusters, such as the Johnson-Lindenstrauss tranformation.

## References

• [1] S. Arora, E. Hazan, and S. Kale (2012) The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing 8 (1), pp. 121–164. Cited by: §A.1, §B.3, §1.2, §3.4.
• [2] P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop (2015) The hardness of approximation of euclidean k-means. arXiv preprint arXiv:1502.03316. Cited by: §A.2, 2nd item, §4.
• [3] A. Beygelzimer, S. Kakade, and J. Langford (2006) Cover trees for nearest neighbor. In

Proceedings of the 23rd international conference on Machine learning

,
pp. 97–104. Cited by: §1.2.
• [4] N. Cesa-Bianchi and G. Lugosi (2006) Prediction, learning, and games. Cambridge university press. Cited by: §1.2.
• [5] M. Charikar, C. Chekuri, T. Feder, and R. Motwani (2004) Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33 (6), pp. 1417–1440. Cited by: §1.2.
• [6] M. Charikar, L. O’Callaghan, and R. Panigrahy (2003) Better streaming algorithms for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 9-11, 2003, San Diego, CA, USA, pp. 30–39. External Links: Cited by: §3.1.1, Theorem 3.3.
• [7] K. Chen (2009) On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39 (3), pp. 923–947. Cited by: §3.1.2, Theorem 3.4.
• [8] A. Choromanska and C. Monteleoni (2012) Online clustering with experts. In Artificial Intelligence and Statistics, pp. 227–235. Cited by: §1.2.
• [9] S. Dasgupta (2008) Course notes, cse 291: topics in unsupervised learning. lecture 6: clustering in an online/streaming setting. External Links: Link Cited by: §1.2.
• [10] L. Li, B. Guedj, and S. Loustau (2018) A quasi-Bayesian perspective to online clustering. Electronic Journal of Statistics 12 (2), pp. 3071–3113. External Links: Link Cited by: §1.2.
• [11] E. Liberty, R. Sriharsha, and M. Sviridenko (2016) An algorithm for online k-means clustering. In 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX), pp. 81–89. Cited by: §1.2.
• [12] A. Meyerson (2001) Online facility location. In Proceedings 2001 IEEE International Conference on Cluster Computing, pp. 426–431. Cited by: §1.2.
• [13] M. Moshkovitz (2019) Unexpected effects of online k-means clustering. arXiv preprint arXiv:1908.06818. Cited by: §1.2.
• [14] S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §1.2.

## Appendix A Proofs of Results from Section 2

### a.1 Mwua

Let’s look at the projection of a set of centres to the set of sites . This projection is a multiset defined as . Let such that is the projected point corresponding to . We define the projection infinity distance of onto as the maximum distance between these pairs, i.e. , and denote it as .

The following lemma comes from the folklore.

###### Lemma A.1.

Let be the centroid (mean) of a set of points , and another point in space. Then .

###### Corollary A.2.

Let be a set of points, the set of optimal centers, and alternative centers such that then

We now turn to the main theorem

###### Theorem A.3.

Let be the set of sites and let be the set of experts (-centers chosen out of the sites). Then, for any , with , the MWUAwith the expert set achieves regret ; the per round running time is .

###### Proof.

Following section (3.9) in Arora et al. [1]. The grid distance results with sites hence experts, which is . Denote the regret with respect to the grid experts as the grid-regret. Running a single step in MWUArequires sampling and weight update, taking time and the algorithm guarantees a grid-regret of at most .

Let be the closest grid sites to . Because , (A.2) gives . Hence the regret of the algorithm is at most , so choosing for yields an algorithm with regret and a per step time complexity of . ∎

### a.2 Reduction

The following is a proof for Theorem (2.2)

###### Proof.

We will reduce the offline problem with point set to the online problem by generating a stream of uniformly sampled points from and running on ; generates intermediate cluster centers , and we return the best one with respect to the entire set .

Denote the best offline -means solution as – the optimal clustering for the stream may not coincide with the optimal offline clustering , but it must perform at least as good as on . Denote the internal randomness of . The regret guarantee gives

 Er[EX[T∑t=1ℓ(Ct,xt)−ℓ(C∗T+1,X)]] ≤Tα Using the linearity of expectation and noticing C∗T+1 doesn’t depend on r. T∑t=1Er[EX[ℓ(Ct,xt)]] ≤Tα+EX[ℓ(C∗T+1,X)] ≤Tα+EX[ℓ(C∗,X)] Using ∀C:Ext[ℓ(C,xt)]=ℓ(C,P)|P| T∑t=1Er[EX[ℓ(C,P)|P|]] ≤Tα+TC|P| Define ϵt s.t. Er[EX[ℓ(C,P)]]=(1+ϵt)C. Because C∗ is optimal, we know ∀t:ϵt≥0. T∑t=1(1+ϵt)C|P| ≤Tα+TC|P| Rearranging T∑t=1ϵt ≤|P|TαC Denote ϵ∗=mint(ϵt) T⋅ϵ∗ ≤|P|TαC ϵ∗ ≤|P|Tα−1C

So provided one can choose s.t. is arbitrarily small.

is a non-negative random variable, hence this condition suffices to produce an approximation algorithm with arbitrary

for -means. Awasthi et al. [2] shows that this is NP-hard, finishing the proof. ∎

### a.3 Ftl

The following is a proof for Theorem (2.3)

###### Proof.

We will present an algorithm that generates a stream of points on a line, for and any value of , such that the regret FTL obtains for the stream can be bounded from below by where is some constant. Extending the result to arbitrary can be done by contracting the bounding box where the algorithm generates points by a factor of , and adding data points outside of it in equally spaced locations, to force the creation of clusters, one for each location.

The stream will consist of points in 3 locations for . We will call a ()-clustering if it puts all the points at () in one cluster and the rest in the other cluster, and a ()-clustering puts all the point at () in one cluster and the rest in the other cluster. We will define a stream that has a () optimal clustering.

We will keep the amount of points in and () equal up to a difference of 1 at any step by alternative between the two any time we put points in one of them. There exists such that if there is one point at () and points in each of then the optimal clustering is the ()-clustering and for points in each of the optimal clustering is the ()-clustering.

Our algorithm will start with a point in () making a ()-clustering. The next points will be added as follows– as long as is a ()-clustering add a point to () or 0, balancing them; if has just become a ()-clustering, the next point will be (), inflicting an loss for FTL which depends only on hence it is , and making a ()-clustering again. Halt when we have points.

When the algorithm halts a ()-clustering is either the optimal clustering or is at most below the optimal clustering, so we will say that is a ()-clustering without loss of generality. FTL will jump from ()-clustering to ()-clustering times, and suffer a loss of at least the loss incurs at that step (because we balance () and , FTLmoves the lower cluster away from the middle hence suffer a slightly larger loss than at the next stage). This means that the regret is larger than which is bounded from below by a linear function in for a fixed . ∎

## Appendix B Proofs of Results from Section 3

### b.1 Incremental Coreset

The following is a proof for Lemma (3.2)

###### Proof.

Note that by the definition of the algorithm, each set contains at most centers. It follows that, by the definition of the algorithm, any is such that

 |A(S(t)i)|=∑c∈S(t)i,j∈[logT],u∈[logT]|A(S(t),c,j,u)|≤O(kζlog3T)

The overall bound follows from the fact that . Moreover, the fact that at any time , we have that for follows from Theorem (3.3) and the definition of the algorithm.

We then argue that contains a -coreset. Recall that there exists a set that induces a bicriteria -approximation to the -meansproblem. Thus, for a given center , for any , the