1 Introduction
Clustering is an important unsupervised task used for various applications, including, for instance, anomaly detection
[23], recommender systems [30] and image segmentation [28]. The median clustering objective is particularly useful when the partition must be defined using centers from the data, as in some types of image categorization [12] and video summarization [17]. While clustering has been classically applied to fixed offline data, in recent years clustering on sequential data has become a topic of ongoing research, motivated by various applications where data is observed sequentially, such as detecting communities in social networks [3], online recommender systems [27] and online data summarization [5].Previous work on clustering sequential data [e.g., 16, 4, 2] has typically focused on cases where the main limitation is memory; the clustering needs to be done on massive amounts of data, and so it cannot be kept in memory in full. In this work, we study sequential median clustering in a different setting, which we call the nosubstitution setting. In this setting, centers need to be selected from the sequence of examples immediately when they are observed, and cannot be substituted later. This setting captures instances of clustering where the selection of a center involves an immediate action in the real world on the observed example. For instance, consider selecting a small set of users who will receive an expensive promotional gift, out of users that arrive to a shopping website. The goal is to find the users who will be the most effective in spreading the word about the product. This can be formalized as a median objective, with respect to a metric defined by connections between users.
We study the nosubstitution setting in a general metric space, under the assumption that the data sequence is an i.i.d. sample from an unknown distribution, and the goal is to minimize the distribution risk of the selected centers. We provide an efficient algorithm, called SKM, which uses as a black box a given clustering algorithm for fixed data sets. We show that the multiplicative approximation factor obtained by SKM is twice the factor that can be obtained by an offline algorithm for samplebased clustering that uses the same black box. In addition, we provide an algorithm, called SKM2, that obtains the same approximation factor as the best (not necessarily efficient) offline algorithm. However, this algorithm has a computational complexity which is exponential in . Whether there exists an efficient nosubstitution algorithm with the same approximation factor as an efficient offline algorithm, is an open problem which we leave for future work.
Related Work
[7] studied samplebased median clustering in the offline setting. In this setting, the entire set of sampled data points is observed, and then the centers are selected from this sample. For the case of a general metric space, [7] provides uniform finitesample bounds on the convergence of the sample risk to the distribution risk of any choice of centers from the sample.
Algorithms studying clustering on sequential data have mainly assumed a fixed data set and an adversarial ordering, under bounded memory. In this setting, the approximation is measured against the best possible clustering of the data set. [16] proposed the first singlepass constant approximation algorithm for the median objective with bounded memory. [4, 11, 2] develop algorithms for this setting using coreset constructions. [10] design algorithms based on the facilitylocation objective, using a procedure proposed in [26]. The latter also studies facility location under a random arrival order. [9] suggests a spaceefficient technique to extend any samplebased offline coreset construction to the streaming (boundedmemory) model. [20] considers the streaming median problem under random arrival order. The different approaches obtain different tradeoffs between memory and approximation guarantees. Unlike the nosubstitution setting, these algorithms can repeatedly change their selection of centers, or simply select a center that has appeared sometime in the past.
[24] studies the online means objective with an arbitrary arrival order, in a setting where each observed point must either be allocated to an alreadydefined cluster or start a new cluster. This setting can be seen as a variant of the nosubstitution setting, since a chosen center cannot be discarded later. However, the proposed algorithm selects centers, where is the sample size, and it is shown that in this adversarial setting, one must select more than elements to obtain a bounded approximation factor. [22] proposes an online median algorithm which maintains a constant approximate solution at any point in time, while minimizing the number of necessary recalculations of a clustering.
The nosubstitution setting bears a resemblance to the secretary problem under a cardinality constraint. In this setting, a set of limited cardinality must be selected with no substitutions from a sequence of objects, so as to optimize a given objective. [6, 14, 19] study this setting when the objective is monotone and submodular. [5] suggest reformulating the median objective as a submodular function. However, this reformulation does not preserve the approximation ratio of the median objective. It also requires access to an oracle for function value calculations, which is not readily available in the samplebased sequential clustering setting. [29] study a more general problem of converting an offline algorithm to a nosubstitution algorithm in an interactive setting.
2 Setting and Preliminaries
For an integer , denote . Let be a bounded metric space, and assume . For let
. Assume a probability distribution
over. Below, the random variable
is always assumed to be drawn according to , unless explicitly noted otherwise. For , denote . A clustering is a set of points which represent the centers of the clusters. Given a probability distribution , the median risk of on is . For a finite set , is defined as the risk ofon the uniform distribution over
. We will generally assume an i.i.d. sample . For convenience of presentation, we treat as both a sequence and as a set interchangeably, ignoring the possibility of duplicate examples in the sample. These can be easily handled by using multisets, and taking the necessary precautions when selecting an element from . When a minimization with respect to is performed, we assume that ties are broken arbitrarily.Denote by a specific optimal solution of the median clustering problem, where the minimization is over all possible clusterings in ; we assume for simplicity that such an optimizer always exists. Denote by a specific solution that minimizes the risk on using centers from . In the nosubstitution median setting, the algorithm does not know the underlying distribution . It observes the i.i.d. sample in a sequence, and selects elements from as centers, to form the clustering .
The centers can only be selected immediately after they are observed, before observing the next element in the sequence. Moreover, the centers cannot be substituted later. The objective is to obtain a small , compared to the optimal . An offline median algorithm takes as input a finite set of points from and outputs a clustering . We say that is a approximation offline median algorithm, for some , if for all input sets , .
For a nonnegative function , we denote by a function which is upperbounded by for some universal constant , for any integer and , and for all sufficiently large .
3 An Efficient Algorithm: Skm
The first algorithm that we propose, called SKM, is designed to use an efficient offline median algorithm as a black box and is in itself efficient. SKM works in two phases. In the first phase, the incoming elements are observed and no element is selected. In the second phase, elements are selected based on the information gained in the first phase. This structure is similar to that of algorithms for the classical secretary problem [15, 13], in which one tries to select a single element with a maximal value.
SKM receives as input a confidence parameter , the number of clusters , the sequence size , and access to a blackbox offline median algorithm . The main challenge in designing SKM is to define a selection rule for elements from the second phase, based on the information gained in the first phase. This information needs to have uniform finitesample convergence properties so that the error of the solution can be bounded, and centers need to be found with a high probability. SKM constructs this rule by combining the solution of
on the examples of the first phase with information on the distribution estimated from these examples.
Denote the set of elements observed in the first phase by , and those observed in the second phase by . Elements from are selected as centers if they are close to the centers selected by . Importantly, closeness is measured relative to the distribution of distances in : An element from is selected as a center if its distance to one of the centers selected by from is smaller than all but at most a fraction of the points in . For , define . This is the probability mass of points whose distance from is at most the distance of . For a set of points and , let be the fraction of the points in that are in .
For , let be some point . Denote the ball in with center and radius determined by by . SKM is listed in Alg. 1. The guarantee on SKM is provided in the following theorem.
Theorem 3.1.
Suppose that SKM is run with inputs , , and , where is a approximation offline median algorithm. For any and any distribution over , with a probability at least ,
Theorem 3.1 gives a range of tradeoffs between additive and multiplicative errors, depending on the value of . In particular, by setting and noting that , we get
This guarantee can be compared to the guarantees of an offline algorithm that uses the same median algorithm as a black box. As shown in [7], for , with a probability at least , for every clustering and for , Therefore,
Since we also have , it follows that Therefore, the additive errors of both guarantees have a similar dependence on , and . When , the additive errors go to zero, and we remain with the approximation factor of SKM, which is twice that of the offline algorithm.
To prove Theorem 3.1, we first prove that with a high probability, SKM succeeds in selecting centers from . This requires showing that the estimate of the mass of using is close to its true mass on the distribution. We use the following lemma, proved in the appendix using the empirical Bernstein’s inequality of [25]:
Lemma 3.2.
Let be i.i.d. random variables over with mean . Let be their empirical mean. Then, with a probability at least , .
This result is used in the proof of the following lemma. For readability, we denote the sizes of and by respectively.
Lemma 3.3.
For every distribution over , if then with a probability at least , SKM selects centers from .
Proof.
For , denote . Apply Lemma 3.2 by letting stand for the indicators for , , . It follows that with a probability at least , if , then , hence . By a union bound on the pairs in , we have that with a probability of , for all pairs ,
In particular, this holds for and , where are the centers returned by in SKM. Denote . By definition of , for all , . In addition, by definition of and , we have that .
Since , we have . Therefore, .
Therefore, with a probability at least , satisfies that for all , where we used . If this event holds for , then the probability over that is at most By a union bound, the probability that less than centers were selected from is at most .
Combining the two events, we conclude that the probability of selecting centers during SKM is at least . ∎
We now bound the risk of the output of SKM, under the assumption that indeed centers have been successfully selected.
The condition in step 6 of the algorithm guarantees that all the selected centers are in the around the centers returned by . The following two lemmas bound the risk that the selected centers induce compared to the original centers. The lemmas are formulated more generally to apply to a general distribution. The first lemma considers a single center. For a distribution over and , denote .
Lemma 3.4.
Let . Let be a distribution over . Let , such that . Then
Proof.
Denote . Using the triangle inequality, and letting , we have
To upperbound , note that by the conditions on , . Therefore,
It follows that , which completes the proof.
∎
The lemma above provides a multiplicative upper bound on the risk obtained when replacing a center with another center . However, this upper bound is only useful if is small. In the general case, an additive error term cannot be avoided. For instance, suppose that the optimal clustering has a risk of zero, and there is at least one very small cluster. In this case, the algorithm might not succeed in choosing a good center for this cluster, and some additive error will ensue. The following lemma bounds the overall risk of the clustering when all centers are replaced.
Lemma 3.5.
Let and let be a distribution over . Let , and such that . Then for any ,
Proof.
Denote and . Let , and let be the conditional distribution of given . Distinguish between two types of clusters. If , then , where the second inequality follows from the assumption on . Thus . Since , . Therefore, . On the other hand, if , then
Therefore, Lemma 3.4 holds for , , and , implying that Since , we have . Therefore,
We thus have
which completes the proof. ∎
Using the results above, Theorem 3.1 can now be proved.
Proof of Theorem 3.1..
Recall that are independent i.i.d. samples of size drawn from . By Hoeffding’s inequality and the fact that we have that for any fixed clustering , By a union bound on all the clusterings in and on , we get that with a probability , all such clusterings satisfy
(3.1) 
where we used .
In addition, by Lemma 3.3, with a probability at least , SKM selects centers from . The two events thus hold simultaneously with a probability at least . Condition below on these events and let be the selected centers, ordered so that . Denote . Since , we have by definition of that . Therefore, Lemma 3.5 holds with set to the uniform distribution on , , and . Hence,
By the assumptions on and by Eq. (3.1),
In addition, Combining the inequalities and noting that , we get
The theorem follows by setting as defined in SKM. ∎
We have thus shown that SKM obtains an approximation factor at most twice that of an offline algorithm. If the blackbox algorithm is efficient, SKM is also efficient. In the next section, we show that if efficiency limitations are removed, there is an algorithm for the nosubstitution setting that obtains the same approximation factor as an optimal (possibly also inefficient) offline algorithm.
4 Obtaining the Optimal Approximation Factor: Skm2
If efficiency considerations are ignored, the offline algorithm can use a approximation algorithm with the best possible . It is well known [see, e.g., 16] that for any data set , , and that this upper bound is tight. Therefore, the lowest possible value for in a general metric space is . Using the bound of [7] discussed in Section 3, this gives the following guarantee for the offline algorithm:
We now give an algorithm for the nosubstitution setting, which obtains the same approximation factor of , and a similar additive error to that of the offline algorithm. The algorithm, called SKM2, is listed in Alg. 2. It receives as input the confidence parameter , the number of clusters , and the sequence size . Similarly to SKM, it also works in two phases, where the first phase is used for estimation, and the second phase is used for selecting centers. The first phase is further split to subsequences . The second phase is denoted .
The main challenge in designing SKM2 is to make sure that elements are selected as centers only if it will later be possible, with a high probability, to select additional centers so that the final risk will be nearoptimal. To this end, we define a recursive notion of goodness. For a set of size , we say that it is good if its risk on is lower than some threshold. For a set of size less than , it is good if there is a sufficient probability to find another element to add to this set, such that the augmented set is good. The following definition formalizes this.
Definition 4.1.
Let of size at most . Let and . The predicate good is defined as follows, with respect to the subsamples .

For of size , is good (or simply good) if .

For of size , define . is good if .
The algorithm sets the value of depending on the input parameters, and finds a value for such that is good. It then iteratively gets the examples, and adds the observed example as a center if the addition preserves the goodness of the solution collected so far. We show below that if is good for as defined in Alg. 2, then with a high probability SKM2 will succeed in selecting centers with a risk at most on , and that this will result in a nearoptimal clustering.
We prove the following result for SKM2.
Theorem 4.2.
Suppose that SKM2 is run with inputs and . For any and distribution over , with a probability at least ,
By setting and noting the , we get
As discussed above, this is the same multiplicative approximation factor as the optimal offline algorithm. The additive error is larger by a factor of .
We now prove Theorem 4.2. Note that by definition of goodness for of size , it follows that if SKM2 succeeds in selecting centers, then the solution it finds has a risk of at most on . We thus need to show that indeed centers are selected with a high probability, that is close to the optimal achievable risk, and that the risk on is close to the risk on . We use the following lemma, proved in the appendix based on Bernstein’s inequality.
Lemma 4.3.
Let be i.i.d. random variables in with mean . Let be the empirical mean. Then, with a probability at least , .
Denote the sizes of by respectively. First, we show that SKM2 selects centers with a high probability.
Lemma 4.4.
With a probability at least , by the end of the run SKM2 has collected centers.
Proof.
Let be the possible values of examined by the algorithm which are smaller than . Note that since for , the largest such that satisfies . Therefore, . By Lemma 3.2 and a union bound, with a probability at least , for any , , and of size ,
Condition below on this event. Let be the value selected by SKM2, let be the set of points collected by the algorithm until iteration , and let . If , then it is good by the definition of . Otherwise, it is good by the condition on line 6. Therefore, by definition, . This implies the LHS of the implication above, hence .
Therefore, conditioned on the event above, the probability that the next sample satisfies that is good is at least . Since this holds for all iterations until there are centers in , the probability that the algorithm collects less than centers is at most the probability of obtaining less than successes in independent experiments with a probability of success . Let be the empirical fraction of successes on experiments. By Lemma 4.3, since , with a probability , . Since , we have . Therefore, taking a union bound, we get that with a probability of at least , the algorithm selects centers. ∎
We now show that the value of selected by SKM2 is close to the optimal risk. By Hoeffdings’s inequality and a union bound over the possible choices of , for all of size , with a probability , Call this event .
Lemma 4.5.
Let , . With a probability of , implies that the value of set by SKM2 satisfies .
Proof.
Let . For sets , denote by the collection of all sets of size that include exactly one element from each of . We start by showing that with a high probability, there exist sets such that for all , , , and Let be an optimal clustering for . For , let such that and . Let . Denote . By Lemma 4.3, since , we have that with a probability at least , for all , , as required.
We now show that . By the definition of , for any we have , where is defined before Lemma 3.4. Therefore, the conditions of Lemma 3.5 hold with , , and . Hence, for ,
Under , we get that for all ,
Lastly, we show that the existence of implies an upper bound on the value of set by the algorithm. First, we show that is good. This can be seen by induction on the definition of goodness: For , all are good since . Now, suppose that all sets for some are good, and let . Then, since for all we have , it follows that
Therefore, by definition, is good. By induction, we conclude that is also good. Clearly, is also good for any . Since the value selected by SKM2 is set to the smallest value such that is natural and is good, and since , we conclude that , as required. ∎
The proof of Theorem 4.2 is now immediate.
Proof of Theorem 4.2.
Assume that holds, as well as the events of Lemma 4.4 and Lemma 4.5. This occurs with a probability at least . By Lemma 4.4 the algorithm selects which is of size and is good. Thus, by the definition of goodness, . By , it follows that . By Lemma 4.5, The theorem follows by plugging in the values of and simplifying. ∎
5 Discussion
In this work, we showed that an approximation factor which is twice that of the samplebased offline algorithm can be obtained by an efficient nosubstitution algorithm. We further showed that removing the efficiency requirement allows obtaining the same approximation factor as the best offline algorithm. It is an open question whether there is an efficient nosubstitution algorithm with the same approximation factor as the best efficient offline algorithm.
The essential property of SKM2 that allows obtaining an improved approximation factor is the requirement that only centers which allow many possible choices of other centers are selected. In other words, the center choice should not be very sensitive to a small number of points in the sample. This type of stability, or robustness, has been previously studied for clustering algorithms in other contexts [see, e.g., 21, 1]. More generally, stability of algorithms is known to be an essential property in learning algorithms [8]. Thus, the relationship between stability of algorithms and success in the nosubstitution setting is an interesting open problem.
References
 Ackerman et al. [2013] M. Ackerman, S. BenDavid, D. Loker, and S. Sabato. Clustering oligarchies. In Artificial Intelligence and Statistics, pages 66–74, 2013.
 Ackermann et al. [2012] M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, and C. Sohler. Streamkm++: A clustering algorithm for data streams. Journal of Experimental Algorithmics (JEA), 17:2–4, 2012.
 Aggarwal and Yu [2005] C. C. Aggarwal and P. S. Yu. Online analysis of community evolution in data streams. In Proceedings of the 2005 SIAM International Conference on Data Mining, pages 56–67. SIAM, 2005.

Ailon et al. [2009]
N. Ailon, R. Jaiswal, and C. Monteleoni.
Streaming kmeans approximation.
In Advances in neural information processing systems, pages 10–18, 2009.  Badanidiyuru et al. [2014] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671–680. ACM, 2014.

Bateni et al. [2010]
M. Bateni, M. Hajiaghayi, and M. Zadimoghaddam.
Submodular secretary problem and extensions.
In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
, pages 39–52. Springer, 2010.  BenDavid [2007] S. BenDavid. A framework for statistical clustering with constant time approximation algorithms for kmedian and kmeans clustering. Machine Learning, 66(23):243–257, 2007.
 Bousquet and Elisseeff [2002] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
 Braverman et al. [2016] V. Braverman, D. Feldman, and H. Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889, 2016.

Charikar et al. [2003]
M. Charikar, L. O’Callaghan, and R. Panigrahy.
Better streaming algorithms for clustering problems.
In
Proceedings of the thirtyfifth annual ACM symposium on Theory of computing
, pages 30–39. ACM, 2003.  Chen [2009] K. Chen. On coresets for kmedian and kmeans clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009.

Dueck and Frey [2007]
D. Dueck and B. J. Frey.
Nonmetric affinity propagation for unsupervised image
categorization.
In
2007 IEEE 11th International Conference on Computer Vision
, pages 1–8. IEEE, 2007.  Dynkin [1963] E. B. Dynkin. The optimum choice of the instant for stopping a markov process. Soviet Mathematics, 4:627–629, 1963.
 Feldman et al. [2011] M. Feldman, J. S. Naor, and R. Schwartz. Improved competitive ratios for submodular secretary problems. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 218–229. Springer, 2011.
 Gilbert and Mosteller [2006] J. P. Gilbert and F. Mosteller. Recognizing the maximum of a sequence. In Selected Papers of Frederick Mosteller, pages 355–398. Springer, 2006.
 Guha et al. [2000] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Foundations of computer science, 2000. proceedings. 41st annual symposium on, pages 359–366. IEEE, 2000.
 Hadi et al. [2006] Y. Hadi, F. Essannouni, and R. O. H. Thami. Video summarization by kmedoid clustering. In Proceedings of the 2006 ACM symposium on Applied computing, pages 1400–1401. ACM, 2006.
 Hoeffding [1963] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
 Kesselheim and Tönnis [2016] T. Kesselheim and A. Tönnis. Submodular secretary problems: Cardinality, matching, and linear constraints. arXiv preprint arXiv:1607.08805, 2016.
 Lang [2018] H. Lang. Online facility location against atbounded adversary. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, pages 1002–1014. Society for Industrial and Applied Mathematics, 2018.
 Lange et al. [2004] T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. Stabilitybased validation of clustering solutions. Neural computation, 16(6):1299–1323, 2004.
 Lattanzi and Vassilvitskii [2017] S. Lattanzi and S. Vassilvitskii. Consistent kclustering. In International Conference on Machine Learning, pages 1975–1984, 2017.
 Leung and Leckie [2005] K. Leung and C. Leckie. Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twentyeighth Australasian conference on Computer ScienceVolume 38, pages 333–342. Australian Computer Society, Inc., 2005.
 Liberty et al. [2016] E. Liberty, R. Sriharsha, and M. Sviridenko. An algorithm for online kmeans clustering. In 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 81–89. SIAM, 2016.
 Maurer and Pontil [2009] A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
 Meyerson [2001] A. Meyerson. Online facility location. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 426–431. IEEE, 2001.
 Nasraoui et al. [2007] O. Nasraoui, J. Cerwinske, C. Rojas, and F. Gonzalez. Performance of recommendation systems in dynamic streaming environments. In Proceedings of the 2007 SIAM International Conference on Data Mining, pages 569–574. SIAM, 2007.
 Ng et al. [2006] H. Ng, S. Ong, K. Foong, P. Goh, and W. Nowinski. Medical image segmentation using kmeans clustering and improved watershed algorithm. In 2006 IEEE Southwest Symposium on Image Analysis and Interpretation, pages 61–65. IEEE, 2006.
 Sabato and Hess [2018] S. Sabato and T. Hess. Interactive algorithms: Pool, stream and precognitive stream. Journal of Machine Learning Research, 18(229):1–39, 2018.

Shepitsen et al. [2008]
A. Shepitsen, J. Gemmell, B. Mobasher, and R. Burke.
Personalized recommendation in social tagging systems using hierarchical clustering.
In Proceedings of the 2008 ACM conference on Recommender systems, pages 259–266. ACM, 2008.