Clustering is the problem of grouping elements into clusters such that each element is similar to the elements in the cluster assigned to it and not similar to elements in any other cluster. It is one of, if not the, primary problem in the area of machine learning known as Unsupervised Learning and no clustering problem is as famous and widely considered as the-Means problem: Given a multiset find centroids minimizing . Several NP-Hardness results exist for finding the optimal
-Means clustering in general, forcing one to turn towards heuristics.-Means is NP-hard even for and general dimension  and it is also NP-hard for and general . Even hardness of approximation results exist [15, 8]. In  the authors show there exists a such that it is NP-hard to approximate -Means to within a factor of optimal, and in  it is proved that . On the upper bound side the best known polynomial time approximation algorithm for -Means has an approximation factor of 6.357 . In practice, Lloyd’s algorithm is a popular iterative local search heuristic that starts from some random or arbitrary clustering. The running time of Lloyd’s algorithm is where is the number of rounds of the local search procedure. In theory, if Lloyd’s algorithm is run to convergence to a local minimum, could be exponential and there is no guarantee on how well the solution found approximates the optimal solution [6, 20]. Lloyd’s algorithm is often combined with the effective seeding technique for selecting initial centroids due to  that gives an expected approximation ratio for the initial clustering, which can then improved further by Lloyd’s algorithm.
For the one-dimensional case, the -Means problem is not NP-hard. In particular, there is an time and space dynamic programming solution for the 1D case, due to work by . The 1D -Means problem is encountered surprisingly often in practice, some examples being in data analysis in social networks, bioinformatics and retail market [5, 13, 18].
It is only natural to try other reasonable distance measures for the data considered and define different clustering problems. There are many other choices than the sum of squares of the Euclidian distances that define -Means. For instance, one could use any norm instead. The special case of is known as -Medians clustering and has also received considerable attention. The -Medians problems is also NP-hard and the best polynomial time approximation algorithms has an approximation factor of 2.633 .
In  the authors consider and define clustering with Bregman Divergences. Bregman Divergence generalize squared Euclidian distance and thus Bregman Clusterings include the
-Means problem, as well as a wide range of other clustering problems that can be defined from Bregman Divergences like e.g. clustering with Kullback–Leibler divergence as the cost. Interestingly, the heuristic local search algorithm for Bregman Clustering is basically the same approach as Lloyd’s algorithm for -Means. Clustering with Bregman Divergences is clearly NP-Hard as well since it includes -Means clustering. We refer the reader to  for more about the general problem. For the 1D version of the problem,  generalized the algorithm from  to the -Medians problems and Bregman Divergences achieving the same time and space bounds.
1.1 Our Results
In this paper we give theoretically and practically efficient algorithm for 1D clustering problems, in particular -Means.
The -Means clustering problem in 1D is defined as follows. Given and , find centroids minimizing the cost
The main results of this paper are new fast algorithms for 1D -Means. First we give an algorithm that computes the optimal -Means clustering that runs in time using optimal space, or time if the input is already sorted. The algorithm also computes the cost of the optimal clustering using clusters for all . This is relevant for instance for model selection of the right . This is an improvement by a factor of in time and in space compared to the existing solution (which also supports computing the cost for all . The constant factors hidden by the -notation are small and we expect this algorithm to be very efficient in practice. Second, we show how to compute an optimal -Means clustering in time using space. This algorithm is mainly of theoretical interest as we expect constants to be rather large. As opposed to the time algorithm, this algorithm does not compute the optimal costs for using clusters for all .
The time algorithm relates to a natural regularized version of -Means clustering where instead of specifying the number of clusters beforehand, we instead specify a cost of using an extra cluster and then minimize the cost of the clustering plus the cost of the number of clusters used. Formally, the problem is as follows: Given and , compute the optimal regularized clustering:
We show that this problem is solvable in time if the input is sorted.
In 1D, Lloyd’s algorithm can be implemented to run in time where is the number of rounds, and we expect that our algorithm can compute the optimal clustering for reasonable in essentially the same time as Lloyd’s algorithm can approximate it.
The -Medians problem is to compute a clustering that minimize the sum of absolute distances to the centroid, i.e. compute that minimize
Our algorithms generalize naturally to solve this problem in the same time bounds as for the -means problem.
Let be a differentiable real-valued strictly convex function. The Bregman Divergence induced by is defined as
Notice that the Bregman Divergence induced from , gives squared Euclidian Distance (k-Means). Bregman divergences are not metrics since they are not symmetric in general and the triangle inequality is not necessarily satisfied. They do however have many redeeming qualities, for instance Bregman Divergences are convex in the first argument, albeit not the second, see [9, 10] for a more comprenhensive treatment.
The Bregman Clustering problem as defined in  is to find centroids that minimize
where is a Bregman Divergence. For our case, where the inputs , we assume that computing a Bregman Divergence, i.e. evaluating and its derivative, takes constant time. We show that our algorithms naturally generalize to 1D clustering using any Bregman Divergence to define the cluster cost while still maintaing the same running time as for -Means.
An independent implementation of the time algorithm is available in the R package Ckmeans.1d.dp . The implementation is for -Means clustering, and uses space.
In Section 2 we describe the existing time algorithm for 1D -Means clustering that uses space. In Section 3 we show how to compute the same output as the old algorithm using only time and space. Then we show how to improve the running time to . Finally, in Section 4 we show how our new algorithms generalizes to different cluster costs than squared Euclidian distance.
2 The Dynamic Programming Algorithm
In this section, we describe the previous time and space algorithm presented in . We also introduce the definitions and notation we use in our new algorithm. We will always assume sorted input . If the input is not sorted, we start by sorting it in time. We also remark that there could be many ways of partitioning the point set and computing centroids that achieve the same cost. This is for instance the case if the input is identical points. The task at hand is to find any optimal solution.
Let be the cost of grouping into one cluster with the optimal choice of centroid, , the arithmetic mean of the points.
There is an space data structure that can compute in time for any using time preprocessing.
This is a standard application of prefix sums and works as follows. By definition,
With access to prefix sum arrays of and both the centroid and the cost is easily computed in constant time .
2.1 Algorithm Sketch
The algorithm computes the optimal clustering using clusters for all prefixes of input points , for , and for all using Dynamic Programming as follows.
Let be the cost of optimally clustering into clusters. For the cost of optimally clustering into one cluster is the cluster cost . That is, for all . This can be computed in time by Lemma 1.
Notice that is the cost of optimally clustering into clusters and is the cost of clustering into one cluster. This makes the first point in the last and rightmost cluster. Let be the argument that minimizes (1)
It is possible there exists multiple obtaining same minimal value for (1). To make the optimal clustering unique, such ties are broken in favour of smaller .
Notice is the first point in the rightmost cluster of the optimal clustering. Thus, given one can find the optimal solution by standard backtracking:
Here is the ’th cluster in the optimal clustering. One can naively compute each entry of and using (1) and (2). This takes time for each cell, thus and can be computed in time using space. This is exactly what is described in .
3 New Algorithms
The idea of the first new algorithm is simply to compute the tables and faster, by reducing the time to compute each row of and to time instead of time. This improvement exploits a monotonicity property of the values stored in a row of . This is explained in Section 3.1, resulting in an time and space solution, assuming sorted inputs. Section 3.2 then shows how to reduce the space usage to just while retaining running time. In Section 3.3 we show that the same property allows us to solve 1D -Means for in, time and linear space, and solve the regularized version of 1D -Means in time.
3.1 Faster Algorithm From Monotone Matrices
In this section we reduce the problem of computing a row of and to searching an implicitly defined matrix of a special form, which allows us to compute each row of and in linear time.
Define as the cost of the optimal clustering of using clusters, restricted to having the rightmost cluster (largest cluster center) contain the elements . For convenience, we define for as the cost of clustering into clusters, i.e. the last cluster is empty. This means that satisfies:
where by definition when (which is consistent with the definition in Section 2). We have that relates to as follows:
where ties are broken in favor of smaller (as defined in Section 2.1).
This means that when we compute a row of and , we are actually computing for all . We think of as an matrix with rows indexed by and columns indexed by . With this interpretation, computing the ’th row of and corresponds to computing for each row in , the column index that corresponds to the smallest value in row . In particular, the entries and correpond to the value and the index of the minimum entry in the ’th row of respectively. The problem of finding the minimum value in every row of a matrix has been studied before . First we need the definition of a monotone matrix.
In , the authors showed the following:
 Finding for each row of an arbitrary monotone matrix requires time, whereas if the matrix is totally monotone, the time is when and is when .
The fast algorithm for totally monotone matrices is known as the SMAWK algorithm and we will refer to it by that (cool) name.
Let’s relate this to the 1D -Means clustering problem. That is monotone means that if we consider the optimal clustering of the points with clusters, then if we start adding more points after , then the first (smallest) point in the last of the clusters can only increase (move right) in the new optimal clustering of . This sounds like it should be true for 1D -Means and it turns out it is. Thus, applying the algorithm for monotone matrices, we can fill a row of and in time leading to an time algorithm for 1D -Means, which is already a great improvement.
However, as we show below, the matrix induced by the 1D -Means problem is in fact totally monotone:
The matrix is totally monotone.
As  remarks, a matrix is totally monotone if all its submatrices are monotone. To prove that is totally monotone, we thus need to prove that for any two row indices with and two column indices with , it holds that if then .
Notice that these values correspond to the costs of clustering elements and , starting the rightmost cluster with element and respectively. Since , this is the same as proving that
which is true if we can prove that . Rearranging terms, what we need to prove is that for any and , it holds that:
This is the property known as the concave (concave for short) property [24, 12, 23] and has been used to significantly speed up algorithms, including Dynamic Programming algorithms, for for other problems. We start by handling the special case where . In this case, we have by definition that , thus we need to show that . This is the case since any point amongst is included in at most one of and (since ). Thus is the cost of taking two disjoint and consecutive subsets of the points and clustering the two sets using the optimal choice of centroid in each. Clearly this cost is less than clustering all the points using one centroid.
We now turn to the general case where . Let be the mean of and be the mean of and assume that (the other case is symmetric). Finally, let denote the cost of grouping the elements in a cluster with centroid . Split the cost into the cost of the elements and the cost of the elements as
We trivially get since is the cost using the optimal centroid. Secondly,
since and all elements are less than or equal to (since is the mean of points that all are greater than ). Combining the results, we see that:
This completes the proof. ∎
Computing an optimal -Means clustering of a sorted input of size for takes time.
By construction the cost of the optimal clustering is computed for all . If we store the table the cluster centers for any can be extracted in time.
3.2 Reducing Space Usage
In the following, we show how to reduce the space usage to just while maintaining running time using a space reduction technique of Hirschberg . First observe that each row of and only refers to the previous row. Thus one can clearly “forget”row when we are done computing row . The problem is that if we don’t store all of , we cannot backtrack and find the optimal solution. In the following, we present an algorithm that avoids the table entirely.
Our key observation is the following: Assume and that for every prefix , we have computed the optimal cost of clustering into clusters. Note that this is precisely the set of values stored in the ’th row of . Assume furthermore that we have computed the optimal cost of clustering every suffix into clusters. Let us denote these costs by for . Then clearly the optimal cost of clustering into clusters is given by:
Our main idea is to first compute row of and row of using linear space. From these two, we can compute the argument minimizing (4). We can then split the reporting of the optimal clustering into two recursive calls, one reporting the optimal clustering of points into clusters, and one call reporting the optimal clustering of into clusters. When the recursion bottoms out with , we can clearly report the optimal clustering using linear space and time as this is just the full set of points.
From Section 3.1 we already know how to compute row of using linear space: Simply call SMAWK to compute row of for , where we throw away row of (and don’t even store ) when we are done computing row . Now observe that table can be computed by taking our points and reversing their order by negating the values. This way we obtain a new ordered sequence of points where . Running SMAWK repeatedly for on the point set produces a table such that is the optimal cost of clustering into clusters. Since this cost is the same as clustering into clusters, we get that the ’th row of is identical to the ’th row of if we reverse the order of the entries.
To summarize our algorithm for reporting the optimal clustering, do as follows: Let be an initially empty output list of clusters. If , append to a cluster containing all points. Otherwise (), use SMAWK on and to compute row of and row of using linear space (by evicting row from memory when we have finished computing row ) and time. Compute the argument minimizing (4) in time. Evict row of and row of from memory. Recursively report the optimal clustering of points into clusters (which appends the output to ). When this terminates, recursively report the optimal clustering of points into clusters. When the algorithm terminates, contains the optimal clustering of into clusters.
At any given time, our algorithm uses only space. To see this, first note that we evict all memory used to compute the value minimizing (4) before recursing. Furthermore, we complete the first recursive call (and evict all memory used) before starting the second. Finally, for the recursion, we don’t need to make a copy of points . It suffices to remember that we are only working on the subset of inputs .
Now let denote the time used by the above algorithm to compute the optimal clustering of sorted points into clusters. Then there is some constant such that satisfies the recurrence:
and for :
We claim that satisfies . We prove the claim by induction in . The base case follows trivially by inspection of the formula for . For the inductive step , we use the induction hypothesis to conclude:
For , we have that , therefore:
Which is what we needed to prove.
Computing an optimal -Means clustering of a sorted input of size takes time and uses space.
Note to compute the cost of the optimal clustering for all we ensure that we never delete the last column of the cost matrix which requires an additional space.
3.3 Even Faster Algorithm
In this section we show that the concave property we proved for the cluster costs yields and algorithm for computing the optimal -Means clustering for one given in time. The result follows almost directly from . In  Schieber gives an algorithm with the aforementioned running time for the problem of finding the shortest path of fixed length in a directed acyclic graph with nodes where the weights, , satisfy the concave property and are represented as a function that returns the weight of a given edge in constant time.
Theorem 4 ().
Computing a minimum weight path of length between any two nodes in a directed acyclic graph of size where the weights satisfy the concave property takes time using space.
We reduce the 1D -Means problem to a directed graph problem as follows. Sort the input in time and let denote the sorted input sequence. For each input we associate a node and add an extra node . Now define the weight of the edge from to as the cost of clustering in one cluster, which is . Each edge weight is computed in constant time and by the proof of Lemma 2, particularly Equation 3, the edge weights satisfy the monge concave property. Finally, to compute the optimal clustering we use Schiebers algorithm to compute the lowest weight path with edges from to .
Computing an optimal -Means clustering of an input of size for given takes time using space.
It is relevant to briefly consider parts of Schiebers algorithm and how it relates to -Means clustering, in particular a regularized version of the problem. Schiebers algorithm relies crucially on algorithms that given a directed acyclic graph where the weights satisfy the concave property computes a minimum weight path in time [23, 14]. Note the only difference in this problem compared to above, is that the search is not restricted to paths of edges only.
3.3.1 Regularized Clustering
Consider a regularized version of the -Means clustering problem where we instad of providing the number of clusters specify the cost of a cluster and ask to minimize the cost of the clustering plus the penalty for each cluster used. For simplicity, assume all the input points are distinct.
If we set the optimal clustering has cost zero and use a cluster for each input point. If we let increase towards infinity, the optimal number of clusters used in the optimal solution monotonically decrease towards one (zero clusters is not well defined). Let be the smallest distance between points in the input. The optimal cost of using clusters is then . When it is less costly to use only clusters since the added clustering of using one less cluster is smaller than the cost of a cluster. Letting increase again will inevitably lead to a miminum value such that for only clusters is used in the optimal solution. Following the same pattern is the difference between the optimal cost using clusters and clusters. Continuing this yields the very interesting event sequence that encodes the only relevant choices for the regularization parameter. Note that the algorithm actually yields since it computes the optimal cost for all .
In the reduction to the directed graph problem, adding a cost of for each cluster used corresponds to adding to the weight of each edge. Note that the edge weights clearly still satisfy the concave property. Thus, solving the regularized version of -Means clustering correpoonds to finding the shortest path (of any length) in a directed acyclic graph where the weights satisfy the concave property. By the algorithms in [23, 14] this takes time.
Computing an optimal regularized 1D -Means clustering of a sorted input of size takes time.
Now notice if we actually use as the cost per cluster, or any there is an optimal solution using clusters which is an optimal -Means clustering. This means that if the inputs are integers, we can solve the 1D -Means problem by a simple application of binary search in time where is the universe size .
4 Extending to More Distance Measures
In the following we show how to generalize our algorithm to Bregman Divergences and sum of absolute distances while retaining the same running time and space usage.
4.1 Bregman Divergence and Bregman Clustering
In this section we show how our algorithm generalizes to any Bregman Divergence. First, let us remind ourselves what a Bregman Divergence and a Bregman Clustering is. Let be a differentiable real-valued strictly convex function. The Bregman Divergence defined by is defined as
The Bregman Clustering problem as defined in , is to find a clustering, , that minimize
Notice that the cluster center is the second argument of the Bregman Divergence. This is important since Bregman Divergences are not in general symmetric.
For the purpose of 1D clustering, we mention two important properties of Bregman Divergences. For any Bregman Divergence, the unique element that minimizes the summed distance to a multiset of elements is the mean of the elements, exactly as it was for squared Euclidian distance. This is in one sense the defining property of Bregman Divergences .
Linear Separators For Bregman Divergences.
For all Bregman divergences, the locus of points that are equidistant to two fixed points in terms of a Bregman divergence is given by
which corresponds to a hyperplane. Also, the pointssits on either side of the hyperplane and the Voronoi cells defined using Bregman divergences are connected.
This means, in particular, that between any two points in 1D, , there is a hyperplane (point) with and all points smaller than are closer to and all points larger than are closer to . We capture what we need from this observation in a simple “distance” lemma:
Given two fixed real numbers , then for any point , we have , and for any point we have
Computing Cluster Costs for Bregman Divergences.
Since the mean minizes Bregman Divergences, the centroids used in optimal clusterings are unchanged compared to the -Means case. The prefix sums idea used to implement the data structure used for Lemma 1 generalizes to Bregman Divergences as observed in  (under the name Summed Area Tables). The formula for computing the cost of grouping the points in one cluster is as follows. Let be the arithmetic mean of the points , then
It follows that the Bregman Divergence cost of a consecutive subset of input points and the centroid can be computed in in constant time with stored prefix sums for and .
Monge Concave - Totally Monotone Matrix.
The only properties we used in Section 3.1 to prove the monge concave property and that the matrix is totally monotone, is that the mean is the minimizer of the sum of distances to a multiset of points, and that
when and all elements in . This is clearly still true by Lemma 3.
It follows that the algorithms we specified for 1D -Means generalize to any Bregman Divergence.
4.2 -Medians - Clustering with sum of absolute values
For the -Medians problem we replace the the sum of squared Euclidian distances with the sum of absolute distances. Formally, the -Medians problem is to compute a clustering, , minimizing
Note that in 1D, all norms are the same and reduce to this case. Also note that the minimizing centroid for a cluster is no longer the mean of the points in that cluster, but the median. To solve this problem, we change the centroid to be the median, and if there an even number of points, we fix the median to be the exact middle point between the two middle elements, making the choice of centroid unique.
As for Bregman Divergences, we need to show that we can compute the cost with this new cost in constant time. Also, we need to compute the centroid in constant time and argue that the cost is monge moncave which implies the implicit matrix is totally monotone. The arguments are essentially the same, but for completeness we briefly cover them below.
Computing Cluster Costs for Absolute Distances.
Not surprisingly, using prefix sums still allow constant time computation of . Let , and compute the centroid as
which can be computed in constant time with access to a prefix sum table of . This was also observed in .
Monge Concave - Totally Monotone Matrix.
The monge concave and totally monotone matrix argument above for Bregman Divergences (and for squared Euclidian distance) remain valid since first of all, we still have as is the median of points all greater than . Furthermore, it still holds that when and all elements are less than or equal to , then:
It follows that the algorithms we specified for 1D -Means generalize to the 1D -Median problem.
We wish to thank Pawel Gawrychowski for pointing out important earlier work on concave property.
-  A. Aggarwal, M. M. Klawe, S. Moran, P. Shor, and R. Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1):195–208, 1987.
-  A. Aggarwal, B. Schieber, and T. Tokuyama. Finding a minimum-weightk-link path in graphs with the concave monge property and applications. Discrete & Computational Geometry, 12(3):263–280, 1994.
-  S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. CoRR, abs/1612.07925, 2016.
-  D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009.
-  V. Arnaboldi, M. Conti, A. Passarella, and F. Pezzoni. Analysis of ego network structure in online social networks. In Privacy, security, risk and trust (PASSAT), 2012 international conference on and 2012 international confernece on social computing (SocialCom), pages 31–40. IEEE, 2012.
-  D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Proceedings of the Twenty-second Annual Symposium on Computational Geometry, SCG ’06, pages 144–153, New York, NY, USA, 2006. ACM.
-  D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
-  P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop. The hardness of approximation of euclidean k-means. In L. Arge and J. Pach, editors, 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, volume 34 of LIPIcs, pages 754–767. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015.
-  A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. J. Mach. Learn. Res., 6:1705–1749, Dec. 2005.
-  J.-D. Boissonnat, F. Nielsen, and R. Nock. Bregman voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010.
-  D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18(6):341–343, June 1975.
-  D. S. Hirschberg and L. L. Larmore. The least weight subsequence problem. SIAM Journal on Computing, 16(4):628–638, 1987.
-  O. Jeske, M. Jogler, J. Petersen, J. Sikorski, and C. Jogler. From genome mining to phenotypic microarrays: Planctomycetes as source for novel bioactive molecules. Antonie Van Leeuwenhoek, 104(4):551–567, 2013.
-  M. M. Klawe. A simple linear time algorithm for concave one-dimensional dynamic programming. Technical report, Vancouver, BC, Canada, Canada, 1989.
-  E. Lee, M. Schmidt, and J. Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40–43, 2017.
-  M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The Planar k-Means Problem is NP-Hard, pages 274–285. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
-  F. Nielsen and R. Nock. Optimal interval clustering: Application to bregman clustering and statistical mixture learning. IEEE Signal Process. Lett., 21:1289–1292, 2014.
D. Pennacchioli, M. Coscia, S. Rinzivillo, F. Giannotti, and D. Pedreschi.
The retail market as a complex system.
EPJ Data Science, 3(1):1, 2014.
-  B. Schieber. Computing a minimum weightk-link path in graphs with the concave monge property. Journal of Algorithms, 29(2):204 – 222, 1998.
-  A. Vattani. k-means requires exponentially many iterations even in the plane. Discrete & Computational Geometry, 45(4):596–616, 2011.
-  H. Wang and J. Song. Ckmeans.1d.dp: Optimal and fast univariate clustering; R package version 4.0.0., 2017.
-  H. Wang and M. Song. Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal, 3(2):29–33, 2011.
-  R. Wilber. The concave least-weight subsequence problem revisited. Journal of Algorithms, 9(3):418 – 425, 1988.
F. F. Yao.
Efficient dynamic programming using quadrangle inequalities.
Proceedings of the Twelfth Annual ACM Symposium on Theory of Computing, STOC ’80, pages 429–435, New York, NY, USA, 1980. ACM.