k-Means Clustering of Lines for Big Data
The k-means for lines is a set of k centers (points) that minimizes the sum of squared distances to a given set of n lines in R^d. This is a straightforward generalization of the k-means problem where the input is a set of n points. Related problems minimize sum of (non-squared) distances, other norms, m-estimators or ignore the t farthest points (outliers) from the k centers. We suggest the first provable PTAS algorithms for these problems that compute (1+epsilon)-approximation in time O(n (n)/epsilon^2) for any given epsilon ∈ (0, 1), and constant integers k, d, t ≥ 1, including support for streaming and distributed input. Experimental results on Amazon EC2 cloud and open source are also provided.
READ FULL TEXT