K-Means

What is K-Means Clustering?

K-means clustering is an unsupervised learning technique to classify unlabeled data by grouping them by features, rather than pre-defined categories. The variable K represents the number of groups or categories created. The goal is to split the data into K different clusters and report the location of the center of mass for each cluster. Then, a new data point can be assigned a cluster (class) based on the closed center of mass.

The big advantage of this approach is that the human bias is taken out of the equation. Instead of having a researcher create classification groups, the machine creates its own clusters based upon empirical proofs, rather than assumptions.


How does K-Means Clustering Work?

Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.  

  1. Data assignment: Each cluster is created and defined by its centroid (central collection of features). Each data point is then assigned to its nearest centroid, based on some choice of distance function

  2. Centroid update: After all data points are assigned, the centroids are recalculated by taking the mean of all data points assigned to that cluster.

  3. Repeat: This assignment and update process repeats until some stopping criteria is met, such as, no change to clusters, the sum of the distances is minimized, or some maximum iteration threshold is reached.