What is Statistical Classification?
Statistical classification is the broad supervised learning approach that trains a program to categorize new, unlabeled information based upon its relevance to known, labeled data.
The algorithms that sort unlabeled data into labeled classes, or categories of information, are called classifiers. A simple practical example are spam filters that scan incoming “raw” emails and classify them as either “spam” or “not-spam.” Classifiers are the concrete implementation of pattern recognition in any form of machine learning.
Statistical Classification versus Cluster Analysis
In the unsupervised learning technique of cluster analysis, none of the training dataset categories are labeled. This allows the system maximum flexibility to create its own rules for classification and hopefully find hidden patterns unknown to humans.
Statistical classification is much more structured, with the rules essentially dictated by the human trainer ahead of time. The classifiers rules are dynamic though, including the ability to handle vague or unknown values, all tailored to the type of inputs being examined. Most classifiers also employ probability estimates that allow end users to manipulate data classification with utility functions.
In unsupervised learning, classifiers form the backbone of cluster analysis and in supervised or semi-supervised learning, classifiers are how the system characterizes and evaluates unlabeled data.