What is Unsupervised Learning?
Unsupervised learning is a kind of machine learning where a model must look for patterns in a dataset with no labels and with minimal human supervision. This is in contrast to supervised learning techniques, such as classification or regression, where a model is given a training set of inputs and a set of observations, and must learn a mapping from the inputs to the observations. In unsupervised learning, only the inputs are available, and a model must look for interesting patterns in the data.
Another name for unsupervised learning is knowledge discovery. Common unsupervised learning techniques include clustering, and dimensionality reduction.
Unsupervised Learning vs Supervised Learning
Supervised Learning
The simplest kinds of machine learning algorithms are supervised learning algorithms. In supervised learning, a model is trained with data from a labeled dataset, consisting of a set of features, and a label. This is typically a table with multiple columns representing features, and a final column for the label. The model then learns to predict the label for unseen examples.
Unsupervised Learning
In unsupervised learning, a dataset is provided without labels, and a model learns useful properties of the structure of the dataset. We do not tell the model what it must learn, but allow it to find patterns and draw conclusions from the unlabeled data.
The algorithms in unsupervised learning are more difficult than in supervised learning, since we have little or no information about the data. Unsupervised learning tasks typically involve grouping similar examples together, dimensionality reduction, and density estimation.
Reinforcement Learning
In addition to unsupervised and supervised learning, there is a third kind of machine learning, called reinforcement learning. In reinforcement learning, as with unsupervised learning, there is no labeled data. Instead, a model learns over time by interacting with its environment. For example, if a robot is learning to walk, it can attempt different strategies of taking steps in different orders. If the robot walks successfully for longer, then a reward is assigned to the strategy that led to that result. Over time, a reinforcement learning model learns as a child does, by balancing exploration (trying new strategies) and exploitation (making use of known successful techniques).
Mathematical difference between unsupervised learning and supervised learning
Unsupervised learning generally involves observing several examples of a random vector
x, and attempting to learn the probability distribution
p(x), or some interesting properties of that distribution.In contrast, in supervised learning, the model observes several examples of a variable x, each paired with a vector y, and learning to predict y from x.
The line between supervised and unsupervised learning is not always clear cut. In other words, they are not formally defined concepts, and many algorithms can be used to perform both tasks. It is sometimes possible to re-express a supervised learning problem as an unsupervised learning problem, and vice versa.Â
For example, a supervised learning problem of learning
can be re-expressed via Bayes' theorem as an unsupervised problem of learning the joint distribution
Nonetheless, the concepts of supervised and unsupervised learning are very useful divisions to have in practice. Traditionally, regression and classification problems are categorized under supervised learning, while density estimation, clustering, and dimensionality reduction are grouped under unsupervised learning.
Examples of Unsupervised Learning Techniques
Cluster analysis
Clustering is the task of grouping a set of items so that each item is assigned to the same group as other items that are similar to it. Clustering is commonly used for data exploration and data mining.
There is not one single clustering algorithm, but common algorithms include k-means clustering, hierarchical clustering, and mixture models.
The result of a cluster analysis of data, where the color of the dots indicates the cluster assigned to each item by a k-means clustering algorithm
Anomaly Detection
Anomaly detection is the identification of rare observations that differ significantly from the majority of a dataset. These are called anomalies, or outliers.
Common anomaly detection algorithms include k-nearest neighbor and isolation forests.
Univariate Anomaly Detection
In univariate anomaly detection, a series of observations of a single variable x is given to an algorithm. The algorithm identifies any observation which is significantly different from the previous observations.
The simplest formula for this is to calculate the z-score of every observation, which is defined as the number of standard deviations that distance it from the mean of all observations. Prior to running the algorithm, we decide how big a z-score is necessary to consider an observation an anomaly.
Mathematical formula for the z-score
z-Score Formula Symbols Explained
The observation in question. | |
The mean value of all observations. | |
The standard deviation of all observations. |
Multivariate Anomaly Detection
Anomaly detection can also be done in a multivariate context.
For example, for two variables, regression can be used to find the relationship between them. An anomaly would be a value which lies far from the regression line.
100 observations of two variables, x and y. One observation is an outlier. A correctly chosen anomaly detection algorithm would identify this as an outlier while ignoring the other observations.
Note that both in the case of univariate and multivariate anomaly detection, the model is not provided with labels telling it which training examples are anomalies, but rather it is given a set of rules describing what makes observations similar, and identifies by itself the observations which are furthest from the majority.
Neural Networks for Unsupervised Learning
There are a number of neural network frameworks which can perform unsupervised learning.Â
Autoencoder
An autoencoder is a neural network which is able to learn efficient data encodings by unsupervised learning. The autoencoder is given a dataset, such as a set of images, and is able to learn a low-dimensional representation of the data by learning to ignore noise in the data.
Generative Adversarial Network
Another well-known unsupervised neural network model is the generative adversarial network. Generative adversarial networks are able to learn to generate new data examples which share important characteristics of the training dataset. For example, a generative adversarial network can be trained on a set of millions of photographs, and learn to generate lifelike but non-existent human faces, which humans are unable to distinguish from authentic images.
Synthetic faces generated by the well-known generative adversarial network StyleGAN, which was trained in an unsupervised manner on the Flickr-Faces-HQ face dataset.
Unsupervised Learning and Transformers
The state of the art for natural language processing models is currently transformer neural networks. These are feedforward neural networks used for processing sequential data, such as text data.
Although the best-known use of transformers is for supervised learning techniques such as machine translation, transformers can also be trained using unsupervised learning to generate new sequences which are similar to the sequences in a training set. In particular, they can generate realistic text documents which look like they were written by a human.
Attention Mechanism and Unsupervised Learning
The central part of a transformer network architecture is the attention mechanism, which allows the neural network to focus on parts of the input sequence when generating an output token. In 2019, Baihan Lin of Columbia University, New York, proposed a design for an unsupervised attention mechanism which researchers can use for model selection, that is, it can learn to best automate the hyperparameter selection and feature engineering stage of data science.
Example of Unsupervised Learning: K-means clustering
Let us consider the example of the Iris dataset. This is a table of data on 150 individual plants belonging to three species. For each plant, there are four measurements, and the plant is also annotated with a target, that is the species of the plant. The data can be easily represented in a table. Below are five rows of the table corresponding to the features and labels of five plants.
A typical use of a supervised learning algorithm here would be to generalize from the plants in the training dataset, and learn to predict the species of a new plant from its four measurements. This is a simple classification problem and can be done using any of many standard algorithms including decision trees, random forests, multiclass logistic regression, and many more.
Let us now consider an unsupervised learning scenario. We give an unsupervised learning algorithm only the four feature columns, and not the target column:
The model must identify patterns in the plant measurements without knowing the species of any of the plants.
We can run a clustering algorithm on the measurement data of the 150 plants, to discover if the plants will naturally cluster together into groups.
We choose the simplest clustering algorithm, k-means clustering. Perhaps k-means clustering can discover the three species without being given this information?
We can set k = 3, so that the k-means algorithm must discover 3 clusters. Passing the 150 plants into the k-means algorithm, the algorithm annotates the 150 plants as belonging to group 0, 1, or 2:
There is unfortunately not much correspondence between the discovered clusters and the true species. Putting back the target value, we can see that of the three virginica examples, one was assigned to group 2 and two were assigned to group 0.
In fact, we can summarize the clustering algorithm's output with a confusion matrix. The x-axis shows the predicted class output by the k-means, while the y-axis shows the information about the true species, which was withheld from the clustering algorithm.
It appears that the k-means was able to discover setosa as a separate class without being given any prior information, but its performance was much less impressive on the other two species.
Taking two of our four measurements, we can plot these on a scatter plot and show the true species, and the clusters discovered by k-means. In the graph view, the two groupings look remarkably similar, when the colors are chosen to match, although some outliers are visible:
This shows how a clustering algorithm can discover patterns in unlabeled data without any extra accompanying information. It is clear that the k-means algorithm would be very useful if the species information was not available.
Clustering is both a very powerful tool but also very limited in performance compared to supervised learning techniques, since much less prior information is provided.
Applications of Unsupervised Learning
Unsupervised Learning for Anomaly Detection in Finance
With the ubiquity of credit cards, financial fraud has become a major problem because of the ease with which an individual's credit card details can be compromised. Unauthorized or fraudulent transactions can sometimes be recognized by a break from the user's normal pattern of usage, such as large volume transactions, or rapid buying sprees.
Credit card transaction data can be fed into a multivariate anomaly detection algorithm in the form of a series of features, such as transaction amount, transaction time of day, transaction location, and time since the previous transaction. The outliers can then be flagged to the bank as potentially fraudulent. In these cases, the bank can either unilaterally block the card or request the user to authenticate the transaction in another way.
Anomaly detection, rather than classification, is the ideal tool for credit card fraud detection, because fraudulent transactions are extremely rare but nevertheless very important, and a classification approach might not cope as well with the class imbalance of fraudulent vs non-fraudulent transactions.
The bank will have to decide where to draw the line, to weigh up the risk of inconvenience to the user resulting from blocking a card unnecessarily, versus the greater inconvenience of missing a fraudulent transaction.
Unsupervised Learning for Clustering Medical Data
In the medical field, often large amounts of data is available, but no labels are present. For example, devices such as a CAT scanner, MRI scanner, or an EKG, produce streams of numbers but these are entirely unlabeled. In these cases obtaining labeled data is difficult, costly, or impossible, and so supervised learning methods are not possible.
A number of clustering methods have been applied to datasets of neurological diseases, such as Alzheimer's disease. These datasets are typically a combination of clinical and biological features. The clustering techniques allow medical practitioners to identify patterns across patients which would otherwise be difficult to find by eye.
In 2019, a team of researchers in the UAE, Egypt, and Australia conducted a meta-study of clustering algorithms on Alzheimer's disease data, and reported that it was possible to identify subgroups which corresponded to the stage of the disease's progression. They compared k-means clustering, k-means-mode clustering, hierarchical agglomerative clustering, and multi-layer clustering, and found that all of the clustering algorithms investigated brought a new level of insight into the various subtypes of Alzheimer's patients.
Unsupervised Learning History
In the 1930s, the American anthropologists Harold Driver and Alfred Kroeber had collected statistical data from a number of ethnographic analyses that they had carried out on Polynesian cultures, and were interested in a way of measuring the similarities between cultures, and assigning cultures to groups based on their similarities. In 1932, they published a book titled Quantitative Expression of Cultural Relationships, which described their clustering algorithm. This was the birth of the field of cluster analysis.
Over the next ten years, the psychologists Joseph Zubin and Robert Tryon introduced cluster analysis to psychology, and it was soon used to classify personality traits.
In 1957, Stuart Lloyd at Bell Labs introduced the standard algorithm for k-means, using it for pulse-code modulation, which is a method of digitally representing sampled analog signals. Over time many iterations of the k-means algorithm, as well as other popular clustering algorithms, have been developed, and clustering has become widely used in data science across all industries in recent years.
The idea of anomaly detection for intrusion detection systems was formalized by the American information security researcher Dorothy Denning in 1986. At that time she was working for the nonprofit SRI International. Her intrusion detection system used a set of rules to identify intrusions (hacking attempts) on a system according to their statistical differences from typical users and events. Denning's design forms the base of many modern anomaly detection systems today.
Neural network-based unsupervised learning techniques such as generative adversarial networks and autoencoders have generally only come to prominence since the 2010s, as computing power and data became available for neural networks to become widely used. For example, generative adversarial networks were initially proposed by the American postdoctoral researcher Ian Goodfellow and his colleagues in 2014, although the groundwork had been laid by others in previous years.
References
Murphy, Machine Learning: A Probabilistic Perspective (2012)
Goodfellow et al, Deep Learning (2016)
Driver and Kroeber, Quantitative Expression of Cultural Relationships (1932)
Aggarwal, Outlier Analysis (2017)
Alashwal et al, The Application of Unsupervised Clustering Methods to Alzheimer’s Disease (2019)
Lin, Constraining Implicit Space with MDL: Regularity Normalization as Unsupervised Attention (2019)