isotree
(Python, R, C++) Extended Isolation Forest, SCiForest, and variations, with some additions (outlier detection + similarity + NA imputation)
view repo
This work briefly explores the possibility of approximating spatial distance (alternatively, similarity) between data points using the Isolation Forest method envisioned for outlier detection. The logic is similar to that of isolation: the more similar or closer two points are, the more random splits it will take to separate them. The separation depth between two points can be standardized in the same way as the isolation depth, transforming it into a distance metric that is limited in range, centered, and in compliance with the axioms of distance. This metric presents some desirable properties such as being invariant to the scales of variables or being able to account for non-linear relationships between variables, which other metrics such as Euclidean or Mahalanobis distance do not. Extensions to the Isolation Forest method are also proposed for handling categorical variables and missing values, resulting in a more generalizable and robust metric.
READ FULL TEXT VIEW PDF
We present an extension to the model-free anomaly detection algorithm,
I...
read it
We develop a novel exploratory tool for non-Euclidean object data based ...
read it
We present the mathematical analysis of the Isolation Random Forest Meth...
read it
We present the mathematical analysis of the Isolation Random Forest Meth...
read it
Canonical distances such as Euclidean distance often fail to capture the...
read it
A fundamental question in data analysis, machine learning and signal
pro...
read it
Models at various levels of resolution are commonly used, both for fores...
read it
(Python, R, C++) Extended Isolation Forest, SCiForest, and variations, with some additions (outlier detection + similarity + NA imputation)
This work explores the idea of using Isolation Forests ([4], [6]) and variations thereof ([3], [5]
) for estimating how similar/closer or dissimilar/further two points are in an arbitrary feature space
, based on an observed sample of data to which an Isolation Forest model is fit, and from which the distributions and relationships between different variables/dimensions are implicitly incorporated into this distance/similarity metric.The premise is simple: if a set of data points is split into two branches recursively multiple times by choosing some variable and split point in that variable uniformly at random, the points that are more distant will on average be separated (put into different tree branches) with fewer splits (closer to the root node), while points that are more similar will require more splits to become separated.
The procedure is less efficient than simpler calculations such as Euclidean distance (
), but offers some advantages: this distance/similarity is invariant to the scale of each variable, having non-normal distributions does not present any issues, and potential correlations between variables in the distribution are taken into account, even if these correlations are not linear. Additionally, some small modifications to the Isolation Forest algorithm allow incorporation of categorical variables and handling of missing values in the procedure.
Isolation Forest (a.k.a. iForest)([4], [6]
) is an algorithm devised for outlier or anomaly detection based on the concept of isolation: if a set of data points is split according to some random variable by finding a split point at random within the range in the data, assigning all points that are less or equal than this threshold to one branch and the rest to the other, and this process is continued recursively on each branch, then outlier points will become isolated (put alone) in one branch quicker (with fewer splits, closer to the root node of the tree) than non-outlier points.
The idea can be extended to non-random splits based on the standard deviations of the variable being split that are obtained at each branch (
[5]), and to splitting hyperplanes (
[3], [5]), which as shown in [3], can help to remove some biases that are introduced by the single-variable splitting process. As outliers can only be considered to be so if their average isolation depth is less than expected for a random data point, this procedure can be terminated before isolating every single point by stopping the process once it reaches the depth that a balanced binary tree would have, and the remainder isolation depth for non-isolated points approximated by adding to the terminal depth the expected value of this depth for each point if the process were continued with uniformly-random data and uniformly-random splits on the number of points that remain on that node.If considering two random points in a subset, one can also think of separation instead of isolation as the binary trees are grown: if the points are split (assigned to different branches from a binary tree node) according to being smaller or greater than a random value within the range of some variable in the feature space, then the closer two points are in that dimension, the higher the probability that they will end up in the same branch if the split point is chosen at random, due to the fact that, the closer they are, the larger the number of possible split points under which they end up together, and if some variable underwent a linear or affine transformation
, each possible split on the points will still have the same probability as before, since this only depends on their relative position within the range of the variable.If the process is repeated further, choosing a variable and split point at random in each branch that was obtained in the previous split, then again closer points in the new variable have a higher chance of ending up in the same branch, but this time it is conditioned on already not having been separated in the previous split. In this regard, non-random splits that aim at finding the point that minimizes the standard deviations of the variable in the obtained branches could also do a better job at making clustered points appear even more similar, due to the fact that splits will tend to separate clusters first (see [5] and [9]).
If this procedure is repeated indefinitely from the beginning until each point becomes isolated, then the average separation depth between any two points across these random trees will be greater iff the points are closer (with closeness influenced by the data distribution), and having multiple trees will remove the large expected variability introduced by having to start by separating points according to one variable (that is, a large number of pairs is expected to become separated with the first split, which is typically not what happens with isolation as most points don’t end up isolated after the first split).
Just like with isolation, it’s possible to calculate the expected separation depth between two random points if a set of points is split by a random tree procedure which would assign from the remaining points a random number of them to go to one branch and the remainder to the other. The expected separation depth under a randomly-built tree like this with the same number of terminal nodes as points in the data or subset in a terminal node, can be calculated recursively by considering that, if two points are not separated right after a split, they will go together into yet another node, but of smaller remainder size, in which the procedure will be repeated again – see [2] for details. The formula is given by:
With (single point is already isolated) and (two points always become separated in one split).
It can be more efficiently calculated by a recursion as follows:
As the sample size grows to infinity, the expected separation depth can be more easily approximated: this scenario with infinite discrete choices is equivalent to a scenario with a continuous number line, for which at each split, two points plus a split threshold are drawn according to a random uniform distribution, and will be separated if the threshold lies inbetween the two points – the probability of this happening in a uniform distribution is
regardless of the range, and if they are not separated in that split, then the process starts again with a still-infinite sample, which will give the same probability of of not becoming separated, and thus the expected separation depth is .The fact that this number is constant for indefinitely large sample sizes comes in handy, as then one can assume that this is what it takes to separate two points in an arbitrary data sample as will be explained next.
For determining isolation depth, when a tree node reaches the height limit with multiple points (as opposed to a single isolated point) or contains a set of points in which no further split is possible, the expected isolation depth for that remainder in [4]
is approximated according to how many points ended up there, and this number is taken again at prediction time for new data points if they reach that terminal node in the tree. For separation, this is problematic as at the moment of determining the distances between points in a terminal node, the expected remaining separation depth will depend on the number of points that end up in that terminal node, making the distances between two points be affected by the presence of a third point when the trees are grown or if this expectation were to be determined in the same way as isolation for new points. Fortunately, there would not be such a dependence on a third point when it comes the time to use the already-fitted trees to estimate the distance between two new points if it can be assumed that at the end of each terminal node will lie an infinite sample of more data points drawn according to the same data distribution, because the estimated separation depth for an infinite sample is always
so if the separation depth is calculated for a new sample of points, once two of them reach the same terminal node in a tree, their expected separation depth will be the same as if there were yet more points reaching that same terminal node.Just like for the outlier score proposed in [4], the expectation in a randomly-built tree can also be used to produce a standardized metric, by comparing the obtained average separation depth between points against the expected separation depth in a randomly-built tree through a simple transformation such as - subtracting the minimum separation from both numbers so as to make the metric be able to reach its maximum. This standardized metric, which measures dissimilarity due to the negative sign, presents some nice properties: on average, points should have a dissimilarity between them of about (), with points that are more similar than average having values closer to zero, and points that are more dissimilar than average having values closer to .
This dissimilarity (from here on, distance) can be shown to be a proper metric distance under some extra assumptions:
It is bounded between zero and one (, ), thus .
If it is assumed that a single point is indivisible and thus it’s average separation depth infinite, then .
Since these are binary trees, there is only one possible path between any two points in a given tree, each pair of points requires at least 1 split to be separated, and a point (a) cannot be separated from point (b) further down the tree than it is separated from point (c) if points (b) and (c) have already been separated earlier, thus in every tree, which implies .
This analysis assumes that all points are unique, and the expected separation calculation would not work if duplicated points are passed down the tree as then they would be considered to have some positive distance even though they are the same. This can however be taken into account at the moment of growing the trees by assigning them a higher sampling weight, and duplicates can be filtered out before being passed through already-fitted trees.
It’s possible to think of some simple extensions to the original Isolation Forest model for handling categorical variables as follows: a random subset of the present categories is assigned to one tree branch, while the rest are assigned to the other tree branch, and when new points are passed down the tree, if they have a category that was not present in the original points from which the split was determined, they are divided heuristically, either by assigning them to the branch that had the fewer points, or by assigning them to both branches but with a weight given by the proportion of points before the split that were assigned to each branch, and the results later combined according to these weights. This same trick can be used for handling missing data, providing better results than a-priori imputation (see
[8]). In the extended model ([3], [5]), which produces splits by more than one variable at a time, missing data and new categories can alternatively be imputed with the median of the sample or sub-sample from which the split was determined (i.e. only the points that reach that current level). In the case of categorical variables, each category would have its own coefficient to add to the linear combination, and the resulting numeric transformation under these coefficients will have some median in the original sample that can be used as imputation value.The full procedure is described below:
The metric proposed here (the implementation was made open source and freely available
^{1}^{1}1https://github.com/david-cortes/isotree) was compared against typical distance metrics (Euclidean, Mahalanobis, Cosine) in terms of their (Pearson) correlation under randomly-generated data with different properties, using the single-variable and the extended model with two variables at a time, both of them with no sub-sampling, full-depth trees, 100 trees per model, and only-random splits.The following comparisons take a randomly-generated matrix
composed of several column vectors
. Some of the values were later set randomly as missing for comparison purposes.In these examples, the most suitable metric under each specific situation presented the highest correlation with the distance metric proposed here, even though the inverse was not always the case. The extended model shows a slight edge in most cases, which becomes a rather large edge in the case of missing values as it was able to maintain a higher correlation against the same distance obtained when the values are not missing. Both models were able to produce comparable within-group distances in a mirrored Gaussian mixture, which a distance such as Mahalanobis that takes the mixed covariance matrix cannot do.
As a more realistic comparison point, the metric proposed here for the extended model was also compared against Gower distance (calculated using the R package ”cluster” with its default parameters - see [7]) under the hypothyroid dataset^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease, which contains a mixture of numeric, boolean, and categorical variables, with missing values in several of them and non-normally-distributed numeric variables, this time with limited-depth trees and some non-random splits as in [5] - the (Pearson) correlation between these metrics stood at . Unfortunately, for such kind of data, it’s very difficult to make a detailed comparison and/or determine which one produces the most desirable output, so the comparison was stopped at that.
This work introduced a metric distance between points in an arbitrary feature space which is obtained with the use of Isolation Forest models and is based on a sample from the data-generating distribution. Compared to more typical metrics such as Euclidean or Mahalanobis distance, this metric was shown to be more robust against different possible relationships between variables, to produce more desirable relative distances under mixed distributions, and to have other desirable properties such as being limited in range and having a threshold value that can be used to determine if two points are more similar than dissimilar. Some simple extensions to the Isolation Forest algorithm were proposed to allow calculations with missing values and categorical variables, which in the case of missing values was shown to provide highly-correlated results with the non-missing-data distance, and in the case of mixed numeric and categorical variables, was shown to correlate highly with Gower distance, while still being able to account for relationships between numeric and categorical variables.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pages 274–290. Springer, 2010.cluster: Cluster Analysis Basics and Extensions
, 2019. R package version 2.1.0.
Comments
There are no comments yet.