Distance approximation using Isolation Forests

10/27/2019 ∙ by David Cortes, et al. ∙ 18

This work briefly explores the possibility of approximating spatial distance (alternatively, similarity) between data points using the Isolation Forest method envisioned for outlier detection. The logic is similar to that of isolation: the more similar or closer two points are, the more random splits it will take to separate them. The separation depth between two points can be standardized in the same way as the isolation depth, transforming it into a distance metric that is limited in range, centered, and in compliance with the axioms of distance. This metric presents some desirable properties such as being invariant to the scales of variables or being able to account for non-linear relationships between variables, which other metrics such as Euclidean or Mahalanobis distance do not. Extensions to the Isolation Forest method are also proposed for handling categorical variables and missing values, resulting in a more generalizable and robust metric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

isotree

(Python, R, C++) Extended Isolation Forest, SCiForest, and variations, with some additions (outlier detection + similarity + NA imputation)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work explores the idea of using Isolation Forests ([4], [6]) and variations thereof ([3], [5]

) for estimating how similar/closer or dissimilar/further two points are in an arbitrary feature space

, based on an observed sample of data to which an Isolation Forest model is fit, and from which the distributions and relationships between different variables/dimensions are implicitly incorporated into this distance/similarity metric.

The premise is simple: if a set of data points is split into two branches recursively multiple times by choosing some variable and split point in that variable uniformly at random, the points that are more distant will on average be separated (put into different tree branches) with fewer splits (closer to the root node), while points that are more similar will require more splits to become separated.

The procedure is less efficient than simpler calculations such as Euclidean distance (

), but offers some advantages: this distance/similarity is invariant to the scale of each variable, having non-normal distributions does not present any issues, and potential correlations between variables in the distribution are taken into account, even if these correlations are not linear. Additionally, some small modifications to the Isolation Forest algorithm allow incorporation of categorical variables and handling of missing values in the procedure.

2 Isolation Forests

Isolation Forest (a.k.a. iForest)([4], [6]

) is an algorithm devised for outlier or anomaly detection based on the concept of isolation: if a set of data points is split according to some random variable by finding a split point at random within the range in the data, assigning all points that are less or equal than this threshold to one branch and the rest to the other, and this process is continued recursively on each branch, then outlier points will become isolated (put alone) in one branch quicker (with fewer splits, closer to the root node of the tree) than non-outlier points.

The idea can be extended to non-random splits based on the standard deviations of the variable being split that are obtained at each branch (

[5]

), and to splitting hyperplanes (

[3], [5]), which as shown in [3], can help to remove some biases that are introduced by the single-variable splitting process. As outliers can only be considered to be so if their average isolation depth is less than expected for a random data point, this procedure can be terminated before isolating every single point by stopping the process once it reaches the depth that a balanced binary tree would have, and the remainder isolation depth for non-isolated points approximated by adding to the terminal depth the expected value of this depth for each point if the process were continued with uniformly-random data and uniformly-random splits on the number of points that remain on that node.

The average isolation depth obtained for a given point can be converted into a standardized outlier metric according to how it differs from the expected isolation depth for a random data point, which is given by , where is the nth harmonic number - see [6] and [1] for details.

3 Separation depth and distance

If considering two random points in a subset, one can also think of separation instead of isolation as the binary trees are grown: if the points are split (assigned to different branches from a binary tree node) according to being smaller or greater than a random value within the range of some variable in the feature space, then the closer two points are in that dimension, the higher the probability that they will end up in the same branch if the split point is chosen at random, due to the fact that, the closer they are, the larger the number of possible split points under which they end up together, and if some variable underwent a linear or affine transformation

, each possible split on the points will still have the same probability as before, since this only depends on their relative position within the range of the variable.

If the process is repeated further, choosing a variable and split point at random in each branch that was obtained in the previous split, then again closer points in the new variable have a higher chance of ending up in the same branch, but this time it is conditioned on already not having been separated in the previous split. In this regard, non-random splits that aim at finding the point that minimizes the standard deviations of the variable in the obtained branches could also do a better job at making clustered points appear even more similar, due to the fact that splits will tend to separate clusters first (see [5] and [9]).

Figure 1: Example random tree on 1-d data points

If this procedure is repeated indefinitely from the beginning until each point becomes isolated, then the average separation depth between any two points across these random trees will be greater iff the points are closer (with closeness influenced by the data distribution), and having multiple trees will remove the large expected variability introduced by having to start by separating points according to one variable (that is, a large number of pairs is expected to become separated with the first split, which is typically not what happens with isolation as most points don’t end up isolated after the first split).

Just like with isolation, it’s possible to calculate the expected separation depth between two random points if a set of points is split by a random tree procedure which would assign from the remaining points a random number of them to go to one branch and the remainder to the other. The expected separation depth under a randomly-built tree like this with the same number of terminal nodes as points in the data or subset in a terminal node, can be calculated recursively by considering that, if two points are not separated right after a split, they will go together into yet another node, but of smaller remainder size, in which the procedure will be repeated again – see [2] for details. The formula is given by:

With (single point is already isolated) and (two points always become separated in one split).

It can be more efficiently calculated by a recursion as follows:

As the sample size grows to infinity, the expected separation depth can be more easily approximated: this scenario with infinite discrete choices is equivalent to a scenario with a continuous number line, for which at each split, two points plus a split threshold are drawn according to a random uniform distribution, and will be separated if the threshold lies inbetween the two points – the probability of this happening in a uniform distribution is

regardless of the range, and if they are not separated in that split, then the process starts again with a still-infinite sample, which will give the same probability of of not becoming separated, and thus the expected separation depth is .

The fact that this number is constant for indefinitely large sample sizes comes in handy, as then one can assume that this is what it takes to separate two points in an arbitrary data sample as will be explained next.

For determining isolation depth, when a tree node reaches the height limit with multiple points (as opposed to a single isolated point) or contains a set of points in which no further split is possible, the expected isolation depth for that remainder in [4]

is approximated according to how many points ended up there, and this number is taken again at prediction time for new data points if they reach that terminal node in the tree. For separation, this is problematic as at the moment of determining the distances between points in a terminal node, the expected remaining separation depth will depend on the number of points that end up in that terminal node, making the distances between two points be affected by the presence of a third point when the trees are grown or if this expectation were to be determined in the same way as isolation for new points. Fortunately, there would not be such a dependence on a third point when it comes the time to use the already-fitted trees to estimate the distance between two new points if it can be assumed that at the end of each terminal node will lie an infinite sample of more data points drawn according to the same data distribution, because the estimated separation depth for an infinite sample is always

so if the separation depth is calculated for a new sample of points, once two of them reach the same terminal node in a tree, their expected separation depth will be the same as if there were yet more points reaching that same terminal node.

Just like for the outlier score proposed in [4], the expectation in a randomly-built tree can also be used to produce a standardized metric, by comparing the obtained average separation depth between points against the expected separation depth in a randomly-built tree through a simple transformation such as - subtracting the minimum separation from both numbers so as to make the metric be able to reach its maximum. This standardized metric, which measures dissimilarity due to the negative sign, presents some nice properties: on average, points should have a dissimilarity between them of about (), with points that are more similar than average having values closer to zero, and points that are more dissimilar than average having values closer to .

This dissimilarity (from here on, distance) can be shown to be a proper metric distance under some extra assumptions:

  • It is bounded between zero and one (, ), thus .

  • If it is assumed that a single point is indivisible and thus it’s average separation depth infinite, then .

  • Since these are binary trees, there is only one possible path between any two points in a given tree, each pair of points requires at least 1 split to be separated, and a point (a) cannot be separated from point (b) further down the tree than it is separated from point (c) if points (b) and (c) have already been separated earlier, thus in every tree, which implies .

This analysis assumes that all points are unique, and the expected separation calculation would not work if duplicated points are passed down the tree as then they would be considered to have some positive distance even though they are the same. This can however be taken into account at the moment of growing the trees by assigning them a higher sampling weight, and duplicates can be filtered out before being passed through already-fitted trees.

4 Categorical variables and missing values

It’s possible to think of some simple extensions to the original Isolation Forest model for handling categorical variables as follows: a random subset of the present categories is assigned to one tree branch, while the rest are assigned to the other tree branch, and when new points are passed down the tree, if they have a category that was not present in the original points from which the split was determined, they are divided heuristically, either by assigning them to the branch that had the fewer points, or by assigning them to both branches but with a weight given by the proportion of points before the split that were assigned to each branch, and the results later combined according to these weights. This same trick can be used for handling missing data, providing better results than a-priori imputation (see

[8]). In the extended model ([3], [5]), which produces splits by more than one variable at a time, missing data and new categories can alternatively be imputed with the median of the sample or sub-sample from which the split was determined (i.e. only the points that reach that current level). In the case of categorical variables, each category would have its own coefficient to add to the linear combination, and the resulting numeric transformation under these coefficients will have some median in the original sample that can be used as imputation value.

The full procedure is described below:

Inputs (input data with dimensionality ), (number of trees), (sub-sample size), (number of splitting dimensions), (max depth)
      Output Isolation Forest model consisting of trees

1:Start with empty set of trees
2:for  do
3:     Take subsample consisting of points from selected randomly.
4:     if  then
5:         Add single-variable tree:
6:     else
7:         Add extended tree:      return
Algorithm 1 iForestEnhanced

Inputs (input data points), (max depth), (current depth), (weight of each point in )
      Output Tree node with left branch , right branch , proportion left , chosen variable , present categories c, and either split point or split subset

1:if  or or  then
2:     Terminate procedure (return empty output )
3:else
4:     Choose variable at random from such that has at least 2 different values (if not possible, then terminate)
5:     if  is numeric then
6:         Choose a random point
7:         Determine subsets ,
8:         Set empty present categories
9:     else
10:         Determine present categories
11:         Choose a random subset of categories from all possible subsets of c
12:         Determine subsets ,      
13:     Determine proportion assigned to first branch
14:     Divide missing values s.t. , , ,
15:     return tree node with left branch , right branch , left branch proportion , chosen variable , present categories c, and either split point or split subset
Algorithm 2 iTreeEnh

Inputs (input data points), (max depth), (current depth), (number of splitting dimensions)
      Output Tree node with left branch , right branch , subset of variables u, chosen numeric coefficients z, categorical coefficients , imputation values r, split point

1:if  or or  then
2:     Terminate procedure (return empty output )
3:else
4:     Initialize linear combination for each point in
5:     Initialize empty set of numeric coefficients and categorical coefficients
6:     Choose a subset u of variables at random at random from such that has at least 2 different values (fewer than if not possible, terminate if none has at least 2 different values)
7:     for each numeric variable  do
8:         Draw a random coefficient
9:         Standardize coefficient as
10:         Update
11:         Add coefficient to set
12:         Determine imputation value as
13:         Update
14:         Add imputation value to the set      
15:     for each categorical variable  do
16:         For each category, choose a random coefficient
17:         Update
18:         Add set of coefficients to set
19:         Determine imputation value as
20:         Update
21:         Add imputation value to the set      
22:     if  then
23:         Terminate procedure (return empty output )      
24:     Choose a random split point
25:     Determine subsets , .
26:     return tree node with left branch , right branch , subset of variables u, numeric coefficients z, categorical coefficients , imputation values R, split point
Algorithm 3 iTreeExtEnh

Inputs (input data consisting of points), (number of trees), (isolation forest)
      Output Distance matrix

1:Initialize pairwise sums of separation depths as
2:if the trees are single-variable then
3:     Initialize weights
4:for each tree  do
5:     if the trees are single-variable then
6:         Update
7:     else
8:         Update      
9:return
Algorithm 4 SepDepth

Inputs (node of an iTreeEnh), (input data), (current sum of separation depths), (weights for )
      Output Distance matrix

1:if  then
2:     Update
3:     return
4:else
5:     Update
6:     if chosen variable in is numeric then
7:         Determine subsets ,
8:         Divide missing values s.t. , , ,
9:     else
10:         Determine subsets ,
11:         Divide missing values and unseen categories s.t. , , ,      
12:     return
Algorithm 5 TraverseTree

Inputs (node of an iTreeExtEnh), (input data), (current sum of separation depths)
      Output Distance matrix

1:if  then
2:     Update
3:     return
4:else
5:     Update
6:     Initialize linear combination for each point in
7:     for each numeric variable  do
8:         Update      
9:     for each categorical variable  do
10:         Update      
11:     Determine subsets , .
12:     return
Algorithm 6 TraverseExtTree

5 Comparison to other distance metrics

The metric proposed here (the implementation was made open source and freely available

111https://github.com/david-cortes/isotree) was compared against typical distance metrics (Euclidean, Mahalanobis, Cosine) in terms of their (Pearson) correlation under randomly-generated data with different properties, using the single-variable and the extended model with two variables at a time, both of them with no sub-sampling, full-depth trees, 100 trees per model, and only-random splits.

The following comparisons take a randomly-generated matrix

composed of several column vectors

. Some of the values were later set randomly as missing for comparison purposes.

max width= Metric Iso IsoExt Euc Cos Iso 0.944 0.951 0.622 IsoExt 0.944 0.968 0.62 Euc 0.951 0.968 0.628 Cos 0.622 0.62 0.628

Table 1: Independent variables with the same scale

- this is the kind of case in which Euclidean distance is the most approriate, and here it is equivalent to Mahalanobis due to the covariance matrix being an identity matrix.

max width= Metric Iso IsoExt Euc Mah Cos Iso 0.944 0.671 0.95 0.382 IsoExt 0.944 0.697 0.971 0.378 Euc .671 0.697 0.697 0.542 Mah .95 0.971 0.697 0.361 Cos 0.382 0.378 0.542 0.361

Table 2: Independent variables with different scale , - here Euclidean distance will always weight the larger column heavier, but metrics such as Mahalanobis distance can easily overcome this difference.

max width= Metric Iso IsoExt Euc Mah Cos Euc (no ) Mah (no ) Cos (no ) Iso 0.962 0.657 0.768 0.605 0.924 0.924 0.551 IsoExt .962 0.72 0.832 0.619 0.929 0.93 0.522 Euc .657 0.72 0.916 0.177 0.563 0.562 0.234 Mah .768 0.832 0.916 0.454 0.761 0.76 0.383 Cos 0.605 0.619 0.177 0.454 0.747 0.747 0.756 Euc (no ) 0.924 0.929 0.563 0.761 0.747 0.628 Mah (no ) 0.924 0.93 0.562 0.76 0.747 0.628 Cos (no ) 0.551 0.522 0.234 0.383 0.756 0.628 0.628

Table 3:

Independent variables in the same scale, plus a non-linear transformation:

, - intuitively, having the newly-added column which is just a deterministic transformation of an already-existing column does not add any extra information, so an ideal distance metric should be very similar to the simple Euclidean distance without the new column.

max width= Metric Iso IsoExt Euc Mah Cos Iso (15% NA) IsoExt (15% NA) Euc (15% NA) Mah (15% NA) Cos (15% NA) Iso 0.96 0.94 0.74 0.7 0.63 0.86 0.85 0.78 0.63 IsoExt 0.96 0.94 0.76 0.67 0.6 0.87 0.86 0.79 0.6 Euc 0.94 0.94 0.75 0.74 0.6 0.85 0.9 0.78 0.66 Mah 0.74 0.76 0.75 0.56 0.48 0.69 0.68 0.72 0.5 Cos 0.7 0.67 0.74 0.56 0.47 0.58 0.63 0.51 0.87 Iso (15% NA) 0.63 0.6 0.6 0.48 0.47 0.5 0.52 0.57 0.43 IsoExt (15% NA) .86 0.87 0.85 0.69 0.58 0.5 0.94 0.79 0.66 Euc (15% NA) 0.85 0.86 0.9 0.68 0.63 0.52 0.94 0.79 0.7 Mah (15% NA) 0.78 0.79 0.78 0.72 0.51 0.57 0.79 0.79 0.55 Cos (15% NA) 0.63 0.6 0.66 0.5 0.87 0.43 0.66 0.7 0.55

Table 4: Non-independent variables , , (all of these distribution parameters were randomly-generated and do not represent anything meaningful) - this is the kind of scenario in which Mahalanobis distance is the most appropriate as variables are only related by their linear correlation, under a single unimodal distribution from which all of them are drawn. Additionally, a random 15% of the values was set as missing, and in the case of Euclidean, Mahalanobis, and Cosine distance, was imputed with the column mean.

max width= Metric Iso IsoExt Euc Mah Cos Iso 0.97 0.95 0.89 0.69 IsoExt 0.97 0.96 0.87 0.76 Euc 0.95 0.96 0.9 0.71 Mah 0.89 0.87 0.9 0.55 Cos 0.69 0.76 0.71 0.55 max width= Metric Iso IsoExt Euc Mah Cos , 0.3 0.27 0.94 1.51 0.2 , 0.28 0.27 0.88 0.92 0.84 , 0.54 0.58 1.96 2.26 1.35

Table 5: Gaussian mixture with non-independent variables and equal probability for each group - , . Here an ideal metric should make points within groups closer than points between groups, and the metric should take into account the internal correlations within each group more than the mixed correlations (this is shown in the table at the right). Best reference here is Euclidean distance, but it still doesn’t account for relationships between variables. Since the covariance matrix is the same but with oposite signs at the non-diagonal entries, under an ideal metric, the average distance between points within group should be similar to the average distance between points within group .

Figure 2: Sample points from mixture used in example 5 (outlier regions are from extended model).

In these examples, the most suitable metric under each specific situation presented the highest correlation with the distance metric proposed here, even though the inverse was not always the case. The extended model shows a slight edge in most cases, which becomes a rather large edge in the case of missing values as it was able to maintain a higher correlation against the same distance obtained when the values are not missing. Both models were able to produce comparable within-group distances in a mirrored Gaussian mixture, which a distance such as Mahalanobis that takes the mixed covariance matrix cannot do.

As a more realistic comparison point, the metric proposed here for the extended model was also compared against Gower distance (calculated using the R package ”cluster” with its default parameters - see [7]) under the hypothyroid dataset222https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease, which contains a mixture of numeric, boolean, and categorical variables, with missing values in several of them and non-normally-distributed numeric variables, this time with limited-depth trees and some non-random splits as in [5] - the (Pearson) correlation between these metrics stood at . Unfortunately, for such kind of data, it’s very difficult to make a detailed comparison and/or determine which one produces the most desirable output, so the comparison was stopped at that.

6 Conclusions

This work introduced a metric distance between points in an arbitrary feature space which is obtained with the use of Isolation Forest models and is based on a sample from the data-generating distribution. Compared to more typical metrics such as Euclidean or Mahalanobis distance, this metric was shown to be more robust against different possible relationships between variables, to produce more desirable relative distances under mixed distributions, and to have other desirable properties such as being limited in range and having a threshold value that can be used to determine if two points are more similar than dissimilar. Some simple extensions to the Isolation Forest algorithm were proposed to allow calculations with missing values and categorical variables, which in the case of missing values was shown to provide highly-correlated results with the non-missing-data distance, and in the case of mixed numeric and categorical variables, was shown to correlate highly with Gower distance, while still being able to account for relationships between numeric and categorical variables.

References