Anomaly Detection and Prototype Selection Using Polyhedron Curvature

04/05/2020 ∙ by Benyamin Ghojogh, et al. ∙ University of Waterloo 0

We propose a novel approach to anomaly detection called Curvature Anomaly Detection (CAD) and Kernel CAD based on the idea of polyhedron curvature. Using the nearest neighbors for a point, we consider every data point as the vertex of a polyhedron where the more anomalous point has more curvature. We also propose inverse CAD (iCAD) and Kernel iCAD for instance ranking and prototype selection by looking at CAD from an opposite perspective. We define the concept of anomaly landscape and anomaly path and we demonstrate an application for it which is image denoising. The proposed methods are straightforward and easy to implement. Our experiments on different benchmarks show that the proposed methods are effective for anomaly detection and prototype selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection, instance ranking, and prototype selection are important tasks in data mining. Anomaly detection refers to finding outliers or anomalies which differ significantly from the normal data points

[emmott2013systematic]. There exist many applications for anomaly detection such as fraud detection, intrusion detection, medical diagnosis, and damage detection [chandola2009anomaly].

Ranking data points (instances) according to their importance can be useful for better representation of data, omitting the dummy or noisy points, better discrimination of classes in classification tasks, etc [ghojogh2018principal]. Prototype selection is referred to finding the best data points in terms of representation of data, discrimination of classes, information of points, etc [garcia2012prototype]. It can also be useful for better storage and processing time efficiency. Prototype selection can be done either using ranking the points and then discarding the less important points or by merely retaining a portion of data and discarding the others.

In this paper, we propose Curvature Anomaly Detection (CAD) and inverse CAD (iCAD) for anomaly detection and prototype selection, respectively. We also propose their kernel versions which are Kernel CAD (K-CAD) and Kernel iCAD (K-iCAD). The idea of proposed algorithms is based on polyhedron curvature where every point is imagined to be the vertex of a hypothetical polyhedron defined by its neighbors. We also define anomaly landscape and anomaly path which can have different applications such as image denoising. In the following, we mention the related work for anomaly detection and prototype selection. Then, we explain the background for polyhedron curvature. Afterwards, the proposed CAD, K-CAD, iCAD, and K-iCAD are explained. Finally, the experiments are reported.

Anomaly Detection: Local Outlier Factor (LOF) [breunig2000lof] is one of the important anomaly detection algorithms. It defines a measure for local density of every data point according to its neighbors. It compares the local density of every point with its neighbors and find the anomalies. One-class SVM [scholkopf2000support]

is another method which estimates a function which is positive on the regions of data with high density and negative elsewhere. Therefore, the points with negative values of that function are considered as anomalies. If the data are assumed to have Gaussian distribution as the most common distribution, Elliptic Envelope (EE) can be fitted to data

[rousseeuw1999fast]

and the points having low probability in the fitted envelope are considered to be anomaly. Isolation forest

[liu2008isolation] is an isolation-based anomaly detection method [liu2012isolation] which isolates the anomalies using an ensemble approach. The ensemble includes isolation trees where the more depth of tree for isolating a point is a measure of its normality.

Prototype Selection: Prototype selection [garcia2012prototype] is also referred to as instance ranking and numerosity reduction. Edited Nearest Neighbor (ENN) [wilson1972asymptotic] is one of the oldest prototype selection method which removes the points having most of its neighbors from another class. Decremental Reduction Optimization Procedure 3 (DROP3) [wilson2000reduction] has the opposite perspective and removes a point if ite removal improves the -Nearest Neighbor (-NN) classification accuracy. Stratified Ordered Selection (SOS) [kalegele2012demand] starts with boundary points and then recursively finds the median points noticing that boundary and median points are informative. Shell Extraction (SE) [liu2017efficient]

introduces a reduction sphere and removes the points falling in this hyper-sphere in order to approximate the support vectors. Principal Sample Analysis (PSA)

[ghojogh2018principal], which is extended for regression and clustering tasks in [ghojogh2019principal], considers the scatter of data as well as the regression of prototypes for better representation. Instance Ranking by Matrix Decomposition (IRMD) [ghojogh2019instance] decomposes the matrix of data and makes use of the bases of decomposition. The more similar points to the bases are considered to be more important.

2 Background on Polyhedron Curvature

A polytope is a geometrical object in whose faces are planar. The special cases of polytope in and are called polygon and polyhedron, respectively. Some examples for polyhedron are cube, tetrahedron, octahedron, icosahedron, and dodecahedron with four, eight, and twenty triangular faces, and twelve flat faces, respectively [coxeter1973regular]. Consider a polygon where and are the interior and exterior angles at the -th vertex; we have . A similar analysis holds in for Fig. 1-a. In this figure, a vertex of a polyhedron and its opposite cone are shown where the opposite cone is defined to have perpendicular faces to the faces of the polyhedron at the vertex. The intersection of a unit sphere centered at the vertex and the opposite cone is shown in the figure. This intersection is a geodesic on the unit sphere. According to Thomas Harriot’s theorem proposed in 1603 [markvorsen1996curvature], if this geodesic on the unit sphere is a triangle, its area is . The generalization of this theorem from a geodesic triangular polygon (-gon) to an -gon is [markvorsen1996curvature], where the polyhedron has faces meeting at the vertex.

The Descartes’s angular defect at a vertex of a polyhedron is [descartes1890progymnasmata]: . The total defect of a polyhedron is defined as the summation of the defects over the vertices. It can be shown that the total defect of a polyhedron with vertices, edges, and faces is: . The term is Euler-Poincaré characteristic of the polyhedron [richeson2019euler, hilton1982descartes]; therefore, the total defect of a polyhedron is equal to its Euler-Poincaré characteristic. According to Fig. 1-b, the smaller angles result in sharper corner of the polyhedron. Therefore, we can consider the angular defect as the curvature of the vertex.

Figure 1: (a) Polyhedron vertex, unit sphere, and the opposite cone, (b) large and small curvature, (c) a point and its neighbors normalized on a unit hyper-sphere around it.

3 Anomaly Detection

3.1 Curvature Anomaly Detection

The main idea of the Curvature Anomaly Detection (CAD) method is as follows. Every data point is considered to be the vertex of a hypothetical polyhedron (see Fig. 1-a). For every point, we find its -Nearest Neighbors (-NN). The neighbors of the point (vertex) form the faces of a polyhedron meeting at that vertex. Then, the more curvature that point (vertex) has, the more anomalous it is because it is far away (different) from its neighbors. Therefore, anomaly score is proportional to the curvature.

Since, according to the equation of angular effect, the curvature is proportional to minus the summation of angles, we can consider the anomaly score to be inversely proportional to the summation of angles. Without loss of generality, we assume the angles to be in range (otherwise, we take the smaller angle). The less the angles between two edges of the polyhedron, the more their cosine. As the anomaly score is inversely proportional to the angles, we can use cosine for the anomaly score: . We define the anomaly score to be the summation of cosine of the angles of the polyhedron faces meeting at that point: where is the -th edge of the polyhedron passing through the vertex , is the -th neighbor of , and denotes the next edge sharing the same polyhedron face with where .

Note that finding the pairs of edges which belong to the same face is difficult and time-consuming so we relax this calculation to the summation of the cosine of angles between all pairs of edges meeting at the vertex :

(1)

where , , and and denote the -th and -th neighbor of . In Eq. (1), we have omitted the redundant angles because of symmetry of inner product. Note that the Eq. (1) implies that we normalize the neighbors of to fall on the unit hyper-sphere centered at

and then compute their cosine similarities (see Fig.

1-c).

The mentioned relaxation is valid for the following reason. Take two edges meeting at the vertex . If the two edges belong to the same polyhedron face, the relaxation is exact. Consider the case where the two edges do not belong to the same face. These two edges are connected with a set of polyhedron faces. If we tweak one of the two edges to increase/decrease the angle between them, the angle of that edge with its neighbor edge on the same face also increases/decreases. Therefore, the changes in the additional angles of relaxation are consistent with the changes of the angles between the edges sharing the same faces.

After scoring the data points, we can sort the points and find a suitable threshold visually using a scree plot of the scores. However, in order to find anomalies automatically, we apply K-means clustering, with two clusters, to the scores. The cluster with the larger mean is the cluster of anomalies because the higher the score, the more anomalous the point.

For finding anomalies for out-of-sample data, we find -NN for the out-of-sample point where the neighbors are from the training points. Then, we calculate the anomaly score using Eq. (1

). The K-means cluster whose mean is closer to the calculated score determines whether the point is normal or anomaly. It is noteworthy that one can see anomaly detection for out-of-sample data as novelty detection

[pimentel2014review].

3.2 Kernel Curvature Anomaly Detection

The pattern of normal and anomalous data might not be linear. Therefore, we propose Kernel CAD (K-CAD) to work on data in the feature space. In K-CAD, the two stages of finding -NN and calculating the anomaly score are performed in the feature space. Let be the pulling function mapping the data to the feature space . In other words, . Let denote the dimensionality of the feature space, i.e., while . Note that we usually have . The kernel over two vectors and is the inner product of their pulled data [hofmann2008kernel]: . The Euclidean distance in the feature space is [scholkopf2001kernel]: . Using this distance, we find the -NN of the dataset in the feature space.

After finding -NN in the feature space, we calculate the score in the feature space. We pull the vectors and to the feature space so is changed to . Let denote the kernel of neighbors of whose -th element is . The vectors in Eq. (1) are normalized. In the feature space, this is equivalent to normalizing the kernel [ah2010normalized]. If denotes the normalized kernel , the anomaly score in the feature space is:

(2)

where is the -th element of the kernel. The K-means clustering and out-of-sample anomaly detection are similarly performed as in CAD.

Our observations in experiments showed that the anomaly score in K-CAD is ranked inversely for some kernels such as Radial Basis Function (RBF), Laplacian, and polynomial (different degrees) in various datasets. In other words, for example, in K-CAD with linear (i.e., CAD), cosine, and sigmoid kernels, the more anomalous points have greater score but in K-CAD with RBF, Laplacian, and polynomial kernels, the smaller score is assigned to the more anomalous points. We conjecture that the reason lies in the characteristics of the kernels. We defer more investigations for the reason as a future work. In conclusion, for the mentioned kernels, we should either multiply the scores by

or take the K-means cluster with smaller mean as the anomaly cluster.

3.3 Anomaly Landscape and Anomaly Paths

We define anomaly landscape to be the landscape in the input space whose value at every point in the space is the anomaly score computed by Eq. (1) or (2). The point in the space can be either the training or out-of-sample point but the -NN is obtained from the training data. We can have two types of anomaly landscape where all the training data points or merely the non-anomaly training points are used for -NN. In the latter type, the training phase of CAD or K-CAD are performed before calculating the anomaly landscape for the whole input space.

We also define the anomaly path as the path that an anomalous point has traversed from its not-known-yet normal version to become anomalous. Conversely, it is the path that an anomalous point should traverse to become normal. In other words, an anomaly path can be used to make a normal sample anomalous or vice-versa. At every point on the path, we calculate the -NN again because the neighbors may change slightly during the path. For anomaly path, we use the second type of anomaly landscape where the path is like going up/down the mountains in this landscape. For finding the anomaly path for every anomaly point, we use gradient descent where the gradient of the Eq. (1) is used:

(3)

whose derivation is eliminated for brevity (see supplementary material at the end of this article). The anomaly path can be computed in CAD and not K-CAD because the gradient in K-CAD cannot be computed analytically. The anomaly path can have many applications one of which is image denoising as explained in our experiments.

4 Prototype Selection

4.1 Inverse Curvature Anomaly Detection

If the anomaly detection uses scores, we can see instance ranking and numerosity reduction in the opposite perspective of anomaly detection. Therefore, the ranking scores can be considered as the anomaly scores multiplied by : . We sort the ranking scores in descending order. The data point with larger ranking score is more important. As the order of ranking scores is inverse of the order of anomaly scores, we name this method as inverse CAD (iCAD).

Prototype selection can be performed in two approaches: (I) the data points are sorted and a portion of the points having the best ranks is retained, or (II) a portion of data points is retained as prototypes and the rest of points are discarded. Some examples of the fist approach is IRMD, PSA, SOS, and SE. DROP3 and ENN are examples for the second approach. The iCAD can be used for both approaches. The first approach is ranking the points with the ranking score. For the second approach, we apply K-means clustering, with two clusters, to the ranking scores and take the points of the cluster with larger mean.

4.2 Kernel Inverse Curvature Anomaly Detection

We can perform iCAD in the feature space to have Kernel iCAD (K-iCAD). The ranking score is again the anomaly score multiplied by to reverse the ranks of scores: . Again, we have two approaches where the points are ranked or K-means is applied on the scores. Note that for what was mentioned before, we do not multiply by for some kernels including RBF, Laplacian, and polynomial. Note that iCAD and K-iCAD are task agnostic and can be used for data reduction in classification, regression, and clustering. For classification, we apply the method for every class while in regression and clustering, the method is applied on the entire data.

5 Experiments

5.1 Experiments for Anomaly Detection

Synthetic Datasets: We examined CAD and iCAD on three two-dimensional synthetic datasets, i.e., two moons and two homogeneous and heterogeneous clusters. Figure 2 shows the results for CAD and K-CAD with RBF and polynomial (degree three) kernels. As expected, the abnormal and core points are correctly detected as anomalous and normal points, respectively. The boundary points are detected as anomaly in CAD while they are correctly recognized as normal points in K-CAD. In heterogeneous clusters data, the larger cluster is correctly detected as normal in CAD but not in K-CAD; however, if the threshold is manually changed (rather than by K-means) in K-CAD, the larger cluster will also be correctly recognized. As seen in this figure, the scores are reverse in EBF and polynomial kernels which is consistent to our previous note in the paper. We also show the anomaly landscape and anomaly paths for CAD in Fig. 2. The K-CAD does not have anomaly paths as mentioned before. The landscapes in this figure are of the second type and the paths are shown by red traces which simulates climbing down the mountains in the landscape.

Figure 2: Anomaly detection, anomaly scores, anomaly landscape, and anomaly paths for synthetic datasets. In the gray and white plots, the gray and white colors show the regions determined as normal and anomaly, respectively. The gray-scale plots are the anomaly scores.

Real Datasets: We did experiments on several real datasets of anomaly detection. The datasets, which are taken from [web_anomaly_datasets], are speech, opt. digits, arrhythmia, wine, and musk with 1.65%, 3%, 15%, 7.7%, and 3.2% portions of anomalies, respectively. The sample size of these datasets are 3686, 5216, 452, 129, and 3062 and their dimensionality are 400, 64, 274, 13, and 166, respectively. We compared CAD and K-CAD with RBF and polynomial (degree 3) kernels to Isolation forest, LOF, one-class SVM (RBF kernel), and EE. We used in LOF, CAD, and K-CAD. The average area under the ROC curve (AUC) and the average time for both training and test phases over 10-fold Cross Validation (CV) are reported in Table 1. For wine data, because of small sample size, we used 2-fold CV. The system running the methods was Intel Core i7, 3.60 GHz, with 32 GB RAM. In most cases, K-CAD has better performance than CAD; although CAD is useful and effective as we will see for anomaly path and also instance ranking. For speech and optdigits datasets, RBF kernel has better performance than polynomial and for other datasets, polynomial kernel is better. Mostly, K-CAD is faster in both training and test phases because K-CAD uses kernel matrix and normalizing the matrix rather than element-wise cosines in CAD. In speech and optdigits datasets, we outperform all the baseline methods in both trainign and test AUC rates. In arrhythmia data, K-CAD with polynomial kernel has beter results than isolation forest. For wine dataset, K-CAD with polynomial kernel is better than isolation forest, SVM, and EE. In musk data, K-CAD with both RBF and polynomial kernels is better than isolation forest and SVM.

For experimenting the effect of in CAD and K-CAD, we report the results of for arrhythmia dataset in Table 5. For CAD, where the cosine is done element-wise, time increases by as expected. Overall, the accuracy, especially the training AUC, increases in CAD by . K-CAD is more robust to change of in terms of accuracy and time.

CAD K-CAD (rbf) K-CAD (poly) Iso Forest LOF SVM EE
Speech Train: Time: 14.84 0.21 13.54 0.21 13.87 0.33 2.82 0.06 6.53 0.02 6.12 0.02 23.35 0.25
AUC: 34.78 0.15 76.15 2.08 63.69 1.48 48.37 1.38 53.99 1.75 46.63 1.59 49.16 1.84
Test: Time: 7.16 0.03 1.38 0.03 1.42 0.03 0.06 00.01 7.31 0.10 0.24 0.01 0.01 0.00
AUC: 42.07 10.48 71.23 13.18 56.15 10.48 45.23 12.17 53.53 12.29 44.55 12.09 47.25 12.54
Opt digits Train: Time: 13.27 0.11 26.96 0.31 25.86 0.48 0.81 0.01 1.84 0.02 3.10 0.01 0.96 0.04
AUC: 32.67 1.53 87.52 1.76 77.79 1.45 68.38 4.64 60.84 1.67 50.52 3.81 39.04 2.44
Test: Time: 11.24 0.12 2.67 0.01 2.59 0.03 0.03 0.01 2.13 0.08 0.15 00.00 00.01 00.00
AUC: 26.28 7.10 88.22 5.62 79.72 4.81 68.36 8.11 61.12 11.65 37.49 7.41 38.84 4.29
Arrhythmia Train: Time: 4.76 0.02 2.75 0.06 2.53 0.03 0.20 0.01 0.07 00.00 0.13 00.00 0.85 0.02
AUC: 52.89 0.96 48.87 0.51 73.92 1.12 62.43 2.05 91.04 0.66 88.56 00.87 80.59 0.65
Test: Time: 1.59 0.01 0.30 0.01 0.28 00.00 0.02 0.01 0.08 0.00 0.01 00.00 00.01 00.00
AUC: 48.02 9.06 48.56 5.38 71.88 9.23 63.07 11.55 90.57 5.47 90.03 5.63 80.32 4.73
Wine Train: Time: 0.28 0.00 0.03 0.03 0.03 0.01 0.09 0.00 0.01 0.00 0.01 0.00 0.05 0.02
AUC: 25.59 4.28 27.04 10.66 92.11 7.06 79.56 10.59 98.70 1.29 68.59 4.25 59.56 37.15
Test: Time: 0.18 00.00 0.02 00.00 0.01 0.00 0.02 0.00 0.01 0.00 0.01 0.00 0.01 0.00
AUC: 23.65 14.45 40.17 13.12 86.97 2.96 76.09 10.11 92.58 5.69 91.13 3.83 57.70 38.84
Musk Train: Time: 11.15 0.20 9.67 0.47 9.42 0.03 0.98 0.01 1.16 0.03 3.16 0.00 10.16 0.32
AUC: 40.69 2.97 69.68 2.97 93.45 1.46 99.91 00.00 41.93 3.34 57.99 7.34 99.99 00.00
Test: Time: 4.89 0.02 0.98 0.02 0.98 0.01 0.03 0.00 1.24 0.01 0.17 0.00 0.01 0.00
AUC: 30.30 10.37 50.00 0.00 93.80 3.77 99.95 0.00 39.00 10.55 5.71 3.63 100 0.00
Table 1: Comparison of anomaly detection methods. Rates are AUC percentage and times are in seconds.

Figure 3: Image denoising using anomaly paths: the most left image is the original image and the first to fourth rows are for Gaussian noise, Gaussian blurring, salt & pepper impulse noise, and JPEG blocking. The numbers are the iteration indices.

An Application; Image Denoising: One of the applications for anomaly path is image denoising where several similar reference images exist; for example, in video where neighbor frames exist. For experiment, we used the first 100 frames of Frey face dataset. We selected one of the frames and applied different types of noises, i.e., Gaussian noise, Gaussian blurring, salt & pepper impulse noise, and JPEG blocking to it all with the same mean squared errors (MSE = ). For a more difficult experiment, we removed the non-distorted frame from dataset. Figure 3 shows the iterations of denoising for different noise types where is used.

Figure 4: Instance ranking for synthetic datasets where larger markers are for more important data points. The first to third columns correspond to iCAD, K-iCAD with RBF kernel, and K-iCAD with polynomial kernel, respectively.
CAD K-CAD (rbf) K-CAD (poly)
= 3 Train: Time: 0.55 0.00 3.25 0.15 3.06 0.16
AUC: 37.01 1.00 49.83 0.63 63.61 1.36
Test: Time: 1.12 0.02 0.37 0.01 0.35 0.01
AUC: 45.51 6.21 49.32 1.69 61.67 10.18
= 10 Train: Time: 4.76 0.02 2.75 0.06 2.53 0.03
AUC: 52.89 0.96 48.87 0.51 73.92 1.12
Test: Time: 1.59 0.01 0.30 0.01 0.28 00.00
AUC: 48.02 9.06 48.56 5.38 71.88 9.23
= 20 Train: Time: 16.56 0.07 3.41 0.13 3.24 0.24
AUC: 56.87 1.02 49.33 0.91 76.83 1.07
Test: Time: 2.88 0.01 0.39 0.01 0.36 0.01
AUC: 47.90 9.18 49.27 7.06 74.88 8.47
Figure 5: Comparison of CAD and K-CAD performance on arrhythmia dataset for different values.

5.2 Experiments for Prototype Selection

Synthetic Datasets: The performances of iCAD and K-iCAD with RBF and polynomial (degree 3) kernels are illustrated in Fig. 5 for the three synthetic datasets where the larger markers show the more important points. We can see that the points are ranked as expected.

Real Datasets:

We performed experiments on several real datasets, i.e., pima, image segment, Facebook metrics, and iris datasets, from the UCI machine learning repository. The first two datasets are used for classification, the third for regression, and the last one for clustering. The sample size of datasets are 768, 2310, 500, and 150, and their dimensionality are 8, 19, 19, and 4, respectively. The number of classes/clusters in pima, image segment, and iris are 2, 7, and 3, respectively. We used 1-Nearest Neighbor (1-NN), Linear Discriminant Analysis (LDA), SVM, Linear Regression (LR), Random Forest (RF), Multi-Layer Perceptron (MLP) with two hidden layers, K-means and Birch clustering methods in experiments. Table

2

reports the average accuracy and time over 10-fold CV and comparison to IRMD (with QR decomposition), PSA, SOS, Sorted by Distance from Mean (SDM), ENN, DROP3, and No Reduction (NR).

Pima Dataset
iCAD K-iCAD (rbf) K-iCAD (poly) IRMD PSA SOS SE SDM iCAD K-iCAD (rbf) K-iCAD (poly) ENN DROP3 NR
Time: 1.98E0 4.41E1 4.14E1 9.52E3 5.87E0 2.13E2 1.59E 4.51E3 2.09E0 4.73E1 4.73E1 1.87E2 5.00E0
1NN 20% data: 70.69% 67.70% 61.97% 67.28% 66.53% 67.06% 44.40% 62.87% (40.79%) (4.35%) (12.76%) (69.40%) (13.44%) 68.48%
50% data: 70.83% 66.01% 64.96% 66.64% 64.44% 67.04% 50.64% 66.39% 65.23% 66.93% 65.88% 71.74% 64.06%
70% data: 69.26% 67.56% 65.74% 67.95% 65.75% 66.91% 65.62% 67.81%
LDA 20% data: 75.64% 67.82% 72.25% 70.81% 76.17% 73.56% 54.55% 65.09% 77.73%
50% data: 76.42% 76.03% 76.43% 73.81% 76.81% 76.43% 71.35% 73.68% 65.62% 71.74% 66.14% 77.07% 75.12%
70% data: 76.94% 76.82% 76.81% 75.38% 76.82% 76.56% 76.55% 74.98%
SVM 20% data: 63.28% 57.65% 62.50% 59.62% 57.68% 63.52% 42.04% 63.78% 64.57%
50% data: 65.22% 62.25% 57.94% 61.57% 61.57% 62.25% 48.15% 60.01% 64.84% 63.92% 63.93% 67.18% 53.12%
70% data: 62.76% 61.28% 56.40% 55.97% 60.42% 62.63% 55.23% 64.98%
Image Segment dataset
iCAD K-iCAD (rbf) K-iCAD (poly) IRMD PSA SOS SE SDM iCAD K-iCAD (rbf) K-iCAD (poly) ENN DROP3 NR
Time: 6.25E0 1.20E0 1.04E0 3.64E2 3.28E2 5.86E2 4.85E2 1.38E2 6.13E0 1.58E0 1.44E0 1.30E1 3.13E2
1NN 20% data: 90.34% 81.42% 83.72% 78.00% 90.25% 89.26% 83.41% 80.99% (9.54%) (3.97%) (3.14%) (95.25%) (6.78%) 96.45%
50% data: 93.41% 89.43% 92.85% 86.83% 94.76% 94.19% 84.71% 87.05% 47.22% 49.13% 49.43% 94.45% 60.34%
70% data: 94.84% 92.20% 95.75% 91.03% 96.06% 95.41% 90.69% 90.00%
LDA 20% data: 89.26% 85.97% 90.86% 82.90% 90.47% 90.51% 85.23% 86.36% 91.55%
50% data: 90.64% 90.04% 91.94% 86.32% 91.08% 91.16% 86.27% 87.83% 60.60% 58.70% 59.91% 67.57% 79.35%
70% data: 90.90% 90.99% 91.64% 88.26% 91.42% 91.34% 90.30% 89.48%
SVM 20% data: 73.41% 77.74% 76.40% 71.42% 78.44% 78.35% 71.73% 71.86% 85.23%
50% data: 80.90% 81.08% 80.30% 80.60% 77.18% 82.85% 70.60% 81.34% 39.43% 40.38% 38.70% 78.52% 55.10%
70% data: 84.71% 84.71% 83.67% 78.87% 82.20% 85.67% 86.79% 84.32%
Facebook Dataset
iCAD K-iCAD (rbf) K-iCAD (poly) IRMD PSA SOS SE SDM iCAD K-iCAD (rbf) K-iCAD (poly) ENN DROP3 NR
Time: 1.30E0 6.33E1 1.01E0 7.01E3 1.08E0 1.40E2 8.40E3 4.81E3 1.26E0 3.18E1 3.18E1
LR 20% data: 7.57E3 6.00E3 7.45E3 5.28E3 9.98E3 6.07E3 1.89E4 7.50E3 (49.96%) (6.95%) (34.81%) 5.81E3
50% data: 6.11E3 5.51E3 5.86E3 4.85E3 6.30E3 6.25E3 6.59E3 5.69E3 6.10E3 1.56E4 6.41E3
70% data: 5.70E3 6.00E3 5.72E3 4.95E3 5.55E3 5.90E3 5.83E3 5.30E3
RF 20% data: 8.54E3 7.87E3 6.85E3 5.53E3 7.12E3 6.56E3 1.09E4 6.58E3 6.17E3
50% data: 6.41E3 7.08E3 6.28E3 5.02E3 6.20E3 6.76E3 6.10E3 7.19E3 6.36E3 9.82E3 6.46E3
70% data: 5.92E3 6.95E3 6.19E3 5.03E3 5.86E3 6.26E3 6.32E3 6.03E3
MLP 20% data: 1.348E4 1.61E4 1.02E4 7.11E3 2.12E4 7.46E3 4.70E4 2.31E4 5.72E3
50% data: 6.10E3 5.44E3 5.57E3 4.75E3 6.93E3 6.02E3 5.62E3 6.64E3 6.11E3 5.06E4 8.31E3
70% data: 6.31E3 6.06E3 5.66E3 5.14E3 5.75E3 5.95E3 5.86E3 6.00E3
Iris Dataset
iCAD K-iCAD (rbf) K-iCAD (poly) IRMD PSA SOS SE SDM iCAD K-iCAD (rbf) K-iCAD (poly) ENN DROP3 NR
Time: 4.96E1 5.30E2 4.46E2 1.60E3 2.83E1 3.70E3 3.00E1 1.90E3 4.68E1 6.07E2 5.95E2
K-means 20% data: 6.87E1 1.56E1 2.34E1 1.32E1 6.18E1 6.92E1 5.02E1 4.89E1 (60.59%) (84.07%) (84.07%) 7.03E1
50% data: 6.98E1 7.13E1 8.33E1 5.93E1 7.01E1 7.38E1 4.85E1 5.77E1 7.34E1 8.06E1 8.20E1
70% data: 7.18E1 7.83E1 8.20E1 6.53E1 7.18E1 7.35E1 6.27E1 7.12E1
Birch 20% data: 6.09E1 0.00E0 1.26E1 1.07E1 6.97E1 6.41E1 5.10E1 2.90E1 5.93E1
50% data: 6.48E1 6.58E1 7.15E1 6.35E1 6.11E1 5.34E1 5.11E1 5.92E1 7.71E1 6.60E1 6.85E1
70% data: 6.67E1 7.17E1 7.37E1 6.61E1 6.48E1 5.79E1 6.23E1 6.04E1
Table 2: Comparison of instance selection methods in classification. Classification, regression, and clustering rates are accuracy, mean absolute error, and adjusted rand index (best is 1), respectively, and times are in seconds. The left and right columns are for rank-based and retaining-based methods. Numbers in parentheses show the percentage of retained data.

The iCAD and K-iCAD are reported in both rank-based and retaining-based versions of prototype selection. For pima and image segment datasets, iCAD and K-iCAD are both performing equally well but in other datasets, K-iCAD is mostly better. In terms of time, we outperform PSA and DROP3. In pima, we outperform all other baselines. In image segment, we are better than IRMD, SE, and SDM. In facebook data, we are mostly better than SOS, SE, and SDM, and in some cases better than PSA. In iris data, we outperform all the baselines. In some cases, we even outperform using the entire data. In retaining-based iCAD and K-iCAD, mostly, K-iCAD with RBF kernel retains the least, then K-iCAD with polynomial kernel, and then CAD.

6 Conclusion and Future Direction

This paper proposed a new method for anomaly detection, named CAD, and its kernel version. The main idea was to consider every point as a vertex of a polyhedron with the help of its neighbors and measure its curvature. Moreover, with an opposite view to CA, iCAD and K-iCAD were proposed for prototype selection. Different experiments as well as an application in image denoising were also reported. As a possible future work, we will try the idea of curvature for manifold embedding to propose a curvature preserving embedding method.

References

Supplementary Material: Proof of Eq. (3)

Let . Then, .

which gives the proposed derivative. Q.E.D.