The Mean and Median Criterion for Automatic Kernel Bandwidth Selection for Support Vector Data Description

Support vector data description (SVDD) is a popular technique for detecting anomalies. The SVDD classifier partitions the whole space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, and the Gaussian kernel is a common choice for the kernel function. The Gaussian kernel has a bandwidth parameter, whose value is important for good results. A small bandwidth leads to overfitting, and the resulting SVDD classifier overestimates the number of anomalies. A large bandwidth leads to underfitting, and the classifier fails to detect many anomalies. In this paper we present a new automatic, unsupervised method for selecting the Gaussian kernel bandwidth. The selected value can be computed quickly, and it is competitive with existing bandwidth selection methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

11/15/2018

The Trace Criterion for Kernel Bandwidth Selection for Support Vector Data Description

Support vector data description (SVDD) is a popular anomaly detection te...
02/17/2016

Peak Criterion for Choosing Gaussian Kernel Bandwidth in Support Vector Data Description

Support Vector Data Description (SVDD) is a machine-learning technique u...
03/08/2018

A New Bandwidth Selection Criterion for Analyzing Hyperspectral Data Using SVDD

This paper presents a method for hyperspectral image classification usin...
10/31/2016

Kernel Bandwidth Selection for SVDD: Peak Criterion Approach for Large Data

Support Vector Data Description (SVDD) provides a useful approach to con...
12/17/2021

Gaussian RBF Centered Kernel Alignment (CKA) in the Large Bandwidth Limit

We prove that Centered Kernel Alignment (CKA) based on a Gaussian RBF ke...
09/12/2007

Bandwidth selection for kernel estimation in mixed multi-dimensional spaces

Kernel estimation techniques, such as mean shift, suffer from one major ...
06/16/2016

Sampling Method for Fast Training of Support Vector Data Description

Support Vector Data Description (SVDD) is a popular outlier detection te...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and anomaly detection. First introduced by Tax and Duin

[13]

SVDD’s mathematical formulation is almost identical to the one-class variant of support vector machines : one-class support sector machines (OCSVM), which is attributed to Schölkopf et al. 

[10]. The use of SVDD is popular in domains where the majority of data belongs to a single class and it is not possible to make any distributional assumptions. For example, SVDD is useful for analyzing sensor readings from reliable equipment where almost all the readings describe the equipment’s normal state of operation.

Like other one class classifiers SVDD provides a geometric description of the observed data. The SVDD classifier assigns a distance to each point in the domain space; which measures the separation of that point from the training data. During scoring any observation found to be at a large distance from the training data might be an anomaly, and the user might choose to generate an alert.

Several researchers have proposed using SVDD for multivariate process control [12, 2]. Other applications of SVDD involve machine condition monitoring [14, 16] and image classification [8].

I-a Mathematical Formulation

In this subsection we describe the mathematical formulation of SVDD, the description is based on [13].
Normal Data Description:
The SVDD model for normal data description builds a hypersphere that contains most of the data within a small radius. Given observations , we need to solve the follwing optimization problem to obtain the SVDD data description.


Primal Form:
Objective:

(1)

subject to:

(2)
(3)

where:
represent the training data,
is the radius and represents the decision variable,
is the slack for each variable,
is the center (a decision variable),
is the penalty constant that controls the trade-off between the volume and the errors, and
is the expected outlier fraction.
 
Dual Form:
The dual formulation is obtained using Lagrange multipliers.
Objective:

(4)

subject to:

(5)
(6)

where are the Lagrange constants, and is the penalty constant.
 
Duality Information:
The position of observation is connected to the optimal , the radius of the sphere , and the center of the sphere in the following manner:

Center position:

(7)

Inside position:

(8)

Boundary position:

(9)

Outside position:

(10)

Any for which the corresponding is known as a support vector.

Let denote the set then the radius of the hypersphere is calculated, for any , as follows:

(11)

The value of does not depend on the choice of .
Scoring:

For any point , the distance is calculated as follows:

(12)

Points whose are designated as outliers.

The spherical data boundary can include a significant amount of space that has a sparse distribution of training observations. Using this model to score can lead to a lot of false positives. Hence, instead of a spherical shape, a compact bounded outline around the data is often desired. Such an outline should approximate the shape of the single-class training data. This is possible by using kernel functions.

Flexible Data Description:

The Support vector data description is made flexible by replacing the inner product with a suitable kernel function . The Gaussian kernel function used in this paper is defined as

(13)

where is the Gaussian bandwidth parameter.

The modified mathematical formulation of SVDD with a kernel function is as follows:

Objective:

(14)

Subject to:

(15)
(16)

In perfect analogy with the previous section, any with is an inside point, any for which is called a support vector.

is similarly defined as and the threshold is calculated, for any , as

(17)

The value of does not depend on which is used.
Scoring: For any observation the distance is calculated as follows:

(18)

Any point for which is designated as an outlier.

I-B Importance of the Kernel Bandwidth Value

In practice, SVDD is almost always computed by using the Gaussian kernel function, and it is important to set the value of bandwidth parameter correctly. A small bandwidth leads to overfitting, and the resulting SVDD classifier overestimates the number of anomalies. A large bandwidth leads to underfitting, and many anomalies cannot be detected by the classifier.

Because SVDD is an unsupervised learning technique, it is desirable to have an automatic, unsupervised bandwidth selection technique that does not depend on labeled data that separate the

inliers from the outliers. In [4], Kakde et al. present the peak criterion, which is an unsupervised bandwidth selection technique, and show that it performs better than alternative unsupervised methods. However, determining the bandwidth suggested by the peak criterion requires that the SVDD solution be computed multiple times for the training data for a list of bandwidth values that lie on a grid. Even though using sampling techniques can speed up the computation (see [2016arXiv161100058P]); this method is still expensive. Moreover to avoid unnecessary computations it is also necessary to initiate the grid search at a good starting value and it is not immediately obvious what a good starting value is. In this paper we suggest two new criteria : the mean criterion and the median criterion. The mean criterion has a simple closed-form expression in terms of the training data.

We evaluated the mean criterion and the median criterion in multiple ways. We conducted simulation studies where we could objectively determine the quality of a particular bandwidth. We compared the results obtained from the mean and median criteria with those obtained from alternative methods on a wide range of data sets. The data were specially selected to probe potential weaknesses in the mean criterion.

Our results show that the mean criterion is competitive with the peak and median criteria for most data sets in our test suite. In addition, computation of the mean criterion is fast even when the data set is large. These properties make the mean criterion a good bandwidth selection technique. However, unsupervised bandwidth tuning is an extremely difficult problem, so it is quite possible that there is a class of data sets for which the mean criterion does not give good results.

The rest of the paper is organized as follows. Section II defines the mean and median criteria for bandwidth tuning, and the remaining sections compare the mean, median, and peak criteria with each other.

Ii The Mean Criterion for Bandwidth Selection

Ii-a Training Data That Have Distinct Observations

Assume we have a training data set that consists of distinct points in and we want to determine a good kernel bandwidth value for training this data set.

Given a candidate bandwidth , let denote the kernel matrix whose element in position is A tiny s is not a good candidate for the kernel bandwidth, because as ,

converges to the identity matrix of order

. When the kernel matrix is very close to the identity matrix, all observations in the original data set become support vectors. This indicates a case of severe overfitting.

So for a reasonable bandwidth value , the corresponding kernel matrix must be sufficiently different from the identity matrix . One way to ensure this would be to choose such that

(19)

where is an appropriate matrix norm and is a tolerance factor. Larger values of will ensure greater distance from the identity matrix.

It is easy to determine an that satisfies (19) when is chosen as the Frobenius norm. The Frobenius norm of a matrix is defined as the square root of the sum of squares of all elements in the matrix; that is,

If is the Frobenious norm, then

From the well-known inequality of arithmetic and geometric means, we have

so it is sufficient to choose an such that

(20)

where

Equation (20) suggests using

(21)

as the kernel bandwidth. The numerator of (20) contains the mean of pairwise squared distances; this suggests creating new criteria by replacing them with another measure of central tendency of the squared distances in the numerator of (20). For example we can have another criterion which suggests using as the bandwidth value.

We call using as the bandwidth the mean criterion for bandwidth selection, and we call using as the bandwidth the median criterion for bandwidth selection.

The mean criterion bandwidth can be computed very quickly.

Let , let denote the column means, and let

denote the column variance. So

Then it is easy to show that . So the bandwidth suggested by the mean criterion is

(22)

To see this, note the following:

The result is immediate.

Because the column variances can be calculated in one pass over the data, the computation of the mean criterion is an algorithm.

The computation needed for the median criterion cannot be simplified; however, one can take a sample from the data and use as an approximation to

Ii-B Training Data That Have Repeated Observations

We now consider the case where we have repeated observations in the training data set. Let be the set of distinct points in our training data set in , and assume that is repeated times. In this case, the kernel matrix, is a square matrix of order . The kernel matrix can be partitioned into blocks where the block is a matrix of order given by , where is a column vector of ones. As , converges to , a block diagonal matrix with diagonal blocks for In this case, we similarly seek an such that

(23)

Define:
,
, and
.

In a manner similar to the previous section, we have

Using Jensen’s inequality,

As in the previous section the preceding bound can be simplified and expressed in terms of the weighted column variances. We have any that satisfies the following inequality also satisfies (23).

(24)

As expected, (22) equals (24) when . Equation (24) is derived for completeness; it will not be used in the rest of this article. We will use (21) throughout this article.

Ii-C Choice of

The mean and median criteria depend on the parameter . For the mean and median criteria to be effective, there should be an easy way to choose the value . Otherwise we will have simply replaced the difficult problem of choosing a bandwidth with another difficult problem of choosing . In our investigations, we noticed that setting to works for most cases. So unless explicitly stated otherwise, the value of is throughout this article.

Iii Evaluating the Mean criterion

Iii-a Alternative Criteria

In [1], Aggarwal suggests setting for the kernel that is parametrized as , which translates to using for the kernel parametrized as . We call this the  criterion. We compare the mean and median criteria with the  criterion. In addition, we compare the mean and median criterion with the peak criterion (see  [4]). Since the peak criterion performs better than the alternative criteria mentioned in [4], we omit the other criteria that are mentioned there.

Iii-B Choice of Data sets

The mean and median criteria depend on the distribution of pairwise distances of the training data set. These methods might not work well if the distribution of the pairwise distances is skewed. So it is important to check the performance of the mean and median criteria on data sets that have a skewed distribution of pairwise distances.

The distribution of pairwise distances in data sets where the data lie in distinct clusters is typically multimodal and skewed. See Figure 4(a) for a data set that has three distinct clusters; the histogram of pairwise distances as seen in Figure 4(b) indicates a skewed and multimodal distribution.

For this reason, we check the performance of the different criteria on “connected” data that is, data without any clusters and on data sets where there are two or more clusters.

Iv Comparing the Criteria on Two-Dimensional Data

Iv-a Two-Dimensional Connected Data

Iv-A1 Data Description

In this section, we compare the performance of the mean, median, , and peak criteria on selected two-dimensional data. These data sets are connected; that is, there are no clusters in the data. Because the data are two-dimensional, we can visually evaluate the quality of results. To evaluate the results, we obtain the data description provided by the different bandwidths, and then we score the bounding rectangle of the data by dividing it into a grid. The inlier region that is obtained from scoring should closely match the training data. Figure 1 displays the results for a banana-shaped data, and Figure 2 displays the results for a star-shaped data.

(a) Scatter plot of banana-shaped data
(b) Mean criterion result
(c) Median criterion result
(d) Peak criterion result
(e)  criterion result
Fig. 1: Results for banana data. The darkly shaded region is the inlier region.
(a) Scatter plot of star data
(b) Mean criterion result
(c) Median criterion result
(d) Peak criterion result
(e)  criterion result
Fig. 2: Results for star data. The darkly shaded region is the inlier region.

Iv-A2 Conclusion

The scoring results indicate that the bandwidth values computed using the mean and median criteria provide a data description of good quality. Such a description is close to the one obtained using the peak criteria. The  criterion does not work well for these data sets.

Iv-B Two-Dimensional Disconnected Data

Iv-B1 Data Description

In this section, we compare the performance of the mean, median,

, and peak criteria on selected two-dimensional data that lie in different clusters. Selecting the bandwidth of such data is usually more difficult than estimating the bandwidth of connected data. Because the data are two-dimensional, we can visually evaluate the quality of results. To evaluate the results, we obtain the data description that is provided by the different bandwidths, and then we score the bounding rectangle of the data by dividing it into a

grid. The inlier region obtained from scoring should closely match the training data. The following data sets are used in this section:

  1. The three-clusters data, which consists of three clusters [9]. Figure 3 displays the data and the scoring results.

  2. The refrigerant data, which consists of four clusters [3]. Figure 4 displays the data and the scoring results.

  3. A simulated “two-donuts and a munchkin” data set which consists of two donut-shaped regions and a spherical region. Figure 5 displays the data and the scoring results.

(a) Scatter plot of three-clusters data
(b) Histogram of pairwise distances
(c) mean criterion result
(d) median criterion result
(e) Peak criterion result
(f)  criterion result
Fig. 3: Results for the three-clusters data. The darkly shaded region is the inlier region.
(a) Scatterplot
(b) Histogram of pairwise distances
(c) Mean criterion result
(d) Median criterion result
(e) Peak criterion result
(f)  criterion result
Fig. 4: Results for the refrigerant data. The darkly shaded region is the inlier region.
(a) Scatter plot
(b) Histogram of pairwise distances
(c) Mean criterion result
(d) median criterion result
(e) Peak criterion result
(f)  criterion result
Fig. 5: Results for the two-donuts and munchkin data. The darkly shaded region is the inlier region.

Iv-B2 Conclusion

The scoring results indicate that the bandwidth value that is computed using the mean and median criteria provides a data description of reasonably good quality for the three-cluster data set and for the two-donuts and munchkin data set. Such description is close to the one obtained using the peak criterion.

For the refrigerant data set, the peak criterion significantly outperforms other methods. The data description obtained from the peak criterion can separate out all four clusters, whereas the other methods merge two clusters that lie close to each other. Although the mean and median criterion do not perform as well as the peak criteria, any point in the inlier region is close to the training data, and the area of the region that is misclassified is small compared to the bounding region of the training data. So the result is still very reasonable.

The  criterion again performs poorly on all these data sets, and it is not considered as an candidate in the remaining sections.

V Comparing the Criteria on High-Dimensional Data

The  score is a common measure of a binary classifier’s accuracy. It is defined as

where and

stand for the number of true-positives, number of false-negatives, and number of false-positives, respectively. Comparing the different criteria for high-dimensional data is much more difficult than comparing them for two-dimensional data. In two-dimensional data, the quality of the result can be easily judged by looking at the plot of the scoring results. But this is not possible for high-dimensional data. For the purpose of evaluation, we selected labeled high-dimensional data that have a dominant class. We used SVDD on a subset of the dominant class to obtain a description of the dominant class, and then we scored the rest of the data to evaluate the criteria. We expect the points in the scoring data set that correspond to the dominant class to be classified as

inliers and all other points to be classified as outliers. Because the data are labeled, we can also use cross validation to determine the bandwidth that best describes the dominant class in the sense of maximizing a measure of fit, such as the  score. So in this section we compare the bandwidth suggested by the different unsupervised criteria with the bandwidth obtained through cross validation for various benchmark data sets. The results are summarized in Table I below. The benchmark data sets used for the analysis are described in sections V-A through V-E.

Data Dimension/Nobs Max ()(s) Peak ()(s) Mean ()(s) Median ()(s)
Metal Etch 20/96 (0.8)(0.69) (0.56)(0.46) (0.42)(0.24) (0.43)(0.19)
Shuttle 9/2000 (0.96)(17) (0.96)(14) (0.95)(11) (0.84)(5.75)
Spam 57/1500 (0.63)(50) (0.63)(65) (0.42)(0.24) (0.43)(0.19)
Tennessee Eastman 41/2000 (0.19)(17) (0.16)(8) (0.14)(7.22) (0.135)(6.04)
Intrusion 45/11490 (0.95)(7050) (0.89)(5060) (0.95)(9667) (0.95)(356)
TABLE I: Results for High-Dimensional Data

In Table I, the second column contains the values of the cross validation bandwidth and its corresponding  score, and the third, fourth and fifth columns contain the bandwidth suggested by the peak, mean and median criteria and their corresponding  scores. Caveat
SVDD is a geometric classifier, so using labels in this manner is useful only if they geometrically separate the data. If the labels actually separate the data geometrically the bandwidth obtained from cross validation will lead to a high  score.

We now describe the data sets mentioned in Table I.

V-a Metal Etch Data

This data set consists of 20 process variables, an ID variable, and a timestamp variable from a metal wafer etcher. The data consist of measurements from 108 normal wafers and 21 faulty wafers. The training data set contains half of all the normal wafers, and the scoring data set contains the remaining observations. The data set used in this analysis is explained in [15] and can be obtained from [6].

V-B Shuttle Data

This data set consists of measurements made on a shuttle. The data set contains nine numeric attributes and one class attribute. Out of 58,000 total observations, 80% of the observations belong to class one. A random sample of 2,000 observations belonging to class 1 was selected for training, and the remaining 56,000 observations were used for scoring. This data set is from the UC Irvine Machine Learning Repository [5].

V-C Spam Data

The spam data set consists of emails that were classified as spam or not. Each record corresponds to an individual email. The total number of attributes is 57. Most attributes are frequencies of specific words. Training is performed using a subset of non-spam observations. Remaining observations, which relate to both the spam and non-spam emails, were used for scoring. This data set is from the UC Irvine Machine Learning Repository [5].

V-D Tennessee Eastman

The data set was generated using the MATLAB simulation code, which provides a model of an industrial chemical process. The data were generated for normal operations of the process and twenty faulty processes. Each observation consists of 41 variables, out of which 22 were measured continuously every 6 seconds on average and the remaining 19 were sampled at a specified interval of every 0.1 or 0.25 hours. From the simulated data, we created an analysis data set that uses the normal operations data of the first 90 minutes and data that correspond to faults 1 through 20. A data set that contains observations of normal operations was used for training. Scoring was performed to determine whether the model could accurately classify an observation as belonging to normal operation of the process. The MATLAB simulation code is available at [7].

V-E Intrusion Data

This data set contains multivariate data that characterize cyber attacks. It contains 45 attributes which include type of service, number of source bytes, number of failed logins, and number of files created. Out of the 24,156 observations, 22,981 were labeled no attack and 1,175 were labeled attack. Half of the no attack observations were used as training data and the remaining observations were used as scoring data. This data set can be obtained from [11].

V-F Conclusion

The bandwidths suggested by the mean criterion are similar to the bandwidth suggested by the median and peak criteria for many data sets. This similarity makes the mean criterion an attractive bandwidth selection criterion because it can be computed very quickly.

Vi Simulation Study on Random Polygons

Vi-a Design

In this section, we conduct a simulation study to compare the different bandwidth selection methods. The simulation study consists of training SVDD on randomly generated polygons. Given the number of vertices , we first generate the vertices of the polygon in counterclockwise direction as where are the order statistics of a uniform iid sample from the interval and are uniformly chosen from an interval .

We then sample uniformly from the interior of the polygon and compute the bandwidths that are suggested by the mean, median, and peak criteria. Because it is easy to determine whether a point actually lies inside a particular polygon, we can also use cross validation to determine the best bandwidth parameter. To do so we divide the bounding rectangle of this polygon into a grid and label each point in the grid as an inside or outside point depending on whether that point is inside or outside the polygon. We can choose the best bandwidth as the one that classifies the grid points as inside or outside points with the highest  score. Figure 6 shows a typical polygon and the data that are generated from the polygon for fitting SVDD.

In our simulation study, we set and and we generate polygons whose number of vertices vary from 5 to 30. For a particular vertex size, we generate 20 polygons. For each such polygon, we create a data set that consists of 600 points sampled from the interior of the polygon, and we use this sample to obtain bandwidths that are obtained through cross validation and from the mean and median criteria. For each such polygon, we have the  score that corresponds to the cross-validation bandwidth and the  scores that correspond to the mean and median criterion, and , respectively. is the best possible  score that can be attained by any bandwidth. At the end of the simulation we have a collection of “ score ratios”: and , one for each polygon used in the simulation. If most of these values are close to , this will indicate that the bandwidth suggested by the mean and median criterion is competitive with the bandwidth that maximizes the  score.

Vi-B Results

The box-and-whiskers plots in Figure 7 summarize the simulation study results. The X axis shows the number of vertices of the polygon, and the box on the Y axis shows the distribution of the

 scores. The bottom and the top of the box show the first and third quartile values. The ends of the whiskers represent the minimum and maximum values of the

 score ratio. The diamond shape indicates the mean value, and the horizontal line in the box indicates the second quartile. The plots shows that the ratio is greater than 0.8 across all values of number of vertices. The  score ratio in the top three quartiles is greater than 0.9 across all values of the number of vertices. As the complexity of the polygon increases with increasing number of vertices, the spread of  score ratio also increases.

(a) True polygon
(b) Sampled data
Fig. 6: Simulation study with random polygons
(a) Mean criterion
(b) Median criterion
Fig. 7: Simulation study results

Vi-C Conclusion

The fact that the  score ratios are always close to 1 suggests the mean and median criteria generalize across different training data sets. However, a similar simulation performed for the peak criterion in [4] shows that the distribution of  score ratios for the peak criterion is even more concentrated around 1 that and the minimum values of the  score ratios are much higher that those for the mean and median criteria. This shows that the peak criterion generalizes better than the mean and median criterion.

Vii Conclusion and Future Work

We proposed two new bandwidth selection criteria, the mean criterion and the median criterion, for the Gaussian kernel for SVDD training. The proposed criteria give results that are similar to the peak criterion for many data sets, and hence are a good starting point for determining a suitable bandwidth for a particular data set. The suggested criteria might not be the most appropriate for data sets where the distance matrix is highly skewed, and more research is needed for determining an appropriate bandwidth for such cases.

References