The Trace Criterion for Kernel Bandwidth Selection for Support Vector Data Description

Support vector data description (SVDD) is a popular anomaly detection technique. The SVDD classifier partitions the whole data space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, for which the Gaussian kernel is a common choice. The Gaussian kernel has a bandwidth parameter, and it is important to set the value of this parameter correctly for good results. A small bandwidth leads to overfitting such that the resulting SVDD classifier overestimates the number of anomalies, whereas a large bandwidth leads to underfitting and an inability to detect many anomalies. In this paper, we present a new unsupervised method for selecting the Gaussian kernel bandwidth. Our method, which exploits the low-rank representation of the kernel matrix to suggest a kernel bandwidth value, is competitive with existing bandwidth selection methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/16/2017

The Mean and Median Criterion for Automatic Kernel Bandwidth Selection for Support Vector Data Description

Support vector data description (SVDD) is a popular technique for detect...
06/16/2016

Sampling Method for Fast Training of Support Vector Data Description

Support Vector Data Description (SVDD) is a popular outlier detection te...
10/31/2016

Kernel Bandwidth Selection for SVDD: Peak Criterion Approach for Large Data

Support Vector Data Description (SVDD) provides a useful approach to con...
12/17/2021

Gaussian RBF Centered Kernel Alignment (CKA) in the Large Bandwidth Limit

We prove that Centered Kernel Alignment (CKA) based on a Gaussian RBF ke...
01/24/2020

Simple and Effective Prevention of Mode Collapse in Deep One-Class Classification

Anomaly detection algorithms find extensive use in various fields. This ...
09/01/2017

Fast Incremental SVDD Learning Algorithm with the Gaussian Kernel

Support vector data description (SVDD) is a machine learning technique t...
02/17/2016

Peak Criterion for Choosing Gaussian Kernel Bandwidth in Support Vector Data Description

Support Vector Data Description (SVDD) is a machine-learning technique u...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and anomaly detection. First introduced by Tax and Duin

[11]

, SVDD’s mathematical formulation is almost identical to the one-class variant of support vector machines: one-class support sector machines (OCSVM), which is attributed to Schölkopf et al. 

[9]. The use of SVDD is popular in domains in which the majority of data belongs to a single class and no distributional assumptions can be made. For example, SVDD is useful for analyzing sensor readings from reliable equipment for which almost all the readings describe the equipment’s normal state of operation.

Like other one-class classifiers, SVDD provides a geometric description of the observed data. The SVDD classifier assigns a distance to each point in the domain space; the distance measures the separation of that point from the training data. During scoring, any observation found to be at a large distance from the training data might be an anomaly, and the user might choose to generate an alert.

Several researchers have proposed using SVDD for multivariate process control [10, 2]. Other applications of SVDD involve monitoring condition of machines [12, 14] and image classification [8].

I-a Mathematical Formulation

In this section, we describe the mathematical formulation of SVDD; the description is based on [11].
Normal Data Description:
The SVDD model for normal data description builds a hypersphere that contains most of the data within a small radius. Given observations , we need to solve the following optimization problem to obtain the SVDD data description.
Primal Form:
Objective:

(1)

subject to:

(2)
(3)

where:
represent the training data,
is the radius and represents the decision variable,
is the slack for each variable,
is the center (a decision variable),
is the penalty constant that controls the trade-off between the volume and the errors, and
is the expected outlier fraction.
 
Dual Form:
The dual formulation is obtained using Lagrange multipliers.
Objective:

(4)

subject to:

(5)
(6)

where are the Lagrange constants, and is the penalty constant.
 
Duality Information:
The position of observation is connected to the optimal , the radius of the sphere, and the center of the sphere in the following manner:

Center position:

(7)

Inside position:

(8)

Boundary position:

(9)

Outside position:

(10)

Any for which the corresponding is known as a support vector.

Let denote the set . Then the radius of the hypersphere is calculated as follows for any :

(11)

The value of does not depend on the choice of .
Scoring:

For any point , the distance is calculated as follows:

(12)

Points whose are designated as outliers.

The spherical data boundary can include a significant amount of space that has a sparse distribution of training observations. Using this model to score can lead to a lot of false positives. Hence, instead of a spherical shape, a compact bounded outline around the data is often desired. Such an outline should approximate the shape of the single-class training data. This is possible by using kernel functions.

Flexible Data Description:

The support vector data description is made flexible by replacing the inner product with a suitable kernel function . The Gaussian kernel function used in this paper is defined as

(13)

where is the Gaussian bandwidth parameter.

The modified mathematical formulation of SVDD with a kernel function is as:

Objective:

(14)

subject to:

(15)
(16)

In perfect analogy with the previous section, any for which is an inside point and any for which is called a support vector.

is similarly defined as and the threshold is calculated, as follows for any :

(17)

The value of does not depend on which is used.
Scoring: For any observation , the distance is calculated as follows:

(18)

Any point for which is designated as an outlier.

I-B Importance of the Kernel Bandwidth Value

In practice, SVDD is almost always computed by using the Gaussian kernel function, and it is important to set the value of bandwidth parameter correctly. A small bandwidth leads to overfitting, and the resulting SVDD classifier overestimates the number of anomalies. A large bandwidth leads to underfitting, and many anomalies cannot be detected by the classifier.

Because SVDD is an unsupervised learning technique, it is desirable to have an automatic, unsupervised bandwidth selection technique that does not depend on labeled data that separate the

inliers from the outliers. In [4], Kakde et al. present the peak criterion, which is an unsupervised bandwidth selection technique, and show that it performs better than alternative unsupervised methods. However, determining the bandwidth that is suggested by the peak criterion requires that the SVDD solution be computed multiple times for the training data for a list of bandwidth values that lie on a grid. Even though using sampling techniques can speed up the computation (see [7]), this method is still expensive. Moreover, it is also necessary to initiate the grid search at a good starting value in order to avoid unnecessary computation, and it is not immediately obvious what a good starting value is.

In [5], Liao et al. present the modified mean criterion for bandwidth selection. The suggested bandwidth has a closed-form expression in terms of the training data, and it can be computed very quickly. SVDD that is trained by using the modified mean criterion bandwidth is reasonably accurate for many data sets; however, it is less accurate than the peak criterion in general.

In this paper, we introduce the trace criterion for Gaussian bandwidth selection. The computation of the trace criterion consists of finding the inflection point of a smooth function of the bandwidth parameter. The computation terminates quickly when standard nonlinear optimization methods such as Newton-Raphson are used. This method is efficient for moderately large data sets.

Our results show that the trace criterion is competitive with the peak criterion for many data sets. Simulation studies suggest that this method is more accurate than the mean criterion for a certain class of high-dimensional data.

These properties make the trace criterion method a good bandwidth selection technique. However, unsupervised bandwidth tuning is an extremely difficult problem, so it is quite possible that there is a class of data sets for which the trace criterion does not give good results.

The rest of the paper is organized as follows. Section II defines the trace criterion for bandwidth tuning, and the remaining sections compare the mean, peak; and trace criterion with each other.

Ii The Trace Criterion for Bandwidth Selection

Ii-a Parameter tuning using inflection points

Parameter tuning for unsupervised learning methods such as clustering and one-class classification can be difficult when external validation data are not available, as is quite frequently the case. A popular method for parameter tuning in such cases is to look at the values of a “validation measure” as a function of the hyperparameter of interest, and choose the value of the hyperparameter where the function has an inflection point

[1]. For a specific example, consider the problem of determining the number of clusters for -means. : the sum of the squared distance of each point in the training data from the cluster center closest to can be taken as a validation measure. If the number of clusters is held fixed, then a lower value of indicates better clustering. Let denote the value of for the clustering that is suggested by -means with clusters. As increases, decreases, so you cannot choose the number of clusters as because it would suggest as many clusters as the number of points in the training data. However, it is observed that the value of at which the function has an inflection point is quite frequently a good choice for the number of clusters. The inflection point of other validation measures such as the silhouette coefficient is also used to determine the number of clusters [1].

We will now propose a validation measure for SVDD whose inflection point provides our suggested bandwidth.

Ii-B Validation measure for trace criterion

Assume we have a training data set that consists of distinct points in and we want to determine a good kernel bandwidth value for training this data set. In kernel methods, including SVDD, the objective function to be optimized depends on how the data are transformed through the kernel matrix Since this matrix has elements, where is the number of observations, it is impossible to work with the entire kernel matrix even for moderate values of . The Gaussian kernel matrix is always positive semidefinite, and it typically has a rapidly decaying spectrum. The rapidly decaying spectrum can be exploited to create a low-rank positive semidefinite approximation of

by replacing all the eigenvalues in the spectral decomposition of

below a certain threshold by [13]. has a square root of low rank, that is, , where is an matrix with . Since , can be considered an approximate low- rank square root of . In many cases, computation of expressions that involve can be made tractable by replacing with . However, it is not feasible to compute such that by using the eigendecomposition of when is large. The Nyström methods form a class of popular methods to compute a low-rank representation of even when is large. We use a variant of Nyström method as described in [15] to construct a validation measure whose inflection point will be suggested as the bandwidth.

Given data in and an integer , let be distinct landmark points in . The landmark points are chosen so that they are evenly distributed within the training data. Following [15], we choose as the cluster centers that are obtained from a means clustering with clusters of the training data.

From the theory of reproducing kernel Hilbert spaces (RKHS), we know that for any there is a Hilbert space and a mapping such that for any we have , where denotes the inner product on and denotes the norm on .

Let denote the subspace spanned by . Note that is a finite-dimensional and hence closed subspace of of dimension exactly . Given , let denote the projection of on

In the Nyström methods, the kernel matrix is approximated by , a rank matrix that has a low-dimensional square root that can be easily computed.

We use the projected values for a different purpose. We use the discrepancy between and for in the training data to come up with a validation measure.

From usual least square arguments, we know that for any , we have , where are such that

This implies the normal equations,

and implies that can be explicitly computed as





Since and are orthogonal, the squared norm of the residual is given by

This shows that . Let ; the closer is to , the lower the residual error is. From the preceding expression for you can see that


So, given training data , if we define

then is a measure of the loss of accuracy due to projection into , where lies between and for each Higher values of indicate lower loss in precision due to projection into . In particular, if , there is absolutely no loss of precision. That is,

It is empirically observed that typically increases, as increases (that is, ). Moreover is unimodal: it initially increases until it reaches its maximum value, and then decreases. So typically has a well-defined inflection point where takes its maximum value. See Figure 1 for the graph of a typical plot of and .

The trace criterion suggests using as the bandwidth that empirical observations suggest coincides with the inflection point of in most cases.

Even though is a scalar function, it is defined in terms of matrix operations and we can come up with an explicit closed form for by using usual rules of matrix differential calculus [6].

Let denote the kernel matrix of representative points . Its element-by-element derivative is given by the matrix,

.

For , let

denote a column vector. Then its element-by-element derivative is given by the column vector,

Then we have

, and hence

Let . Then the preceding equation simplifies to

(19)

From (19), it is apparent that is differentiable on , such as we can use standard nonlinear optimization methods such as Newton-Raphson to maximize it.

The cost of evaluating is , which is tractable because .

We have to choose the number of representative points, , for applying the trace criterion. We have observed that gives a good result for most data sets, and we have used the trace criterion with throughout this paper unless explicitly stated otherwise.

Fig. 1: Plot of a validation function and its derivative

Iii Evaluating the Trace Criterion

In this section, we will use different data sets evaluate the trace criterion and compare its performance with the modified mean [5] and the peak criteria [4]. The evaluation is based on a series of data sets. Starting from simple two-dimensional connected data, we gradually moved toward more complex high-dimensional disconnected data. For our evaluation, we used both real-life and simulated data.
We observed that for two-dimensional connected data and for real-life data sets that are used in this evaluation, the performance of the modified mean and trace criteria is comparable. When data were two-dimensional and disconnected (meaning that data contained multiple disjoint clusters), the bandwidth value that was computed using the trace criterion provided much better data description than the modified mean criterion (see results in Table 4 and 5).
We also evaluated the trace criterion by using high-dimensional hyperspheres and hypercubes. The training data set consisted of 1 to 12 disjoint hyperspheres or hypercubes, and the dimension of the data varied between 5 to 40 in increments of 5. The scoring data used for evaluation consisted of observations from inside one or more hypercubes or one or more hyperspheres, and observations that are just outside the training data, in close proximity of the training data as seen in Figure 8, 10, 12 and 13. For this set of evaluations, we observed that the bandwidth value provided by the trace criterion produced a magnitude better performance than the modified mean criterion.

Iii-a Choice of Data Sets

We evaluate the trace criterion by using following four types of data sets:

  • Connected two-dimensional data: These are data that have two variables and no clusters.

  • Disconnected two-dimensional data: These are data that have two variables and two or more clusters.

  • Higher-dimensional data: These refer to real-life data that have more than two variables.

  • High-dimensional simulated data: These data consist of one or more hyperspheres or hypercubes.

Iv Comparison Using Two-Dimensional Connected Data

Iv-a Data Description

This section compares the trace criterion with the peak and modified mean criteria using two-dimensional connected data. Such data have no clusters. We use the star-shaped data and banana-shaped data, and we train an SVDD model by using the bandwidth value obtained by the peak, modified mean and trace criteria. To evaluate the results, we scored the bounding rectangle of the data by dividing it into a 200 200 grid. With a good bandwidth value, the inlier region that is obtained from scoring should match the geometry of the training data. Figures 2 and 3 display the results, along with the bandwidth value.

Iv-B Scoring Results

The scoring results indicate that the bandwidth value obtained using the modified mean and trace criteria provide good-quality data description. The value of bandwidth that is obtained using the modified mean criterion is comparable to the one obtained using the peak criterion.

(a) Training Data (b) Peak
(c) Modified Mean (d) Trace
Fig. 2: Results for two-dimensional connected data (Banana-shaped data)
(a) Training Data (b) Peak
(c) Modified Mean (d) Trace
Fig. 3: Results for Two-Dimensional Connected Data (Star-shaped Data)

V Comparison Using Two-Dimensional Disconnected Data

V-a Data Description

This section compares the trace criterion with the peak and modified mean criteria using two-dimensional data that lie across multiple clusters. We have observed that computing a good bandwidth value for such data is more difficult than for connected data. Since the data are two-dimensional, we can visually judge the quality of results. To evaluate the results, we scored the bounding rectangle of the data by dividing it into a 200 200 grid. With a good bandwidth value, the inlier region obtained from scoring should match the geometry of the training data. We use following two data sets for evaluations:

  • The refrigerant data which consist of four clusters [3]

  • A simulated “two-donut and a circle” data set, which consists of two donut-shaped clusters and one circular cluster

Figures 4 and 5 display the results, along with the bandwidth value.

V-B Scoring Results

The scoring results indicate that the bandwidth value obtained using the peak and the trace criteria provides a data description of reasonably good quality, for both data sets. For refrigerant data, the bandwidth values that are computed by peak and trace criteria are close. The scoring results indicate that the data description that is obtained using these values is able to separate out all four clusters, whereas the description that is obtained using the modified mean criterion bandwidth value merges the two clusters that lie close to each other. As indicated in Figure 4, the bandwidth value that is obtained using the modified mean criterion is significantly larger than the one obtained using the trace and the peak criteria.

(a) Training Data (b) Peak
(c) Modified Mean (d) Trace
Fig. 4: Results for two-dimensional disconnected data (Refrigerant Data)
(a) Training Data (b) Peak
(c) Modified Mean (d) Trace
Fig. 5: Results for two-dimensional disconnected data (Two Donuts and a circle data)

Vi Comparison Using High-Dimensional Data

Comparing the different criteria for high-dimensional data is much more difficult than comparing them for two-dimensional data. In two-dimensional data, the quality of the result can be easily judged by looking at the plot of the scoring results. But this is not possible for high-dimensional data. For the purpose of evaluation, we selected labeled high-dimensional data that have a dominant class. We used SVDD on a subset of the dominant class to obtain a description of the dominant class, and then we scored the rest of the data to evaluate the criteria. We expected the points in the scoring data set that correspond to the dominant class to be classified as inliers and all other points to be classified as outliers. Because the data are labeled, we could also use cross validation to determine the bandwidth that best describes the dominant class in the sense of maximizing a measure of fit, such as the score. So in this section we compare the bandwidths that are suggested by the different unsupervised criteria with the bandwidth that is obtained through cross validation for various benchmark data sets. The results are summarized in Table I . The benchmark data sets used for the analysis are described in sections VI-A1 through VI-A2.

Shuttle Tennessee Eastman
# Variables 9 41
# Observations 2000 2000
Max 0.96 (17) 0.19 (17)
Peak 0.96 (14) 0.16 (8)
Modified Mean 0.96 (17.2) 0.181 (11.37)
Trace 0.958 (13.1) 0.181 (11.22)
TABLE I: Results for high-dimensional Data

Vi-a Data Description

Vi-A1 Shuttle Data

This data set consists of measurements made on a shuttle. The data set contains nine numeric attributes and one classification attribute. Of 58,000 total observations, 80% belong to class 1. A random sample of 2,000 observations belonging to class 1 was selected for training, and the remaining 56,000 observations were used for scoring. This data set is from the UC Irvine Machine Learning Repository.

Vi-A2 Tennessee Eastman Data

The data set was generated using the MATLAB simulation code, which provides a model of an industrial chemical process. The data were generated for normal operations of the process and 20 faulty processes. Each observation consists of 41 variables, out of which 22 were measured continuously every 6 seconds on average and the remaining 19 were sampled at a specified interval of every 0.1 or 0.25 hours. From the simulated data, we created an analysis data set that uses the normal operations data of the first 90 minutes and data that correspond to faults 1 through 20. A data set that contains observations of normal operations was used for training. Scoring was performed to determine whether the model could accurately classify an observation as belonging to normal operation of the process. The MATLAB simulation code is available at [7].

Vi-B Scoring Results

The results outlined in table I indicate that the measure values that were obtained using all three criteria are equivalent and the values are in the neighborhood of the best value that can be obtained by cross validation.

Vii Comparison Using Simulated Data

In this section, we present results of a simulation study that we conducted to compare performance of the trace criterion with the peak and modified mean criteria. The simulations were performed to generate training data with a known geometry. The data dimensions were varied between 2 and 10. We conducted three simulation studies. Table II provides details of these studies.

Simulation Data
Dimension # of
data sets Details
Two-dimensional polygons 2 600 Polygons with varying number of vertices and length of sides
Hypercubes 5 to 40 in increments of 5 400 Single hypercube used for training
Hyperspheres 5 to 40 in increments of 5 400 Single hypersphere used for training
High-dimensional disconnected data with multiple hyperspheres 2 to 10 880 Each training data set contains multiple disjoint hyperspheres, ranging from 2 to 12.
High-dimensional disconnected data with multiple hypercubes 2 to 10 880 Each training data set contains multiple disjoint hypercubes, ranging from 2 to 12.
TABLE II: Simulation study

Vii-a Evaluation Using Two-Dimensional Polygons

In this section, we measure the performance of the trace criterion when it is applied to randomly generated polygons. Given the number of vertices, , we generate the vertices of a randomly generated polygon in the anticlockwise sense as Here and for are the order statistics of an i.i.d sample that is uniformly drawn from The are uniformly drawn from an interval

For this simulation, we chose and and varied the number of vertices from to . We generated random polygons for each vertex size. Having determined a polygon, we randomly sampled points uniformly from the interior of the polygon and used trace criterion and this sample to determine a bandwidth value. Figure 6 shows two random polygons.

(a) Number of Vertices = 5 (b) Number of Vertices = 25
Fig. 6: Random polygons

However, since we can easily determine whether a point lies in the interior of a polygon, we can also use cross validation to determine a good bandwidth value. To do so, we found the bounding rectangle of each of the polygons and divided it into a grid. We then labeled each point on this grid as an “inside” or an “outside” point. We then fit SVDD on the sampled data and scored the points on this grid for different values of and chose the value of that maximized the -measure.

The performance of a bandwidth selection criterion can be measured by the -measure ratio, which is defined as , where is the -measure that is obtained when the value suggested by the trace criterion is used and is the best possible value of -measure over all values of . A value close to 1 indicates that a bandwidth selection criterion is competitive with cross validation. We have values of this ratio for each vertex size.
The box and whiskers plot in Figure 7 summarizes the simulation study results for the modified mean and trace criteria. The X-axis shows the number of vertices of the ploygon and Y-axis shows the

-measure ratio. The bottom and the top of the box shows the first and the third quartile values. The ends of the whiskers represent the minimum and the maximum values of the

-measure ratio. The plot shows that the -measure ratio is greater than 0.9 across all values of number of vertices. Because the complexity of the ploygon increases as the number of vertices increases, we observed that the spread of the -measure ratio increased slightly. The fact that the -measure ratio is always greater than 0.9 provides necessary evidence that both the trace criterion and the modified mean criterion generalize across different training data sets.

Fig. 7: Evaluation using random polygons

Vii-B Evaluation Using Hyperspheres

In this section, we evaluate the trace criterion by using spherical data of varying dimensions. The observations in such spherical data (or hypersphere) are uniformly distributed. We use scoring to evaluate the quality of the data description that is obtained using the trace criterion bandwidth value. The scoring data set consists of 50%

inlier observations, which are uniformly distributed inside the training sphere and 50% of outlier observations, which are uniformly distributed outside the sphere. The points outside the sphere lie in a narrow annular ring, just outside the sphere. Figure 8 illustrates two variables in the training and scoring data. The rationale behind creating such scoring data set is that if the bandwidth value is good, then the data description that is obtained using such a value should be able to discriminate between observations that are inside and observations that are just outside the sphere. We varied the hypershpere dimension from 5 to 40 in increments of 5. For each dimension, 50 sets of training and scoring data sets were simulated. We computed the measure for each data set to determine the quality of data description. Figure 9 shows a box-and-whiskers plot of the measure for various values of data dimension. The measure decreases as the number of variables (the data dimension) increases from 5 to 40. The measure is consistently above 0.9 for all simulated data sets across different dimensions, the measure value dropped rapidly with the increase in the hypersphere dimension for the modified mean criterion. This observation confirms that a bandwidth value that is obtained using the trace criterion provides a much better-quality data description than the modified mean criterion provides.

(a) Training Data, #obs=5,000
(b) Scoring Data, #obs=10,000
Fig. 8: Evaluation using hyperspheres
Fig. 9: Evaluation using hyperspheres

Vii-C Evaluation Using Hypercubes

In this section, we evaluate the trace criterion using cube-shaped data of varying dimensions. The observations in such cubic data (or hypercube) are uniformly distributed. We used scoring to evaluate the quality of data description that was obtained using the trace criterion bandwidth value. The scoring data set consists of 50% inlier observations, which are uniformly distributed inside the training cube and 50% outlier observations which are uniformly distributed outside the cube. The points outside the cube lie in a narrow frame, just outside the cube. Figure 10 illustrates the two variables in the training and scoring data. The rationale behind creating such a scoring data set is that if the bandwidth value is good, then the data description that is obtained using such a value should be able to discriminate between observations that are inside and observations that are just outside the cube. We varied the hypercube dimension from 5 to 40 in increments of 5. For each dimension, 50 sets of training and scoring data sets were simulated. We computed the measure for each data set to determine the quality of data description. Figure 11 shows a box-and-whiskers plot of the measure for various values of data dimension. The measure decreases as the number of variables (the data dimension) increases from 5 to 40. Although, the measure is consistently above 0.7 for all simulated data sets across different dimensions, the measure value dropped rapidly with the increase in the the hypercube dimension for the modified mean criterion. This observation confirms that bandwidth value that is obtained using the trace criterion provides a much better-quality data description than the modified mean criterion provides.

(a) Training Data, #obs=5,000
(b) Scoring Data, #obs=10,000
Fig. 10: Evaluation using hypercubes
Fig. 11: Evaluation using hypercubes

Vii-D Evaluation Using High-Dimensional Disconnected Data

In this section, we use high-dimensional disconnected data to evaluate the performance of the trace criteria. The training data consist of two or more disjoint hyperspheres or the hypercubes used earlier in sections VII-B and VII-C. The details of the simulation study are presented below.

Vii-D1 Evaluation using multiple hyperspheres

We evaluated the trace and modified mean criteria by using data sets that contain multiple numbers of spheres. The number of spheres in a data set was varied between 2 and 12, and the data dimension was varied between 5 and 40 in increments of 5. For each combination of data dimension and number of spheres, we used different seed values to generate 10 different sets of training and scoring data sets. Figure 12 illustrates sample training and scoring data sets that have five spheres and use two dimensional data.

(a) Training Data
(b) Scoring Data
Fig. 12: Evaluation using multiple hyperspheres

For each simulation run, we computed the measure. Table LABEL:fig:sphereboxWhisker (a) to Table LABEL:fig:sphereboxWhisker (k) provide the box-and-whiskers plot of the measure for simulations results that were obtained using different number of hyperspheres. Each section of Table LABEL:fig:sphereboxWhisker provides measure for different values of data dimension. As seen in Table LABEL:fig:sphereboxWhisker (a) to Table LABEL:fig:sphereboxWhisker (d), show that when the number of hyperspheres is between 2 and 4, both the modified mean and then the trace criteria provide values above 0.9, with the modified mean criterion, performing slightly better than the trace criterion. But as the number of hyperspheres increases from 5 to 12, Table LABEL:fig:sphereboxWhisker (e) to Table LABEL:fig:sphereboxWhisker(k) indicate that the performance of the modified mean criterion drops compared to the trace criterion. The trace criteria consistently provides a measure higher than 0.9, whereas the measure obtained using the modified mean criterion is in the range of 0.7 to 0.8, depending on the number of hyperspheres.

(a) Number of hyperspheres = 2 (b) Number of hyperspheres = 3
(c) Number of hyperspheres = 4 (d) Number of hyperspheres = 5
(e) Number of hyperspheres = 6 (f) Number of hyperspheres = 7
(g) Number of hyperspheres = 8 (h) Number of hyperspheres = 9
(i) Number of hyperspheres = 10 (j) Number of hyperspheres = 11
(k) Number of hyperspheres = 12
TABLE III: Evaluation using multiple hyperspheres

Vii-D2 Evaluation Using Multiple Hypercubes

We evaluated the trace and modified mean criteria by using data sets that contain multiple numbers of cubes. The number of cubes in a data set was varied between 2 and 12, and the data dimension was varied between 5 and 40 in increments of 5. For each combination of data dimension and number of cubes, we generated 10 different sets of training and scoring data sets by different seed value. Figure 13 illustrates sample training and scoring data set that have five cubes and use two-dimensional data.

(a) Training Data
(b) Scoring Data
Fig. 13: Evaluation using multiple hypercubes

For each simulation run, we computed the measure. Table LABEL:fig:cubeboxWhisker(a) to Table LABEL:fig:cubeboxWhisker(k) provide the box-and-whiskers plot of the measure for simulation results that were obtained using different number of hypercubes. Each section of Table LABEL:fig:cubeboxWhisker provides measure for different values of data dimension. Table LABEL:fig:cubeboxWhisker(a) to Table LABEL:fig:cubeboxWhisker(c), show that when the number of hypercubes is between 2 and 4, both the modified mean and the trace criteria provide values above 0.9, with the modified mean criterion, performing slightly better than the trace criterion. But as the number of hypercubes increases from 5 to 12, Table LABEL:fig:cubeboxWhisker(d) to Table LABEL:fig:cubeboxWhisker(k) indicate that the performance of the modified mean criterion drops compared to the trace criterion. The trace criterion consistently provides an measure higher than 0.9, whereas the measure that is obtained using the modified mean criterion is in the range of 0.7 to 0.8, depending on the number of hypercubes.

(a) Number of hypercubes = 2 (b) Number of hypercubes = 3
(c) Number of hypercubes = 4 (d) Number of hypercubes = 5
(e) Number of hypercubes = 6 (f) Number of hypercubes = 7
(g) Number of hypercubes = 8 (h) Number of hypercubes = 9
(i) Number of hypercubes = 10 (j) Number of hypercubes = 11
(k) Number of hypercubes = 12
TABLE IV: Evaluation using multiple hypercubes

Viii Conclusion

The trace criterion for computing the bandwidth value of a Gaussian kernel for SVDD, as proposed in this paper, exploits the low-rank representation of the kernel matrix to suggest a bandwidth value. Several evaluations that use synthetic and real-life data sets indicate that the bandwidth value that is obtained using the trace criterion provides similar or better results compared to existing methods. The trace criterion method provides good values when data are high-dimensional and disjoint.

Acknowledgement

Authors would like to thank Anne Baxter, Principal Technical Editor at SAS, for her assistance in creating this manuscript.

References