I Introduction
Support vector data description (SVDD) is a machine learning technique that is used for singleclass classification and anomaly detection. First introduced by Tax and Duin
[13]SVDD’s mathematical formulation is almost identical to the oneclass variant of support vector machines : oneclass support sector machines (OCSVM), which is attributed to Schölkopf et al.
[10]. The use of SVDD is popular in domains where the majority of data belongs to a single class and it is not possible to make any distributional assumptions. For example, SVDD is useful for analyzing sensor readings from reliable equipment where almost all the readings describe the equipment’s normal state of operation.Like other one class classifiers SVDD provides a geometric description of the observed data. The SVDD classifier assigns a distance to each point in the domain space; which measures the separation of that point from the training data. During scoring any observation found to be at a large distance from the training data might be an anomaly, and the user might choose to generate an alert.
Several researchers have proposed using SVDD for multivariate process control [12, 2]. Other applications of SVDD involve machine condition monitoring [14, 16] and image classification [8].
Ia Mathematical Formulation
In this subsection we describe the mathematical formulation of SVDD, the description is based on
[13].
Normal Data Description:
The SVDD model for normal data description builds a hypersphere that contains
most of the data within a small radius. Given observations ,
we need to solve the follwing optimization problem to obtain the SVDD data description.
Primal Form:
Objective:
(1) 
subject to:
(2)  
(3) 
where:
represent the training data,
is the radius and represents the decision variable,
is the slack for each variable,
is the center (a decision variable),
is the penalty constant that controls the tradeoff between the volume and the errors, and
is the expected outlier fraction.
Dual Form:
The dual formulation is obtained using Lagrange multipliers.
Objective:
(4) 
subject to:
(5)  
(6) 
where are the Lagrange constants, and is the penalty constant.
Duality Information:
The position of observation is connected to the optimal , the radius of the
sphere , and the center of the sphere in the
following manner:
Center position:
(7) 
Inside position:
(8) 
Boundary position:
(9) 
Outside position:
(10) 
Any for which the corresponding is known as a support vector.
Let denote the set then the radius of the hypersphere is calculated, for any , as follows:
(11) 
The value of does not depend on the choice of .
Scoring:
For any point , the distance is calculated as follows:
(12) 
Points whose are designated as outliers.
The spherical data boundary can include a significant
amount of space that has a sparse distribution of training
observations. Using this model to score can lead to a
lot of false positives. Hence, instead of a
spherical shape, a compact bounded outline around the data
is often desired. Such an outline should approximate the
shape of the singleclass training data. This is possible
by using kernel functions.
Flexible Data Description:
The Support vector data description is made flexible by replacing the inner product with a suitable kernel function . The Gaussian kernel function used in this paper is defined as
(13) 
where is the Gaussian bandwidth parameter.
The modified mathematical formulation of SVDD with a kernel function is as follows:
Objective:
(14) 
Subject to:
(15)  
(16) 
In perfect analogy with the previous section, any with is an inside point, any for which is called a support vector.
is similarly defined as and the threshold is calculated, for any , as
(17) 
The value of does not depend on which is used.
Scoring:
For any observation the distance is calculated as follows:
(18) 
Any point for which is designated as an outlier.
IB Importance of the Kernel Bandwidth Value
In practice, SVDD is almost always computed by using the Gaussian kernel function, and it is important to set the value of bandwidth parameter correctly. A small bandwidth leads to overfitting, and the resulting SVDD classifier overestimates the number of anomalies. A large bandwidth leads to underfitting, and many anomalies cannot be detected by the classifier.
Because SVDD is an unsupervised learning technique, it is desirable to have an automatic, unsupervised bandwidth selection technique that does not depend on labeled data that separate the
inliers from the outliers. In [4], Kakde et al. present the peak criterion, which is an unsupervised bandwidth selection technique, and show that it performs better than alternative unsupervised methods. However, determining the bandwidth suggested by the peak criterion requires that the SVDD solution be computed multiple times for the training data for a list of bandwidth values that lie on a grid. Even though using sampling techniques can speed up the computation (see [2016arXiv161100058P]); this method is still expensive. Moreover to avoid unnecessary computations it is also necessary to initiate the grid search at a good starting value and it is not immediately obvious what a good starting value is. In this paper we suggest two new criteria : the mean criterion and the median criterion. The mean criterion has a simple closedform expression in terms of the training data.We evaluated the mean criterion and the median criterion in multiple ways. We conducted simulation studies where we could objectively determine the quality of a particular bandwidth. We compared the results obtained from the mean and median criteria with those obtained from alternative methods on a wide range of data sets. The data were specially selected to probe potential weaknesses in the mean criterion.
Our results show that the mean criterion is competitive with the peak and median criteria for most data sets in our test suite. In addition, computation of the mean criterion is fast even when the data set is large. These properties make the mean criterion a good bandwidth selection technique. However, unsupervised bandwidth tuning is an extremely difficult problem, so it is quite possible that there is a class of data sets for which the mean criterion does not give good results.
The rest of the paper is organized as follows. Section II defines the mean and median criteria for bandwidth tuning, and the remaining sections compare the mean, median, and peak criteria with each other.
Ii The Mean Criterion for Bandwidth Selection
Iia Training Data That Have Distinct Observations
Assume we have a training data set that consists of distinct points in and we want to determine a good kernel bandwidth value for training this data set.
Given a candidate bandwidth , let denote the kernel matrix whose element in position is A tiny s is not a good candidate for the kernel bandwidth, because as ,
converges to the identity matrix of order
. When the kernel matrix is very close to the identity matrix, all observations in the original data set become support vectors. This indicates a case of severe overfitting.So for a reasonable bandwidth value , the corresponding kernel matrix must be sufficiently different from the identity matrix . One way to ensure this would be to choose such that
(19) 
where is an appropriate matrix norm and is a tolerance factor. Larger values of will ensure greater distance from the identity matrix.
It is easy to determine an that satisfies (19) when is chosen as the Frobenius norm. The Frobenius norm of a matrix is defined as the square root of the sum of squares of all elements in the matrix; that is,
If is the Frobenious norm, then
From the wellknown inequality of arithmetic and geometric means, we have
so it is sufficient to choose an such that
(20) 
where
Equation (20) suggests using
(21) 
as the kernel bandwidth. The numerator of (20) contains the mean of pairwise squared distances; this suggests creating new criteria by replacing them with another measure of central tendency of the squared distances in the numerator of (20). For example we can have another criterion which suggests using as the bandwidth value.
We call using as the bandwidth the mean criterion for bandwidth selection, and we call using as the bandwidth the median criterion for bandwidth selection.
The mean criterion bandwidth can be computed very quickly.
Let , let denote the column means, and let
denote the column variance. So
Then it is easy to show that . So the bandwidth suggested by the mean criterion is
(22) 
To see this, note the following:
The result is immediate.
Because the column variances can be calculated in one pass over the data, the computation of the mean criterion is an algorithm.
The computation needed for the median criterion cannot be simplified; however, one can take a sample from the data and use as an approximation to
IiB Training Data That Have Repeated Observations
We now consider the case where we have repeated observations in the training data set. Let be the set of distinct points in our training data set in , and assume that is repeated times. In this case, the kernel matrix, is a square matrix of order . The kernel matrix can be partitioned into blocks where the block is a matrix of order given by , where is a column vector of ones. As , converges to , a block diagonal matrix with diagonal blocks for In this case, we similarly seek an such that
(23) 
Define:
,
, and
.
In a manner similar to the previous section, we have
Using Jensen’s inequality,
As in the previous section the preceding bound can be simplified and expressed in terms of the weighted column variances. We have any that satisfies the following inequality also satisfies (23).
(24) 
IiC Choice of
The mean and median criteria depend on the parameter . For the mean and median criteria to be effective, there should be an easy way to choose the value . Otherwise we will have simply replaced the difficult problem of choosing a bandwidth with another difficult problem of choosing . In our investigations, we noticed that setting to works for most cases. So unless explicitly stated otherwise, the value of is throughout this article.
Iii Evaluating the Mean criterion
Iiia Alternative Criteria
In [1], Aggarwal suggests setting for the kernel that is parametrized as , which translates to using for the kernel parametrized as . We call this the criterion. We compare the mean and median criteria with the criterion. In addition, we compare the mean and median criterion with the peak criterion (see [4]). Since the peak criterion performs better than the alternative criteria mentioned in [4], we omit the other criteria that are mentioned there.
IiiB Choice of Data sets
The mean and median criteria depend on the distribution of pairwise distances of the training data set. These methods might not work well if the distribution of the pairwise distances is skewed. So it is important to check the performance of the mean and median criteria on data sets that have a skewed distribution of pairwise distances.
The distribution of pairwise distances in data sets where the data lie in distinct clusters is typically multimodal and skewed. See Figure 4(a) for a data set that has three distinct clusters; the histogram of pairwise distances as seen in Figure 4(b) indicates a skewed and multimodal distribution.
For this reason, we check the performance of the different criteria on “connected” data that is, data without any clusters and on data sets where there are two or more clusters.
Iv Comparing the Criteria on TwoDimensional Data
Iva TwoDimensional Connected Data
IvA1 Data Description
In this section, we compare the performance of the mean, median, , and peak criteria on selected twodimensional data. These data sets are connected; that is, there are no clusters in the data. Because the data are twodimensional, we can visually evaluate the quality of results. To evaluate the results, we obtain the data description provided by the different bandwidths, and then we score the bounding rectangle of the data by dividing it into a grid. The inlier region that is obtained from scoring should closely match the training data. Figure 1 displays the results for a bananashaped data, and Figure 2 displays the results for a starshaped data.
IvA2 Conclusion
The scoring results indicate that the bandwidth values computed using the mean and median criteria provide a data description of good quality. Such a description is close to the one obtained using the peak criteria. The criterion does not work well for these data sets.
IvB TwoDimensional Disconnected Data
IvB1 Data Description
In this section, we compare the performance of the mean, median,
, and peak criteria on selected twodimensional data that lie in different clusters. Selecting the bandwidth of such data is usually more difficult than estimating the bandwidth of connected data. Because the data are twodimensional, we can visually evaluate the quality of results. To evaluate the results, we obtain the data description that is provided by the different bandwidths, and then we score the bounding rectangle of the data by dividing it into a
grid. The inlier region obtained from scoring should closely match the training data. The following data sets are used in this section:
A simulated “twodonuts and a munchkin” data set which consists of two donutshaped regions and a spherical region. Figure 5 displays the data and the scoring results.
IvB2 Conclusion
The scoring results indicate that the bandwidth value that is computed using the mean and median criteria provides a data description of reasonably good quality for the threecluster data set and for the twodonuts and munchkin data set. Such description is close to the one obtained using the peak criterion.
For the refrigerant data set, the peak criterion significantly outperforms other methods. The data description obtained from the peak criterion can separate out all four clusters, whereas the other methods merge two clusters that lie close to each other. Although the mean and median criterion do not perform as well as the peak criteria, any point in the inlier region is close to the training data, and the area of the region that is misclassified is small compared to the bounding region of the training data. So the result is still very reasonable.
The criterion again performs poorly on all these data sets, and it is not considered as an candidate in the remaining sections.
V Comparing the Criteria on HighDimensional Data
The score is a common measure of a binary classifier’s accuracy. It is defined as
where and
stand for the number of truepositives, number of falsenegatives, and number of falsepositives, respectively. Comparing the different criteria for highdimensional data is much more difficult than comparing them for twodimensional data. In twodimensional data, the quality of the result can be easily judged by looking at the plot of the scoring results. But this is not possible for highdimensional data. For the purpose of evaluation, we selected labeled highdimensional data that have a dominant class. We used SVDD on a subset of the dominant class to obtain a description of the dominant class, and then we scored the rest of the data to evaluate the criteria. We expect the points in the scoring data set that correspond to the dominant class to be classified as
inliers and all other points to be classified as outliers. Because the data are labeled, we can also use cross validation to determine the bandwidth that best describes the dominant class in the sense of maximizing a measure of fit, such as the score. So in this section we compare the bandwidth suggested by the different unsupervised criteria with the bandwidth obtained through cross validation for various benchmark data sets. The results are summarized in Table I below. The benchmark data sets used for the analysis are described in sections VA through VE.Data  Dimension/Nobs  Max ()(s)  Peak ()(s)  Mean ()(s)  Median ()(s) 

Metal Etch  20/96  (0.8)(0.69)  (0.56)(0.46)  (0.42)(0.24)  (0.43)(0.19) 
Shuttle  9/2000  (0.96)(17)  (0.96)(14)  (0.95)(11)  (0.84)(5.75) 
Spam  57/1500  (0.63)(50)  (0.63)(65)  (0.42)(0.24)  (0.43)(0.19) 
Tennessee Eastman  41/2000  (0.19)(17)  (0.16)(8)  (0.14)(7.22)  (0.135)(6.04) 
Intrusion  45/11490  (0.95)(7050)  (0.89)(5060)  (0.95)(9667)  (0.95)(356) 
In Table I, the second column contains
the values of the cross validation bandwidth and its
corresponding score, and the third, fourth and fifth
columns contain the bandwidth suggested by the peak, mean
and median criteria and their corresponding scores.
Caveat
SVDD is a geometric classifier, so using labels in this manner is useful
only if they geometrically separate the data. If the labels actually separate
the data geometrically the bandwidth obtained from cross validation will lead
to a high score.
We now describe the data sets mentioned in Table I.
Va Metal Etch Data
This data set consists of 20 process variables, an ID variable, and a timestamp variable from a metal wafer etcher. The data consist of measurements from 108 normal wafers and 21 faulty wafers. The training data set contains half of all the normal wafers, and the scoring data set contains the remaining observations. The data set used in this analysis is explained in [15] and can be obtained from [6].
VB Shuttle Data
This data set consists of measurements made on a shuttle. The data set contains nine numeric attributes and one class attribute. Out of 58,000 total observations, 80% of the observations belong to class one. A random sample of 2,000 observations belonging to class 1 was selected for training, and the remaining 56,000 observations were used for scoring. This data set is from the UC Irvine Machine Learning Repository [5].
VC Spam Data
The spam data set consists of emails that were classified as spam or not. Each record corresponds to an individual email. The total number of attributes is 57. Most attributes are frequencies of specific words. Training is performed using a subset of nonspam observations. Remaining observations, which relate to both the spam and nonspam emails, were used for scoring. This data set is from the UC Irvine Machine Learning Repository [5].
VD Tennessee Eastman
The data set was generated using the MATLAB simulation code, which provides a model of an industrial chemical process. The data were generated for normal operations of the process and twenty faulty processes. Each observation consists of 41 variables, out of which 22 were measured continuously every 6 seconds on average and the remaining 19 were sampled at a specified interval of every 0.1 or 0.25 hours. From the simulated data, we created an analysis data set that uses the normal operations data of the first 90 minutes and data that correspond to faults 1 through 20. A data set that contains observations of normal operations was used for training. Scoring was performed to determine whether the model could accurately classify an observation as belonging to normal operation of the process. The MATLAB simulation code is available at [7].
VE Intrusion Data
This data set contains multivariate data that characterize cyber attacks. It contains 45 attributes which include type of service, number of source bytes, number of failed logins, and number of files created. Out of the 24,156 observations, 22,981 were labeled no attack and 1,175 were labeled attack. Half of the no attack observations were used as training data and the remaining observations were used as scoring data. This data set can be obtained from [11].
VF Conclusion
The bandwidths suggested by the mean criterion are similar to the bandwidth suggested by the median and peak criteria for many data sets. This similarity makes the mean criterion an attractive bandwidth selection criterion because it can be computed very quickly.
Vi Simulation Study on Random Polygons
Via Design
In this section, we conduct a simulation study to compare the different bandwidth selection methods. The simulation study consists of training SVDD on randomly generated polygons. Given the number of vertices , we first generate the vertices of the polygon in counterclockwise direction as where are the order statistics of a uniform iid sample from the interval and are uniformly chosen from an interval .
We then sample uniformly from the interior of the polygon and compute the bandwidths that are suggested by the mean, median, and peak criteria. Because it is easy to determine whether a point actually lies inside a particular polygon, we can also use cross validation to determine the best bandwidth parameter. To do so we divide the bounding rectangle of this polygon into a grid and label each point in the grid as an inside or outside point depending on whether that point is inside or outside the polygon. We can choose the best bandwidth as the one that classifies the grid points as inside or outside points with the highest score. Figure 6 shows a typical polygon and the data that are generated from the polygon for fitting SVDD.
In our simulation study, we set and and we generate polygons whose number of vertices vary from 5 to 30. For a particular vertex size, we generate 20 polygons. For each such polygon, we create a data set that consists of 600 points sampled from the interior of the polygon, and we use this sample to obtain bandwidths that are obtained through cross validation and from the mean and median criteria. For each such polygon, we have the score that corresponds to the crossvalidation bandwidth and the scores that correspond to the mean and median criterion, and , respectively. is the best possible score that can be attained by any bandwidth. At the end of the simulation we have a collection of “ score ratios”: and , one for each polygon used in the simulation. If most of these values are close to , this will indicate that the bandwidth suggested by the mean and median criterion is competitive with the bandwidth that maximizes the score.
ViB Results
The boxandwhiskers plots in Figure 7 summarize the simulation study results. The X axis shows the number of vertices of the polygon, and the box on the Y axis shows the distribution of the
scores. The bottom and the top of the box show the first and third quartile values. The ends of the whiskers represent the minimum and maximum values of the
score ratio. The diamond shape indicates the mean value, and the horizontal line in the box indicates the second quartile. The plots shows that the ratio is greater than 0.8 across all values of number of vertices. The score ratio in the top three quartiles is greater than 0.9 across all values of the number of vertices. As the complexity of the polygon increases with increasing number of vertices, the spread of score ratio also increases.ViC Conclusion
The fact that the score ratios are always close to 1 suggests the mean and median criteria generalize across different training data sets. However, a similar simulation performed for the peak criterion in [4] shows that the distribution of score ratios for the peak criterion is even more concentrated around 1 that and the minimum values of the score ratios are much higher that those for the mean and median criteria. This shows that the peak criterion generalizes better than the mean and median criterion.
Vii Conclusion and Future Work
We proposed two new bandwidth selection criteria, the mean criterion and the median criterion, for the Gaussian kernel for SVDD training. The proposed criteria give results that are similar to the peak criterion for many data sets, and hence are a good starting point for determining a suitable bandwidth for a particular data set. The suggested criteria might not be the most appropriate for data sets where the distance matrix is highly skewed, and more research is needed for determining an appropriate bandwidth for such cases.
References
 [1] C.C. Aggarwal. Outlier Analysis. Springer New York, 2013.
 [2] Fatih Camci and Ratna Babu Chinnam. General support vector representation machine for oneclass classification of nonstationary classes. Pattern Recognition, 41(10):3021–3034, 2008.
 [3] N. A. Heckert and James J. Filliben. NIST handbook 148: DATAPLOT reference manual, Volume I: Commands. http://www.itl.nist.gov/div898/software/dataplot/, 2000. [Online; accessed 4August2017].
 [4] D. Kakde, A. Chaudhuri, S. Kong, M. Jahja, H. Jiang, and J. Silva. Peak Criterion for Choosing Gaussian Kernel Bandwidth in Support Vector Data Description. In 2017 IEEE International Conference on Prognostics and Health Management (ICPHM) (PHM2017), 2017. [Online preprint.] Available: https://arxiv.org/abs/1602.05257.
 [5] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.
 [6] Eigenvector research. Metal etch data for fault detection evaluation. http://www.eigenvector.com/data/Etch/. [Online; accessed 4August2017].
 [7] N. Lawrence Ricker. Tennessee Eastman challenge archive, matlab 7.x code. http://depts.washington.edu/control/LARRY/TE/download.html, 2002. [Online; accessed 21March2016].
 [8] Carolina SanchezHernandez, Doreen S. Boyd, and Giles M. Foody. Oneclass classification for mapping a specific landcover class: SVDD classification of fenland. Geoscience and Remote Sensing, IEEE Transactions on, 45(4):1061–1073, 2007.
 [9] SAS Institute Inc., Cary, NC. SAS/STAT 14.2 User’s Guide, 2016.

[10]
Bernhard Schölkopf, Robert C. Williamson, Alex J. Smola, John ShaweTaylor,
and John C. Platt.
Support vector method for novelty detection.
In Advances in neural information processing systems, pages 582–588, 2000.  [11] SIGKDD. KDD intrusion data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999. [Online; accessed 4August2017].
 [12] Thuntee Sukchotrat, Seoung Bum Kim, and Fugee Tsung. Oneclass classificationbased control charts for multivariate process monitoring. IIE transactions, 42(2):107–120, 2009.
 [13] David M.J. Tax and Robert P.W. Duin. Support vector data description. Machine learning, 54(1):45–66, 2004.
 [14] Achmad Widodo and BoSuk Yang. Support vector machine in machine condition monitoring and fault diagnosis. Mechanical Systems and Signal Processing, 21(6):2560–2574, 2007.

[15]
Barry M. Wise, Neal B. Gallagher, Stephanie Watts Butler, Daniel D. White, and
Gabriel G. Barna.
A comparison of principal component analysis, multiway principal component analysis, trilinear decomposition and parallel factor analysis for fault detection in a semiconductor etch process.
Journal of Chemometrics, 13(34):379–396, 1999. 
[16]
Alexander Ypma, David M.J. Tax, and Robert P.W. Duin.
Robust machine fault detection with independent component analysis and support vector data description.
In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., pages 67–76. IEEE, 1999.
Comments
There are no comments yet.