I Introduction
In data classification, the overall goal is to define a model that can classify data into a predefined set of classes. During the training phase, the parameters of the classification model are estimated using samples belonging to the classes of interest. When the classification task involves two or more classes, model training requires a sufficient number of samples from each class and the corresponding class labels. However, in cases where we are interested in the distinction of a class from all other classes, the application of multiclass classification methods is usually not appropriate.
In case where both the class of interest (hereafter called positive class) and all other classes (used to form the negative class) are sufficiently represented in the training set, classspecific models can be employed [1][2][3][4], whereas, if the negative class is not sufficiently represented, oneclass models should be applied. The main conceptual difference between classspecific and oneclass models is that the former ones try to discriminate the positive class from every other class, while the latter ones try to describe the positive class without exploiting information related to negative samples. This is why oneclass models can be applied in problems where only the positive class can be sufficiently sampled, while the negative one is either too rare or expensive to sample [5].
Oneclass classification problem has been tackled mainly by three approaches: density estimation, reconstruction and class boundary description [6]. For the density estimation approach, the Gaussian model, the mixture of Gaussians [7] and the Parzen density [8] are the most popular ones [9]
. In reconstruction methods, the class is modelled by making some assumptions about the process which generates the target data. Some examples of reconstruction methods are based on Kmeans clustering, learning vector quantization and selforganizing maps
[10]. In boundary description, a closed boundary around the target data is optimally formed. Support Vector Data Description (SVDD) [11] is one of the popular boundary methods used for solving oneclass classification problems, by defining a hypersphere enclosing the target class. The hypersphere of SVDD can be made more flexible using kernel methods [11].Other boundary methods have also been proposed for oneclass classification. In [12]
, OneClass Support Vector Machine (OCSVM) is proposed, in which the objective is to define the hyperplane that discriminates the data from the origin with maximum margin. It has also been proven that the solutions of SVDD and OCSVM are equivalent for normalized data representations in the kernel space
[13] [14]. In [15], Graph Embedded OCSVM (GEOCSVM) and Graph Embedded SVDD (GESVDD) methods are introduced as extensions of [12] and [11], respectively. These methods incorporate geometric class information expressed by generic graph structures in OCSVM and SVDD optimization that acts as a regularizer to their solution.Oneclass classification has been used for many different applications. In [16], oneclass classification is used for detecting faults in induction motors. In [17], oneclass classification, particularly SVDD, is used in remote sensing for mapping a specific landcover class, illustrated with an example of classification of a local government district in Cambridgeshire, England. In [18], an SVDDbased algorithm for target detection in hyperspectral images is developed. In [19], three different oneclass classifiers, i.e., oneclass Gaussian mixture, oneclass SVM and oneclass Nearest Neighbor are employed to label sound events as fall or part of the daily routine for elderly people based on sound signatures. In [20] oneclass classification is used for video summarization based on human activities.
In this paper, we propose a novel method for generic oneclass classification, namely Subspace Support Vector Data Description (SSVDD). SSVDD defines a model for the positive class in a lowdimensional feature space optimized for oneclass classification. By allowing nonlinear data mappings, simple class models can be defined in the lowdimensional feature space that correspond to complex models in the original feature space. Such an approach allows us to simplify the information required for describing the class of interest, while at the same time it can provide a good performance in nonlinear problems.
Ii Subspace Support Vector Data Description
Let us assume that the class to be modeled is represented by a set of vectors , living in a dimensional feature space (i.e. ). Subspace Support Vector Data Description (SSVDD) tries to determine a dimensional feature space (), in which the class can be optimally modeled. When linear projection is considered, the objective is to determine a matrix , such that:
(1) 
can be used in order to better model the class using a oneclass classification model. We will describe how nonlinear mappings can be exploited to this end using kernels in Subsection IID.
The oneclass classifier employed in this work is SVDD [11], which models the class by defining the hypesphere tightly enclosing the class. That is, given the data representation in the lowdimensional feature space , we want to determine the center of the class and the corresponding radius , by minimizing:
(2) 
such that all the training data are enclosed in the hypersphere, i.e.:
(3) 
In order to define a tighter class boundary (and possibly handle the situation of outliers in the training data), a relaxed version of the above criterion is solved by introducing a set of slack variables
. That is, the optimization function to minimize becomes:(4) 
under the constraints that most of the training data should lie inside the hypersphere, i.e.:
(5)  
(6) 
The parameter in (4) is a regularization parameter which controls the tradeoff between the volume of hypersphere and the training error caused by allowing outliers in the class description. is inversely proportional to the fraction of the expected outliers in the training set. Increasing the value of will allow more training samples to fall outside the class boundary.
The optimization problem in (4), under the constraints in (5) and (6) corresponds to the original SVDD optimization problem optimized with respect to an additional parameter that is used to define the optimal data representations for oneclass classification. In order to find the optimal parameter values, we apply Lagrangebased optimization. The Lagrangian function is given by:
(7)  
and should be maximized with respect to Lagrange multipliers and minimized with respect to radius , center a, slack variables and projection matrix .
By setting the partial derivative to zero, we get:
(8)  
(9)  
(10)  
(11) 
From (8)(11), we can observe that the optimization parameters and are interconnected and, thus, they cannot be jointly optimized. In order to optimize (7) with respect to both and , we apply an iterative optimization process where, at each step, we fix one parameter and optimize the other, as will be described in the following subsections.
Iia Class description
Given a data projection matrix , the data description step follows the standard SVDDbased solution. That is, substituting (1), (8), (9) and (10) in (7) we obtain:
(12) 
Now, maximizing (12) gives the set of . The samples corresponding to values are the support vectors defining the data description. The samples corresponding to values are on the boundary of the corresponding hypersphere, while those outside the boundary will correspond to values . For the samples inside the boundary, the corresponding values of will be equal to zero [11]. Here we should note that whether a sample is a support vector or not, it is affected by the selection of the data projection matrix , which is optimized based on the process described next.
IiB SVDDbased subspace learning
After determining the optimal set of , we optimize an augmented version of the Lagrangian function in (12):
(13) 
where
is a regularization term expressing the class variance in the lowdimensional space having the form:
(14) 
is a regularization parameter controlling the importance of the regularization term in the update and is the trace operator. We additionally impose the constraint , in order to obtain a orthogonal projection. is a vector controlling the contribution of each training sample in the regularization term and can take the following values:

: In this case the regularization term becomes obsolete and is optimized using (12). This case is referred to as hereafter.

: In this case all training samples contribute to the regularization term equally. That is, all samples are used in order to describe the variance of the class. This case is referred to as hereafter.

: In this case the samples belonging to the class boundary, as well as the outliers, are used to describe the class variance and regularize the update of . This case is referred to as hereafter.

, where is a vector with values , if is a support vector, and , otherwise. This case is referred to as hereafter.
We update by using the gradient of , i.e.:
(15) 
where is the derivative of (14) with respect to , i.e.:
(16) 
IiC SSVDD optimization
In order to define both an optimized data projection matrix and the optimal data description in the resulting subspace, we iteratively apply the two processing steps described in subsections IIA and IIB, as described in Algorithm 1. The ’s computed by maximizing (12) are used in (15) to update through a gradient step using a learning rate parameter . The projection matrix is orthogonalized and normalized in each iteration to force the orthogonality constraint before applying the data mapping.
IiD Nonlinear data description
In order to exploit nonlinear mappings from to for oneclass classification using the proposed SSVDD, we follow the standard kernelbased learning approach [13]. That is, the original data representations are nonlinearly mapped to the socalled kernel space using a nonlinear function , such that . In , a linear projection of all the training data to is given by:
(17) 
where is a projection matrix of arbitrary dimensions [13]. In order to calculate the data representations , we employ the kernel trick stating that can be expressed as a linear combination of the training data representations in leading to:
(18) 
where is a matrix formed by the training data representations in , is a matrix containing the reconstruction weights of with respect to and is the th column of the socalled kernel matrix having elements equal to . In our experiments we use the RBF kernel, given by:
(19) 
where is a hyperparameter scaling the Euclidean distance between and .
IiE Test phase
During testing, a sample is mapped to its representation in the lowdimensional space using (1) (or (18) for the nonlinear case) and its distance from the hypersphere center is calculated:
(22) 
is classified as positive when and as negative, otherwise.
No.  Dataset name  N  D  Target class 

1  Balance scale  625  4  Left 
2  Iris  150  4  Irisvirginica 
3  Lenses  24  4  No contact lenses 
4  Seeds  210  7  Kama 
5  Haberman’s survival  306  3  Survived 
6  Qualitative bankruptcy  250  7  Bankrupt 
7  User knowledge modeling  403  5  Low 
8  Pima Indians diabetes  768  8  No diabetes 
9  Banknote authentication  1372  5  No 
10  TA evaluation  151  5  High 
11  PDelft pump  1500  64  Normal 
12  Vehicle Opel  864  18  Opel 
13  Sonar  208  60  Mines 
14  Breast Wisconsin  699  9  Malignant 
Dataset  1  2  3  4  5  6  7  8  9  10  11  12  13  14 
Linear  
SVDD  0.703  0.762  0.609  0.774  0.834  0.686  0.634  0.791  0.764  0.485  0.846  0.853  0.625  0.958 
OCSVM  0.688  0.612  0.394  0.619  0.644  0.562  0.532  0.529  0.657  0.532  0.632  0.590  0.535  0.660 
SSVDD  0.907  0.899  0.620  0.756  0.836  0.692  0.960  0.786  0.908  0.482  0.856  0.855  0.618  0.957 
SSVDD  0.898  0.897  0.724  0.827  0.839  0.720  0.957  0.793  0.889  0.502  0.857  0.855  0.599  0.960 
SSVDD  0.896  0.881  0.649  0.798  0.841  0.722  0.946  0.787  0.886  0.507  0.856  0.855  0.633  0.960 
SSVDD  0.896  0.868  0.694  0.778  0.821  0.715  0.954  0.784  0.852  0.458  0.857  0.854  0.638  0.953 
Nonlinear  
SVDD  0.734  0.827  0.413  0.858  0.835  0.605  0.651  0.785  0.804  0.396  0.836  0.852  0.609  0.962 
OCSVM  0.544  0.673  0.523  0.444  0.743  0.550  0.409  0.786  0.700  0.274  0.661  0.679  0.530  0.630 
GESVDD  0.757  0.857  0.314  0.799  0.811  0.554  0.654  0.790  0.797  0.484  0.830  0.847  0.550  0.966 
GEOCSVM  0.815  0.869  0.398  0.800  0.816  0.594  0.658  0.667  0.930  0.498  0.613  0.788  0.593  0.962 
SSVDD  0.635  0.725  0.736  0.727  0.842  0.700  0.518  0.786  0.728  0.472  0.836  0.858  0.504  0.961 
SSVDD  0.662  0.573  0.603  0.540  0.845  0.762  0.523  0.790  0.717  0.473  0.856  0.858  0.637  0.783 
SSVDD  0.734  0.694  0.624  0.719  0.838  0.620  0.578  0.785  0.720  0.417  0.856  0.858  0.637  0.902 
SSVDD  0.495  0.700  0.736  0.774  0.841  0.632  0.562  0.572  0.703  0.474  0.832  0.858  0.637  0.951 
Dataset  1  2  3  4  5  6  7  8  9  10  11  12  13  14 
Linear  
SVDD  0.014  0.041  0.152  0.041  0.009  0.072  0.032  0.009  0.010  0.025  0.005  0.003  0.033  0.002 
OCSVM  0.074  0.143  0.257  0.171  0.055  0.082  0.071  0.093  0.015  0.091  0.027  0.024  0.083  0.040 
SSVDD  0.022  0.034  0.154  0.041  0.007  0.046  0.017  0.009  0.026  0.088  0.002  0.004  0.022  0.004 
SSVDD  0.026  0.032  0.136  0.052  0.017  0.012  0.019  0.006  0.031  0.049  0.001  0.003  0.058  0.012 
SSVDD  0.029  0.061  0.157  0.057  0.009  0.016  0.016  0.016  0.039  0.052  0.001  0.003  0.048  0.004 
SSVDD  0.024  0.063  0.118  0.030  0.038  0.016  0.031  0.013  0.110  0.078  0.001  0.003  0.022  0.016 
Nonlinear  
SVDD  0.020  0.020  0.276  0.066  0.011  0.046  0.027  0.010  0.011  0.224  0.008  0.005  0.042  0.008 
OCSVM  0.164  0.158  0.331  0.275  0.139  0.107  0.233  0.014  0.073  0.173  0.114  0.146  0.075  0.354 
GESVDD  0.029  0.022  0.312  0.064  0.045  0.045  0.052  0.021  0.023  0.101  0.007  0.006  0.042  0.009 
GEOCSVM  0.039  0.056  0.368  0.071  0.026  0.131  0.058  0.261  0.019  0.063  0.188  0.121  0.090  0.009 
SSVDD  0.006  0.058  0.060  0.178  0.010  0.029  0.036  0.004  0.039  0.029  0.047  0.000  0.282  0.018 
SSVDD  0.053  0.124  0.340  0.089  0.004  0.048  0.051  0.002  0.012  0.050  0.002  0.000  0.000  0.100 
SSVDD  0.013  0.027  0.353  0.059  0.008  0.135  0.074  0.029  0.017  0.133  0.002  0.000  0.000  0.079 
SSVDD  0.369  0.035  0.060  0.049  0.007  0.089  0.058  0.330  0.021  0.037  0.057  0.000  0.000  0.021 
Iii Experiments
Iiia Datasets, evaluation criteria and experimental setup
We performed experiments on the datasets listed in Table I. Datasets 110 are downloaded from UCI website [21]
, while datasets 1114 are downloaded from TU delft pattern recognition lab website
[22]. The datasets with more than two classes were converted to a positive class and a negative class by considering the class with the majority of samples as the positive class and others as the negative class. Table I shows the target class of the each dataset in the last column.In binary classification, a machine learning model can make two kinds of errors during testing. It can either wrongly predict a data sample from the positive class as negative or a negative data sample as positive. In oneclass classification, the focus is on the target class and usually it is of greater interest to predict the positive class accurately. Recall, also called sensitivity, hit rate, or true positive rate, is the proportion of correctly classified positive samples during the test:
(23) 
where is the total number of correctly classified positive samples and is the total number of positive samples in the data. Recall is used to evaluate classification results in cases, where it is more important to predict the positive class accurately. Another metric used to evaluate machine learning algorithm is precision, which is the proportion of correctly classified samples among those classified into the positive class:
(24) 
where
is an acronym for false positives, i.e., the number of samples incorrectly predicted as positive during the test. A perfect precision score of 1.0 means that every sample classified as positive is from the positive class. In other words, a low precision score indicates a large number of false positives. F1 measure takes into account both precision and recall. It is defined as their harmonic mean as
(25) 
We use (25) for evaluating and comparing performance of the proposed algorithm with competing methods.
To perform our experiments we divided our datasets into train and test sets. We performed our experiments on each dataset by selecting 70 percent of the data for training and the remaining 30 percent for testing. The 7030 train and test sets were selected randomly 5 times to check the performance of each model robustly. Thus, in total we created 5 traintest (7030%) partitions for each dataset. The proportion of each class in each set follows the original proportions. Also for the datasets having originally more than two classes, the positive and negative class labels were assigned after the subset for training and testing were created as described.
We selected the parameters for the proposed method by 5fold crossvalidation over each training set according to the best average F1 measure and then used them to train the final model using the whole training set. Whenever we trained a model, only positive samples were used. We selected the value of the parameter as , where , is the scaled version of the mean distance between the training samples using a scaling factor , where , and from . The subspace dimension for datasets having more than 10 dimensional feature space was restricted to a maximum of 10, i.e., . For datasets with , we set .
We compared our results with the original SVDD (linear and kernel), OCSVM (linear and kernel), GEOCSVM and GESVDD. The parameters were selected using a similar 5fold crossvalidation approach and the common parameters were selected from the ranges given above. Other parameters were selected as in the corresponding research papers.
IiiB Experimental results
Fig. 1, illustrates an example transformation of all the data samples of dataset 5 (Haberman’s survival) from the original 3dimensional feature space to a lower 2dimensional feature space using the nonlinear version of the proposed SSVDD method with the constraint (see subsection IIB). The figure shows the capability of the proposed method to transform the data to a compact form which is more suitable to be enclosed by a hypersphere.
In Tables II and III, we report the average F1 measure and the standard deviation of F1 measure for the evaluated linear and nonlinear methods. The linear version of the proposed SSVDD clearly outperforms all other linear methods. Only for dataset 10, OCSVM achieves a higher performance. The nonlinear version of SSVDD outperformed other nonlinear methods on datasets 3, 5, 6, 11, 12 and 13. For dataset 8, GESVDD and SSVDD achieved the same results. GEOCSVM obtained the best results on datasets 1, 2, 7, 9 and 10. Compared to the baseline methods (SVDD and OCSVM), SSVDD shows a clear improvement.
For datasets 12 and 13, the nonlinear versions of SSVDD (except for for dataset 13) have zero standard deviation. A closer inspection of the results shows that, in these cases, the obtained mapping and data description classify all the test samples as positive, due to the selection of small values for the hyperparameter [23]. A tighter fitted hypersphere on the training data may possibly lead to more meaningful results, which could be achieved by restricting the range of the values used during the crossvalidation process applied on the training data for hyperparameter selection of the proposed method.
When comparing the different regularization terms used with the proposed method, achieves the best performance most often with both linear and nonlinear versions. In , all training samples contribute to the regularization term equally.
Iv Conclusion
In this paper, we proposed a new method for oneclass classification. The proposed SSVDD method maps the original data to a lower dimensional feature space, which is more suitable for oneclass classification. The method iteratively optimizes the mapping to the new subspace and the data description in that feature space. Both linear and nonlinear versions were defined along with four different regularization terms.
We performed experiments on 14 different publicly available datasets. Our experiments showed that the proposed method yields better results than the baselines and competing oneclass classification methods in majority of the cases. A constraint that uses all samples for describing the data variance leads to the best results for SSVDD.
In the future, we intend to try the proposed SSVDD method with different kernels and design new regularization terms. We will also evaluate a similar mapping approach in combination with other already established oneclass classification methods.
Acknowledgement
This work was supported by the NSFTEKES Center for Visual and Decision Informatics project CoBotics, jointly sponsored by Tieto Oy Finland and CA Technologies.
References
 [1] C.L. Liu and H. Sako, “Classspecific feature polynomial classifier for pattern classification and its application to handwritten numeral recognition,” Pattern recognition, vol. 39, no. 4, pp. 669–681, 2006.
 [2] A. Iosifidis and M. Gabbouj, “Classspecific kernel discriminant analysis revisited: Further analysis and extensions,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4485–4496, 2017.
 [3] A. Iosifidis, A. Tefas, and I. Pitas, “Classspecific reference discriminant analysis with application in human behavior analysis,” IEEE Transactions on HumanMachine Systems, vol. 45, no. 3, pp. 315–326, 2015.
 [4] A. Iosifidis and M. Gabbouj, “Scaling up classspecific kernel discriminant analysis for largescale face verificationn,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 11, pp. 2453–2465, 2016.

[5]
L. C. M.A. Pimentel, D.A. Clifton and L. Tarassenko, “A review of novelty detection,”
Signal Processing, vol. 99, pp. 215–249, 2014.  [6] D. Martinus and J. Tax, “Oneclass classification: Conceptlearning in the absence of counterexamples,” Ph.D. dissertation, PhD thesis, Delft University of Technology, 2001.
 [7] C. M. Bishop, Neural networks for pattern recognition. Oxford university press, 1995.

[8]
E. Parzen, “On estimation of a probability density function and mode,”
The annals of mathematical statistics, vol. 33, no. 3, pp. 1065–1076, 1962.  [9] M. GhasemiGol, M. Sabzekar, R. Monsefi, M. Naghibzadeh, and H. S. Yazdi, “A new support vector data description with fuzzy constraints,” in Intelligent Systems, Modelling and Simulation (ISMS), 2010 International Conference on. IEEE, 2010, pp. 10–14.
 [10] T. Kohonen, “Learning vector quantization,” in SelfOrganizing Maps. Springer, 1995, pp. 175–189.
 [11] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004.
 [12] B. Schölkopf, R. Williamson, A. Smola, and J. ShaweTaylor, “Sv estimation of a distribution’s support,” Advances in neural information processing systems, vol. 12, 1999.
 [13] B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
 [14] T. Le, D. Tran, W. Ma, and D. Sharma, “A unified model for support vector machine and support vector data description,” in Neural Networks (IJCNN), The 2012 International Joint Conference on. IEEE, 2012, pp. 1–8.
 [15] V. Mygdalis, A. Iosifidis, A. Tefas, and I. Pitas, “Graph embedded oneclass classifiers for media data classification,” Pattern Recognition, vol. 60, pp. 585–595, 2016.
 [16] R. RazaviFar, M. FarajzadehZanjani, S. Zare, M. Saif, and J. Zarei, “Oneclass classifiers for detecting faults in induction motors,” in Electrical and Computer Engineering (CCECE), 2017 IEEE 30th Canadian Conference on. IEEE, 2017, pp. 1–5.
 [17] C. SanchezHernandez, D. S. Boyd, and G. M. Foody, “Oneclass classification for mapping a specific landcover class: Svdd classification of fenland,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 4, pp. 1061–1073, 2007.
 [18] W. Sakla, A. Chan, J. Ji, and A. Sakla, “An svddbased algorithm for target detection in hyperspectral imagery,” IEEE Geoscience and Remote Sensing Letters, vol. 8, no. 2, pp. 384–388, 2011.
 [19] M. Popescu and A. Mahnot, “Acoustic fall detection using oneclass classifiers,” in Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE. IEEE, 2009, pp. 3505–3508.
 [20] A. Iosifidis, V. Mygdalis, A. Tefas, and I. Pitas, “Oneclass classification based on extreme learning and geometric class information,” Neural Processing Letters, vol. 45, no. 2, pp. 577–592, 2017.
 [21] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
 [22] “Technical University Delft pattern recognition lab, oneclass classifier.” [Online]. Available: http://homepage.tudelft.nl/n9d04/occ/index.html
 [23] C. L. W.C. Chang and C. Lin, “A revisit to support vector data description (svdd),” Technical report, 2013.
Comments
There are no comments yet.