1 Introduction
Oneclass classification (OCC) addresses the problem of recognising patterns that adhere to a specific condition presumed as normal, and identifying them from any other object violating the normality criterion. OCC stands apart from the conventional two/multiclass classification paradigm in that it primarily uses observations from a single, very often the target class for training. Oneclass classification acts as an essential building block in a diverse range of practical systems including presentation attack detection in biometrics Fatemifar et al. (2021), health care Carrera et al. (2019), audio or video surveillance Rabaoui et al. (2008); Zhang et al. (2020), intrusion detection Nader et al. (2014), social network Chaker et al. (2017), safetycritical systems Budalakoti et al. (2009), fraud detection Kamaruddin and Ravi (2016), insurance Sundarkumar et al. (2015), etc.
As with many other machine learning problems, stateoftheart OCC algorithms are built on the premise of deep learning methodology
Goodfellow et al. (2016); Ruff et al. (2018); Erfani et al. (2016)using massive labelled datasets, typically containing millions of samples. Although deep structures have led to breakthroughs in oneclass learning and classification, their reliance on huge sets of data may pose certain limitations in practice. In this context, collecting sufficiently large sets of training observations for certain applications can be a challenge, hindering a full exploitation of the expressive capacity of deep networks. Even if sufficient data is gathered, labelling such huge amounts of data may be a bigger challenge. Whilst crowdsourcing may be considered as an applicable strategy to label huge sets of data in some fields, for a variety of different reasons including level of knowledge, data privacy, time required to produce accurate labels, etc. it may not serve as a viable option in the domains such as defence, security, healthcare. Although certain techniques such as active learning
Settles (2012) or learning with privileged information Vapnik and Izmailov (2015) may be instrumental in reducing the quantity of necessary labelling/labelled data, they still demand the time and domain expertise of a human operator. In the absence of large training sets required by deep nets, and specifically for small to moderatesized datasets containing hundreds or thousands of training samples, kernelbased methods Cortes and Vapnik (1995)offer a very promising methodology of classification. Moreover, unlike deep networks that incorporate many heuristics with regards to their structure and the corresponding (hyper)parameters, kernel methods are based on solid foundations and are characterised by strong bases in optimisation and statistical learning theory
Vapnik (1998).The support vector data description (SVDD) approach Tax and Duin (2004)
which is proposed as an adaptation of support vector machines to the oneclass setting, presents a very popular kernelbased method for oneclass classification. Although designed for oneclass setting, the SVDD approach does not require the training data to be exclusively and purely normal/positive which can be regarded as a quite appealing property in practical applications where the data is very often contaminated with noise and outliers. Furthermore, it provides an intuitive geometric characterisation of a predominantly positive dataset without making any specific assumption regarding the underlying distribution. Moreover, the SVDD decision making process entails computing a simple distance between the centre of the target class and a test observation to label it as either normal (i.e. positive/target) or as anomaly (i.e. negative/outlier). And finally, when large sets of training data are available, the SVDD method may be extended to a deep structure to directly learn features from the data for improved performance
Ruff et al. (2018). These properties make SVDD a highly favoured method of practice in a variety of oneclass classification applications where it serves as one of the most widely used techniques, if not the most.The underlying idea in the SVDD approach is to determine the smallest hypersphere enclosing the data. While its hardmargin formulation requires all positive target data to be strictly encapsulated within the inferred hyperspherical boundary, in practical situations, a dataset may incorporate noisy/outlier samples. In the soft margin SVDD approach, in order to take into account the possibility of a contaminated dataset and improve the generalisation capability of the model, the distance from each training object to the centre of the hypersphere need not be strictly smaller than the radius but larger distances are penalised. In order to encode and penalise violations from the hyperspherical decision boundary, in the soft margin variant, nonnegative slack variables measuring the extent of violation of each object from the decision boundary are introduced. The optimisation problem is then modified to reflect such violations and penalise an
norm term on the slacks. In other words, the conventional SVDD method, and also the standard twoclass SVM classifier
Cortes and Vapnik (1995) are founded on the idea of minimising an norm risk over the set of nonnegative slack variables. In the context of twoclass classification, very recently Vapnik and Izmailov (2021), the classical norm penalty term in the SVM formulation has been revisited to consider two alternative slack penalties defined by the  and the norms to formulate new SVM algorithms. A reformulation of the standard twoclass SVM to the and penalty terms has been verified to improve the classification performance, sometimes significantly Vapnik and Izmailov (2021).In this work, we study the merits of different norm risks for ”oneclass” classification in the context of the SVDD approach. For this purpose, we consider a general norm () slack penalty term where serves as a free parameter of the algorithm. As such, while in the standard SVDD method the slacks are penalised linearly, by introducing a norm function, nonlinear cost functions of the slacks may be optimised where the degree of the nonlinearity (i.e. ) may be tuned on the data. The reflection of the slack penalty term onto the dual space formulation of the corresponding optimisation problem turns out to be a dual norm () cost on the dual variables, thus, providing the method the capability to tune into the inherent sparsity of the problem.
1.1 Contributions
The major contributions of the current study may be summarised as listed below.

We generalise the SVDD formulation from an to an slack norm penalty function and illustrate that the proposed generalisation may lead to significant improvements in the performance of the algorithm;

We extend the proposed norm formulation from a pure oneclass setting to the training scenario where labelled negative objects are also available and illustrate the merits offered by the proposed extension;

Based on Rademacher complexities, we theoretically study the generalisation performance of the proposed slack norm approach and derive bounds on its error;

And we carry out an experimental evaluation of the proposed method on multiple OCC datasets and provide a comparison to the original SVDD method and its different variants, as well as other OCC techniques from the literature.
1.2 Organisation
The remainder of the paper is structured as follows. In Section 2, the relevant literature with a particular emphasis on different variants of the SVDD formalism is reviewed. In Section 3, once a short overview of the support vector data description (SVDD) approach Tax and Duin (2004) is given, we present our proposed slack norm SVDD approach for the pure oneclass setting and then derive its extension for labelled negative training observations. Section 4 studies the generalisation error bound of the proposed approach based on Rademacher complexities. We present and analyse the results of an experimental evaluation of the proposed method in Section 5 where possible extensions of the proposed approach are also discussed. Finally, Section 6 concludes the paper.
2 Prior work
Although different categorisations of OCC methods exist in different studies Chandola et al. (2009); Khan and Madden (2014); Pimentel et al. (2014); Perera et al. (2021); Tax and Duin (2004), the oneclass classification techniques may be roughly identified as either generative or nongenerative Kittler et al. (2014), the latter best represented by discriminative approaches. While in the generative techniques, the objective is to model the underlying generative process of the data, the discriminative methods try to directly partition the observation space into different regions for classification. Discriminative approaches tend to yield better performance in practice since they try to explicitly solve the OCC problem without attempting to solve an intermediate and more general task of inferring the underlying distribution or generative process Vapnik (1998).
Generative OCC approaches encompass the methods that try to estimate the underlying distribution using, for example, Parzen windowing
Bishop (1994), Gaussian distribution modelling
Parra et al. (1996), or those which use a mixture of distributions Platzer et al. (2008); Fatemifar et al. (2022). A different subcategory of generative approaches include methods that for decision making use the residual of reconstructing a test sample with respect to a hypothesised model, some instances of which are the kernel principal component analysis (KPCA) and its variants
Hoffmann (2007); Xiao et al. (2013), or the autoencoderbased techniques
Abati et al. (2019).Discriminative methods constitute a strong alternative to the generative oneclass learners. As an instance, based on a variant of the Fisher classification principle, the kernel null space method Bodesheim et al. (2013) tries to map positive objects onto a single point in a discriminative subspace, obtaining very competitive results compared to some other alternatives Arashloo and Kittler (2021). Another successful discriminative oneclass method focuses on the use of Gaussian Process (GP) priors Kemmler et al. (2013)
trying to directly infer the a posteriori class probability of the target class. A further example of discriminative oneclass learners is that of the nearest neighbourbased approaches
Tax and Duin (2000) where the normality of an object is decided based on its immediate neighbours. Among others, a widely applied discriminative oneclass classification method is that of support vector data description (SVDD) approach Tax and Duin (2004) that tries to estimate the smallest volume surrounding the positive objects. In the case of the existence of labelled negative training objects, the decision boundary is refined by requiring the negative objects to lie outside the hyperspherical boundary. The soft version of this approach allows the positive and negative (if any) training objects to violate the boundary criterion but subject to a linear penalty on the extent of the violation (called slack) where a parameter controls the tradeoff between the volume and such errors in the objective function. Due to its success in data description and its intuitive geometrical interpretation and the ability to benefit from a kernelbased representation, the SVDD approach serves as a widely used technique in the OCC literature, motivating many subsequent research. As an instance, in Hu et al. (2021), based on the observation that the SVDD centre and the volume are sensitive to the parameter controlling the tradeoff between the errors (slacks) and the volume, a method called GLSVDD is proposed where local and global probability densities are used to derive sampleadaptive errors via associating weights to the slacks corresponding to different objects. In Wang and Lai (2013), a different samplespecific weighting approach (PSVDD) based on the position of the feature space image is proposed to adaptively regularise the complexity of the SVDD sphere. Other work Cha et al. (2014) (DWSVDD) considers reweighting sample errors in the objective function by utilising the relative density of each object to the density distribution of normal samples. The authors in Lee et al. (2005) define a densitybased distance between a sample point and the centre of the hypersphere to adjust the constraint set of the SVDD optimisation problem by reweighting training objects. The work in Chen et al. (2015), considers a different linear sample weighting scheme in the SVDD objective function by introducing the cutoff distancebased local density of objects. The work in Wu and Ye (2009) introduces a margin parameter to maximise the margin between the hypersphere and the nontarget objects in an SVDD formulation and directly optimises the margin. The Euclidean distance ( distance) employed in the widely used Gaussian kernel function is reassessed in Nader et al. (2014) to see if other distances in the Gaussian kernel function may provide performance advantages. Apart from the research focused on improving the performance of the SVDD method in a oneclass setting, there also exist other studies where the SVDD approach is generalised to two Huang et al. (2011), or to multiple classes Turkoz et al. (2020).Considering the body of work discussed above, one observes that the majority of the existing studies tries to modify the slack error term by introducing an adaptive weighting for each data sample based on different cues. Clearly, a simple linear weighting scheme does change the linearity of the objective function with respect to the slacks. The exceptions to the studies above are the work in Tsang et al. (2005); Chang et al. (2015) where instead of an norm penalty, an norm penalty is considered over the slacks. As will be demonstrated in the subsequent sections, an slack penalty may not always yield an optimal performance for data description. The current study is a generalisation of the existing SVDD formulations as it considers an () slack norm penalty where serves as a free parameter of the algorithm allowing for different nonlinear penalties to be optimised w.r.t. slacks while at the same time providing the opportunity to tune into the inherent sparsity characteristics of the data.
3 Methodology
In this section, first, we briefly review the SVDD method Tax and Duin (2004) and then present the proposed approach.
3.1 Preliminaries
The Support Vector Data Description (SVDD) approach Tax and Duin (2004) tries to estimate the smallest hyperspherical volume that encloses normal/target data in some predetermined feature space. As a hypersphere is characterised by its centre and its radius , the learning problem in the SVDD method is defined as minimising the radius while requiring the hypersphere to encapsulate all normal objects ’s, that is
(1) 
In practice, however, the training data might be contaminated with noise and outliers. In order to handle possible outliers in the training set and derive a solution with a better generalisation capability, the objective function in the SVDD method is modified so that the distance from the centre to each training observation need not be strictly smaller than , rather, larger distances are penalised. For this purpose, using nonnegative slack variables ’s, the SVDD optimisation task is modified as
(2) 
where denotes a vector collection of ’s and the tradeoff between the sum of errors (i.e. ’s) and the squared radius is controlled using parameter . The optimisation problem above corresponds to the case where only normal samples (and possibly a minority of noisy objects) are presumed to exist in the training set. When labelled negative training objects are also available, the learning problem in the SVDD method is modified to enforce positive samples to lie within the hypersphere while negative samples are encouraged to fall outside its boundary.
The SVDD objective function in Eq. 2 depends on an norm of the slack variables as , and consequently, all errors/slacks are penalised linearly. Although penalising all errors linearly in their magnitudes is a plausible option, it is by no means the only possibly. An an instance, a different alternative may be to penalise only the maximum error/slack which can be achieved by incorporating an norm on the slacks as . Any other penalty which would lie between penalising all the slacks linearly and penalising only the maximum error may then be characterised using a general norm on the errors, i.e. via . In particular, introducing a variable norm parameter opens the door to consider nonlinear penalties on the errors compared with the original SVDD method which is limited to a linear penalty on the slacks. From a dual problem viewpoint, introducing an norm penalty on the slacks translates into sparsity inducing dual norms on the dual variables which provides the opportunity to better consider the intrinsic sparsity of the problem. As such, in the proposed approach, we generalise the SVDD error function using an norm function of slacks, discussed next.
3.2 slack norm SVDD
By replacing the norm term on the slack variables in Eq. 2 with an norm, the optimisation problem in the proposed approach is defined as
(3) 
In order to solve the optimisation problem above, the Lagrangian is formed as
(4) 
where ’s and ’s are nonnegative Lagrange multipliers. In order to derive the dual function, the Lagrangian should be minimised with respect to the primal variables , , . Setting the partial derivatives of w.r.t. , , and to zero yields
(5a)  
(5b)  
(5c) 
Substituting the relations above into , the Lagrangian is obtained as
(6) 
where and denote vector collections of ’s and ’s. Furthermore, one can easily check that the Slater’s condition is satisfied, and thus, the following complementary conditions also hold at the optimum:
(7a)  
(7b) 
Using Eq. 5c and Eq. 7a, it must hold that . Since and , one concludes that , . As a result, the Lagrangian in Eq. 6 would be simplified as
(8) 
The dual problem entails maximising in :
(9) 
Note that, for , we have , and consequently, the term in the Lagrangian is convex w.r.t. . Note also that the other terms in are either linear or quadratic functions of , and hence, are convex while the constraints are affine. As a result, the optimisation problem above is a convex optimisation task.
3.3 slack norm SVDD with negative samples
In the proposed slack norm approach, similar to the original SVDD method Tax and Duin (2004), when labelled nontarget/negative training observations are available, they may be utilised to refine the description. In this case, as opposed to the positive samples that should be enclosed within the hypersphere, the nontarget objects should lie outside its boundary. In what follows, the normal/positive samples are indexed by , and the negative objects by , . In order to allow for possible errors in both the positive and the negative training samples, slack variables ’s and ’s are introduced. The optimisation problem when labelled negative samples are available is then defined as
(10) 
In the objective function above, while may be used to control the fraction of positive training objects that fall outside the hypersphere boundary, may be adjusted to regulate the fraction of negative training samples that will lie within the hypersphere. By introducing Lagrange multipliers , , , , the Lagrangian of Eq. 10 is formed as
(11) 
In order to form the dual function, the Lagrangian should be minimised w.r.t. , , ’s, and ’s. Setting the partial derivatives of w.r.t. to , , , and to zero yields
(12a)  
(12b)  
(12c)  
(12d) 
Resubstituting the relations above into Eq. 11 gives
(13) 
where and respectively stand for vector collections of ’s and ’s. Similarly, and denote vector collections of ’s and ’s, respectively. Since the Slater’s condition holds, the following complementary conditions are also satisfied at the optimum:
(14a)  
(14b)  
(14c)  
(14d) 
Using Eqs. 12c and 14a, and also Eqs. 12d and 14b, one concludes that , and , . As a result, the Lagrangian in Eq. 13 would be
(15) 
The dual problem then reads
s.t.  
(16) 
Since leads to , the terms in the Lagrangian are convex while the remaining terms are either linear or quadratic functions and the constraint sets are affine. Subsequently, the maximisation problem in Eq. 16 is convex.
3.4 Joint formulation
As discussed earlier, when only positive labelled training observations are available, in the proposed approach one solves the optimisation problem in Eq. 9 with the Lagrangian given in Eq. 8. When in addition to the positive training samples, labelled negative training objects are also available, the problem to be solved is expressed as the optimisation task in Eq. 16 with the corresponding Lagrangian given in Eq. 15. Although the optimisation tasks corresponding to the pure positive case and that of the second scenario where negative training samples are also available may appear different, nevertheless, both optimisation problems can be expressed compactly using a joint formulation as follows. Let us assume that vector corresponds to the labels of training samples where for positive objects the label is while for any possible nontarget training samples the corresponding label is . Furthermore, suppose the Lagrange multipliers associated with the negative and positive samples are all collected into a single vector . In order to reduce the clutter in the formulation, let us further assume , and . With these definitions, the Lagrangian in Eq. 15 may be expressed as
(17) 
where denotes hadamard/elementwise product. It may be easily verified that when only positive training samples are available, the Lagrangian above correctly recovers that of Eq. 8 while in the existence of labelled negative training objects, it matches that of Eq. 15. As a result, in the proposed approach, the generic optimisation problem to solve can be expressed as
s.t.  (18) 
where is the vectors of labels and the Lagrangian is given as Eq. 17.
3.5 Kernel space representation
In may practical applications, instead of a rigid boundary, a more elastic description is favoured. In such cases, a reproducing kernel Hilbert space representation may be adopted. Inspecting the Lagrangian in Eq. 17, it can be observed that the training samples only appear in terms of inner products which facilitates deriving a kernelspace representation for the proposed approach. Since in the kernel space it holds that where is the kernel function, the Lagrangian in the reproducing kernel Hilbert space may be written as
(19) 
where denotes the kernel matrix. If additionally, all objects have unit length in the feature space , i.e. if , one may further simplify the Lagrangian. For this propose, note that as for normalised feature vectors we have and since due to the constraints imposed it must hold that , the term can be safely dropped from the objective function without affecting the result. As a result, the optimisation problem for unitlength features shall be
(20) 
As a widely used kernel function, the Gaussian kernel by definition, yields unitlength feature vectors in the kernel space, and the formulation above is applicable.
3.6 Decision strategy
Similar to the conventional SVDD approach, for decision making in the proposed slack norm method, the distance of an object to the centre of the description is gauged and employed as a dissimilarity criterion. The distance of an object to the centre of the hypersphere in the kernel space is
(21) 
In order to compute the radius of the description, note that the complementary conditions in Eqs. 14c and 14d may be compactly represented as . As a result, if for an object the corresponding Lagrange multiplier is nonzero, it must hold that , and hence, the radius of the description may be computed as
(22) 
where indexes an object whose corresponding Lagrange multiplier is nonzero. The objects whose distance to the centre of the hypersphere is larger than the radius (subject to some margin) would be classified as novel.
4 Generalisation error bound
In this section, using the Rademacher complexities, we characterise the generalisation error bound for the proposed slack norm SVDD approach.
Theorem 1
Let us assume corresponds to a class of kernelbased linear functions:
(23) 
then the empirical Rademacher complexity of function class over samples , denoted as , is bounded as ShaweTaylor and Cristianini (2004)
(24) 
where denotes matrix trace and stands for the kernel matrix associated with the feature mapping and is an upper bound on the kernel function .
Next, we present the main theorem concerning the generalisation performance of the proposed approach.
Theorem 2
In the proposed approach, assuming that is a margin parameter, with confidence greater than , a test point is incorrectly classified with the probability bounded as
(25) 
where is the ground truth label for observation .
For the proof of Theorem 2, first, we review a few relevant theories and then present the proof.
Theorem 3
Assume and suppose is a function class from to . Let
be independent samples that are drawn according to a probability distribution
. Then with a probability higher than over , for each it holds that ShaweTaylor and Cristianini (2004)(26) 
where is the empirical expectation of on the random sample set and denotes the empirical Rademacher complexity of the function class .
Theorem 4
If is Lipschitz and satisfies , then the empirical Rademacher complexity of the composition function class satisfies ShaweTaylor and Cristianini (2004).
Towards the proof of Theorem 2, we present the following theorem.
Theorem 5
Let us consider as the hypothesis function defined as where measures the distance of sample with label to the centre of the hypersphere in the feature space (see Eq. 21). For some fixed margin , we define as
(27) 
is Lipschitz and satisfies . Then with a probability higher than over it holds
(28) 
Proof
We have
(29) 
where and stands for a vector collection of all ’s. Note that is Lipschitz with constant . As with the zeroone loss, the margin loss above penalises any misclassified objects but also penalises when it correctly classifies an object with low confidence. In order to derive an upper bound on , observe that , and consequently, we have
(30) 
Since the kernel function is bounded by , using Eq. 21, and the fact that ShaweTaylor and Cristianini (2004), we have
(31) 
and hence
(32) 
As a result, is Lipschitz with constant .
Next, using Theorem 4 and Theorem 1, we have
(33) 
Using Eq. 29 and Eq. 33 in Theorem 3, the proof to Theorem 5 is complete. Since we have , using Theorem 5, the proof to Theorem 2 is completed.
As may be observed from Eq. 25, parameter directly affects the expected loss on the training set (the first term on the RHS of the equation) and also controls the Rademacher complexity (the second term on the RHS of Eq. 25) of the proposed method. As the error probability varies as a function of , the utility of a free norm parameter in the proposed approach is justified. Note that depending on and the margin parameter , setting may not minimise the RHS in Eq. 25, and hence, may lead to an increased probability of misclassification in the proposed approach. In practice, the norm parameter may be adjusted according to the characteristics of the data using cross validation to optimise the performance or to control the false acceptance/rejection rate. Note also, since parameter appears in the dual problem as terms (see Eq. 20), it also affects the sparsity of .
5 Experiments
In this section, an experimental evaluation of the proposed approach is conducted where we provide a comparison to some other variants of the SVDD approach as well as to baseline approaches on multiple datasets. The rest of this section is organised as detailed next.

In Section 5.1, we visualise the decision boundaries inferred by the proposed approach for different ’s for synthetic data.

In Section 5.2, the implementation details, the experimental setup, and the standard datasets used in the experiments are discussed.

In Section 5.3, the results of an experimental evaluation of the proposed approach in a pure oneclass setting (labelled negative objects unavailable) are presented and compared with other methods on multiple datasets.

Section 5.4 provides the results of an experimental evaluation of the proposed approach in the presence of negative training samples along with a comparison against other methods on multiple datasets.
5.1 Decision boundaries
In order to visualise the effect of norm parameter
on the inferred decision boundaries, we randomly generate 100 normally distributed 2D samples with a mean of 2 and standard deviation of 3 in each direction. Using a Gaussian kernel function, the proposed approach is then run to derive a description of the data. The experiment is repeated for different values of
where corresponds to the original SVDD method in Tax and Duin (2004). The decision boundaries superimposed on the data are visualised in Fig. 1. From the figure, it may be observed that for the case of the method has inferred a boundary which separates a region of relatively low density in the middle of the distribution from the rest of the 2D space. For the random data samples generated in this experiment with a mean of this clearly indicates a case of overfitting. By increasing the norm parameter above 1, the decision boundary better covers the mean of the distribution. More specifically, while for smaller values of the boundary is tighter, for larger values the description tends to encapsulate a higher percentage of the normal samples. As will be discussed in the following sections, in the proposed method, we tune parameter using the validation set corresponding to each dataset.5.2 Implementation details
In the experiments that follow, the features are first standardised by subtracting the mean computed over all positive training samples and then dividing by the standard deviation followed by normalising each feature vector to have a unit norm. The positive samples are divided randomly into three nonoverlapping subsets to form the training, validation, and the test sets. Similarly, the negative samples are divided randomly into three disjoint subsets for training, validation and testing purposes. In order to minimise possible effects of random data partitioning on the performance, we repeat the procedure above 10 times, and record the mean along with the standard deviation of the performance over these 10 trials. We set the parameters of all methods on the corresponding validation subset of each dataset. In particular, for the proposed approach and . In all experiments, we use a Gaussian kernel the width of which is set to half of the average pairwise Euclidean distance among all training objects. As the dual problem in Eq. 20 is convex, one may use different algorithms for optimisation. In this work, we use CVX Grant and Boyd (2014), a package for solving convex programmes.
Abbrev.  Dataset  #total  #positive  dim.  Source 

D1  Iris (virginica)  150  50  4  UCI 
D2  Hepatitis (normal)  155  123  19  UCI 
D3  Ecoli (periplasm)  336  52  7  UCI 
D4  Concordia16 (digit 1)  4000  400  256  CENPARMI 
D5  Delft pump (5x1)  336  133  64  Delft 
D6  Balancescale (middle)  625  49  4  UCI 
D7  Wine (2)  178  71  13  UCI 
D8  Waveform (0)  900  300  21  UCI 
D9  Survival (5yr)  306  81  3  UCI 
D10  Housing (MEDV35)  506  48  13  Statlib 
D11  glass6  214  185  9  KEEL 
D12  haberman  306  225  3  KEEL 
D13  led7digit  443  406  7  KEEL 
D14  pima  768  500  8  KEEL 
D15  wisconsin  683  444  9  KEEL 
D16  yeast(05679vs4)  528  477  8  KEEL 
D17  cleveland(0vs4)  177  164  13  KEEL 
D18  Breast(benign)  699  458  9  UCI 
D19  Survival (5yr)  306  225  3  UCI 
D20  FMNIST (Class ”1”)  1926  1027  784  Zalando 
In order to evaluate the proposed approach, 20 benchmark databases from the UCI repository Dua and Graff (2017), TUDelf University Tax and Duin (2006), KEEL repository AlcalaFdez et al. (2011), CENPARMI Cho (1997), Statlib Harrison and Rubinfeld (1978), and Zalando Xiao et al. (2017) are used. The datasets used in the experiments correspond to different application domains from varied sources. The statistics of the datasets used in the experiments are reported in Table LABEL:dstats. For the evaluation of the proposed approach, we conduct two sets of experiments. The first set follows a pure oneclass classification paradigm, i.e. only positive samples are used to train the models. In the second set of experiments, negative objects are also deployed for model training. For comparison, we report the performance of the original SVDD approach of Tax and Duin (2004) denoted as ”SVDD” and also its alternative variant which considers squared errors in the objective function, denoted as ”SVDD” Chang et al. (2015). The proposed approach is denoted as ”SVDD” in the corresponding tables. We also provide a comparison of the proposed SVDD method to some linear reweighting variants of the SVDD approach including PSVDD Wang and Lai (2013), DWSVDD Cha et al. (2014), and GLSVDD Hu et al. (2021) as well as stateoftheart OCC techniques. In particular, we have included kernelbased oneclass classifiers which are applicable to moderatelysized datasets. These are the kernel Gaussian Process method (GP) Kemmler et al. (2013), the Kernel Null FoleySammon Transform (KNFST) Lin et al. (2008); Arashloo and Kittler (2021), and the Kernel Principal Component Analysis (KPCA) Hoffmann (2007).
Following the common approach in the literature and in order to facilitate the comparison of the performances of different methods independent from a specific operating threshold, we report the performances in terms of the AUC measure which is the area under the Receiver Operating Characteristic curves (ROC). The ROC curve characterises the true positive rate against the false positive rate at various operating thresholds. A higher AUC indicates a better performance for the system.
GP  KPCA  KNFST  PSVDD  DWSVDD  GLSVDD  SVDD  SVDD  SVDD  

D1  68.0316.51  78.1722.69  68.5116.46  78.9815.58  67.4216.75  67.7016.39  67.4216.75  67.3716.52  
D2  58.784.37  66.466.75  58.654.39  62.515.98  59.364.37  61.206.06  59.984.31  59.484.26  
D3  61.3011.33  53.0613.66  61.3211.15  53.9812.31  61.6711.66  61.0810.49  61.0811.36  61.2011.46  
D4  92.962.60  94.101.73  92.942.62  93.242.50  93.312.54  93.212.53  94.361.84  93.582.42  
D5  87.873.43  86.073.29  87.853.45  84.783.92  87.853.45  88.023.33  87.853.45  87.903.40  
D6  94.523.57  90.772.46  94.663.52  92.943.16  94.683.54  94.643.54  94.703.55  94.603.56  
D7  60.8211.03  72.3011.10  60.7110.99  65.649.04  61.1211.57  64.1210.85  60.7911.04  60.9011.19  
D8  64.154.66  61.471.90  64.104.64  64.095.91  65.375.09  65.395.41  64.915.53  65.105.48  
D9  53.4118.99  47.3714.10  59.6616.34  52.2919.26  58.2821.15  66.4012.11  64.4314.87  60.7215.26  
D10  87.736.00  78.708.49  87.766.02  81.2811.89  87.776.11  87.826.03  87.776.11  87.786.08  
D11  87.8110.11  95.442.59  87.5510.65  95.483.53  86.9010.06  96.582.18  94.242.53  90.685.67  
D12  59.705.90  70.665.79  55.545.51  59.757.79  55.847.12  58.347.41  66.024.94  
D13  62.6318.55  67.497.79  62.3418.01  66.6015.73  41.546.43  66.7714.25  62.8413.62  69.0410.99  
D14  51.665.36  51.465.32  53.475.42  54.934.78  56.044.46  61.244.95  59.833.75  
D15  52.7012.37  95.881.05  53.5112.56  65.2112.28  47.1014.97  70.6810.68  93.391.73  68.7510.83  
D16  57.749.93  63.717.39  58.568.80  62.948.48  61.0210.33  63.269.98  60.939.65  61.5210.68  
D17  51.4913.99  79.139.71  51.7814.04  57.9313.99  51.0913.90  51.8514.47  51.0913.90  51.3514.06  
D18  38.112.53  45.742.93  41.155.13  42.645.04  42.075.14  41.225.62  41.995.16  41.765.67  
D19  55.1613.36  37.018.20  44.7011.81  51.4913.02  61.3814.80  61.019.08  60.289.47  58.6411.56  
D20  94.821.47  94.8610.9  94.781.49  94.961.23  95.370.87  96.370.79  96.481.14 
5.3 Pure oneclass setting
In this setting, only positive objects are used for training. Table 2 reports the performances of different methods in this setting where we set parameter on the validation subset of each dataset to maximise the performance. A number of observations from the table are in order. First, on all datasets the proposed SVDD approach yields a superior performance compared to its SVDD and SVDD variants. In particular, on some datasets such as D1 and D14, the improvement in the performance offered by the proposed approach is substantial while on some other datasets such as D17 the improvement is huge and reaches . It should be noted that the performance improvements offered by the proposed approach are obtained despite the fact that the validation sets of some datasets may not be very large, and hence, may not serve as a very good representative of the entire the distribution of samples for tuning parameter . It is expected that a more representative validation set would lead to even further improvements in the performance. A statistical ranking of different methods in the pure oneclass setting using the Friedman’s test is provided in Table 3. From the table, it can be observed that while the proposed SVDD approach ranks the best among other approaches while the standard SVDD approach ranks much worst which underlines the significance of the proposed slack norm approach. Furthermore, although the SVDD method provides some improvement with respect to the original SVDD approach, its performance is still inferior compared to the proposed method. The second best performing method (in terms of average ranking) corresponds to a sample reweighting SVDD approach presented in Hu et al. (2021) which uses global and local statistics to linearly weight slacks in the objective function.
Algorithm  Rank 

GP  6.50 
KPCA  5.30 
KNFST  6.70 
PSVDD  5.85 
DWSVDD  5.72 
GLSVDD  3.65 
SVDD  5.20 
SVDD  4.80 
SVDD (this work) 
5.4 Training in the presence of negative data
In this second evaluation setting, in addition to positive objects, labelled negative samples are also used for training. Table 4 reports the performances of different methods in this setting. Note that as in the case of pure oneclass learning, the optimal value for the proposed approach is determined on the validation set. From among the GP, KPCA and KNFST approaches, only the KNFST approach is able to directly deploy negative samples for training. In order to emphasise that a method uses negative objects for training, a negative exponent (””) is used in the table. We also include the PSVDD, DWSVDD, and the GLSVDD approaches trained using both negative and positive samples and denote them as PSVDD, DWSVDD, and GLSVDD. From Table 4, it can be observed that on all datasets the proposed SVDD approach obtains a better performance as compared with its SVDD and SVDD variants. In particular, while on some datasets the SVDD and SVDD approaches are unable to effectively utilise negative training samples, the proposed SVDD method can better benefit from such samples to refine the description for improved performance. When compared with linear sample reweighting methods of PSVDD, DWSVDD, and GLSVDD, the proposed approach also performs better. An average ranking of different methods in this evaluation setting is provided in Table 5. From Table 5 it may be seen that the proposed SVDD approach utilising negative objects for training ranks the best among other competitors. Furthermore, neither the SVDD nor the SVDD methods which use negative training samples do not rank the second. The second best performing method in this setting corresponds to the KNFST method Lin et al. (2008); Arashloo and Kittler (2021).
Dataset  KNFST  PSVDD  DWSVDD  GLSVDD  SVDD  SVDD  SVDD 

D1  95.905.48  39.4521.77  74.3824.80  64.6219.09  58.5321.63  59.7922.50  
D2  69.2511.91  55.884.63  58.985.02  66.416.91  59.384.99  59.605.12  
D3  70.804.94  52.538.94  60.707.55  60.718.34  59.198.86  59.958.56  
D4  93.271.71  94.921.45  93.331.79  94.461.23  93.881.23  
D4  93.022.11  88.635.07  91.173.49  91.173.49  91.173.49  91.193.45  
D6  90.2912.84  88.733.46  92.614.15  92.564.14  92.624.14  92.414.23  
D7  93.703.77  50.848.34  73.846.50  58.598.39  58.139.14  58.668.90  
D8  90.011.31  64.973.74  66.933.24  67.203.15  66.953.30  65.953.35  
D9  64.149.19  48.5610.65  83.0810.33  63.3011.58  59.929.96  60.7215.15  
D10  89.377.15  82.927.86  86.668.53  85.888.47  86.668.53  86.588.58  
D11  96.523.62  84.9511.15  94.901.78  96.501.86  94.023.96  89.885.76  
D12  52.527.84  58.846.19  63.065.22  69.447.54  62.109.37  65.956.16  
D13  91.117.62  68.9712.57  47.6513.31  69.2413.25  69.5410.59  66.538.41  