# ℓ_p Slack Norm Support Vector Data Description

The support vector data description (SVDD) approach serves as a de facto standard for one-class classification where the learning task entails inferring the smallest hyper-sphere to enclose target objects while linearly penalising any errors/slacks via an ℓ_1-norm penalty term. In this study, we generalise this modelling formalism to a general ℓ_p-norm (p≥1) slack penalty function. By virtue of an ℓ_p slack norm, the proposed approach enables formulating a non-linear cost function with respect to slacks. From a dual problem perspective, the proposed method introduces a sparsity-inducing dual norm into the objective function, and thus, possesses a higher capacity to tune into the inherent sparsity of the problem for enhanced descriptive capability. A theoretical analysis based on Rademacher complexities characterises the generalisation performance of the proposed approach in terms of parameter p while the experimental results on several datasets confirm the merits of the proposed method compared to other alternatives.

02/12/2018

### Subspace Support Vector Data Description

This paper proposes a novel method for solving one-class classification ...
03/20/2020

### Ellipsoidal Subspace Support Vector Data Description

In this paper, we propose a novel method for transforming data into a lo...
08/22/2019

### Quadratic Surface Support Vector Machine with L1 Norm Regularization

We propose ℓ_1 norm regularized quadratic surface support vector machine...
07/29/2016

### Data Filtering for Cluster Analysis by ℓ_0-Norm Regularization

A data filtering method for cluster analysis is proposed, based on minim...
10/15/2019

### Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

We generalize the concept of maximum-margin classifiers (MMCs) to arbitr...
08/08/2020

### Representation Learning via Cauchy Convolutional Sparse Coding

In representation learning, Convolutional Sparse Coding (CSC) enables un...
10/12/2017

### Self-Taught Support Vector Machine

In this paper, a new approach for classification of target task using li...

## 1 Introduction

One-class classification (OCC) addresses the problem of recognising patterns that adhere to a specific condition presumed as normal, and identifying them from any other object violating the normality criterion. OCC stands apart from the conventional two-/multi-class classification paradigm in that it primarily uses observations from a single, very often the target class for training. One-class classification acts as an essential building block in a diverse range of practical systems including presentation attack detection in biometrics Fatemifar et al. (2021), health care Carrera et al. (2019), audio or video surveillance Rabaoui et al. (2008); Zhang et al. (2020), intrusion detection Nader et al. (2014), social network Chaker et al. (2017), safety-critical systems Budalakoti et al. (2009), fraud detection Kamaruddin and Ravi (2016), insurance Sundarkumar et al. (2015), etc.

As with many other machine learning problems, state-of-the-art OCC algorithms are built on the premise of deep learning methodology

Goodfellow et al. (2016); Ruff et al. (2018); Erfani et al. (2016)

using massive labelled datasets, typically containing millions of samples. Although deep structures have led to breakthroughs in one-class learning and classification, their reliance on huge sets of data may pose certain limitations in practice. In this context, collecting sufficiently large sets of training observations for certain applications can be a challenge, hindering a full exploitation of the expressive capacity of deep networks. Even if sufficient data is gathered, labelling such huge amounts of data may be a bigger challenge. Whilst crowd-sourcing may be considered as an applicable strategy to label huge sets of data in some fields, for a variety of different reasons including level of knowledge, data privacy, time required to produce accurate labels, etc. it may not serve as a viable option in the domains such as defence, security, healthcare. Although certain techniques such as active learning

Settles (2012) or learning with privileged information Vapnik and Izmailov (2015) may be instrumental in reducing the quantity of necessary labelling/labelled data, they still demand the time and domain expertise of a human operator. In the absence of large training sets required by deep nets, and specifically for small to moderate-sized datasets containing hundreds or thousands of training samples, kernel-based methods Cortes and Vapnik (1995)

offer a very promising methodology of classification. Moreover, unlike deep networks that incorporate many heuristics with regards to their structure and the corresponding (hyper)parameters, kernel methods are based on solid foundations and are characterised by strong bases in optimisation and statistical learning theory

Vapnik (1998).

The support vector data description (SVDD) approach Tax and Duin (2004)

which is proposed as an adaptation of support vector machines to the one-class setting, presents a very popular kernel-based method for one-class classification. Although designed for one-class setting, the SVDD approach does not require the training data to be exclusively and purely normal/positive which can be regarded as a quite appealing property in practical applications where the data is very often contaminated with noise and outliers. Furthermore, it provides an intuitive geometric characterisation of a predominantly positive dataset without making any specific assumption regarding the underlying distribution. Moreover, the SVDD decision making process entails computing a simple distance between the centre of the target class and a test observation to label it as either normal (i.e. positive/target) or as anomaly (i.e. negative/outlier). And finally, when large sets of training data are available, the SVDD method may be extended to a deep structure to directly learn features from the data for improved performance

Ruff et al. (2018). These properties make SVDD a highly favoured method of practice in a variety of one-class classification applications where it serves as one of the most widely used techniques, if not the most.

The underlying idea in the SVDD approach is to determine the smallest hyper-sphere enclosing the data. While its hard-margin formulation requires all positive target data to be strictly encapsulated within the inferred hyper-spherical boundary, in practical situations, a dataset may incorporate noisy/outlier samples. In the soft margin SVDD approach, in order to take into account the possibility of a contaminated dataset and improve the generalisation capability of the model, the distance from each training object to the centre of the hyper-sphere need not be strictly smaller than the radius but larger distances are penalised. In order to encode and penalise violations from the hyper-spherical decision boundary, in the soft margin variant, non-negative slack variables measuring the extent of violation of each object from the decision boundary are introduced. The optimisation problem is then modified to reflect such violations and penalise an

-norm term on the slacks. In other words, the conventional SVDD method, and also the standard two-class SVM classifier

Cortes and Vapnik (1995) are founded on the idea of minimising an -norm risk over the set of non-negative slack variables. In the context of two-class classification, very recently Vapnik and Izmailov (2021), the classical -norm penalty term in the SVM formulation has been revisited to consider two alternative slack penalties defined by the - and the -norms to formulate new SVM algorithms. A reformulation of the standard two-class SVM to the and penalty terms has been verified to improve the classification performance, sometimes significantly Vapnik and Izmailov (2021).

In this work, we study the merits of different norm risks for ”one-class” classification in the context of the SVDD approach. For this purpose, we consider a general -norm () slack penalty term where serves as a free parameter of the algorithm. As such, while in the standard SVDD method the slacks are penalised linearly, by introducing a -norm function, non-linear cost functions of the slacks may be optimised where the degree of the non-linearity (i.e. ) may be tuned on the data. The reflection of the slack penalty term onto the dual space formulation of the corresponding optimisation problem turns out to be a dual -norm () cost on the dual variables, thus, providing the method the capability to tune into the inherent sparsity of the problem.

### 1.1 Contributions

The major contributions of the current study may be summarised as listed below.

• We generalise the SVDD formulation from an to an slack norm penalty function and illustrate that the proposed generalisation may lead to significant improvements in the performance of the algorithm;

• We extend the proposed -norm formulation from a pure one-class setting to the training scenario where labelled negative objects are also available and illustrate the merits offered by the proposed extension;

• Based on Rademacher complexities, we theoretically study the generalisation performance of the proposed slack norm approach and derive bounds on its error;

• And we carry out an experimental evaluation of the proposed method on multiple OCC datasets and provide a comparison to the original SVDD method and its different variants, as well as other OCC techniques from the literature.

### 1.2 Organisation

The remainder of the paper is structured as follows. In Section 2, the relevant literature with a particular emphasis on different variants of the SVDD formalism is reviewed. In Section 3, once a short overview of the support vector data description (SVDD) approach Tax and Duin (2004) is given, we present our proposed slack norm SVDD approach for the pure one-class setting and then derive its extension for labelled negative training observations. Section 4 studies the generalisation error bound of the proposed approach based on Rademacher complexities. We present and analyse the results of an experimental evaluation of the proposed method in Section 5 where possible extensions of the proposed approach are also discussed. Finally, Section 6 concludes the paper.

## 2 Prior work

Although different categorisations of OCC methods exist in different studies Chandola et al. (2009); Khan and Madden (2014); Pimentel et al. (2014); Perera et al. (2021); Tax and Duin (2004), the one-class classification techniques may be roughly identified as either generative or non-generative Kittler et al. (2014), the latter best represented by discriminative approaches. While in the generative techniques, the objective is to model the underlying generative process of the data, the discriminative methods try to directly partition the observation space into different regions for classification. Discriminative approaches tend to yield better performance in practice since they try to explicitly solve the OCC problem without attempting to solve an intermediate and more general task of inferring the underlying distribution or generative process Vapnik (1998).

Generative OCC approaches encompass the methods that try to estimate the underlying distribution using, for example, Parzen windowing

Bishop (1994)

, Gaussian distribution modelling

Parra et al. (1996), or those which use a mixture of distributions Platzer et al. (2008); Fatemifar et al. (2022)

. A different sub-category of generative approaches include methods that for decision making use the residual of reconstructing a test sample with respect to a hypothesised model, some instances of which are the kernel principal component analysis (KPCA) and its variants

Hoffmann (2007); Xiao et al. (2013)

, or the autoencoder-based techniques

Abati et al. (2019).

Discriminative methods constitute a strong alternative to the generative one-class learners. As an instance, based on a variant of the Fisher classification principle, the kernel null space method Bodesheim et al. (2013) tries to map positive objects onto a single point in a discriminative subspace, obtaining very competitive results compared to some other alternatives Arashloo and Kittler (2021). Another successful discriminative one-class method focuses on the use of Gaussian Process (GP) priors Kemmler et al. (2013)

trying to directly infer the a posteriori class probability of the target class. A further example of discriminative one-class learners is that of the nearest neighbour-based approaches

Tax and Duin (2000) where the normality of an object is decided based on its immediate neighbours. Among others, a widely applied discriminative one-class classification method is that of support vector data description (SVDD) approach Tax and Duin (2004) that tries to estimate the smallest volume surrounding the positive objects. In the case of the existence of labelled negative training objects, the decision boundary is refined by requiring the negative objects to lie outside the hyper-spherical boundary. The soft version of this approach allows the positive and negative (if any) training objects to violate the boundary criterion but subject to a linear penalty on the extent of the violation (called slack) where a parameter controls the trade-off between the volume and such errors in the objective function. Due to its success in data description and its intuitive geometrical interpretation and the ability to benefit from a kernel-based representation, the SVDD approach serves as a widely used technique in the OCC literature, motivating many subsequent research. As an instance, in Hu et al. (2021), based on the observation that the SVDD centre and the volume are sensitive to the parameter controlling the trade-off between the errors (slacks) and the volume, a method called GL-SVDD is proposed where local and global probability densities are used to derive sample-adaptive errors via associating weights to the slacks corresponding to different objects. In Wang and Lai (2013), a different sample-specific weighting approach (P-SVDD) based on the position of the feature space image is proposed to adaptively regularise the complexity of the SVDD sphere. Other work Cha et al. (2014) (DW-SVDD) considers re-weighting sample errors in the objective function by utilising the relative density of each object to the density distribution of normal samples. The authors in Lee et al. (2005) define a density-based distance between a sample point and the centre of the hyper-sphere to adjust the constraint set of the SVDD optimisation problem by re-weighting training objects. The work in Chen et al. (2015), considers a different linear sample weighting scheme in the SVDD objective function by introducing the cut-off distance-based local density of objects. The work in Wu and Ye (2009) introduces a margin parameter to maximise the margin between the hyper-sphere and the non-target objects in an SVDD formulation and directly optimises the margin. The Euclidean distance ( distance) employed in the widely used Gaussian kernel function is reassessed in Nader et al. (2014) to see if other distances in the Gaussian kernel function may provide performance advantages. Apart from the research focused on improving the performance of the SVDD method in a one-class setting, there also exist other studies where the SVDD approach is generalised to two Huang et al. (2011), or to multiple classes Turkoz et al. (2020).

Considering the body of work discussed above, one observes that the majority of the existing studies tries to modify the slack error term by introducing an adaptive weighting for each data sample based on different cues. Clearly, a simple linear weighting scheme does change the linearity of the objective function with respect to the slacks. The exceptions to the studies above are the work in Tsang et al. (2005); Chang et al. (2015) where instead of an -norm penalty, an -norm penalty is considered over the slacks. As will be demonstrated in the subsequent sections, an slack penalty may not always yield an optimal performance for data description. The current study is a generalisation of the existing SVDD formulations as it considers an () slack norm penalty where serves as a free parameter of the algorithm allowing for different non-linear penalties to be optimised w.r.t. slacks while at the same time providing the opportunity to tune into the inherent sparsity characteristics of the data.

## 3 Methodology

In this section, first, we briefly review the SVDD method Tax and Duin (2004) and then present the proposed approach.

### 3.1 Preliminaries

The Support Vector Data Description (SVDD) approach Tax and Duin (2004) tries to estimate the smallest hyperspherical volume that encloses normal/target data in some pre-determined feature space. As a hypersphere is characterised by its centre and its radius , the learning problem in the SVDD method is defined as minimising the radius while requiring the hypersphere to encapsulate all normal objects ’s, that is

 minR,OE(R,O)=R2 s.t. ∥xi−O∥22≤R2, ∀i (1)

In practice, however, the training data might be contaminated with noise and outliers. In order to handle possible outliers in the training set and derive a solution with a better generalisation capability, the objective function in the SVDD method is modified so that the distance from the centre to each training observation need not be strictly smaller than , rather, larger distances are penalised. For this purpose, using non-negative slack variables ’s, the SVDD optimisation task is modified as

 minR,O,ζE(R,O,ζ)=R2+c∑iζi s.t. ∥xi−O∥22≤R2+ζi, ζi≥0, ∀i (2)

where denotes a vector collection of ’s and the trade-off between the sum of errors (i.e. ’s) and the squared radius is controlled using parameter . The optimisation problem above corresponds to the case where only normal samples (and possibly a minority of noisy objects) are presumed to exist in the training set. When labelled negative training objects are also available, the learning problem in the SVDD method is modified to enforce positive samples to lie within the hyper-sphere while negative samples are encouraged to fall outside its boundary.

The SVDD objective function in Eq. 2 depends on an -norm of the slack variables as , and consequently, all errors/slacks are penalised linearly. Although penalising all errors linearly in their magnitudes is a plausible option, it is by no means the only possibly. An an instance, a different alternative may be to penalise only the maximum error/slack which can be achieved by incorporating an -norm on the slacks as . Any other penalty which would lie between penalising all the slacks linearly and penalising only the maximum error may then be characterised using a general -norm on the errors, i.e. via . In particular, introducing a variable norm parameter opens the door to consider non-linear penalties on the errors compared with the original SVDD method which is limited to a linear penalty on the slacks. From a dual problem viewpoint, introducing an norm penalty on the slacks translates into sparsity inducing dual norms on the dual variables which provides the opportunity to better consider the intrinsic sparsity of the problem. As such, in the proposed approach, we generalise the SVDD error function using an -norm function of slacks, discussed next.

### 3.2 ℓp slack norm SVDD

By replacing the -norm term on the slack variables in Eq. 2 with an -norm, the optimisation problem in the proposed approach is defined as

 minR,O,ζE(R,O)=R2+c∑iζpi s.t. ∥xi−O∥22≤R2+ζi, ζi≥0, ∀i (3)

In order to solve the optimisation problem above, the Lagrangian is formed as

 L=R2+c∑iζpi−∑iαi[R2+ζi−(∥xi∥22−2O⊤xi+∥O∥22)]−∑iγiζi (4)

where ’s and ’s are non-negative Lagrange multipliers. In order to derive the dual function, the Lagrangian should be minimised with respect to the primal variables , , . Setting the partial derivatives of w.r.t. , , and to zero yields

 ∂L∂R=0⇒∑iαi=1 (5a) ∂L∂O=0⇒O=∑iαixi (5b) ∂L∂ζi=0⇒ζi=(αi+γicp)1p−1 (5c)

Substituting the relations above into , the Lagrangian is obtained as

 L=(cp)−1p−1(1/p−1)∥α+γ∥p/(p−1)p/(p−1)+∑iαix⊤ixi−∑i∑jαiαjx⊤ixj (6)

where and denote vector collections of ’s and ’s. Furthermore, one can easily check that the Slater’s condition is satisfied, and thus, the following complementary conditions also hold at the optimum:

 γiζi=0,∀i (7a) αi(R2+ζi−∥xi−O∥22)=0,∀i (7b)

Using Eq. 5c and Eq. 7a, it must hold that . Since and , one concludes that , . As a result, the Lagrangian in Eq. 6 would be simplified as

 L=(cp)−1p−1(1/p−1)∥α∥p/(p−1)p/(p−1)+∑iαix⊤ixi−∑i∑jαiαjx⊤ixj (8)

The dual problem entails maximising in :

 maxαL s.t. α≥0,∥α∥1=1 (9)

Note that, for , we have , and consequently, the term in the Lagrangian is convex w.r.t. . Note also that the other terms in are either linear or quadratic functions of , and hence, are convex while the constraints are affine. As a result, the optimisation problem above is a convex optimisation task.

### 3.3 ℓp slack norm SVDD with negative samples

In the proposed slack norm approach, similar to the original SVDD method Tax and Duin (2004), when labelled non-target/negative training observations are available, they may be utilised to refine the description. In this case, as opposed to the positive samples that should be enclosed within the hypersphere, the non-target objects should lie outside its boundary. In what follows, the normal/positive samples are indexed by , and the negative objects by , . In order to allow for possible errors in both the positive and the negative training samples, slack variables ’s and ’s are introduced. The optimisation problem when labelled negative samples are available is then defined as

 minR,O,ζE(R,O,ζ)=R2+c1∑iζpi+c2∑lζpl (10)

In the objective function above, while may be used to control the fraction of positive training objects that fall outside the hypersphere boundary, may be adjusted to regulate the fraction of negative training samples that will lie within the hypersphere. By introducing Lagrange multipliers , , , , the Lagrangian of Eq. 10 is formed as

 L=R2+c1∑iζpi+c2∑lζpl−∑iγiζi−∑lγlζl −∑iαi[R2+ζi−(∥xi∥22−2O⊤xi+∥O∥22)] −∑lαl[(∥xl∥22−2O⊤xl+∥O∥22)−R2+ζl] (11)

In order to form the dual function, the Lagrangian should be minimised w.r.t. , , ’s, and ’s. Setting the partial derivatives of w.r.t. to , , , and to zero yields

 ∂L∂R=0⇒∑iαi−∑lαl=1 (12a) ∂L∂O=0⇒O=∑iαixi−∑lαlxl (12b) ∂L∂ζi=0⇒ζi=(αi+γic1p)1p−1 (12c) ∂L∂ζl=0⇒ζl=(αl+γlc2p)1p−1 (12d)

Resubstituting the relations above into Eq. 11 gives

 L +∑iαix⊤ixi−∑lαlx⊤lxl−∑i∑jαiαjx⊤ixj−∑l∑mαlαmx⊤lxm +2∑i∑lαiαlx⊤ixl (13)

where and respectively stand for vector collections of ’s and ’s. Similarly, and denote vector collections of ’s and ’s, respectively. Since the Slater’s condition holds, the following complementary conditions are also satisfied at the optimum:

 γiζi=0,∀i (14a) γlζl=0,∀l (14b) αi(R2+ζi−∥xi−O∥22)=0,∀i (14c) αl(R2−ζl−∥xl−O∥22)=0,∀l (14d)

Using Eqs. 12c and 14a, and also Eqs. 12d and 14b, one concludes that , and , . As a result, the Lagrangian in Eq. 13 would be

 L =(c1p)−1p−1(1/p−1)∥αT∥p/(p−1)p/(p−1)+(c2p)−1p−1(1/p−1)∥αN∥p/(p−1)p/(p−1) +∑iαix⊤ixi−∑lαlx⊤lxl−∑i∑jαiαjx⊤ixj−∑l∑mαlαmx⊤lxm +2∑i∑lαiαlx⊤ixl (15)

 s.t. αT≥0,αN≥0, ∥αT∥1−∥αN∥1=1 (16)

Since leads to , the terms in the Lagrangian are convex while the remaining terms are either linear or quadratic functions and the constraint sets are affine. Subsequently, the maximisation problem in Eq. 16 is convex.

### 3.4 Joint formulation

As discussed earlier, when only positive labelled training observations are available, in the proposed approach one solves the optimisation problem in Eq. 9 with the Lagrangian given in Eq. 8. When in addition to the positive training samples, labelled negative training objects are also available, the problem to be solved is expressed as the optimisation task in Eq. 16 with the corresponding Lagrangian given in Eq. 15. Although the optimisation tasks corresponding to the pure positive case and that of the second scenario where negative training samples are also available may appear different, nevertheless, both optimisation problems can be expressed compactly using a joint formulation as follows. Let us assume that vector corresponds to the labels of training samples where for positive objects the label is while for any possible non-target training samples the corresponding label is . Furthermore, suppose the Lagrange multipliers associated with the negative and positive samples are all collected into a single vector . In order to reduce the clutter in the formulation, let us further assume , and . With these definitions, the Lagrangian in Eq. 15 may be expressed as

 L=−¯c1∥α⊙(1+y)∥qq−¯c2∥α⊙(1−y)∥qq+∑iαiyix⊤ixi−∑i,jαiyiαjyjx⊤ixj (17)

where denotes hadamard/elementwise product. It may be easily verified that when only positive training samples are available, the Lagrangian above correctly recovers that of Eq. 8 while in the existence of labelled negative training objects, it matches that of Eq. 15. As a result, in the proposed approach, the generic optimisation problem to solve can be expressed as

 maxαL s.t. α≥0,y⊤α=1 (18)

where is the vectors of labels and the Lagrangian is given as Eq. 17.

### 3.5 Kernel space representation

In may practical applications, instead of a rigid boundary, a more elastic description is favoured. In such cases, a reproducing kernel Hilbert space representation may be adopted. Inspecting the Lagrangian in Eq. 17, it can be observed that the training samples only appear in terms of inner products which facilitates deriving a kernel-space representation for the proposed approach. Since in the kernel space it holds that where is the kernel function, the Lagrangian in the reproducing kernel Hilbert space may be written as

 L=−¯c1∥α⊙(1+y)∥qq−¯c2∥α⊙(1−y)∥qq+∑iαiyiκ(xi,xi)−(α⊙y)⊤K(α⊙y) (19)

where denotes the kernel matrix. If additionally, all objects have unit length in the feature space , i.e. if , one may further simplify the Lagrangian. For this propose, note that as for normalised feature vectors we have and since due to the constraints imposed it must hold that , the term can be safely dropped from the objective function without affecting the result. As a result, the optimisation problem for unit-length features shall be

 minα¯c1∥α⊙(1+y)∥qq +¯c2∥α⊙(1−y)∥qq+(α⊙y)⊤K(α⊙y) s.t. α≥0,y⊤α=1 (20)

As a widely used kernel function, the Gaussian kernel by definition, yields unit-length feature vectors in the kernel space, and the formulation above is applicable.

### 3.6 Decision strategy

Similar to the conventional SVDD approach, for decision making in the proposed slack norm method, the distance of an object to the centre of the description is gauged and employed as a dissimilarity criterion. The distance of an object to the centre of the hypersphere in the kernel space is

 f(z)=∥ϕ(z)−ϕ(O)∥22=κ(z,z)−2∑iαiyiκ(z,xi)+(α⊙y)⊤K(α⊙y) (21)

In order to compute the radius of the description, note that the complementary conditions in Eqs. 14c and 14d may be compactly represented as . As a result, if for an object the corresponding Lagrange multiplier is non-zero, it must hold that , and hence, the radius of the description may be computed as

 R2 =∥∥ϕ(xj)−ϕ(O)∥∥22−yjζj =κ(xj,xj)−2∑iαiyiκ(xj,xi)+(α⊙y)⊤K(α⊙y)−yjζj (22)

where indexes an object whose corresponding Lagrange multiplier is non-zero. The objects whose distance to the centre of the hyper-sphere is larger than the radius (subject to some margin) would be classified as novel.

## 4 Generalisation error bound

In this section, using the Rademacher complexities, we characterise the generalisation error bound for the proposed slack norm SVDD approach.

###### Theorem 1

Let us assume corresponds to a class of kernel-based linear functions:

 F={x→w⊤ϕ(x),∥w∥2≤B} (23)

then the empirical Rademacher complexity of function class over samples , denoted as , is bounded as Shawe-Taylor and Cristianini (2004)

 ^Rn(F)≤2Bn√tr(K)≤2BBκ√n (24)

where denotes matrix trace and stands for the kernel matrix associated with the feature mapping and is an upper bound on the kernel function .

Next, we present the main theorem concerning the generalisation performance of the proposed approach.

###### Theorem 2

In the proposed approach, assuming that is a margin parameter, with confidence greater than , a test point is incorrectly classified with the probability bounded as

 P[y(f(x)−R2)>υ]≤1nυp∥ζ∥pp+4pBBκυp√n(B2+3B2κ+R2)p−1+3√ln(2/Δ)2n (25)

where is the ground truth label for observation .

For the proof of Theorem 2, first, we review a few relevant theories and then present the proof.

###### Theorem 3

Assume and suppose is a function class from to . Let

be independent samples that are drawn according to a probability distribution

. Then with a probability higher than over , for each it holds that Shawe-Taylor and Cristianini (2004)

 ED[g(x)]≤^E[g(x)]+^Rn(G)+3√ln(2/Δ)2n (26)

where is the empirical expectation of on the random sample set and denotes the empirical Rademacher complexity of the function class .

###### Theorem 4

If is -Lipschitz and satisfies , then the empirical Rademacher complexity of the composition function class satisfies Shawe-Taylor and Cristianini (2004).

Towards the proof of Theorem 2, we present the following theorem.

###### Theorem 5

Let us consider as the hypothesis function defined as where measures the distance of sample with label to the centre of the hypersphere in the feature space (see Eq. 21). For some fixed margin , we define as

 g(x)=A(yh(x))=⎧⎪⎨⎪⎩0% if yh(x)≤0;(yh(x)/υ)pif 0≤yh(x)≤υ;1else. (27)

is -Lipschitz and satisfies . Then with a probability higher than over it holds

 ED[g(x)]≤1υpn∥ζ∥pp+4Bpnυp(B2+3Bκ+R2)p−1√tr(K)+3√ln(2/Δ)2n (28)

Proof
We have

 ^E[g(x)]=1n∑ig(xi)≤1nυp∑i(yi(f(xi)−R2))p+=1nυp∑iζpi=1nυp∥ζ∥pp (29)

where and stands for a vector collection of all ’s. Note that is Lipschitz with constant . As with the zero-one loss, the margin loss above penalises any misclassified objects but also penalises when it correctly classifies an object with low confidence. In order to derive an upper bound on , observe that , and consequently, we have

 ∥∥ ∥∥∂A∂(yh(x))∥∥ ∥∥2=pυp∥∥(f(x)−R2)∥∥p−12≤pυp(∥f(x)∥2+R2)p−1 (30)

Since the kernel function is bounded by , using Eq. 21, and the fact that Shawe-Taylor and Cristianini (2004), we have

 ∥f(x)∥2≤B2+3B2κ (31)

and hence

 ∥∥ ∥∥∂A∂(yh(x))∥∥ ∥∥2≤pυp(B2+3B2κ+R2)p−1 (32)

As a result, is Lipschitz with constant .

Next, using Theorem 4 and Theorem 1, we have

 ^Rn(G)≤2L^Rn(F)≤4BBκL√n≤4pBBκυp√n(B2+3B2κ+R2)p−1 (33)

Using Eq. 29 and Eq. 33 in Theorem 3, the proof to Theorem 5 is complete. Since we have , using Theorem 5, the proof to Theorem 2 is completed.

As may be observed from Eq. 25, parameter directly affects the expected loss on the training set (the first term on the RHS of the equation) and also controls the Rademacher complexity (the second term on the RHS of Eq. 25) of the proposed method. As the error probability varies as a function of , the utility of a free norm parameter in the proposed approach is justified. Note that depending on and the margin parameter , setting may not minimise the RHS in Eq. 25, and hence, may lead to an increased probability of misclassification in the proposed approach. In practice, the norm parameter may be adjusted according to the characteristics of the data using cross validation to optimise the performance or to control the false acceptance/rejection rate. Note also, since parameter appears in the dual problem as terms (see Eq. 20), it also affects the sparsity of .

## 5 Experiments

In this section, an experimental evaluation of the proposed approach is conducted where we provide a comparison to some other variants of the SVDD approach as well as to baseline approaches on multiple datasets. The rest of this section is organised as detailed next.

• In Section 5.1, we visualise the decision boundaries inferred by the proposed approach for different ’s for synthetic data.

• In Section 5.2, the implementation details, the experimental set-up, and the standard datasets used in the experiments are discussed.

• In Section 5.3, the results of an experimental evaluation of the proposed approach in a pure one-class setting (labelled negative objects unavailable) are presented and compared with other methods on multiple datasets.

• Section 5.4 provides the results of an experimental evaluation of the proposed approach in the presence of negative training samples along with a comparison against other methods on multiple datasets.

### 5.1 Decision boundaries

In order to visualise the effect of norm parameter

on the inferred decision boundaries, we randomly generate 100 normally distributed 2D samples with a mean of 2 and standard deviation of 3 in each direction. Using a Gaussian kernel function, the proposed approach is then run to derive a description of the data. The experiment is repeated for different values of

where corresponds to the original SVDD method in Tax and Duin (2004). The decision boundaries superimposed on the data are visualised in Fig. 1. From the figure, it may be observed that for the case of the method has inferred a boundary which separates a region of relatively low density in the middle of the distribution from the rest of the 2D space. For the random data samples generated in this experiment with a mean of this clearly indicates a case of over-fitting. By increasing the norm parameter above 1, the decision boundary better covers the mean of the distribution. More specifically, while for smaller values of the boundary is tighter, for larger values the description tends to encapsulate a higher percentage of the normal samples. As will be discussed in the following sections, in the proposed method, we tune parameter using the validation set corresponding to each dataset.

### 5.2 Implementation details

In the experiments that follow, the features are first standardised by subtracting the mean computed over all positive training samples and then dividing by the standard deviation followed by normalising each feature vector to have a unit -norm. The positive samples are divided randomly into three non-overlapping subsets to form the training, validation, and the test sets. Similarly, the negative samples are divided randomly into three disjoint subsets for training, validation and testing purposes. In order to minimise possible effects of random data partitioning on the performance, we repeat the procedure above 10 times, and record the mean along with the standard deviation of the performance over these 10 trials. We set the parameters of all methods on the corresponding validation subset of each dataset. In particular, for the proposed approach and . In all experiments, we use a Gaussian kernel the width of which is set to half of the average pairwise Euclidean distance among all training objects. As the dual problem in Eq. 20 is convex, one may use different algorithms for optimisation. In this work, we use CVX Grant and Boyd (2014), a package for solving convex programmes.

In order to evaluate the proposed approach, 20 benchmark databases from the UCI repository Dua and Graff (2017), TUDelf University Tax and Duin (2006), KEEL repository Alcala-Fdez et al. (2011), CENPARMI Cho (1997), Statlib Harrison and Rubinfeld (1978), and Zalando Xiao et al. (2017) are used. The datasets used in the experiments correspond to different application domains from varied sources. The statistics of the datasets used in the experiments are reported in Table LABEL:dstats. For the evaluation of the proposed approach, we conduct two sets of experiments. The first set follows a pure one-class classification paradigm, i.e. only positive samples are used to train the models. In the second set of experiments, negative objects are also deployed for model training. For comparison, we report the performance of the original SVDD approach of Tax and Duin (2004) denoted as ”-SVDD” and also its alternative variant which considers squared errors in the objective function, denoted as ”-SVDD” Chang et al. (2015). The proposed approach is denoted as ”-SVDD” in the corresponding tables. We also provide a comparison of the proposed -SVDD method to some linear re-weighting variants of the SVDD approach including P-SVDD Wang and Lai (2013), DW-SVDD Cha et al. (2014), and GL-SVDD Hu et al. (2021) as well as state-of-the-art OCC techniques. In particular, we have included kernel-based one-class classifiers which are applicable to moderately-sized datasets. These are the kernel Gaussian Process method (GP) Kemmler et al. (2013), the Kernel Null Foley-Sammon Transform (KNFST) Lin et al. (2008); Arashloo and Kittler (2021), and the Kernel Principal Component Analysis (KPCA) Hoffmann (2007).

Following the common approach in the literature and in order to facilitate the comparison of the performances of different methods independent from a specific operating threshold, we report the performances in terms of the AUC measure which is the area under the Receiver Operating Characteristic curves (ROC). The ROC curve characterises the true positive rate against the false positive rate at various operating thresholds. A higher AUC indicates a better performance for the system.

### 5.3 Pure one-class setting

In this setting, only positive objects are used for training. Table 2 reports the performances of different methods in this setting where we set parameter on the validation subset of each dataset to maximise the performance. A number of observations from the table are in order. First, on all datasets the proposed -SVDD approach yields a superior performance compared to its -SVDD and -SVDD variants. In particular, on some datasets such as D1 and D14, the improvement in the performance offered by the proposed approach is substantial while on some other datasets such as D17 the improvement is huge and reaches . It should be noted that the performance improvements offered by the proposed approach are obtained despite the fact that the validation sets of some datasets may not be very large, and hence, may not serve as a very good representative of the entire the distribution of samples for tuning parameter . It is expected that a more representative validation set would lead to even further improvements in the performance. A statistical ranking of different methods in the pure one-class setting using the Friedman’s test is provided in Table 3. From the table, it can be observed that while the proposed -SVDD approach ranks the best among other approaches while the standard -SVDD approach ranks much worst which underlines the significance of the proposed slack norm approach. Furthermore, although the -SVDD method provides some improvement with respect to the original -SVDD approach, its performance is still inferior compared to the proposed method. The second best performing method (in terms of average ranking) corresponds to a sample re-weighting SVDD approach presented in Hu et al. (2021) which uses global and local statistics to linearly weight slacks in the objective function.

### 5.4 Training in the presence of negative data

In this second evaluation setting, in addition to positive objects, labelled negative samples are also used for training. Table 4 reports the performances of different methods in this setting. Note that as in the case of pure one-class learning, the optimal value for the proposed approach is determined on the validation set. From among the GP, KPCA and KNFST approaches, only the KNFST approach is able to directly deploy negative samples for training. In order to emphasise that a method uses negative objects for training, a negative exponent (””) is used in the table. We also include the P-SVDD, DW-SVDD, and the GL-SVDD approaches trained using both negative and positive samples and denote them as P-SVDD, DW-SVDD, and GL-SVDD. From Table 4, it can be observed that on all datasets the proposed -SVDD approach obtains a better performance as compared with its -SVDD and -SVDD variants. In particular, while on some datasets the -SVDD and -SVDD approaches are unable to effectively utilise negative training samples, the proposed -SVDD method can better benefit from such samples to refine the description for improved performance. When compared with linear sample re-weighting methods of P-SVDD, DW-SVDD, and GL-SVDD, the proposed approach also performs better. An average ranking of different methods in this evaluation setting is provided in Table 5. From Table 5 it may be seen that the proposed -SVDD approach utilising negative objects for training ranks the best among other competitors. Furthermore, neither the -SVDD nor the -SVDD methods which use negative training samples do not rank the second. The second best performing method in this setting corresponds to the KNFST method Lin et al. (2008); Arashloo and Kittler (2021).