1 Introduction
Oneclass novelty detection refers to the recognize of abnormal patters on data recognized as normal. Abnormal data, also know as outliers, anomaly or alien data are patters who belong to the different classes than normal class. The goal of the novelty detection field is to distinguish anomaly patterns which are different by normal and classify them. The capability of many machine learning technique, in the field of novelty and outlier detection is to decide whether a new instance belongs to the same distribution or if it has different behaviour such as to be considered as outlier. Typically, outliers detection is unsupervised and the goal is to recognize the density of clusters to discover possible outliers. We focus on novelty detection with semisupervised approach trained without outliers, whose goal is to decide whether a new observation is an outlier. In literature, many conventional oneclass classifiers that resolve the novelty detection problem exists, such OCSVM
scholkopf2001estimating, MST_CD juszczak2009minimum; GrassaGCO19; DBLP_LAGRASSA. Due to the high complexity of some data types, such as images and audio signals, these novelty detection methods suffer also of bad performance on highdimensional data and then dimensionality reduction techniques are required. To solve this problem techniques as Principal component analysis (PCA) and singular value decomposition (SVD) are commonly used to dimensionality reduction or classical feature selection using statistical metrics. These approaches are taskdependent and they need to an expert supervisor. In contrast to the traditional machine learning approach, deep learning models such as GAN
schlegl2017unsupervised; perera2019ocgan and Deep OneClass (DOC) perera2019learning, are able to extract these features independently from the particular task to be solved. In literature, few deep learning approaches exist to solve novelty detection problems. Our focus, in this paper, is to investigate a generic method for oneclass classification using a convolutional neural network as deep feature extractor and minimum spanning trees, able to pattern recognition. To the best of our knowledge, none of the previous work used convolutional neural network jointly with a graphbased model. In this work, we extend two our graphbased previous works
GrassaGCO19; DBLP_LAGRASSA and we use them as oneclass novelty detection approach exploiting deep features. Our work makes the following contributions:
We extend our previous works on MST and use jointly with a convolutional neural network to solve novelty detection problems.

To prove the effectiveness and robustness of the proposed approach we evaluate on two wellknown available datasets where we achieve the stateoftheart across many tasks.
2 Related work
In this section, we briefly introduce the main approaches used for novelty detection and highlight advantages and disadvantages. In general, the problems of Oneclass classification is harder than the problem of normal twoclass classification. For normal classification, the decision boundary is supported from both sides by examples of each the classes. In Oneclass classification only one side of the boundary is covered and available and it is hard to find the best separation of the target and the outliers class. Anomaly detection and oneclass classification are problems related to oneclass novelty detection
chalapathy2019deep. Both have similar objectives – to detect outofclass samples given a set of inclass samples. A hard label is expected to be assigned to a given image in oneclass classification; therefore its performance is measured using detection accuracy and F1 score. In contrast, novelty detection is only expected to associate a novelty score to a given image; therefore the performance of novelty detection is measured using a Receiver Operating Characteristic (ROC) curve. The supervised approach offers a better approach in terms of performance than unsupervised novelty detection techniques gornitz2013toward. Models that use this approach learns the hyperplane of separation or a generic decision boundary to discriminate data instances and then, to predict whether test instances belong to this boundary of if it lies outside. Deep model based on a supervised approach fails when the features space is highly and nonlinear and these methods require various data from the training of both classes (normal and abnormal) that usually are unavailables. Against, the unsupervised approach is used to distinguish normal and abnormal class without know labels data instances. Usually, these methods are used to automate the process of data labelling. Autoencoder is used as unsupervised deep architecture in novelty detection
baldi2012autoencodershinton2006reducing. In the case with unavailable data labelled, this approach offers good results but, often it is a challenge to learn common features among instances in high dimensionality and with a high nonlinear distribution of data. Semisupervised in novelty detection are widely used and they assume that all training instances of only a class is known and the goal is to recognize is an object is predicted as normal or abnormal, for instance OCSVM, SVDD and others. The main idea for the oneclass support vector method (OCSVM) is to separates all the data from feature space F and maximize the distance from a hyperplane to the origin. In contrast with traditional SVM, OCSVM learns a decision boundary that achieves maximum separation between the samples of the known class and the origin. A binary discriminative function is used to assign a positive label if the test belongs to a region or negative whether it lies out of the boundary. Instead, to consider a hyperplane, SVDD
Tax2004takes a spherical boundary around the training data. The goal is to minimize the volume of this boundary such that a possible outlier lies outside. The OCSVM and SVDD are closely related. Both can be adopted as novelty detection methods by considering distance to the decision boundary as the novelty score. SVDD gives a higher owners correctness ration (true positive) in the case which a large variation in density exist among the normalclass instances. In such case, it starts to reject the lowdensity target points as outliers. Further, in the case, the data distribution is highly nonlinear the probability to make the wrong prediction is high because is not possible to track a more detailed decision boundary around training data. SVM is affected by the same problem and it does not perform very well when the data are overlapping, furthermore is not suitable for large datasets. With the wide diffusion of deep learning, nowadays we can recognize a new type of models known as Hybrid models, able to solve novelty detection problems. Deep learning models are used as deep features extractor and they are used as input to the traditional algorithms wellknown in machine learning like oneclass support vector machine, autoencoder+ocsvm
andrews2016detecting, autoencoder+knn
song2017hybrid, autoencoder+svdd kim2015deep. The main advantage of this hybrid technique is to reduce the curse of dimensionality and increase the discriminative power of features using neural networks. A recently proposed approach is Oneclass neural networks (OCNN)
chalapathy2018anomalyruff2018deepperera2019learning that combines a deep neural network while optimizing the dataenclosing boundary in output space. Against the hybrid models, they do not require data for the classification and they outperform in terms of speed. Intuitively, a disadvantage is the computational time required for training step and for model updates in a high dimensional input data. Another technique to approach the novelty detection problems with neural networks is the GAN goodfellow2014generative. The generative adversarial network goodfellow2014generative use a discriminator to distinguish between generated and real data simultaneously: when the discriminator can understand if the input was generated, the backpropagation update only the generator weight elsewhere it is updated only the discriminator weight. The discriminator can be used as anomaly detector because it gets as results two class that the first represents the elements that are part of the class instead of the other class represent the element that is not in the class. Some examples of GAN used as anomaly detection is the AnoGAN schlegl2017unsupervised or OCGan perera2019ocgan. It is possible to use the neural network for novelty detection with another technique: Autoencoder. The autoencoder can create a compressed version of the input and after it can generate again the input using this representation. It is possible to evaluate how well the decoded information is similar to the input information, so we can set a threshold marchi2015novel, if the evaluation is bigger than the threshold the input is classified as new. Using variational autoencoder kingma2013auto it is possible to improve the evaluation because it used to get as input the varied input, so only if some input is very different from the seen example it perform less.3 Proposed Method
In this paper, we propose a novel Deep Hybrid Model (DHM), called OCmst, that effectively explores the convolutional features for one class novelty detection in image classification. Our goal is to label images, never seen during training, as belonging to the single class analyzed (0) or as anomalies (1). As graphically represented in Figure 1, our goal is achieved in two main steps: (a) through the use of a single MST to assign labels 0, 1 or ; (b) through the use of two MSTs to resolve all previous labeled samples.
We use a generic convolutional neural network as feature extractors using only samples having a single class label, according to oneclass classifiers. These train data are images transformed into deep features by a VGG19, pretrained on Imagenet. The deep features extracted from a pretrained CNN are used directly without any transformation and fed to our proposed OCmst method.
Going into detail, in our approach we can distinguish two main steps as part of the proposed OCmst model. In the first step, when a new test sample is given, we select the first instances of the training set and create a “complete graph” using euclidean distance as the weight for each couple of edges. All the training samples belong to the same normal class. The selected training samples of the normal class (only known class) are the closest to the sample . Subsequently, we use the Kruskal’s algorithm to find the minimum spanning tree using the previous selected instances. In contrast with previous work GrassaGCO19, we use two different boundaries around each MST to create the decision boundary and establish whether a test lies inside the first, second or out of the boundaries created. In Figs. 26 and 27, we show on a 3dimensional space a real case using OCmst on a toy dataset created to highlight three possible scenery (accepted/rejected/uncertain instance). If lies in the second boundary, we label it as uncertain test, otherwise, if the sample lies in the first boundary then we label it as a normal class otherwise abnormal class label is assigned.
In the second step, for each test sample labeled as (uncertain) in the previous step (see Fig.1(b)), we need to assign one of the two labels: normal (0) or abnormal (1). For each of such samples we select the nearest neighbours per classes and create two MST to make the final prediction (see Fig. 31). In this phase we use the label predicted of abnormal class and the groundtruth of the normal class labels. In GrassaGCO19, we use both structures based on MST, where basic elements of this classifier are not only vertices but also edges of the graph, giving a richer representation of the data.
Below is a summary of the basic idea of the work presented in GrassaGCO19. Considering the edges of the graph as target class objects, additional virtual target objects are generated. The classification of a new object is based on the distance to the nearest vertex or edge. The key of this classifier is to track a shape around the training set not considering only all instances of the training but also edges of the graph, to have more structured information. Therefore, in the prediction phase, the classifier considers two important elements:

Projection of point on an edge defined by vertices

Minimum Euclidean distance between and
The Projection of on an edge is defined as follow:
(1) 
We check if lies on the edge = (,) then, we compute and the Euclidean distance between and , more formally if the following condition is true
(2) 
then
(3) 
otherwise we compute the Euclidean distance of and pairs (, ), precisely:
(4) 
Therefore, a new object is recognized by (see Algorithm 1) if it lies inside the decision boundary that will be described below, otherwise, the object is considered as outlier. The decision of whether an object is recognized by the classifier or not is based on the threshold of the shape created during the training phase, more formally:
(5) 
where is the subset of nodes defined by the results obtained in Eqs. 3 and 4.
Differently from what is proposed in juszczak2009minimum, where authors set the threshold as a value of the distribution of the edge weights in the given MST, in our approach we enrich this solution with the introduction of additional thresholds. In juszczak2009minimum, given as an ordered edge weights values, they define as , where . For instance, with , we assign the median value of all edge weights into the MST. In our approach, we set two different thresholds and to discriminate three different decision boundaries useful to make three kinds of classification.
In the first step of our approach defined in Algorithm 1, we create only MST, using only the normal class and then, to assign predicted labels to test samples, we introduce the follow discriminative function :
(6)  
(7)  
(8) 
In Eq. 6 the test instance relies inside the boundary, therefore the MST assign label (object recognized) to , otherwise in Eq. 8 we refuse the object because is out of the boundary defined by . For each instance inside the border region defined by the thresholds and , we assign a label (uncertain object) as defined in Eq. 7. Furthermore, differently from juszczak2009minimum, in our work we do not use all instances of normal class to create a minimum spanning three, but we select instances from training set (see graphical representation in Figure 1(a)) closest to the test instance and create a MST. The main reason we do this is to capture a local representation from an MST built using the neighbors of from the training set. After the first step, we obtain an array of predictions as follow:
(9) 
where is the label to represent recognized object as normal and label represents abnormality. Labels w represent all the test samples inside the border region defined in Eq.7 of the discriminative function. In the second step described in Figure 1(b), we use a pair of MSTs trained on normal and abnormal class to resolve the ambiguity of all the uncertain instances labelled as . In our previous work GrassaGCO19 in the case in which both classifiers accept/reject test instances, we simply searched the data distribution per classes closest to the test objects and made the final classification. In this work, we also extend this function introducing statistical metrics to benefit into classification performance. Given two sets of data samples and containing the elements per class nearest to the test
, we compute standard deviations:
(10) 
where is the mean value of these observations.
Finally, given the minimum distance between a test sample and an MST node as defined in Algorithm 2 rows 916 , we define zeta score as:
(11) 
where is a generic standard deviation of data samples. This formulation means that the new observation will be classified using jointly the concept of distance from the appropriate MST and the standard deviation , where we will assign a label class to the test object that obtains the minimum score. More formally we use a function as:
(12) 
We can summarize the main differences with our previous work as follow:

we use trained CNNs as deep feature extractors;

we introduce different level of decision boundaries to track uncertain samples and we use strongly rejected instances to create the abnormal class;

we introduce a new discriminative function in case two MSTs accept or reject an instance.
4 Experimental Results
4.1 Datasets
To prove the effectiveness of the proposed method we evaluate it on two wellknown datasets: FashionMNIST xiao2017fashion and CIFAR10 krizhevsky2009learning. Figure 22 shows some examples taken from the two datasets used. Below, we describe the details on datasets used.
FashionMNIST: it is a dataset containing 60000 instances for the train set and 10000 instances for the test set. The number of classes is 10 and each sample is a 28x28 grayscale image of clothing. This dataset is more challenging than MNIST lecun1998mnist
and it represents a useful benchmarker for machine learning algorithms. Looking at the differences we see that MNIST is too easy. CNNs can reach 99.7% on MNIST, while classic machine learning algorithms can easily reach 97%. Furthermore, MNIST cannot be representative of modern computer vision problems.
Cifar10: Consist of 60000 images in 10 classes (6000 per classes) with a training size and test size of 50000 and 10000 respectively. Each sample is 32x32 color images with a low resolution. The 10 classes are airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Cifar10 is most challenging than FashionMNIST due to diverse content and complexity of images. Further, it is widely used as benchmarker comparison by research for classification task.
Although these two datasets are mainly used to study and compare supervised classification techniques, in this work we use them to study a novelty detection algorithm, working with the single classes against all the others.
4.2 Setup
Accordingly with novelty detection problems we use only one class of a training set considering all the instances as normal class and performing an (see Algorithm 1) to discover outliers with strong rejection from the test set. Then we use the test samples rejected to create the abnormal class and subsequently classify them. The test set is composed of 10000 instances, more precisely, 1000 normal samples and 9000 abnormal samples. We used the AUC score, according to the literature (ruff2018deep
) to compare our approach with the others published. The proposed model was implemented using the framework Pytorch
paszke2017automatic and a VGG19 simonyan2014very pretrained on Imagenet imagenet_cvpr09 as deep features extractor. Further, the dimensionality of features extracted is 4096. We use these features as input for first step and then for second step of our OCmst model. The values of the two thresholds and were found experimentally using a validation set extracted from the two training data for the two datasets used. These parameters have therefore been set as and and have not been modified for all the other experiments.4.3 Results
To better understand the potential of the proposed model, we have conducted two groups of experiments:

parameter analysis

comparison with other methods proposed in the literature.
For the first group of experiments we extracted a validation set from the training of the CIFAR10 dataset. We have analyzed the parameter trying to understand what is the optimal value to use in all the other experiments. This parameter is very important because the speed and also the memory occupation of the entire novelty detection process depends on it. For example, on the class Plane of CIFAR10 dataset (10000 test samples), passing from to the execution time changes from minutes to minutes. In Table 2 we report the AUC score achieved in CIFAR10 using different values for each experimental run. As reported in this table, we choose the best value for the parameter, i.e. .
In the second group of experiments we compare our approach with many other approaches reported in the literature. In the Table 1 we report results of our method and we compare the same results with the results published in the papers scholkopf2001estimating; bishop2006pattern; hadsell2006dimensionality; kingma2013auto; van2016conditional; schlegl2017unsupervised; abati2018and; ruff2018deep; perera2019ocgan using the CIFAR10 dataset. From this first group of experiments we can see that our approach is the one that produces the best average results and the best absolute results for 6 classes out of 10 in total. Another similar experiment is shown in Table 3. In this case we used a different dataset, FashionMNIST and we compared ourselves with all the results published in the paper schlachterdeep. Looking at the average results we can see that our approach ranks fourth but the results are still comparable with the best ones. We can conclude that the OCmst show the best performance on CIFAR10 and competitive results on FashionMNIST.
5 Conclusion
In this work we introduce the first hybrid model graphbased for novelty detection problems. Our method uses the deep features produced by a convolutional neural network to find a good decision boundary exploiting minimum spanning tree structures. The proposed OCmst outperforms the stateoftheart in Novelty detection problem on many classes of the CIFAR10 dataset, showing an AUC score higher than others. In FashionMNIST datasets we obtained competitive results. Our experiments prove the effectiveness of the proposed approach on two different datasets and highlighting advantages and disadvantages.
Acknowledgements.
The authors kindly appreciate the NVIDIA gift of a Titan Xp GPU for this research.6 Declaration of interest
The authors declare that they have no conflict of interest.
Comments
There are no comments yet.