Deep One-Class Classification Using Data Splitting

02/04/2019
by   Patrick Schlachter, et al.
0

This paper introduces a generic method which enables to use conventional deep neural networks as end-to-end one-class classifiers. The method is based on splitting given data from one class into two subsets. In one-class classification, only samples of one normal class are available for training. During inference, a closed and tight decision boundary around the training samples is seeked which conventionally trained neural networks are not able to provide. By splitting data into typical and atypical normal subsets, the proposed method can use a binary loss and defines additional distance constraints on the latent feature space. Various experiments on three well-known image datasets showed the effectiveness of the proposed method which outperformed seven baseline models in 23 of 30 experiments.

READ FULL TEXT VIEW PDF

Authors

page 1

03/12/2019

Open-Set Recognition Using Intra-Class Splitting

This paper proposes a method to use deep neural networks as end-to-end o...
12/20/2018

One-Class Feature Learning Using Intra-Class Splitting

This paper proposes a novel generic one-class feature learning method wh...
09/06/2017

Neural Networks Regularization Through Class-wise Invariant Representation Learning

Training deep neural networks is known to require a large number of trai...
08/06/2021

Ensemble Augmentation for Deep Neural Networks Using 1-D Time Series Vibration Data

Time-series data are one of the fundamental types of raw data representa...
12/23/2021

Optimal learning of high-dimensional classification problems using deep neural networks

We study the problem of learning classification functions from noiseless...
08/20/2021

Does Adversarial Oversampling Help us?

Traditional oversampling methods are generally employed to handle class ...
08/18/2020

Prevalence of Neural Collapse during the terminal phase of deep learning training

Modern practice for training classification deepnets involves a Terminal...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One-class classification describes special classification problems in which only samples from one class, the so-called normal class, are available for training. During inference, the task is to discriminate normal samples from samples of the other class, the so-called abnormal or anomaly class.

Conventional one-class classifiers such as the one-class support vector machine (OCSVM) 

[1] or the support vector data description (SVDD) [2]

have limited performance on complex raw data such as natural images because of their sensitive hyperparameters

, and . Furthermore, they require hand-crafted features which are task-dependent and have to be carefully figured out by experts.

In contrast to traditional machine learning methods, deep learning can benefit from a huge amount of data and achieves better performances in complex tasks such as image classification, natural language processing and speech recognition 

[3]. Accordingly, an obvious research direction is to use deep learning methods for one-class classification.

(a) Typical examples (b) Atypical examples
Figure 1: Examples for typical and atypical samples.
Figure 2: The architecture of the proposed method.

Indeed, there exist only few deep learning approaches to one-class classification. One typical method is to train an autoencoder by normal samples only and to use the reconstruction error as an indication of their class affiliation 

[4]. Various research was conducted in this field. For instance, recent literature includes methods based on variational autoencoders [5, 6], additional regularization terms to a mean squared error (MSE) cost function for higher robustness [7] and a combination of clustering and constraints on the latent space of an autoencoder [8]. However, all these error-based methods have limited performance for complex datasets, because the pixel-wise error does not always match the human understanding [9]. Hence, either the given normal samples should be distributed closely to each other in the original data space or feature engineering is needed before training an autoencoder. Moreover, the selected error threshold is crucial for the performance.

Apart from error-based methods, recent methods use a deep network to replace non-linear kernel methods of one-class classifiers [10, 11]. For instance, the state-of-the-art method Deep SVDD proposed by Ruff et al. [10] combines a deep network with an SVDD. Different from end-to-end models, Perera et al. [12]

proposed a deep model to extract one-class features from raw data which are subsequently given into a conventional one-class classifier. However, their method requires a multi-class reference dataset which is difficult to acquire for a given one-class problem. Moreover, the method can only guarantee a tight decision boundary if the reference dataset is highly correlated to the abnormal data. Finally, as the model is not an end-to-end model, the objectives for feature extraction and classification are disconnected.

In this paper, we focus on end-to-end models and introduce a generic method for one-class classification using an arbitrary deep neural network as a backbone. The key is to split training samples of the normal class into two subsets, typical normal and atypical normal. Fig. 1

shows examples randomly sampled from these two subsets. By using a binary loss and applying distance constraints on the two subsets, the proposed method enables end-to-end training. Eventually, the output of the network can be interpreted as the probability that a given sample belongs to the normal class.

2 Proposed method

Our basic idea is to split training data into two subsets, namely typical and atypical samples. We call this intra-class splitting (ICS). This enables to use a binary loss during training and to define distance constraints between these subsets. Moreover, tight and closed decision boundaries necessary for one-class classification can be achieved despite training a conventional deep neural network in an end-to-end manner.

Fig. 2 visualizes the proposed architecture. A final one-class classifier utilizes an arbitrary deep neural network as a backbone, in which the previous layers can be considered as feature extraction subnetwork and the top layer corresponds to a classification subnetwork. In contrast to these two subnetworks, the distance subnetwork is only used during training to satisfy some constraints on the latent representations .

2.1 Intra-Class Splitting

In a given normal class, not all samples are representative for this class as illustrated in Fig. 1. Therefore, it is assumed that a given normal dataset is composed of two disjunct subsets. The first subset consists of typical normal samples which are the most representative for the normal class and correspond to the majority of the given dataset. The second subset is considered to contain atypical normal samples.

An intuitive approach to split a given normal dataset is to train a neural network with a bottleneck structure such as an autoencoder. By using a compression-decompression process, only the most important information of the input data is well maintained. Accordingly, those samples which are better reconstructed contain more representative features. Hence, the first step of ICS is to train an autoencoder with all given normal samples using MSE as the objective.

After training an autoencoder, it is used to acquire the reconstructions of all training samples. Then, the similarity between the original data and the reconstructed data is calculated using a predefined similarity metric . For example, the structural similarity (SSIM) [13] is a possible similarity metric for image data.

Finally, according to the predefined ratio , the first samples with the lowest similarity scores are considered as atypical normal samples , while the others are considered as typical normal samples .

Figure 3: Basic idea of intra-class splitting and one-class constraints.

2.2 One-Class Constraints

Based on ICS, three distance constraints on the desired latent representations are defined and visualized in Fig. 3:

  1. [leftmargin=*]

  2. Small distances among typical normal samples. The typical normal samples represent the given normal class compactly. Thus the latent representations of different typical normal samples should indicate similar high-level features. In other words, the latent representations of typical normal samples should have small distances to each other.

  3. Large distances between typical and atypical normal samples. Atypical normal samples are assumed to have high-level features more similar to the abnormal samples than those of typical normal samples. Therefore, the latent representations of atypical normal samples should be easily discriminable from those of typical normal samples, i.e. the latent representations of typical and atypical normal samples should have large distances.

  4. Large distances among atypical normal samples. The latent representations of atypical normal samples should have large distances among themselves in order to force that typical normal samples are enclosed by atypical samples. This is the key to a tight and closed decision boundary.

Since the term “distance” is not restricted to a specific distance metric, we allow a generic, differentiable distance function as in [14, 15]. It is modeled by the distance subnetwork in Fig. 2. takes two inputs and calculates a scalar value normalized in the range of as the distance between these two inputs. According to the above constraints, is learned under the following criteria:

(1)

2.3 Training

After ICS, the three subnetworks from Fig. 2 are jointly trained with typical and atypical normal data. For each of the three desired constraints on the latent representations , a loss is defined. Then, the network is trained with the three different losses step by step for a fixed number of iterations.

During the first step, the network is trained with a batch of typical normal samples to minimize the distance between their latent representations by minimizing the closeness loss

(2)

where and are the latent representations of two different typical normal samples.

Second, assigning the label “0” to typical normal samples and the label “1” to atypical normal samples enables to train the network with a binary cross-entropy loss. We call this loss intra-class loss :

(3)

where is the label for a given sample and is the label of predicted by the classification subnetwork. Thereby, the function is realized by the one-class classifier consisting of the feature extraction and classification subnetwork. This loss implicitly maximizes the distances between the latent representations of typical and atypical normal samples.

Third, the network is trained with only atypical normal samples to maximize the distances among latent representations of atypical normal samples. This is done by minimizing the dispersion loss

(4)

where and are the latent representations of two different atypical normal samples.

Normal Class OCSVM IF ImageNet SSIM DSVDD NaiveNN NNwICS Ours
Digit 0 98.2 (0.0) 96.2 (0.3) 71.1 (0.0) 98.7 (0.0) 98.0 (0.7) 96.8 (0.5) 96.9 (0.1) 98.9 (0.0)
Digit 1 99.2 (0.0) 99.4 (0.0) 88.9 (0.0) 99.8 (0.0) 99.7 (0.1) 74.8 (6.0) 98.7 (0.3) 99.8 (0.1)
Digit 2 82.1 (0.0) 73.0 (2.3) 58.5 (0.0) 82.6 (0.0) 91.7 (0.8) 67.5 (5.2) 85.4 (2.6) 91.7 (1.8)
Digit 3 86.1 (0.0) 82.6 (0.5) 63.2 (0.0) 90.6 (0.0) 91.9 (1.5) 67.1 (2.8) 95.1 (0.3) 96.6 (0.2)
Digit 4 94.8 (0.0) 87.9 (0.4) 70.6 (0.0) 76.6 (0.0) 94.9 (0.8) 94.1 (0.5) 77.5 (0.6) 86.5 (1.3)
Digit 5 77.4 (0.0) 73.4 (0.8) 60.7 (0.0) 92.3 (0.0) 88.5 (0.9) 55.8 (4.2) 85.5 (0.7) 88.9 (0.0)
Digit 6 94.8 (0.0) 85.8 (0.9) 67.6 (0.0) 96.6 (0.0) 98.3 (0.5) 89.7 (0.2) 96.3 (0.0) 98.8 (0.2)
Digit 7 93.4 (0.0) 91.4 (0.5) 71.0 (0.0) 96.0 (0.0) 94.6 (0.9) 68.2 (7.3) 94.0 (0.0) 96.1 (0.2)
Digit 8 90.2 (0.0) 73.9 (1.1) 64.0 (0.0) 80.2 (0.0) 93.9 (1.6) 77.7 (9.2) 91.0 (0.3) 95.0 (0.2)
Digit 9 92.8 (0.0) 87.5 (0.1) 71.8 (0.0) 79.5 (0.0) 96.5 (0.3) 81.8 (0.7) 87.4 (0.1) 90.0 (0.4)
T-shirt 86.1 (0.0) 86.8 (0.6) 58.1 (0.0) 83.7 (0.0) 79.1 (1.5) 62.9 (0.9) 85.1 (1.7) 88.3 (1.2)
Trouser 93.9 (0.0) 97.7 (0.1) 75.4 (0.0) 98.5 (0.0) 94.0 (1.3) 65.6 (4.5) 94.6 (0.1) 98.9 (0.2)
Pullover 85.6 (0.0) 87.1 (0.3) 58.1 (0.0) 87.2 (0.0) 83.0 (1.4) 73.6 (0.9) 82.6 (1.2) 88.2 (0.4)
Dress 85.9 (0.0) 90.1 (0.7) 60.1 (0.0) 89.2 (0.0) 82.9 (1.9) 70.0 (1.7) 89.1 (0.1) 92.1 (2.2)
Coat 84.6 (0.0) 89.8 (0.4) 58.3 (0.0) 87.3 (0.0) 87.0 (0.5) 80.8 (3.9) 85.8 (0.2) 90.2 (0.0)
Sandal 81.3 (0.0) 88.7 (0.2) 69.2 (0.0) 85.2 (0.0) 80.3 (4.6) 64.0 (9.4) 85.5 (0.0) 89.4 (1.4)
Shirt 78.6 (0.0) 79.7 (0.9) 57.3 (0.0) 75.3 (0.0) 74.9 (1.3) 71.8 (1.3) 75.6 (0.4) 78.3 (0.6)
Sneaker 97.6 (0.0) 98.0 (0.1) 75.5 (0.0) 97.8 (0.0) 94.2 (2.1) 92.0 (3.2) 94.9 (0.1) 98.3 (0.2)
Bag 79.5 (0.0) 88.3 (0.6) 61.9 (0.0) 81.6 (0.0) 79.1 (4.5) 72.9 (8.7) 82.0 (0.3) 88.6 (2.3)
Ankle boot 97.8 (0.0) 97.9 (0.1) 78.3 (0.0) 98.4 (0.0) 93.2 (2.4) 90.7 (0.1) 94.9 (0.3) 98.5 (0.1)
Airplane 61.9 (0.0) 66.7 (1.3) 53.3 (0.0) 75.6 (0.0) 61.7 (4.1) 63.8 (4.5) 62.7 (1.8) 76.8 (3.2)
Automobile 38.5 (0.0) 43.6 (1.3) 53.6 (0.0) 43.5 (0.0) 65.9 (2.1) 52.7 (0.9) 63.2 (0.8) 71.3 (0.2)
Bird 60.6 (0.0) 59.1 (0.3) 51.9 (0.0) 61.1 (0.0) 50.8 (0.8) 47.8 (0.4) 57.6 (0.4) 63.0 (0.8)
Cat 49.4 (0.0) 50.3 (0.5) 50.8 (0.0) 48.6 (0.0) 59.1 (1.4) 50.2 (4.3) 58.0 (0.2) 60.1 (3.4)
Deer 71.3 (0.0) 74.4 (0.2) 55.8 (0.0) 63.5 (0.0) 60.9 (1.1) 65.1 (1.8) 61.9 (0.1) 74.9 (0.9)
Dog 52.0 (0.0) 51.4 (0.3) 52.6 (0.0) 62.1 (0.0) 65.7 (2.5) 53.3 (0.7) 65.7 (0.2) 66.0 (1.1)
Frog 63.8 (0.0) 71.1 (0.5) 54.6 (0.0) 44.5 (0.0) 67.7 (2.6) 41.1 (3.4) 64.2 (2.5) 71.6 (0.8)
Horse 48.2 (0.0) 53.6 (0.3) 51.3 (0.0) 47.2 (0.0) 67.3 (0.9) 51.5 (0.4) 62.4 (2.0) 64.1 (1.6)
Ship 63.7 (0.0) 69.4 (0.6) 57.1 (0.0) 76.8 (0.0) 75.9 (1.2) 45.8 (5.0) 73.9 (1.1) 78.9 (0.5)
Truck 48.8 (0.0) 53.9 (1.0) 56.0 (0.0) 40.7 (0.0) 73.1 (1.2) 53.5 (3.2) 55.6 (2.2) 66.0 (2.5)
Table 1:

AUC (standard deviation) in %.

3 Experiments

3.1 Setup

The proposed method was evaluated on three benchmark image datasets MNIST [16], Fashion-MNIST [17] and CIFAR-10 [18]. All three datasets have ten different classes. Per dataset, one class was taken as the normal class and the remaining nine classes were considered as abnormal classes. Accordingly, the training set sizes were for MNIST, for Fashion-MNIST and for CIFAR-10. The test set was composed of 1000 normal samples and 9000 abnormal samples. Finally, AUC [19] was used as performance metric.

According to the literature, only few prior work proposed state-of-the-art one-class classifiers. In this work, the following conventional and deep learning based models were selected as baseline models: i) OCSVM [1] using and an RBF kernel with kernel size ; ii) Isolation Forest (IF) [20]; iii) ImageNet + OCSVM: Features extracted by a VGG19 [21] pretrained on ImageNet [22] were used as the input for a OCSVM; iv) Deep SVDD (DSVDD) [10]; v) Error based classifier (SSIM) which directly took the SSIM between the reconstructions and the original data as the classification score. OCSVM, IF and DSVDD shared the settings from [10]. In addition, the following variants of the proposed method were considered as baseline models: vi) Naive neural network without ICS (NaiveNN): The network with the same architecture as the proposed method was trained without the distance subnetwork or ICS; vii) Neural network with ICS but without one-class constraints (NNwDS): The normal dataset was split into typical and atypical subsets. After assigning two different labels to the subsets, the network was trained with these two subsets but without any constraints on the latent representations.

The concrete architecture of the autoencoder for ICS is arbitrary. In this work, the encoder shared a similar structure with AlexNet [23] except that all dense layers were replaced by one convolutional layer. The decoder had a symmetrical structure to the encoder, which utilized transposed convolutional layers for upsampling. The base architecture for the proposed method was AlexNet. The feature extraction subnetwork in Fig. 2 was composed of the layers from the input layer to the second last layer of the AlexNet. Its output layer was considered as the classification subnetwork. The distance subnetwork was composed of one subtraction layer and one dense layer. In particular, the subtraction layer calculated the pixel-wise difference which was subsequently mapped to a scalar value by the dense layer. Note that these two subnetworks can be replaced by any other deeper networks.

The proposed model was implemented with TensorFlow 

[24]. We used SSIM as similarity metric for ICS. Furthermore, the ratio for choosing atypical normal samples was set to 10 and the number of training iterations was 10000. Finally, the training mini-batch size was 64 and L2-regularization was used for every convolutional layer with a decay of .

3.2 Results and Discussion

Table 1 shows the resulting AUCs in percent averaged over five different seeds for the initialization of the network. Compared to the baseline models, the proposed method performed best in 23 of 30 cases. Moreover, our method showed a better performance than the baseline models especially for the natural image dataset CIFAR-10. For example, the proposed method outperformed the recent state-of-the-art method DSVDD in 8 of 10 cases on CIFAR-10 with an average improvement of more than 11.4%.

Although all methods performed similarly on the trivial datasets, the proposed method still showed improved performance. For instance, on MNIST, our method achieved 1.4% improvement over the DSVDD and over 6.5% improvement over OCSVM, IF, ImageNet and SSIM.

Considering the variants of the proposed method, the NaiveNN performed worst as expected, because it tends to map all points from the original data space to an identical label, making a correct classification challenging. This situation is tolerable in simple datasets. However, the NaiveNN cannot be used at all on the more complex dataset CIFAR-10. In contrast, NNwICS, a naive neural network with ICS, achieved higher AUCs and was comparable to or outperformed the other baseline models. In conclusion, the integration of ICS into neural networks can enhance the performance for one-class classification problems.

Eventually, the proposed method was evaluated with different ratios to judge its sensitivity. Fig. 4 shows AUCs averaged over ten classes and four different initialization seeds for each dataset depending on the ratio . The results indicate that each dataset has one optimal which is about 10%. By choosing a smaller or greater value for , the AUC is worse.

Figure 4: AUC over ratio .

4 Conclusion

We proposed a novel method for one-class classification using deep learning. By splitting given normal data into typical and atypical normal subsets, it allows to introduce a binary loss and additional constraints which enable an end-to-end training of standard deep neural networks. The proposed method was evaluated in various experiments on image datasets. It showed a distinct improvement over state-of-the-art approaches to one-class classification in average, especially for the complex dataset CIFAR-10. Future implications of this paper may include the extension of the proposed method to larger network architectures and more complex datasets. Moreover, the proposed method may be transferred to the field of open set recognition. Finally, we work on mathematical proofs for the significance of intra-class splitting.

References

  • [1] Bernhard Schölkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson,

    Estimating the support of a high-dimensional distribution,”

    Neural Comput., vol. 13, no. 7, pp. 1443–1471, July 2001.
  • [2] David M.J. Tax and Robert P.W. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, Jan 2004.
  • [3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016.
  • [4] Mayu Sakurada and Takehisa Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction,” in Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. ACM, 2014, p. 4.
  • [5] Jinwon An and Sungzoon Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, vol. 2, pp. 1–18, 2015.
  • [6] Y. Kawachi, Y. Koizumi, and N. Harada, “Complementary set variational autoencoder for supervised anomaly detection,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 2366–2370.
  • [7] Chong Zhou and Randy C Paffenroth, “Anomaly detection with robust deep autoencoders,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 665–674.
  • [8] Caglar Aytekin, Xingyang Ni, Francesco Cricri, and Emre Aksu, “Clustering and unsupervised anomaly detection with l 2 normalized deep auto-encoder representations,” in 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–6.
  • [9] Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in In CVPR, 2005, pp. 886–893.
  • [10] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft, “Deep one-class classification,” in Proceedings of the 35th International Conference on Machine Learning, Jennifer Dy and Andreas Krause, Eds., Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018, vol. 80 of Proceedings of Machine Learning Research, pp. 4393–4402, PMLR.
  • [11] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla, “Anomaly detection using one-class neural networks,” CoRR, vol. abs/1802.06360, 2018.
  • [12] Pramuditha Perera and Vishal M. Patel, “Learning deep features for one-class classification,” CoRR, vol. abs/1801.05365, 2018.
  • [13] Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik, “Multi-scale structural similarity for image quality assessment,” in in Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, (Asilomar, 2003, pp. 1398–1402.
  • [14] Alexey Dosovitskiy and Thomas Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
  • [15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei,

    “Perceptual losses for real-time style transfer and super-resolution,”

    in

    European Conference on Computer Vision

    . Springer, 2016, pp. 694–711.
  • [16] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [17] Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” 2017.
  • [18] Alex Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., Citeseer, 2009.
  • [19] Andrew P. Bradley, “The use of the area under the roc curve in the evaluation of machine learning algorithms,” Pattern Recogn., vol. 30, no. 7, pp. 1145–1159, July 1997.
  • [20] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, “Isolation forest,” in Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Washington, DC, USA, 2008, ICDM ’08, pp. 413–422, IEEE Computer Society.
  • [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [22] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,

    “Imagenet classification with deep convolutional neural networks,”

    in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105. Curran Associates, Inc., 2012.
  • [24] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., “Tensorflow: a system for large-scale machine learning,” in Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. USENIX Association, 2016, pp. 265–283.