1 Introduction
Anomaly detection is an important problem in machine learning, whose goal is to identify samples that exhibit substantial discrepancies from normal data
[3]. Recently, it has received much attention in studies related to deep neural networks (DNNs), along with tasks such as novelty detection and outofdistribution detection (OOD), as the unexpected behaviors of DNNs when shown samples from classes unseen in training, have been recognized as a critical problem. Typically, DNN classifiers can misclassify such samples as those from known classes with high confidence.
The softmax classifier employed in DNNs [10] is attributed as one of the causes of this problem. Alternatively, deep data description models [12]
, which describe each class of normal data with a hypersphere or a Gaussian component in the embedded space, have been proposed for outlier and OOD detection tasks. A data description models can naturally identify a test sample with low predicted probabilities over all known classes as an outlier or an OOD sample. The goal of the data description model is to capture the characteristics of known classes in a deep representation, such that each data is projected onto the proximity of a corresponding class center. Its training is thus a supervised learning, using a largescale labeled samples of indistribution classes.
In anomaly detection, meanwhile, the training data are given without labels and assumed to have come from a single data source. In practice, normal data often consist of multiple classes, making modeling difficult under an unsupervised setting, and the variation among classes substantially increases the difficulty of anomaly detection.
In this paper, we address this task utilizing a limited number of labeled examples from each normal class. This setting is related to that of fewshot learning in that it exploits a small number of labeled examples, but unlike semisupervised anomaly detection [13], it does not exploit examples of anomalies for training or validation.
The key intuition behind our approach, which differ to previous studies is to consider the interclass discrepancies among the normal class examples We derive the loss function from the information bottleneck principle
[17], which . As previous studies have built the detection models on discriminative and generative classifiers, the training using interclass discrepancies had not been considered with strong interests. However, we consider it critical for exploiting the smallscale labeled samples. The IB principle formulates the relation between the interclass discrepancy and the intraclass discrepancies which is similar to the data description loss function. We refer to the proposed model as eXemplar Data Description (XDD).In our empirical study, we setup an anomaly detection task using the MNIST dataset to identify samples from unseen classes as anomalies. The results show that XDD exhibits high detection performance and contributes to the segmentation of normal classes.
This paper is organized as follows. Section 2 describes the related studies on deep data description, deep anomaly detection, and fewshot learning.
2 Related Work
2.1 Deep Data Description
The support vector data description [15] aims to find a spherical decision boundary that encloses the normal data, in order to detect whether a new sample comes from the same distribution or is an outlier. DeepSVDD [12] has employed the embedding functions of deep neural networks (DNN) to capture the structure of normal data.
Deep Multiclass Data Description (MCDD) [10] was introduced as an extension of the Deep SVDD for outofdistribution (OOD) detection. A DNN is trained such that the embedding function maps the labeled data onto the close proximity of the centers of the corresponding class , to find Gaussian components which describes the training data in the embedded space .
Describing the component as a multinormal distribution
(1) 
The Deep MCDD loss is defined as a MAP loss of the generative classifier as
(2)  
where is the distance from the class centers given (1).
(3) 
2.2 GANbased Anomaly Detection
Using autoencoders to compute anomaly score has been an import approach for unsupervised anomaly detection
[3]. Given only the normal data, autoencoders can be trained to minimize the reconstruction loss, after which its reconstruction errors on test data, can be used as anomaly scores. This approach is easily extended to deep and convolutional autoencoders [4].Recently, anomaly detection using generative adversarial networks have emerged as a popular approach for deep unsupervised anomaly detection following the influential work of AnoGAN [14].
Generally, GANbased anomaly detection exploit the generator network, which learns the manifold of normal data distribution through its mapping function, to compute the reconstruction error of the test data based on the learned manifold. For example, the test data is reconstructed using SGD in AnoGAN and using a BiGAN architecture in EGBAD [20].
The above literatures have also reported promising results using the embedded space of the discriminator newtork, in which the distances are used as the anomaly scores. In this work, we compute the anomaly scores using only the distances in the embedded space of the discriminator network.
2.3 Fewshot Learning
Fewshot learning [18]
is the task of exploiting an additional set of labeled examples for a target task to adapt some prior knowledge acquired in different source tasks. Similar to transfer learning, one of its key benefits is reducing the data collection cost for a new task which is similar to one or more previous tasks.
The type of the target and the source tasks are primarily supervised learning, namely classification. The setting in which examples with labels of classes are given for the target task is called a wayshot classification. Many existing work on fewshot anomaly detection assumes a similar setting, but is set to a small number for anomalies [11].
In fewshot learning for OOD detection [6], the setting may be described as wayshot, i.e., examples of OOD classes are not given. Meanwhile, its source task is a supervised learning given sufficient labeled example for indistribution classes.
In this paper, we consider a setting where the source task is unsupervised learning given unlabled, normal data and the target task is conducted in a
wayshot setting, i.e., without examples of anomalies. In this setting, one is only required to prepare a limited number of labels for each class of normal data, which is a substantial relief of burden for practical applications. To our knowledge, there have not been a prior study on fewshot anomaly detection utilizing only labels of normal class examples.3 Exemplar Data Description
Let denote the source task input, which is a set of unlabeled data. The target task input comprises labeled examples from each class, . The label takes a value from .
The goal in both tasks, is to obtain an embedding function , but the source task is usually conducted as a pretraining to the target task. In this paper, we consider the source task to be an adversarial training in an unsupervised settnig.
In a basic generative adversarial networks, a discriminator network learns an embedding function and a discriminating function to classify between a real image and a generated image. We consider to be the deep feature space of a pretrained discriminator network and to be the pretrained parameters. We thus attemp to minimize an informationtheoretic loss with regards to and in the target task. The loss is derived from the information bottleneck principle [17, 2, 16], in the form of a ratedistortion function
(5) 
where is the mutual information and
denotes the random variables for the input, the deep features, and the class labels, respectively.
The first mutual information term in (5) quantifies the redundacy of representation , and the second term reflects the amount of discriminative information regarding preserved in . The objective of the rate distortion problem, or minimizing , thus is to find a sparse yet discriminative representation for predicting .
By definition, mutual information is the expected KL divergence between the marginal and the conditional distribution over . We model the conditional distribution as an isotropic normal distirbution centering on the embedding . We define a deviation hyper parameter with respect to the class label .
(6)  
The first term in the RHS of (5) is rewritten as follows as described in Appendix 0.A.
(7)  
(7) shows that primarily reflects the sum of standarized squared distances between all pairs of and . When we assume that the distances between the pairs from different classes are substantially smaller than those from the same class, we can consider this to be the interclass distance term.
The second mutual informatio term, , as derived in
, can be rewritten such that it is identical to data description model except for the constant term. It thus represents the intraclass similarity among the components.
(8)  
(8) is equivalent to the MAP loss except for the class bias which induces the samples to be mapped close to the class center.
By minimizing
, we thus induce the mapping which increases the distances among classes as well as closeness within the classes. Our intuition is that increasing the interclass distances in the embedded space is critical for the task at hand, because it can increase the robustness of the model estimated from smallerscale data, and also reduce the probability that anomalies are mapped close to the known class distributions.
4 Algorithm and Implementation Details
The training and testing of XDD is conducted in the following steps.

Pretraining (Generative Adversarial Networks)

XDD Training

Testing
In steps two and three, the training data does not contain the anomaly samples. The labeled normal class samples are given in addition to the unlabeled training samples in step two.
In [10] and [12], the data description loss was introduced both in a maxmargin and a oneclass forms. Following their examples, we define the interclass component (7) and the intraclass component (8) of the IB loss in maxmargin forms.
The interclass max margin loss is defined as follows.
(9) 
(9) penalizes the interclass pairs that are distant by less than . We set the threshold
at the maximum of the sum of mean and two standard deviations of the intraclass distances in each batch.
The maxmargin intraclass loss is defined as
(10) 
where is the radius of the component sphere. We set the at the median of the intraclass distances in each minibatch.
The GAN loss is minimized along with the intra and interclass distance losses to maintain the consistency of discriminator representation.
(11) 
The three terms in (11) differ substantially in scale, upon which we employ the blocked coordinate descent for training [5].
In testing, we compute the anomaly score of a test sample by kernel density estimation (KDE) using the projections of the labeled examples.
5 Experiments
We present an experimental result using the MNIST dataset. The architectures of the generator and discriminator networks are shown in Table 1
5.1 Setup
We adopt the standard setup from previous studies [1, 10] for anomaly detection using MNIST. For training, we exclude one digit as the target anomalous class from the original set and measure the detection performance on the test set using the area under the ROC and the precisionrecall curves. The fewshot setting is shot nineway problem, i.e., labeled samples of each class were randomly chosen.
5.2 Results
Fig. 1 summarizes the AUPRC performances of the XDD models on ten target anomaly class detection tasks. The axis indicates the mean AUPRCs in ten repetitions of . The number on the axis indicate the digit designated as the anomalous class. The blue bars indicate the means after pretraining and the orange bar indicate the means after XDD training. XDD training contributes to substantial improvements of AUPRC values in all tasks.
Fig. 1 shows
Figures 3 and 3 illustrate the lowdimensional projection of the test samples using SNE [8] after pretraining and after the XDD training. The samples are represented by triangles of colors corresponding to respective classes while the labeled examples are represented by black ’s and ’s. This result was obtained when the anomalous target class is eight (yellow).
References
 [1] Akcay, S., AtapourAbarghouei, A., Breckon, T.P.: Ganomaly: Semisupervised anomaly detection via adversarial training (2018)
 [2] Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net (2017), https://openreview.net/forum?id=HyxQzBceg
 [3] Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. ACM Comput. Surv. 41(3), 1–58 (2009).

[4]
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016),
http://www.deeplearningbook.org  [5] Grippof, L., Sciandrone, M.: Globally convergent blockcoordinate techniques for unconstrained optimization. Optimization Methods & Software 10, 587–637 (1999)
 [6] Jeong, T., Kim, H.: Oodmaml: Metalearning for fewshot outofdistribution detection and classification. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 3907–3916. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/28e209b61a52482a0ae1cb9f5959c792Paper.pdf
 [7] Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Master’s thesis (2009), http://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf
 [8] Kusner, M., Tyree, S., Weinberger, K.Q., Agrawal, K.: Stochastic neighbor compression. In: Jebara, T., Xing, E.P. (eds.) Proceedings of the 31st International Conference on Machine Learning (ICML14). pp. 622–630. JMLR Workshop and Conference Proceedings (2014), http://jmlr.org/proceedings/papers/v32/kusner14.pdf
 [9] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (Nov 1998).
 [10] Lee, D., Yu, S., Yu, H.: Multiclass data description for outofdistribution detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 1362–1370. KDD ’20, Association for Computing Machinery, New York, NY, USA (2020). , https://doi.org/10.1145/3394486.3403189
 [11] Lu, Y., Yu, F., Reddy, M.K.K., Wang, Y.: Fewshot sceneadaptive anomaly detection (2020)
 [12] Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder, A., Müller, E., Kloft, M.: Deep oneclass classification. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4393–4402. PMLR, Stockholmsmässan, Stockholm Sweden (10–15 Jul 2018), http://proceedings.mlr.press/v80/ruff18a.html
 [13] Ruff, L., Vandermeulen, R.A., Görnitz, N., Binder, A., Müller, E., Müller, K., Kloft, M.: Deep semisupervised anomaly detection. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020. OpenReview.net (2020), https://openreview.net/forum?id=HkgH0TEYwH
 [14] Schlegl, T., Seeböck, P., Waldstein, S.M., SchmidtErfurth, U., Langs, G.: Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. In: Niethammer, M., Styner, M., Aylward, S.R., Zhu, H., Oguz, I., Yap, P., Shen, D. (eds.) Information Processing in Medical Imaging  25th International Conference, IPMI 2017, Boone, NC, USA, June 2530, 2017, Proceedings. Lecture Notes in Computer Science, vol. 10265, pp. 146–157. Springer (2017). , https://doi.org/10.1007/9783319590509_12
 [15] Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54, 45–66 (January 2004). , http://portal.acm.org/citation.cfm?id=960091.960109
 [16] Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW). pp. 1–5 (2015).
 [17] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. Computing Research Repository(CoRR) physics/0004057 (2000)
 [18] Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: A survey on fewshot learning. ACM Comput. Surv. 53(3) (Jun 2020). , https://doi.org/10.1145/3386252
 [19] Xiao, H., Rasul, K., Vollgraf, R.: Fashionmnist: a novel image dataset for benchmarking machine learning algorithms (Aug 2017)
 [20] Zenati, H., Foo, C.S., Lecouat, B., Manek, G., Chandrasekhar, V.R.: Efficient ganbased anomaly detection (2019)
Appendix 0.A Information Bottleneck
The information bottleneck principle is a principle for extracting relevant information in an input varibale about an output variable . The relevant information is defined by the mutual information , assuming statistical dependence between the two variables. We further attempt to learn a compressed representation of , by discarding irrelevant features that do not contribute to the prediction of .
We denote the compressed representation by . In the context of this paper, is the embedding of by a deep learning model , such that . The compression rate and the distortion given and can also be measured by the mutual informations and , respectively.
Finding an optimal leads to a Lagrangian
(12) 
Minimizing (12) is seen as the ratedistortion problem where the Lagrangian multiplier represents the tradeoff between the compresssion rate and the distortion.
The mutual information can be rewritten as the KullbackLeibler divergence between the marginal and the conditional probability distributions
(13) 
We model the empirical distribution of by the average of the dirac delta functions,
(14) 
and the conditional distribution
as a normal distribution
(15)  
where is the deviation over the class .
(16)  
(16) represents the sum of distances between all pairs of . We assume to be small enough that for all pairs such that and .
Meanwhile, we model the class conditional probability over as follows.
(17)  
The mutual information can be rewritten as follows
(18)  
(18) is equivalent to the MAP loss function except for the class bias.