DeepAI

# Few-shot Deep Representation Learning based on Information Bottleneck Principle

In a standard anomaly detection problem, a detection model is trained in an unsupervised setting, under an assumption that the samples were generated from a single source of normal data. In practice, however, normal data often consist of multiple classes. In such settings, learning to differentiate between normal instances and anomalies among discrepancies between normal classes without large-scale labeled data presents a significant challenge. In this work, we attempt to overcome this challenge by preparing few examples from each normal class, which is not excessively costly. The above setting can also be described as a few-shot learning for multiple, normal classes, with the goal of learning a useful representation for anomaly detection. In order to utilize the limited labeled examples in training, we integrate the inter-class distances among the labeled examples in the deep feature space into the MAP loss. We derive their relations from an information-theoretic principle. Our empirical study shows that the proposed model improves the segmentation of normal classes in the deep feature space which contributes to identifying the anomaly class examples.

04/21/2021

### Brittle Features May Help Anomaly Detection

One-class anomaly detection is challenging. A representation that clearl...
09/15/2020

### Deep Reinforcement Learning for Unknown Anomaly Detection

We address a critical yet largely unsolved anomaly detection problem, in...
06/11/2021

### Self-Trained One-class Classification for Unsupervised Anomaly Detection

Anomaly detection (AD), separating anomalies from normal data, has vario...
05/24/2021

### Deep Visual Anomaly detection with Negative Learning

With the increase in the learning capability of deep convolution-based a...
05/22/2021

### Feature Encoding with AutoEncoders for Weakly-supervised Anomaly Detection

Weakly-supervised anomaly detection aims at learning an anomaly detector...
10/25/2021

### Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning

Reconstruction-based approaches to anomaly detection tend to fall short ...
10/31/2018

### Consistency-based anomaly detection with adaptive multiple-hypotheses predictions

In out-of-distribution classification tasks, only some classes - the nor...

## 1 Introduction

Anomaly detection is an important problem in machine learning, whose goal is to identify samples that exhibit substantial discrepancies from normal data

[3]

. Recently, it has received much attention in studies related to deep neural networks (DNNs), along with tasks such as novelty detection and out-of-distribution detection (OOD), as the unexpected behaviors of DNNs when shown samples from classes unseen in training, have been recognized as a critical problem. Typically, DNN classifiers can misclassify such samples as those from known classes with high confidence.

The softmax classifier employed in DNNs [10] is attributed as one of the causes of this problem. Alternatively, deep data description models [12]

, which describe each class of normal data with a hypersphere or a Gaussian component in the embedded space, have been proposed for outlier and OOD detection tasks. A data description models can naturally identify a test sample with low predicted probabilities over all known classes as an outlier or an OOD sample. The goal of the data description model is to capture the characteristics of known classes in a deep representation, such that each data is projected onto the proximity of a corresponding class center. Its training is thus a supervised learning, using a large-scale labeled samples of in-distribution classes.

In anomaly detection, meanwhile, the training data are given without labels and assumed to have come from a single data source. In practice, normal data often consist of multiple classes, making modeling difficult under an unsupervised setting, and the variation among classes substantially increases the difficulty of anomaly detection.

In this paper, we address this task utilizing a limited number of labeled examples from each normal class. This setting is related to that of few-shot learning in that it exploits a small number of labeled examples, but unlike semi-supervised anomaly detection [13], it does not exploit examples of anomalies for training or validation.

The key intuition behind our approach, which differ to previous studies is to consider the inter-class discrepancies among the normal class examples We derive the loss function from the information bottleneck principle

[17], which . As previous studies have built the detection models on discriminative and generative classifiers, the training using inter-class discrepancies had not been considered with strong interests. However, we consider it critical for exploiting the small-scale labeled samples. The IB principle formulates the relation between the inter-class discrepancy and the intra-class discrepancies which is similar to the data description loss function. We refer to the proposed model as eXemplar Data Description (XDD).

In our empirical study, we setup an anomaly detection task using the MNIST dataset to identify samples from unseen classes as anomalies. The results show that XDD exhibits high detection performance and contributes to the segmentation of normal classes.

This paper is organized as follows. Section 2 describes the related studies on deep data description, deep anomaly detection, and few-shot learning.

## 2 Related Work

### 2.1 Deep Data Description

The support vector data description [15] aims to find a spherical decision boundary that encloses the normal data, in order to detect whether a new sample comes from the same distribution or is an outlier. Deep-SVDD [12] has employed the embedding functions of deep neural networks (DNN) to capture the structure of normal data.

Deep Multi-class Data Description (MCDD) [10] was introduced as an extension of the Deep SVDD for out-of-distribution (OOD) detection. A DNN is trained such that the embedding function maps the labeled data onto the close proximity of the centers of the corresponding class , to find Gaussian components which describes the training data in the embedded space .

Describing the component as a multinormal distribution

 P(z|y=k)=N(f(z;W)|μk,σ2kI) (1)

The Deep MCDD loss is defined as a MAP loss of the generative classifier as

 LMCDD = −1NN∑i=1logP(y=k)P(x|y=k)∑k′P(y=k′)P(x|y=k′) (2) = 1NN∑i=1logexp(−Dyi(xi)+byi)∑Kk=1exp(−Dk(xi)+bk)

where is the distance from the class centers given (1).

 Dk(x)≈∥f(x;W)−μk∥22σ2k+logσdk (3)

From equations 2 and 3, the Deep MCDD training can be considered a minimization of intra-class distances with regards to the class-wise deviation in the embedded space.

[10] shows the max-margin loss corresponding to (2) as

 K∑k=1[R2k+1νN∑Ni=1max{0,αik(∥f(xi;W)−ck∥−R2k)}] (4)

### 2.2 GAN-based Anomaly Detection

Using autoencoders to compute anomaly score has been an import approach for unsupervised anomaly detection

[3]. Given only the normal data, autoencoders can be trained to minimize the reconstruction loss, after which its reconstruction errors on test data, can be used as anomaly scores. This approach is easily extended to deep and convolutional autoencoders [4].

Recently, anomaly detection using generative adversarial networks have emerged as a popular approach for deep unsupervised anomaly detection following the influential work of AnoGAN [14].

Generally, GAN-based anomaly detection exploit the generator network, which learns the manifold of normal data distribution through its mapping function, to compute the reconstruction error of the test data based on the learned manifold. For example, the test data is reconstructed using SGD in AnoGAN and using a BiGAN architecture in EGBAD [20].

The above literatures have also reported promising results using the embedded space of the discriminator newtork, in which the distances are used as the anomaly scores. In this work, we compute the anomaly scores using only the distances in the embedded space of the discriminator network.

### 2.3 Few-shot Learning

Few-shot learning [18]

is the task of exploiting an additional set of labeled examples for a target task to adapt some prior knowledge acquired in different source tasks. Similar to transfer learning, one of its key benefits is reducing the data collection cost for a new task which is similar to one or more previous tasks.

The type of the target and the source tasks are primarily supervised learning, namely classification. The setting in which examples with labels of classes are given for the target task is called a -way--shot classification. Many existing work on few-shot anomaly detection assumes a similar setting, but is set to a small number for anomalies [11].

In few-shot learning for OOD detection [6], the setting may be described as -way--shot, i.e., examples of OOD classes are not given. Meanwhile, its source task is a supervised learning given sufficient labeled example for in-distribution classes.

In this paper, we consider a setting where the source task is unsupervised learning given unlabled, normal data and the target task is conducted in a

-way--shot setting, i.e., without examples of anomalies. In this setting, one is only required to prepare a limited number of labels for each class of normal data, which is a substantial relief of burden for practical applications. To our knowledge, there have not been a prior study on few-shot anomaly detection utilizing only labels of normal class examples.

## 3 Exemplar Data Description

Let denote the source task input, which is a set of unlabeled data. The target task input comprises labeled examples from each class, . The label takes a value from .

The goal in both tasks, is to obtain an embedding function , but the source task is usually conducted as a pretraining to the target task. In this paper, we consider the source task to be an adversarial training in an unsupervised settnig.

In a basic generative adversarial networks, a discriminator network learns an embedding function and a discriminating function to classify between a real image and a generated image. We consider to be the deep feature space of a pretrained discriminator network and to be the pretrained parameters. We thus attemp to minimize an information-theoretic loss with regards to and in the target task. The loss is derived from the information bottleneck principle [17, 2, 16], in the form of a rate-distortion function

 LIB = I(X;Z)−βI(Y;Z) (5)

where is the mutual information and

denotes the random variables for the input, the deep features, and the class labels, respectively.

The first mutual information term in (5) quantifies the redundacy of representation , and the second term reflects the amount of discriminative information regarding preserved in . The objective of the rate distortion problem, or minimizing , thus is to find a sparse yet discriminative representation for predicting .

By definition, mutual information is the expected KL divergence between the marginal and the conditional distribution over . We model the conditional distribution as an isotropic normal distirbution centering on the embedding . We define a deviation hyper parameter with respect to the class label .

 p(z|xi) = N(z|f(xi),σ2kI) (6) = 1(2πσ2k)d/2exp∥z−f(xi)∥2σ2k

The first term in the RHS of (5) is rewritten as follows as described in Appendix 0.A.

 I(X;Z) = EX,Z[logP(Z|X)P(Z)] (7) = 1N2∑z∈ZN∑i=1(−∥z−f(xi)∥22σ2k+logσdk)+const.

(7) shows that primarily reflects the sum of standarized squared distances between all pairs of and . When we assume that the distances between the pairs from different classes are substantially smaller than those from the same class, we can consider this to be the inter-class distance term.

The second mutual informatio term, , as derived in

, can be rewritten such that it is identical to data description model except for the constant term. It thus represents the intra-class similarity among the components.

 I(Y;Z) = EY,Z[logP(Z|Y)P(Z)] (8) = nyN2N∑i=1∑ylogexp−∥z−μy∥22σ−logσdy∑exp−∥z−μy∥22σ−logσdy+const.

(8) is equivalent to the MAP loss except for the class bias which induces the samples to be mapped close to the class center.

By minimizing

, we thus induce the mapping which increases the distances among classes as well as closeness within the classes. Our intuition is that increasing the inter-class distances in the embedded space is critical for the task at hand, because it can increase the robustness of the model estimated from smaller-scale data, and also reduce the probability that anomalies are mapped close to the known class distributions.

## 4 Algorithm and Implementation Details

The training and testing of XDD is conducted in the following steps.

2. XDD Training

3. Testing

In steps two and three, the training data does not contain the anomaly samples. The labeled normal class samples are given in addition to the unlabeled training samples in step two.

In [10] and [12], the data description loss was introduced both in a max-margin and a one-class forms. Following their examples, we define the inter-class component (7) and the intra-class component (8) of the IB loss in max-margin forms.

The inter-class max margin loss is defined as follows.

 Linter=E{j,k:yj≠yk}[max{0,R2dis−∥f(pj;W)−f(pk;W)∥}] (9)

(9) penalizes the inter-class pairs that are distant by less than . We set the threshold

at the maximum of the sum of mean and two standard deviations of the intra-class distances in each batch.

The max-margin intra-class loss is defined as

 (10)

where is the radius of the component sphere. We set the at the median of the intra-class distances in each mini-batch.

The GAN loss is minimized along with the intra- and inter-class distance losses to maintain the consistency of discriminator representation.

 LXDD=Lintra+L% inter−VGAN(D;G) (11)

The three terms in (11) differ substantially in scale, upon which we employ the blocked coordinate descent for training [5].

In testing, we compute the anomaly score of a test sample by kernel density estimation (KDE) using the projections of the labeled examples.

## 5 Experiments

We present an experimental result using the MNIST dataset. The architectures of the generator and discriminator networks are shown in Table 1

### 5.1 Setup

We adopt the standard setup from previous studies [1, 10] for anomaly detection using MNIST. For training, we exclude one digit as the target anomalous class from the original set and measure the detection performance on the test set using the area under the ROC and the precision-recall curves. The few-shot setting is -shot nine-way problem, i.e., labeled samples of each class were randomly chosen.

### 5.2 Results

Fig. 1 summarizes the AUPRC performances of the XDD models on ten target anomaly class detection tasks. The -axis indicates the mean AUPRCs in ten repetitions of . The number on the -axis indicate the digit designated as the anomalous class. The blue bars indicate the means after pre-training and the orange bar indicate the means after XDD training. XDD training contributes to substantial improvements of AUPRC values in all tasks.

Fig. 1 shows

Figures 3 and 3 illustrate the low-dimensional projection of the test samples using -SNE [8] after pre-training and after the XDD training. The samples are represented by triangles of colors corresponding to respective classes while the labeled examples are represented by black ’s and ’s. This result was obtained when the anomalous target class is eight (yellow).

From 3, we can see that the separation among normal classes in the embedded space are not distinct after pre-training. Fig. 3 visually indicates that XDD training improves the separation between the normal classes which in turn reduce the overlap between target anomalous classes.

## References

• [1] Akcay, S., Atapour-Abarghouei, A., Breckon, T.P.: Ganomaly: Semi-supervised anomaly detection via adversarial training (2018)
• [2] Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net (2017), https://openreview.net/forum?id=HyxQzBceg
• [3] Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. ACM Comput. Surv. 41(3), 1–58 (2009).
• [4]

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016),

http://www.deeplearningbook.org
• [5] Grippof, L., Sciandrone, M.: Globally convergent block-coordinate techniques for unconstrained optimization. Optimization Methods & Software 10, 587–637 (1999)
• [6] Jeong, T., Kim, H.: Ood-maml: Meta-learning for few-shot out-of-distribution detection and classification. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 3907–3916. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/28e209b61a52482a0ae1cb9f5959c792-Paper.pdf
• [7] Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Master’s thesis (2009), http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
• [8] Kusner, M., Tyree, S., Weinberger, K.Q., Agrawal, K.: Stochastic neighbor compression. In: Jebara, T., Xing, E.P. (eds.) Proceedings of the 31st International Conference on Machine Learning (ICML-14). pp. 622–630. JMLR Workshop and Conference Proceedings (2014), http://jmlr.org/proceedings/papers/v32/kusner14.pdf
• [9] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (Nov 1998).
• [10] Lee, D., Yu, S., Yu, H.: Multi-class data description for out-of-distribution detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 1362–1370. KDD ’20, Association for Computing Machinery, New York, NY, USA (2020). , https://doi.org/10.1145/3394486.3403189
• [11] Lu, Y., Yu, F., Reddy, M.K.K., Wang, Y.: Few-shot scene-adaptive anomaly detection (2020)
• [12] Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder, A., Müller, E., Kloft, M.: Deep one-class classification. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4393–4402. PMLR, Stockholmsmässan, Stockholm Sweden (10–15 Jul 2018), http://proceedings.mlr.press/v80/ruff18a.html
• [13] Ruff, L., Vandermeulen, R.A., Görnitz, N., Binder, A., Müller, E., Müller, K., Kloft, M.: Deep semi-supervised anomaly detection. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net (2020), https://openreview.net/forum?id=HkgH0TEYwH
• [14] Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. In: Niethammer, M., Styner, M., Aylward, S.R., Zhu, H., Oguz, I., Yap, P., Shen, D. (eds.) Information Processing in Medical Imaging - 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017, Proceedings. Lecture Notes in Computer Science, vol. 10265, pp. 146–157. Springer (2017). , https://doi.org/10.1007/978-3-319-59050-9_12
• [15] Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54, 45–66 (January 2004). , http://portal.acm.org/citation.cfm?id=960091.960109
• [16] Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (ITW). pp. 1–5 (2015).
• [17] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. Computing Research Repository(CoRR) physics/0004057 (2000)
• [18] Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 53(3) (Jun 2020). , https://doi.org/10.1145/3386252
• [19] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (Aug 2017)
• [20] Zenati, H., Foo, C.S., Lecouat, B., Manek, G., Chandrasekhar, V.R.: Efficient gan-based anomaly detection (2019)

## Appendix 0.A Information Bottleneck

The information bottleneck principle is a principle for extracting relevant information in an input varibale about an output variable . The relevant information is defined by the mutual information , assuming statistical dependence between the two variables. We further attempt to learn a compressed representation of , by discarding irrelevant features that do not contribute to the prediction of .

We denote the compressed representation by . In the context of this paper, is the embedding of by a deep learning model , such that . The compression rate and the distortion given and can also be measured by the mutual informations and , respectively.

Finding an optimal leads to a Lagrangian

 L=I(X;Z)−βI(Z;Y) (12)

Minimizing (12) is seen as the rate-distortion problem where the Lagrangian multiplier represents the trade-off between the compresssion rate and the distortion.

The mutual information can be rewritten as the Kullback-Leibler divergence between the marginal and the conditional probability distributions

 I(X;Z)=Ex[DKL(p(z|x)∥p(z))] (13)

We model the empirical distribution of by the average of the dirac delta functions,

 p(z)=1NN∑i=1δ(z−f(xi)) (14)

and the conditional distribution

 p(z|xi) = N(z|f(xi),σ2kI) (15) = 1(2πσ2k)d/2exp∥z−f(xi)∥22σ2k

where is the deviation over the class .

 I(X;Z)=Ex,z[logp(z|x)p(z)] (16) = ∫∫1NN∑i=1δ(z−f(xi))log⎡⎢⎣1(2πσ2k)d/2exp−∥z−f(xi)∥22σ2k⎤⎥⎦dzdx −∫∫1NN∑i=1δ(z−f(xi))log[1NN∑i=1δ(z−zi)]dzdx = ∫1NN∑i=1log⎡⎢⎣1(2πσ2k)d/2exp−∥z−f(xi)∥22σ2k⎤⎥⎦+log1Ndx = 1N2∑zN∑i=1(−∥z−f(xi)∥22σ2k+logσdk)+const.

(16) represents the sum of distances between all pairs of . We assume to be small enough that for all pairs such that and .

Meanwhile, we model the class conditional probability over as follows.

 p(z|y) = N(z;μy,σyI) (17) = 1(2πσ2y)d/2exp(−∥z−μy∥22σ2y)

The mutual information can be rewritten as follows

 I(Y;Z) = Ey,z[logp(z|y)p(z)] (18) = ∫∫1NN∑i=1nyNδ(z−zi)logp(z|y)dydz −∫∫1NN∑i=1nyNδ(z−zi)logp(z)dydz = 1NN∑i=1∑ynyNlogexp(−∥z−μy∥22σ)−logσdy∑exp(−∥z−μy∥22σ)−logσdy+% const.

(18) is equivalent to the MAP loss function except for the class bias.