Exploring Deep Anomaly Detection Methods Based on Capsule Net

07/15/2019 ∙ by Xiaoyan Li, et al. ∙ 2

In this paper, we develop and explore deep anomaly detection techniques based on the capsule network (CapsNet) for image data. Being able to encoding intrinsic spatial relationship between parts and a whole, CapsNet has been applied as both a classifier and deep autoencoder. This inspires us to design a prediction-probability-based and a reconstruction-error-based normality score functions for evaluating the "outlierness" of unseen images. Our results on three datasets demonstrate that the prediction-probability-based method performs consistently well, while the reconstruction-error-based approach is relatively sensitive to the similarity between labeled and unlabeled images. Furthermore, both of the CapsNet-based methods outperform the principled benchmark methods in many cases.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As real-time tracking & diagnosis systems and autonomous controlling devices are strongly demanded in various domains in the current era of Internet of Things (IoTs), smart cities, big data, and deep learning, anomaly detection (also known as

outlier detection) is becoming increasingly critical. It aims at uncovering abnormal data points which may stand for novel or alarming events. Anomaly detection is of key importance in IoT systems, data centres, security platforms, and life science to diagnose system failure, detect intruders or attackers, and discover novel knowledge.

Prior to the use of deep learning approaches, statistical and heuristic methods were main tools for anomaly detection in restricted application domains. Kernel methods, such as kernel density estimation (KDE)

(Parzen, 1962; Kim & Scott, 2008)

and support vector domain description (SVDD)

(Tax & Duin, 1999), were the most successful ones to deal with non-linearity of the input feature space through the kernel trick. However, nowadays the tremendous amount of data of various types (such as images, texts, omics data, etc.) have been collected, posing new challenges in handling the scalability and complexity of such data. With billions of samples in modern datasets, traditional methods become less effective. For example, conventional feature encoding and extraction methods are incompetent to capture informative factors from the input feature space of complex data.

Embracing the wealth of data, deep learning models have achieved significant successes in various discriminative and generative modellings of modern data (Bengio et al., 2003; Hinton & Salakhutdinov, 2006; Krizhevsky et al., 2012; Graves et al., 2013; Kingma & Welling, 2014; Goodfellow et al., 2014; LeCun et al., 2015), encouraging to explore deep anomaly detection solutions. The core of deep learning is learning complex representations for the data at different levels in the latent space (Bengio et al., 2013)

. For example, convolutions on sequence data and embedding technologies on discrete data allow to encode visible examples to low-dimensional continuous dense vectors in a latent space, showing the advantage of distributed representation learning over alternative approaches. Furthermore, stochastic gradient descent using mini-batches makes learning of deep networks very scalable to data of big size. The past few years have witnessed progresses made by the work reviewed in Appendix

A. Accordingly, comparative studies, e.g. (Skvara et al., 2018), unconsidering these two advantages of deep learning over classic methods, could be biased.

The technique in (Golan & EI-Yaniv, 2018) is based on the insightful observation: learning to discriminate between many types of geometrically transformed images encourages learning of features that are useful for detecting novelties. Among all geometric transformations, they only considered compositions of horizontal flipping, translations, and rotations, resulting in 72 distinct transformations. Their main focus was tackling the problem of identifying anomalous images in pure single class setting, even though it was mentioned their method may also be effective at distinguishing out-of-distribution samples from multiple-class data.

In fact, learning transformation is very challenging in computer vision tasks. Convolutional neural network (CNN)

(LeCun et al., 1998)

, a hierarchy of convolution operations, has been widely used as a highly effective technique in classifying images. However, one arguable key limitation of CNN is that the neurons do not sufficiently capture the properties of entities such as position, orientation, and sizes, as well as their part-whole relationship. The capsule network (CapsNet) 

(Hinton et al., 2011; Sabour et al., 2017; Hinton et al., 2018) has been proposed and shown advantages in maintaining such information, which is a novel and promising structure that may be more closely related to biological neural organization. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or part. It has been demonstrated that CapsNet is capable of preserving hierarchical pose (position, size, and orientation) relationships between image features. For a given image, CapsNet can automatically and dynamically model affine transformations and part-whole relationships using an iterative routing-by-agreement mechanism.

Inspired by these developments, we propose that, instead of using geometric transformations to learn distinct features of images, CapsNet can be employed to automatically learn transformations from the training examples such that a test example that cannot be explained by the network should be viewed as anomaly. Unlike (Golan & EI-Yaniv, 2018), our work concentrates on the cases where normal samples come from multiple classes. Our contributions are three-fold: (1) based on unique characteristics of CapsNet, we propose two normality score functions that work well; to the best of our knowledge, this is the first attempt to explore and test CapsNet for deep anomaly detection; (2) we provide insights and categorize existing ideas for deep anomaly detection into boundary-based and distribution-based families, paving the road for future studies; (3) we compared our methods with principled benchmark methods and assessed their capacities for deep anomaly detection.

The rest of this paper is organized as follows. Insights into existing work, a gentle introduction to CapsNet, benchmarks, and data are provided in the Appendix A. The CapsNet-based normality score functions are described in Section 2. In Section 3, the proposed methods are evaluated on three datasets in comparison with benchmarks.

2 CapsNet-Based Normality Score Functions

Unlike the method in (Golan & EI-Yaniv, 2018), we actually do not need to manually transform each training image. Using CapsNet, transformations ought to be automatically learned via an iterative routing-by-agreement mechanism. We can thus ignore the stage of labeling each transformed image. After a CapsNet classifier is trained, unseen images (either normal or out-of-distribution) could be directly fed into the learned model. We present two normality score function to determine the outlierness of these images.

(a) Encoder as classifier.
(b) Decoder as regularizer.
Figure 1: Architecture of the CapsNet designed in (Sabour et al., 2017) for MNIST data. The main structure displayed in (fig_capsnetencoder) is a classifier but can also be viewed as an encoder. The reconstruction regularizer can be viewed as a decoder as displayed in (fig_capsnetdecoder).

2.1 Prediction-Probability-Based Normality Score

The activation probabilities of digital capsules at the last layer of a CapsNet indicate the probabilities of the input sample belonging to the classes. However, unlike softmax probabilities in CNN, the activation probabilities of all digit capsules do not necessarily sum to one. Assuming the network is trained sufficiently well, for a normal test image, there should be one and only one probability being close to “1”, representing the possibility of this image belonging to its true class. However, when an anomalous sample cannot be explained by the network, all activation probabilities of digital capsules would be very low. Therefore, this unique characteristic inspires us to define a normality score function based on prediction probabilities (PP):


where is an input image, represents the -th digit capsule (see Appendix A.3), denotes the probability of belonging to the -th class. Hereafter, we simply call Equation (1) PP score function

. Since the threshold, dividing the normal and the anomalous, is hard to decide, as per convention, we use the area under the receiver operating characteristic curve (auROC) to measure its performance.

2.2 Reconstruction-Error-Based Normality Score

In CapsNet, reconstruction error is used as a regularization term through a decoder component (Figure 0(b)). The classifier is also an encoder (Figure 0(a)

) for disentangled representation learning. In this perspective, a CapsNet thus meanwhile function as a deep autoencoder, offering an idea of score normality based on reconstruction error.

In this method, we use normalized squared error (NSE) to measure the quality of reconstructed images. The reason of using NSE instead of MSE (mean squared error) is that, different objects in different images (e.g. MNIST digits) can have different numbers of nonzero pixels in contrast with pure background. Using MSE will significantly weaken the actual difference between the input image and the reconstructed image. For example, as digit image “1” takes much less numbers of pixels than digit “8”, using MSE the reconstruction loss of image “1” will be reduced at a greater extent than that of image “8”, even though image “1” may be reconstructed worse than image “8”. The reconstruction error (RE) based normality score can thus be defined as:


where represents an actual image and the reconstructed image. When the background in an image takes a large portion of the space, and the values of background pixels are near zeros, the advantage of NSE will be more obvious. Hereafter, we refer to Equation (1) as RE score function.

3 Experiments

3.1 Case Studies in Normality Score Functions

We looked into the performance of our two normality score functions through two case studies on the MNIST data.

3.1.1 Multi-Class Training Data and Single-Class Anomalous Digits

Figure 2 shows the performance of our methods when selecting digits “2” and “9” as abnormal samples, respectively, and the rest digits as normal samples. When digit “2” was treated as anomaly class, both PP and RE score functions achieved near perfect auROCs (0.9841 and 0.9699). When “9” was viewed as anomaly class, however, the PP score function outperformed the RE score function by auROC of 0.12. The reason for this can be found from Figure 3, which depicts real digits and their reconstructed ones for both digits. From Figure 3, one can easily notice that anomalous digit “9” was mostly reconstructed as digit “4”, while anomalous digit “2” was reconstructed as several different digits such as “1”, “3”, “6”, and “7”. As images “4” and images “9” are quite similar, the normality scores of images “9” become very high, even though the true digits “9” and their reconstructed versions “4” are two different numbers.

(a) Digit 2 as anomalous class.
(b) Digit 9 as anomalous class.
Figure 2: ROC curves of detecting anomalous digits: “2” (fig_anomalous2) and “9” (fig_anomalous9) respectively using CapsNet.
(a) Digit 2 as anomalous class.
(b) Digit 9 as anomalous class.
Figure 3: Original digits (upper half) and reconstructed digits (lower half), when detecting anomalous digits “2” (fig_digit2) and “9” (fig_digit9) respectively using CapsNet.

3.1.2 Multi-Class Traing Data and Multiple Anomalous Digits

In this case, we tested our two CapsNet-based normality score functions by considering digits “0”, “3” and “5” as abnormal digits, and the rest as normal. The results is displayed in Figure 4. Both methods achieved similar auROCs.

Figure 4: ROC curves of detecting abnormal digits “0”, “3”, and “5” using our two CapsNet-based score functions.

3.2 Comparison on MNIST Dataset

Our two methods and three benchmarks CNN+OCSVM, DBN, and VAE (see Appendix B) were compared on MNIST data by respectively treating each digit class as anomalous class and the rest as normal. Their performance in terms of auROC is displayed in Figure 5. One can see that our CapsNet-based PP method works consistently the best, while our CapsNet-based RE method in generally is slightly inferior to the PP method, but has comparable results as CNN+OCSVM. The two DGMs (DBN and VAE) are not competitive to the CapsNet-based and deep hybrid methods.

Figure 5: Performance on the MNIST data.

3.2.1 Comparison on Fashion-MNIST Dataset

When comparing all five methods on the Fashion-MNIST data, generally speaking all methods tended to get lower results, which is reasonable because Fashion-MNIST samples is more complicated than MNIST samples. When using footwear (Sandals, Sneakers, or Ankle boots) as anomalous samples, the CapsNet(RE) method outperformed the CapsNet(PP) method. It is because the CapsNet classifier could carry some wrong confident information of classifying a footwear anomalous sample (e.g. Sneaker) to the normal footwear classes (e.g. Sandal or Ankle boots), while reconstruction errors can pick up differences in details. In the case of using data from a topwear class as anomalous samples, CapsNet(PP) worked better than CapsNet(RE). When trousers and bags were viewed as abnormal samples, both methods worked quite well without big differences in performance. The performances of CNN+OCSVM and VAE vary largely. In few cases, they can obtain similar results as CapsNet-based methods. DBN did not behave impressively on the data.

Figure 6: Performance on the Fashion-MNIST data.

3.3 Comparison on Small-Norb Dataset

Small-Norb is a challenging data for all five methods. When using animals and cars as anomalies, only CapsNet(PP) performed reasonably good, the other methods behaved randomly. When trucks were used as anomalous samples, only CapsNet(PP) and CapsNet(RE) were able to behave non-randomly. In the case of planes as abnormal data, both CapsNet(RE) and VAE worked the best. Only in the case of humans as anomalies, CNN-OCSVM reached 0.7 auROC.

Figure 7: Performance on the Small-Norb data.

4 Conclusions and Future Work

Many modern data-driven intelligent systems require more accurate anomaly detection techniques. In this paper, we explore novel solutions in consideration of CapsNet’s distinct characteristics. We devise two normality score functions based on CapsNet’s activation probability and reconstruction error respectively. Experiments on three image data sets show that both methods have complementary strengths and outperform existing solutions in many setups.

In this paper, we did not discuss classification-based anomaly detection methods, as they essentially treat an anomaly detection task as a two-category classification problem by using both normal and abnormal samples in the training process. While all out-of-distribution anomaly detection methods as discussed in this paper only take normal samples for training, because an anomalous data point could unpredictably come from anywhere outside the normal data distribution, which is true in many application domains. We did not discuss deep reinforcement learning methods for anomaly detection in this paper as well. But this is a new area worthy of future investigation. The performance of our normality score functions depends on CapsNet’s learning capacity on certain data. However, as mentioned in

(Sabour et al., 2017), same as deep generative models, current CapsNets do not perform very well when the backgrounds of images vary too much (such as CIFAR-10) to be modelled. Hence combining our two normality score functions with other solutions mentioned in Appendix A may lead to a more robust solution.


  • An & Cho (2015) An, J. and Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Technical report, Data Mining Center, Seoul National University, Seoul, South Korea, 2015.
  • Andrews et al. (2016) Andrews, J. T. A., Morton, E. J., and Griffin, L. D. Detecting anomalous data using auto-encoders.

    International Journal of Machine Learning and Computing

    , 6(1):21–26, 2016.
  • Bengio et al. (2003) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of Machine Learning Research, 2:1137–1155, 2003.
  • Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
  • Erfani et al. (2016) Erfani, S. M., Rajasegarar, S., Karunasekera, S., and Leckie, C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognition, 58:121–134, 2016.
  • Golan & EI-Yaniv (2018) Golan, I. and EI-Yaniv, R. Deep anomaly detection using geometric transformations. arXiv preprint arXiv:1805.10917, 2018.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
  • Graves et al. (2013) Graves, A., Mohamed, A., and Hinton, G.

    Speech recognition with deep recurrent neural networks.

    In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649, 2013.
  • Hinton & Salakhutdinov (2006) Hinton, G. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313:504–507, 2006.
  • Hinton et al. (2006) Hinton, G., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
  • Hinton et al. (2011) Hinton, G., Krizhevsky, A., and Wang, S. Transforming auto-encoder. In International Conference on Artificial Neural Networks, pp. 44–51, 2011.
  • Hinton et al. (2018) Hinton, G., Sabour, S., and Frosst, N. Matrix capsules with EM routing. In International Conferences on Learning Representations, 2018.
  • Kim & Scott (2008) Kim, J. and Scott, C. Robust kernel density estimation. Journal of Machine Learning Research, 13:3381–3384, 05 2008.
  • Kingma & Welling (2014) Kingma, D. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • LeCun et al. (2004) LeCun, Y., Huang, F., and Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2004.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521:436–444, 2015.
  • Li (2013) Li, Y. Sparse machine learning models in bioinformatics. PhD thesis, School of Computer Science, University of Windsor, Windsor, Ontario, Canada, 2013.
  • Li & Zhu (2018a) Li, Y. and Zhu, X. Exploring Helmholtz machine and deep belief net in the exponential family perspective. In ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models, July 2018a.
  • Li & Zhu (2018b) Li, Y. and Zhu, X.

    Exponential family restricted Boltzmann machines and annealed importance sampling.

    In International Joint Conference on Neural Networks (IJCNN), pp. 39–48, July 2018b.
  • Li et al. (2015) Li, Y., Oommen, B., Ngom, A., and Rueda, L. Pattern classification using a new border identification paradigm: The nearest border technique. Neurocomptuing, 157:105–117, 2015.
  • Parzen (1962) Parzen, E.

    On estimation of a probability density function and mode.

    Ann. Math. Statist, 33(3):1065–1076, 1962.
  • Ruff et al. (2018) Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Müller, E., and Kloft, M. Deep one-class classification. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 4393–4402. PMLR, 2018.
  • Sabour et al. (2017) Sabour, S., Frosst, N., and Hinton, G. Dynamic routing between capsules. In Neural Information Processing Systems, pp. 3856–3866, 2017.
  • Schlegl et al. (2017) Schlegl, T., Seebock, P., Waldstein, S. M., Schmidt-Erfurth, U., and Langs, G. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pp. 146–157. Springer, 2017.
  • Scholkopf et al. (2001) Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., and Williamson, B. Estimating the support of a high-dimensional distribution. Neural Computation, 13:1443–1471, 2001.
  • Skvara et al. (2018) Skvara, V., Pevny, T., and Smidl, V.

    Are generative deep models for novelty detection truly better?

    In ACM SIGKDD 2018 Workshop on Outlier Detection De-constructed. ACM, 2018.
  • Tax & Duin (1999) Tax, D. and Duin, R. Support vector domain description. Pattern Recognition Letters, 20:1191–1199, 1999.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms, 2017.

Appendix A Insights into Existing Work

Instead of simply enumerating all existing work for deep anomaly detection, we categorize these solutions into two families, and provide insights into their characteristics, strengths and challenges.

A classical viewpoint regarding anomaly detection is that, learning the boundary of data mass is more effective and straightforward than learning the density distribution of the data, because (1) available data were too few to cover the distribution in many cases, and (2) it was much more complicated and difficult to model data distribution using a generative model. However, in the big data era, massive amount of data become available in many domains; many of the data are structured (such as images, graphs, time-series, text, etc.); and deep generative models have achieved promising successes in modelling such modern data, offering a new avenue for exploring distribution-based methods for anomaly detection. These two categories, boundary-based and distribution-based, are respectively discussed below.

a.1 Boundary-Based Methods

Kernel based support vector domain description (SVDD) or one-class support vector machine (OCSVM) methods, including hypersphere

(Tax & Duin, 1999)

and hyperplane

(Scholkopf et al., 2001) models, is a successful family for anomaly detection in the pre-deep-learning era. The idea of the hypersphere-based SVDD is to map data points from input space to high-dimensional space and learn a hypersphere that capture the core mass of the data distribution. Any data point outside the hypersphere is viewed as an abnormal sample. Please see (Li, 2013; Li et al., 2015) for a systematic discussion of SVDD methods. It is quite natural to think of deep extension of these methods to continue their success in the deep learning age. In pursuit of this aim, there are two efforts: deep hybrid methods and one-class neural network models. In deep hybrid methods (e.g. VAE+OCSVM (Andrews et al., 2016) and DBN+OCSVM (Erfani et al., 2016)), a supervised (e.g. CNN or recurrent net) or unsupervised (e.g. deep belief net (DBN) or varitiaonal autoencoder (VAE)) neural network is first employed to learn embedding representations of samples in the hidden space, then using these latent representations as inputs an SVDD method is used to detect abnormal data points. The one-class neural network models (e.g. one-class deep SVDD (Ruff et al., 2018)) learn a neural network and SVDD together by maximising an adapted SVDD objective in the prime form. Both classes of methods have pros and cons. Deep hybrid methods are pipelines that are very flexible in choosing and combining different representation learning (or pretrained embeddings) and SVDD models. However, these methods face the challenge of scalability due to size of kernel matrices in dual form of SVDD. Deep SVDD models explicitly use deep neural networks as feature extractors in replacement of implicit kernel tricks, and are scalable to large data due to use of SVDD’s prime forms and stochastic gradient descent. Nevertheless, this strategy is lack of flexibility in practice, that is a specific model needs to be built for each specific type of data.

a.2 Distribution-Based Methods

As mentioned above, deep generative models (DGMs), such as deep belief net (DBN) (Hinton et al., 2006; Li & Zhu, 2018a) and variational autoencoder (VAE) (Kingma & Welling, 2014)

, can be applied as unsupervised feature learning techniques. More importantly, since DGMs aim at modelling the joint distribution of visible and latent variables (that is

), their likelihood by marginalising out may serve as an anomality describer. However, exact likelihood can only be obtained in quite few generative models, such as exponential family restricted Boltzmann machines (RBMs) (Li & Zhu, 2018b). In many cases when a recognition component is used for approximate inference, only the variational/evidence lower bound (ELBO) of log-likelihood is available. Unfortunately, ELBO may be too loose to be an normality indicator. Oftentimes, the ELBOs of normal and abnormal samples indistinguishably fall into the same range. Luckily, this is not the end of the story. In an architecture with encoder (recognition) component and decoder (generative) component, reconstruction error could serve as anomality measure based on the intuition that out-of-distribution samples can be reconstructed badly (An & Cho, 2015). But in the case of generative adversarial net (GAN) based methods (Schlegl et al., 2017), an encoder is unavailable. For an inquiry sample

, a supervised learning process has to be executed to search for a hidden representation

such that the corresponding generated sample best approximates . The approximation error and probability from GAN’s discriminator can together indicate the extend of anomality.

a.3 Capsule Nets

The concept of capsule was first introduced in transforming autoencoder for image modelling (Hinton et al., 2011). But its potential capacity was not demonstrated until the publications of vector capsule net (CapsNet) (Sabour et al., 2017) and matrix capsule net (Hinton et al., 2018). Since both models are conceptually very similarly, we used vector CapsNet in our studies. The normality score functions described in Section 2 should also work for matrix CapsNet. Figure 1

shows an example of such network for the MNIST data where each capsule is a vector of 8 neurons and a digit capsule has 16 neurons. Each capsule is a sparse linear combination of all transformed vectors of lower capsules. A capsule can encapsulate pose information (such as position, orientation, scaling, and skewness) and instantiation parameters (such as color and texture). The connections and activations between a parent capsule and child capsules are dynamically determined, thus transformations and part-whole relationships can be modeled, implying a revolutionary potential for structured data modeling (such as images, videos, and documents).

Formally, the -th capsule at layer can be computed as


where is a sparse linear combination of pose vectors from the lower layer:


where is total number capsules at level , and a pose vector is transformed from a capsule:


where is a transformation matrix to be learned using back-propagation, is the -th capsule at level , and is called a coupling coefficient and is computed using a softmax function


where is determined using a dynamic routing algorithm based on inner product (cosine) .

The squash function in Equation 3 allows the length of a capsule, , to serve as its activation probability. Thus, there is no explicit activation unit in vector CapsNet.

The objective function is defined by a margin loss which uses norms of digit capsules as prediction probabilities:


where is the total number of classes, and


where is actual class label, is a digit capsule, if and only if , , , and .

Dynamic routing is used between the primary capsule layer and the digital capsule layer. Thus, in Figure 1, the lower-level capsules dynamically connect to the 10 higher-level digit capsules. Since bottom-up activation paths form a tree-like structure and a very deep tree is impractical, CapsNet should not have many capsule layers.

The architecture of CapsNet used in this paper is identical to that of (Sabour et al., 2017)

. The number of epochs was set to

for experiments on MNIST dataset, and for both Fashion-MNIST and Small-Norb datasets; batch size was chose as for MNIST and Fashion-MNIST datasets and for Small-Norb dataset.

Appendix B Benchmarks

Our CapsNet based anomaly detection methods were compared with three principled methods from the two family of methods discussed in Section A. These benchmarks are described as bellows. Please note that the method presented in (Golan & EI-Yaniv, 2018) is unavailable for multi-class normal data, thus we were unable to compare with it.

  • We implemented a deep hybrid method named CNN+OCSVM, obviously CNN was used to learn the latent representations and OCSVM was used for anomaly detection. The CNN component has two convolutional layers (3

    3 receptive fields in both layers, 32 and 64 feature maps respectively for the first and second layers, max-pooling with pooling size of 2

    2 after the second layer, ReLU activation function), one fully connected layer (128 units), and a softmax layer for class labels of normal samples. The outputs of the fully connected layer were latent representations extracted for searching the hyperparameters and optimizing the model parameters of OCSVM.

  • We employed a three-layer DBN to model data distribution and then measured reconstruction error to score abnormality. There are 500 units in each hidden layer. The model was layer-wise pretrained by RBMs. Bernoulli distribution was assumed for both visible and hidden layers. The pixel values were scaled to interval [0,1] as input to DBN. Reconstruction error of an inquiry sample was used to detect anomalous samples.

  • Similarly, a convolutional VAE was also applied to capture data distribution. The inference component (encoder) has the same architecture as the CNN component (disregarding the output layer) in CNN+OCSVM. The latent space size was set to 64. The structure of the inference component was mirrored for the structure of the generative component. Reconstruction error was used in determination of anomality.

Appendix C Datasets

Our two methods and the three benchmarks were evaluated on three image datasets with increasing difficulty: MNIST, Fashion-MNIST and Small-Norb. In learning stages, all methods were trained using same normal training examples. In test processes, same normal samples and anomalous data were used. We kept the normal and anomalous test data balanced, so that we can focus on comparison using auROC and leave imbalanced issue for future investigation.

  • MNIST: It contains a training set of grayscale digit images and a test set of same resolution grayscale examples from approximately different writers (LeCun et al., 1998).

  • Fashion-MNIST: It is a dataset of Zalando’s article images, comprising MNIST-like labeled fashion images with images per category (Xiao et al., 2017). The training set has images and the test set has images. The samples come from classes: T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot.

  • Small-Norb: It contains grayscale images pairs of toys belonging to generic categories: four-legged animals, human figures, airplanes, trucks, and cars (LeCun et al., 2004). The objects were imaged by two cameras under lighting conditions, elevations , and azimuths. As in (Sabour et al., 2017), the images were resized to 4848; random 3232 crops of them were obtained during training process. Central 3232 patches of test images were used during test.