Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning

Reconstruction-based approaches to anomaly detection tend to fall short when applied to complex datasets with target classes that possess high inter-class variance. Similar to the idea of self-taught learning used in transfer learning, many domains are rich with similar unlabeled datasets that could be leveraged as a proxy for out-of-distribution samples. In this paper we introduce Latent-Insensitive Autoencoder (LIS-AE) where unlabeled data from a similar domain is utilized as negative examples to shape the latent layer (bottleneck) of a regular autoencoder such that it is only capable of reconstructing one task. Since the underlying goal of LIS-AE is to only reconstruct in-distribution samples, this makes it naturally applicable in the domain of class-incremental learning. We treat class-incremental learning as multiple anomaly detection tasks by adding a different latent layer for each class and use other available classes in task as negative examples to shape each latent layer. We test our model in multiple anomaly detection and class-incremental settings presenting quantitative and qualitative analysis showcasing the accuracy and the flexibility of our model for both anomaly detection and class-incremental learning.

READ FULL TEXT VIEW PDF

page 6

page 9

03/08/2022

Visual anomaly detection in video by variational autoencoder

Video anomalies detection is the intersection of anomaly detection and v...
07/03/2022

Anomaly Detection with Adversarially Learned Perturbations of Latent Space

Anomaly detection is to identify samples that do not conform to the dist...
03/13/2022

Feature space reduction as data preprocessing for the anomaly detection

In this paper, we present two pipelines in order to reduce the feature s...
03/08/2021

Anomaly Detection Based on Selection and Weighting in Latent Space

With the high requirements of automation in the era of Industry 4.0, ano...
06/08/2022

A Unified Model for Multi-class Anomaly Detection

Despite the rapid advance of unsupervised anomaly detection, existing me...
02/25/2022

Do autoencoders need a bottleneck for anomaly detection?

A common belief in designing deep autoencoders (AEs), a type of unsuperv...
11/25/2021

Few-shot Deep Representation Learning based on Information Bottleneck Principle

In a standard anomaly detection problem, a detection model is trained in...

1. Introduction

Anomaly detection is a classical machine learning field which is concerned with the identification of in-distribution and out-of-distribution samples. Unlike traditional multi-label classification where the goal is to find decision boundaries between classes present in a given dataset, the goal of anomaly detection is to find one-versus-all boundaries for classes that are not in the dataset which is significantly more challenging compared to standard classification. Autoencoders

(Bengio et al., 2007) have been used extensively for anomaly detection (Zhou and Paffenroth, 2017; Zimek et al., 2012; Chalapathy et al., 2017) under the assumption that reconstruction error incurred by anomalies is higher than that of normal samples (Hasan et al., 2016; Zong et al., 2018). However, it has been observed that this assumption might not hold as standard autoencoders might generalize so well even for anomalies(Gong et al., 2019; Zong et al., 2018). In practice, this issue becomes more relevant in two important settings, namely, when the normal data is relatively complex it requires high latent dimension for good reconstitution, and when anomalies share similar compositional features and are from a close domain to the normal data (Chandola et al., 2009).
To mitigate these issues we present Latent-Insensitive Autoencoder (LIS-AE), a new class of autoencoders where the training process is carried out in two phases. In the first phase, the model simply reconstructs the input as a standard autoencoder, in the second phase the entire model except the latent layer is ”frozen”. We then train the model in such a way that forces the latent layer to only keep reconstructing the target task. We use the concept of a negative dataset from one-class classification (Weston et al., 2006) whereby an auxiliary dataset of non-examples from similar domains is used as a proxy for out-of-distribution samples. We change the training objective such that the autoencoder keeps its low reconstruction error for the target dataset while pushing the error of the negative dataset to exceed certain value.

(a)

The first diagram shows feature extraction phase.

(b) The second phase starts by freezing the model except latent layer. Negative examples are used to fine-tune the latent layer to be only responsive to .
Figure 1. The two phases of training.

To examine the behaviour of our model under distributional shift, we test our model in Continual Learning (CL) settings. Continual learning, also known as Incremental learning, is a machine learning paradigm aiming at developing models which have the ability to continually learn from a potentially infinite stream of classification tasks by incorporating new knowledge while retaining previously learned skills (Zenke et al., 2017). Moreover, a CL algorithm must admit to additional constraints, most notably, access to all data is assumed not to be available during training. Additionally, the model is not allowed to replay all previous data. The main challenge of CL is that most of the current machine learning models are prone to what is known as catastrophic forgetting (Tan et al., 2018). The phenomenon of a machine learning model experiencing abrupt performance degradation on previously learned concepts when trained to acquire new skills (French, 1999). In other words, models tend to ”fit” the most recent task and ”forget” about previously learned ones. To formalize the problem, given a stream of tasks where each task contains a dataset such that represents a -sample and represents its target. The goal of continual learning models is to learn each task sequentially under the constraint that access to previous data is unavailable or limited. Depending on the availability of a task descriptor during test time, the problem is categorized further into two distinct test protocols, namely, task-incremental and class-incremental settings. For task-incremental test settings, the model is presented with samples from unknown classes along with a task identifier

that indicates which task each sample belongs to. More formally, the model is estimating

. On the other hand, for Class-Incremental test settings, the model is estimating since it is only presented with the unknown without any additional information to which task the unknown samples belong to. This in turn makes class-incremental settings significantly more challenging than task-incremental settings and explains why most of the work published in the field assumes that is presented during testing (Li and Hoiem, 2017; Lee et al., 2017; Rusu et al., 2016). To elicit the intuition behind our approach, we first break the problem of class-incremental learning into a more granular task-agnostic formulation, namely, instead of assuming that the model is presented with a stream of tasks containing multiple classes, we assume that the model is presented with a stream of independent classes. One extreme solution that completely avoids catastrophic forgetting by construction is to learn a separate generative model for each class. However, learning generative models in normal settings is typically not an easy task and the number of models would grow with each new class which makes this solution very inefficient. Another extreme is the use of single model for efficient capacity and train each class incrementally, however, the model would suffer greatly from catastrophic forgetting, thus, various regularizes are added. The spectrum of approaches between the aforementioned two extremes is best captured by the so-called ”stability-plasticity” dilemma(Mermillod et al., 2013). The decision to use autoencoders as part of the proposed model is dictated by firstly, its unsupervised nature, and secondly it has been shown that autoencoders are comparatively resilient to catastrophic forgetting (Thai et al., 2021). For CL, since the bottleneck layer is the most sensitive part of our model for a given task, we create a separate latent layer for each class and the aforementioned process is repeated. Details of architecture, training process, and experiments are discussed in detail in the following sections.

2. Related Work

2.1. Anomaly Detection

Many reconstruction-based anomaly detection approaches have been proposed starting with classical methods such as PCA (Pearson, 1901)

. Robust-PCA mitigates the issue of outlier sensitivity in PCA by decomposing the data matrix into a sum of two low-rank and sparse matrices using nuclear norm and

norms as convex relaxation for the objective loss (Candès et al., 2011). Autoencoders address the issue of PCA only considering linear relations in feature-space by introducing nonlinearities benefiting from multiple layers of representations (Bourlard and Kamp, 1988). We elaborate further on the shortcomings of PCA and autoencoders in the theoretical section and use that to motivate our approach.

Other methods try to improve on base autoencoders by endowing the latent code with particular properties. In the case of VAE (An and Cho, 2015)

, it does so by having the latent code to follow a prior distribution (usually normal) which also allows sampling from the decoder. However, in the context of anomaly detection, it introduces scaling issues since minimizing KL-Divergence for high latent dimensions required for complex tasks is quite challenging. Another approach is Replicator Neural Networks (RepNN)

(Hawkins et al., 2002)

which is an autoencoder with a staircase activation function positioned on the output of the bottleneck layer (Latent Layer). This is mainly used in order to quantize the latent code into a number of discrete values which also aids in forming clusters

(Williams et al., 2002)

. Unfortunately, a discrete staircase function is non-differentiable which prevents learning via backpropagation. Instead, a differentiable approximation involving the sum of

hyperbolic tangent functions tanh was introduced in place of the otherwise, non-differentiable discrete staircase function. However, as discussed in (Tóth and Gosztolya, 2004), despite the theoretical appeal for having a quantized latent code via smooth approximation, in practice, having such activation function makes it significantly difficult for the gradient signal to flow. We also note that increasing the number of levels using the aforementioned , tanh sum approximation presents a significant overhead during training and testing since activation functions have to be computed for each batch, moreover, it suffered from scaling issues similar to that of VAE.
Another approach is memorizing normality of a given dataset using Memory-augmented Autoencoder (Gong et al., 2019). This approach limits the effective space of possible latent codes by constructing a memory module that takes in the output of the encoder as an address and passes to the decoder the most relevant memory items from a stored reservoir of prototypical patterns that have been learned during training.
Other non-reconstruction-based approaches include One-Class classification which is tightly connected to anomaly detection in the sense that both problems are concerned with finding one-versus-all boundaries. One-Class SVM is a variation of the classical SVM algorithm (Cortes and Vapnik, 1995) where the objective is to find a hyper-plane that best separates samples from outliers (Schölkopf et al., 2001)

. Support Vector Data Description (SVDD)

(Tax and Duin, 2004) tries to find a circumscribing hyper-sphere that contains all samples while having optimal margin for outliers. It is worth noting that for kernels where such as RBF and Laplacian, OC-SVM and OC-SVDD learn identical decision functions (Lampert, 2009). To address the lack of representation learning and bad computational scalability of OC-SVM and OC-SVDD, Deep SVDD (OC-DSVDD) employs a deep neural network that learns useful representation while mapping outputs to a hypersphere of minimum volume (Ruff et al., 2018). However, due to its sole reliance on optimizing for minimum volume, this approach is prone to hyper-sphere collapse which leads to finding uninformative features (Perera and Patel, 2019).

Other approaches have been proposed where an auxiliary datset of non-examples (negative dataset) is drawn from similar domains as a proxy for the otherwise intractable complement for the target class. In (Weston et al., 2006), a collection called the ”Universum”, allows learning useful representation to the domain of the problem via maximizing the number of contradictions on an equivalence class. Similar to OC-DSVDD, (Perera and Patel, 2019) leverages a labeled dataset from a close domain to fine-tune pre-trained two CNNs in order to learn new good features. The goodness of these features is quantified by the compactness (inter-class variance) for the target class and descriptiveness (cross-entropy) for the labeled dataset. Despite avoiding hyper-sphere collapse and outperforming OC-SVDD, this approach requires two pre-trained neural networks and a large labeled dataset along with the target dataset. Another approach that also makes use of a large auxiliary dataset is Outlier Exposure (OE) (Hendrycks et al., 2018)

, which is a supervised approach that trains a standard neural network classifier while exposing it to a diverse set of non-examples on which the output of the classifier is optimized to follow a uniform distribution using another cross-entropy loss.

2.2. Continual Learning

Progressive Neural Networks (ProgNets)(Rusu et al., 2016) prevent catastrophic forgetting by dynamically adding separate subnetworks to accommodate new tasks while maintaining lateral connections to allow forward transfer of knowledge from previous tasks. The two major problems with this approach are that model parameters grow quadratically with each new task and also the model assumes knowledge of tasks boundaries to determine which output to select. This, by definition, renders ProgNets unsuitable for class-incremental learning.

Another approach that adds separate networks to accommodate new tasks is Expert-Gate (Aljundi et al., 2017). One major advantage of Expert-Gate over ProgNet is that it does not require knowledge of tasks boundaries since it uses an undercomplete autoencoder-based gating mechanism to determine which subnetwork, called ”Expert”, is to be active by using the reconstruction loss of the auxiliary autoencoders as a metric to determine which task the sample is drawn from. However, similar to ProgNet, this approach also suffers from complexity issues since with each new task, a new network is added with a separate autoencoder to serve as its ”gate”. Another assumption that Expert-Gate relies on is that an autoencoder for a particular task will not generalize well to another similar task and a simple gate will suffice to distinguish between complex tasks. However, such assumption may not hold as the autoencoder might generalize to new tasks, typically when the task is from a similar domain or the autoencoder gate is trained on a task with relatively large intra-class or inter-class variance (Gong et al., 2019).

Pseudo-Rehearsal via Generative-Replay (GR) (Shin et al., 2017) is a bio-inspired approach that mimics the way in which the hippocampus in the human brain interacts with the neocortex when learning new concepts (Lavenex and Amaral, 2000). GR architectures have two main components, a deep generative module (usually a GAN (Goodfellow et al., 2014) or a VAE (Kingma and Welling, 2013)), and a “solver” module (Usually a CNN (LeCun and Cortes, 2010)) which is responsible for solving new tasks. The generator is trained to output instances that follow the same data distribution such that when a new task arrives, the solver trains on the synthetic data outputted by the generator along with the new data in order to alleviate catastrophic forgetting. Despite their flexibility and biological appeal, Generative Replay-based methods have three major disadvantages, namely, training a generative model on a stream of changing synthetic and real data is challenging (Liu et al., 2020), GR models tend to fall short when dealing with complex datasets (Aljundi et al., 2019), and the time required to train the model on a new task increases linearly since the model has to generate and rehearse tasks. Many variants have been proposed to address these issues. Conditional GANs operate in constant time but have lower accuracy (Lesort et al., 2019), while others depend on non-incremental pre-trained networks (van de Ven et al., 2020; Liu et al., 2020).

Regularization-based techniques such as elastic weight consolidation (EWC) (Kirkpatrick et al., 2017), synaptic intelligence (SI) (Zenke et al., 2017)

and incremental moment matching (IMM)

(Lee et al., 2017), modify model parameters in such a way that preserves important weights for previous tasks while finding less sensitive weights to accommodate new tasks. Despite the popularity of these methods, they tend not to perform well in class-incremental settings since they were originally introduced for task-incremental learning (Van de Ven and Tolias, 2018).

3. Proposed Method

3.1. Architecture

An undercomplete deep autoencoder is a type of unsupervised feed-forward neural network for learning a lower-dimensional feature representation of a particular dataset via reconstructing the input back at the output layer. To prevent autoencoders from converging to trivial solutions such as the identity mapping; a bottleneck layer with output

such that its dimension is less than the dimension of the input . The forward pass is computed as such:

(1)
(2)
(3)

where is the input, is the bottleneck layer, and

are convolutional neural networks representing the encoder and the decoder modules respectively. Typically, such models are trained to minimize the

-norm of the difference between the input and the reconstructed output . As previously discussed, the choice of the activation function of plays an important rule in anomaly detection. Activation functions that quantize the latent code or encourage forming clusters are preferable. In our experiments, we find that confining the latent code to have values between with a tanh

activatin function as we maximize the loss over the negative dataset during the latent-shaping phase has a regularizing effect. We also note that unbounded activation functions such as ReLU tend to have poor performance.

3.2. Terminology

Positive Dataset ():

This is the dataset that contains the normal class(es), for example, plane class from CIFAR-10.


Negative Dataset (): This is a secondary unlabeled dataset containing negative examples from a similar domain as . The choice of depends on . For example if

is the digit 0 from MNIST,

might be random strokes or another dataset with similar features such as Omniglot(Tang et al., 2017). It is important to note that the model should not be tested on since this violates the assumption of not knowing anomalies.

Feature Extraction Phase: This is the first phase of training. The model is simply trained to reconstruct its input. This may also include reconstructing negative examples in order to extract useful features in case of continual learning.
Latent-Shaping Phase: This is the second phase of training. The encoder and decoder networks are frozen and only the latent layer is active.

3.3. Training for Anomaly Detection

Input: Positive () and Negative () datasets
 // : Encoder, : Latent Layer, : Decoder
Output: Trained model
 // Feature extraction phase
 // Sample mini batches from
for  until convergence do
        // backpropagation step
       Minimize
end for
 // Latent-shaping phase
 // Sample mini batches from
for  until convergence do
        // backpropagation step
       Minimize
end for
Algorithm 1 Anomaly Detection Training

Given a dataset and a negative dataset from a similar domain to , we divide the training process into two phases; the first phase is reconstructing samples from

by minimizing the loss function

until convergence, where is the input drawn from and is the output of the autoencoder. In the second phase, we freeze the model except for the latent layer and minimize the following loss function:

(4)

where is a sampled batch from , is its reconstruction,

is a hyperparameter that controls the effect of the two parts of the loss function and

is another hyperparameter indicating that we are satisfied if the reconstruction error of the negative dataset exceeds a certain value.

3.4. Predicting Anomalies

We use reconstruction error to distinguish between anomalies and normal data where is the test sample and is the reconstructed output. More specifically, we set a threshold such that if the output is considered to be anomalous.

3.5. Training for Continual Learning

(a) Feature Extraction.
(b) Add New Latent Layer.
(c) Shape First Latent.
(d) Shape Second Latent.
Figure 2. Training in class-incremental settings for a task containing two classes. (a) Feature Extraction phase: the model reconstructs . (b) New latent layer is added. (c-d) Freeze model except latent layers, split dataset into positive and negative parts. Select one latent layer and proceed training as described in section 3.3

As can be seen in Algorithm 2, given a stream of sequential tasks with optional auxiliary negative datasets = , where , such that represents the sample, represents its target, is the number of classes present in , and is the total number of tasks. For a particular task , the first phase of training each class in is similar to the first phase described in Algorithm 1. For each latent , is split into and such that where and where . For example, if task is to classify digits 8 and 9 from MNIST dataset. We consider samples of the digit 9 as for the latent layer corresponding to digit 8, and samples of the digit 8 as for digit 9. Of course, this does not preclude the possibility of having an additional auxiliary negative dataset. (Further details on the effect of negative examples are given in section 5.2) We notice that by selecting a particular latent layer we end up with a scenario similar to the second phase described in 3.1 and we resume training by minimizing the same loss function:

(5)

3.6. Prediction for Continual Learning

Given an unknown sample to be classified. We iterate over each latent layer as shown in fig. 3 and compute the reconstruction error. The latent layer with the minimum error is considered to be the correct class.

Input: Tasks , Auxiliary Negative Datasets (Optional)
 // : Encoder, : Latent Layer, : Decoder
Output: Trained model
for  until convergence do
        // Feature extraction phase
        // Sample mini batches from
       for  until convergence do
              // backpropagation step
             Minimize
       end for
        // train only latent layers
       for  do
              // Latent-shaping phase
              // sample mini batches from
             for  until convergence  do
                    // backpropagation step
                   Minimize
             end for
            
       end for
      
end for
Algorithm 2 Continual Learning Training

4. Theoretical Justification

4.1. Formulation

From optimiality of autoencoders (Bourlard and Kamp, 1988)

, we know that absent non-linear activation function, a linear autoencoder corresponds to singular value decomposition (SVD); henceforth, we use SVD interchangeably with linear autoencoders. Given an

data matrix , we decompose into , where and is its orthogonal complement .

Figure 3. Prediction in incremental learning settings. The correct class corresponds to the latent code with the least reconstruction error.

We further decompose using SVD:

where and are orthonormal matrices and is a diagonal matrix such that . However, in practice it is rarely separated this neatly, specially when dealing with large number of samples of a high-dimensional dataset; therefore, we resort to reduced-SVD where we take the first columns of with the caveat that the choice of is a hyper-parameter.


The matrix can be divided thus: , and from Eckart–Young low-rank approximation theorem, columns of and columns of .
A linear autoencoder with -dimensional latent layer is equivalent to the following transform:

Where and represent the decoder and the encoder respectively. Furthermore, any data point can be represented as , where and are () and -dimensional real vectors. By orthonormality, we have the following identities: and , where is an

-identity matrix. As a shorthand, we write

instead of . Using these two identities, we rewrite the reconstruction loss as following:

We note that the loss function is agnostic to the nature of and is only concerned with its magnitude. The assumption for anomaly detection under this setting is that , where and correspond to orthogonal components for anomalies and positive data respectively. We posit that while this agnosticism is desirable for potential generality, it is not optimal for anomaly detection; hence, we modify the loss score to depend on the nature of :

where is an matrix such that the loss is small for normal data but large for anomalies. In other words, we want and to be large.
We define orthonormal basis for and orthonormal basis for , where is the matrix of all positive orthogonal components . We decompose further into and where columns of are the basis of that are not in and columns of are the remaining columns of .
Since , any can be written as , where , and are real vectors.
Despite the fact that we do not have access to , we can utilize other negative examples from similar domain and use as a proxy for . Since , maximizing implies maximizing assuming that . The later assumption hinges on the fact that is from a similar domain. Therefore, we end up with the goal of finding such that and is large.

In practice, we cannot maximize indefinitely and we are satisfied if it reaches a certain large :

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4. Comparison between Linear AE and LIS-AE. Digit-8 is the normal task. (a) Inputs. (b) Orthogonal vector. (c) AE reconstruction of orthogonal vectors(zero images). (d) LIS-AE reconstruction of orthogonal vectors. (e) AE reconstruction of inputs. (f) LIS-AE reconstruction of inputs.

We notice that in order for this to work, the decoder has to be known and remain fixed (frozen). This suggests a two-phase training where we first compute the decoder and encoder networks, and in the second phase the decoder is fixed while the encoder is modified using the new loss. In fig. 4, a linear version of LIS-AE is trained on digit-8 from MNIST with Omniglot as a negative dataset. We perform orthogonal decomposition on each input by projecting it onto digit-8 subspace to get its projection and orthogonal vectors. We then feed each vector separately to a regular linear AE and linear LIS-AE. We observe that the regular autoencoder outputs zero images for the orthogonal part of each sample regardless of the class it belongs to. However, in the case of LIS-AE, it behaves differently for normal class than for anomalous classes.

We also notice that orthogonal projections do not form a semantically meaningful representation in pixel space. In order to gain a better representation we use a deep AE. For this non-linear case, we treat the middle part of the network as an inner linear autoencoder which is operating on a more semantically meaningful transformed version of the data. This suggests a stacked autoencoder archiecture where another loss term for the inner autoencoder is added in the first phase to make sure that the output of the layer after latent is similar to the latent input. In the second phase we freeze the entire network except for the encoder of the inner autoencoder (latent layer of entire model) and minimize the reconstruction error of positive examples while maximizing the loss for negative examples. However, in our experiments we observed that adding these loss terms was not necessary and a similar loss to the linear case produced similar results since we are only considering reconstruction scores of the outer model. Therefore, we keep the entire network frozen except for the latent layer while directly minimizing the following loss as before. (eq. 4)

4.2. Intuition

For concreteness, we consider the following simple, supervised case. Given a dataset such that for each :

We notice that most of the variance in data is along the x-axis. Training a linear autoencoder with latent dimension , results in and where and are the decoder and encoder networks respectively.
Given input ,
The loss score .
Training a LIS-AE on negative samples that have only nonzero values along the z-axis, we end up with the same and a new (modified) , where is a large number.


has the form , has the form where .
The new loss scores for and are:

In the case of regular Linear-AE (PCA), given , for each point the cylinder: the following holds: , making the two samples indistinguishable. In the case of LIS-AE, holds only for the elliptic cylinder , and since is a large number, the cross-section of the cylinder is squashed in the z dimension resulting in heavily penalized loss in the z dimension but a regular loss in the y dimension. In this case, the two samples become indistinguishable only for very small values of .
We note that the new is merely a rotated and stretched version of the old in the

-plane. Thus, we can think of Linear LIS-AE as a regular PCA with its eigenvectors (columns of

) stretched and tilted

in the directions of the orthogonal complement of the eigenspace. This is done in such a way that keeps the column space of normal examples invariant under the new transformation

. By itself, this formulation remains ill-posed since there are infinite number of solutions that do not necessarily help with anomaly detection. More formally, given , we can choose any matrix such that since . However, this does not guarantee any advantage for anomaly detection on similar data, even worse in practice, this modification process might result in a slightly worse performance if done arbitrarily since the model usually has to sacrifice some extreme samples from the normal data to balance the two losses. Thus, the negative dataset is used to properly determine the directions of the tilt and hyperparamteres ( and ) determine the importance and amount of stretching (or shrinking) without changing the normal case as much as possible. For deep LIS-AE, the same analogy holds albeit in a latent space.

Deep architectures are not only useful for learning good representation, but can learn a non-linear transformation with useful properties for our objective such as linear separability of negative and positive samples. By adding a standard binary cross-entropy loss before the non-linear activation of the latent layer during the first phase, we ensure that the input of the latent layer is linearly-separable for positive and negative examples during the second phase. This linearly-separable variant (LinSep-LIS-AE) almost always performs better than directly using LIS-AE. We investigate the effect of this property on the second phase in section 5.2.

5. Experiments

Model MNIST Fashion-MNIST 2-Class MNIST
KDE 0.9568 0.9183 0.9206
IF 0.8624 0.9144 0.73018
OC-SVM 0.9108 0.8608 0.8741
OC-DSVDD 0.9489 0.8577 0.8972
AnoGAN 0.9579 0.9098 0.8406
AnoGAN-FM 0.9544 0.9072 0.8353
Linear-AE 0.9412 0.8845 0.8915
VAE 0.9642 0.9092 0.9263
Mem-AE 0.9714 0.9131 0.9352
Sig-RepNN (N=4) 0.9661 0.9124 0.9261
AE 0.9601 0.9076 0.9221
LIS-AE 0.9768 0.9256 0.9457
Table 1. Average AUC for 10 tasks sampled from MNIST, Fashion-MNIST and 5 2-class tasks sampled from MNIST.

We report results on the following datasets, MNIST (LeCun and Cortes, 2010), Fashion-MNIST (Xiao et al., 2017)

, SVHN

(Netzer et al., 2011) and CIFAR-10 (Krizhevsky et al., 2009). Results of our approach are compared to baseline models with the same capacity for autoencoder-based methods.

5.1. Anomaly Detection

In this section, we test LIS-AE for anomaly detection on image data in unsupervised settings. Given a standard classification dataset, we group a set of classes together into a new dataset and consider it the ”normal” dataset. The rest of classes that are not in the normal nor in the negative datasets are considered anomalies. During training, our model is presented only with the normal dataset and the additional negative dataset. We evaluate the performance on test data comprised of both the ”normal” and ”anomalous” groups.
For MNIST and Fashion-MNIST, the encoder network consists of two Convolutional layers with LeakyReLU non-linearities followed by a fully-connected bottleneck layer with tanh activation function. The decoder network consists of a fully-connected layer followed by a LeakyReLU and two Deconvolution layers with LeakyReLU activation functions and a final convolution layer with sigmoid situated at the final output. For SVHN and CIFAR-10 we use latent layer with larger sizes and higher capacity networks with same depth. It is worth noting that the choice of latent layer size has the most effect on performance for all models (compared to other hyper-parameters). We report the best performing latent dimension for all models.
In table (1), we compare LIS-AE with several autoencoder-based anomaly detection models as baselines, all of which share the exact same architecture. It is worth noting that the most direct comparison is between LIS-AE and AE since not only they have the same architecture, they have the exact same encoder and decoder weights and their performance is merely measured before and after the latent-shaping phase. We use a different variant of RepNN with a sigmoid activation function placed before the tanh staircase function approximation described in section 2.2. This is mainly used because ”squashing”

the input between 0 and 1 before passing it to the staircase function gives more robust and easy-to-train network. We only report the best results for Sig-RepNN with 4 activation levels. For anomaly GAN (AnoGAN)

(Schlegl et al., 2017), we follow the implementation described in (Schlegl et al., 2019). We train a W-GAN (Gulrajani et al., 2017) with gradient penalty and report performance for two anomaly scores, namely, encoder-generator reconstruction loss and additional feature-matching distance score in the discriminator feature space (AnoGAN-FM).

Model SVHN CIFAR-10
KDE 0.5648 0.5752
IF 0.5112 0.6097
OC-SVM 0.5047 0.5651
OC-DSVDD 0.5681 0.6411
AnoGAN 0.5598 0.5843
AnoGAN-FM 0.5645 0.5880
Linear-AE 0.5702 0.5753
VAE 0.5692 0.5734
Mem-AE 0.5720 0.5931
Sig-RepNN (N=4) 0.5684 0.5719
AE 0.5698 0.5703
LIS-AE 0.6886 0.8145
LinSep-LIS-AE 0.7701 0.8858
Sup. LIS-AE 0.7573 0.8384
Sup. LinSep-LIS-AE 0.8479 0.9170
Table 2. Average AUC for 10 anomaly detection tasks sampled from SVHN and CIFAR-10 are shown.

For AnoGAN, Linear-AE, AE, VAE, RepNN, MemAE, and LIS-A, we use reconstruction error such that if the input is considered an anomaly. Varying the threshold , we are able to compute the area under the curve (AUC) as a measure of performance. Similarly, for OC-SVDD (equivalently OC-SVM with rbf kernel) and OC-DSVDD, we vary inverse length scale

and use predicted class label. For Kernel density estimation (KDE)

(Parzen, 1962), we vary the threshold over the log-likelihood scores. For Isolation Forest (IF) (Liu et al., 2008), we vary the threshold over the anomaly score calculated by the Isolation Forest algorithm.

Figure 5. Each line represents a trade-off between accuracy of anomalies and normal data for CIFAR10. The top pane shows accuracies on tasks 0-4 and the bottom shows accuracies on tasks 5-9. Note that as the threshold value increases the model favors accepting anomalies over misclassifying normal examples. LIS-AE gives a significant margin compared to base AE.

The datasets tested in table (1) are MNIST and Fashion-MNIST. To train LIS-AE on MNIST we use Omniglot (Lake et al., 2015) as our negative dataset since it shares similar compositional characteristics with MNIST. Since Omniglot is a relatively small dataset, we diversify the negative examples with various augmentation techniques, namely, Gaussian blurring, random cropping, horizontal and vertical flipping. We test two settings for MNIST, a 1-class setting where normal dataset is one particular class and the rest of the dataset classes are considered anomalies. The process is repeated for all classes and averge AUC for 10 classes is reported. Another setting is 2-class MNIST where the normal dataset consists of two classes and the remaining classes are considered anomalies. For example, the first task contains digits 0 and 1 and the remaining digits are considered anomalies, the second task contains digit 2 and 3, and so forth. This setting is more challenging since there is more than one class present in the normal dataset and also very informative for continual learning methods that use autoencoders as gates to first identify tasks boundaries. For Fashion-MNIST, the choice of negative example is different. We use the next class as the negative dataset and we do not include it with anomalies (i.e. remaining classes) during test time. This also informative for continual learning where we have a stream of sequential tasks. In the ablation section we test multiple negative datasets for Fashion-MNIST.
We note that LIS-AE achieves superior performance to all compared approaches, however, we also notice that these settings are comparatively easy and all tested models performed adequately including classical non-deep approaches. In table (2), we show performance on SVHN and CIFAR-10 which are more complex dataests compared to MNIST and Fashion-MNIST. To train LIS-AE, we split each dataset into two datasets, each split is used as negative examples for the other one. Note that we only test on the remaining classes which are not in the normal nor the negative datasets. For example, the first dataset from CIFAR-10 has five classes, namely, airplane, automobile, bird, cat and deer while the second one has dog, frog, horse, ship and truck. Training on airplane as the first normal task, LIS-AE maximizes the loss for samples drawn from the negative dataset (dog, frog, ship and so forth). We then test its performance on airplane as the normal class and only on automobile, bird, cat and deer as anomalies. Note that we do not test on dog, frog and other classes in the negative dataset. This processes is repeated for all 10 classes and average AUC is reported. As mentioned in section 4.2, We introduce LinSep-LIS-AE as an improvement over base standard LIS-AE. The difference between the two models is only in the first phase where a binary cross-entropy loss is added to ensure that positive and negative examples are linearly seperabable during the second phase. The last two entries of the table are supervised upper bounds for each variant where the negative dataset is the same as outliers. In figure (6), we see that standard AE is prone to generalize so well for other classes which is not a desired property for anomaly detection. In contrast, LIS-AE only reconstructs normal data faithfully which translates to the large performance gap we see in figure (5).

Figure 6. Top row is test input from CIFAR-10 and SVHN, middle row is the output of a standard AE (first phase) and the bottom row is the output of LIS-AE. Trained on normal ”car” and ”digit-0” classes, LIS-AE only reconstructs samples of normal class correctly.

We also notice that despite CIFAR-10 being more complex than SVHN, most reconstruction-based models are performing better on CIFAR-10 than on SVHN. This is due to the fact that the difference between SVHN classes in terms of reconstruction is not as large since they share similar compositional features and do appear in samples from other classes while for CIFAR-10 classes vary significantly. (e.g. digit-2 and digit-3 on a wall vs truck and bird)

5.2. Ablation

Negative Positive Data
 Data MNIST Fashion SVHN CIFAR-10
None 0.9485 0.8740 0.5698 0.5703
Omni 0.9605 0.9013 - -
MNIST 0.9778 0.8942 - -
Fashion 0.9482 0.9106 - -
SVHN - - 0.6886 0.7065
CIFAR-10 - - 0.5481 0.8145
Same (Sup.) 0.9901 0.9623 0.7573 0.8384
Table 3. Average AUC for 10 anomaly detection tasks sampled from two 5-class MNIST, Fashion-MNIST, SVHN and CIFAR-10 datasets where a regular LIS-AE is trained with different negative Splits.

In this section, we investigate the effect of the nature of negative dataset and linear separability of positive and negative examples. In table (3) we train LIS-AE on different negative and positive datasets. Similar to table (2), we split each positive dataset into two datasets and follow the same settings as before with the exception of ”None” and ”Supervised ” cases. The ”None” case indicates that no negative examples have been used whereas the Supervised indicates that both outliers and negative datasets share the same classes. Note that this case is different from the case where the positive and negative datasets come from the same dataset. Unless stated otherwise, we only test on classes (outliers) that are not in the positive nor in the negative datasets. For example, when MNIST is used as a source for both positive and negative datasets, the positive data starts with class 0 and negative dataset consists of classes 5 to 9 where the outliers are classes 1 to 4. This process is repeated for all 10 classes present in each dataset and average AUC is reported. Overall, using a negative dataset resulted in a significant increase in performance in every case except for two important cases, namely, when Fashion-MNIST and CIFAR-10 were used as negative datasets for MNIST and SVHN respectively. This could be explained by the fact that the model was not capable of reconstructing Fashion-MNIST and CIFAR-10 classes in the first place. Moreover, shaping the latent layer in such a way that maximizes the loss for Fashion-MNIST and CIFAR-10 classes does not guarantee any advantage for anomaly detection of similar digit classes present in MNIST and SVHN. This coupled with the fact that this process in practice forces the model to ignore some samples from the normal dataset to balance the two losses which results in the performance degradation we observe in these two cases.

Table (4) is an excerpt of the complete table in the appendices where we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR-10 dataset. We split CIFAR-10 into two separate datasets, the first split is used for selecting classes as negative datasets and the other split is used as outliers. For each class in CIFAR-10 we train eight models in different settings, the first setting is None where we train a standard autoencoder with no negative examples as the base model. The remaining seven settings differ in the second phase, we select one class as our negative dataset and test the model performance on each individual class from the outlier dataset. The combined setting is similar to the setting described in section 5.1 where we combine all negative classes in one 5-class negative dataset. Note that these classes are not the same as the classes in the outlier test dataset except for the final setting, which is an upper-bound supervised setting where the negative dataset is comprised of classes that are in the outlier dataset except for the positive class. This process is then repeated for all 10 classes in CIFAR-10. Overall, we observe a significant performance increase over the base model with the general trend of negative classes significantly increasing anomaly detection performance for similar outliers. For example, the dog class drastically improves performance on the cat class but not so much for the plane class. However, we also notice two important exceptions, namely, when the horse class is used as the negative dataset for the car class, we notice a significant performance increase for the relatively similar deer class as expected, however, when the horse class is used as the negative dataset for the same deer class, we notice that the performance does not improve as in the first case and even degrades for the care class.

Positive Negative Outliers
 Class  Class Plane Car Bird Cat Deer avg
Car None 0.32 - 0.34 0.33 0.33 0.330
Dog 5 0.67 - 0.89 0.93 0.90 0.848
Frog 6 0.58 - 0.90 0.90 0.91 0.823
Horse 7 0.66 - 0.88 0.90 0.92 0.840
Ship 8 0.83 - 0.59 0.51 0.50 0.608
Truck 9 0.51 - 0.44 0.49 0.44 0.470
Comb. (5-9) 0.81 - 0.92 0.92 0.94 0.898
Sup. (0-4) 0.89 - 0.93 0.90 0.95 0.918
Deer None 0.56 0.80 0.52 0.54 - 0.605
Dog 5 0.72 0.85 0.63 0.80 - 0.750
Frog 6 0.66 0.86 0.60 0.75 - 0.718
Horse 7 0.71 0.58 0.58 0.71 - 0.645
Ship 8 0.93 0.94 0.63 0.72 - 0.805
Truck 9 0.84 0.97 0.62 0.72 - 0.773
Comb. (5-9) 0.87 0.95 0.61 0.73 - 0.790
Sup. (0-4) 0.93 0.97 0.63 0.72 - 0.813
Table 4. AUC for LIS-AE trained on individual positive and negative classes is reported.

Other notable examples of this observation can be found in the appendices where, for instance, the dog class improves performance on cat outliers, but causes noticeable degradation when used as the negative dataset for the same cat class. We believe the reason behind this problem is that in the first case, the gained performance is due to the fact that these classes share similar compositional features and backgrounds. However in the second case, the same property makes it difficult to balance the minimization and maximization loss during the latent-shaping phase. For example, car and truck images are very similar in this scenario that minimizing and maximizing the loss at the same time becomes contradictory. As posited in section 4.2, we mitigate this issue by adding a binary cross-entropy loss while training in the first phase to ensure that the input of the latent layer is linearly-separable for positive and negative examples. Notice that unlike other approaches (Hendrycks et al., 2018; Perera and Patel, 2019), this does not require a labeled positive or negative dataset and relies only on the fact that we have two distinct datasets. This linear separablity makes the second phase of training relatively easier and less contradictory. In table (5), we see that LinSep-LIS-AE mitigates this issue for the aforementioned cases and gives the AUC increase we observed in table (2).

5.3. Continual Learning

Positive Negative Outliers
 Class  Class Plane Car Bird Cat Deer avg
Car None 0.32 - 0.34 0.33 0.33 0.330
Dog 5 0.67 - 0.94 0.97 0.95 0.883
Frog 6 0.58 - 0.93 0.96 0.96 0.858
Horse 7 0.69 - 0.95 0.97 0.97 0.895
Ship 8 0.90 - 0.78 0.79 0.76 0.808
Truck 9 0.59 - 0.77 0.82 0.73 0.728
Comb. (5-9) 0.90 - 0.97 0.97 0.98 0.955
Sup. (0-4) 0.95 - 0.98 0.98 0.98 0.9725
Deer None 0.56 0.80 0.52 0.54 - 0.605
Dog 5 0.67 0.86 0.71 0.89 - 0.783
Frog 6 0.68 0.87 0.62 0.79 - 0.740
Horse 7 0.70 0.84 0.61 0.72 - 0.718
Ship 8 0.94 0.95 0.63 0.73 - 0.813
Truck 9 0.84 0.97 0.61 0.76 - 0.795
Comb. (5-9) 0.90 0.97 0.66 0.80 - 0.833
Sup. (0-4) 0.97 0.98 0.76 0.83 - 0.885
Table 5. AUC for LinSep-LIS-AE trained on individual positive and negative classes is reported.

A common evaluation method for CL settings is split-dataset tests whereby a standard classification dataset is divided into disjoint tasks within each task an number of classes such that is the total number of classes present in dataset. The model is then presented with tasks one at a time and the final performance of all tasks is reported. In this section, we consider MNIST and Fashion-MNIST datasets in two common variants of the aforementioned setting, namely, task-incremental and class-incremental settings. It is important to note that while these two datasets are now considered too simple a test for meaningful evaluation of classifiers in general, this does not hold for continual learning since MNIST-like tasks are still extremely challenging for continual learning especially in class-incremental settings. We compare LIS-AE with several CL approaches. We start by estimating a lower bound for each setting by simply training the classifier sequentially on tasks to assess forgetting and interference. For lower bound model, EWC, online EWC (Schwarz et al., 2018), SI, and deep generative replay (DGR), the classifier network has the same architecture as the encoder network used in LIS-AE. For Expert-Gates (EG), we use experts (classifiers) with the same encoder network architecture where is the number of tasks. It is worth noting that the latent dimension of the autoencoder gates in EG differs from the latent dimension used in LIS-AE. Therefore, we report the best accuracy for EG and LIS-AE. For fair comparison, we only include models that do not have an episodic memory (do not store raw data).

We first test our model under the easy protocol where the model is presented with an unknown sample along with a task identifier. Table (6) shows results of split-MNIST and split-Fashion-MNIST, for both of the datasets, we consider five folds of the MNIST dataset, the first fold is comprised of 0 and 1, the second fold is 2 and 3 and so forth. During testing, the model is presented with a task identifier indicating which class the sample belongs to. For LIS-AE, this reduces interference between latent layers from different tasks significantly.

Model MNIST Fash-MNIST
lower bound 0.8819 0.7721
EWC 0.9864 0.9572
Online EWC 0.9904 0.9642
SI 0.9916 0.9682
DGR 0.9941 0.8842
LIS-AE 0.9961 0.9866
Table 6. Classification accuracy on split-MNIST and split-Fashion-MNIST tested in Task-Incremental settings.
Model MNIST Fash-MNIST
lower bound 0.1970 0.1821
EWC 0.1992 0.1902
Online EWC 0.1993 0.1891
SI 0.2101 0.1911
DGR 0.9124 0.7298
EG 0.9306 0.8024
LIS-AE 0.9453 0.7786
DL-LIS-AE 0.9814 0.8587
Table 7. Classification accuracy on split-MNIST and split-Fashion-MNIST tested in Class-Incremental settings.

In table (7) we turn to the more challenging class-incremental settings where no task identifier is presented. We see that for MNIST, the performance remains relatively close to task-incremental performance, however, for Fashion-MNIST, noticeable interference is manifested when task 3 is introduced as shown in fig. 7. This is due to the ordering of classes present in Fashion-MNIST as task 2 contains class coat while task 3 contains class shirt. Such interference was not significantly manifested in the task-incremental setting because of the availability of task label that first distinguishes between the two similar classes. To give our model more flexibility in class-incremental settings, we introduce another variant of LIS-AE, namely, Double-Latent LIS-AE (DL-LIS-AE) where instead of adding only one latent layer for each class as described in section 3.5, we add two consecutive latent layers with tanh activation in the middle which substantially improves performance. It is worth noting that despite Expert-Gates model having five different autoencoders and five different classifier networks, a single base LIS-AE achieves similar performance. This is due to the fact that in our model, utilizing negative examples even as simple as just one-class negative dataset, significantly improves performance as discussed in the section 5.2. Another advantage is that the added latent layers are, by definition, operating as gates but for a better representation and not in pixel-space. We also note in table (1) when solving anomaly detection tasks, the 2-class MNIST setting was significantly harder than 1-class setting due to higher inter-class variance in data.

Figure 7: Model accuracy of SPLI-MNIST and SPLIT-FASHION-MNIST in class-incremental settings.

6. Conclusion

In this paper we introduced a novel autoencoder-based model called Latent-Insensitive Autoencoder (LIS-AE). With the help of negative samples drawn from a similar domain as the normal data we tune the weights of the bottleneck part of a standrad autoencoder such that the resulting model is able to reconstruct the target task while penalizing anomalous samples. We also presented theoretical justification for the reasoning behind our two-phase training process and the latent-shaping loss function along with a more powerful variant. Multiple ablation studies were conducted to explain We showed that continual learning can be thought of as multiple anomaly detection problems and leveraged this framing to extend the applications of our model beyond anomaly detection to tackle the challenging problem of class-incremental learning using a simple variant with multiple bottlenecks. We tested our model in a variety of anomaly detection and class-incremental settings with multiple datasets of varying degrees of complexity. Experimental results showed significant performance improvement over compared methods. Future research will focus on further investigating the connection between continual learning and anomaly detection, forward and backward transfer of knowledge for continual learning and possible ways for synthesizing negative examples for domains with limited data. We also hope to further study and employ various manifold learning approaches for latent space representation.

Acknowledgement

Artem Lenskiy was funded by Our Health in Our Hands (OHIOH), a strategic initiative of the Australian National University, which aims to transform healthcare by developing new personalised health technologies and solutions in collaboration with patients, clinicians, and health care providers.

References

  • R. Aljundi, L. Caccia, E. Belilovsky, M. Caccia, M. Lin, L. Charlin, and T. Tuytelaars (2019) Online continual learning with maximally interfered retrieval. arXiv preprint arXiv:1908.04742. Cited by: §2.2.
  • R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3366–3375. Cited by: §2.2.
  • J. An and S. Cho (2015)

    Variational autoencoder based anomaly detection using reconstruction probability

    .
    Special Lecture on IE 2 (1), pp. 1–18. Cited by: §2.1.
  • Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. (2007) Greedy layer-wise training of deep networks. Advances in neural information processing systems 19, pp. 153. Cited by: §1.
  • H. Bourlard and Y. Kamp (1988)

    Auto-association by multilayer perceptrons and singular value decomposition

    .
    Biological cybernetics 59 (4), pp. 291–294. Cited by: §2.1, §4.1.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011)

    Robust principal component analysis?

    .
    Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §2.1.
  • R. Chalapathy, A. K. Menon, and S. Chawla (2017) Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 36–51. Cited by: §1.
  • V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1.
  • C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.1.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1.
  • D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1705–1714. Cited by: Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning, §1, §2.1, §2.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2.2.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028. Cited by: §5.1.
  • M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §1.
  • S. Hawkins, H. He, G. Williams, and R. Baxter (2002) Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180. Cited by: §2.1.
  • D. Hendrycks, M. Mazeika, and T. Dietterich (2018) Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §2.1, §5.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.1.
  • C. H. Lampert (2009) Kernel methods in computer vision. Now Publishers Inc. Cited by: §2.1.
  • P. Lavenex and D. G. Amaral (2000) Hippocampal-neocortical interaction: a hierarchy of associativity. Hippocampus 10 (4), pp. 420–430. Cited by: §2.2.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §2.2, §5.
  • S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017) Overcoming catastrophic forgetting by incremental moment matching. arXiv preprint arXiv:1703.08475. Cited by: §1, §2.2.
  • T. Lesort, A. Gepperth, A. Stoian, and D. Filliat (2019) Marginal replay vs conditional replay for continual learning. In International Conference on Artificial Neural Networks, pp. 466–480. Cited by: §2.2.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
  • F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 eighth ieee international conference on data mining, pp. 413–422. Cited by: §5.1.
  • X. Liu, C. Wu, M. Menta, L. Herranz, B. Raducanu, A. D. Bagdanov, S. Jui, and J. v. de Weijer (2020) Generative feature replay for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 226–227. Cited by: §2.2.
  • M. Mermillod, A. Bugaiska, and P. Bonin (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology 4, pp. 504. Cited by: §1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §5.
  • E. Parzen (1962)

    On estimation of a probability density function and mode

    .
    The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §5.1.
  • K. Pearson (1901) LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2 (11), pp. 559–572. Cited by: §2.1.
  • P. Perera and V. M. Patel (2019)

    Learning deep features for one-class classification

    .
    IEEE Transactions on Image Processing 28 (11), pp. 5450–5463. Cited by: Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning, §2.1, §2.1, §5.2.
  • R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng (2007) Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pp. 759–766. Cited by: Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning.
  • L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §2.1.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §1, §2.2.
  • T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth (2019)

    F-anogan: fast unsupervised anomaly detection with generative adversarial networks

    .
    Medical image analysis 54, pp. 30–44. Cited by: §5.1.
  • T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §5.1.
  • B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §2.1.
  • J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. Cited by: §5.3.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690. Cited by: §2.2.
  • C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu (2018) A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270–279. Cited by: §1.
  • T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei (2017) Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17 (2), pp. 336. Cited by: §3.2.
  • D. M. Tax and R. P. Duin (2004) Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.1.
  • A. Thai, S. Stojanov, I. Rehg, and J. M. Rehg (2021) Does continual learning= catastrophic forgetting?. arXiv preprint arXiv:2101.07295. Cited by: §1.
  • L. Tóth and G. Gosztolya (2004) Replicator neural networks for outlier modeling in segmental speech recognition. In International Symposium on Neural Networks, pp. 996–1001. Cited by: §2.1.
  • G. M. van de Ven, H. T. Siegelmann, and A. S. Tolias (2020) Brain-inspired replay for continual learning with artificial neural networks. Nature communications 11 (1), pp. 1–14. Cited by: §2.2.
  • G. M. Van de Ven and A. S. Tolias (2018) Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §2.2.
  • J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik (2006) Inference with the universum. In Proceedings of the 23rd international conference on Machine learning, pp. 1009–1016. Cited by: §1, §2.1.
  • G. Williams, R. Baxter, H. He, S. Hawkins, and L. Gu (2002) A comparative study of rnn for outlier detection in data mining. In 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 709–712. Cited by: §2.1.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §1, §2.2.
  • C. Zhou and R. C. Paffenroth (2017) Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 665–674. Cited by: §1.
  • A. Zimek, E. Schubert, and H. Kriegel (2012) A survey on unsupervised outlier detection in high-dimensional numerical data.

    Statistical Analysis and Data Mining: The ASA Data Science Journal

    5 (5), pp. 363–387.
    Cited by: §1.
  • B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    .
    In International Conference on Learning Representations, Cited by: §1.

Appendix A Effect of individual classes as negative examples

As discussed in section 5.2, we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR-10 dataset. The first table shows results for standard LIS-AE while the second table shows results for LinSep-LIS-AE.

Positive Negative Outliers  Class  Class Plane Car Bird Cat Deer avg Plane None - 0.78 0.58 0.62 0.61 0.648 Dog 5 - 0.83 0.87 0.95 0.91 0.890 Frog 6 - 0.83 0.86 0.94 0.90 0.883 Horse 7 - 0.82 0.86 0.94 0.91 0.883 Ship 8 - 0.83 0.86 0.92 0.90 0.878 Truck 9 - 0.84 0.86 0.95 0.91 0.890 Comb. (5-9) - 0.83 0.85 0.93 0.91 0.880 Sup. (0-4) - 0.81 0.88 0.94 0.92 0.888 Car None 0.32 - 0.34 0.33 0.33 0.330 Dog 5 0.67 - 0.89 0.93 0.9 0.848 Frog 6 0.58 - 0.9 0.9 0.91 0.823 Horse 7 0.66 - 0.88 0.9 0.92 0.840 Ship 8 0.83 - 0.59 0.51 0.5 0.608 Truck 9 0.51 - 0.44 0.49 0.44 0.470 Comb. (5-9) 0.81 - 0.92 0.92 0.94 0.898 Sup. (0-4) 0.89 - 0.93 0.90 0.95 0.918 Bird None 0.52 0.78 - 0.54 0.52 0.590 Dog 5 0.59 0.75 - 0.71 0.49 0.635 Frog 6 0.53 0.78 - 0.69 0.54 0.635 Horse 7 0.63 0.80 - 0.66 0.57 0.665 Ship 8 0.86 0.89 - 0.61 0.45 0.703 Truck 9 0.76 0.94 - 0.64 0.72 0.765 Comb. (5-9) 0.78 0.90 - 0.62 0.49 0.698 Sup. (0-4) 0.82 0.94 - 0.60 0.48 0.710 Cat None 0.55 0.76 0.50 - 0.50 0.578 Dog 5 0.54 0.73 0.52 - 0.52 0.578 Frog 6 0.56 0.72 0.60 - 0.62 0.625 Horse 7 0.70 0.75 0.59 - 0.68 0.680 Ship 8 0.91 0.89 0.53 - 0.46 0.678 Truck 9 0.82 0.94 0.50 - 0.48 0.685 Comb. (5-9) 0.89 0.91 0.55 - 0.52 0.718 Sup. (0-4) 0.93 0.93 0.58 - 0.54 0.745 Deer None 0.56 0.80 0.52 0.54 - 0.605 Dog 5 0.72 0.85 0.63 0.80 - 0.750 Frog 6 0.66 0.86 0.60 0.75 - 0.718 Horse 7 0.71 0.58 0.58 0.71 - 0.645 Ship 8 0.93 0.94 0.63 0.72 - 0.805 Truck 9 0.84 0.97 0.62 0.72 - 0.773 Comb. (5-9) 0.87 0.95 0.61 0.73 - 0.790 Sup. (0-4) 0.93 0.97 0.62 0.72 - 0.810 Positive Negative Outliers  Class  Class Dog Frog Horse Ship Truck avg Dog None - 0.69 0.66 0.57 0.77 0.673 Plane 0 - 0.53 0.66 0.92 0.89 0.750 Car 1 - 0.56 0.68 0.95 0.91 0.775 Bird 2 - 0.63 0.63 0.77 0.78 0.703 Cat 3 - 0.67 0.65 0.66 0.81 0.698 Deer 4 - 0.73 0.69 0.70 0.76 0.720 Comb. (0-4) - 0.58 0.67 0.95 0.94 0.785 Sup. (5-9) - 0.56 0.73 0.95 0.95 0.798 Frog None 0.40 - 0.53 0.49 0.67 0.523 Plane 0 0.73 - 0.81 0.96 0.93 0.858 Car 1 0.74 - 0.83 0.96 0.97 0.875 Bird 2 0.80 - 0.84 0.91 0.85 0.850 Cat 3 0.84 - 0.80 0.87 0.86 0.843 Deer 4 0.75 - 0.86 0.88 0.87 0.840 Comb. (0-4) 0.75 - 0.84 0.97 0.95 0.877 Sup. (5-9) 0.82 - 0.88 0.97 0.96 0.907 Horse None 0.41 0.58 - 0.46 0.66 0.528 Plane 0 0.55 0.50 - 0.93 0.83 0.703 Car 1 0.56 0.58 - 0.90 0.93 0.743 Bird 2 0.62 0.73 - 0.80 0.71 0.715 Cat 3 0.77 0.76 - 0.65 0.66 0.710 Deer 4 0.62 0.83 - 0.57 0.60 0.655 Comb. (0-4) 0.51 0.57 - 0.89 0.88 0.713 Sup. (5-9) 0.59 0.66 - 0.95 0.92 0.780 Ship None 0.62 0.74 0.73 - 0.77 0.715 Plane 0 0.75 0.75 0.82 - 0.74 0.765 Car 1 0.84 0.89 0.90 - 0.88 0.878 Bird 2 0.94 0.96 0.94 - 0.78 0.905 Cat 3 0.95 0.95 0.93 - 0.8 0.908 Deer 4 0.92 0.96 0.94 - 0.78 0.900 Comb. (0-4) 0.95 0.97 0.96 - 0.83 0.928 Sup. (5-9) 0.94 0.96 0.96 - 0.88 0.935 Truck None 0.35 0.53 0.46 0.30 - 0.41 Plane 0 0.61 0.52 0.58 0.80 - 0.628 Car 1 0.51 0.57 0.53 0.47 - 0.520 Bird 2 0.92 0.90 0.84 0.73 - 0.848 Cat 3 0.95 0.90 0.82 0.61 - 0.820 Deer 4 0.91 0.92 0.86 0.63 - 0.830 Comb. (0-4) 0.93 0.91 0.83 0.76 - 0.858 Sup. (5-9) 0.94 0.95 0.90 0.78 - 0.893

Positive Negative Outliers
 Class  Class Plane Car Bird Cat Deer avg
Plane None - 0.78 0.58 0.62 0.61 0.648
Dog 5 - 0.84 0.88 0.97 0.93 0.905
Frog 6 - 0.84 0.86 0.95 0.93 0.895
Horse 7 - 0.83 0.85 0.95 0.92 0.888
Ship 8 - 0.80 0.57 0.74 0.56 0.668
Truck 9 - 0.92 0.71 0.87 0.74 0.810
Comb. (0-4) - 0.90 0.85 0.96 0.94 0.913
Sup. (5-9) - 0.93 0.90 0.96 0.95 0.935
Car None 0.32 - 0.34 0.33 0.33 0.330
Dog 5 0.67 - 0.94 0.97 0.95 0.883
Frog 6 0.58 - 0.93 0.96 0.96 0.858
Horse 7 0.69 - 0.95 0.97 0.97 0.895
Ship 8 0.90 - 0.78 0.79 0.76 0.808
Truck 9 0.59 - 0.77 0.82 0.73 0.728
Comb. (0-4) 0.90 - 0.97 0.97 0.98 0.955
Sup. (5-9) 0.95 - 0.98 0.98 0.98 0.973
Bird None 0.52 0.78 - 0.54 0.52 0.590
Dog 5 0.56 0.78 - 0.81 0.55 0.675
Frog 6 0.53 0.80 - 0.71 0.56 0.650
Horse 7 0.63 0.81 - 0.72 0.59 0.688
Ship 8 0.86 0.93 - 0.64 0.47 0.725
Truck 9 0.74 0.95 - 0.68 0.48 0.713
Comb. (0-4) 0.82 0.95 - 0.76 0.60 0.783
Sup. (5-9) 0.89 0.97 - 0.70 0.56 0.773
Cat None 0.55 0.76 0.50 - 0.50 0.578
Dog 5 0.50 0.71 0.51 - 0.46 0.545
Frog 6 0.52 0.76 0.58 - 0.64 0.625
Horse 7 0.65 0.79 0.56 - 0.64 0.660
Ship 8 0.93 0.92 0.56 - 0.48 0.723
Truck 9 0.81 0.96 0.52 - 0.48 0.693
Comb. (5-9) 0.88 0.95 0.62 - 0.68 0.783
Sup. (5-9) 0.95 0.97 0.75 - 0.76 0.860
Deer None 0.56 0.80 0.52 0.54 - 0.605
Dog 5 0.67 0.86 0.71 0.89 - 0.783
Frog 6 0.68 0.87 0.62 0.79 - 0.740
Horse 7 0.70 0.84 0.61 0.72 - 0.718
Ship 8 0.94 0.95 0.63 0.73 - 0.813
Truck 9 0.84 0.97 0.61 0.76 - 0.795
Comb. (0-4) 0.90 0.97 0.66 0.80 - 0.833
Sup. (5-9) 0.97 0.98 0.76 0.83 - 0.885
Positive Negative Outliers
 Class  Class Dog Frog Horse Ship Truck avg
Dog None - 0.69 0.66 0.57 0.77 0.673
Plane 0 - 0.62 0.68 0.98 0.94 0.805
Car 1 - 0.67 0.7 0.96 0.97 0.825
Bird 2 - 0.73 0.68 0.94 0.86 0.803
Cat 3 - 0.65 0.68 0.8 0.85 0.745
Deer 4 - 0.81 0.74 0.89 0.81 0.8125
Comb. (0-4) - 0.80 0.76 0.97 0.96 0.874
Sup. (5-9) - 0.90 0.85 0.97 0.98 0.925
Frog None 0.40 - 0.53 0.49 0.67 0.523
Plane 0 0.7 - 0.58 0.98 0.96 0.841
Car 1 0.7 - 0.83 0.98 0.98 0.930
Bird 2 0.86 - 0.91 0.96 0.93 0.933
Cat 3 0.88 - 0.87 0.94 0.92 0.912
Deer 4 0.81 - 0.92 0.95 0.92 0.929
Comb. (0-4) 0.91 - 0.95 0.98 0.98 0.956
Sup. (5-9) 0.94 - 0.97 0.98 0.98 0.968
Horse None 0.41 0.58 - 0.46 0.66 .0531
Plane 0 0.53 0.59 - 0.96 0.87 0.806
Car 1 0.57 0.67 - 0.96 0.95 0.860
Bird 2 0.33 0.75 - 0.93 0.79 0.823
Cat 3 0.78 0.81 - 0.9 0.77 0.826
Deer 4 0.67 0.85 - 0.87 0.68 0.800
Comb. (0-4) 0.75 0.91 - 0.97 0.93 0.891
Sup. (5-9) 0.82 0.96 - 0.98 0.96 0.930
Ship None 062 0.74 0.73 - 0.77 0.717
Plane 0 0.84 0.89 0.93 - 0.85 0.890
Car 1 0.86 0.91 0.93 - 0.92 0.920
Bird 2 0.96 0.97 0.97 - 0.85 0.930
Cat 3 0.97 0.98 0.97 - 0.85 0.933
Deer 4 0.95 0.97 0.97 - 0.84 0.927
Comb. (0-4) 0.97 0.98 0.98 - 0.90 0.956
Sup. (5-9) 0.97 0.98 0.98 - 0.94 0.968
Truck None 0.35 0.53 0.46 0.30 - 0.412
Plane 0 0.69 0.70 0.67 0.87 - 0.733
Car 1 0.54 0.61 0.53 0.61 - 0.573
Bird 2 0.93 0.89 0.88 0.73 - 0.858
Cat 3 0.96 0.91 0.88 0.63 - 0.845
Deer 4 0.90 0.88 0.91 0.67 - 0.840
Comb. (0-4) 0.97 0.96 0.91 0.82 - 0.914
Sup. (5-9) 0.98 0.98 0.96 0.89 - 0.953

Appendix B Detailed Results for all 10 tasks

The following graphs are detailed results for some experiments in various settings described in section 5.1 and 5.2. Each curve represents a trade-off between accuracy on anomalies and on normal data for each dataset. The two left panes are an upper-bound supervised setting where the negative dataset is the same as outliers. The top pane shows accuracies on tasks 0 to 4 and the bottom shows accuracies on tasks 5 to 9. Note that as the threshold value increases the model favors accepting anomalies over misclassifying normal examples. In almost all cases, we observe that LIS-AE gives a significant margin compared to normal AE.
Standard LIS-AE trained on CIFAR-10 classes. Left, outliers as negative dataset (Supervised). Right, SVHN as negative dataset.

Results of LinSep-LIS-AE variant on SVHN. Left, outliers as negative dataset (Supervised). Right, unsupervised.

MNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset.
Fashion-MNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset.