1. Introduction
Anomaly detection is a classical machine learning field which is concerned with the identification of in-distribution and out-of-distribution samples. Unlike traditional multi-label classification where the goal is to find decision boundaries between classes present in a given dataset, the goal of anomaly detection is to find one-versus-all boundaries for classes that are not in the dataset which is significantly more challenging compared to standard classification. Autoencoders
(Bengio et al., 2007) have been used extensively for anomaly detection (Zhou and Paffenroth, 2017; Zimek et al., 2012; Chalapathy et al., 2017) under the assumption that reconstruction error incurred by anomalies is higher than that of normal samples (Hasan et al., 2016; Zong et al., 2018). However, it has been observed that this assumption might not hold as standard autoencoders might generalize so well even for anomalies(Gong et al., 2019; Zong et al., 2018). In practice, this issue becomes more relevant in two important settings, namely, when the normal data is relatively complex it requires high latent dimension for good reconstitution, and when anomalies share similar compositional features and are from a close domain to the normal data (Chandola et al., 2009).To mitigate these issues we present Latent-Insensitive Autoencoder (LIS-AE), a new class of autoencoders where the training process is carried out in two phases. In the first phase, the model simply reconstructs the input as a standard autoencoder, in the second phase the entire model except the latent layer is ”frozen”. We then train the model in such a way that forces the latent layer to only keep reconstructing the target task. We use the concept of a negative dataset from one-class classification (Weston et al., 2006) whereby an auxiliary dataset of non-examples from similar domains is used as a proxy for out-of-distribution samples. We change the training objective such that the autoencoder keeps its low reconstruction error for the target dataset while pushing the error of the negative dataset to exceed certain value.
![]() The first diagram shows feature extraction phase. |
![]() |
To examine the behaviour of our model under distributional shift, we test our model in Continual Learning (CL) settings. Continual learning, also known as Incremental learning, is a machine learning paradigm aiming at developing models which have the ability to continually learn from a potentially infinite stream of classification tasks by incorporating new knowledge while retaining previously learned skills (Zenke et al., 2017). Moreover, a CL algorithm must admit to additional constraints, most notably, access to all data is assumed not to be available during training. Additionally, the model is not allowed to replay all previous data. The main challenge of CL is that most of the current machine learning models are prone to what is known as catastrophic forgetting (Tan et al., 2018). The phenomenon of a machine learning model experiencing abrupt performance degradation on previously learned concepts when trained to acquire new skills (French, 1999). In other words, models tend to ”fit” the most recent task and ”forget” about previously learned ones. To formalize the problem, given a stream of tasks where each task contains a dataset such that represents a -sample and represents its target. The goal of continual learning models is to learn each task sequentially under the constraint that access to previous data is unavailable or limited. Depending on the availability of a task descriptor during test time, the problem is categorized further into two distinct test protocols, namely, task-incremental and class-incremental settings. For task-incremental test settings, the model is presented with samples from unknown classes along with a task identifier
that indicates which task each sample belongs to. More formally, the model is estimating
. On the other hand, for Class-Incremental test settings, the model is estimating since it is only presented with the unknown without any additional information to which task the unknown samples belong to. This in turn makes class-incremental settings significantly more challenging than task-incremental settings and explains why most of the work published in the field assumes that is presented during testing (Li and Hoiem, 2017; Lee et al., 2017; Rusu et al., 2016). To elicit the intuition behind our approach, we first break the problem of class-incremental learning into a more granular task-agnostic formulation, namely, instead of assuming that the model is presented with a stream of tasks containing multiple classes, we assume that the model is presented with a stream of independent classes. One extreme solution that completely avoids catastrophic forgetting by construction is to learn a separate generative model for each class. However, learning generative models in normal settings is typically not an easy task and the number of models would grow with each new class which makes this solution very inefficient. Another extreme is the use of single model for efficient capacity and train each class incrementally, however, the model would suffer greatly from catastrophic forgetting, thus, various regularizes are added. The spectrum of approaches between the aforementioned two extremes is best captured by the so-called ”stability-plasticity” dilemma(Mermillod et al., 2013). The decision to use autoencoders as part of the proposed model is dictated by firstly, its unsupervised nature, and secondly it has been shown that autoencoders are comparatively resilient to catastrophic forgetting (Thai et al., 2021). For CL, since the bottleneck layer is the most sensitive part of our model for a given task, we create a separate latent layer for each class and the aforementioned process is repeated. Details of architecture, training process, and experiments are discussed in detail in the following sections.2. Related Work
2.1. Anomaly Detection
Many reconstruction-based anomaly detection approaches have been proposed starting with classical methods such as PCA (Pearson, 1901)
. Robust-PCA mitigates the issue of outlier sensitivity in PCA by decomposing the data matrix into a sum of two low-rank and sparse matrices using nuclear norm and
norms as convex relaxation for the objective loss (Candès et al., 2011). Autoencoders address the issue of PCA only considering linear relations in feature-space by introducing nonlinearities benefiting from multiple layers of representations (Bourlard and Kamp, 1988). We elaborate further on the shortcomings of PCA and autoencoders in the theoretical section and use that to motivate our approach.Other methods try to improve on base autoencoders by endowing the latent code with particular properties. In the case of VAE (An and Cho, 2015)
, it does so by having the latent code to follow a prior distribution (usually normal) which also allows sampling from the decoder. However, in the context of anomaly detection, it introduces scaling issues since minimizing KL-Divergence for high latent dimensions required for complex tasks is quite challenging. Another approach is Replicator Neural Networks (RepNN)
(Hawkins et al., 2002)which is an autoencoder with a staircase activation function positioned on the output of the bottleneck layer (Latent Layer). This is mainly used in order to quantize the latent code into a number of discrete values which also aids in forming clusters
(Williams et al., 2002). Unfortunately, a discrete staircase function is non-differentiable which prevents learning via backpropagation. Instead, a differentiable approximation involving the sum of
hyperbolic tangent functions tanh was introduced in place of the otherwise, non-differentiable discrete staircase function. However, as discussed in (Tóth and Gosztolya, 2004), despite the theoretical appeal for having a quantized latent code via smooth approximation, in practice, having such activation function makes it significantly difficult for the gradient signal to flow. We also note that increasing the number of levels using the aforementioned , tanh sum approximation presents a significant overhead during training and testing since activation functions have to be computed for each batch, moreover, it suffered from scaling issues similar to that of VAE.Another approach is memorizing normality of a given dataset using Memory-augmented Autoencoder (Gong et al., 2019). This approach limits the effective space of possible latent codes by constructing a memory module that takes in the output of the encoder as an address and passes to the decoder the most relevant memory items from a stored reservoir of prototypical patterns that have been learned during training.
Other non-reconstruction-based approaches include One-Class classification which is tightly connected to anomaly detection in the sense that both problems are concerned with finding one-versus-all boundaries. One-Class SVM is a variation of the classical SVM algorithm (Cortes and Vapnik, 1995) where the objective is to find a hyper-plane that best separates samples from outliers (Schölkopf et al., 2001)
. Support Vector Data Description (SVDD)
(Tax and Duin, 2004) tries to find a circumscribing hyper-sphere that contains all samples while having optimal margin for outliers. It is worth noting that for kernels where such as RBF and Laplacian, OC-SVM and OC-SVDD learn identical decision functions (Lampert, 2009). To address the lack of representation learning and bad computational scalability of OC-SVM and OC-SVDD, Deep SVDD (OC-DSVDD) employs a deep neural network that learns useful representation while mapping outputs to a hypersphere of minimum volume (Ruff et al., 2018). However, due to its sole reliance on optimizing for minimum volume, this approach is prone to hyper-sphere collapse which leads to finding uninformative features (Perera and Patel, 2019).Other approaches have been proposed where an auxiliary datset of non-examples (negative dataset) is drawn from similar domains as a proxy for the otherwise intractable complement for the target class. In (Weston et al., 2006), a collection called the ”Universum”, allows learning useful representation to the domain of the problem via maximizing the number of contradictions on an equivalence class. Similar to OC-DSVDD, (Perera and Patel, 2019) leverages a labeled dataset from a close domain to fine-tune pre-trained two CNNs in order to learn new good features. The goodness of these features is quantified by the compactness (inter-class variance) for the target class and descriptiveness (cross-entropy) for the labeled dataset. Despite avoiding hyper-sphere collapse and outperforming OC-SVDD, this approach requires two pre-trained neural networks and a large labeled dataset along with the target dataset. Another approach that also makes use of a large auxiliary dataset is Outlier Exposure (OE) (Hendrycks et al., 2018)
, which is a supervised approach that trains a standard neural network classifier while exposing it to a diverse set of non-examples on which the output of the classifier is optimized to follow a uniform distribution using another cross-entropy loss.
2.2. Continual Learning
Progressive Neural Networks (ProgNets)(Rusu et al., 2016) prevent catastrophic forgetting by dynamically adding separate subnetworks to accommodate new tasks while maintaining lateral connections to allow forward transfer of knowledge from previous tasks. The two major problems with this approach are that model parameters grow quadratically with each new task and also the model assumes knowledge of tasks boundaries to determine which output to select. This, by definition, renders ProgNets unsuitable for class-incremental learning.
Another approach that adds separate networks to accommodate new tasks is Expert-Gate (Aljundi et al., 2017). One major advantage of Expert-Gate over ProgNet is that it does not require knowledge of tasks boundaries since it uses an undercomplete autoencoder-based gating mechanism to determine which subnetwork, called ”Expert”, is to be active by using the reconstruction loss of the auxiliary autoencoders as a metric to determine which task the sample is drawn from. However, similar to ProgNet, this approach also suffers from complexity issues since with each new task, a new network is added with a separate autoencoder to serve as its ”gate”. Another assumption that Expert-Gate relies on is that an autoencoder for a particular task will not generalize well to another similar task and a simple gate will suffice to distinguish between complex tasks. However, such assumption may not hold as the autoencoder might generalize to new tasks, typically when the task is from a similar domain or the autoencoder gate is trained on a task with relatively large intra-class or inter-class variance (Gong et al., 2019).
Pseudo-Rehearsal via Generative-Replay (GR) (Shin et al., 2017) is a bio-inspired approach that mimics the way in which the hippocampus in the human brain interacts with the neocortex when learning new concepts (Lavenex and Amaral, 2000). GR architectures have two main components, a deep generative module (usually a GAN (Goodfellow et al., 2014) or a VAE (Kingma and Welling, 2013)), and a “solver” module (Usually a CNN (LeCun and Cortes, 2010)) which is responsible for solving new tasks. The generator is trained to output instances that follow the same data distribution such that when a new task arrives, the solver trains on the synthetic data outputted by the generator along with the new data in order to alleviate catastrophic forgetting. Despite their flexibility and biological appeal, Generative Replay-based methods have three major disadvantages, namely, training a generative model on a stream of changing synthetic and real data is challenging (Liu et al., 2020), GR models tend to fall short when dealing with complex datasets (Aljundi et al., 2019), and the time required to train the model on a new task increases linearly since the model has to generate and rehearse tasks. Many variants have been proposed to address these issues. Conditional GANs operate in constant time but have lower accuracy (Lesort et al., 2019), while others depend on non-incremental pre-trained networks (van de Ven et al., 2020; Liu et al., 2020).
Regularization-based techniques such as elastic weight consolidation (EWC) (Kirkpatrick et al., 2017), synaptic intelligence (SI) (Zenke et al., 2017)
and incremental moment matching (IMM)
(Lee et al., 2017), modify model parameters in such a way that preserves important weights for previous tasks while finding less sensitive weights to accommodate new tasks. Despite the popularity of these methods, they tend not to perform well in class-incremental settings since they were originally introduced for task-incremental learning (Van de Ven and Tolias, 2018).3. Proposed Method
3.1. Architecture
An undercomplete deep autoencoder is a type of unsupervised feed-forward neural network for learning a lower-dimensional feature representation of a particular dataset via reconstructing the input back at the output layer. To prevent autoencoders from converging to trivial solutions such as the identity mapping; a bottleneck layer with output
such that its dimension is less than the dimension of the input . The forward pass is computed as such:(1) |
(2) |
(3) |
where is the input, is the bottleneck layer, and
are convolutional neural networks representing the encoder and the decoder modules respectively. Typically, such models are trained to minimize the
-norm of the difference between the input and the reconstructed output . As previously discussed, the choice of the activation function of plays an important rule in anomaly detection. Activation functions that quantize the latent code or encourage forming clusters are preferable. In our experiments, we find that confining the latent code to have values between with a tanhactivatin function as we maximize the loss over the negative dataset during the latent-shaping phase has a regularizing effect. We also note that unbounded activation functions such as ReLU tend to have poor performance.
3.2. Terminology
Positive Dataset ():
This is the dataset that contains the normal class(es), for example, plane class from CIFAR-10.
Negative Dataset (): This is a secondary unlabeled dataset containing negative examples from a similar domain as . The choice of depends on . For example if
is the digit 0 from MNIST,
might be random strokes or another dataset with similar features such as Omniglot(Tang et al., 2017). It is important to note that the model should not be tested on since this violates the assumption of not knowing anomalies.Feature Extraction Phase: This is the first phase of training. The model is simply trained to reconstruct its input. This may also include reconstructing negative examples in order to extract useful features in case of continual learning.
Latent-Shaping Phase: This is the second phase of training. The encoder and decoder networks are frozen and only the latent layer is active.
3.3. Training for Anomaly Detection
Given a dataset and a negative dataset from a similar domain to , we divide the training process into two phases; the first phase is reconstructing samples from
by minimizing the loss function
until convergence, where is the input drawn from and is the output of the autoencoder. In the second phase, we freeze the model except for the latent layer and minimize the following loss function:(4) |
where is a sampled batch from , is its reconstruction,
is a hyperparameter that controls the effect of the two parts of the loss function and
is another hyperparameter indicating that we are satisfied if the reconstruction error of the negative dataset exceeds a certain value.3.4. Predicting Anomalies
We use reconstruction error to distinguish between anomalies and normal data where is the test sample and is the reconstructed output. More specifically, we set a threshold such that if the output is considered to be anomalous.
3.5. Training for Continual Learning
![]() |
![]() |
![]() |
![]() |
As can be seen in Algorithm 2, given a stream of sequential tasks with optional auxiliary negative datasets = , where , such that represents the sample, represents its target, is the number of classes present in , and is the total number of tasks. For a particular task , the first phase of training each class in is similar to the first phase described in Algorithm 1. For each latent , is split into and such that where and where . For example, if task is to classify digits 8 and 9 from MNIST dataset. We consider samples of the digit 9 as for the latent layer corresponding to digit 8, and samples of the digit 8 as for digit 9. Of course, this does not preclude the possibility of having an additional auxiliary negative dataset. (Further details on the effect of negative examples are given in section 5.2) We notice that by selecting a particular latent layer we end up with a scenario similar to the second phase described in 3.1 and we resume training by minimizing the same loss function:
(5) |
3.6. Prediction for Continual Learning
Given an unknown sample to be classified. We iterate over each latent layer as shown in fig. 3 and compute the reconstruction error. The latent layer with the minimum error is considered to be the correct class.
4. Theoretical Justification
4.1. Formulation
From optimiality of autoencoders (Bourlard and Kamp, 1988)
, we know that absent non-linear activation function, a linear autoencoder corresponds to singular value decomposition (SVD); henceforth, we use SVD interchangeably with linear autoencoders. Given an
data matrix , we decompose into , where and is its orthogonal complement .
We further decompose using SVD:
where and are orthonormal matrices and is a diagonal matrix such that . However, in practice it is rarely separated this neatly, specially when dealing with large number of samples of a high-dimensional dataset; therefore, we resort to reduced-SVD where we take the first columns of with the caveat that the choice of is a hyper-parameter.
The matrix can be divided thus: , and from Eckart–Young low-rank approximation theorem, columns of and columns of .
A linear autoencoder with -dimensional latent layer is equivalent to the following transform:
Where and represent the decoder and the encoder respectively. Furthermore, any data point can be represented as , where and are () and -dimensional real vectors. By orthonormality, we have the following identities: and , where is an
-identity matrix. As a shorthand, we write
instead of . Using these two identities, we rewrite the reconstruction loss as following:We note that the loss function is agnostic to the nature of and is only concerned with its magnitude. The assumption for anomaly detection under this setting is that , where and correspond to orthogonal components for anomalies and positive data respectively. We posit that while this agnosticism is desirable for potential generality, it is not optimal for anomaly detection; hence, we modify the loss score to depend on the nature of :
where is an matrix such that the loss is small for normal data but large for anomalies. In other words, we want and to be large.
We define orthonormal basis for and orthonormal basis for , where is the matrix of all positive orthogonal components .
We decompose further into and where columns of are the basis of that are not in and columns of are the remaining columns of .
Since , any can be written as , where , and are real vectors.
Despite the fact that we do not have access to , we can utilize other negative examples from similar domain and use as a proxy for . Since , maximizing implies maximizing assuming that . The later assumption hinges on the fact that is from a similar domain. Therefore, we end up with the goal of finding such that and is large.
In practice, we cannot maximize indefinitely and we are satisfied if it reaches a certain large :
|
|
|
|
|
|
We notice that in order for this to work, the decoder has to be known and remain fixed (frozen). This suggests a two-phase training where we first compute the decoder and encoder networks, and in the second phase the decoder is fixed while the encoder is modified using the new loss. In fig. 4, a linear version of LIS-AE is trained on digit-8 from MNIST with Omniglot as a negative dataset. We perform orthogonal decomposition on each input by projecting it onto digit-8 subspace to get its projection and orthogonal vectors. We then feed each vector separately to a regular linear AE and linear LIS-AE. We observe that the regular autoencoder outputs zero images for the orthogonal part of each sample regardless of the class it belongs to. However, in the case of LIS-AE, it behaves differently for normal class than for anomalous classes.
We also notice that orthogonal projections do not form a semantically meaningful representation in pixel space. In order to gain a better representation we use a deep AE. For this non-linear case, we treat the middle part of the network as an inner linear autoencoder which is operating on a more semantically meaningful transformed version of the data. This suggests a stacked autoencoder archiecture where another loss term for the inner autoencoder is added in the first phase to make sure that the output of the layer after latent is similar to the latent input. In the second phase we freeze the entire network except for the encoder of the inner autoencoder (latent layer of entire model) and minimize the reconstruction error of positive examples while maximizing the loss for negative examples. However, in our experiments we observed that adding these loss terms was not necessary and a similar loss to the linear case produced similar results since we are only considering reconstruction scores of the outer model. Therefore, we keep the entire network frozen except for the latent layer while directly minimizing the following loss as before. (eq. 4)
4.2. Intuition
For concreteness, we consider the following simple, supervised case. Given a dataset such that for each :
We notice that most of the variance in data is along the x-axis. Training a linear autoencoder with latent dimension , results in and where and are the decoder and encoder networks respectively.
Given input ,
The loss score .
Training a LIS-AE on negative samples that have only nonzero values along the z-axis, we end up with the same and a new (modified) , where is a large number.
has the form ,
has the form where .
The new loss scores for and are:
In the case of regular Linear-AE (PCA), given , for each point the cylinder: the following holds: , making the two samples indistinguishable.
In the case of LIS-AE, holds only for the elliptic cylinder , and since is a large number, the cross-section of the cylinder is squashed in the z dimension resulting in heavily penalized loss in the z dimension but a regular loss in the y dimension. In this case, the two samples become indistinguishable only for very small values of .
We note that the new is merely a rotated and stretched version of the old in the
-plane. Thus, we can think of Linear LIS-AE as a regular PCA with its eigenvectors (columns of
) stretched and tiltedin the directions of the orthogonal complement of the eigenspace. This is done in such a way that keeps the column space of normal examples invariant under the new transformation
. By itself, this formulation remains ill-posed since there are infinite number of solutions that do not necessarily help with anomaly detection. More formally, given , we can choose any matrix such that since . However, this does not guarantee any advantage for anomaly detection on similar data, even worse in practice, this modification process might result in a slightly worse performance if done arbitrarily since the model usually has to sacrifice some extreme samples from the normal data to balance the two losses. Thus, the negative dataset is used to properly determine the directions of the tilt and hyperparamteres ( and ) determine the importance and amount of stretching (or shrinking) without changing the normal case as much as possible. For deep LIS-AE, the same analogy holds albeit in a latent space.Deep architectures are not only useful for learning good representation, but can learn a non-linear transformation with useful properties for our objective such as linear separability of negative and positive samples. By adding a standard binary cross-entropy loss before the non-linear activation of the latent layer during the first phase, we ensure that the input of the latent layer is linearly-separable for positive and negative examples during the second phase. This linearly-separable variant (LinSep-LIS-AE) almost always performs better than directly using LIS-AE. We investigate the effect of this property on the second phase in section 5.2.
5. Experiments
Model | MNIST | Fashion-MNIST | 2-Class MNIST |
---|---|---|---|
KDE | 0.9568 | 0.9183 | 0.9206 |
IF | 0.8624 | 0.9144 | 0.73018 |
OC-SVM | 0.9108 | 0.8608 | 0.8741 |
OC-DSVDD | 0.9489 | 0.8577 | 0.8972 |
AnoGAN | 0.9579 | 0.9098 | 0.8406 |
AnoGAN-FM | 0.9544 | 0.9072 | 0.8353 |
Linear-AE | 0.9412 | 0.8845 | 0.8915 |
VAE | 0.9642 | 0.9092 | 0.9263 |
Mem-AE | 0.9714 | 0.9131 | 0.9352 |
Sig-RepNN (N=4) | 0.9661 | 0.9124 | 0.9261 |
AE | 0.9601 | 0.9076 | 0.9221 |
LIS-AE | 0.9768 | 0.9256 | 0.9457 |
We report results on the following datasets, MNIST (LeCun and Cortes, 2010), Fashion-MNIST (Xiao et al., 2017)
, SVHN
(Netzer et al., 2011) and CIFAR-10 (Krizhevsky et al., 2009). Results of our approach are compared to baseline models with the same capacity for autoencoder-based methods.5.1. Anomaly Detection
In this section, we test LIS-AE for anomaly detection on image data in unsupervised settings. Given a standard classification dataset, we group a set of classes together into a new dataset and consider it the ”normal” dataset. The rest of classes that are not in the normal nor in the negative datasets are considered anomalies. During training, our model is presented only with the normal dataset and the additional negative dataset. We evaluate the performance on test data comprised of both the ”normal” and ”anomalous” groups.
For MNIST and Fashion-MNIST, the encoder network consists of two Convolutional layers with LeakyReLU non-linearities followed by a fully-connected bottleneck layer with tanh activation function. The decoder network consists of a fully-connected layer followed by a LeakyReLU and two Deconvolution layers with LeakyReLU activation functions and a final convolution layer with sigmoid situated at the final output. For SVHN and CIFAR-10 we use latent layer with larger sizes and higher capacity networks with same depth. It is worth noting that the choice of latent layer size has the most effect on performance for all models (compared to other hyper-parameters). We report the best performing latent dimension for all models.
In table (1), we compare LIS-AE with several autoencoder-based anomaly detection models as baselines, all of which share the exact same architecture. It is worth noting that the most direct comparison is between LIS-AE and AE since not only they have the same architecture, they have the exact same encoder and decoder weights and their performance is merely measured before and after the latent-shaping phase. We use a different variant of RepNN with a sigmoid activation function placed before the tanh staircase function approximation described in section 2.2. This is mainly used because ”squashing”
the input between 0 and 1 before passing it to the staircase function gives more robust and easy-to-train network. We only report the best results for Sig-RepNN with 4 activation levels. For anomaly GAN (AnoGAN)
(Schlegl et al., 2017), we follow the implementation described in (Schlegl et al., 2019). We train a W-GAN (Gulrajani et al., 2017) with gradient penalty and report performance for two anomaly scores, namely, encoder-generator reconstruction loss and additional feature-matching distance score in the discriminator feature space (AnoGAN-FM).Model | SVHN | CIFAR-10 |
---|---|---|
KDE | 0.5648 | 0.5752 |
IF | 0.5112 | 0.6097 |
OC-SVM | 0.5047 | 0.5651 |
OC-DSVDD | 0.5681 | 0.6411 |
AnoGAN | 0.5598 | 0.5843 |
AnoGAN-FM | 0.5645 | 0.5880 |
Linear-AE | 0.5702 | 0.5753 |
VAE | 0.5692 | 0.5734 |
Mem-AE | 0.5720 | 0.5931 |
Sig-RepNN (N=4) | 0.5684 | 0.5719 |
AE | 0.5698 | 0.5703 |
LIS-AE | 0.6886 | 0.8145 |
LinSep-LIS-AE | 0.7701 | 0.8858 |
Sup. LIS-AE | 0.7573 | 0.8384 |
Sup. LinSep-LIS-AE | 0.8479 | 0.9170 |
For AnoGAN, Linear-AE, AE, VAE, RepNN, MemAE, and LIS-A, we use reconstruction error such that if the input is considered an anomaly. Varying the threshold , we are able to compute the area under the curve (AUC) as a measure of performance. Similarly, for OC-SVDD (equivalently OC-SVM with rbf kernel) and OC-DSVDD, we vary inverse length scale
and use predicted class label. For Kernel density estimation (KDE)
(Parzen, 1962), we vary the threshold over the log-likelihood scores. For Isolation Forest (IF) (Liu et al., 2008), we vary the threshold over the anomaly score calculated by the Isolation Forest algorithm.
The datasets tested in table (1) are MNIST and Fashion-MNIST. To train LIS-AE on MNIST we use Omniglot (Lake et al., 2015) as our negative dataset since it shares similar compositional characteristics with MNIST. Since Omniglot is a relatively small dataset, we diversify the negative examples with various augmentation techniques, namely, Gaussian blurring, random cropping, horizontal and vertical flipping. We test two settings for MNIST, a 1-class setting where normal dataset is one particular class and the rest of the dataset classes are considered anomalies. The process is repeated for all classes and averge AUC for 10 classes is reported. Another setting is 2-class MNIST where the normal dataset consists of two classes and the remaining classes are considered anomalies. For example, the first task contains digits 0 and 1 and the remaining digits are considered anomalies, the second task contains digit 2 and 3, and so forth. This setting is more challenging since there is more than one class present in the normal dataset and also very informative for continual learning methods that use autoencoders as gates to first identify tasks boundaries.
For Fashion-MNIST, the choice of negative example is different. We use the next class as the negative dataset and we do not include it with anomalies (i.e. remaining classes) during test time. This also informative for continual learning where we have a stream of sequential tasks. In the ablation section we test multiple negative datasets for Fashion-MNIST.
We note that LIS-AE achieves superior performance to all compared approaches, however, we also notice that these settings are comparatively easy and all tested models performed adequately including classical non-deep approaches.
In table (2), we show performance on SVHN and CIFAR-10 which are more complex dataests compared to MNIST and Fashion-MNIST. To train LIS-AE, we split each dataset into two datasets, each split is used as negative examples for the other one. Note that we only test on the remaining classes which are not in the normal nor the negative datasets. For example, the first dataset from CIFAR-10 has five classes, namely, airplane, automobile, bird, cat and deer while the second one has dog, frog, horse, ship and truck.
Training on airplane as the first normal task, LIS-AE maximizes the loss for samples drawn from the negative dataset (dog, frog, ship and so forth). We then test its performance on airplane as the normal class and only on automobile, bird, cat and deer as anomalies. Note that we do not test on dog, frog and other classes in the negative dataset. This processes is repeated for all 10 classes and average AUC is reported. As mentioned in section 4.2, We introduce LinSep-LIS-AE as an improvement over base standard LIS-AE. The difference between the two models is only in the first phase where a binary cross-entropy loss is added to ensure that positive and negative examples are linearly seperabable during the second phase. The last two entries of the table are supervised upper bounds for each variant where the negative dataset is the same as outliers. In figure (6), we see that standard AE is prone to generalize so well for other classes which is not a desired property for anomaly detection. In contrast, LIS-AE only reconstructs normal data faithfully which translates to the large performance gap we see in figure (5).
![]() |
![]() |
We also notice that despite CIFAR-10 being more complex than SVHN, most reconstruction-based models are performing better on CIFAR-10 than on SVHN. This is due to the fact that the difference between SVHN classes in terms of reconstruction is not as large since they share similar compositional features and do appear in samples from other classes while for CIFAR-10 classes vary significantly. (e.g. digit-2 and digit-3 on a wall vs truck and bird)
5.2. Ablation
Negative | Positive Data | |||
---|---|---|---|---|
Data | MNIST | Fashion | SVHN | CIFAR-10 |
None | 0.9485 | 0.8740 | 0.5698 | 0.5703 |
Omni | 0.9605 | 0.9013 | - | - |
MNIST | 0.9778 | 0.8942 | - | - |
Fashion | 0.9482 | 0.9106 | - | - |
SVHN | - | - | 0.6886 | 0.7065 |
CIFAR-10 | - | - | 0.5481 | 0.8145 |
Same (Sup.) | 0.9901 | 0.9623 | 0.7573 | 0.8384 |
In this section, we investigate the effect of the nature of negative dataset and linear separability of positive and negative examples. In table (3) we train LIS-AE on different negative and positive datasets. Similar to table (2), we split each positive dataset into two datasets and follow the same settings as before with the exception of ”None” and ”Supervised ” cases. The ”None” case indicates that no negative examples have been used whereas the Supervised indicates that both outliers and negative datasets share the same classes. Note that this case is different from the case where the positive and negative datasets come from the same dataset. Unless stated otherwise, we only test on classes (outliers) that are not in the positive nor in the negative datasets. For example, when MNIST is used as a source for both positive and negative datasets, the positive data starts with class 0 and negative dataset consists of classes 5 to 9 where the outliers are classes 1 to 4. This process is repeated for all 10 classes present in each dataset and average AUC is reported. Overall, using a negative dataset resulted in a significant increase in performance in every case except for two important cases, namely, when Fashion-MNIST and CIFAR-10 were used as negative datasets for MNIST and SVHN respectively. This could be explained by the fact that the model was not capable of reconstructing Fashion-MNIST and CIFAR-10 classes in the first place. Moreover, shaping the latent layer in such a way that maximizes the loss for Fashion-MNIST and CIFAR-10 classes does not guarantee any advantage for anomaly detection of similar digit classes present in MNIST and SVHN. This coupled with the fact that this process in practice forces the model to ignore some samples from the normal dataset to balance the two losses which results in the performance degradation we observe in these two cases.
Table (4) is an excerpt of the complete table in the appendices where we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR-10 dataset. We split CIFAR-10 into two separate datasets, the first split is used for selecting classes as negative datasets and the other split is used as outliers. For each class in CIFAR-10 we train eight models in different settings, the first setting is None where we train a standard autoencoder with no negative examples as the base model. The remaining seven settings differ in the second phase, we select one class as our negative dataset and test the model performance on each individual class from the outlier dataset. The combined setting is similar to the setting described in section 5.1 where we combine all negative classes in one 5-class negative dataset. Note that these classes are not the same as the classes in the outlier test dataset except for the final setting, which is an upper-bound supervised setting where the negative dataset is comprised of classes that are in the outlier dataset except for the positive class. This process is then repeated for all 10 classes in CIFAR-10. Overall, we observe a significant performance increase over the base model with the general trend of negative classes significantly increasing anomaly detection performance for similar outliers. For example, the dog class drastically improves performance on the cat class but not so much for the plane class. However, we also notice two important exceptions, namely, when the horse class is used as the negative dataset for the car class, we notice a significant performance increase for the relatively similar deer class as expected, however, when the horse class is used as the negative dataset for the same deer class, we notice that the performance does not improve as in the first case and even degrades for the care class.
Positive | Negative | Outliers | |||||
---|---|---|---|---|---|---|---|
Class | Class | Plane | Car | Bird | Cat | Deer | avg |
Car | None | 0.32 | - | 0.34 | 0.33 | 0.33 | 0.330 |
Dog 5 | 0.67 | - | 0.89 | 0.93 | 0.90 | 0.848 | |
Frog 6 | 0.58 | - | 0.90 | 0.90 | 0.91 | 0.823 | |
Horse 7 | 0.66 | - | 0.88 | 0.90 | 0.92 | 0.840 | |
Ship 8 | 0.83 | - | 0.59 | 0.51 | 0.50 | 0.608 | |
Truck 9 | 0.51 | - | 0.44 | 0.49 | 0.44 | 0.470 | |
Comb. (5-9) | 0.81 | - | 0.92 | 0.92 | 0.94 | 0.898 | |
Sup. (0-4) | 0.89 | - | 0.93 | 0.90 | 0.95 | 0.918 | |
Deer | None | 0.56 | 0.80 | 0.52 | 0.54 | - | 0.605 |
Dog 5 | 0.72 | 0.85 | 0.63 | 0.80 | - | 0.750 | |
Frog 6 | 0.66 | 0.86 | 0.60 | 0.75 | - | 0.718 | |
Horse 7 | 0.71 | 0.58 | 0.58 | 0.71 | - | 0.645 | |
Ship 8 | 0.93 | 0.94 | 0.63 | 0.72 | - | 0.805 | |
Truck 9 | 0.84 | 0.97 | 0.62 | 0.72 | - | 0.773 | |
Comb. (5-9) | 0.87 | 0.95 | 0.61 | 0.73 | - | 0.790 | |
Sup. (0-4) | 0.93 | 0.97 | 0.63 | 0.72 | - | 0.813 | |
Other notable examples of this observation can be found in the appendices where, for instance, the dog class improves performance on cat outliers, but causes noticeable degradation when used as the negative dataset for the same cat class. We believe the reason behind this problem is that in the first case, the gained performance is due to the fact that these classes share similar compositional features and backgrounds. However in the second case, the same property makes it difficult to balance the minimization and maximization loss during the latent-shaping phase. For example, car and truck images are very similar in this scenario that minimizing and maximizing the loss at the same time becomes contradictory. As posited in section 4.2, we mitigate this issue by adding a binary cross-entropy loss while training in the first phase to ensure that the input of the latent layer is linearly-separable for positive and negative examples. Notice that unlike other approaches (Hendrycks et al., 2018; Perera and Patel, 2019), this does not require a labeled positive or negative dataset and relies only on the fact that we have two distinct datasets. This linear separablity makes the second phase of training relatively easier and less contradictory. In table (5), we see that LinSep-LIS-AE mitigates this issue for the aforementioned cases and gives the AUC increase we observed in table (2).
5.3. Continual Learning
Positive | Negative | Outliers | |||||
---|---|---|---|---|---|---|---|
Class | Class | Plane | Car | Bird | Cat | Deer | avg |
Car | None | 0.32 | - | 0.34 | 0.33 | 0.33 | 0.330 |
Dog 5 | 0.67 | - | 0.94 | 0.97 | 0.95 | 0.883 | |
Frog 6 | 0.58 | - | 0.93 | 0.96 | 0.96 | 0.858 | |
Horse 7 | 0.69 | - | 0.95 | 0.97 | 0.97 | 0.895 | |
Ship 8 | 0.90 | - | 0.78 | 0.79 | 0.76 | 0.808 | |
Truck 9 | 0.59 | - | 0.77 | 0.82 | 0.73 | 0.728 | |
Comb. (5-9) | 0.90 | - | 0.97 | 0.97 | 0.98 | 0.955 | |
Sup. (0-4) | 0.95 | - | 0.98 | 0.98 | 0.98 | 0.9725 | |
Deer | None | 0.56 | 0.80 | 0.52 | 0.54 | - | 0.605 |
Dog 5 | 0.67 | 0.86 | 0.71 | 0.89 | - | 0.783 | |
Frog 6 | 0.68 | 0.87 | 0.62 | 0.79 | - | 0.740 | |
Horse 7 | 0.70 | 0.84 | 0.61 | 0.72 | - | 0.718 | |
Ship 8 | 0.94 | 0.95 | 0.63 | 0.73 | - | 0.813 | |
Truck 9 | 0.84 | 0.97 | 0.61 | 0.76 | - | 0.795 | |
Comb. (5-9) | 0.90 | 0.97 | 0.66 | 0.80 | - | 0.833 | |
Sup. (0-4) | 0.97 | 0.98 | 0.76 | 0.83 | - | 0.885 | |
A common evaluation method for CL settings is split-dataset tests whereby a standard classification dataset is divided into disjoint tasks within each task an number of classes such that is the total number of classes present in dataset. The model is then presented with tasks one at a time and the final performance of all tasks is reported. In this section, we consider MNIST and Fashion-MNIST datasets in two common variants of the aforementioned setting, namely, task-incremental and class-incremental settings. It is important to note that while these two datasets are now considered too simple a test for meaningful evaluation of classifiers in general, this does not hold for continual learning since MNIST-like tasks are still extremely challenging for continual learning especially in class-incremental settings. We compare LIS-AE with several CL approaches. We start by estimating a lower bound for each setting by simply training the classifier sequentially on tasks to assess forgetting and interference. For lower bound model, EWC, online EWC (Schwarz et al., 2018), SI, and deep generative replay (DGR), the classifier network has the same architecture as the encoder network used in LIS-AE. For Expert-Gates (EG), we use experts (classifiers) with the same encoder network architecture where is the number of tasks. It is worth noting that the latent dimension of the autoencoder gates in EG differs from the latent dimension used in LIS-AE. Therefore, we report the best accuracy for EG and LIS-AE. For fair comparison, we only include models that do not have an episodic memory (do not store raw data).
We first test our model under the easy protocol where the model is presented with an unknown sample along with a task identifier. Table (6) shows results of split-MNIST and split-Fashion-MNIST, for both of the datasets, we consider five folds of the MNIST dataset, the first fold is comprised of 0 and 1, the second fold is 2 and 3 and so forth. During testing, the model is presented with a task identifier indicating which class the sample belongs to. For LIS-AE, this reduces interference between latent layers from different tasks significantly.
Model | MNIST | Fash-MNIST |
---|---|---|
lower bound | 0.8819 | 0.7721 |
EWC | 0.9864 | 0.9572 |
Online EWC | 0.9904 | 0.9642 |
SI | 0.9916 | 0.9682 |
DGR | 0.9941 | 0.8842 |
LIS-AE | 0.9961 | 0.9866 |
Model | MNIST | Fash-MNIST |
---|---|---|
lower bound | 0.1970 | 0.1821 |
EWC | 0.1992 | 0.1902 |
Online EWC | 0.1993 | 0.1891 |
SI | 0.2101 | 0.1911 |
DGR | 0.9124 | 0.7298 |
EG | 0.9306 | 0.8024 |
LIS-AE | 0.9453 | 0.7786 |
DL-LIS-AE | 0.9814 | 0.8587 |
In table (7) we turn to the more challenging class-incremental settings where no task identifier is presented. We see that for MNIST, the performance remains relatively close to task-incremental performance, however, for Fashion-MNIST, noticeable interference is manifested when task 3 is introduced as shown in fig. 7. This is due to the ordering of classes present in Fashion-MNIST as task 2 contains class coat while task 3 contains class shirt. Such interference was not significantly manifested in the task-incremental setting because of the availability of task label that first distinguishes between the two similar classes. To give our model more flexibility in class-incremental settings, we introduce another variant of LIS-AE, namely, Double-Latent LIS-AE (DL-LIS-AE) where instead of adding only one latent layer for each class as described in section 3.5, we add two consecutive latent layers with tanh activation in the middle which substantially improves performance. It is worth noting that despite Expert-Gates model having five different autoencoders and five different classifier networks, a single base LIS-AE achieves similar performance. This is due to the fact that in our model, utilizing negative examples even as simple as just one-class negative dataset, significantly improves performance as discussed in the section 5.2. Another advantage is that the added latent layers are, by definition, operating as gates but for a better representation and not in pixel-space. We also note in table (1) when solving anomaly detection tasks, the 2-class MNIST setting was significantly harder than 1-class setting due to higher inter-class variance in data.

6. Conclusion
In this paper we introduced a novel autoencoder-based model called Latent-Insensitive Autoencoder (LIS-AE). With the help of negative samples drawn from a similar domain as the normal data we tune the weights of the bottleneck part of a standrad autoencoder such that the resulting model is able to reconstruct the target task while penalizing anomalous samples. We also presented theoretical justification for the reasoning behind our two-phase training process and the latent-shaping loss function along with a more powerful variant. Multiple ablation studies were conducted to explain We showed that continual learning can be thought of as multiple anomaly detection problems and leveraged this framing to extend the applications of our model beyond anomaly detection to tackle the challenging problem of class-incremental learning using a simple variant with multiple bottlenecks. We tested our model in a variety of anomaly detection and class-incremental settings with multiple datasets of varying degrees of complexity. Experimental results showed significant performance improvement over compared methods. Future research will focus on further investigating the connection between continual learning and anomaly detection, forward and backward transfer of knowledge for continual learning and possible ways for synthesizing negative examples for domains with limited data. We also hope to further study and employ various manifold learning approaches for latent space representation.
Acknowledgement
Artem Lenskiy was funded by Our Health in Our Hands (OHIOH), a strategic initiative of the Australian National University, which aims to transform healthcare by developing new personalised health technologies and solutions in collaboration with patients, clinicians, and health care providers.
References
- Online continual learning with maximally interfered retrieval. arXiv preprint arXiv:1908.04742. Cited by: §2.2.
-
Expert gate: lifelong learning with a network of experts.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3366–3375. Cited by: §2.2. -
Variational autoencoder based anomaly detection using reconstruction probability
. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §2.1. - Greedy layer-wise training of deep networks. Advances in neural information processing systems 19, pp. 153. Cited by: §1.
-
Auto-association by multilayer perceptrons and singular value decomposition
. Biological cybernetics 59 (4), pp. 291–294. Cited by: §2.1, §4.1. -
Robust principal component analysis?
. Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §2.1. - Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 36–51. Cited by: §1.
- Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1.
- Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.1.
- Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1.
- Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1705–1714. Cited by: Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning, §1, §2.1, §2.2.
- Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2.2.
- Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028. Cited by: §5.1.
- Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §1.
- Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180. Cited by: §2.1.
- Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §2.1, §5.2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.2.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.2.
- Learning multiple layers of features from tiny images. Cited by: §5.
- Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.1.
- Kernel methods in computer vision. Now Publishers Inc. Cited by: §2.1.
- Hippocampal-neocortical interaction: a hierarchy of associativity. Hippocampus 10 (4), pp. 420–430. Cited by: §2.2.
- MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §2.2, §5.
- Overcoming catastrophic forgetting by incremental moment matching. arXiv preprint arXiv:1703.08475. Cited by: §1, §2.2.
- Marginal replay vs conditional replay for continual learning. In International Conference on Artificial Neural Networks, pp. 466–480. Cited by: §2.2.
- Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
- Isolation forest. In 2008 eighth ieee international conference on data mining, pp. 413–422. Cited by: §5.1.
- Generative feature replay for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 226–227. Cited by: §2.2.
- The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology 4, pp. 504. Cited by: §1.
- Reading digits in natural images with unsupervised feature learning. Cited by: §5.
-
On estimation of a probability density function and mode
. The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §5.1. - LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2 (11), pp. 559–572. Cited by: §2.1.
-
Learning deep features for one-class classification
. IEEE Transactions on Image Processing 28 (11), pp. 5450–5463. Cited by: Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning, §2.1, §2.1, §5.2. - Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pp. 759–766. Cited by: Latent-Insensitive Autoencoders for Anomaly Detection and Class-Incremental Learning.
- Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §2.1.
- Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §1, §2.2.
-
F-anogan: fast unsupervised anomaly detection with generative adversarial networks
. Medical image analysis 54, pp. 30–44. Cited by: §5.1. - Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §5.1.
- Estimating the support of a high-dimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §2.1.
- Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. Cited by: §5.3.
- Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690. Cited by: §2.2.
- A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270–279. Cited by: §1.
- Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17 (2), pp. 336. Cited by: §3.2.
- Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.1.
- Does continual learning= catastrophic forgetting?. arXiv preprint arXiv:2101.07295. Cited by: §1.
- Replicator neural networks for outlier modeling in segmental speech recognition. In International Symposium on Neural Networks, pp. 996–1001. Cited by: §2.1.
- Brain-inspired replay for continual learning with artificial neural networks. Nature communications 11 (1), pp. 1–14. Cited by: §2.2.
- Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §2.2.
- Inference with the universum. In Proceedings of the 23rd international conference on Machine learning, pp. 1009–1016. Cited by: §1, §2.1.
- A comparative study of rnn for outlier detection in data mining. In 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 709–712. Cited by: §2.1.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
- Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §1, §2.2.
- Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 665–674. Cited by: §1.
-
A survey on unsupervised outlier detection in high-dimensional numerical data.
Statistical Analysis and Data Mining: The ASA Data Science Journal
5 (5), pp. 363–387. Cited by: §1. -
Deep autoencoding gaussian mixture model for unsupervised anomaly detection
. In International Conference on Learning Representations, Cited by: §1.
Appendix A Effect of individual classes as negative examples
As discussed in section 5.2, we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR-10 dataset. The first table shows results for standard LIS-AE while the second table shows results for LinSep-LIS-AE.
Positive
Negative
Outliers
Class
Class
Plane
Car
Bird
Cat
Deer
avg
Plane
None
-
0.78
0.58
0.62
0.61
0.648
Dog 5
-
0.83
0.87
0.95
0.91
0.890
Frog 6
-
0.83
0.86
0.94
0.90
0.883
Horse 7
-
0.82
0.86
0.94
0.91
0.883
Ship 8
-
0.83
0.86
0.92
0.90
0.878
Truck 9
-
0.84
0.86
0.95
0.91
0.890
Comb. (5-9)
-
0.83
0.85
0.93
0.91
0.880
Sup. (0-4)
-
0.81
0.88
0.94
0.92
0.888
Car
None
0.32
-
0.34
0.33
0.33
0.330
Dog 5
0.67
-
0.89
0.93
0.9
0.848
Frog 6
0.58
-
0.9
0.9
0.91
0.823
Horse 7
0.66
-
0.88
0.9
0.92
0.840
Ship 8
0.83
-
0.59
0.51
0.5
0.608
Truck 9
0.51
-
0.44
0.49
0.44
0.470
Comb. (5-9)
0.81
-
0.92
0.92
0.94
0.898
Sup. (0-4)
0.89
-
0.93
0.90
0.95
0.918
Bird
None
0.52
0.78
-
0.54
0.52
0.590
Dog 5
0.59
0.75
-
0.71
0.49
0.635
Frog 6
0.53
0.78
-
0.69
0.54
0.635
Horse 7
0.63
0.80
-
0.66
0.57
0.665
Ship 8
0.86
0.89
-
0.61
0.45
0.703
Truck 9
0.76
0.94
-
0.64
0.72
0.765
Comb. (5-9)
0.78
0.90
-
0.62
0.49
0.698
Sup. (0-4)
0.82
0.94
-
0.60
0.48
0.710
Cat
None
0.55
0.76
0.50
-
0.50
0.578
Dog 5
0.54
0.73
0.52
-
0.52
0.578
Frog 6
0.56
0.72
0.60
-
0.62
0.625
Horse 7
0.70
0.75
0.59
-
0.68
0.680
Ship 8
0.91
0.89
0.53
-
0.46
0.678
Truck 9
0.82
0.94
0.50
-
0.48
0.685
Comb. (5-9)
0.89
0.91
0.55
-
0.52
0.718
Sup. (0-4)
0.93
0.93
0.58
-
0.54
0.745
Deer
None
0.56
0.80
0.52
0.54
-
0.605
Dog 5
0.72
0.85
0.63
0.80
-
0.750
Frog 6
0.66
0.86
0.60
0.75
-
0.718
Horse 7
0.71
0.58
0.58
0.71
-
0.645
Ship 8
0.93
0.94
0.63
0.72
-
0.805
Truck 9
0.84
0.97
0.62
0.72
-
0.773
Comb. (5-9)
0.87
0.95
0.61
0.73
-
0.790
Sup. (0-4)
0.93
0.97
0.62
0.72
-
0.810
Positive
Negative
Outliers
Class
Class
Dog
Frog
Horse
Ship
Truck
avg
Dog
None
-
0.69
0.66
0.57
0.77
0.673
Plane 0
-
0.53
0.66
0.92
0.89
0.750
Car 1
-
0.56
0.68
0.95
0.91
0.775
Bird 2
-
0.63
0.63
0.77
0.78
0.703
Cat 3
-
0.67
0.65
0.66
0.81
0.698
Deer 4
-
0.73
0.69
0.70
0.76
0.720
Comb. (0-4)
-
0.58
0.67
0.95
0.94
0.785
Sup. (5-9)
-
0.56
0.73
0.95
0.95
0.798
Frog
None
0.40
-
0.53
0.49
0.67
0.523
Plane 0
0.73
-
0.81
0.96
0.93
0.858
Car 1
0.74
-
0.83
0.96
0.97
0.875
Bird 2
0.80
-
0.84
0.91
0.85
0.850
Cat 3
0.84
-
0.80
0.87
0.86
0.843
Deer 4
0.75
-
0.86
0.88
0.87
0.840
Comb. (0-4)
0.75
-
0.84
0.97
0.95
0.877
Sup. (5-9)
0.82
-
0.88
0.97
0.96
0.907
Horse
None
0.41
0.58
-
0.46
0.66
0.528
Plane 0
0.55
0.50
-
0.93
0.83
0.703
Car 1
0.56
0.58
-
0.90
0.93
0.743
Bird 2
0.62
0.73
-
0.80
0.71
0.715
Cat 3
0.77
0.76
-
0.65
0.66
0.710
Deer 4
0.62
0.83
-
0.57
0.60
0.655
Comb. (0-4)
0.51
0.57
-
0.89
0.88
0.713
Sup. (5-9)
0.59
0.66
-
0.95
0.92
0.780
Ship
None
0.62
0.74
0.73
-
0.77
0.715
Plane 0
0.75
0.75
0.82
-
0.74
0.765
Car 1
0.84
0.89
0.90
-
0.88
0.878
Bird 2
0.94
0.96
0.94
-
0.78
0.905
Cat 3
0.95
0.95
0.93
-
0.8
0.908
Deer 4
0.92
0.96
0.94
-
0.78
0.900
Comb. (0-4)
0.95
0.97
0.96
-
0.83
0.928
Sup. (5-9)
0.94
0.96
0.96
-
0.88
0.935
Truck
None
0.35
0.53
0.46
0.30
-
0.41
Plane 0
0.61
0.52
0.58
0.80
-
0.628
Car 1
0.51
0.57
0.53
0.47
-
0.520
Bird 2
0.92
0.90
0.84
0.73
-
0.848
Cat 3
0.95
0.90
0.82
0.61
-
0.820
Deer 4
0.91
0.92
0.86
0.63
-
0.830
Comb. (0-4)
0.93
0.91
0.83
0.76
-
0.858
Sup. (5-9)
0.94
0.95
0.90
0.78
-
0.893
|
|
Appendix B Detailed Results for all 10 tasks
The following graphs are detailed results for some experiments in various settings described in section 5.1 and 5.2. Each curve represents a trade-off between accuracy on anomalies and on normal data for each dataset. The two left panes are an upper-bound supervised setting where the negative dataset is the same as outliers. The top pane shows accuracies on tasks 0 to 4 and the bottom shows accuracies on tasks 5 to 9. Note that as the threshold value increases the model favors accepting anomalies over misclassifying normal examples. In almost all cases, we observe that LIS-AE gives a significant margin compared to normal AE.
Standard LIS-AE trained on CIFAR-10 classes. Left, outliers as negative dataset (Supervised). Right, SVHN as negative dataset.
![]() |
![]() |
Results of LinSep-LIS-AE variant on SVHN. Left, outliers as negative dataset (Supervised). Right, unsupervised. |
![]() |
![]() |
MNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset. |
![]() |
![]() |
Fashion-MNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset. |