1. Introduction
Anomaly detection is a classical machine learning field which is concerned with the identification of indistribution and outofdistribution samples. Unlike traditional multilabel classification where the goal is to find decision boundaries between classes present in a given dataset, the goal of anomaly detection is to find oneversusall boundaries for classes that are not in the dataset which is significantly more challenging compared to standard classification. Autoencoders
(Bengio et al., 2007) have been used extensively for anomaly detection (Zhou and Paffenroth, 2017; Zimek et al., 2012; Chalapathy et al., 2017) under the assumption that reconstruction error incurred by anomalies is higher than that of normal samples (Hasan et al., 2016; Zong et al., 2018). However, it has been observed that this assumption might not hold as standard autoencoders might generalize so well even for anomalies(Gong et al., 2019; Zong et al., 2018). In practice, this issue becomes more relevant in two important settings, namely, when the normal data is relatively complex it requires high latent dimension for good reconstitution, and when anomalies share similar compositional features and are from a close domain to the normal data (Chandola et al., 2009).To mitigate these issues we present LatentInsensitive Autoencoder (LISAE), a new class of autoencoders where the training process is carried out in two phases. In the first phase, the model simply reconstructs the input as a standard autoencoder, in the second phase the entire model except the latent layer is ”frozen”. We then train the model in such a way that forces the latent layer to only keep reconstructing the target task. We use the concept of a negative dataset from oneclass classification (Weston et al., 2006) whereby an auxiliary dataset of nonexamples from similar domains is used as a proxy for outofdistribution samples. We change the training objective such that the autoencoder keeps its low reconstruction error for the target dataset while pushing the error of the negative dataset to exceed certain value.
To examine the behaviour of our model under distributional shift, we test our model in Continual Learning (CL) settings. Continual learning, also known as Incremental learning, is a machine learning paradigm aiming at developing models which have the ability to continually learn from a potentially infinite stream of classification tasks by incorporating new knowledge while retaining previously learned skills (Zenke et al., 2017). Moreover, a CL algorithm must admit to additional constraints, most notably, access to all data is assumed not to be available during training. Additionally, the model is not allowed to replay all previous data. The main challenge of CL is that most of the current machine learning models are prone to what is known as catastrophic forgetting (Tan et al., 2018). The phenomenon of a machine learning model experiencing abrupt performance degradation on previously learned concepts when trained to acquire new skills (French, 1999). In other words, models tend to ”fit” the most recent task and ”forget” about previously learned ones. To formalize the problem, given a stream of tasks where each task contains a dataset such that represents a sample and represents its target. The goal of continual learning models is to learn each task sequentially under the constraint that access to previous data is unavailable or limited. Depending on the availability of a task descriptor during test time, the problem is categorized further into two distinct test protocols, namely, taskincremental and classincremental settings. For taskincremental test settings, the model is presented with samples from unknown classes along with a task identifier
that indicates which task each sample belongs to. More formally, the model is estimating
. On the other hand, for ClassIncremental test settings, the model is estimating since it is only presented with the unknown without any additional information to which task the unknown samples belong to. This in turn makes classincremental settings significantly more challenging than taskincremental settings and explains why most of the work published in the field assumes that is presented during testing (Li and Hoiem, 2017; Lee et al., 2017; Rusu et al., 2016). To elicit the intuition behind our approach, we first break the problem of classincremental learning into a more granular taskagnostic formulation, namely, instead of assuming that the model is presented with a stream of tasks containing multiple classes, we assume that the model is presented with a stream of independent classes. One extreme solution that completely avoids catastrophic forgetting by construction is to learn a separate generative model for each class. However, learning generative models in normal settings is typically not an easy task and the number of models would grow with each new class which makes this solution very inefficient. Another extreme is the use of single model for efficient capacity and train each class incrementally, however, the model would suffer greatly from catastrophic forgetting, thus, various regularizes are added. The spectrum of approaches between the aforementioned two extremes is best captured by the socalled ”stabilityplasticity” dilemma(Mermillod et al., 2013). The decision to use autoencoders as part of the proposed model is dictated by firstly, its unsupervised nature, and secondly it has been shown that autoencoders are comparatively resilient to catastrophic forgetting (Thai et al., 2021). For CL, since the bottleneck layer is the most sensitive part of our model for a given task, we create a separate latent layer for each class and the aforementioned process is repeated. Details of architecture, training process, and experiments are discussed in detail in the following sections.2. Related Work
2.1. Anomaly Detection
Many reconstructionbased anomaly detection approaches have been proposed starting with classical methods such as PCA (Pearson, 1901)
. RobustPCA mitigates the issue of outlier sensitivity in PCA by decomposing the data matrix into a sum of two lowrank and sparse matrices using nuclear norm and
norms as convex relaxation for the objective loss (Candès et al., 2011). Autoencoders address the issue of PCA only considering linear relations in featurespace by introducing nonlinearities benefiting from multiple layers of representations (Bourlard and Kamp, 1988). We elaborate further on the shortcomings of PCA and autoencoders in the theoretical section and use that to motivate our approach.Other methods try to improve on base autoencoders by endowing the latent code with particular properties. In the case of VAE (An and Cho, 2015)
, it does so by having the latent code to follow a prior distribution (usually normal) which also allows sampling from the decoder. However, in the context of anomaly detection, it introduces scaling issues since minimizing KLDivergence for high latent dimensions required for complex tasks is quite challenging. Another approach is Replicator Neural Networks (RepNN)
(Hawkins et al., 2002)which is an autoencoder with a staircase activation function positioned on the output of the bottleneck layer (Latent Layer). This is mainly used in order to quantize the latent code into a number of discrete values which also aids in forming clusters
(Williams et al., 2002). Unfortunately, a discrete staircase function is nondifferentiable which prevents learning via backpropagation. Instead, a differentiable approximation involving the sum of
hyperbolic tangent functions tanh was introduced in place of the otherwise, nondifferentiable discrete staircase function. However, as discussed in (Tóth and Gosztolya, 2004), despite the theoretical appeal for having a quantized latent code via smooth approximation, in practice, having such activation function makes it significantly difficult for the gradient signal to flow. We also note that increasing the number of levels using the aforementioned , tanh sum approximation presents a significant overhead during training and testing since activation functions have to be computed for each batch, moreover, it suffered from scaling issues similar to that of VAE.Another approach is memorizing normality of a given dataset using Memoryaugmented Autoencoder (Gong et al., 2019). This approach limits the effective space of possible latent codes by constructing a memory module that takes in the output of the encoder as an address and passes to the decoder the most relevant memory items from a stored reservoir of prototypical patterns that have been learned during training.
Other nonreconstructionbased approaches include OneClass classification which is tightly connected to anomaly detection in the sense that both problems are concerned with finding oneversusall boundaries. OneClass SVM is a variation of the classical SVM algorithm (Cortes and Vapnik, 1995) where the objective is to find a hyperplane that best separates samples from outliers (Schölkopf et al., 2001)
. Support Vector Data Description (SVDD)
(Tax and Duin, 2004) tries to find a circumscribing hypersphere that contains all samples while having optimal margin for outliers. It is worth noting that for kernels where such as RBF and Laplacian, OCSVM and OCSVDD learn identical decision functions (Lampert, 2009). To address the lack of representation learning and bad computational scalability of OCSVM and OCSVDD, Deep SVDD (OCDSVDD) employs a deep neural network that learns useful representation while mapping outputs to a hypersphere of minimum volume (Ruff et al., 2018). However, due to its sole reliance on optimizing for minimum volume, this approach is prone to hypersphere collapse which leads to finding uninformative features (Perera and Patel, 2019).Other approaches have been proposed where an auxiliary datset of nonexamples (negative dataset) is drawn from similar domains as a proxy for the otherwise intractable complement for the target class. In (Weston et al., 2006), a collection called the ”Universum”, allows learning useful representation to the domain of the problem via maximizing the number of contradictions on an equivalence class. Similar to OCDSVDD, (Perera and Patel, 2019) leverages a labeled dataset from a close domain to finetune pretrained two CNNs in order to learn new good features. The goodness of these features is quantified by the compactness (interclass variance) for the target class and descriptiveness (crossentropy) for the labeled dataset. Despite avoiding hypersphere collapse and outperforming OCSVDD, this approach requires two pretrained neural networks and a large labeled dataset along with the target dataset. Another approach that also makes use of a large auxiliary dataset is Outlier Exposure (OE) (Hendrycks et al., 2018)
, which is a supervised approach that trains a standard neural network classifier while exposing it to a diverse set of nonexamples on which the output of the classifier is optimized to follow a uniform distribution using another crossentropy loss.
2.2. Continual Learning
Progressive Neural Networks (ProgNets)(Rusu et al., 2016) prevent catastrophic forgetting by dynamically adding separate subnetworks to accommodate new tasks while maintaining lateral connections to allow forward transfer of knowledge from previous tasks. The two major problems with this approach are that model parameters grow quadratically with each new task and also the model assumes knowledge of tasks boundaries to determine which output to select. This, by definition, renders ProgNets unsuitable for classincremental learning.
Another approach that adds separate networks to accommodate new tasks is ExpertGate (Aljundi et al., 2017). One major advantage of ExpertGate over ProgNet is that it does not require knowledge of tasks boundaries since it uses an undercomplete autoencoderbased gating mechanism to determine which subnetwork, called ”Expert”, is to be active by using the reconstruction loss of the auxiliary autoencoders as a metric to determine which task the sample is drawn from. However, similar to ProgNet, this approach also suffers from complexity issues since with each new task, a new network is added with a separate autoencoder to serve as its ”gate”. Another assumption that ExpertGate relies on is that an autoencoder for a particular task will not generalize well to another similar task and a simple gate will suffice to distinguish between complex tasks. However, such assumption may not hold as the autoencoder might generalize to new tasks, typically when the task is from a similar domain or the autoencoder gate is trained on a task with relatively large intraclass or interclass variance (Gong et al., 2019).
PseudoRehearsal via GenerativeReplay (GR) (Shin et al., 2017) is a bioinspired approach that mimics the way in which the hippocampus in the human brain interacts with the neocortex when learning new concepts (Lavenex and Amaral, 2000). GR architectures have two main components, a deep generative module (usually a GAN (Goodfellow et al., 2014) or a VAE (Kingma and Welling, 2013)), and a “solver” module (Usually a CNN (LeCun and Cortes, 2010)) which is responsible for solving new tasks. The generator is trained to output instances that follow the same data distribution such that when a new task arrives, the solver trains on the synthetic data outputted by the generator along with the new data in order to alleviate catastrophic forgetting. Despite their flexibility and biological appeal, Generative Replaybased methods have three major disadvantages, namely, training a generative model on a stream of changing synthetic and real data is challenging (Liu et al., 2020), GR models tend to fall short when dealing with complex datasets (Aljundi et al., 2019), and the time required to train the model on a new task increases linearly since the model has to generate and rehearse tasks. Many variants have been proposed to address these issues. Conditional GANs operate in constant time but have lower accuracy (Lesort et al., 2019), while others depend on nonincremental pretrained networks (van de Ven et al., 2020; Liu et al., 2020).
Regularizationbased techniques such as elastic weight consolidation (EWC) (Kirkpatrick et al., 2017), synaptic intelligence (SI) (Zenke et al., 2017)
and incremental moment matching (IMM)
(Lee et al., 2017), modify model parameters in such a way that preserves important weights for previous tasks while finding less sensitive weights to accommodate new tasks. Despite the popularity of these methods, they tend not to perform well in classincremental settings since they were originally introduced for taskincremental learning (Van de Ven and Tolias, 2018).3. Proposed Method
3.1. Architecture
An undercomplete deep autoencoder is a type of unsupervised feedforward neural network for learning a lowerdimensional feature representation of a particular dataset via reconstructing the input back at the output layer. To prevent autoencoders from converging to trivial solutions such as the identity mapping; a bottleneck layer with output
such that its dimension is less than the dimension of the input . The forward pass is computed as such:(1) 
(2) 
(3) 
where is the input, is the bottleneck layer, and
are convolutional neural networks representing the encoder and the decoder modules respectively. Typically, such models are trained to minimize the
norm of the difference between the input and the reconstructed output . As previously discussed, the choice of the activation function of plays an important rule in anomaly detection. Activation functions that quantize the latent code or encourage forming clusters are preferable. In our experiments, we find that confining the latent code to have values between with a tanhactivatin function as we maximize the loss over the negative dataset during the latentshaping phase has a regularizing effect. We also note that unbounded activation functions such as ReLU tend to have poor performance.
3.2. Terminology
Positive Dataset ():
This is the dataset that contains the normal class(es), for example, plane class from CIFAR10.
Negative Dataset (): This is a secondary unlabeled dataset containing negative examples from a similar domain as . The choice of depends on . For example if
is the digit 0 from MNIST,
might be random strokes or another dataset with similar features such as Omniglot(Tang et al., 2017). It is important to note that the model should not be tested on since this violates the assumption of not knowing anomalies.Feature Extraction Phase: This is the first phase of training. The model is simply trained to reconstruct its input. This may also include reconstructing negative examples in order to extract useful features in case of continual learning.
LatentShaping Phase: This is the second phase of training. The encoder and decoder networks are frozen and only the latent layer is active.
3.3. Training for Anomaly Detection
Given a dataset and a negative dataset from a similar domain to , we divide the training process into two phases; the first phase is reconstructing samples from
by minimizing the loss function
until convergence, where is the input drawn from and is the output of the autoencoder. In the second phase, we freeze the model except for the latent layer and minimize the following loss function:(4) 
where is a sampled batch from , is its reconstruction,
is a hyperparameter that controls the effect of the two parts of the loss function and
is another hyperparameter indicating that we are satisfied if the reconstruction error of the negative dataset exceeds a certain value.3.4. Predicting Anomalies
We use reconstruction error to distinguish between anomalies and normal data where is the test sample and is the reconstructed output. More specifically, we set a threshold such that if the output is considered to be anomalous.
3.5. Training for Continual Learning
As can be seen in Algorithm 2, given a stream of sequential tasks with optional auxiliary negative datasets = , where , such that represents the sample, represents its target, is the number of classes present in , and is the total number of tasks. For a particular task , the first phase of training each class in is similar to the first phase described in Algorithm 1. For each latent , is split into and such that where and where . For example, if task is to classify digits 8 and 9 from MNIST dataset. We consider samples of the digit 9 as for the latent layer corresponding to digit 8, and samples of the digit 8 as for digit 9. Of course, this does not preclude the possibility of having an additional auxiliary negative dataset. (Further details on the effect of negative examples are given in section 5.2) We notice that by selecting a particular latent layer we end up with a scenario similar to the second phase described in 3.1 and we resume training by minimizing the same loss function:
(5) 
3.6. Prediction for Continual Learning
Given an unknown sample to be classified. We iterate over each latent layer as shown in fig. 3 and compute the reconstruction error. The latent layer with the minimum error is considered to be the correct class.
4. Theoretical Justification
4.1. Formulation
From optimiality of autoencoders (Bourlard and Kamp, 1988)
, we know that absent nonlinear activation function, a linear autoencoder corresponds to singular value decomposition (SVD); henceforth, we use SVD interchangeably with linear autoencoders. Given an
data matrix , we decompose into , where and is its orthogonal complement .We further decompose using SVD:
where and are orthonormal matrices and is a diagonal matrix such that . However, in practice it is rarely separated this neatly, specially when dealing with large number of samples of a highdimensional dataset; therefore, we resort to reducedSVD where we take the first columns of with the caveat that the choice of is a hyperparameter.
The matrix can be divided thus: , and from Eckart–Young lowrank approximation theorem, columns of and columns of .
A linear autoencoder with dimensional latent layer is equivalent to the following transform:
Where and represent the decoder and the encoder respectively. Furthermore, any data point can be represented as , where and are () and dimensional real vectors. By orthonormality, we have the following identities: and , where is an
identity matrix. As a shorthand, we write
instead of . Using these two identities, we rewrite the reconstruction loss as following:We note that the loss function is agnostic to the nature of and is only concerned with its magnitude. The assumption for anomaly detection under this setting is that , where and correspond to orthogonal components for anomalies and positive data respectively. We posit that while this agnosticism is desirable for potential generality, it is not optimal for anomaly detection; hence, we modify the loss score to depend on the nature of :
where is an matrix such that the loss is small for normal data but large for anomalies. In other words, we want and to be large.
We define orthonormal basis for and orthonormal basis for , where is the matrix of all positive orthogonal components .
We decompose further into and where columns of are the basis of that are not in and columns of are the remaining columns of .
Since , any can be written as , where , and are real vectors.
Despite the fact that we do not have access to , we can utilize other negative examples from similar domain and use as a proxy for . Since , maximizing implies maximizing assuming that . The later assumption hinges on the fact that is from a similar domain. Therefore, we end up with the goal of finding such that and is large.
In practice, we cannot maximize indefinitely and we are satisfied if it reaches a certain large :






We notice that in order for this to work, the decoder has to be known and remain fixed (frozen). This suggests a twophase training where we first compute the decoder and encoder networks, and in the second phase the decoder is fixed while the encoder is modified using the new loss. In fig. 4, a linear version of LISAE is trained on digit8 from MNIST with Omniglot as a negative dataset. We perform orthogonal decomposition on each input by projecting it onto digit8 subspace to get its projection and orthogonal vectors. We then feed each vector separately to a regular linear AE and linear LISAE. We observe that the regular autoencoder outputs zero images for the orthogonal part of each sample regardless of the class it belongs to. However, in the case of LISAE, it behaves differently for normal class than for anomalous classes.
We also notice that orthogonal projections do not form a semantically meaningful representation in pixel space. In order to gain a better representation we use a deep AE. For this nonlinear case, we treat the middle part of the network as an inner linear autoencoder which is operating on a more semantically meaningful transformed version of the data. This suggests a stacked autoencoder archiecture where another loss term for the inner autoencoder is added in the first phase to make sure that the output of the layer after latent is similar to the latent input. In the second phase we freeze the entire network except for the encoder of the inner autoencoder (latent layer of entire model) and minimize the reconstruction error of positive examples while maximizing the loss for negative examples. However, in our experiments we observed that adding these loss terms was not necessary and a similar loss to the linear case produced similar results since we are only considering reconstruction scores of the outer model. Therefore, we keep the entire network frozen except for the latent layer while directly minimizing the following loss as before. (eq. 4)
4.2. Intuition
For concreteness, we consider the following simple, supervised case. Given a dataset such that for each :
We notice that most of the variance in data is along the xaxis. Training a linear autoencoder with latent dimension , results in and where and are the decoder and encoder networks respectively.
Given input ,
The loss score .
Training a LISAE on negative samples that have only nonzero values along the zaxis, we end up with the same and a new (modified) , where is a large number.
has the form ,
has the form where .
The new loss scores for and are:
In the case of regular LinearAE (PCA), given , for each point the cylinder: the following holds: , making the two samples indistinguishable.
In the case of LISAE, holds only for the elliptic cylinder , and since is a large number, the crosssection of the cylinder is squashed in the z dimension resulting in heavily penalized loss in the z dimension but a regular loss in the y dimension. In this case, the two samples become indistinguishable only for very small values of .
We note that the new is merely a rotated and stretched version of the old in the
plane. Thus, we can think of Linear LISAE as a regular PCA with its eigenvectors (columns of
) stretched and tiltedin the directions of the orthogonal complement of the eigenspace. This is done in such a way that keeps the column space of normal examples invariant under the new transformation
. By itself, this formulation remains illposed since there are infinite number of solutions that do not necessarily help with anomaly detection. More formally, given , we can choose any matrix such that since . However, this does not guarantee any advantage for anomaly detection on similar data, even worse in practice, this modification process might result in a slightly worse performance if done arbitrarily since the model usually has to sacrifice some extreme samples from the normal data to balance the two losses. Thus, the negative dataset is used to properly determine the directions of the tilt and hyperparamteres ( and ) determine the importance and amount of stretching (or shrinking) without changing the normal case as much as possible. For deep LISAE, the same analogy holds albeit in a latent space.Deep architectures are not only useful for learning good representation, but can learn a nonlinear transformation with useful properties for our objective such as linear separability of negative and positive samples. By adding a standard binary crossentropy loss before the nonlinear activation of the latent layer during the first phase, we ensure that the input of the latent layer is linearlyseparable for positive and negative examples during the second phase. This linearlyseparable variant (LinSepLISAE) almost always performs better than directly using LISAE. We investigate the effect of this property on the second phase in section 5.2.
5. Experiments
Model  MNIST  FashionMNIST  2Class MNIST 

KDE  0.9568  0.9183  0.9206 
IF  0.8624  0.9144  0.73018 
OCSVM  0.9108  0.8608  0.8741 
OCDSVDD  0.9489  0.8577  0.8972 
AnoGAN  0.9579  0.9098  0.8406 
AnoGANFM  0.9544  0.9072  0.8353 
LinearAE  0.9412  0.8845  0.8915 
VAE  0.9642  0.9092  0.9263 
MemAE  0.9714  0.9131  0.9352 
SigRepNN (N=4)  0.9661  0.9124  0.9261 
AE  0.9601  0.9076  0.9221 
LISAE  0.9768  0.9256  0.9457 
We report results on the following datasets, MNIST (LeCun and Cortes, 2010), FashionMNIST (Xiao et al., 2017)
, SVHN
(Netzer et al., 2011) and CIFAR10 (Krizhevsky et al., 2009). Results of our approach are compared to baseline models with the same capacity for autoencoderbased methods.5.1. Anomaly Detection
In this section, we test LISAE for anomaly detection on image data in unsupervised settings. Given a standard classification dataset, we group a set of classes together into a new dataset and consider it the ”normal” dataset. The rest of classes that are not in the normal nor in the negative datasets are considered anomalies. During training, our model is presented only with the normal dataset and the additional negative dataset. We evaluate the performance on test data comprised of both the ”normal” and ”anomalous” groups.
For MNIST and FashionMNIST, the encoder network consists of two Convolutional layers with LeakyReLU nonlinearities followed by a fullyconnected bottleneck layer with tanh activation function. The decoder network consists of a fullyconnected layer followed by a LeakyReLU and two Deconvolution layers with LeakyReLU activation functions and a final convolution layer with sigmoid situated at the final output. For SVHN and CIFAR10 we use latent layer with larger sizes and higher capacity networks with same depth. It is worth noting that the choice of latent layer size has the most effect on performance for all models (compared to other hyperparameters). We report the best performing latent dimension for all models.
In table (1), we compare LISAE with several autoencoderbased anomaly detection models as baselines, all of which share the exact same architecture. It is worth noting that the most direct comparison is between LISAE and AE since not only they have the same architecture, they have the exact same encoder and decoder weights and their performance is merely measured before and after the latentshaping phase. We use a different variant of RepNN with a sigmoid activation function placed before the tanh staircase function approximation described in section 2.2. This is mainly used because ”squashing”
the input between 0 and 1 before passing it to the staircase function gives more robust and easytotrain network. We only report the best results for SigRepNN with 4 activation levels. For anomaly GAN (AnoGAN)
(Schlegl et al., 2017), we follow the implementation described in (Schlegl et al., 2019). We train a WGAN (Gulrajani et al., 2017) with gradient penalty and report performance for two anomaly scores, namely, encodergenerator reconstruction loss and additional featurematching distance score in the discriminator feature space (AnoGANFM).Model  SVHN  CIFAR10 

KDE  0.5648  0.5752 
IF  0.5112  0.6097 
OCSVM  0.5047  0.5651 
OCDSVDD  0.5681  0.6411 
AnoGAN  0.5598  0.5843 
AnoGANFM  0.5645  0.5880 
LinearAE  0.5702  0.5753 
VAE  0.5692  0.5734 
MemAE  0.5720  0.5931 
SigRepNN (N=4)  0.5684  0.5719 
AE  0.5698  0.5703 
LISAE  0.6886  0.8145 
LinSepLISAE  0.7701  0.8858 
Sup. LISAE  0.7573  0.8384 
Sup. LinSepLISAE  0.8479  0.9170 
For AnoGAN, LinearAE, AE, VAE, RepNN, MemAE, and LISA, we use reconstruction error such that if the input is considered an anomaly. Varying the threshold , we are able to compute the area under the curve (AUC) as a measure of performance. Similarly, for OCSVDD (equivalently OCSVM with rbf kernel) and OCDSVDD, we vary inverse length scale
and use predicted class label. For Kernel density estimation (KDE)
(Parzen, 1962), we vary the threshold over the loglikelihood scores. For Isolation Forest (IF) (Liu et al., 2008), we vary the threshold over the anomaly score calculated by the Isolation Forest algorithm.The datasets tested in table (1) are MNIST and FashionMNIST. To train LISAE on MNIST we use Omniglot (Lake et al., 2015) as our negative dataset since it shares similar compositional characteristics with MNIST. Since Omniglot is a relatively small dataset, we diversify the negative examples with various augmentation techniques, namely, Gaussian blurring, random cropping, horizontal and vertical flipping. We test two settings for MNIST, a 1class setting where normal dataset is one particular class and the rest of the dataset classes are considered anomalies. The process is repeated for all classes and averge AUC for 10 classes is reported. Another setting is 2class MNIST where the normal dataset consists of two classes and the remaining classes are considered anomalies. For example, the first task contains digits 0 and 1 and the remaining digits are considered anomalies, the second task contains digit 2 and 3, and so forth. This setting is more challenging since there is more than one class present in the normal dataset and also very informative for continual learning methods that use autoencoders as gates to first identify tasks boundaries.
For FashionMNIST, the choice of negative example is different. We use the next class as the negative dataset and we do not include it with anomalies (i.e. remaining classes) during test time. This also informative for continual learning where we have a stream of sequential tasks. In the ablation section we test multiple negative datasets for FashionMNIST.
We note that LISAE achieves superior performance to all compared approaches, however, we also notice that these settings are comparatively easy and all tested models performed adequately including classical nondeep approaches.
In table (2), we show performance on SVHN and CIFAR10 which are more complex dataests compared to MNIST and FashionMNIST. To train LISAE, we split each dataset into two datasets, each split is used as negative examples for the other one. Note that we only test on the remaining classes which are not in the normal nor the negative datasets. For example, the first dataset from CIFAR10 has five classes, namely, airplane, automobile, bird, cat and deer while the second one has dog, frog, horse, ship and truck.
Training on airplane as the first normal task, LISAE maximizes the loss for samples drawn from the negative dataset (dog, frog, ship and so forth). We then test its performance on airplane as the normal class and only on automobile, bird, cat and deer as anomalies. Note that we do not test on dog, frog and other classes in the negative dataset. This processes is repeated for all 10 classes and average AUC is reported. As mentioned in section 4.2, We introduce LinSepLISAE as an improvement over base standard LISAE. The difference between the two models is only in the first phase where a binary crossentropy loss is added to ensure that positive and negative examples are linearly seperabable during the second phase. The last two entries of the table are supervised upper bounds for each variant where the negative dataset is the same as outliers. In figure (6), we see that standard AE is prone to generalize so well for other classes which is not a desired property for anomaly detection. In contrast, LISAE only reconstructs normal data faithfully which translates to the large performance gap we see in figure (5).
We also notice that despite CIFAR10 being more complex than SVHN, most reconstructionbased models are performing better on CIFAR10 than on SVHN. This is due to the fact that the difference between SVHN classes in terms of reconstruction is not as large since they share similar compositional features and do appear in samples from other classes while for CIFAR10 classes vary significantly. (e.g. digit2 and digit3 on a wall vs truck and bird)
5.2. Ablation
Negative  Positive Data  

Data  MNIST  Fashion  SVHN  CIFAR10 
None  0.9485  0.8740  0.5698  0.5703 
Omni  0.9605  0.9013     
MNIST  0.9778  0.8942     
Fashion  0.9482  0.9106     
SVHN      0.6886  0.7065 
CIFAR10      0.5481  0.8145 
Same (Sup.)  0.9901  0.9623  0.7573  0.8384 
In this section, we investigate the effect of the nature of negative dataset and linear separability of positive and negative examples. In table (3) we train LISAE on different negative and positive datasets. Similar to table (2), we split each positive dataset into two datasets and follow the same settings as before with the exception of ”None” and ”Supervised ” cases. The ”None” case indicates that no negative examples have been used whereas the Supervised indicates that both outliers and negative datasets share the same classes. Note that this case is different from the case where the positive and negative datasets come from the same dataset. Unless stated otherwise, we only test on classes (outliers) that are not in the positive nor in the negative datasets. For example, when MNIST is used as a source for both positive and negative datasets, the positive data starts with class 0 and negative dataset consists of classes 5 to 9 where the outliers are classes 1 to 4. This process is repeated for all 10 classes present in each dataset and average AUC is reported. Overall, using a negative dataset resulted in a significant increase in performance in every case except for two important cases, namely, when FashionMNIST and CIFAR10 were used as negative datasets for MNIST and SVHN respectively. This could be explained by the fact that the model was not capable of reconstructing FashionMNIST and CIFAR10 classes in the first place. Moreover, shaping the latent layer in such a way that maximizes the loss for FashionMNIST and CIFAR10 classes does not guarantee any advantage for anomaly detection of similar digit classes present in MNIST and SVHN. This coupled with the fact that this process in practice forces the model to ignore some samples from the normal dataset to balance the two losses which results in the performance degradation we observe in these two cases.
Table (4) is an excerpt of the complete table in the appendices where we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR10 dataset. We split CIFAR10 into two separate datasets, the first split is used for selecting classes as negative datasets and the other split is used as outliers. For each class in CIFAR10 we train eight models in different settings, the first setting is None where we train a standard autoencoder with no negative examples as the base model. The remaining seven settings differ in the second phase, we select one class as our negative dataset and test the model performance on each individual class from the outlier dataset. The combined setting is similar to the setting described in section 5.1 where we combine all negative classes in one 5class negative dataset. Note that these classes are not the same as the classes in the outlier test dataset except for the final setting, which is an upperbound supervised setting where the negative dataset is comprised of classes that are in the outlier dataset except for the positive class. This process is then repeated for all 10 classes in CIFAR10. Overall, we observe a significant performance increase over the base model with the general trend of negative classes significantly increasing anomaly detection performance for similar outliers. For example, the dog class drastically improves performance on the cat class but not so much for the plane class. However, we also notice two important exceptions, namely, when the horse class is used as the negative dataset for the car class, we notice a significant performance increase for the relatively similar deer class as expected, however, when the horse class is used as the negative dataset for the same deer class, we notice that the performance does not improve as in the first case and even degrades for the care class.
Positive  Negative  Outliers  

Class  Class  Plane  Car  Bird  Cat  Deer  avg 
Car  None  0.32    0.34  0.33  0.33  0.330 
Dog 5  0.67    0.89  0.93  0.90  0.848  
Frog 6  0.58    0.90  0.90  0.91  0.823  
Horse 7  0.66    0.88  0.90  0.92  0.840  
Ship 8  0.83    0.59  0.51  0.50  0.608  
Truck 9  0.51    0.44  0.49  0.44  0.470  
Comb. (59)  0.81    0.92  0.92  0.94  0.898  
Sup. (04)  0.89    0.93  0.90  0.95  0.918  
Deer  None  0.56  0.80  0.52  0.54    0.605 
Dog 5  0.72  0.85  0.63  0.80    0.750  
Frog 6  0.66  0.86  0.60  0.75    0.718  
Horse 7  0.71  0.58  0.58  0.71    0.645  
Ship 8  0.93  0.94  0.63  0.72    0.805  
Truck 9  0.84  0.97  0.62  0.72    0.773  
Comb. (59)  0.87  0.95  0.61  0.73    0.790  
Sup. (04)  0.93  0.97  0.63  0.72    0.813  
Other notable examples of this observation can be found in the appendices where, for instance, the dog class improves performance on cat outliers, but causes noticeable degradation when used as the negative dataset for the same cat class. We believe the reason behind this problem is that in the first case, the gained performance is due to the fact that these classes share similar compositional features and backgrounds. However in the second case, the same property makes it difficult to balance the minimization and maximization loss during the latentshaping phase. For example, car and truck images are very similar in this scenario that minimizing and maximizing the loss at the same time becomes contradictory. As posited in section 4.2, we mitigate this issue by adding a binary crossentropy loss while training in the first phase to ensure that the input of the latent layer is linearlyseparable for positive and negative examples. Notice that unlike other approaches (Hendrycks et al., 2018; Perera and Patel, 2019), this does not require a labeled positive or negative dataset and relies only on the fact that we have two distinct datasets. This linear separablity makes the second phase of training relatively easier and less contradictory. In table (5), we see that LinSepLISAE mitigates this issue for the aforementioned cases and gives the AUC increase we observed in table (2).
5.3. Continual Learning
Positive  Negative  Outliers  

Class  Class  Plane  Car  Bird  Cat  Deer  avg 
Car  None  0.32    0.34  0.33  0.33  0.330 
Dog 5  0.67    0.94  0.97  0.95  0.883  
Frog 6  0.58    0.93  0.96  0.96  0.858  
Horse 7  0.69    0.95  0.97  0.97  0.895  
Ship 8  0.90    0.78  0.79  0.76  0.808  
Truck 9  0.59    0.77  0.82  0.73  0.728  
Comb. (59)  0.90    0.97  0.97  0.98  0.955  
Sup. (04)  0.95    0.98  0.98  0.98  0.9725  
Deer  None  0.56  0.80  0.52  0.54    0.605 
Dog 5  0.67  0.86  0.71  0.89    0.783  
Frog 6  0.68  0.87  0.62  0.79    0.740  
Horse 7  0.70  0.84  0.61  0.72    0.718  
Ship 8  0.94  0.95  0.63  0.73    0.813  
Truck 9  0.84  0.97  0.61  0.76    0.795  
Comb. (59)  0.90  0.97  0.66  0.80    0.833  
Sup. (04)  0.97  0.98  0.76  0.83    0.885  
A common evaluation method for CL settings is splitdataset tests whereby a standard classification dataset is divided into disjoint tasks within each task an number of classes such that is the total number of classes present in dataset. The model is then presented with tasks one at a time and the final performance of all tasks is reported. In this section, we consider MNIST and FashionMNIST datasets in two common variants of the aforementioned setting, namely, taskincremental and classincremental settings. It is important to note that while these two datasets are now considered too simple a test for meaningful evaluation of classifiers in general, this does not hold for continual learning since MNISTlike tasks are still extremely challenging for continual learning especially in classincremental settings. We compare LISAE with several CL approaches. We start by estimating a lower bound for each setting by simply training the classifier sequentially on tasks to assess forgetting and interference. For lower bound model, EWC, online EWC (Schwarz et al., 2018), SI, and deep generative replay (DGR), the classifier network has the same architecture as the encoder network used in LISAE. For ExpertGates (EG), we use experts (classifiers) with the same encoder network architecture where is the number of tasks. It is worth noting that the latent dimension of the autoencoder gates in EG differs from the latent dimension used in LISAE. Therefore, we report the best accuracy for EG and LISAE. For fair comparison, we only include models that do not have an episodic memory (do not store raw data).
We first test our model under the easy protocol where the model is presented with an unknown sample along with a task identifier. Table (6) shows results of splitMNIST and splitFashionMNIST, for both of the datasets, we consider five folds of the MNIST dataset, the first fold is comprised of 0 and 1, the second fold is 2 and 3 and so forth. During testing, the model is presented with a task identifier indicating which class the sample belongs to. For LISAE, this reduces interference between latent layers from different tasks significantly.
Model  MNIST  FashMNIST 

lower bound  0.8819  0.7721 
EWC  0.9864  0.9572 
Online EWC  0.9904  0.9642 
SI  0.9916  0.9682 
DGR  0.9941  0.8842 
LISAE  0.9961  0.9866 
Model  MNIST  FashMNIST 

lower bound  0.1970  0.1821 
EWC  0.1992  0.1902 
Online EWC  0.1993  0.1891 
SI  0.2101  0.1911 
DGR  0.9124  0.7298 
EG  0.9306  0.8024 
LISAE  0.9453  0.7786 
DLLISAE  0.9814  0.8587 
In table (7) we turn to the more challenging classincremental settings where no task identifier is presented. We see that for MNIST, the performance remains relatively close to taskincremental performance, however, for FashionMNIST, noticeable interference is manifested when task 3 is introduced as shown in fig. 7. This is due to the ordering of classes present in FashionMNIST as task 2 contains class coat while task 3 contains class shirt. Such interference was not significantly manifested in the taskincremental setting because of the availability of task label that first distinguishes between the two similar classes. To give our model more flexibility in classincremental settings, we introduce another variant of LISAE, namely, DoubleLatent LISAE (DLLISAE) where instead of adding only one latent layer for each class as described in section 3.5, we add two consecutive latent layers with tanh activation in the middle which substantially improves performance. It is worth noting that despite ExpertGates model having five different autoencoders and five different classifier networks, a single base LISAE achieves similar performance. This is due to the fact that in our model, utilizing negative examples even as simple as just oneclass negative dataset, significantly improves performance as discussed in the section 5.2. Another advantage is that the added latent layers are, by definition, operating as gates but for a better representation and not in pixelspace. We also note in table (1) when solving anomaly detection tasks, the 2class MNIST setting was significantly harder than 1class setting due to higher interclass variance in data.
6. Conclusion
In this paper we introduced a novel autoencoderbased model called LatentInsensitive Autoencoder (LISAE). With the help of negative samples drawn from a similar domain as the normal data we tune the weights of the bottleneck part of a standrad autoencoder such that the resulting model is able to reconstruct the target task while penalizing anomalous samples. We also presented theoretical justification for the reasoning behind our twophase training process and the latentshaping loss function along with a more powerful variant. Multiple ablation studies were conducted to explain We showed that continual learning can be thought of as multiple anomaly detection problems and leveraged this framing to extend the applications of our model beyond anomaly detection to tackle the challenging problem of classincremental learning using a simple variant with multiple bottlenecks. We tested our model in a variety of anomaly detection and classincremental settings with multiple datasets of varying degrees of complexity. Experimental results showed significant performance improvement over compared methods. Future research will focus on further investigating the connection between continual learning and anomaly detection, forward and backward transfer of knowledge for continual learning and possible ways for synthesizing negative examples for domains with limited data. We also hope to further study and employ various manifold learning approaches for latent space representation.
Acknowledgement
Artem Lenskiy was funded by Our Health in Our Hands (OHIOH), a strategic initiative of the Australian National University, which aims to transform healthcare by developing new personalised health technologies and solutions in collaboration with patients, clinicians, and health care providers.
References
 Online continual learning with maximally interfered retrieval. arXiv preprint arXiv:1908.04742. Cited by: §2.2.

Expert gate: lifelong learning with a network of experts.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3366–3375. Cited by: §2.2. 
Variational autoencoder based anomaly detection using reconstruction probability
. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §2.1.  Greedy layerwise training of deep networks. Advances in neural information processing systems 19, pp. 153. Cited by: §1.

Autoassociation by multilayer perceptrons and singular value decomposition
. Biological cybernetics 59 (4), pp. 291–294. Cited by: §2.1, §4.1. 
Robust principal component analysis?
. Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §2.1.  Robust, deep and inductive anomaly detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 36–51. Cited by: §1.
 Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1.
 Supportvector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.1.
 Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1.
 Memorizing normality to detect anomaly: memoryaugmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1705–1714. Cited by: LatentInsensitive Autoencoders for Anomaly Detection and ClassIncremental Learning, §1, §2.1, §2.2.
 Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §2.2.
 Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028. Cited by: §5.1.
 Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 733–742. Cited by: §1.
 Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180. Cited by: §2.1.
 Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §2.1, §5.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.2.
 Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.2.
 Learning multiple layers of features from tiny images. Cited by: §5.
 Humanlevel concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.1.
 Kernel methods in computer vision. Now Publishers Inc. Cited by: §2.1.
 Hippocampalneocortical interaction: a hierarchy of associativity. Hippocampus 10 (4), pp. 420–430. Cited by: §2.2.
 MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §2.2, §5.
 Overcoming catastrophic forgetting by incremental moment matching. arXiv preprint arXiv:1703.08475. Cited by: §1, §2.2.
 Marginal replay vs conditional replay for continual learning. In International Conference on Artificial Neural Networks, pp. 466–480. Cited by: §2.2.
 Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
 Isolation forest. In 2008 eighth ieee international conference on data mining, pp. 413–422. Cited by: §5.1.
 Generative feature replay for classincremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 226–227. Cited by: §2.2.
 The stabilityplasticity dilemma: investigating the continuum from catastrophic forgetting to agelimited learning effects. Frontiers in psychology 4, pp. 504. Cited by: §1.
 Reading digits in natural images with unsupervised feature learning. Cited by: §5.

On estimation of a probability density function and mode
. The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §5.1.  LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2 (11), pp. 559–572. Cited by: §2.1.

Learning deep features for oneclass classification
. IEEE Transactions on Image Processing 28 (11), pp. 5450–5463. Cited by: LatentInsensitive Autoencoders for Anomaly Detection and ClassIncremental Learning, §2.1, §2.1, §5.2.  Selftaught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pp. 759–766. Cited by: LatentInsensitive Autoencoders for Anomaly Detection and ClassIncremental Learning.
 Deep oneclass classification. In International conference on machine learning, pp. 4393–4402. Cited by: §2.1.
 Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §1, §2.2.

Fanogan: fast unsupervised anomaly detection with generative adversarial networks
. Medical image analysis 54, pp. 30–44. Cited by: §5.1.  Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pp. 146–157. Cited by: §5.1.
 Estimating the support of a highdimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §2.1.
 Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. Cited by: §5.3.
 Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690. Cited by: §2.2.
 A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270–279. Cited by: §1.
 Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17 (2), pp. 336. Cited by: §3.2.
 Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.1.
 Does continual learning= catastrophic forgetting?. arXiv preprint arXiv:2101.07295. Cited by: §1.
 Replicator neural networks for outlier modeling in segmental speech recognition. In International Symposium on Neural Networks, pp. 996–1001. Cited by: §2.1.
 Braininspired replay for continual learning with artificial neural networks. Nature communications 11 (1), pp. 1–14. Cited by: §2.2.
 Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §2.2.
 Inference with the universum. In Proceedings of the 23rd international conference on Machine learning, pp. 1009–1016. Cited by: §1, §2.1.
 A comparative study of rnn for outlier detection in data mining. In 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 709–712. Cited by: §2.1.
 Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
 Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §1, §2.2.
 Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 665–674. Cited by: §1.

A survey on unsupervised outlier detection in highdimensional numerical data.
Statistical Analysis and Data Mining: The ASA Data Science Journal
5 (5), pp. 363–387. Cited by: §1. 
Deep autoencoding gaussian mixture model for unsupervised anomaly detection
. In International Conference on Learning Representations, Cited by: §1.
Appendix A Effect of individual classes as negative examples
As discussed in section 5.2, we examine the effect of each class present in the negative dataset on anomaly detection performance for other test classes from the CIFAR10 dataset. The first table shows results for standard LISAE while the second table shows results for LinSepLISAE.
Positive
Negative
Outliers
Class
Class
Plane
Car
Bird
Cat
Deer
avg
Plane
None

0.78
0.58
0.62
0.61
0.648
Dog 5

0.83
0.87
0.95
0.91
0.890
Frog 6

0.83
0.86
0.94
0.90
0.883
Horse 7

0.82
0.86
0.94
0.91
0.883
Ship 8

0.83
0.86
0.92
0.90
0.878
Truck 9

0.84
0.86
0.95
0.91
0.890
Comb. (59)

0.83
0.85
0.93
0.91
0.880
Sup. (04)

0.81
0.88
0.94
0.92
0.888
Car
None
0.32

0.34
0.33
0.33
0.330
Dog 5
0.67

0.89
0.93
0.9
0.848
Frog 6
0.58

0.9
0.9
0.91
0.823
Horse 7
0.66

0.88
0.9
0.92
0.840
Ship 8
0.83

0.59
0.51
0.5
0.608
Truck 9
0.51

0.44
0.49
0.44
0.470
Comb. (59)
0.81

0.92
0.92
0.94
0.898
Sup. (04)
0.89

0.93
0.90
0.95
0.918
Bird
None
0.52
0.78

0.54
0.52
0.590
Dog 5
0.59
0.75

0.71
0.49
0.635
Frog 6
0.53
0.78

0.69
0.54
0.635
Horse 7
0.63
0.80

0.66
0.57
0.665
Ship 8
0.86
0.89

0.61
0.45
0.703
Truck 9
0.76
0.94

0.64
0.72
0.765
Comb. (59)
0.78
0.90

0.62
0.49
0.698
Sup. (04)
0.82
0.94

0.60
0.48
0.710
Cat
None
0.55
0.76
0.50

0.50
0.578
Dog 5
0.54
0.73
0.52

0.52
0.578
Frog 6
0.56
0.72
0.60

0.62
0.625
Horse 7
0.70
0.75
0.59

0.68
0.680
Ship 8
0.91
0.89
0.53

0.46
0.678
Truck 9
0.82
0.94
0.50

0.48
0.685
Comb. (59)
0.89
0.91
0.55

0.52
0.718
Sup. (04)
0.93
0.93
0.58

0.54
0.745
Deer
None
0.56
0.80
0.52
0.54

0.605
Dog 5
0.72
0.85
0.63
0.80

0.750
Frog 6
0.66
0.86
0.60
0.75

0.718
Horse 7
0.71
0.58
0.58
0.71

0.645
Ship 8
0.93
0.94
0.63
0.72

0.805
Truck 9
0.84
0.97
0.62
0.72

0.773
Comb. (59)
0.87
0.95
0.61
0.73

0.790
Sup. (04)
0.93
0.97
0.62
0.72

0.810
Positive
Negative
Outliers
Class
Class
Dog
Frog
Horse
Ship
Truck
avg
Dog
None

0.69
0.66
0.57
0.77
0.673
Plane 0

0.53
0.66
0.92
0.89
0.750
Car 1

0.56
0.68
0.95
0.91
0.775
Bird 2

0.63
0.63
0.77
0.78
0.703
Cat 3

0.67
0.65
0.66
0.81
0.698
Deer 4

0.73
0.69
0.70
0.76
0.720
Comb. (04)

0.58
0.67
0.95
0.94
0.785
Sup. (59)

0.56
0.73
0.95
0.95
0.798
Frog
None
0.40

0.53
0.49
0.67
0.523
Plane 0
0.73

0.81
0.96
0.93
0.858
Car 1
0.74

0.83
0.96
0.97
0.875
Bird 2
0.80

0.84
0.91
0.85
0.850
Cat 3
0.84

0.80
0.87
0.86
0.843
Deer 4
0.75

0.86
0.88
0.87
0.840
Comb. (04)
0.75

0.84
0.97
0.95
0.877
Sup. (59)
0.82

0.88
0.97
0.96
0.907
Horse
None
0.41
0.58

0.46
0.66
0.528
Plane 0
0.55
0.50

0.93
0.83
0.703
Car 1
0.56
0.58

0.90
0.93
0.743
Bird 2
0.62
0.73

0.80
0.71
0.715
Cat 3
0.77
0.76

0.65
0.66
0.710
Deer 4
0.62
0.83

0.57
0.60
0.655
Comb. (04)
0.51
0.57

0.89
0.88
0.713
Sup. (59)
0.59
0.66

0.95
0.92
0.780
Ship
None
0.62
0.74
0.73

0.77
0.715
Plane 0
0.75
0.75
0.82

0.74
0.765
Car 1
0.84
0.89
0.90

0.88
0.878
Bird 2
0.94
0.96
0.94

0.78
0.905
Cat 3
0.95
0.95
0.93

0.8
0.908
Deer 4
0.92
0.96
0.94

0.78
0.900
Comb. (04)
0.95
0.97
0.96

0.83
0.928
Sup. (59)
0.94
0.96
0.96

0.88
0.935
Truck
None
0.35
0.53
0.46
0.30

0.41
Plane 0
0.61
0.52
0.58
0.80

0.628
Car 1
0.51
0.57
0.53
0.47

0.520
Bird 2
0.92
0.90
0.84
0.73

0.848
Cat 3
0.95
0.90
0.82
0.61

0.820
Deer 4
0.91
0.92
0.86
0.63

0.830
Comb. (04)
0.93
0.91
0.83
0.76

0.858
Sup. (59)
0.94
0.95
0.90
0.78

0.893


Appendix B Detailed Results for all 10 tasks
The following graphs are detailed results for some experiments in various settings described in section 5.1 and 5.2. Each curve represents a tradeoff between accuracy on anomalies and on normal data for each dataset. The two left panes are an upperbound supervised setting where the negative dataset is the same as outliers. The top pane shows accuracies on tasks 0 to 4 and the bottom shows accuracies on tasks 5 to 9. Note that as the threshold value increases the model favors accepting anomalies over misclassifying normal examples. In almost all cases, we observe that LISAE gives a significant margin compared to normal AE.
Standard LISAE trained on CIFAR10 classes. Left, outliers as negative dataset (Supervised). Right, SVHN as negative dataset.
Results of LinSepLISAE variant on SVHN. Left, outliers as negative dataset (Supervised). Right, unsupervised. 
MNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset. 
FashionMNIST classes as Positive datasets. Left, outliers as negative dataset (Supervised). Right, Omniglot as negative dataset. 