Anomaly detection refers to identifying data points, events, or observations that are deviating from the normal behaviors of the datasets Ben-Gal (2005). Recognizing the anomaly data is a crucial task in various fields such as identifying fraud in credit card transactions Raj and Portia (2011), discriminating fraudulent claims in insurance or health care Ukil et al. (2016), detecting intrusion for cyber-security Goh et al. (2017)
, and monitoring rare objects or events in computer vision and surveillance videoSultani et al. (2018). Previous approaches Chandola et al. (2009)2001), local outlier factor (LOF) Breunig et al. (2000), and local search algorithm (LSA) He et al. (2005).
In the natural language processing (NLP) field, existing approaches mainly concentrate on recognizing specific text anomaly patterns such as hate speechDavidson et al. (2017); Badjatiya et al. (2017); Waseem and Hovy (2016) or spam text Laorden et al. (2014); Miller et al. (2014); Savage et al. (2014), but not on the general indistinct anomaly patterns, like domain deviation and quality variation. Detecting text anomalies is essentially hard in complex real-world applications, as the anomaly patterns are enormous, diverse, and non-trivial to define.
To tackle general anomalies, Tax and Duin (2004) proposes a support vector data description (SVDD) method, which targets to fit a spherically shaped description for the training data (from well-sampled normal class). However, without robust feature engineering, it could be difficult for SVDD to build a good data descriptor, especially for complex unstructured data. To this end, deep anomaly detection methods have been proposed and significantly improve the performance Chalapathy and Chawla (2019) such as deep one-class support vector data description (OC-SVDD) Ruff et al. (2018)
, one-class convolutional neural network (OC-CNN)Oza and Patel (2018), anomaly detection with deep auto encoders Zhou and Paffenroth (2017)
, and adversarially learned one-class classifierSabokrou et al. (2018). Those aforementioned approaches are designed for the the image data and cannot be directly applied to the natural language data. Here, to tackle text anomaly detection from end-to-end, we aim to build a deep network in the NLP domain and automatically discriminate the anomalies from the normal.
In this paper, motivated by the idea of deep OC-SVDD, we propose an anomaly-injected deep SVDD (AI-SVDD) framework, which provides a systematic solution for text anomaly detection. The motivation for AI-SVDD is illustrated in Figure 1, in the 2-D feature space, anomalies might be dispersed anywhere due to enormous anomaly patterns. To detect such anomalies, a common approach is to design a model to map normal data into a compact data hypersphere, i.e., the red circle. However, without enough labeled anomalies, such a model might be sensitive to outliers or mislabelled points (upper right bigger blue circle) and make errors (many unobserved anomalies are detected as normal). Even if labeling a small portion of anomalies and treating the problem as a binary classification, a classifier can barely produce satisfactory results (see the blue line in Figure 1), as it is essentially hard to cover the entire anomaly distribution. Hence, we need a robust model to differentiate normal data from diverse indistinct anomalies. The proposed approach not only learns a more compact data description through updating the center (deep OC-SVDD uses a prefixed center), but also adopts a small number of anomaly label data to better discriminate anomaly texts from the normal. More generally, the proposed AI-SVDD can be regarded as the one-class case when all labels are the same.
The contributions are as follows: (i) We introduce a novel AI-SVDD deep learning framework for anomaly detection; (ii) We analyze the difference between deep OC-SVDD and the proposed AI-SVDD; (iii) We propose a systematic pipeline that combines AI-SVDD objective with deep NLP model and a solution to tackle the quadratic equality constrained optimization problem; (iv) We provide results and analysis on two potential application scenarios and one real-world application scenario, and we also demonstrate that the proposed deep AI-SVDD outperforms existing works.
Anomaly detection in general is a broad topic. In this section, we review some existing text anomaly detection methods and introduce a series of deep learning based approaches that are closely related to our work.
To detect anomalies in text, Manevitz and Yousef (2001) propose a one-class support vector machines (OC-SVM) approach that treats document classification in a one-against-all fashion Manevitz and Yousef (2001). Similarly, Liu et al. (2002) introduce a partially-supervised classification method that classifies positive class documents against all others. For streaming text data, Mahapatra et al. (2012) propose a LDA-based text clustering algorithm and considers including contextual pattern as side information. Those traditional text anomaly detection approaches utilize feature engineering techniques such as tf-idf or ngram feature to generate text representations. There are also works that use end-to-end models. For example, Larson et al. (2019) employ sentence embedding and rank samples based on their distances to the generated mean of embeddings to detect outliers in a dialogue system; Nedelchev et al. (2020)
propose a recurrent neural network (RNN) type of encoder and decoder to model the dialogue quality and perceive dialogue evaluation as an anomaly detection task;Zhuang et al. (2017) and Ruff et al. (2019b) focus on identifying outlier document and both utilize vector representations of words in the model.
Next, we closely review the deep support vector data description (SVDD) approaches for anomaly detection. In Ruff et al. (2018), an SVDD objective in combining with deep learning models, such as CNN
is introduced to tackle the high-dimensional, data-rich scenarios. This approach minimizes the volume of a hypersphere that encloses the network representations of the data and shows significant performance improvement of anomaly detection in MNISTLeCun et al. (1998)
and CIFAR-10Krizhevsky et al. (2009) datasets in the field of computer vision. However, unlike the original SVDD objective that jointly minimizes the hypersphere and updates the center in the feature space Tax and Duin (2004), deep OC-SVDD considers the center of the hypersphere is known and fixed during the learning process. Moreover, all those aforementioned traditional and end-to-end approaches are unsupervised.
In addition, to our best knowledge, there is limited work dealing with anomaly detection in a supervised fashion. To increase the discrimination power, a deep semi-supervised anomaly detection approach is introduced in Ruff et al. (2019a). It utilizes a subset of labeled data (verified by some domain expert as being normal or anomalous) together with large unlabeled data samples and shows slight performance improvement over the OC-SVDD approach. A clear drawback of their approach is that the learned network could project outlier data points to be enclosed in the hypersphere unexpectedly, as they assume that the majority of the unlabeled data belongs to the normal class but the quadratic objective is sensitive to outliers by natural.
To this end, we propose an improved deep anomaly detection framework. It jointly minimizes the volume of a normal class data enclosed hypersphere and updates the center, and in the meanwhile, our method further adopts a small proportion of the anomaly labeled data to enhance the discrimination power.
Deep OC-SVDD and Revisions
Background and motivation
Ruff et al. (2018)
introduces a deep one-class classification framework, which aims to train a neural networkthat minimizes the volume of a data-enclosing hypersphere centered on a predetermined projecting center . Formally, for some input space and latent (feature) space , let be a hidden layers neural network with the corresponding set of weights . The deep OC-SVDD objective can be formulated as:
where are samples on , , is the squared Euclidean norm for a matrix, and is the regularization hyper-parameter.
Rather than minimizing the volume of a data-enclosing hypersphere on a fixed center, we can further add into the optimization and the objective becomes:
In their original work, is essentially the center of the data points through a neural network representation. That is, while jointly optimize and for Eq. (2), an optimal of can be calculated as . After plugging back into Eq. (2), we can further obtain the following new objective (please check Appendix for more details):
For preliminary exploration of Eq. (Background and motivation), if we consider a vanilla network with a simple linear layer, we could finally show that the network weights would be all zero and only transform all data points onto origin. The details are shown in the supplemental material.
Instead of minimizing Eq. (2) that suffers from learning nothing, in this paper, we consider the following one-class objective:
Since Eq. (Revision) minimizes the center in addition to Eq. (1), the volume would be smaller than . As a result, by introducing a joint minimization on the center , we can transform the input space to an even more compact hypersphere.
Moreover, the one-class objective in (Revision) can be further extended to a more general scenario where the anomaly labeled data can be introduced. Therefore, we propose the AI-SVDD objective and show the differences in section AI-SVDD for Anomaly Detection. Note that the new objective equals to the one-class objective when all labels are the same (i.e., all data are normal).
AI-SVDD for Anomaly Detection
In this section, we propose to extend the aforementioned one-class objective in Eq. (Revision) to a more general case, where the goal is to push the normal class close to a center while keeping the anomaly class far from the center. Accordingly, when additional class labels are available, anomaly detection can be more targeted and likely to be more effective and robust.
Deep AI-SVDD objective
In our deep AI-SVDD setting, labeled samples are given, where each corresponds to anomaly label ( denotes normal class, denotes anomaly) for the sample . We aim to jointly find a center and a deep neural network parameters set with layers, such that each positive instance is close to the center and each negative instance is far from the center:
After solving first, we have . If we plug back into (Deep AI-SVDD objective) and simplify it (please check Appendix for more details), we have the following new objective:
Note that in our proposed objective, the new center learned by Eq. (Deep AI-SVDD objective) takes advantage of the anomaly data in the following two folds: (i) a normal sample and an anomaly sample that are close in the embedding space cancel each other and will not contribute to the final center, which improves data robustness for the model compared with the one class case; (ii) the new center is pushed away from the anomaly points, where the simple average in the one class case can drag the center towards the anomaly when the training data is polluted with anomaly samples. Please check the supplemental material for more details.
The difference between our proposed objective in Eq. (Deep AI-SVDD objective) and the deep OC-SVDD objective in Eq. (1) is not only updating the center but also injecting the anomaly labeled data in the objective. The purpose and motivation is to introduce a more compact enclosure of the data hypersphere and in the meanwhile, increase discrimination of the anomalies from the normal data. We illustrate how different objectives perform using vanilla network on a toy synthetic data in the supplemental material.
The system and training
To detect text anomalies, our target is to build the deep AI-SVDD system pipeline. We denote Eq. (1) as the OC-Loss and the first part of Eq. (Deep AI-SVDD objective) without constraints as the BC-Loss. The goal is to optimize the deep networks corresponding to minimizing the BC-Loss. The systematic diagram of the anomaly detection is shown in Figure 2.
The system pipeline consists of three stages: the pretraining stage, the training stage and the test stage, respectively. In the pretraining stage, an auto-encoder is learned to transform the data from the original input space into a latent feature space. In the training stage, only the encoder part of the network is used to train on the BC-Loss or OC-Loss. In the test stage, a trained network is used to generate latent features given the test data.
To capture a good language representation of the text, we take advantages on the pretrained bidirectional transformers (BERT) to extract initial features of the text and serve as the input data of the system. We then introduce auto-encoder architecture as pretrained to learn latent encoding from the given data by minimizing the mean squared error (MSE) between the original input (BERT embeddings) and the reconstructed ones. A multilayer perceptron (MLP) network is employed in the network architecture of both encoder and decoder to transform the input into the latent space with a reduced dimension, where the hidden size, and latent dimension can be selected through cross validation Browne (2000).
Since our proposed BC-Loss does not involve the center , unlike the OC-Loss in Eq. (1), we do not need the pretraining stage. We design the pretraining stage particularly for OC-Loss, which is a necessary step to obtain a good prefixed center for the deep OC-SVDD method.
Network weights minimization with quadratic equality constraints: Since the BC-Loss contains a set of quadratic equality constraints, we rely on the projected gradient descent method Bubeck (2014) to solve the minimization problem. For , the process is we first update the network parameters based on the backward error propagation step as in Eq. (7), and then perform a projection onto the constraints as in Eq. (8):
where is the learning rate, is the learned th layer network weights at iteration , is the updated weights at the time step weights and
is the loss function evaluated at. The projected gradient descent method has the similar convergence rate as that of the gradient descent method on the unconstrained objective Bubeck (2014).
In the inference stage, once the network is trained, the score for a test point is given by the distance from to the center obtained from training data:
In the following experiment111Code will be released upon publication., we consider two potential application scenarios, one is to detect irrelevant sentences in a topic, such as movie reviews, and the other is to detect mislabeled text data that does not belong to a known class type. Finally, we employ a real-world medical document dataset for the application of title quality filtering. We consider using the following baseline methods and comparing method:
The baselines we considered here fall in the following two categories: (i) the BERT-rank method as proposed in Larson et al. (2019) that first converts sentences into embeddings, and use the training data embeddings to compute the center . To capture good representation of the text data, we use the pretrained BERT base cased model Devlin et al. (2018) to generate the embedding. In the test stage, we rank the test data based on the Euclidean distance of the embedding to the center from high to low for evaluation. (ii) The traditional outlier detection methods such as one-class SVM (OC-SVM) Manevitz and Yousef (2001), isolation forrest (ISF) Liu et al. (2008), and local outlier factor (LOF) Breunig et al. (2000) use BERT embeddings as input.
The comparing method we considered is the deep OC-SVDD approach as proposed in Ruff et al. (2018). Note that the work proposed in Ruff et al. (2018) focuses on the application domain of computer vision and can not be directly applied in text data. Here, we employ the training pipeline as described in Section The system and training and learning a deep model by minimizing the loss in Eq. (1).
For a comparison of text outlier detection approaches, we consider the following metrics:
Mean Average Precision (MAP) captures the overall precision of detecting all the anomalies, which is defined as , where is the total number of anomalies, is the th anomaly ordered by the Euclidean distance to the center in descending order, and is the position of the th anomaly in all test examples.
Recall@k captures the percentage of retrieving the anomalies at the of the test data samples, which is defined as , where is the total number of test samples.
AUC captures the area under the Receiver Operating Characteristic (ROC) graph of the true positive rate (TPR) against the false positive rate (FPR) by scanning through various thresholds. Here the threshold calculated by the Euclidean distance from a test sample to the center as defined in (9).
In the topic change application, we concentrate on exploring how different levels of polluted data affects the detection performance. The BERT-rank method and the deep OC-SVDD approach are involved for comparison. We qualitatively analyze the results through ROC curves along with detailed numerical evaluations on all metrics mentioned above. In the applications of text anomaly detection and medical text quality filtering, we compare our AI-SVDD approach with all aforementioned baselines and the comparing method quantitatively.
Text quality filtering
To detect whether a response is irrelevant in a specific topic text data
, we use the IMDB movie review dataMaas et al. (2011)
as the normal class (main topic) text data. We treat wikitext-2 dataMerity et al. (2016), which contains various topics, as the anomaly text data. For both IMDB and wikitext data, we use the Stanford NLP toolbox Qi et al. (2020) to split a paragraph into sentences. As a result, for the normal part of the data, the processed IMDB dataset contains sentences for training, sentences for model development and sentence for testing; for anomalies, we randomly select sentences obtained from the wikitext data.
To investigate how data pollution affects model performance, we consider a series of pollution proportions . More specifically, we generate the training data by combining all IMDB training sentences together with a total number of anomaly sentences randomly selected from the wikitext data. The development and test datasets are generated using the pollution only for simplicity (also randomly selected from the rest of wikitext sentences). We only generate each pollution proportion training dataset, development dataset and test dataset once since the normal data samples are fixed and anomaly sample size is a small proportion222Generating training data multiple times cannot cover the full anomaly distribution, and the aforementioned methods are not concentrating on learning model to fit anomaly data.. In this experiment, we run our proposed deep AI-SVDD approach together with the BERT-rank method and the deep OC-SVDD approach times on different polluted training data. We report the results on the test dataset in terms of ROC, MAP, Recall@5 and AUC, which are summarized in Figure 3 and Table 1.
Result and analysis
Figure 3 shows the ROC curves of different methods in different pollution proportion settings. We can observe that the AUCs of the ROC curves of the BERT-rank method does not change much across different pollution proportions, while the AUCs of the OC-SVDD method reduces gradually when data pollution proportion increases. However, the AUCs of our proposed approach increases slightly when data pollution proportion increased from to , as our AI-SVDD model takes advantage of the labeled anomalies. Figure 3(d) illustrates the ROC curves of the aforementioned three methods on a -polluted training data. The results illustrate that our proposed approach achieves a increasing in AUC compared with OC-SVDD and increase compared to BERT-rank method. In addition, by reaching TPR, our AI-SVDD method only introduces false alarms (FPR) while BERT-rank introduces false alarms and OC-SVDD brings in false alarms, respectively.
|BERT-rank 333Results of the BERT-rank method are consistent without variations, as it only calculates the average of embeddings of all training samples does not change across different runs.||24.9(0.0)||24.6(0.0)||24.6(0.0)||24.3(0.0)||23.7(0.0)|
|BERT-rank 333Results of the BERT-rank method are consistent without variations, as it only calculates the average of embeddings of all training samples does not change across different runs.||25.0(0.0)||25.0(0.0)||25.0(0.0)||22.9(0.0)||22.9(0.0)|
|BERT-rank 333Results of the BERT-rank method are consistent without variations, as it only calculates the average of embeddings of all training samples does not change across different runs.||86.6(0.0)||86.6(0.0)||86.5(0.0)||86.3(0.0)||85.9(0.0)|
Table 1 shows the detection performance in terms of MAP, Recall@5 and AUC on the -polluted test data on different methods. For the baseline BERT-rank method, the performance in all metrics reduce slightly when the pollution proportion in the training data is increasing. But this method in general, underperforms the anomaly detection task compared with OC-SVDD and AI-SVDD methods. When data is not polluted, the deep OC-SVDD outperforms other methods but its performance reduces gradually when the pollution proportion is increasing, on the contrary, the MAP, Recall@5 and AUC scores of our proposed deep AI-SVDD method significantly increase when the data pollution proportion is increasing, especially when the proportions change from to . When the data is -polluted, our deep AI-SVDD method achieves the best overall performance with average scores of , and , in terms of MAP, Recall@5 and AUC, respectively.
Note that when the data is -polluted, our approach is the same as the revised one-class objective in Eq. (Revision), which differs from the OC-SVDD objective on how we treat the projection center. From Table 1, we can observe a clear performance gap while using those two objectives, we suspect this is majorly due to the following two reasons: (i) The deep OC-SVDD involves a pretraining network and it needs to be tuned carefully for finding a good prefixed center before optimizing the OC-Loss. However, such a method is very sensitive to the choice of the center. When the center is randomly initialized, we can easily observe a significant performance decrease. (ii) Our AI-SVDD objective minimizes the average pair-wise squared distances of all training data in the latent space, without the anomaly label points, the model can extremely overfit the training data.
Unknown class detection
polluted test data with various polluted training data. Since there is no randomness in OC-SVM, LOF and BERT-rank, the standard deviations of these methods are zero.
To evaluate the ability of detecting unknown classes, we conduct experiments on three datasets, which cover different domains: news (News Topic Categorization uses AG’s news corpus Zhang et al. (2015)), online reviews (Review Categorization uses Yelp dataset Zhang et al. (2015)), and biomedical papers (Abstract Role Categorization uses RCT dataset Dernoncourt and Lee (2017)). We create text anomaly datasets Larson et al. (2019); Ruff et al. (2019b) in the fashion of one-class classification setup. For each dataset we study, one class is used as normal and all the other classes are considered as anomalies. Examples with the normal class is relabelled as (”normal”) and examples from remaining classes are relabelled as (”anomaly”). Detailed descriptions of how each domain data for unknown class detection task is generated are as follows: News Topic Categorization is a topic classification task. We use the AG’s news corpus Zhang et al. (2015), which groups the news into 4 categorizations: ’world’, ’sports’, ’business’ and ’sci/tech’. ’sports’ topic is chosen as the normal, other topics are used as anomaly. Since the news contains both headlines and content, we consider using the headlines to detect text anomalies. Removing invalid headline text results in normal training samples, normal development samples and normal test samples. Review Categorization is a task to predict the stars the user has given based on their text reviews. We use Yelp dataset Zhang et al. (2015). There are five labels in total, from 1 to 5 stars. The 5 star is chosen as the normal and 1-4 stars are used as anomaly. Since the dataset is large (training set contains samples), we filter out the text samples that are over 50 words or less than 3 words and use the filtered as the input text. The resulting number of normal training samples is , normal development sample size is , and normal test sample size is . Abstract Role Categorization is a task to predict the role of a text in abstracts. We use RCT dataset Dernoncourt and Lee (2017), in which the role is labelled with five classes: ’background’, ’objective’, ’method’, ’results’ and ’conclusions’. The ’conclusions’ role is considered as normal, and the others are considered as anomaly. We use the raw text to distinguish anomaly from the normal. The normal training sample size is , normal development sample size is , normal test sample size is .
To mimic the real situation where only a small number of anomaly text exists, we follow a similar setting of the experiments in Section Setting. Specifically, we adopt the same range of pollution proportion to generate training datasets and the same pollution setting is used for the development and the test datasets. Anomaly samples are selected uniformly from all classes other than the one normal class.
Result and analysis
Table 2 summarizes the anomaly detection performance in terms of MAP, Recall@5 and AUC on the -polluted test data on different approaches. We can observe from the table that all of the unsupervised methods (OC-SVM, ISF, LOF, BERT-rank, OC-SVDD) exhibit a performance decreasing when the training pollution increase, while our proposed deep AI-SVDD approach demonstrates increased performance when pollution proportion is increasing. For the AG dataset, the LOF method with BERT embedding achieves the highest performance of all metric when pollution is less than . Our deep AI-SVDD approach competes all the others when the training pollution is . For the Yelp dataset, when pollution is less than , the highest values for different metrics are sparsely located for various unsupervised methods, but our AI-SVDD approach shows consistent highest performance on all metric when training pollution . For the RCT dataset, our AI-SVDD method outperforms all other models in terms of MAP and AUC when , and show competitive performance in terms of Recall@5 when compared with LOF.
Medical document anomaly detection
In the last, we evaluate our approach on a real-world text data quality control task. Specifically, our target is to identify qualified question answering (QA) pairs from a set of multi-source multi-form medical text data in Chinese. Disqualified samples include low-quality QA pairs (such that the question is incomplete or it does not exhibit a clear intention), or non-QA data such as scientific essays, etc. In this task, we just focus on the title or question part of the data. The original crowdsourcing data contains over 10 million samples that need to be examined, but only a small portion of the data is labeled. As a result, such labeled multi-source data contains a total of normal samples and anomaly samples from data sources.
To demonstrate the performance of the proposed approach and the competing methods on various test datasets, we use different random splits of the total data samples. To maintain a similar class ratio on training and test data, we combine of the normal samples and anomaly samples as the training data, and the remaining of both normal and anomaly samples are considered as the test data. For each of the random splits, we denote it as one Monte-Carlo (MC) run.
Result and analysis
Table 3 shows the mean and standard deviation of the anomaly detection performance on various methods in terms of MAP, Recall@5 and AUC of independent MC runs. We can observe from the table that traditional outlier detection approaches (OC-SVM, ISF, LOF) even with BERT embeddings, do not show promising results on any metrics, especially for the OC-SVM. The BERT-rank method can compete with the traditional approaches and show increased performance. In addition, the deep OC-SVDD method has a increase in terms of MAP and AUC and a increase in Recall@5 compared with the BERT-rank. Moreover, our proposed AI-SVDD approach outperforms all comparing methods in all metric in this medical text anomaly detection task. Particularly, our method improves the average MAP by , the average Recall@5 by and the average AUC by , when compared with the OC-SVDD approach.
Discussion and future work
We want to emphasize that our approach is not a pure supervised classification method but a variant of the deep SVDD approach (which is pure unsupervised). The main idea is when a small number of negative examples (which belong to anomaly and should be rejected) are available, they can be incorporated into the training to improve the data description. Although the objective uses a small amount of the anomaly label y, the target and our objective still focuses on one class and fit a good data description to capture its distribution. In the proposed experiments, We only compare our approach with traditional outlier detection approaches and several other deep SVDD approaches. Due to space limitation and slight beyond our concentration of this work, we did not show the comparison of our approach with the classification approach, those will remain as the future work. We believe that the proposed method can beat the performance of the classification approach, especially under the scenario that the majority of the test negative samples are not observed and not from the sample distribution as in the training. Please see Figure 1 for an illustrative example. This is due to the fact that the negative samples do not contain a consistent pattern and come from the same distribution, it is hard to model both positive and negative samples to train a classifier.
In this work, we developed a novel deep AI-SVDD model that aims to discriminate the normal class data from anomalies or outliers. We carefully examined the deep OC-SVDD objective and proposed a revision with a center updating mechanism. To tackle the text anomaly detection task, we proposed a systematic deep model with a solution for constraints minimization. We employed several baseline methods and developed a competing deep OC-SVDD model from scratch. To evaluate our proposed approach, we conducted our experiments in three different applications with textual datasets. Experimental results demonstrated that our proposed AI-SVDD approach is promising and it could provide competitive results when data is polluted. AI-SVDD outperforms various comparing approaches on the three proposed applications.
We would like to show our special thanks to Xingyuan Pan, who is a Ph.D. graduated from University of Utah and now working as an applied scientist in Amazon. He provides meaningful suggestions and tremendous help for this work.
Appendix A Appendix
For simplicity, we ignore the ranges information in sum expressions in the discussions below, i.e. for a sample index , stands for , & for a layer index , stands for .
Proof of equality between Eqs. (2) & (3) in main paper
Since in Eq. (2) in main paper, after plugging it back into Eq. (2) in main paper, we can obtain:
Let , and . Denote be the first term of , then it can be simplified as:
into the first term of (10), yields:
Therefore, with , minimizing Eq. (2) is equivalent to minimizing Eq. (3) in main paper.
Proof of equality between Eqs. (5) & (6) in main paper
Since in Eq. (5) in main paper, after plugging it back into Eq. (6) in main paper, we can obtain:
If we denote , and , we can simplify it as follows:
Since , and
substituting it into the first term of (A), yields
Therefore, with , minimizing Eq. (5) is equivalent to minimizing Eq. (6) in main paper.
We tune the hyper-parameters for the deep OC-SVDD network by setting hidden size , latent size , batch size and regularization term . For our proposed AI-SVDD network, we search over a larger range for each hyper-parameter, hidden size , latent size and batch size . Following the cross-validation mechanism, we find the optimal hyper-parameters for each application using the non-polluted training data and the corresponding development data for validation. For all three applications, the learning rate is fixed as
and the epoch is fixed to be. For the text topic change application, the optimal setting for deep OC-SVDD network is and the optimal setting for our deep AI-SVDD network is . For the text anomaly detection application, the optimal setting for deep OC-SVDD network in AG dataset is , Yelp and RCT-20k dataset are both . The optimal setting for our deep AI-SVDD network with AG dataset is , with Yelp dataset is , and with RCT-20k dataset is . For the application of medical document title anomaly detection, the optimal setting for deep OC-SVDD network is and the optimal setting for our deep AI-SVDD network is .
Appendix B Supplemental Material
Case study–I: Vanilla network
For preliminary exploration purpose, we consider a very simple neural network with a linear layer. Let be the weight to be learned in Eq. (1) in main paper and , then we have:
An optimal solution of Eq. (12) can be obtained by .
Similarly, for Eq. (3) in main paper, let be the weight to be learned and , then problem (3) in main paper is equivalent to:
Taking the derivative of Eq. (13) w.r.t. and assigning the gradient to zero, we have:
Let , then an optimal solution of Eq. (13), , falls into the null space of . However, since is full-rank, the optimal in such a case can only be zero. In other words, the network learns nothing but projects all data points into one.
To avoid the aforementioned problem, we consider replacing the regularization term in Eq. (13) with a constraint instead, then the problem can be newly defined as:
Since , by applying the Cauchy-Schwarz inequality in Bhatia and Davis (1995), we can easily show that the transformed feature points () would have a more compact enclosure than the original data points.
Case study–II: Comparison of different objectives
In this part, we aim to illustrate the difference of one-class objectives in Eq. (12) & (14) and compare with our proposed AI-SVDD objective. Followed by the vanilla network in Eqs. (12), we also employ a simple one linear layer network , then the objective (6) becomes:
To compare and illustrate the differences among objectives, we generate a small number
data points from a Gaussian distribution where each point. Next, We obtain the optimal linear transformation for Eq. (12) as well as solve the constraint optimization in Eq. (14) with the optimals and . Minimizing Eq. (B) w.r.t. to , we also obtain the optimal transformation and the corresponding optimal center . To simulate anomaly points, we randomly generated points that are sampled from a circle centered at with radius . For the sake of exploring the discriminative powers in a harder and more realistic situation, we carefully designed the anomaly points to be close to some of the edge points (located 2 away from the Gaussian center).
In Figure 4, we illustrate the original data space in the top-left part of the figure and the blue dots refer to normal points and green crosses refer to the abnormal points. The linear transformed data points (s and s where s refer to the anomaly points) are shown in the top-right of Figure 4, and the linear transformed data points (s and s, where is obtained from minimizing Eq. (14)) are shown in the bottom-left of Figure 4. In Figure 4, the data hypersphere of Eq. (14) as in the feature space is actually much smaller than the data hypersphere of Eq. (12) as . In addition, the ratio between the average squared distance (between the anomaly points and the center) and that of the normal points in Eq. (14) is larger than in Eq. (12).
Since the AI-SVDD objective contains the anomaly data in the training, we have to randomly generated additional anomaly points from the same anomaly points distribution and use them for model training. In Figure 4, we plot the transformed points in the bottom-right sub-figure. The data hypersphere of Eq. (B) as in the feature space is very close to the data hypersphere of Eq. (14). In addition, the ratio between the average squared distance (between the anomaly points and the center) and that of the normal points in Eq. (B) is even larger than in Eq. (14).
Hence, our proposed AI-SVDD model learns a transformation that introduces a more compact enclosure of the data hypersphere (compared to Eq. (12)), and in the meanwhile, discriminates the anomalies from the normal with a larger distance ratio (compared to Eq. (14)). Since the formulation in Eq. (5) of the main paper naturally introduces a larger distance between the anomaly and the center than that between the normal and the center, the difference (distance) between anomaly and the normal would be larger.
Illustration on center difference
To illustrate the center difference between Eq. (5) and Eq. (4) in main paper, and show how the label affecting the center, we randomly generate Gaussian points (normal class points) from , and randomly sampled bad data from , which are far from the normal class points. In the meantime, we also random generated anomaly class points, points are generated from , points from , point from and point from . Please see the blue dot points for the normal class, green dot for the normal class bad data and green cross for the anomaly class in Figure 5.
The one class anomaly detection center in Eq. (4) is the pure average of all feature points, which is shown as the red dot. The binary class case center in Eq. (5) is not the pure average, which is shown as the purple dot, where the yellow dot is the ground truth normal class center. From Figure 5, we can observe that the one class center is far from the ground truth, which means it is sensitive to the bad points, where the bad points may naturally comes from the data since the data itself might already contain small number of anomalies or outliers. In the meanwhile, if we use the one class center to perform a detection, the performance would be low since it can not distinguish some of the anomalies from the normal class. See Figure 5 for an example. However, since the binary class center is the difference between all of the normal class points and all of the anomaly points, the points close to each other with different labels would not contribute to the final center. Therefore, the center in the binary class objective Eq. (5) is more robust to outliers of the data than the one class in Eq. (4) in main paper.
- Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760. Cited by: Introduction.
- Outlier detection. In Data mining and knowledge discovery handbook, pp. 131–146. Cited by: Introduction.
- A cauchy-schwarz inequality for operators with applications. Linear algebra and its applications 223, pp. 119–129. Cited by: Appendix B.
- LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104. Cited by: Introduction, item ii.
- Cross-validation methods. Journal of mathematical psychology 44 (1), pp. 108–132. Cited by: The system and training.
- Convex optimization: algorithms and complexity. arXiv preprint arXiv:1405.4980. Cited by: The system and training.
- Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: Introduction.
- Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: Introduction.
- Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009. Cited by: Introduction.
- PubMed 200k rct: a dataset for sequential sentence classification in medical abstracts. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 308–313. Cited by: Dataset.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: item i.
- Anomaly detection in cyber physical systems using recurrent neural networks. In 2017 IEEE 18th International Symposium on High Assurance Systems Engineering (HASE), pp. 140–145. Cited by: Introduction.
- An optimization model for outlier detection in categorical data. In International Conference on Intelligent Computing, pp. 400–409. Cited by: Introduction.
- Learning multiple layers of features from tiny images. Cited by: Related Work.
- Study on the effectiveness of anomaly detection for spam filtering. Information Sciences 277, pp. 421–444. Cited by: Introduction.
- Outlier detection for improved data quality and diversity in dialog systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 517–527. Cited by: Related Work, item i, Dataset.
- The mnist database. URL http://yann. lecun. com/exdb/mnist. Cited by: Related Work.
- Partially supervised classification of text documents. In ICML, Vol. 2, pp. 387–394. Cited by: Related Work.
- Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: item ii.
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Cited by: Dataset.
- Contextual anomaly detection in text data. Algorithms 5 (4), pp. 469–489. Cited by: Related Work.
One-class svms for document classification.
Journal of machine Learning research2 (Dec), pp. 139–154. Cited by: Introduction, Related Work, item ii.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: Dataset.
- Twitter spammer detection using data stream clustering. Information Sciences 260, pp. 64–73. Cited by: Introduction.
- Treating dialogue quality evaluation as an anomaly detection problem. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 508–512. Cited by: Related Work.
- One-class convolutional neural network. IEEE Signal Processing Letters 26 (2), pp. 277–281. Cited by: Introduction.
- Stanza: a Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Cited by: Dataset.
- Analysis on credit card fraud detection methods. In 2011 International Conference on Computer, Communication and Electrical Technology (ICCCET), pp. 152–156. Cited by: Introduction.
- Deep semi-supervised anomaly detection. In International Conference on Learning Representations, Cited by: Related Work.
- Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: Introduction, Related Work, Background and motivation, Figure 3, Experiments.
- Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4061–4071. Cited by: Related Work, Dataset.
Adversarially learned one-class classifier for novelty detection. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3379–3388. Cited by: Introduction.
- Anomaly detection in online social networks. Social Networks 39, pp. 62–70. Cited by: Introduction.
- Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6488. Cited by: Introduction.
- Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: Introduction, Related Work.
- IoT healthcare analytics: the importance of anomaly detection. In 2016 IEEE 30th international conference on advanced information networking and applications (AINA), pp. 994–997. Cited by: Introduction.
- Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop, pp. 88–93. Cited by: Introduction.
- Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: Dataset.
Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 665–674. Cited by: Introduction.
- Identifying semantically deviating outlier documents. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2748–2757. Cited by: Related Work.