Weakly-supervised Deep Anomaly Detection with Pairwise Relation Learning

10/30/2019 ∙ by Guansong Pang, et al. ∙ 19

This paper studies a rarely explored but critical anomaly detection problem: weakly-supervised anomaly detection with limited labeled anomalies and a large unlabeled data set. This problem is very important because it (i) enables anomaly-informed modeling which helps identify anomalies of interests and address the notorious high false positives in unsupervised anomaly detection, and (ii) eliminates the reliance on large-scale and complete labeled anomaly data in fully-supervised settings. However, the problem is especially challenging since we have only limited labeled data for a single class, and moreover, the seen anomalies often cannot cover all types of anomalies (i.e., unseen anomalies). We address this problem by formulating the problem as a pairwise relation learning task. Particularly, our approach defines a two-stream ordinal regression network to learn the relation of randomly selected instance pairs, i.e., whether the instance pair contains labeled anomalies or just unlabeled data instances. The resulting model leverages both the labeled and unlabeled data to effectively augment the data and learn generalized representations of both normality and abnormality. Extensive empirical results show that our approach (i) significantly outperforms state-of-the-art competing methods in detecting both seen and unseen anomalies and (ii) is substantially more data-efficient.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Anomaly detection aims at identifying exceptional data instances that have a significant deviation from the majority of data instances, which can offer important insights into broad applications, such as identifying fraudulent transactions or insider trading, detecting network intrusions, and early detection of diseases. Numerous anomaly detection methods have been introduced [Aggarwal2017a, Chen et al.2017, Schlegl et al.2017, Zenati et al.2018, Pang et al.2018, Ruff et al.2018], of which most are unsupervised methods. The popularity of the unsupervised methods is mainly due to that they avoid the prohibitive cost of labeling large-scale anomaly data. However, since they do not have any prior knowledge of the anomalies of interests, many anomalies they identify are data noises or uninteresting data instances, leading to high false positives or low detection recall.

To address this issue, we study a rarely explored but critical anomaly detection problem, i.e., weakly-supervised anomaly detection with a few (e.g., multiple to dozens) labeled anomalies and a large-scale unlabeled data set. The limited labeled anomalies can be leveraged as our prior knowledge of the anomalies to learn anomaly-informed detection models to reduce false positives [Aggarwal2017b]. Compared to fully-supervised anomaly detection techniques that require large-scale and complete labeled anomaly data, the weakly-supervised anomaly detection is much more practical, because although labeling large-scale anomaly data that covers all types of anomalies is too costly (if not impossible), a small set of labeled anomalies can often be made available with affordable/trivial cost in many real-world anomaly detection applications. These labeled anomalies may come from a deployed detection system, e.g., a few successfully detected network intrusion records, or they may be from end-users, such as a small number of fraudulent credit card transactions that are reported by bank clients.

However, this problem is especially challenging. This is because we have only limited labeled data for the anomaly class, and moreover, anomalies typically stem from unknown events. Therefore, the limited labeled anomalies often, if not always, cannot cover all types of anomalies. As a result, using these limited anomalies as supervision information is presented with significant challenges in generalizing detection models to detect seen and unseen anomalies (i.e., new types of anomalies that are unseen during training).

There have been a few early explorations in this research line using traditional methods [McGlohon et al.2009, Tamersoy et al.2014, Zhang et al.2018] or recently emerged deep anomaly detection methods111

Deep anomaly detection methods refer to any methods that exploit deep learning techniques

[Goodfellow et al.2016] to learn feature representations or anomaly scores for anomaly detection. [Pang et al.2018], but they leverage the labeled anomalies as auxiliary data to enhance an existing anomaly measure or perform a classification-based anomaly detection. This leads to insufficient exploitation of the labeled data and/or poor generalization to unseen anomalies.

Additionally, note that well established semi-supervised anomaly detection [Noto et al.2012, Görnitz et al.2013, Ienco et al.2017] focuses on learning patterns of the normal class using labeled normal data, which is fundamentally different from our task in terms of the given data and problem. These differences are summarized in Table 1 (see Related Work for detailed discussions).

Approach Training Data Problem
Semi-supervised Large normal data, large unlabeled data Learn patterns of labeled normal data
Weakly-supervised Limited (incomplete) anomaly data, large unlabeled data Learn patterns of labeled anomalies that also generalize well to unseen anomalies
Table 1: Weakly- and Semi-supervised Approaches

This paper introduces a novel deep anomaly detection formulation and its instantiation, namely PRIor knowledge-driven Ordinal Regression network (PRIOR), for the weakly-supervised setting. We formulate the problem as a pairwise relation learning task, of which a two-stream ordinal regression network is defined to learn the relation of instance pairs, i.e., to discriminate whether a instance pair contains the labeled anomalies or just unlabeled instances. Particularly, PRIOR first generates unordered pairs of instances from the small labeled anomaly set and a large unlabeled data set, and then augments the pairs with a synthetic ordinal class feature, in which larger scalar class labels are assigned to the instance pairs that contain at least one labeled anomaly than the other pairs. PRIOR further feeds these pair samples into the regression network to directly learn their anomaly scores. At the testing stage, PRIOR pairs test instances with the training instances and uses the regression network to infer their anomaly scores.

Unlike previous deep methods [Chen et al.2017, Zong et al.2018, Ruff et al.2018] that aim to improve existing anomaly measures with new feature representations, PRIOR uses the added ordinal feature to directly learn an anomaly measure, achieving more efficient use of the labeled data. PRIOR assigns large anomaly scores to instance pairs when they share similar characteristics to labeled anomalies or deviate from the interactions between unlabeled instance pairs. This helps identify both seen and unseen anomalies. Accordingly, this paper makes the following main contributions.

  • We propose a novel formulation for inventing tailored data augmentation and ordinal regression to perform weakly-supervised deep anomaly detection via pairwise relation learning. This yields anomaly-informed detection models with good generalizability.

  • A novel method, namely PRIOR, is instantiated from our formulation to directly learn anomaly scores, which optimizes the scores in an end-to-end fashion, achieving well optimized anomaly scores and data-efficient learning.

Our empirical results on nine real-world data sets show that PRIOR (i) significantly outperforms four state-of-the-art contenders, e.g., at least 27% average improvement in precision-recall rates, (ii) obtains a substantially better data efficiency, e.g., it requires 50%-88% less labeled anomalies to perform comparably good to, or substantially better than, the best contenders, and (iii) achieves substantially better generalizability to unseen anomalies than its contenders.

Pairwise Relation Learning

Problem Formulation

Given a set of training data instances with , where is a large unlabeled data set and with is a very small set of labeled anomalies that provide some prior knowledge of anomalies, our goal is to learn an anomaly scoring function that assigns anomaly scores to data instances in a way that we have if is an anomaly and is a normal data instance.

To obtain more labeled data and perform an end-to-end anomaly score learning, we formulate the problem as a pairwise relation learning task, of which we aim to assign substantially larger anomaly scores to the data instance pairs that contain at least one anomaly than the other instance pairs. This can be achieved by ordinal regression techniques [Gutiérrez et al.2016]. Specifically, let be a set of unordered pairs of instances with artificial class labels, where each pair belongs to one of the three different types of possible combinations: , and ( and ) and is an ordinal class feature with decreasing value assignments to the respective , and pairs, i.e., , then the learning of anomaly scores can be transformed as to learn a ternary ordinal regression function .

We then instantiate the formulation into the method PRIOR for deep anomaly detection. As shown in Figure 1, PRIOR consists of three modules: data augmentation, end-to-end anomaly score learner and ordinal regression. The data augmentation generates the labeled instance pair set . The anomaly scoring is a composition of a feature learner and an anomaly score learner

, which can be trained in an end-to-end manner with an ordinal regression loss function.

Figure 1: PRIOR - Prior Knowledge-driven Two-stream Ordinal Regression Networks. It takes , and instance pairs as inputs and learns to assign larger anomaly scores to the and instance pairs that contain at least one labeled anomaly than the pairs.

Tailored Data Augmentation

Different from popular editing-based augmentation [Zhang and LeCun2015, Perez and Wang2017], two augmentation methods tailored for our problem are used to substantially extend the labeled data: (i) we first generate a set of instance pairs with instances randomly sampled from the small labeled anomaly set and the large unlabeled data set , and categorize the pairs into three classes , and based on the sources that the instances of each pair sample from, where and ; and (ii) a synthetic ordinal class feature is then added, in which the instance pairs of the three classes are assigned with scalar class labels, such that , , and . By doing so, we efficiently synthesize and to produce a large labeled data set .

More importantly, contains critical information for discriminating anomalies from normal data instances. This is due to the fact that since anomalies are rare data instances, the unlabeled data set is often dominated by normal data instances. As a result, most pairs consist of normal instances only. Thus, with as data inputs, PRIOR is fed with training samples that contain key information for discriminating the and pairs that contain at least one anomaly from the pairs consisting of normal instances only. Note that a few pairs may contain anomalies due to potential anomaly contamination in , leading to some noisy pairs in the training data, but we found the end-to-end learning of anomaly scores enabled PRIOR to be trained very data-efficiently, which effectively overcame the negative effects brought by these noisy pairs.

End-to-end Anomaly Score Learner

An end-to-end anomaly score learner is then defined to take pairs of data instances as inputs and directly output the anomaly scores of the pairs. Let be an intermediate representation space, we define a two-stream anomaly scoring network as a composition of a feature learner and an anomaly scoring function , where . Specifically, is a neural feature learner with hidden layers and weight matrices :


where and

. Different network structures can be used here, such as multilayer perceptrons for multidimensional data, convolutional networks for image data, or recurrent networks for sequence data

[Goodfellow et al.2016].

is defined as an anomaly score learner which uses a linear neural unit in the output layer to compute the anomaly scores based on the intermediate representations:


where , is a concatenation operation of and , and ( is the bias term). As shown in Figure 1, to reduce the optimization complexity, PRIOR uses a two-stream network with the shared weight parameters to learn the representations and .

Lastly can be formally represented as


which can be trained in an end-to-end fashion to directly map original data inputs to scalar anomaly scores.

Note that the concatenation results in an ordered pair . This does not affect the unordered pairs from the and classes since the instances of these pairs come from the same data set (i.e., or ), but it may produce inverse effects on our training with the pairs. We address this problem by transforming all unordered

pairs into ordered pairs

before training.

Loss Functions for Ordinal Regression

PRIOR then feeds the ordinal class labels and the anomaly scores output by to optimize the scores using an ordinal regression objective, which aims to assign substantially larger anomaly scores to the and instance pairs than the instance pairs. This fulfills a prior knowledge-driven direct optimization of the anomaly scores, which is expected to have more data-efficient learning and optimized anomaly scores than the current deep methods that focus on optimizing feature representations in an unsupervised way.

Let be the ordinal label of an instance pair , we define the loss as below to guide the optimization:


The absolute loss is used to reduce the effect of potential noisy pairs due to the anomaly contamination in . The three class labels , and are used by default to enforce large margins among the anomaly scores of the three types of instance pairs. PRIOR also works very well with other value assignments as long as there are reasonably large margins among the ordinal class labels.

Therefore, the overall objective function can be written as


where is a sample batch from and

is a regularization term with the hyperparameter

. These optimization settings are provided in the experiments section.

Anomaly Detection Using PRIOR

Training Stage

Algorithm 1 presents the procedure of training PRIOR. Step 1 first extends the data into a set of instance pairs with ordinal class labels, . After an uniform Glorot weight initialization [Glorot and Bengio2010]

in Step 2, PRIOR performs stochastic gradient descent (SGD) based optimization to learn

in Steps 3-9 and obtains the optimized in Step 10. Particularly, stratified random sampling is used in Step 5 to ensure the sample balance of the three classes in . Step 6 performs the forward propagation of the network and computes the loss. Step 7 then uses the loss to perform gradient descent steps.

0:   with and
0:   - an anomaly score mapping
1:   Augment the training data with and
2:  Randomly initialize
3:  for  to  do
4:     for  to  do
5:          Randomly sample data instance pairs from
7:         Perform a gradient descent step w.r.t. the parameters in
8:     end for
9:  end for
10:  return  
Algorithm 1 Training PRIOR

Testing Stage

At the testing stage, given a test instance , PRIOR first pairs it with data instances randomly sampled from and , and then defines its anomaly score as


where are the parameters of a trained , and and are randomly sampled from the respective and . can be interpreted as an ensemble of the anomaly scores of a set of -oriented pairs. Due to the loss in Eqn. (4), is optimized to be greater than given is an anomaly and is a normal instance. The ensemble scores are employed to achieve stable anomaly scoring. PRIOR can perform very stably as long as the ensemble size

is sufficiently large due to the central limit theorem.

is used here.

Theoretical Foundation of PRIOR

Informative Data Augmentation

Augmenting to substantially increases not only the total number of training samples but also the proportion of labeled anomaly data. Specifically, given labeled anomalies and unlabeled data instances, the labeled anomaly data accounts for only. After our data augmentation, the labeled anomaly data (i.e., the instance pairs that contain at least one labeled anomaly) accounts for , where and denote the number of the and pairs, respectively. After some transformations, we can see that the augmented labeled anomaly data remarkably extends the original proportion by a factor of about .

More importantly, since PRIOR uses a two-stream network in its representation learner , it still performs feature learning in the original data space, but with substantially more labeled anomaly data, which helps learn more expressive intermediate feature representations.

Optimizing Anomaly Scores with Regression

This section shows that the anomaly scores defined in Eqn. (6) is well optimized via the regression to ensure anomalies having substantially larger anomaly scores than normal data instances. Specifically, given a test instance , the first term in Eqn. (6) is equivalent to distinguishing the pairs from the pairs. Thus, the term enforces to have an anomaly score close to if is an anomaly, and likewise, close to if is a normal instance. Since , this term can well guarantee that has a larger anomaly score when is an anomaly than when is normal. Similarly, the second term is equivalent to distinguishing the pairs from the pairs, fulfills the same guarantee as the first term. An average aggregation of these two terms is used to achieve statistically stable scoring results.

Also, this ternary ordinal regression empowers PRIOR to learn the abstractions of not only the anomalous behaviors but also the normal behaviors, so PRIOR is expected to assign large anomaly scores to unseen anomalies as long as they clearly deviate from the normality abstraction.

The above analysis assumes is mostly a normal instance because of the rarity of anomalies in real-world applications, and we found empirically that the SGD-based optimization could work very well with this assumption.


Data Sets

As shown in Table 2, nine widely-used publicly available real-world data sets are used222donors and fraud are publicly available at https://www.kaggle.com/, w7a and news20 are available at https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/, backdoor is available at https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/, celeba is at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, and the other data sets are accessible at https://archive.ics.uci.edu/ml/datasets/., which are from diverse domains, e.g., intrusion detection, fraud detection, and disease detection [Moustafa and Slay2015, Liu et al.2012, Pang et al.2018, Randhawa et al.2018]. Specifically, the donors data is taken from KDD Cup 2014 for predicting excitement of donation projects, with exceptionally exciting projects used as anomalies (6% of all data instances). The census data is extracted from the US census bureau database, in which we aim to detect the rare high-income person (6%). fraud is for fraudulent credit card transaction detection, with fraudulent transactions (0.2%) as anomalies. celeba contains more than 200K celebrity images, each with 40 attribute annotations. We use the bald attribute as our detection target, in which the scarce bald celebrities (3%) are treated as anomalies and the other 39 attributes form the feature space. The backdoor data is for network attack detection with the backdoor attacks as anomalies (2.4%) against the ‘normal’ class, which is derived from a widely-used intrusion detection data set called UNSW-NB15 [Moustafa and Slay2015]. w7a is a web page classification data set, with the minority classes (3%) as anomalies. campaign is a data set of bank marketing campaigns, with rarely successful campaigning records (11.3%) as anomalies. news20 is one of the most popular text classification corpora, which is converted into anomaly detection data via random downsampling of the minority class (5%) based on [Liu et al.2012, Pang et al.2018]. thyroid is for disease detection, in which the anomalies are the hypothyroid patients (7.4%).

Four of these data sets contain real anomalies,including donors, fraud, backdoor and thyroid. The other five data sets contain semantically real anomalies, i.e., they are rare and very different from the majority of data instances. This provides a good testbed for anomaly detection evaluation.

Data Characteristic AUC-ROC Performance AUC-PR Performance
donors 619,326 10 0.01% 0.08% 1.0000.000 0.9750.005 0.9950.005 0.9970.002 0.8740.015 1.0000.000 0.5080.048 0.8460.114 0.9940.002 0.2210.025
census 299,285 500 0.01% 0.16% 0.8140.007 0.7940.005 0.8350.014 0.7320.020 0.6240.020 0.3120.005 0.1640.003 0.2910.008 0.1930.019 0.0760.004
fraud 284,807 29 0.01% 6.10% 0.9790.002 0.9720.003 0.9770.001 0.7340.046 0.9530.002 0.6860.006 0.6740.004 0.6880.004 0.0430.021 0.2540.043
celeba 202,599 39 0.02% 0.66% 0.9470.002 0.8940.005 0.9440.003 0.8080.027 0.6980.020 0.2610.006 0.1610.006 0.2610.008 0.0850.012 0.0650.006
backdoor 95,329 196 0.04% 1.29% 0.9700.003 0.8780.007 0.9520.018 0.9280.019 0.7520.021 0.8860.002 0.1160.003 0.8560.016 0.5730.167 0.0510.005
w7a 49,749 300 0.08% 2.03% 0.8130.005 0.7180.010 0.7800.011 0.7000.017 0.4170.010 0.4420.009 0.1040.010 0.1020.016 0.0570.004 0.0240.001
campaign 41,188 62 0.10% 0.65% 0.8180.011 0.7230.006 0.7480.019 0.6230.024 0.7310.015 0.4030.014 0.3300.009 0.3490.023 0.1930.012 0.3280.022
news20 10,523 1M 0.37% 5.70% 0.9380.010 0.8850.003 0.8870.000 0.5780.050 0.3280.016 0.6430.012 0.2220.004 0.2530.001 0.0820.010 0.0350.002
thyroid 7,200 21 0.55% 5.62% 0.8010.004 0.5800.016 0.7490.011 0.5640.017 0.6880.020 0.3010.013 0.0930.005 0.2410.009 0.1160.014 0.1660.017
Average 0.8980.005 0.8240.007 0.8740.009 0.7400.025 0.6740.016 0.5480.008 0.2640.010 0.4320.022 0.2600.029 0.1360.014
P-value - 0.004 0.039 0.004 0.004 - 0.004 0.016 0.004 0.004
Table 2: AUC-ROC and AUC-PR Results (mean std). ‘Size’ is the overall data size, is the dimension, and denote how the labeled anomalies respectively comprise the training data and the anomalies. ‘1M’ denotes news20 has 1,355,191 features.

Competing Methods and Parameter Settings

PRIOR is compared with four methods, including REPEN [Pang et al.2018], adaptive Deep SVDD (DSVDD) [Ruff et al.2018], prototypical networks (denoted as FSNet) [Snell et al.2017], and iForest [Liu et al.2012], which are chosen because they are the respective state-of-the-art in four relevant areas: weakly-supervised deep anomaly detection, feature learning for anomaly detection, classification-based approach, and unsupervised anomaly detection. The original DSVDD cannot make use of the labeled anomalies. We adapted it based on [Tax and Duin2004] to enforce a large margin between the one-class center and the labeled anomalies while minimizing the center-oriented hypersphere. This adaption enhances DSVDD by over 30% accuracy.

Since our experiments focus on unordered multidimensional data, multilayer perceptron networks are used. Similar to the findings in REPEN [Pang et al.2018]

, we empirically found that all deep methods perform best using an architecture with one hidden layer than using deeper architectures. This may be due to the limit of the available labeled data. Following REPEN, one hidden layer with 20 neural units is used in all deep methods. The ReLu activation function

is used. An -norm regularizer with the hyperparameter setting

is applied to avoid overfitting. The RMSprop optimizer with the learning rate

is used. All deep detectors are trained using 50 epochs, with 20 batches per epoch. The batch size is probed in

8, 16, 32, 64, 128, 256, 512. The best fits, 512 in PRIOR and DSVDD, 256 in REPEN and FSNet, are used by default.

Performance Evaluation Metrics

Two popular and complementary metrics, the Area Under Receiver Operating Characteristic Curve (AUC-ROC) and Area Under Precision-Recall Curve (AUC-PR)

[Boyd et al.2013]

, are used. AUC-ROC summarizes the ROC curve of true positives against false positives, which often presents an overoptimistic view of the detection performance due to the class-imbalance nature of anomaly detection data; whereas AUC-PR is a summarization of the curve of precision and recall w.r.t. the anomalies, which focuses on the performance on the anomaly class only and is often more practical. Larger AUC-ROC (AUC-PR) indicates better performance.

The reported AUC-ROC and AUC-PR are averaged results over 10 independent runs. The paired Wilcoxon signed rank [Woolson2007] is used to examine the significance of the performance of PRIOR against its competing methods.

Default Experiment Settings

To replicate the real-life scenarios where a few labeled anomalies together with large unlabeled data are available, the anomalies and normal instances are first splitted into two subsets, with 80% data as training data and the other 20% data as test set. We then combine some randomly selected anomalies with normal training data to produce an anomaly-contaminated unlabeled data set . We further randomly sample a few anomalies from the anomaly class to form the prior knowledge, i.e., the labeled anomaly set .

Under this setting we evaluate the performance of PRIOR w.r.t. its effectiveness in real-world data, data efficiency, robustness, and generalizability to unseen anomalies.

Effectiveness in Real-world Data Sets

Experiment Settings

We first evaluate the effectiveness of PRIOR in a wide range of real-world data. A consistent setting, 2% anomaly contamination and labeled anomalies, is used across all data sets to gain insights into the performance in different real-life applications.


The AUC-ROC and AUC-PR results on the nine real-world data sets are shown in Table 2. PRIOR obtains the best performance on eight data sets in both metrics, with comparably better performance on the data set where it ranks in second. On average, PRIOR performs substantially better than REPEN (9%), DSVDD (3%), FSNet (21%) and iForest (33%) in terms of AUC-ROC; more impressively, it achieves more substantial improvement in AUC-PR, i.e., 108% w.r.t. REPEN, 27% w.r.t. DSVDD, 111% w.r.t. FSNet, and 304% w.r.t. iForest. All of the improvement are significant at the 95% or 99% confidence level. It is often very challenging to achieve large AUC-PR due to the rareness of anomalies. Such impressive AUC-PR improvement shows the superior capability of PRIOR in reducing false positives. This is very encouraging when considering the labeled anomalies account for only 0.005%-0.6% of training data and 0.08%-6% of the anomaly class (see and in Table 2).

The superiority of PRIOR is mainly due to its regression-based end-to-end anomaly score learning, which leverages the augmented instance pairs to efficiently learn the abstraction of both normality and abnormality, achieving more accurate anomaly scores than its counterparts that focus on feature learning. It is interesting that on census PRIOR outperforms DSVDD in AUC-PR but it underperforms in AUC-ROC, which may because the ternary ordinal regression allows PRIOR to emphasize more on detecting anomalies while DSVDD focuses on modeling the normal class only.

Data Efficiency

Experiment Settings

This section examines the data efficiency by inspecting the performance w.r.t. different number of labeled anomalies, ranging from 5 to 120, with the contamination level fixed to 2%.


The AUC-PR results for the data efficiency test are shown in Figure 2. Similar results can also be observed in AUC-ROC. The unsupervised method iForest is insensitive to the labeled data and used as the baseline. The performance of all deep methods generally increases with increasing labeled data. However, the increased anomalies do not always help due to the heterogeneous anomalous behaviors taken by different anomalies. PRIOR is more stable and has better generalizability in such cases.

Figure 2: AUC-PR w.r.t. the Number of Labeled Anomalies

PRIOR is most data-efficient, which obtains the best average AUC-PR w.r.t. different number of labeled anomalies. Impressively, PRIOR can use 88% less labeled anomalies but achieves much better AUC-PR than the best contender DSVDD on w7a, news20 and thyroid; and it requires 50% less labeled data to obtain comparably better performance to the best contender FSNet on donors. This difference is mainly due to the direct optimization of anomaly scores in PRIOR against the indirect optimization in its counterparts.

Compared to the unsupervised iForest, even when a very few (e.g., 5) labeled anomalies are used, the improvement of the weakly-supervised methods is very substantial on nearly all data sets, e.g., the average improvement of PRIOR using 5 labeled anomalies is more than 410%. This shows the superpower of the synthesis of deep models and few labeled anomalies. In campaign that may have very intricate anomalies, the deep methods need slightly more labeled anomalies to achieve the similar improvement.

Robustness w.r.t. Anomaly Contamination

Experiment Settings

We examine the robustness w.r.t. different anomaly contamination levels, {0%, 2%, 5%, 10%, 20%}, with the number of labeled anomalies fixed to 30.


The AUC-PR results for the robustness are presented in Figure 3. PRIOR achieves the best robustness w.r.t. different contamination levels. Similar to the results based on 2% contaminated data in Table 2, PRIOR also significantly outperforms all four contenders by a large margin in the other contamination levels. Particularly, the substantial improvement of PRIOR persists consistently in different contamination levels on campaign, w7a, news20 and thyroid; PRIOR performs comparably well to the best competing method in the other data sets. The demonstrated robustness of PRIOR benefits from its efficiency of leveraging the limited labeled anomalies, i.e., its greater data efficiency helps defy the noisiness of the contaminated instances.

Figure 3: AUC-PR w.r.t. Various Contamination Levels (%)

Effectiveness of Detecting Unseen Anomalies

Experiment Settings

Although the above results, especially on campaign, w7a, news20 and thyroid, show impressive generalizability of PRIOR, this section explicitly evaluates its generalizability to unseen anomalies. The experiment is performed on the UNSW-NB15 data, from which our backdoor data set is derived. UNSW-NB15 is chosen because it contains various different types of real anomalies, including eight other network attacks such as DoS, worms, reconnaissance and shellcode [Moustafa and Slay2015] besides backdoor attacks. Detection models are trained on the training data of the backdoor data. They are then evaluated on the test data of backdoor but replacing the backdoor attacks with randomly selected unseen anomalies from the eight types of new network attacks. We vary the number of the added unseen anomalies from 64 up to 1,024 and guarantee all the eight types of attacks are evenly presented in each test set.


The performance results are shown in Figure 4. In terms of both AUC-ROC and AUC-PR, PRIOR performs consistently and substantially better than all the competing methods in identifying the unseen anomalies across all cases. As discussed in our theoretical analysis, in addition to the patterns of labeled anomalies, PRIOR also learns the abstractions of normal instances due to the ternary ordinal regression in its pairwise relation learning. Therefore, even if the unseen anomalies demonstrate different abnormal behaviors from the seen anomalies, PRIOR would assign large anomaly scores to them when they are different from the normal instances. This is the main driving force underlying the superior performance of PRIOR here.

Note that since the normal instances remain the same in all the test sets with only the anomalies changed, AUC-ROC is relatively stable on the five different test sets. However, on the right panel, increasing the number of anomalies enlarges the anomaly proportion, making it easier to obtain better detection recall, and thus, better AUC-PR performance.

Figure 4: AUC-ROC and AUC-PR (meanstd) w.r.t. No. Unseen Anomalies from Eight Types of Network Intrusions

Related Work

Deep Anomaly Detection

Traditional anomaly detection approaches are often ineffective in high-dimensional or non-linear separable data due to the curse of dimensionality and the deficiency in capturing the non-linear relations

[Aggarwal2017a]. Deep anomaly detection has shown promising results in handle those complex data, of which most methods learn representations separately from anomaly measures [Hawkins et al.2002, Chen et al.2017, Schlegl et al.2017, Zenati et al.2018]. Very recent methods [Zong et al.2018, Pang et al.2018, Ruff et al.2018] unify representation learning and anomaly detection methods to learn more optimal features for specific anomaly detectors. However, all of these methods have an optimization objective focusing on feature representations, so the anomaly scores output by the downstream anomaly measures are optimized in an indirect manner. By contrast, PRIOR performs an end-to-end learning of anomaly scores, which fulfills a direct optimization of anomaly scores.

We had a concurrent work DevNet [Pang et al.2019]

addressing a similar problem, but employs a completely different approach to this study. Here we formulate the problem as a pairwise relation learning task and address the problem using deep ternary ordinal regression without any assumption on data distribution, whereas DevNet assumes the anomaly scores of data instances follow a Gaussian distribution and leverages this prior to learn the anomaly scores by fitting to a Z-Score-based distribution. In terms of empirical performance, DevNet and PRIOR perform comparably well on the shared eight data sets.

Weakly- and Semi-supervised Anomaly Detection

Many studies have been introduced to leverage labeled normal instances to learn patterns of the normal class, which are commonly referred to as semi-supervised anomaly detection methods [Noto et al.2012, Görnitz et al.2013, Ienco et al.2017]. The semi-supervised setting is relevant to our problem because of the availability of both labeled and unlabeled data, but they are two fundamentally different tasks due to the difference in training data and the problem nature. Specifically, we have only a few labeled anomaly data rather than large-scale labeled normal data; and the anomalies are often from different distributions or manifolds, so having only limited labeled anomaly data hardly cover all types of anomalies. Therefore, instead of modeling the labeled normal data, our problem requires to learn patterns of labeled anomalies that also generalize well to unseen anomalies. A few studies [McGlohon et al.2009, Tamersoy et al.2014, Zhang et al.2018, Pang et al.2018] show that these limited labeled anomalies can be leveraged to substantially improve the detection accuracy. However, these studies exploit the labeled anomalies to enhance anomaly scoring via label propagation [McGlohon et al.2009, Tamersoy et al.2014], representation learning [Pang et al.2018] or classification models [Zhang et al.2018], failing to sufficiently utilize the labeled data and/or detect unseen anomalies.

This research line is also relevant to few-shot classification [Fei-Fei et al.2006, Vinyals et al.2016, Snell et al.2017] and positive and unlabeled data (PU) learning [Li and Liu2003, Elkan and Noto2008, Sansone et al.2018] because of the availability of the limited labeled positive instances (anomalies), but they are very different in that these two areas implicitly assume that the few labeled instances share the same intrinsic class structure as the other instances within the same class (e.g., the anomaly class), whereas the few labeled anomalies and the unseen/unknown anomalies may have diverse class structures. This presents significant challenges to the techniques of both areas.


This paper introduces a novel formulation and its instantiation to devise ordinal regression networks for weakly-supervised deep anomaly detection. Our approach achieves significant improvement over four state-of-the-art competing methods in terms of the effectiveness in real-world data, data efficiency, robustness, and generalizability to unseen anomalies. This in turn justifies the effectiveness of the synthesis of the pairing-based data augmentation and the ordinal regression-based anomaly score learning approaches.


  • [Aggarwal2017a] Charu C Aggarwal. Outlier analysis. Springer, 2017.
  • [Aggarwal2017b] Charu C Aggarwal.

    Supervised outlier detection.

    In Outlier Analysis, pages 219–248. 2017.
  • [Boyd et al.2013] Kendrick Boyd, Kevin H Eng, and C David Page.

    Area under the precision-recall curve: point estimates and confidence intervals.

    In ECML/PKDD, pages 451–466. Springer, 2013.
  • [Chen et al.2017] Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga.

    Outlier detection with autoencoder ensembles.

    In SDM, pages 90–98. SIAM, 2017.
  • [Elkan and Noto2008] Charles Elkan and Keith Noto.

    Learning classifiers from only positive and unlabeled data.

    In KDD, pages 213–220. ACM, 2008.
  • [Fei-Fei et al.2006] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006.
  • [Glorot and Bengio2010] Xavier Glorot and Yoshua Bengio.

    Understanding the difficulty of training deep feedforward neural networks.

    In AISTATS, pages 249–256, 2010.
  • [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • [Görnitz et al.2013] Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. Toward supervised anomaly detection.

    Journal of Artificial Intelligence Research

    , 46:235–262, 2013.
  • [Gutiérrez et al.2016] Pedro Antonio Gutiérrez, Maria Perez-Ortiz, Javier Sanchez-Monedero, Francisco Fernandez-Navarro, and Cesar Hervas-Martinez. Ordinal regression methods: survey and experimental study. IEEE Transactions on Knowledge and Data Engineering, 28(1):127–146, 2016.
  • [Hawkins et al.2002] Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. Outlier detection using replicator neural networks. In DaWaK, 2002.
  • [Ienco et al.2017] Dino Ienco, Ruggero G Pensa, and Rosa Meo. A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Transactions on Neural Networks and Learning Systems, 28(5):1017–1029, 2017.
  • [Li and Liu2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, pages 587–592, 2003.
  • [Liu et al.2012] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1):3, 2012.
  • [McGlohon et al.2009] Mary McGlohon, Stephen Bay, Markus G Anderle, David M Steier, and Christos Faloutsos. SNARE: A link analytic system for graph labeling and risk detection. In KDD, pages 1265–1274. ACM, 2009.
  • [Moustafa and Slay2015] Nour Moustafa and Jill Slay. UNSW-NB15: a comprehensive data set for network intrusion detection systems. In Military Communications and Information Systems Conference, 2015, pages 1–6, 2015.
  • [Noto et al.2012] Keith Noto, Carla Brodley, and Donna Slonim. FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1):109–133, 2012.
  • [Pang et al.2018] Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In KDD, pages 2041–2050, 2018.
  • [Pang et al.2019] Guansong Pang, Chunhua Shen, and Anton van den Hengel. Deep anomaly detection with deviation networks. In KDD, pages 353–362. ACM, 2019.
  • [Perez and Wang2017] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint:1712.04621, 2017.
  • [Randhawa et al.2018] Kuldeep Randhawa, Chu Kiong Loo, Manjeevan Seera, Chee Peng Lim, and Asoke K Nandi. Credit card fraud detection using adaboost and majority voting. IEEE Access, 6:14277–14284, 2018.
  • [Ruff et al.2018] Lukas Ruff, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In ICML, pages 4390–4399, 2018.
  • [Sansone et al.2018] Emanuele Sansone, Francesco GB De Natale, and Zhi-Hua Zhou. Efficient training for positive unlabeled learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [Schlegl et al.2017] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In IPMI, pages 146–157. Springer, Cham, 2017.
  • [Snell et al.2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4077–4087, 2017.
  • [Tamersoy et al.2014] Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. Guilt by association: Large scale malware detection by mining file-relation graphs. In KDD, pages 1524–1533, 2014.
  • [Tax and Duin2004] David MJ Tax and Robert PW Duin.

    Support vector data description.

    Machine Learning, 54(1):45–66, 2004.
  • [Vinyals et al.2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
  • [Woolson2007] RF Woolson. Wilcoxon signed-rank test. Wiley Encyclopedia of Clinical Trials, pages 1–3, 2007.
  • [Zenati et al.2018] Houssam Zenati, Manon Romain, Chuan-Sheng Foo, Bruno Lecouat, and Vijay Chandrasekhar. Adversarially learned anomaly detection. In ICDM, pages 727–736. IEEE, 2018.
  • [Zhang and LeCun2015] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXiv preprint:1502.01710, 2015.
  • [Zhang et al.2018] Ya-Lin Zhang, Longfei Li, Jun Zhou, Xiaolong Li, and Zhi-Hua Zhou. Anomaly detection with partially observed anomalies. In WWW Companion, pages 639–646, 2018.
  • [Zong et al.2018] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen.

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection.

    In ICLR, 2018.