Introduction
Anomaly detection aims at identifying exceptional data instances that have a significant deviation from the majority of data instances, which can offer important insights into broad applications, such as identifying fraudulent transactions or insider trading, detecting network intrusions, and early detection of diseases. Numerous anomaly detection methods have been introduced [Aggarwal2017a, Chen et al.2017, Schlegl et al.2017, Zenati et al.2018, Pang et al.2018, Ruff et al.2018], of which most are unsupervised methods. The popularity of the unsupervised methods is mainly due to that they avoid the prohibitive cost of labeling largescale anomaly data. However, since they do not have any prior knowledge of the anomalies of interests, many anomalies they identify are data noises or uninteresting data instances, leading to high false positives or low detection recall.
To address this issue, we study a rarely explored but critical anomaly detection problem, i.e., weaklysupervised anomaly detection with a few (e.g., multiple to dozens) labeled anomalies and a largescale unlabeled data set. The limited labeled anomalies can be leveraged as our prior knowledge of the anomalies to learn anomalyinformed detection models to reduce false positives [Aggarwal2017b]. Compared to fullysupervised anomaly detection techniques that require largescale and complete labeled anomaly data, the weaklysupervised anomaly detection is much more practical, because although labeling largescale anomaly data that covers all types of anomalies is too costly (if not impossible), a small set of labeled anomalies can often be made available with affordable/trivial cost in many realworld anomaly detection applications. These labeled anomalies may come from a deployed detection system, e.g., a few successfully detected network intrusion records, or they may be from endusers, such as a small number of fraudulent credit card transactions that are reported by bank clients.
However, this problem is especially challenging. This is because we have only limited labeled data for the anomaly class, and moreover, anomalies typically stem from unknown events. Therefore, the limited labeled anomalies often, if not always, cannot cover all types of anomalies. As a result, using these limited anomalies as supervision information is presented with significant challenges in generalizing detection models to detect seen and unseen anomalies (i.e., new types of anomalies that are unseen during training).
There have been a few early explorations in this research line using traditional methods [McGlohon et al.2009, Tamersoy et al.2014, Zhang et al.2018] or recently emerged deep anomaly detection methods^{1}^{1}1
Deep anomaly detection methods refer to any methods that exploit deep learning techniques
[Goodfellow et al.2016] to learn feature representations or anomaly scores for anomaly detection. [Pang et al.2018], but they leverage the labeled anomalies as auxiliary data to enhance an existing anomaly measure or perform a classificationbased anomaly detection. This leads to insufficient exploitation of the labeled data and/or poor generalization to unseen anomalies.Additionally, note that well established semisupervised anomaly detection [Noto et al.2012, Görnitz et al.2013, Ienco et al.2017] focuses on learning patterns of the normal class using labeled normal data, which is fundamentally different from our task in terms of the given data and problem. These differences are summarized in Table 1 (see Related Work for detailed discussions).
Approach  Training Data  Problem 

Semisupervised  Large normal data, large unlabeled data  Learn patterns of labeled normal data 
Weaklysupervised  Limited (incomplete) anomaly data, large unlabeled data  Learn patterns of labeled anomalies that also generalize well to unseen anomalies 
This paper introduces a novel deep anomaly detection formulation and its instantiation, namely PRIor knowledgedriven Ordinal Regression network (PRIOR), for the weaklysupervised setting. We formulate the problem as a pairwise relation learning task, of which a twostream ordinal regression network is defined to learn the relation of instance pairs, i.e., to discriminate whether a instance pair contains the labeled anomalies or just unlabeled instances. Particularly, PRIOR first generates unordered pairs of instances from the small labeled anomaly set and a large unlabeled data set, and then augments the pairs with a synthetic ordinal class feature, in which larger scalar class labels are assigned to the instance pairs that contain at least one labeled anomaly than the other pairs. PRIOR further feeds these pair samples into the regression network to directly learn their anomaly scores. At the testing stage, PRIOR pairs test instances with the training instances and uses the regression network to infer their anomaly scores.
Unlike previous deep methods [Chen et al.2017, Zong et al.2018, Ruff et al.2018] that aim to improve existing anomaly measures with new feature representations, PRIOR uses the added ordinal feature to directly learn an anomaly measure, achieving more efficient use of the labeled data. PRIOR assigns large anomaly scores to instance pairs when they share similar characteristics to labeled anomalies or deviate from the interactions between unlabeled instance pairs. This helps identify both seen and unseen anomalies. Accordingly, this paper makes the following main contributions.

We propose a novel formulation for inventing tailored data augmentation and ordinal regression to perform weaklysupervised deep anomaly detection via pairwise relation learning. This yields anomalyinformed detection models with good generalizability.

A novel method, namely PRIOR, is instantiated from our formulation to directly learn anomaly scores, which optimizes the scores in an endtoend fashion, achieving well optimized anomaly scores and dataefficient learning.
Our empirical results on nine realworld data sets show that PRIOR (i) significantly outperforms four stateoftheart contenders, e.g., at least 27% average improvement in precisionrecall rates, (ii) obtains a substantially better data efficiency, e.g., it requires 50%88% less labeled anomalies to perform comparably good to, or substantially better than, the best contenders, and (iii) achieves substantially better generalizability to unseen anomalies than its contenders.
Pairwise Relation Learning
Problem Formulation
Given a set of training data instances with , where is a large unlabeled data set and with is a very small set of labeled anomalies that provide some prior knowledge of anomalies, our goal is to learn an anomaly scoring function that assigns anomaly scores to data instances in a way that we have if is an anomaly and is a normal data instance.
To obtain more labeled data and perform an endtoend anomaly score learning, we formulate the problem as a pairwise relation learning task, of which we aim to assign substantially larger anomaly scores to the data instance pairs that contain at least one anomaly than the other instance pairs. This can be achieved by ordinal regression techniques [Gutiérrez et al.2016]. Specifically, let be a set of unordered pairs of instances with artificial class labels, where each pair belongs to one of the three different types of possible combinations: , and ( and ) and is an ordinal class feature with decreasing value assignments to the respective , and pairs, i.e., , then the learning of anomaly scores can be transformed as to learn a ternary ordinal regression function .
We then instantiate the formulation into the method PRIOR for deep anomaly detection. As shown in Figure 1, PRIOR consists of three modules: data augmentation, endtoend anomaly score learner and ordinal regression. The data augmentation generates the labeled instance pair set . The anomaly scoring is a composition of a feature learner and an anomaly score learner
, which can be trained in an endtoend manner with an ordinal regression loss function.
Tailored Data Augmentation
Different from popular editingbased augmentation [Zhang and LeCun2015, Perez and Wang2017], two augmentation methods tailored for our problem are used to substantially extend the labeled data: (i) we first generate a set of instance pairs with instances randomly sampled from the small labeled anomaly set and the large unlabeled data set , and categorize the pairs into three classes , and based on the sources that the instances of each pair sample from, where and ; and (ii) a synthetic ordinal class feature is then added, in which the instance pairs of the three classes are assigned with scalar class labels, such that , , and . By doing so, we efficiently synthesize and to produce a large labeled data set .
More importantly, contains critical information for discriminating anomalies from normal data instances. This is due to the fact that since anomalies are rare data instances, the unlabeled data set is often dominated by normal data instances. As a result, most pairs consist of normal instances only. Thus, with as data inputs, PRIOR is fed with training samples that contain key information for discriminating the and pairs that contain at least one anomaly from the pairs consisting of normal instances only. Note that a few pairs may contain anomalies due to potential anomaly contamination in , leading to some noisy pairs in the training data, but we found the endtoend learning of anomaly scores enabled PRIOR to be trained very dataefficiently, which effectively overcame the negative effects brought by these noisy pairs.
Endtoend Anomaly Score Learner
An endtoend anomaly score learner is then defined to take pairs of data instances as inputs and directly output the anomaly scores of the pairs. Let be an intermediate representation space, we define a twostream anomaly scoring network as a composition of a feature learner and an anomaly scoring function , where . Specifically, is a neural feature learner with hidden layers and weight matrices :
(1) 
where and
. Different network structures can be used here, such as multilayer perceptrons for multidimensional data, convolutional networks for image data, or recurrent networks for sequence data
[Goodfellow et al.2016].is defined as an anomaly score learner which uses a linear neural unit in the output layer to compute the anomaly scores based on the intermediate representations:
(2) 
where , is a concatenation operation of and , and ( is the bias term). As shown in Figure 1, to reduce the optimization complexity, PRIOR uses a twostream network with the shared weight parameters to learn the representations and .
Lastly can be formally represented as
(3) 
which can be trained in an endtoend fashion to directly map original data inputs to scalar anomaly scores.
Note that the concatenation results in an ordered pair . This does not affect the unordered pairs from the and classes since the instances of these pairs come from the same data set (i.e., or ), but it may produce inverse effects on our training with the pairs. We address this problem by transforming all unordered
pairs into ordered pairs
before training.Loss Functions for Ordinal Regression
PRIOR then feeds the ordinal class labels and the anomaly scores output by to optimize the scores using an ordinal regression objective, which aims to assign substantially larger anomaly scores to the and instance pairs than the instance pairs. This fulfills a prior knowledgedriven direct optimization of the anomaly scores, which is expected to have more dataefficient learning and optimized anomaly scores than the current deep methods that focus on optimizing feature representations in an unsupervised way.
Let be the ordinal label of an instance pair , we define the loss as below to guide the optimization:
(4) 
The absolute loss is used to reduce the effect of potential noisy pairs due to the anomaly contamination in . The three class labels , and are used by default to enforce large margins among the anomaly scores of the three types of instance pairs. PRIOR also works very well with other value assignments as long as there are reasonably large margins among the ordinal class labels.
Therefore, the overall objective function can be written as
(5) 
where is a sample batch from and
is a regularization term with the hyperparameter
. These optimization settings are provided in the experiments section.Anomaly Detection Using PRIOR
Training Stage
Algorithm 1 presents the procedure of training PRIOR. Step 1 first extends the data into a set of instance pairs with ordinal class labels, . After an uniform Glorot weight initialization [Glorot and Bengio2010]
in Step 2, PRIOR performs stochastic gradient descent (SGD) based optimization to learn
in Steps 39 and obtains the optimized in Step 10. Particularly, stratified random sampling is used in Step 5 to ensure the sample balance of the three classes in . Step 6 performs the forward propagation of the network and computes the loss. Step 7 then uses the loss to perform gradient descent steps.Testing Stage
At the testing stage, given a test instance , PRIOR first pairs it with data instances randomly sampled from and , and then defines its anomaly score as
(6) 
where are the parameters of a trained , and and are randomly sampled from the respective and . can be interpreted as an ensemble of the anomaly scores of a set of oriented pairs. Due to the loss in Eqn. (4), is optimized to be greater than given is an anomaly and is a normal instance. The ensemble scores are employed to achieve stable anomaly scoring. PRIOR can perform very stably as long as the ensemble size
is sufficiently large due to the central limit theorem.
is used here.Theoretical Foundation of PRIOR
Informative Data Augmentation
Augmenting to substantially increases not only the total number of training samples but also the proportion of labeled anomaly data. Specifically, given labeled anomalies and unlabeled data instances, the labeled anomaly data accounts for only. After our data augmentation, the labeled anomaly data (i.e., the instance pairs that contain at least one labeled anomaly) accounts for , where and denote the number of the and pairs, respectively. After some transformations, we can see that the augmented labeled anomaly data remarkably extends the original proportion by a factor of about .
More importantly, since PRIOR uses a twostream network in its representation learner , it still performs feature learning in the original data space, but with substantially more labeled anomaly data, which helps learn more expressive intermediate feature representations.
Optimizing Anomaly Scores with Regression
This section shows that the anomaly scores defined in Eqn. (6) is well optimized via the regression to ensure anomalies having substantially larger anomaly scores than normal data instances. Specifically, given a test instance , the first term in Eqn. (6) is equivalent to distinguishing the pairs from the pairs. Thus, the term enforces to have an anomaly score close to if is an anomaly, and likewise, close to if is a normal instance. Since , this term can well guarantee that has a larger anomaly score when is an anomaly than when is normal. Similarly, the second term is equivalent to distinguishing the pairs from the pairs, fulfills the same guarantee as the first term. An average aggregation of these two terms is used to achieve statistically stable scoring results.
Also, this ternary ordinal regression empowers PRIOR to learn the abstractions of not only the anomalous behaviors but also the normal behaviors, so PRIOR is expected to assign large anomaly scores to unseen anomalies as long as they clearly deviate from the normality abstraction.
The above analysis assumes is mostly a normal instance because of the rarity of anomalies in realworld applications, and we found empirically that the SGDbased optimization could work very well with this assumption.
Experiments
Data Sets
As shown in Table 2, nine widelyused publicly available realworld data sets are used^{2}^{2}2donors and fraud are publicly available at https://www.kaggle.com/, w7a and news20 are available at https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/, backdoor is available at https://www.unsw.adfa.edu.au/unswcanberracyber/cybersecurity/ADFANB15Datasets/, celeba is at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, and the other data sets are accessible at https://archive.ics.uci.edu/ml/datasets/., which are from diverse domains, e.g., intrusion detection, fraud detection, and disease detection [Moustafa and Slay2015, Liu et al.2012, Pang et al.2018, Randhawa et al.2018]. Specifically, the donors data is taken from KDD Cup 2014 for predicting excitement of donation projects, with exceptionally exciting projects used as anomalies (6% of all data instances). The census data is extracted from the US census bureau database, in which we aim to detect the rare highincome person (6%). fraud is for fraudulent credit card transaction detection, with fraudulent transactions (0.2%) as anomalies. celeba contains more than 200K celebrity images, each with 40 attribute annotations. We use the bald attribute as our detection target, in which the scarce bald celebrities (3%) are treated as anomalies and the other 39 attributes form the feature space. The backdoor data is for network attack detection with the backdoor attacks as anomalies (2.4%) against the ‘normal’ class, which is derived from a widelyused intrusion detection data set called UNSWNB15 [Moustafa and Slay2015]. w7a is a web page classification data set, with the minority classes (3%) as anomalies. campaign is a data set of bank marketing campaigns, with rarely successful campaigning records (11.3%) as anomalies. news20 is one of the most popular text classification corpora, which is converted into anomaly detection data via random downsampling of the minority class (5%) based on [Liu et al.2012, Pang et al.2018]. thyroid is for disease detection, in which the anomalies are the hypothyroid patients (7.4%).
Four of these data sets contain real anomalies,including donors, fraud, backdoor and thyroid. The other five data sets contain semantically real anomalies, i.e., they are rare and very different from the majority of data instances. This provides a good testbed for anomaly detection evaluation.
Data Characteristic  AUCROC Performance  AUCPR Performance  

Data  Size  PRIOR  REPEN  DSVDD  FSNet  iForest  PRIOR  REPEN  DSVDD  FSNet  iForest  
donors  619,326  10  0.01%  0.08%  1.0000.000  0.9750.005  0.9950.005  0.9970.002  0.8740.015  1.0000.000  0.5080.048  0.8460.114  0.9940.002  0.2210.025 
census  299,285  500  0.01%  0.16%  0.8140.007  0.7940.005  0.8350.014  0.7320.020  0.6240.020  0.3120.005  0.1640.003  0.2910.008  0.1930.019  0.0760.004 
fraud  284,807  29  0.01%  6.10%  0.9790.002  0.9720.003  0.9770.001  0.7340.046  0.9530.002  0.6860.006  0.6740.004  0.6880.004  0.0430.021  0.2540.043 
celeba  202,599  39  0.02%  0.66%  0.9470.002  0.8940.005  0.9440.003  0.8080.027  0.6980.020  0.2610.006  0.1610.006  0.2610.008  0.0850.012  0.0650.006 
backdoor  95,329  196  0.04%  1.29%  0.9700.003  0.8780.007  0.9520.018  0.9280.019  0.7520.021  0.8860.002  0.1160.003  0.8560.016  0.5730.167  0.0510.005 
w7a  49,749  300  0.08%  2.03%  0.8130.005  0.7180.010  0.7800.011  0.7000.017  0.4170.010  0.4420.009  0.1040.010  0.1020.016  0.0570.004  0.0240.001 
campaign  41,188  62  0.10%  0.65%  0.8180.011  0.7230.006  0.7480.019  0.6230.024  0.7310.015  0.4030.014  0.3300.009  0.3490.023  0.1930.012  0.3280.022 
news20  10,523  1M  0.37%  5.70%  0.9380.010  0.8850.003  0.8870.000  0.5780.050  0.3280.016  0.6430.012  0.2220.004  0.2530.001  0.0820.010  0.0350.002 
thyroid  7,200  21  0.55%  5.62%  0.8010.004  0.5800.016  0.7490.011  0.5640.017  0.6880.020  0.3010.013  0.0930.005  0.2410.009  0.1160.014  0.1660.017 
Average  0.8980.005  0.8240.007  0.8740.009  0.7400.025  0.6740.016  0.5480.008  0.2640.010  0.4320.022  0.2600.029  0.1360.014  
Pvalue    0.004  0.039  0.004  0.004    0.004  0.016  0.004  0.004 
Competing Methods and Parameter Settings
PRIOR is compared with four methods, including REPEN [Pang et al.2018], adaptive Deep SVDD (DSVDD) [Ruff et al.2018], prototypical networks (denoted as FSNet) [Snell et al.2017], and iForest [Liu et al.2012], which are chosen because they are the respective stateoftheart in four relevant areas: weaklysupervised deep anomaly detection, feature learning for anomaly detection, classificationbased approach, and unsupervised anomaly detection. The original DSVDD cannot make use of the labeled anomalies. We adapted it based on [Tax and Duin2004] to enforce a large margin between the oneclass center and the labeled anomalies while minimizing the centeroriented hypersphere. This adaption enhances DSVDD by over 30% accuracy.
Since our experiments focus on unordered multidimensional data, multilayer perceptron networks are used. Similar to the findings in REPEN [Pang et al.2018]
, we empirically found that all deep methods perform best using an architecture with one hidden layer than using deeper architectures. This may be due to the limit of the available labeled data. Following REPEN, one hidden layer with 20 neural units is used in all deep methods. The ReLu activation function
is used. An norm regularizer with the hyperparameter settingis applied to avoid overfitting. The RMSprop optimizer with the learning rate
is used. All deep detectors are trained using 50 epochs, with 20 batches per epoch. The batch size is probed in
8, 16, 32, 64, 128, 256, 512. The best fits, 512 in PRIOR and DSVDD, 256 in REPEN and FSNet, are used by default.Performance Evaluation Metrics
Two popular and complementary metrics, the Area Under Receiver Operating Characteristic Curve (AUCROC) and Area Under PrecisionRecall Curve (AUCPR)
[Boyd et al.2013], are used. AUCROC summarizes the ROC curve of true positives against false positives, which often presents an overoptimistic view of the detection performance due to the classimbalance nature of anomaly detection data; whereas AUCPR is a summarization of the curve of precision and recall w.r.t. the anomalies, which focuses on the performance on the anomaly class only and is often more practical. Larger AUCROC (AUCPR) indicates better performance.
The reported AUCROC and AUCPR are averaged results over 10 independent runs. The paired Wilcoxon signed rank [Woolson2007] is used to examine the significance of the performance of PRIOR against its competing methods.
Default Experiment Settings
To replicate the reallife scenarios where a few labeled anomalies together with large unlabeled data are available, the anomalies and normal instances are first splitted into two subsets, with 80% data as training data and the other 20% data as test set. We then combine some randomly selected anomalies with normal training data to produce an anomalycontaminated unlabeled data set . We further randomly sample a few anomalies from the anomaly class to form the prior knowledge, i.e., the labeled anomaly set .
Under this setting we evaluate the performance of PRIOR w.r.t. its effectiveness in realworld data, data efficiency, robustness, and generalizability to unseen anomalies.
Effectiveness in Realworld Data Sets
Experiment Settings
We first evaluate the effectiveness of PRIOR in a wide range of realworld data. A consistent setting, 2% anomaly contamination and labeled anomalies, is used across all data sets to gain insights into the performance in different reallife applications.
Results
The AUCROC and AUCPR results on the nine realworld data sets are shown in Table 2. PRIOR obtains the best performance on eight data sets in both metrics, with comparably better performance on the data set where it ranks in second. On average, PRIOR performs substantially better than REPEN (9%), DSVDD (3%), FSNet (21%) and iForest (33%) in terms of AUCROC; more impressively, it achieves more substantial improvement in AUCPR, i.e., 108% w.r.t. REPEN, 27% w.r.t. DSVDD, 111% w.r.t. FSNet, and 304% w.r.t. iForest. All of the improvement are significant at the 95% or 99% confidence level. It is often very challenging to achieve large AUCPR due to the rareness of anomalies. Such impressive AUCPR improvement shows the superior capability of PRIOR in reducing false positives. This is very encouraging when considering the labeled anomalies account for only 0.005%0.6% of training data and 0.08%6% of the anomaly class (see and in Table 2).
The superiority of PRIOR is mainly due to its regressionbased endtoend anomaly score learning, which leverages the augmented instance pairs to efficiently learn the abstraction of both normality and abnormality, achieving more accurate anomaly scores than its counterparts that focus on feature learning. It is interesting that on census PRIOR outperforms DSVDD in AUCPR but it underperforms in AUCROC, which may because the ternary ordinal regression allows PRIOR to emphasize more on detecting anomalies while DSVDD focuses on modeling the normal class only.
Data Efficiency
Experiment Settings
This section examines the data efficiency by inspecting the performance w.r.t. different number of labeled anomalies, ranging from 5 to 120, with the contamination level fixed to 2%.
Results
The AUCPR results for the data efficiency test are shown in Figure 2. Similar results can also be observed in AUCROC. The unsupervised method iForest is insensitive to the labeled data and used as the baseline. The performance of all deep methods generally increases with increasing labeled data. However, the increased anomalies do not always help due to the heterogeneous anomalous behaviors taken by different anomalies. PRIOR is more stable and has better generalizability in such cases.
PRIOR is most dataefficient, which obtains the best average AUCPR w.r.t. different number of labeled anomalies. Impressively, PRIOR can use 88% less labeled anomalies but achieves much better AUCPR than the best contender DSVDD on w7a, news20 and thyroid; and it requires 50% less labeled data to obtain comparably better performance to the best contender FSNet on donors. This difference is mainly due to the direct optimization of anomaly scores in PRIOR against the indirect optimization in its counterparts.
Compared to the unsupervised iForest, even when a very few (e.g., 5) labeled anomalies are used, the improvement of the weaklysupervised methods is very substantial on nearly all data sets, e.g., the average improvement of PRIOR using 5 labeled anomalies is more than 410%. This shows the superpower of the synthesis of deep models and few labeled anomalies. In campaign that may have very intricate anomalies, the deep methods need slightly more labeled anomalies to achieve the similar improvement.
Robustness w.r.t. Anomaly Contamination
Experiment Settings
We examine the robustness w.r.t. different anomaly contamination levels, {0%, 2%, 5%, 10%, 20%}, with the number of labeled anomalies fixed to 30.
Results
The AUCPR results for the robustness are presented in Figure 3. PRIOR achieves the best robustness w.r.t. different contamination levels. Similar to the results based on 2% contaminated data in Table 2, PRIOR also significantly outperforms all four contenders by a large margin in the other contamination levels. Particularly, the substantial improvement of PRIOR persists consistently in different contamination levels on campaign, w7a, news20 and thyroid; PRIOR performs comparably well to the best competing method in the other data sets. The demonstrated robustness of PRIOR benefits from its efficiency of leveraging the limited labeled anomalies, i.e., its greater data efficiency helps defy the noisiness of the contaminated instances.
Effectiveness of Detecting Unseen Anomalies
Experiment Settings
Although the above results, especially on campaign, w7a, news20 and thyroid, show impressive generalizability of PRIOR, this section explicitly evaluates its generalizability to unseen anomalies. The experiment is performed on the UNSWNB15 data, from which our backdoor data set is derived. UNSWNB15 is chosen because it contains various different types of real anomalies, including eight other network attacks such as DoS, worms, reconnaissance and shellcode [Moustafa and Slay2015] besides backdoor attacks. Detection models are trained on the training data of the backdoor data. They are then evaluated on the test data of backdoor but replacing the backdoor attacks with randomly selected unseen anomalies from the eight types of new network attacks. We vary the number of the added unseen anomalies from 64 up to 1,024 and guarantee all the eight types of attacks are evenly presented in each test set.
Results
The performance results are shown in Figure 4. In terms of both AUCROC and AUCPR, PRIOR performs consistently and substantially better than all the competing methods in identifying the unseen anomalies across all cases. As discussed in our theoretical analysis, in addition to the patterns of labeled anomalies, PRIOR also learns the abstractions of normal instances due to the ternary ordinal regression in its pairwise relation learning. Therefore, even if the unseen anomalies demonstrate different abnormal behaviors from the seen anomalies, PRIOR would assign large anomaly scores to them when they are different from the normal instances. This is the main driving force underlying the superior performance of PRIOR here.
Note that since the normal instances remain the same in all the test sets with only the anomalies changed, AUCROC is relatively stable on the five different test sets. However, on the right panel, increasing the number of anomalies enlarges the anomaly proportion, making it easier to obtain better detection recall, and thus, better AUCPR performance.
Related Work
Deep Anomaly Detection
Traditional anomaly detection approaches are often ineffective in highdimensional or nonlinear separable data due to the curse of dimensionality and the deficiency in capturing the nonlinear relations
[Aggarwal2017a]. Deep anomaly detection has shown promising results in handle those complex data, of which most methods learn representations separately from anomaly measures [Hawkins et al.2002, Chen et al.2017, Schlegl et al.2017, Zenati et al.2018]. Very recent methods [Zong et al.2018, Pang et al.2018, Ruff et al.2018] unify representation learning and anomaly detection methods to learn more optimal features for specific anomaly detectors. However, all of these methods have an optimization objective focusing on feature representations, so the anomaly scores output by the downstream anomaly measures are optimized in an indirect manner. By contrast, PRIOR performs an endtoend learning of anomaly scores, which fulfills a direct optimization of anomaly scores.We had a concurrent work DevNet [Pang et al.2019]
addressing a similar problem, but employs a completely different approach to this study. Here we formulate the problem as a pairwise relation learning task and address the problem using deep ternary ordinal regression without any assumption on data distribution, whereas DevNet assumes the anomaly scores of data instances follow a Gaussian distribution and leverages this prior to learn the anomaly scores by fitting to a ZScorebased distribution. In terms of empirical performance, DevNet and PRIOR perform comparably well on the shared eight data sets.
Weakly and Semisupervised Anomaly Detection
Many studies have been introduced to leverage labeled normal instances to learn patterns of the normal class, which are commonly referred to as semisupervised anomaly detection methods [Noto et al.2012, Görnitz et al.2013, Ienco et al.2017]. The semisupervised setting is relevant to our problem because of the availability of both labeled and unlabeled data, but they are two fundamentally different tasks due to the difference in training data and the problem nature. Specifically, we have only a few labeled anomaly data rather than largescale labeled normal data; and the anomalies are often from different distributions or manifolds, so having only limited labeled anomaly data hardly cover all types of anomalies. Therefore, instead of modeling the labeled normal data, our problem requires to learn patterns of labeled anomalies that also generalize well to unseen anomalies. A few studies [McGlohon et al.2009, Tamersoy et al.2014, Zhang et al.2018, Pang et al.2018] show that these limited labeled anomalies can be leveraged to substantially improve the detection accuracy. However, these studies exploit the labeled anomalies to enhance anomaly scoring via label propagation [McGlohon et al.2009, Tamersoy et al.2014], representation learning [Pang et al.2018] or classification models [Zhang et al.2018], failing to sufficiently utilize the labeled data and/or detect unseen anomalies.
This research line is also relevant to fewshot classification [FeiFei et al.2006, Vinyals et al.2016, Snell et al.2017] and positive and unlabeled data (PU) learning [Li and Liu2003, Elkan and Noto2008, Sansone et al.2018] because of the availability of the limited labeled positive instances (anomalies), but they are very different in that these two areas implicitly assume that the few labeled instances share the same intrinsic class structure as the other instances within the same class (e.g., the anomaly class), whereas the few labeled anomalies and the unseen/unknown anomalies may have diverse class structures. This presents significant challenges to the techniques of both areas.
Conclusions
This paper introduces a novel formulation and its instantiation to devise ordinal regression networks for weaklysupervised deep anomaly detection. Our approach achieves significant improvement over four stateoftheart competing methods in terms of the effectiveness in realworld data, data efficiency, robustness, and generalizability to unseen anomalies. This in turn justifies the effectiveness of the synthesis of the pairingbased data augmentation and the ordinal regressionbased anomaly score learning approaches.
References
 [Aggarwal2017a] Charu C Aggarwal. Outlier analysis. Springer, 2017.

[Aggarwal2017b]
Charu C Aggarwal.
Supervised outlier detection.
In Outlier Analysis, pages 219–248. 2017. 
[Boyd et al.2013]
Kendrick Boyd, Kevin H Eng, and C David Page.
Area under the precisionrecall curve: point estimates and confidence intervals.
In ECML/PKDD, pages 451–466. Springer, 2013. 
[Chen et al.2017]
Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga.
Outlier detection with autoencoder ensembles.
In SDM, pages 90–98. SIAM, 2017. 
[Elkan and Noto2008]
Charles Elkan and Keith Noto.
Learning classifiers from only positive and unlabeled data.
In KDD, pages 213–220. ACM, 2008.  [FeiFei et al.2006] Li FeiFei, Rob Fergus, and Pietro Perona. Oneshot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006.

[Glorot and Bengio2010]
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural networks.
In AISTATS, pages 249–256, 2010.  [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[Görnitz et al.2013]
Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld.
Toward supervised anomaly detection.
Journal of Artificial Intelligence Research
, 46:235–262, 2013.  [Gutiérrez et al.2016] Pedro Antonio Gutiérrez, Maria PerezOrtiz, Javier SanchezMonedero, Francisco FernandezNavarro, and Cesar HervasMartinez. Ordinal regression methods: survey and experimental study. IEEE Transactions on Knowledge and Data Engineering, 28(1):127–146, 2016.
 [Hawkins et al.2002] Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. Outlier detection using replicator neural networks. In DaWaK, 2002.
 [Ienco et al.2017] Dino Ienco, Ruggero G Pensa, and Rosa Meo. A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Transactions on Neural Networks and Learning Systems, 28(5):1017–1029, 2017.
 [Li and Liu2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, pages 587–592, 2003.
 [Liu et al.2012] Fei Tony Liu, Kai Ming Ting, and ZhiHua Zhou. Isolationbased anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1):3, 2012.
 [McGlohon et al.2009] Mary McGlohon, Stephen Bay, Markus G Anderle, David M Steier, and Christos Faloutsos. SNARE: A link analytic system for graph labeling and risk detection. In KDD, pages 1265–1274. ACM, 2009.
 [Moustafa and Slay2015] Nour Moustafa and Jill Slay. UNSWNB15: a comprehensive data set for network intrusion detection systems. In Military Communications and Information Systems Conference, 2015, pages 1–6, 2015.
 [Noto et al.2012] Keith Noto, Carla Brodley, and Donna Slonim. FRaC: a featuremodeling approach for semisupervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1):109–133, 2012.
 [Pang et al.2018] Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. Learning representations of ultrahighdimensional data for random distancebased outlier detection. In KDD, pages 2041–2050, 2018.
 [Pang et al.2019] Guansong Pang, Chunhua Shen, and Anton van den Hengel. Deep anomaly detection with deviation networks. In KDD, pages 353–362. ACM, 2019.
 [Perez and Wang2017] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint:1712.04621, 2017.
 [Randhawa et al.2018] Kuldeep Randhawa, Chu Kiong Loo, Manjeevan Seera, Chee Peng Lim, and Asoke K Nandi. Credit card fraud detection using adaboost and majority voting. IEEE Access, 6:14277–14284, 2018.
 [Ruff et al.2018] Lukas Ruff, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert Vandermeulen, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep oneclass classification. In ICML, pages 4390–4399, 2018.
 [Sansone et al.2018] Emanuele Sansone, Francesco GB De Natale, and ZhiHua Zhou. Efficient training for positive unlabeled learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [Schlegl et al.2017] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula SchmidtErfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In IPMI, pages 146–157. Springer, Cham, 2017.
 [Snell et al.2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In NIPS, pages 4077–4087, 2017.
 [Tamersoy et al.2014] Acar Tamersoy, Kevin Roundy, and Duen Horng Chau. Guilt by association: Large scale malware detection by mining filerelation graphs. In KDD, pages 1524–1533, 2014.

[Tax and Duin2004]
David MJ Tax and Robert PW Duin.
Support vector data description.
Machine Learning, 54(1):45–66, 2004.  [Vinyals et al.2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
 [Woolson2007] RF Woolson. Wilcoxon signedrank test. Wiley Encyclopedia of Clinical Trials, pages 1–3, 2007.
 [Zenati et al.2018] Houssam Zenati, Manon Romain, ChuanSheng Foo, Bruno Lecouat, and Vijay Chandrasekhar. Adversarially learned anomaly detection. In ICDM, pages 727–736. IEEE, 2018.
 [Zhang and LeCun2015] Xiang Zhang and Yann LeCun. Text understanding from scratch. arXiv preprint:1502.01710, 2015.
 [Zhang et al.2018] YaLin Zhang, Longfei Li, Jun Zhou, Xiaolong Li, and ZhiHua Zhou. Anomaly detection with partially observed anomalies. In WWW Companion, pages 639–646, 2018.

[Zong et al.2018]
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho,
and Haifeng Chen.
Deep autoencoding gaussian mixture model for unsupervised anomaly detection.
In ICLR, 2018.
Comments
There are no comments yet.