MIDAS
Detecting Microcluster Anomalies in Edge Streams
view repo
Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? Existing approaches aim to detect individually surprising edges. In this work, we propose MIDAS, which focuses on detecting microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, such as lockstep behavior, including denial of service attacks in network traffic data. MIDAS has the following properties: (a) it detects microcluster anomalies while providing theoretical guarantees about its false positive probability; (b) it is online, thus processing each edge in constant time and constant memory, and also processes the data 108505 times faster than stateoftheart approaches; (c) it provides 46 of AUC) than stateoftheart approaches.
READ FULL TEXT VIEW PDF
A graphbased sampling and consensus (GraphSAC) approach is introduced t...
read it
This paper introduces a novel graphanalytic approach for detecting anom...
read it
The challenge of efficiently identifying anomalies in data sequences is ...
read it
In recent years, there has been a growing interest in identifying anomal...
read it
We introduce a novel algorithm to perform graph clustering in the edge
s...
read it
Recently, advanced cyber attacks, which consist of a sequence of steps t...
read it
Social graphs derived from online social interactions contain a wealth o...
read it
Detecting Microcluster Anomalies in Edge Streams
Anomaly detection in graphs is a critical problem for finding suspicious behavior in innumerable systems, such as intrusion detection, fake ratings, and financial fraud. This has been a wellresearched problem with majority of the proposed approaches [1, 4, 9, 10, 11, 17] focusing on static graphs. However, many realworld graphs are dynamic in nature, and methods based on static connections may miss temporal characteristics of the graphs and anomalies.
Among the methods focusing on dynamic graphs, most of them have edges aggregated into graph snapshots [6, 21, 20, 12, 19, 8]. However, to minimize the effect of malicious activities and start recovery as soon as possible, we need to detect anomalies in realtime or near realtime i.e. to identify whether an incoming edge is anomalous or not, as soon as we receive it. In addition, since the number of vertices can increase as we process the stream of edges, we need an algorithm which uses constant memory in graph size.
Moreover, fraudulent or anomalous events in many applications occur in microclusters or suddenly arriving groups of suspiciously similar edges e.g. denial of service attacks in network traffic data and lockstep behavior. However, existing methods which process edge streams in an online manner, including [7, 14], aim to detect individually surprising edges, not microclusters, and can thus miss large amounts of suspicious activity.
In this work, we propose Midas, which detects microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, in edge streams, using constant time and memory. In addition, by using a principled hypothesis testing framework, Midas provides theoretical bounds on the false positive probability, which these methods do not provide.
Our main contributions are as follows:
Streaming Microcluster Detection: We propose a novel streaming approach for detecting microcluster anomalies, requiring constant time and memory.
Theoretical Guarantees: In Theorem 1, we show guarantees on the false positive probability of Midas.
Effectiveness: Our experimental results show that Midas outperforms baseline approaches by  accuracy (in terms of AUC), and processes the data times faster than baseline approaches.
Reproducibility: Our code and datasets are publicly available at https://github.com/bhatiasiddharth/MIDAS.
In this section, we review previous approaches to detect anomalous signs on static and dynamic graphs.
See [2] for an extensive survey on graphbased anomaly detection.
Anomaly detection in static graphs
can be classified by which anomalous entities (nodes, edges, subgraph, etc.) are spotted.
Anomalous node detection: [1]
extracts egonetbased features and finds empirical patterns with respect to the features. Then, it identifies nodes whose egonets deviate from the patterns, including the count of triangles, total weight, and principal eigenvalues.
[10] computes node features, including degree and authoritativeness [11], then spots nodes whose neighbors are notably close in the feature space.Anomaly detection in graph streams use as input a series of graph snapshots over time. We categorize them similarly according to the type of anomaly detected:
Anomalous node detection: [21] approximates the adjacency matrix of the current snapshot based on incremental matrix factorization, then spots nodes corresponding to rows with high reconstruction error.
Anomaly detection in edge streams use as input a stream of edges over time. Categorizing them according to the type of anomaly detected:
Anomalous node detection: Given an edge stream, [24] detects nodes whose egonets suddenly and significantly change.
Anomalous subgraph detection: Given an edge stream, [18] identifies dense subtensors created within a short time.
Only the 2 methods in the last category are applicable to our task, as they operate on edge streams and output a score per edge. However, as shown in Table 1, neither method aims to detect microclusters, or provides guarantees on false positive probability.
SedanSpot eswaran2018sedanspot 
RHSS ranshous2016scalable 
Midas 


Microcluster Detection  
Guarantee on False Positive Probability  
Constant Memory  
Constant Update Time 
Let be a stream of edges from a timeevolving graph . Each arriving edge is a tuple consisting of a source node , a destination node , and a time of occurrence , which is the time at which the edge was added to the graph. For example, in a network traffic stream, an edge could represent a connection made from a source IP address to a destination IP address at time . We do not assume that the set of vertices is known a priori: for example, new IP addresses or user IDs may be created over the course of the stream.
We model as a directed graph. Undirected graphs can simply be handled by treating an incoming undirected as two simultaneous directed edges, one in either direction.
We also allow to be a multigraph: edges can be created multiple times between the same pair of nodes. Edges are allowed to arrive simultaneously: i.e. , since in many applications are given in the form of discrete time ticks.
The desired properties of our algorithm are as follows:
Microcluster Detection: It should detect suddenly appearing bursts of activity which share many repeated nodes or edges, which we refer to as microclusters.
Guarantees on False Positive Probability: Given any userspecified probability level (e.g. ), the algorithm should be adjustable so as to provide false positive probability of at most (e.g. by adjusting a threshold that depends on ). Moreover, while guarantees on the false positive probability rely on assumptions about the data distribution, we aim to make our assumptions as weak as possible.
Constant Memory and Update Time: For scalability in the streaming setting, the algorithm should run in constant memory and constant update time per newly arriving edge. Thus, its memory usage and update time should not grow with the length of the stream, or the number of nodes in the graph.
Next, we describe our Midas and MidasR approaches. The following provides an overview:
Streaming Hypothesis Testing Approach: We describe our Midas algorithm, which uses streaming data structures within a hypothesis testingbased framework, allowing us to obtain guarantees on false positive probability.
Detection and Guarantees: We describe our decision procedure for determining whether a point is anomalous, and our guarantees on false positive probability.
Incorporating Relations: We extend our approach to the MidasR algorithm, which incorporates relationships between edges temporally and spatially^{1}^{1}1We use ‘spatially’ in a graph sense, i.e. connecting nearby nodes, not to refer to any other continuous spatial dimension..
Consider the example in Figure 1 of a single sourcedestination pair , which shows a large burst of activity at time . This burst is the simplest example of a microcluster, as it consists of a large group of edges which are very similar to one another (in fact identical), both spatially (i.e. in terms of the nodes they connect) and temporally.
In an offline setting, there are many timeseries methods which could detect such bursts of activity. However, in an online setting, recall that we want memory usage to be bounded, so we cannot keep track of even a single such time series. Moreover, there are many such sourcedestination pairs, and the set of sources and destinations is not fixed a priori.
To circumvent these problems, we maintain two types of CountMinSketch (CMS) [5] data structures. Assume we are at a particular fixed time tick in the stream; we treat time as a discrete variable for simplicity. Let be the total number of edges from to up to the current time. Then, we use a single CMS data structure to approximately maintain all such counts (for all edges ) in constant memory: at any time, we can query the data structure to obtain an approximate count .
Secondly, let be the number of edges from to in the current time tick (but not including past time ticks). We keep track of using a similar CMS data structure, the only difference being that we reset this CMS data structure every time we transition to the next time tick. Hence, this CMS data structure provides approximate counts for the number of edges from to in the current time tick .
Given approximate counts and , how can we detect microclusters? Moreover, how can we do this in a principled framework that allows for theoretical guarantees?
Fix a particular source and destination pair of nodes, , as in Figure 1. One approach would be to assume that the time series in Figure 1
follows a particular generative model: for example, a Gaussian distribution. We could then find the mean and standard deviation of this Gaussian distribution. Then, at time
, we could compute the Gaussian likelihood of the number of edge occurrences in the current time tick, and declare an anomaly if this likelihood is below a specified threshold.However, this requires a restrictive Gaussian assumption, which can lead to excessive false positives or negatives if the data follows a very different distribution. Instead, we use a weaker assumption: that the mean level (i.e. the average rate at which edges appear) in the current time tick (e.g. ) is the same as the mean level before the current time tick . Note that this avoids assuming any particular distribution for each time tick, and also avoids a strict assumption of stationarity over time.
Hence, we can divide the past edges into two classes: the current time tick and all past time ticks . Recalling our previous notation, the number of events at is , while the number of edges in past time ticks is .
Under the chisquared goodnessoffit test, the chisquared statistic is defined as the sum over categories of . In this case, our categories are and . Under our mean level assumption, since we have total edges (for this sourcedestination pair), the expected number at is , and the expected number for is the remaining, i.e. . Thus the chisquared statistic is:
Note that both and
can be estimated by our CMS data structures, obtaining approximations
and respectively. This leads to our following anomaly score, using which we can evaluate a newly arriving edge with sourcedestination pair :Given a newly arriving edge , our anomaly score is computed as:
(1) 
Algorithm 1 summarizes our Midas algorithm.
While Algorithm 1 computes an anomaly score for each edge, it does not provide a binary decision for whether an edge is anomalous or not. We want a decision procedure that provides binary decisions and a guarantee on the false positive probability: i.e. given a userdefined threshold , the probability of a false positive should be at most
. Intuitively, the key idea is to combine the approximation guarantees of CMS data structures with properties of a chisquared random variable.
The key property of CMS data structures we use is that given any and , for appropriately chosen CMS data structure sizes, with probability at least , the estimates satisfy:
(2) 
where is the total number of edges at time . Since CMS data structures can only overestimate the true counts, we additionally have
(3) 
Define an adjusted version of our earlier score:
(4) 
To obtain its probabilistic guarantee, our decision procedure computes , and uses it to compute an adjusted version of our earlier statistic:
(5) 
Then our main guarantee is as follows:
Let be the quantile of a chisquared random variable with 1 degree of freedom. Then:
(6) 
In other words, using
as our test statistic and threshold
results in a false positive probability of at most .Recall that
(7) 
was defined so that it has a chisquared distribution. Thus:
(8) 
At the same time, by the CMS guarantees we have:
(9) 
In this section, we describe our MidasR approach, which considers edges in a relational manner: that is, it aims to group together edges which are nearby, either temporally or spatially.
Rather than just counting edges in the same time tick (as we do in Midas), we want to allow for some temporal flexibility: i.e. edges in the recent past should also count toward the current time tick, but modified by a reduced weight. A simple and efficient way to do this using our CMS data structures is as follows: at the end of every time tick, rather than resetting our CMS data structures , we reduce all its counts by a fixed fraction . This allows past edges to count toward the current time tick, with a diminishing weight.
We would like to catch large groups of spatially nearby edges: e.g. a single source IP address suddenly creating a large number of edges to many destinations, or a small group of nodes suddenly creating an abnormally large number of edges between them. A simple intuition we use is that in either of these two cases, we expect to observe nodes with a sudden appearance of a large number of edges. Hence, we can use CMS data structures to keep track of edge counts like before, except counting all edges adjacent to any node . Specifically, we create CMS counters and to approximate the current and total edge counts adjacent to node . Given each incoming edge , we can then compute three anomalousness scores: one for edge , as in our previous algorithm; one for node , and one for node . Finally, we combine the three scores by taking their maximum value. Another possibility of aggregating the three scores is to take their sum. Algorithm 2 summarizes the resulting MidasR algorithm.
In terms of memory, both Midas and MidasR only need to maintain the CMS data structures over time, which are proportional to , where and are the number of hash functions and the number of buckets in the CMS data structures; which is bounded with respect to the data size.
In this section, we evaluate the performance of Midas and MidasR compared to SedanSpot on dynamic graphs. We aim to answer the following questions:
[label=Q0.]
Accuracy: How accurately does Midas detect realworld anomalies compared to baselines, as evaluated using the ground truth labels?
Scalability: How does it scale with input stream length? How does the time needed to process each input compare to baseline approaches?
RealWorld Effectiveness: Does it detect meaningful anomalies in case studies on Twitter graphs?
DARPA [13] has IPIP communications between source IP and destination IP over minutes. Each communication is a directed edge (srcIP, dstIP, timestamp, attack) where the ground truth attack label indicates whether the communication is an attack or not (anomalies are of total).
TwitterSecurity [15, 16] has tweet samples for four months (MayAug ) containing Department of Homeland Security keywords related to terrorism or domestic security. Entityentity comention temporal graphs are built on daily basis ( time ticks).
TwitterWorldCup [15, 16] has tweet samples for the World Cup season (June July ). The tweets are filtered by popular/official World Cup hashtags, such as #worldcup, #fifa, #brazil, etc. Similar to TwitterSecurity, entityentity comention temporal graphs are constructed on minute sample rate ( time points).
As described in our Related Work, only RHSS and SedanSpot operate on edge streams and provide a score for each edge. SedanSpot uses personalised PageRank to detect anomalies in sublinear space and constant time per edge. However, RHSS was evaluated in [7] on the DARPA dataset and found to have AUC of (lower than chance). Hence, we only compare with SedanSpot.
All the methods output an anomaly score per edge (higher is more anomalous). We calculate the True Positive Rate (TPR) and False Positive Rate (FPR) and plot the ROC curve (TPR vs FPR). We also report the Area under the ROC curve (AUC) and Average Precision Score.
All experiments are carried out on a Intel Core processor, RAM, running OS . We implement Midas and MidasR in C++. We use hash functions for the CMS data structures, and we set the number of CMS buckets to to result in an approximation error of . For MidasR, we set the temporal decay factor as
. We used an opensourced implementation of
SedanSpot, provided by the authors, following parameter settings as suggested in the original paper (sample size ).Figure 2 plots the ROC curve for MidasR, Midas and SedanSpot. Figure 3(top) plots accuracy (AUC) vs. running time (log scale, in seconds, excluding I/O). We see that Midas achieves a much higher accuracy compared to the baseline , while also running significantly faster vs. . This is a accuracy improvement at faster speed. MidasR achieves the highest accuracy which is accuracy improvement compared to the baseline at faster speed.
Figure 3(bottom) plots the average precision score vs. running time. We see that Midas is more precise compared to the baseline . This is a precision improvement. MidasR achieves the highest average precision score which is more precise than SedanSpot.
We see that Midas and MidasR greatly outperform SedanSpot on both accuracy and precision metrics.
Figure 4 shows the scalability of Midas and MidasR. We plot the wallclock time needed to run on the (chronologically) first edges of the DARPA dataset. This confirms the linear scalability of Midas and MidasR with respect to the number of edges in the input dynamic graph due to its constant processing time per edge. Note that both Midas and MidasR process edges within second, allowing realtime anomaly detection.
Figure 5 plots the number of edges (in millions) and time to process each edge for DARPA dataset. Midas processes edges within s each and edges within s each. MidasR processes edges within s each and edges within s each.
Table 2 shows the time it takes SedanSpot, Midas and MidasR to run on the TwitterWorldCup, TwitterSecurity and DARPA datasets. For TwitterWorldCup dataset, we see that MidasR is faster than SedanSpot vs. and Midas is faster than SedanSpot vs . For TwitterSecurity dataset, we see that MidasR is faster than SedanSpot vs. and Midas is faster than SedanSpot vs . For the DARPA dataset, we see that MidasR is faster than SedanSpot vs. and Midas is faster than SedanSpot vs .
SedanSpot requires several subprocesses (hashing, randomwalking, reordering, sampling, etc), resulting in the large computation time. Midas and MidasR are both both scalable and fast.
SedanSpot  Midas  MidasR  

TwitterWorldCup  s  s  s 
TwitterSecurity  s  s  s 
DARPA  s  s  s 
We measure anomaly scores using Midas, MidasR and SedanSpot on the TwitterSecurity dataset. Figure 6 plots anomaly scores vs. day (during the four months of ). To visualise, we aggregate edges occurring in each day by taking the max anomalousness score per day, for a total of days. Anomalies correspond to major world news such as Mpeketoni attack (Event ) or Soma Mine explosion (Event ). Midas and MidasR show similar trends whereas SedanSpot misses some anomalous events (Events ), and outputs many high scores unrelated to any true events. This is also reflected in the low accuracy and precision of SedanSpot in Figure 3. The anomalies detected by Midas and MidasR coincide with major events in the TwitterSecurity timeline as follows:
13052014. Turkey Mine Accident, Hundreds Dead
24052014. Raid.
30052014. Attack/Ambush.
030614. Suicide bombing
090614. Suicide/Truck bombings.
10062014. Iraqi Militants Seized Large Regions.
11062014. Kidnapping
150614. Attack
260614. Suicide Bombing/Shootout/Raid
030714. Israel Conflicts with Hamas in Gaza.
180714. Airplane with 298 Onboard was Shot Down over Ukraine.
300714. Ebola Virus Outbreak.
This shows the effectiveness of Midas and MidasR for catching realworld anomalies.
Microcluster anomalies: Figure 7 corresponds to Event in the TwitterSecurity dataset. All single edges are equivalent to 444 edges and double edges are equivalent to 888 edges between the nodes. This suddenly arriving (within 1 day) group of suspiciously similar edges is an example of a microcluster anomaly which MidasR detects, but SedanSpot misses.
In this paper, we proposed Midas and Midas
R for microcluster based detection of anomalies in edge streams. Future work could consider more general types of data, including heterogeneous graphs or tensors. Our contributions are as follows:
Streaming Microcluster Detection: We propose a novel streaming approach for detecting microcluster anomalies, requiring constant time and memory.
Theoretical Guarantees: In Theorem 1, we show guarantees on the false positive probability of Midas.
Effectiveness: Our experimental results show that Midas outperforms baseline approaches by  accuracy (in terms of AUC), and processes the data times faster than baseline approaches.
This work was supported in part by NUS ODPRT Grant R252000A81133.
Autopart: parameterfree graph partitioning and outlier detection
. In PKDD, Cited by: Introduction, 3rd item.
Comments
There are no comments yet.