DeepAI
Log In Sign Up

HashNWalk: Hash and Random Walk Based Anomaly Detection in Hyperedge Streams

Sequences of group interactions, such as emails, online discussions, and co-authorships, are ubiquitous; and they are naturally represented as a stream of hyperedges. Despite their broad potential applications, anomaly detection in hypergraphs (i.e., sets of hyperedges) has received surprisingly little attention, compared to that in graphs. While it is tempting to reduce hypergraphs to graphs and apply existing graph-based methods, according to our experiments, taking higher-order structures of hypergraphs into consideration is worthwhile. We propose HashNWalk, an incremental algorithm that detects anomalies in a stream of hyperedges. It maintains and updates a constant-size summary of the structural and temporal information about the stream. Using the summary, which is the form of a proximity matrix, HashNWalk measures the anomalousness of each new hyperedge as it appears. HashNWalk is (a) Fast: it processes each hyperedge in near real-time and billions of hyperedges within a few hours, (b) Space Efficient: the size of the maintained summary is a predefined constant, (c) Effective: it successfully detects anomalous hyperedges in real-world hypergraphs.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/20/2011

Online Anomaly Detection Systems Using Incremental Commute Time

Commute Time Distance (CTD) is a random walk based metric on graphs. CTD...
09/17/2020

MStream: Fast Streaming Multi-Aspect Group Anomaly Detection

Given a stream of entries in a multi-aspect data setting i.e., entries h...
09/17/2020

Real-Time Streaming Anomaly Detection in Dynamic Graphs

Given a stream of graph edges from a dynamic graph, how can we assign an...
04/06/2018

Coding of Graphs with Application to Graph Anomaly Detection

This paper has dual aims. First is to develop practical universal coding...
09/06/2021

gen2Out: Detecting and Ranking Generalized Anomalies

In a cloud of m-dimensional data points, how would we spot, as well as r...
09/07/2022

Context Recovery and Knowledge Retrieval: A Novel Two-Stream Framework for Video Anomaly Detection

Video anomaly detection aims to find the events in a video that do not c...
06/11/2017

DenseAlert: Incremental Dense-Subtensor Detection in Tensor Streams

Consider a stream of retweet events - how can we spot fraudulent lock-st...

1 Introduction

(a) Speed and preciseness
(b) Scalability
(c) Effectiveness
Figure 1: Strengths of HashNWalk. (a) HashNWalk spots anomalous hyperedges rapidly and precisely in a real-world hypergraph. (b) The total runtime of HashNWalk is linear in the size of the input hyperedge stream. (c) HashNWalk detects interesting patents. Patent 1 cited multiple patents that have not been cited together before, and patents 5-7 cited almost the same set of patents. See Section 5 for details.

A variety of real-world graphs, including computer networks, online social networks, and hyperlink networks, have been targets of attacks. Distributed denial-of-service attacks block the availability by causing an unexpected traffic jam on the target machine. In addition, fake connections in online social networks degrade the quality of recommendations, and those in hyperlink networks manipulate the centrality of webpages. Due to its importance and necessity in real-world applications, anomaly detection in graphs has received considerable attention. To detect nodes, edges, and/or subgraphs deviating from structural and temporal patterns in graphs, various numerical measures of the deviation have been proposed with search algorithms [akoglu2010oddball, hooi2016fraudar, shin2016corescope]. As many real-world graphs evolve over time, detecting anomalies in real-time, as they appear, is desirable [bhatia2020midas, eswaran2018sedanspot].

While graphs model pairwise interactions, interactions in many real-world systems are groupwise (collaborations of co-authors, group interactions on online Q&A platforms, co-purchases of items, etc). Such a groupwise interaction is naturally represented as a hyperedge, i.e., a set of an arbitrary number of nodes. A hypergraph, which is a set of hyperedges, is an indispensable extension of a graph, which can only describe pairwise relations. Moreover, many of such real-world hypergraphs evolve over time (e.g., emails exchanged continuously between sets of users, co-authorships established over time, and daily records of co-purchased items), and thus they are typically modeled as a stream of hyperedges.

Despite the great interest in anomaly detection in graphs, the same problem in hypergraphs has been largely unexplored. High-order relationships represented by hyperedges exhibit structural and temporal properties distinguished from those in graphs and hence raise unique technical challenges. Thus, instead of simply decomposing hyperedges into pairwise edges and applying graph-based methods, it is required to take the underlying high-order structures into consideration for anomaly detection in hypergraphs.

To this end, we propose HashNWalk, an online algorithm for detecting anomalous hyperedges. HashNWalk maintains a constant-size summary that tracks structural and temporal patterns in high-order interactions in the input stream. Specifically, HashNWalk incorporates so-called edge-dependent node weights [chitra2019random]

into random walks on hypergraphs to estimate the proximity between nodes while capturing high-order information. Furthermore, we develop an incremental update scheme, which each hyperedge is processed by as it appears.

The designed hypergraph summary is used to score the anomalousness of any new hyperedge in the stream. While the definition of anomaly depends on the context, in this work, we focus on two intuitive aspects: unexpectedness and burstiness. We assume that unexpected hyperedges consist of unnatural combinations of nodes, and bursty hyperedges suddenly appear in a short period of time. Based on the information in the form of a hypergraph summary, we formally define two anomaly score metrics that effectively capture these properties. We empirically show that HashNWalk is effective in detecting anomalous hyperedges in (semi-)real hypergraphs.

In summary, our contributions are as follows:

  • [leftmargin=*]

  • Fast: It takes a very short time for HashNWalk to process each new hyperedge. Specifically, in our experimental setting, it processed 1.4 billion hyperedges within 2.5 hours.

  • Space Efficient: The user can bound the size of the summary, which HashNWalk maintains.

  • Accurate: HashNWalk successfully detects anomalous hyperedges. Numerically, it outperforms its state-of-the-art competitors with up to higher AUROC.

Reproducibility: The source code and datasets are available at https://github.com/geonlee0325/HashNWalk.

2 Related Works

We discuss prior works on the three topics relevant to this paper: (a) anomaly detection in (hyper)graphs; (b) summarization of edge streams; and (c) hypergraphs and applications.

Anomaly Detection in Graphs & Hypergraphs: The problem of detecting anomalous nodes, edges, and/or subgraphs has been extensively studied for both static and dynamic graphs [akoglu2015graph]. In static graphs, nodes whose ego-nets are structurally different from others [akoglu2010oddball], edges whose removal significantly reduces the encoding cost [chakrabarti2004autopart], or subgraphs whose density is abnormally high [beutel2013copycatch, hooi2016fraudar, shin2016corescope] are assumed to be anomalies. In dynamic graphs, temporal edges are assumed to be anomalous if they connect sparsely connected parts in graphs [eswaran2018sedanspot] or are unlikely to appear according to underlying models [aggarwal2011outlier, yoon2019fast, bhatia2020midas, belth2020mining]. In addition, dense subgraphs generated within a short time are considered to be anomalous [shin2017densealert, eswaran2018spotlight]. Recently, embedding based methods have shown to be effective in detecting anomalies in graphs [yu2018netwalk, chang2020f].

On the other hand, detecting anomalies in hypergraphs is relatively unexplored. Anomalous nodes in the hypergraph have been the targets of detection by using scan statistics on hypergraphs [park2009anomaly]

or training a classifier based on the high-order structural features of the nodes 

[leontjeva2012fraud]. The anomalousness of unseen hyperedges is measured based on how likely the combinations of nodes are drawn from the distribution of anomalous co-occurrences, which is assumed to be uniform, instead of the distribution of nominal ones [silva2008hypergraph]. Approximate frequencies of structurally similar hyperedges obtained by locality sensitive hashing are used to score the anomalousness of hyperedges in the hyperedge stream [ranshous2017efficient]. In this paper, we compare ours with the methods that detect anomalous interactions in online settings, i.e., anomaly detectors designed for edge streams [bhatia2020midas, eswaran2018sedanspot, chang2020f] and hyperedge streams [ranshous2017efficient].

Summarization of Edge Streams: Summarization aims to reduce the size of a given graph while approximately maintaining its structural properties. It has been particularly demanded in the context of real-time processing of streaming edges. In [bhatia2020midas], a count-min-sketch is maintained for approximate frequencies of edges. Edge frequencies have been used to answer queries regarding structural properties of graphs [zhao2011gsketch, tang2016graph]. In [bandyopadhyay2016topological], local properties, such as the number of triangles and neighborhood overlap, are estimated by maintaining topological information of a given graph.

Hypergraphs and Applications: Hypergraphs appear in numerous fields, including bioinformatics [hwang2008learning], circuit design [karypis1999multilevel]

, computer vision

[huang2009video, kim2020hypergraph]

, natural language processing

[ding2020more], social network analysis [yang2019revisiting], and recommendation [mao2019multiobjective]. Structural properties [benson2018simplicial, do2020structural, lee2020hypergraph, lee2021how, choe2022midas] and dynamical properties [benson2018simplicial, benson2018sequences, kook2020evolution, lee2021thyme+, choo2022persistence] of such real-world hypergraphs have been studied extensively.

3 Preliminaries

In this section, we introduce notations and preliminaries.

3.1 Notations and Concepts

Hypergraphs: A hypergraph consists of a set of nodes and a set of hyperedges . Each hyperedge is a non-empty subset of an arbitrary number of nodes. We can represent by its incidence matrix , where each entry is if and otherwise. A hyperedge stream is a sequence of hyperedges where each hyperedge arrives at time . For any and , if , then .

Clique Expansion and Information Loss: Clique expansion [zhou2007learning], where each hyperedge is converted to a clique composed of the nodes in , is one of the most common ways of transforming a hypergraph into an ordinary pairwise graph. Clique expansion suffers from the loss of information on high-order interactions. That is, in general, a hypergraph is not uniquely identifiable from its clique expansion. Exponentially many non-isomorphic hypergraphs are reduced to identical clique expansions.

Random Walks on Hypergraphs: A random walk on a hypergraph is formulated in [chitra2019random] as follows. If the current node is , (1) select a hyperedge that contains the node (i.e.,

) with probability proportional to

and (2) select a node with probability proportional to and walk to node . The weight is the weight of the hyperedge , and the weight is the weight of node with respect to the hyperedge . The weight is edge-independent if it is identical for every hyperedge ; and otherwise, it is edge-dependent. If all node weights are edge-independent, then a random walk on becomes equivalent to a random walk on its clique expansion [chitra2019random]. However, if node weights are edge-dependent, random walks on hypergraphs are generally irreversible

. That is, they may not be the same as random walks on any undirected graphs. In this sense, if edge-dependent weights are available, random walks are capable of exploiting high-order information beyond clique expansions and thus empirically useful in many machine learning tasks

[hayashi2020hypergraph].

Transition Matrix: To incorporate edge-dependent node weights, the incidence matrix is generalized to a weighted incidence matrix where each entry is if and otherwise. Then, the transition probability matrix of a random walk on the hypergraph is written as , where denotes the hyperedge-weight matrix where each entry is if and 0 otherwise. The matrices and are diagonal matrices of node degrees and hyperedge weights, respectively. That is, if we let and

be the vectors whose entries are all ones, then

and .

3.2 Problem Description

The problem that we address in this paper is as follows.

Problem 1.

Given a stream of hyperedges, detect anomalous hyperedges, whose structural or temporal properties deviate from general patterns, in near real-time using constant space.

While the definition of anomalous hyperedges depends on the context, we focus on two intuitive perspectives. In one aspect, a hyperedge is anomalous if it consists of an unexpected subset of nodes. That is, we aim to detect hyperedges composed of unusual combinations of nodes. In the other aspect, we aim to identify a set of similar hyperedges that appear in bursts as an anomaly. The sudden emergence of similar interactions often indicates malicious behavior harmful in many applications. In addition, for time-critical applications, we aim to detect such anomalous hyperedges in near real-time, as they appear, using bounded space. While one might tempt to reduce hyperedges into subgraphs and solve the problem as anomalous subgraph detection, this harms the high-order information of the hyperedges. Also, existing works on anomalous subgraphs assume static graphs [hooi2016fraudar] or detect only the single most anomalous subgraph [shin2017densealert], while we aim to score every hyperedge in the stream.

4 Proposed Method

Figure 2: Outline of HashNWalk. (a) A new hyperedge arrives in the input hyperedge stream. (b) Nodes are merged into supernodes with edge-dependent weights by hashing, and hyperedges, including the new one, are represented as -dimensional vectors (=3 in this example). (c) The hypergraph summary is composed of a matrix and a vector , and it is incrementally updated in response to the new hyperedge. (d) Based on the summary , which is immediately obtainable from and (Eq. (2)), the anomalousness of the new hyperedge is measured using the proposed scoring functions (Eq. (5)).

In this section, we propose HashNWalk (Algorithm 1), which is a fast and space-efficient algorithm for detecting anomalies in a hyperedge stream. Our main focus is speed and space efficiency since HashNWalk is expected to process a potentially infinite stream. As illustrated in Figure 2, it maintains a concise and informative summary of a hyperedge stream (Sect. 4.1), which is incrementally updated as each new hyperedge arrives (Sect. 4.2). Once the summary is updated, anomalous hyperedges are identified immediately based on two principled metrics (Sect. 4.3). While HashNWalk is based on multiple summaries (Sect. 4.4), we assume that it consists of a single summary for ease of explanation.

4.1 Hypergraph Summarization

Hyperedge Representation: We describe how to concisely represent each hyperedge using constant space. Hyperedges, by definition, are flexible in their sizes, and it is non-trivial to represent each hyperedge using the same amount of space. To this end, we map each node into one of different values using a hash function . We consider each hash value as a supernode that contains the nodes with the same hash value. Due to hash collisions, a hyperedge may contain a supernode multiple times, and the number of occurrences becomes the weight of the supernode with respect to the hyperedge. Formally, we represent each hyperedge of any size into a -dimensional vector , whose th element indicates the number of the nodes that are contained in and mapped into the hash value (i.e., ). It is also interpreted as the weight of the supernode with respect to the hyperedge . We denote as the set of supernodes that hyperedge contains, i.e., . Note that a hyperedge of any size is represented as a fixed-size vector, whose size is user-controlled. In addition, the edge-dependent weights of supernodes can be utilized by random walks (see Section 3.1). If we use a constant-time hash function and a sparse vector format, for each hyperedge , the time complexity of generating the vector is , as stated in Lemma 1.

Lemma 1 (Time Complexity of Generating ).

Given a hyperedge , it takes time to generate the vector .

Proof. Creating a zero vector in a sparse format (e.g., a hash table) and incrementing for every node takes time. ∎

0:  (1) hyperedge stream: , (2) number of supernodes , (3) number of hash functions , (4) time-decaying parameter
0:  stream of anomaly scores
1:   and Initialize to zeros
2:  for each hyperedge  do
3:      summarize via hashing Sect. 4.1
4:     update and Sect. 4.2
5:                                                                                                        Sect. 4.3
6:  end for
7:  return  
Algorithm 1 HashNWalk

Hypergraph Summary: Below, we describe how to summarize the entire hypergraph for rapid and accurate anomaly detection. We note that the key building block for identifying anomalous hyperedges of both types (i.e., unexpected ones and similar ones in bursts) is to estimate the proximity or structural similarity between nodes. Thus, we summarize the input hypergraph in the form of proximity between supernodes, and we extend random walks to measure the proximity. Our summary is based on random walks extended with edge-dependent supernode weights and hyperedge weights; and we use the transition probabilities as their approximation for rapid updates (see Section 4.2). Specifically, we summarize the input hypergraph as a matrix , where is the weighted incidence matrix where each entry is if and 0 otherwise. The matrix denotes the hyperedge-weight matrix where is if and 0 otherwise. The matrices and are diagonal matrices of supernode degrees and hyperedge weights, respectively. Then, is the transition probability matrix where each entry is the transition probability from supernode to :

(1)

where is the weighted degree of the , i.e., , and is the sum of the weights of the supernodes in the , i.e., .

Edge-Dependent Supernode Weights: If edge-dependent supernode weights are available, random walks utilize high-order information beyond clique expansions. Such weights are naturally obtained from the aforementioned vector representation of hyperedges. That is, we use the number of the occurrences of each supernode in each hyperedge as the weight of with respect to . Formally, , and thus .

Time-Decaying Hyperedge Weights: In order to facilitate identifying recent bursts of similar hyperedges, which are one of our focuses, we emphasize recent hyperedges with large weights. Specifically, at current time , we define the weight of each hyperedge , which is arrived at time , as where is a kernel function for quantifying time decay and

is a hyperparameter that determines the degree of emphasis. Specifically, smaller

more emphasizes recent hyperedges.

4.2 Incremental Update

Challenges: Constructing from scratch, which takes time, is undesirable when immediate responses to anomalies are demanded. In addition, when hyperedges are streamed indefinitely, materializing , , and , which are used to compute , is prohibitive since their sizes are proportional to the number of hyperedges.

Proposed Updated Scheme: We present an incremental algorithm for efficiently but exactly updating in response to a new hyperedge. The proposed update scheme maintains only , whose size is controllable by the user, without materializing any larger matrix. Assume hyperedges have arrived, and let be the proximity from supernode to supernode in them. We introduce a matrix and a vector , and for any supernodes and , their entries when the hyperedge arrives at time are

Then, based on Eq. (1) and the predefined hyperedge weight function , is written as

(2)

Instead of directly tracking the proximity matrix , we track aforementioned and , whose entries are initialized to zero. Each entry and can be updated in constant time, as presented in Lemmas 2 and 3, and once they are updated, we can compute in time by Eq. (2), if necessary.

Lemma 2 (Updating ).

For any , when the hyperedge arrives at , Eq. (3) holds.

(3)
Lemma 3 (Updating ).

For any , when the hyperedge arrives at , Eq. (4) holds.

(4)

Lemma 2 and Lemma 3 are immediate from the definitions of and .

Complexity: Notably, if , holds and if , holds. Thus, if or is not included in the new hyperedge (i.e., or ), remains the same (i.e., ) and thus does not need any update. Similarly, does not change if is not included in . These facts significantly reduce the update time of the summary, enabling near real-time processing of each hyperedge. To sum up, in response to a new hyperedge, HashNWalk updates the summary in a short time using constant space, as stated in Lemmas 4 and 5, respectively.

Lemma 4 (Update Time Per Hyperedge).

Given the sparse vector representation of a hyperedge , updating and using Eq. (3) and Eq. (4) takes time.

Proof. The number of supernodes in is , which is at most the number of nodes and the number of supernodes , and thus . Then, elements of and elements of are updated by Eq. (3) and Eq. (4), and the update time is constant per element. Therefore, the total time complexity is .

Lemma 5 (Constant Space).

The maintained summary takes space.

Proof. The matrix and the vector require and space, respectively.

4.3 Anomaly Detection

Hyperedge Anomaly Score: We now propose an online anomalous hyperedge detector, which is based on the structural and temporal information captured in the summary . We evaluate each newly arriving hyperedge by measuring a hyperedge anomaly score defined in Definition 1.

Definition 1 (Hyperedge Anomaly Score).

Given a newly arriving hyperedge at time , its anomaly score is defined as

(5)

where is the number of occurrences of at time , is a hyperparameter for the importance of the occurrences, , and is just before . Intuitively, and are the “observed” proximity (i.e., the proximity in the current hyperedge) and “expected” proximity (i.e., the proximity in all past hyperedges appearing before ) from supernode to supernode , respectively.

Note that the relationships between all pairs of supernodes in the hyperedge, including the pairs of the same supernode, are taken into consideration, and they are aggregated using any aggregation functions. The hyperparameter and the aggregate function can be controlled to capture various types of anomalies. For the two types of anomalies, we define scoring functions and as described below.

Unexpectedness (): Intuitively, in Eq. (5) measures how much the proximity from the supernode to in the new hyperedge deviates from the proximity in the past hyperedges. Specifically, the ratio is high if two supernodes and that have been far from each other in past hyperedges unexpectedly co-appear with high proximity in the new hyperedge. Thus, in , which is the anomaly score for identifying unexpected hyperedges, we focus on the ratio by setting . In order to detect any such unexpected pairs of supernodes in the hyperedge, uses the maximum ratio as the final score (i.e., ).

Burstiness (): In order to detect similar hyperedges that appear in bursts, the number of occurrences of supernodes is taken into consideration. Supernodes, by definition, are subsets of nodes, and similar hyperedges tend to share many supernodes. If a large number of similar hyperedges appear in a short period of time, then the occurrences of the supernodes in them tend to increase accordingly. Thus, in , which is the anomaly score for identifying recent bursts of similar hyperedges, we set to a positive number (specifically, in this work) to take such occurrences (i.e., in Eq. (5)) into consideration, in addition to unexpectedness (i.e., in Eq. (5)). We reflect the degrees of all supernodes in the hyperedge by averaging the scores from all supernode pairs (i.e., ).

Complementarity of the Anomaly Scores: While the only differences between and are the consideration of the current degree of supernodes (i.e., ) and the aggregation methods, the differences play an important role in identifying specific types of anomalies (see Section 5.2).

Complexity: For each new hyperedge , HashNWalk computes in a short time, as stated in Lemma 6.

Lemma 6 (Scoring Time Per Hyperedge).

Given the hypergraph summary and a hyperedge in the form of a vector , computing takes time.

Proof. The number of supernodes in is . We maintain and update the current degrees of supernodes, which takes time for each new hyperedge . There are pairs of supernodes in , and the computation for each supernode pair in Eq. (5) takes time. Hence, the total time complexity is .

Theorem 1 (Total Time Per Hyperedge).

HashNWalk takes time to process a hyperedge . Proof. Theorem 1 follows from Lemmas 1, 4, and 6.

4.4 Using Multiple Summaries (Optional)

Multiple hash functions can be used in HashNWalk to improve its accuracy at the expense of speed and space. Specifically, if we use hash functions, maintain summaries, and compute scores independently, then the space and time complexities become times of those with one hash function. Given hyperedge anomaly scores from different summaries, we use the maximum one as the final score, although any other aggregation function can be used instead.

5 Experiments

We review our experiments to answer Q1-Q4:

  1. [label=Q0.,leftmargin=*]

  2. Performance: How rapidly and accurately does HashNWalk detect anomalous hyperedges?

  3. Discovery: What meaningful events can HashNWalk detect in real-world hypergraph streams?

  4. Scalability: How does the total runtime of HashNWalk change with respect to the input stream size?

  5. Parameter Analysis: How do the parameters of HashNWalk affect its performance?

5.1 Experimental Settings

Dataset
Email-Enron 143 10,885 2.472 37
Transaction 284,807 284,807 5.99 6
DBLP 1,930,378 3,700,681 2.790 280
Cite-patent 4,641,021 1,696,554 18.103 2,076
Tags-overflow 49,998 14,458,875 2.968 5
Table 1: Five real-world hypergraphs.

Datasets: We used five different real-world datasets in Table 1. They are described in detail in later subsections.

Machines: We ran F-FADE on a workstation with an Intel Xeon 4210 CPU, 256GB RAM, and RTX2080Ti GPUs. We ran the others on a desktop with an Intel Core i9-10900KF CPU and 64GB RAM.

(a) Transaction
(b) SemiU
(c) SemiB
Figure 3: HashNWalk is accurate (in terms of AUROC and Prec.@100) and fast. For example, in the Transaction dataset, HashNWalk achieves higher AUROC with faster speed, compared to F-FADE.

Baselines: We consider four streaming algorithms for anomaly detection in graphs and hypergraphs as competitors:

  • [leftmargin=*]

  • SedanSpot [eswaran2018sedanspot]: Given a stream of edges, it aims to detect unexpected edges, i.e., edges that connect nodes from sparsely connected parts of the graph, based on personalized PageRank scores.

  • Midas [bhatia2020midas]: Given a stream of edges, it aims to detect similar edges in bursts. To this end, it uses the Count-Min-Sketch.

  • F-FADE [chang2020f]: Given a stream of edges, it uses frequency-based matrix factorization and computes the likelihood-based anomaly score of each edge that combines unexpectedness and burstiness.

  • LSH [ranshous2017efficient]: Given a stream of hyperedges, it computes the unexpectedness of each one using its approximate frequency so far.

For graph-based anomaly detection methods, we transform hypergraphs into graphs via clique expansion (Section 3.1). That is, each hyperedge is reduced to pairwise edges, and the timestamp

is assigned to each edge. The anomaly score of the hyperedge is computed by aggregating the anomaly scores of the pairwise edges, using the best one among arithmetic/geometric mean, sum, and maximum.

Implementation: We implemented HashNWalk and LSH

in C++ and Python, respectively. For the others, we used their open-source implementation.

SedanSpot and Midas are implemented in C++ and F-FADE is implemented in Python.

Evaluation: Given anomaly scores of hyperedges, we measure AUROC and Precision@ (i.e., the ratio of true positives among hyperedges with the highest scores).

5.2 Q1. Performance Comparison

We consider three hypergraphs: Transaction, SemiU, and SemiB. Transaction [dal2015calibrating] is a real-world hypergraph of credit card transactions. Each timestamped transaction is described by a dimensional feature vector. There exist frauds, which account for of the entire transactions. For each transaction, we generate a hyperedge by grouping it with nearest transactions that occurred previously. Thus, each node is a transaction, and each hyperedge is a set of transactions that are similar to each other.

In Email-Enron, each node is an email account and each hyperedge is the set of the sender and receivers. The timestamp of each hyperedge is when the email was sent. We consider two scenarios InjectionU and InjectionB, where we generate two semi-real hypergraphs SemiU and SemiB by injecting 200 unexpected and bursty hyperedges, respectively, in Email-Enron. The two injection scenarios are designed as follows:

[leftmargin=*] InjectionU: Injecting unexpected hyperedges. [leftmargin=8pt] Select a hyperedge uniformly at random. Create a hyperedge by replacing nodes in with random ones, and set their timestamp to . Repeat (1)-(2) times to generate hyperedges. InjectionB: Injecting bursty hyperedges. [leftmargin=8pt] Select a time uniformly at random. Sample a set of nodes uniformly at random. Create uniform random subsets of at time . Their sizes are chosen uniformly at random from . Repeat (1) - (3) times to generate hyperedges.

All anomalies are injected after time , and thus all methods are evaluated from time where we set (). In InjectionU, we set . In InjectionB, we set , , and .

Accuracy: In Transaction, we use and set , , and . In SemiU and SemiB, we use and , respectively, and commonly set , , and . As discussed later, these summaries take up less space than the original hypergraphs. As shown in Figure 3, HashNWalk accurately detects anomalous hyperedges in real and semi-real hypergraphs. Notably, while most methods fail to find any anomalous hyperedges in their top (see Figure 0(a) in Section 1) or top (Figure 2(a)) hyperedges with the highest anomaly scores, HashNWalk is successful. In addition, HashNWalk accurately detects both unexpected and bursty hyperedges. Note that while several competitors successfully spot bursty hyperedges in SemiB, most of them fail to spot unexpected ones in SemiU. Specifically, in SemiU, HashNWalk achieves higher precision@100 with faster speed than SedanSpot.

SemiU SemiB
AUROC Prec.@100 AUROC Prec.@100
0.951 0.815 0.802 0.740
0.916 0.090 0.997 1.000
Table 2: The two proposed hyperedge anomaly scoring metrics and complement each other.

Speed: As seen in Figure 3, HashNWalk is one of the fastest methods among the considered ones. Notably, in SemiU, HashNWalk is faster than the second most accurate method.

Space Usage: We analyze the amount of space used by HashNWalk. Let and be the numbers of bits to encode an integer and a floating number, respectively, and we assume . The size of the original hypergraph is the sum of the hyperedge sizes, and precisely, bits are required to encode the hypergraph. As described in Lemma 5 in Section 4.2, for each hash function, HashNWalk tracks a matrix and a vector , and thus it requires bits with hash functions. We set and in Transaction; and and in SemiU and SemiB. As a result, HashNWalk requires about and of the space required for the original hypergraphs, in Transaction and semi-real hypergraphs, respectively. For competitors, we conduct hyperparameter tuning, including configurations requiring more space than ours.111See https://github.com/geonlee0325/HashNWalk for details.

Complementarity of and : As seen in Table 2, while is effective in detecting unexpected hyperedges, it shows relatively low accuracy in detecting bursty hyperedges. The opposite holds in . The results indicate that the two metrics and are complementary.

5.3 Q2. Discovery

Here, we share the results of case studies conducted on the DBLP, Cite-patent, and Tags-overflow datasets.

Discoveries in Co-authorship Hypergraph: DBLP contains information of bibliographies of computer science publications. Each node represents an author, and each hyperedge consists of authors of a publication. The timestamp of the hyperedge is the year of publication. Here, we investigate how authors co-work with different researchers. For each author who have published at least 100 papers, we compute the average unexpectedness and burstiness scores of the hyperedges that is contained in, which we denote by and , respectively. We analyze several authors whose ratio is the highest or the lowest. Intuitively, authors with low ratios tend to co-work in a bursty manner with expected co-authors, while those with high ratios tend to work steadily with unexpected co-authors. Surprisingly, Dr. Bill Hancock, whose ratio is the lowest, published 186 papers all alone. Furthermore, Dr. Hancock published 139 papers in 2000. On the other hand, Dr. Seymour Ginsburg, whose is the highest, published 114 papers from 1958 to 1999 (2.7 papers per year). In addition, 18 co-authors (out of 38) co-authored only one paper with Dr. Ginsburg. In fact, and of the most authors are clustered as seen in Figure 3(a), and the top authors are those with the largest (or the smallest) slope. We further conduct case studies on two specific authors Dr. Shinji Sakamoto and Dr. King-Sun Fu, whose co-working patterns are very different. As seen in Figure 3(b), while Dr. Sakamoto collaborated on most papers with a few researchers, Dr. Fu enjoyed co-working with many new researchers. These findings support our intuition behind the proposed measures, and .

(a) and
(b) Two researchers
Figure 4: Case studies on the DBLP dataset. Some authors deviate from the general pattern (3(a)). Dr. Fu and Dr. Sakamoto differ in their co-working patterns (3(b)).  
Figure 5: Case study on the Cite-patent dataset. Unexpected & bursty hyperedges have different properties.

Discoveries in Patent Citation Hypergraph: We use Cite-patent [tang2012patentminer], which is a citation hypergraph where each node is a patent and each hyperedge is the set of patents cited by a patent. The timestamp of each hyperedge is the year of the citation ranging from 2000 to 2012. Using HashNWalk, we extract some hyperedges with high or . Then, we represent each hyperedge as a -dimensional binary vector indicating which nodes belong to the hyperedge. We visualize the hyperedges after reducing the dimension of the vectors via T-SNE in Figure 5. While unexpected hyperedges are spread, bursty hyperedges are closely located, indicating structurally similar hyperedges arrive in bursts. In addition, we closely examine the citation patterns of suspicious patents detected by HashNWalk. As seen in Figure 0(c), patents with unexpected or bursty citations are effectively detected.

Discoveries in Online Q&A Cite: We share the results of a case study using Tags-overflow. In the dataset, nodes are tags and hyperedges are the set of tags attached to a question. Hyperedges with high (i.e., sets of unexpected keywords) include: {channel, ignore, antlr, hidden, whitespace}, {sifr, glyph, stling, text-styling, embedding}, and {retro-computing, boot, floppy, amiga}. Hyperedges with high (i.e., sets of bursty keywords) include: {python, javascript}, {java, adobe, javascript}, and {c#, java}. Notably, sets of unpopular tags tend to have high unexpectedness, while those containing popular keywords, such as python and javascript, have high burstiness.

5.4 Q3. Scalability

To evaluate the scalability of HashNWalk, we measure how rapidly it updates the hypergraph summary and computes the anomaly scores as the number of hyperedges grows. To this end, we upscale Email-Enron, which originally consists of 10,885 hyperedges, by to times, and measure the total runtime of HashNWalk. As seen in Figure 0(b), the total runtime is linear in the number of hyperedges, which is consistent with our theoretical analysis (Theorem 1 in Section 4). That is, the time taken for processing each hyperedge is near constant. Notably, HashNWalk is scalable enough to process a stream of billion hyperedges within hours.

5.5 Q4. Parameter Analysis

We evaluate HashNWalk under different parameter settings, and the results are shown in Figure 6. In most cases, there is a positive correlation with (i.e., the number of supernodes) and

(i.e., the number of hash functions). Intuitively, a larger number of supernodes and hash functions collectively reduce the variance due to randomness introduced by hash functions. However, since the space usage is dependent on these parameters, a trade-off between the accuracy and the space usage should be considered. In addition, properly setting

(i.e., time decaying parameter) improves the accuracy, as shown in Figure 6, which indicates that not only structural information but also the temporal information is critical in detecting anomalies in hyperedge streams.

(a) Effect of
(b) Effect of
(c) Effect of
Figure 6: The performance of HashNWalk depends on the number of supernodes (), the number of hash functions (), and time decaying parameter in the Transaction dataset.

6 Conclusion

In this work, we propose HashNWalk, an online anomaly detector for hyperedge streams. HashNWalk maintains a random-walk-based hypergraph summary with constant space, and it is incrementally updated in near real-time. Using the summary, HashNWalk computes two anomaly scores that are effective in identifying (a) hyperedges composed of unexpected combinations of nodes and (b) those appearing in bursts. Our experiments demonstrate the speed, accuracy, and effectiveness of HashNWalk in (semi-)real datasets. The source code and datasets are publicly available at https://github.com/geonlee0325/HashNWalk.

Acknowledgements:

This work was supported by National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)).

References