Concept Drift and Anomaly Detection in Graph Streams

06/21/2017 ∙ by Daniele Zambon, et al. ∙ University of Exeter Politecnico di Milano USI Università della Svizzera italiana 0

Graph representations offer powerful and intuitive ways to describe data in a multitude of application domains. Here, we consider stochastic processes generating graphs and propose a methodology for detecting changes in stationarity of such processes. The methodology is general and considers a process generating attributed graphs with a variable number of vertices/edges, without the need to assume one-to-one correspondence between vertices at different time steps. The methodology acts by embedding every graph of the stream into a vector domain, where a conventional multivariate change detection procedure can be easily applied. We ground the soundness of our proposal by proving several theoretical results. In addition, we provide a specific implementation of the methodology and evaluate its effectiveness on several detection problems involving attributed graphs representing biological molecules and drawings. Experimental results are contrasted with respect to suitable baseline methods, demonstrating the competitiveness of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Nomenclature

Graph domain
Graph distance
Generic graph in generated at time
Set of graphs
Dissimilarity domain
Distance
Dissimilarity representation of generic graph
Set of diss. representations of
Dissimilarity representation
Set of prototypes
Stochastic process generating graphs

Change time and its estimate

Probability space for in nominal regime
Probability space for in nominal regime
Null and alternative hypotheses of a statistical test
Statistic used by the change detection test
Threshold for the change detection test
Significance level of the change detection test
Sample mean and expected value w.r.t. 
Fréchet sample and population means w.r.t. 

I Introduction

Learning in non-stationary environments is becoming a hot research topic, as proven by the increasing body of literature on the subject, e.g., see [7296710, gama2014survey] for a survey. Within this learning framework, it is of particular relevance the detection of changes in stationarity of the data generating process. This can be achieved by means of either passive approaches [elwell2011incremental], which follow a pure on-line adaptation strategy, or active ones [alippi2008justI, alippi2009just], enabling learning only as a proactive reaction to a detected change in stationarity. In this paper, we follow this last learning strategy, though many results are general and can be suitably integrated in passive learning approaches as well.

Most change detection mechanisms have been proposed for numeric i.i.d. sequences and either rely on change point methods or change detection tests. Both change point methods and change detection tests are statistical tests; the former works off-line over a finite number of samples [hawkins2003changepoint] while the latter employs a sequential analysis of incoming observations [basseville1993detection]

to detect possible changes. These techniques were originally designed for univariate normal distributed variables, and only later developments extended the methodology to non-Gaussian distributions

[alippi2006adaptive, ross2012two] and multivariate datastreams [zamba2006multivariate, golosnoy2009multivariate].

A somehow related field to change detection tests is one-class classification (e.g., see [eocc_mi]

and references therein). There, the idea is to model only the nominal state of a given system and detect non-nominal conditions (e.g., outliers, anomalies, or faults) by means of inference mechanisms. However, one-class classifiers typically process data batches with no specific presentation order, while change detection problems are sequential in nature.

The important role played, nowadays, by graphs as description of dynamic systems is boosting, also thanks to recent discoveries of theoretical frameworks for performing signal processing on graphs [7407399, shuman2013emerging] and for analysing temporal (complex) networks [holme2012temporal, holme2015modern, masuda2016guide]. However, very few works address the problem of detecting changes in stationarity in streams (i.e., sequences) of graphs [wilson2016modeling, barnett2016change, masoller2015quantifying]

, and, to the best of our knowledge, none of them tackles the problem by considering a generic family of graphs (e.g., graphs with a varying number of vertices/edges, and arbitrary attributes on them). In our opinion, the reason behind such a lack of research in this direction lies in the difficulty of defining a sound statistical framework for generic graphs that, once solved, would permit to also detect changes in time variance in time-dependent graphs. In fact, statistics grounds on concepts like average and expectation, which are not standard concepts in graph domains. Fortunately, recent studies

[jain2016geometry, jain2016statistical] have provided some basic mathematical tools that allow us to move forward in this direction, hence, addressing the problem of detecting changes in stationarity in sequences of graphs.

A key problem in analysing generic graphs refers to assessing their dissimilarity, which is a well-known hard problem [gm_survey]. The literature proposes two main approaches for designing such a measure of dissimilarity [foggia2012graph, emmert2016fifty, roy2014modeling]. In the first case, graphs are analysed in their original domain , whereas the second approach consists of mapping (either explicitly or implicitly) graphs to numeric vectors. A well-known family of algorithms used to assess dissimilarity between graphs relies on the Graph Edit Distance (GED) approach [bunke1983inexact]. More specifically, GED algorithms count and weight the edit operations that are needed in order to make two input graphs equal. Differently, other techniques take advantage of kernel functions [7947106], spectral graph theory [qiu2006graph, bai2015quantum], or assess graph similarity by looking for recurring motifs [costa2010fast]

. The computational complexity associated with the graph matching problem inspired researchers to develop heuristics and approximations, e.g., see

[fankhauser2011speeding, fischer2015approximation, bougleux2017graph] and references therein.

I-a Problem formulation

In this paper, we consider sequences of attributed graphs, i.e., directed or undirected labelled graphs characterized by a variable number of vertices and edges [jain2016geometry]. Attributed graphs associate vertices and edges with generic labels, e.g., scalars, vectors, categorical, and user-defined data structures. In addition, multiple attributes can be associated to the same vertex/edge, whenever requested by the application. By considering attributed graphs, we position ourselves on a very general framework covering most of application scenarios. However, generality requires a new operational framework, since all assumptions made in the literature to make the mathematics amenable, e.g., graphs with a fixed number of vertices and/or scalar attributes, cannot be accepted anymore. In order to cover all applications modellable through attributed graphs, we propose the following general problem formulation for change detection.

Given a generic pre-metric distance on , we construct a -algebra containing at least all open balls of and associate a generic probability measure to . The generated probability space

allows us to consider graphs as a realisation of a structured random variable

on . Define to be the process generating a generic graph at time

according to a stationary probability distribution

(nominal distribution). We say that a change in stationarity occurs at (unknown) time when, from time on, starts generating graphs according to a non-nominal distribution , i.e.,

In this paper, we focus on persistent (abrupt) changes in stationarity affecting the population mean. However, our methodology is general and potentially can detect other types of change, including drifts and transient anomalies lasting for a reasonable lapse of time.

I-B Contribution and paper organization

A schematic description of the proposed methodology to design change detection tests for attributed graphs is shown in Figure 1, and consists of two steps: (i) mapping each graph to a numeric vector through a prototype-based embedding, and (ii) using a multivariate change detection test operating on the -stream for detecting changes in stationarity.

The novelty content of the paper can be summarized as:

  • A methodology to detect changes in stationarity in streams of attributed graphs. To the best of our knowledge, this is the first research contribution tackling change detection problems in streams of varying-size graphs with non-identified vertices and user-defined vertex/edge attributes;

  • A method derived from the methodology to detect changes in stationarity in attributed graphs. We stress that the user can design his/her own change detection method by taking advantage of the proposed methodology;

  • A set of theoretic results grounding the proposed methodology on a firm basis.

The proposed methodology is general and advances the few existing approaches for change detection in graph sequences mostly relying on the extraction and processing of topological features of fixed-size graphs, e.g., see [ranshous2015anomaly].

It is worth emphasising that the proposed approach assumes neither one-to-one nor partial correspondence between vertices/edges across time steps (i.e., vertices do not need to be uniquely identified). This fact has important practical implications in several applications. As a very relevant example, we refer to the identification problem of neurons in extra-cellular brain recordings based on their activity

[Rossant2016]. In fact, each electrode usually records the activity of a neuron cluster, and single neurons need to be disentangled by a procedure called spike sorting. Hence, a precise identification of neurons (vertices) is virtually impossible in such an experimental setting, stressing the importance of methods that do not require one-to-one correspondence between vertices over time.

The remainder of this paper is structured as follows. Section II contextualizes our contribution and discusses related works. Section III-A presents the proposed methodology for change detection in generic streams of graphs. Theoretical results are sketched in Section III-B; related proofs are given in Appendix C. A specific implementation of the methodology is presented in Sections IV and related proofs in Appendix D. Section V shows experimental results conducted on datasets of attributed graphs. Finally, Section VI draws conclusions and offers future directions. Appendices A and B provide further technical details regarding the problem formulation.

Ii Related work

The relatively new field of temporal networks deals with graph-like structures that undergo events across time [holme2015modern, masuda2016guide]. Such events mostly realise in instantaneous or persistent contacts between pairs of vertices. With such structures one can study dynamics taking place on a network, like epidemic and information spreading, and/or dynamics of the network itself, i.e., structural changes affecting vertices and edges over time. Further relevant directions in temporal networks include understanding the (hidden) driving mechanisms and generative models [holme2012temporal].

The literature in statistical inference on time-varying graphs (or networks) is rather limited [holme2015modern, jain2016statistical], especially when dealing with attributed graphs and non-identified vertices. Among the many, anomaly detection in graphs emerged as a problem of particular relevance, as a consequence of the ever growing possibility to monitor and collect data coming from natural and man-made systems of various size. An overview of proposed approaches for anomaly and change detection on time-variant graphs is reported in [ranshous2015anomaly, akoglu2015graph], where the authors distinguish the level of influence of a change. They identify changes affecting vertices and edges, or involving entire sub-networks of different size; this type of change usually concerns static networks, where the topology is often fixed. Other changes have a global influence, or might not be ascribed to specific vertices or edges.

We report that there are several applications in which the vertices are labelled in such a way that, from a time step to another, we are always able to create a partial one-to-one correspondence (identified vertices). This case arises, e.g., when the identity of vertices plays a crucial role and must be preserved over time. Here, we put ourself in the more general scenario where vertices are not necessarily one-to-one identifiable through time.

Within the anomaly detection context, only few works tackle the problem in a classical change detection framework. Among the already published works in detecting changes in stationarity, we mention Barnett et al. [barnett2016change], whose paper deals with the problem of monitoring correlation networks by means of a change point method. In particular, at every time step , the authors construct the covariance matrix computed from the signals up to time and the covariance matrix of the remaining data. As statistic for the change point model, they adopt the Frobenius norm between the covariance matrices. The authors evaluate their method on functional magnetic resonance imaging and stock returns. A different way to approach the problem consists in modeling the network-generating process within a probabilistic framework. Graphs with vertices and disjoint communities can be described by the degree corrected stochastic block model, where some parameters represent the tendency of single vertices to be connected and communities to interact. This model has been adopted by Wilson et al. for monitoring the U.S. Senate co-voting network [wilson2016modeling]

. As monitoring strategy, they consider the standard deviation of each community, and then apply exponential weighted moving average control chart. A further example of change point method for fixed-size graphs combines a generative hierarchical random graph model with a Bayesian hypothesis test

[peel2015detecting].

Iii Change detection in a stream of graphs

The structure of the section is as follows. Section III-A describes the proposed methodology at a high level to ease the understanding. In Section III-B, we present theoretical results grounding our proposal; their proofs are given in the appendices.

:set ofprototypes: dissimilarity representationchange detection?noyesacquirechangedetectedtime:graphgeneratingprocess
Figure 1: The diagram represents the fundamental steps of the methodology. At the top of the figure, the stochastic process generates over time a stream of graphs . The embedding procedure is described in the bottom-left corner. The embedding of graph is computed by considering the dissimilarity w.r.t. each prototype graph , and returns the embedded vector (here 3-dimensional) lying in the dissimilarity space . The embedding procedure proceeds over time and generates the multivariate vector stream . A change detection method (bottom-right) is applied to the -th window extracted from the -stream to evaluate whether a change was detected or not, hence, iterating the procedure with the acquisition of a new graph.

Iii-a The proposed methodology

The methodology operates on an input sequence of attributed graphs and, as sketched in Figure 1, it performs two steps: (i) Map (embed) input graphs to a vector domain . Embedding is carried out by means of the dissimilarity representation , which embeds a generic graph to a vector . (ii) Once a multivariate i.i.d. vector stream is formed, change detection is carried out by inspecting such a numerical sequence with the designer favourite method. The two phases are detailed in the sequel.

Iii-A1 Dissimilarity representation

The embedding of a generic graph is achieved by computing the dissimilarity between and the prototype graphs in ,

(1)

The vector is referred to as the dissimilarity representation of . Set has to be suitably chosen to induce informative embedding vectors. For a detailed discussion about dissimilarity representations and possible ways to select prototypes, we suggest [pkekalska+duin2005].

In order to make the mathematics more amenable, here we assume to be a metric distance; nevertheless, in practical applications, one can choose more general dissimilarity measures.

Iii-A2 Multivariate vector stream

At time step the process generates graph , and the map embeds onto vector , inducing a multivariate stream whose elements lie in . Figure 1 depicts the continuous embedding of graph process.

Under the nominal condition for process , graphs are i.i.d. and drawn from probability space . Consequently, also vectors are i.i.d.. We now define a second probability space associated with embedded vectors ; in particular, here we propose to consider for the Borel’s -algebra generated by all open sets in . is the push forward probability function of by means of , namely

(2)

With such a choice of , we demonstrate in Appendix A that is a probability measure on .

Iii-A3 Change detection test

By observing the i.i.d. vector stream over time we propose a multivariate change detection procedure to infer whether a change has occurred in the vector stream and, in turn, in the graph stream.

The change detection test is the statistical hypothesis test

where is the expected value during the nominal – stationary – regime and is the expectation operator. Statistic , which is applied to windows of the vector stream, is user-defined and is requested to increase when the process becomes non stationary.

Often, the test comes with a threshold so that if

(3)

and the estimated change time is

Whenever the distribution of under hypothesis is available – or can be estimated – the threshold can be related to an user-defined significance level so that .

Iii-B Theoretical results

In this section, we show some theoretical results related to the methodology presented in Section III-A. In particular, we prove the following claims:

  1. Once a change is detected in the dissimilarity space according to a significance level , then a change occurs in probability with significance level also in the graph domain;

  2. If a change occurs in the graph domain having set a significance level , then, with a significance level , a change occurs also in the dissimilarity space.

Figure 2: Conceptual scheme of the proposed methodology that highlights the theoretical results. The key point is to show that objects close in the graph domain , onto which metric is defined, under certain conditions, are close also in the embedded domain controlled by metric . It comes up that if objects in the embedding domain are distant in probability, then also the related graphs in are distant, hence indicating a change in stationarity according to a given false positive rate.

Figure 2 depicts the central idea behind the methodology. Through transformation , we map graphs to vectors. In the transformed space, we consider the expectation and the sample mean associated with set obtained by embedding the graph set , and design a hypothesis test of the form

(4)

where is a positive threshold and hypothesis is associated with a nominal, change-free condition. In this paper, we relate the -test of (4) to a correspondent test operating in the graph domain, where are, respectively, the population and sample mean defined according to Fréchet [frechet1948elements]; for further details about Fréchet statistics refer to Appendix B.

Define and to be two significance levels, such that:

(5)

In the sequel we relate the threshold to , so that also the significance levels and are in turn related to each other.

In order to address our problem, we introduce mild assumptions to obtain closed-form expressions. Such assumptions are satisfied in most applications. More specifically,

  • We assume that the attributed graph space and the dissimilarity space are metric spaces; in particular, is chosen as a graph alignment space [jain2016statistical] – i.e., a general metric space of attributed graphs – and has to be induced by a norm.

  • We put ourselves in the conditions of [jain2016statistical] in order to take advantage of results therein; specifically, we assume that the Fréchet function is finite for any , and there exists a sufficiently asymmetric graph such that the support of is contained in a cone around . In this way, we are under the hypotheses of Theorems 4.1 and 4.2 of [jain2016statistical], which ensure the existence and uniqueness of the Fréchet population and sample mean in .

  • The embedding function is bilipschitz, i.e., there exist two constants , such that for any pair

    (6)
    (7)

Let and be the cumulative density functions (CDFs) of and , respectively. Proposition 1 bounds the distribution in terms of . This fact yields the possibility to derive significance levels and thresholds in Equation (5) that are related.

In order to prove Proposition 1, we make use of two auxiliary results, Lemmas 1 and 2. At first we need to comment that, although in general and , differences are bounded in practice, as shown by Lemma 1.

Lemma 1.

Considering a set of i.i.d. random graphs and the associated embedded set , there exists a constant such that

Lemma 2

is used to derive bounds on the marginal distributions from bounds on the joint distributions that are useful in Proposition 

1 to relate the threshold and the significance level in the graph space with the ones in the dissimilarity space.

Lemma 2.

Consider a random variable and two statistics with associated CDFs , , respectively. If function is increasing and bijective and is a constant in , then

Claims (C1) and (C2) now follow from the relation between the CDFs and proven by the subsequent Proposition 1. In particular, regarding Claim (C1) Proposition 1 provides a criterion for setting a specific threshold for the test (4) for which we can state that is unexpectedly large with a significance level at most ; similarly, we obtain Claim (C2).

Proposition 1.

Consider a sample of i.i.d. graphs. Under assumptions (A1)–(A3), if and are the CDFs of statistics and , respectively, then for every there exist two values and , depending on but independent of , such that

The proofs are given in Appendix C. Results of Proposition 1 allows us to state the major claim (C1): given a sample , for any significance level and threshold , as in (5), we can set up a such that the confidence level of detecting a change in is at least . Specifically, for we have that

(8)

Equation (8) states that if no change occurs in with confidence level , then a change will not be detected in according to threshold at least with confidence level . Indeed, this proves (C1) by contraposition. Similarly, Proposition 1 allows to prove Claim (C2). In fact, for any and as in (5), we can set so that and obtain

Iv Implementations of the methodology

This section describes two examples showing how to implement the proposed methodology and derive actual change detection tests. In Section IV-A we present a specific method for generic families of graphs, whereas in Section IV-B we further specialize the methodological results to a special case considering graphs with identified vertices.

Iv-a Change detection based on central limit theorem

Here, we consider specific techniques for prototype selection and change detection test. For the sake of clarity, we keep the subsection structure of Section III-A. Both the prototype selection and change detection test require a training phase. For this reason, the first observed graphs of the stream will serve as training set , that we assume to be all drawn under nominal conditions.

Iv-A1 Dissimilarity representation

Since the change detection method operates in the dissimilarity space, we need to define the embedding that, at each time step, maps a generic graph to a vector .

We comment that the embedding is completely defined once the graph distance metric and prototype set are chosen. Here, we adopt a metric GED as graph distance, since it meets Assumption (A1) of being a graph alignment space.

Many approaches have been proposed to deal with the prototype selection problem, e.g., see [riesen+bunke2010] and references therein. While the proposed methodology is general and one can choose any solution to this problem, here we adopt the k-Centres method [riesen2007graph]. The method selects prototypes so as to cover training data with balls of equal radius. Specifically, the algorithm operates as follows:

  1. select random prototypes ;

  2. for each , consider the set of all such that ;

  3. for update the prototypes with a graph minimising ;

  4. if the prototype set did not change from the previous iteration step then exit, otherwise go to step 2.

In order to improve the robustness of the k-Centres algorithm, we repeat steps 1–4 by randomizing initial conditions and select the final prototype set to be

where is the collection of prototype sets found at each repetition.

Iv-A2 Multivariate vector stream

Every time the process generates a graph , we embed it as by using the prototype set identified with the k-Centres approach. This operation results in a multivariate vector stream on which we apply the change detection test.

Iv-A3 Change detection test

We consider here a variation of the cumulative sum (CUSUM) test [alippi2006adaptive] to design the change detection test. CUSUM is based on the cumulative sum chart [page1954continuous], it has been proven to be asymptotically optimal [basseville1993detection] and allows for a simple graphical interpretation of results [manly2000cumulative]. Here, the CUSUM is adapted to the multivariate case. However, we remind that, in principle, any change detection test can be used on the embedded sequence.

We batch the observed embedding vectors into non-overlapping samples of length , where index represents the -th data window. For each window, we compare the sample mean estimated in the training set with that estimated in the -th window, i.e., and compute the discrepancy

By assuming that are i.i.d., and given sufficiently large and

, the central limit theorem grants that

and are normally distributed. In particular, and share the same expectation , while covariance matrices are and , respectively.

As a specific choice of , we adopt the Mahalanobis’ distance, i.e., where

(9)

with matrix , i.e., the covariance matrix of . In our implementation, we consider as covariance matrix

the unbiased estimator

.

For each stationary window , the squared Mahalanobis’ distance is distributed as a .

The final statistic inspired by the CUSUM test is defined as

(10)

The difference increases with the increase of the discrepancy in (9) and positive values support the hypothesis that a change occurred, whereas negative values suggest that the system is still operating under nominal conditions. This behaviour resembles that of the original CUSUM increment associated with the log-likelihood ratio. In particular, the parameter can be used to tune the sensitivity of the change detection; in fact, if is the

-quantile

, then produces a negative increment with probability and a positive one with probability .

The last parameter to be defined is the threshold as requested in (3). Since the non-nominal distribution is unknown, a common criterion suggests to control the rate of false alarms, . In sequential detection tests, a related criterion requires a specific average run length

under the null hypothesis (ARL0)

[frisen2003statistical]. ARL0 is connected to the false positive rate in the sense that setting a false alarm rate to yields an ARL0 of

. Since we propose a sequential test, statistic

depends on statistics at preceding times. As a consequence, since we wish to keep fixed, we end up with a time-dependent threshold . As done in [zamba2006multivariate, golosnoy2009multivariate], we numerically determine thresholds through Monte Carlo sampling by simulating a large number of processes of repeated independent realisations. Threshold is then the quantile of the estimated distribution of .

We point out that, when setting a significance level for the random variable , we are implicitly conditioning to the event ; in fact, when exceeds , we raise an alarm and reconfigure the detection procedure.

Iv-A4 Theoretical results

The choice of Mahalanobis’ distance ensures that almost all assumptions in Section III-B are met. In particular, the Mahalanobis’ distance meets the requirements of Assumption (A1). Then, the following Lemma 3 provides a lower bound of the form (6) in Assumption (A3); specifically, the lemma shows that, up to a positive factor, the distance between two graphs is larger then the one between the associated embedding vectors.

From these properties, we can apply Proposition 1 to state that Claim (C1) holds. Hence for any , we can set a specific threshold yielding a confidence level at least .

Lemma 3.

For any two graphs , we have that

where

is the smallest eigenvalue of

.

The proof is reported in Appendix D. Distance is well-defined only when is positive definite, a condition implying that selected prototypes are not redundant.

Iv-B A special case: graphs with identified vertices

Here we take into account the particular scenario where the attribute function of assigns numerical attributes in to vertices and edges of ; the vertex set is a subset of a predefined finite set , with . The peculiarity of this space resides in the fact that any vertex permutation of a graph leads to a different graph. Many real-world applications fall in this setting, for instance correlation graphs obtained by signals coming from sensors or graphs generated by transportation networks.

We show an example of method for this setting, which complies with the methodology and satisfies Assumption (A3). This fact follows from the existence of an injective map from to the matrix set. Indeed, we represent each graph with its weighted adjacency matrix whose row/column indices univocally correspond to the vertices in . By endowing with the Frobenius distance111In , is a graph alignment distance, as formally shown in Appendix D-B. , map is an isometry.

Being the co-domain of an Euclidean space, we compute a matrix whose columns constitute a -dimensional vector configuration related to the prototype graphs; this is done via Classical Scaling [pkekalska+duin2005], that is, for all pairs of prototypes . As usual, we consider the smallest possible that preserves the data structure as much as possible. Successively, for any dissimilarity vector , we define

to be a linear transformation of

, obtained by squaring all the components of ; the matrix is the centering matrix . We apply the same procedure of Section IV, considering instead of : matrix is derived from the non-singular covariance matrix222 is non-singular, since there are no isolated points in and as a consequence of the selection of ., and the statistic is the Mahalanobis distance .

Considering the space and the above transform, we claim that the following lemma holds. Lemma 4 proves the fulfilment of Assumption (A3).

Lemma 4.

For any positive definite matrix ,

where , . is the -th eigenvalue in descending order of magnitude.

The proof is reported in Appendix D.

V Experiments

The proposed methodology can operate on very general families of graphs. Besides the theoretical foundations around which the paper is built, we provide some experimental results showing the effectiveness of what proposed in real change detection problems in streams of graphs. In particular, we consider the method introduced in Section IV-A as an instance of our methodology. Source code for replicating the experiments is available in [zambon2017cdg].

V-a Experiment description

V-A1 Data

The experimental evaluation is performed on the well-known IAM benchmarking databases [riesen+bunke2008]. The IAM datasets contain attributed graphs representing drawings and biochemical compounds. Here we consider the Letters, Mutagenicity, and AIDS datasets.

The Letters dataset contains 2-dimensional geometric graphs. As such, each vertex of the graphs is characterized by a real vector representing its location on a plane. The edges define lines such that the graphical planar representation of the graphs resembles a latin-script letter. The dataset is composed of 15 classes (one for each letter333The IAM letters database [riesen+bunke2008] considers only non-curved letters; hence, e.g., letters A and E are considered, whereas B and C are excluded.) containing 150 instances each.

Conversely, the Mutagenicity and AIDS datasets contain biological molecules. Molecules are represented as graphs by considering each atom as a vertex and each chemical link as an edge. Each vertex is attributed with a chemical symbol, whereas the edges are labelled according to the valence of the link. Both datasets contain two classes of graphs: mutagenic/non-mutagenic for Mutagenicity and active/inactive for AIDS. The two datasets are imbalanced in terms of size of each classes, in particular Mutagenicity has 2401 mutagenic and 1963 non-mutagenic molecules; AIDS contains 400 active and 1600 inactive molecules.

We considered these datasets because they contain different types of graphs with variegated attributes (numerical and categorical). We refer the reader to [riesen+bunke2008] and references therein for a more in-depth discussion about the datasets.

V-A2 Simulating the generating process

For each experiment in Table I, we consider two collections of graphs containing all possible observations in the nominal and non-nominal regimes, respectively. Each collection is composed by graphs present in one or more predefined classes of the dataset under investigation. The collections have to be different, but they do not need to be disjoint; as such, some graphs can belong to both collections.

Next, we simulate the process by bootstrapping graphs from the first collection up to the predefined change time ; this is the nominal regime. After , we bootstrap objects from the second collection hence modelling a change.

Regarding molecular datasets, we considered two distinct experiments. For the Mutagenicity (AIDS) experiment, we set the non-mutagenic (inactive) class as nominal collection and mutagenic (active) class as non-nominal one. On the other side, for the Letter dataset we design four different experiments depending on which classes will populate the collections. Table I reports the settings of all the experiments and next section describes the relevant parameters.

Collection
Experiment ID Dataset Nominal Non-nominal
L-D2 Letter A,E F,H
L-D5 Letter A,E,F,H,I K,L,M,N,T
L-O Letter A,E,F,H F,H,I,K
L-S Letter A,E,F,H,I F,H,I
MUT Mutagenicity non-mutagenic mutagenic
AIDS AIDS inactive active
Table I: Experimental settings. The first column contains an identifier for each experiment; in particular, in the Letter dataset, ‘D’, ‘O’, and ‘S’ stand for disjoint, overlapping, and subset, respectively. The second column reports the dataset involved, and the third and fourth columns show the set of classes from which nominal/non-nominal graphs are extracted. The collections of letters are selected in alphabetical order.

V-A3 Parameters setting

For all experiments, the offset

is set to the third quartile of the

distribution. The time-dependent threshold has been numerically estimated by Monte Carlo sampling. We drew one million processes of i.i.d. random variables by taking the square root of i.i.d. random variables distributed as . For each obtained process (stream), we computed the sequence of cumulative sums like in (10) and estimated the threshold as the quantile of order 1/ARL0, with ARL0 (windows).

We divided the training set in two disjoint subsets, , used during the prototype selection and change detection learning phases, respectively. We set and , afterwards we generated a stream of graphs containing ARL0 observations associated with the operational phase. The change is forced at time . As for the distance , we considered the bipartite GED implemented in [riesen2013novel], where we selected the Volgenant and Jonker assignment algorithm. The other GED parameters are set according to the type of graphs under analysis, i.e., for geometric graphs we consider the Euclidean distance between numerical attributes, and a binary (0-1 distance) for categorical attributes. The k-Centres procedure is repeated 20 times.

We believe that the selected parameter settings are reasonable. Nevertheless, a proper investigation of their impact w.r.t. performance metrics is performed in a companion paper [zambon2017detecting], hence it is outside the focus of the present one.

V-A4 Figures of merit

We assess the performance of the proposed methodology by means of the figures of merit here described. Such measurements are obtained by replicating each experiment one hundred times; we report the average of the observed measures with their estimated 95% confidence interval (95CI) or standard deviation (std).

First of all, we consider the observed ARL0 introduced in Section III-A3 and the delay of detection (DoD). Both of them are computed as the average time lapses between consecutive alarms, but limiting to those alarms raised before and after time , respectively. From them we estimate the rate of detected changes (DCR), by assessing the rate of simulations in which the DoD is less then the observed ARL0. Finally, we consider also the estimated rate of false anomalies within 1000 samples (FA1000). This is computed as the ratio between the count of raised false alarms and the total number of thousands of time steps under the nominal condition.

We point out that the measures ARL0, DoD, and FA1000 are computed with the window as unitary time step.

V-A5 Baseline methods

As previously mentioned, state-of-the-art change detection methods for graph streams usually assume a given topology with a fixed number of vertices and/or simple numeric features for vertices/edges. As reported in [ranshous2015anomaly], considering a variable topology, a common methodology for anomaly detection on graphs consists of extracting some topological features at each time step and then applying a more general anomaly detector on the resulting numeric sequence. Accordingly, in addition to the method proposed in Section IV, we consider two baseline methods for comparison. More precisely, we considered two topological features: the density of edges and the spectral gap of the Laplacian matrix [chung1994]. The particular choices of and can be justified by considering that both features are suitable for describing graphs with a variable number of vertices and edges. We implemented two CUSUM-like change detection tests as in Section IV where, for , the statistic is now given by , and is numerically estimated in the training phase.

In addition, we consider a further baseline implemented as a degenerate case of our method by selecting only prototype and window size . This last baseline is introduced to show that the strength of our methodology resides also in the embedding procedure, and not only in the graph distance .

V-B Results on IAM graph database

Experiment DCR ARL0 DoD FA1000
dataset mean 95CI mean 95CI mean 95CI mean std
L-D2 4 5 1.000 [1.000, 1.000] 189 [103, 333] 26 [8, 66] 1.135 0.368
L-D5 4 5 0.380 [0.290, 0.480] 175 [91, 271] 411 [9, 2149] 1.207 0.390
L-D5 4 25 0.970 [0.930, 1.000] 183 [88, 306] 41 [1, 184] 0.233 0.086
L-D5 4 125 1.000 [1.000, 1.000] 305 [61, 1148] 2 [1, 5] 0.045 0.038
L-D5 8 25 0.990 [0.970, 1.000] 175 [93, 372] 10 [1, 54] 0.245 0.078
L-D5 8 125 1.000 [1.000, 1.000] 255 [50, 928] 1 [1, 1] 0.054 0.047
L-O 4 5 0.230 [0.150, 0.310] 191 [120, 344] 410 [21, 1136] 1.096 0.318
L-O 4 25 0.790 [0.710, 0.870] 205 [101, 355] 123 [3, 723] 0.205 0.070
L-O 4 125 0.940 [0.890, 0.980] 291 [55, 725] 37 [1, 405] 0.045 0.035
L-O 8 25 0.980 [0.950, 1.000] 163 [92, 306] 24 [2, 132] 0.260 0.080
L-O 8 125 1.000 [1.000, 1.000] 238 [52, 620] 2 [1, 5] 0.052 0.053
L-S 4 5 0.720 [0.630, 0.810] 164 [90, 359] 132 [33, 349] 1.321 0.418
L-S 4 25 1.000 [1.000, 1.000] 193 [103, 372] 28 [4, 123] 0.223 0.075
AIDS 4 5 1.000 [1.000, 1.000] 175 [100, 337] 1 [1, 1] 1.221 0.364
MUT 4 5 0.050 [0.010, 0.100] 37 [18, 89] 70 [39, 124] 6.294 2.159
MUT 4 25 0.800 [0.720, 0.880] 52 [19, 145] 53 [4, 348] 1.075 0.521
MUT 4 125 0.980 [0.950, 1.000] 153 [8, 820] 5 [1, 22] 0.276 0.284
MUT 8 25 0.920 [0.860, 0.970] 42 [15, 91] 15 [2, 75] 1.242 0.680
MUT 8 125 1.000 [1.000, 1.000] 112 [9, 540] 2 [1, 4] 0.292 0.267
Table II: Results attained by our method varying number of prototypes () and window size (). Symbol indicate DCR statistically larger than 0.99, with a confidence of 95%. Symbols and highlight significant trends in the DCR increasing and , respectively.
Experiment DCR ARL0 DoD FA1000
dataset index mean 95CI mean 95CI mean 95CI mean std
L-D2 M1 25 1.000 [1.000, 1.000] 238 [106, 594] 15 [2, 87] 0.194 0.082
L-D2 den 1 1.000 [1.000, 1.000] 194 [68, 556] 41 [22, 104] 6.475 3.337
L-D2 SG 1 0.850 [0.780, 0.920] 210 [75, 394] 133 [55, 274] 6.460 3.342
L-D5 M1 25 0.780 [0.700, 0.860] 222 [96, 407] 85 [2, 542] 0.200 0.098
L-D5 den 1 0.180 [0.110, 0.260] 219 [70, 485] 369 [91, 1147] 5.775 3.346
L-D5 SG 1 1.000 [1.000, 1.000] 209 [74, 674] 33 [18, 53] 6.058 2.971
L-O M1 25 0.500 [0.400, 0.600] 218 [100, 459] 304 [7, 1199] 0.203 0.082
L-O den 1 1.000 [1.000, 1.000] 194 [70, 549] 5 [4, 5] 6.853 3.428
L-O SG 1 0.130 [0.070, 0.200] 221 [89, 613] 352 [115, 973] 5.558 2.555
L-S M1 25 0.880 [0.810, 0.940] 230 [116, 469] 93 [4, 636] 0.189 0.066
L-S den 1 1.000 [1.000, 1.000] 188 [79, 385] 32 [20, 48] 6.237 2.777
L-S SG 1 0.080 [0.030, 0.140] 189 [91, 455] 308 [124, 771] 6.100 2.247
AIDS M1 25 1.000 [1.000, 1.000] 229 [95, 489] 1 [1, 1] 0.198 0.081
AIDS den 1 1.000 [1.000, 1.000] 207 [67, 505] 4 [2, 6] 6.312 3.245
AIDS SG 1 1.000 [1.000, 1.000] 197 [79, 443] 93 [48, 152] 6.077 2.744
MUT M1 25 0.260 [0.180, 0.350] 154 [24, 575] 484 [13, 2208] 0.506 0.441
MUT den 1 0.230 [0.150, 0.310] 194 [69, 381] 267 [96, 741] 6.225 3.175
MUT SG 1 0.030 [0.000, 0.070] 208 [72, 437] 688 [134, 2342] 5.940 2.997
Table III: Results attained by baseline methods (index column) based on the graph density (den), the spectral gap of the Laplacian (SG), and the degenerate implementation of the methodology with prototype (M1). Symbol indicates significantly better (95% confidence) results.

For the sake of readability, we show results for our method and baselines in two different tables, that is, Table II and III, respectively.

In all experiments shown in Table I, there is a parameter setting achieving a detection rate (DCR) statistically larger than . Indeed, in Table II the 95% confidence interval (95CI) of the DCR is above (see symbol in the table). Looking at Table II more in detail, we notice that both the window size and the number of prototypes yield higher DCR. In particular, this can be seen in L-O experiments, where all DCR estimates have disjoint 95CIs (i.e., differences are statistically significant). The same phenomenon appears also for L-D5, L-S, and MUT, e.g., see symbols and in Table II. As far as other figures of merit are concerned, we do not observe statistical evidence of any trend. Still, with the exception of a few cases in MUT, all 95CIs related to ARL0 contain the target value ARL0=200; hence, we may say that the threshold estimation described in Section V-A3 completed as expected.

Here, we limit the analysis to the proposed parameter settings for or , since we already reach the highest possible DCR, achieving one hundred detections out of one hundred. We believe that, by increasing the window size the false alarms will decrease, as our method relies on the central limit theorem. Concerning the number of prototypes, we point out that, in the current implementation of the methodology, the number of parameters to be estimated scales as ; accordingly, we need to increase the number of samples.

The second experimental analysis addresses the performance assessment of the three baselines of Section V-A5 on the experiments of Table I. The results reported in Table III show that, in some cases, the considered baselines achieve sound performance, which is comparable to the one shown in Table II. Comparing Table III with Table II, by intersecting the 95% confidence intervals, we notice that there is always one of the proposed methods which attains DCR that is statistically equivalent or better than the baselines, see symbol in Table II. In particular, the proposed method performs significantly better than the baselines on the MUT dataset.

Finally, Table III shows that the method based on edge density is significantly more accurate in terms of DCR than the one based on spectral gap in almost all experiments; confidence intervals at level 95% do not intersect. The first method performed better than the degenerate one (M1) with only one prototype: see L-O and L-S. Conversely, M1 outperforms on L-D5.

Vi Conclusions

In this paper, we proposed a methodology for detecting changes in stationarity in streams of graphs. Our approach allows to handle general families of attributed graphs and is not limited to graphs with fixed number of vertices or graphs with (partial) one-to-one correspondence between vertices at different time steps (uniquely identified vertices). The proposed methodology consists of an embedding step, which maps input graphs onto numeric vectors, bringing the change detection problem back to a more manageable setting.

We designed and tested a specific method as an instance of the proposed methodology. The embedding has been implemented here as a dissimilarity space representation, hence relying on a suitable set of prototype graphs, which, in our case, provide also a characterization of the nominal condition, and a dissimilarity measure between graphs, here implemented as a graph edit distance. The method then computes the Mahalanobis’ distance between the mean values in two windows of embedded graphs and adopts a CUSUM-like procedure to identify changes in stationarity.

We provided theoretical results proving that, under suitable assumptions, the methodology is able to relate the significance level with which we detect changes in the dissimilarity space with a significance level that changes also occurred in the graph domain; also the vice versa has been proven. We also showed that our methodology can handle more basic, yet relevant scenarios with uniquely identified vertices.

Finally, we performed experiments on IAM graph datasets showing that the methodology can be applied both to geometric graphs (2-dimensional drawings) and graphs with categorical attributes (molecules), as instances of possible data encountered in real-world applications. Results show that the proposed method attains at least comparable (and often better) results w.r.t. other change detectors for graph streams.

In conclusion, we believe that the proposed methodology opens the way to designing sound change detection methods for sequences of attributed graphs with possibly time-varying topology and non-identified vertices. In future studies, we plan to work on real-world applications and focus on the automatic optimization of relevant parameters affecting the performance.

Appendix A Correspondence between probability spaces

Consider the measurable space introduced in Section III-A2, and the probability space of Section I-A.

Let us define the preimage function , with and consider the smallest -algebra containing all open balls444A ball is defined as a set of all graphs having distance w.r.t. a reference graph smaller than the radius . and preimage sets w.r.t. any

and a generic probability density function

on . Then, we can define the function as in Equation (2).

The triple is a probability space. The following three properties provide a proof that is a probability measure on :

  • ,

  • , and

  • for any countable collection of pairwise disjoint sets, , hence the sets are pairwise disjoint and .

Notice also that, indicating with the image set of , we have for any such that .

Appendix B Fréchet mean

Given a probability space defined on a metric space , we consider a random sample .

B-a Definition of Fréchet mean and variation

For any object , let us define the functions and . A Fréchet sample (population) mean is any object attaining the minimum of the function (). We point out that the minimum might not exist in and, if it does, it can be attained at multiple objects. Whenever the minimum exists and is unique, we refer to it as (). In addition, we define Fréchet sample (population) variation as the infimum ().

B-B Fréchet mean in Euclidean spaces

In the case of a set and distance , the space is Euclidean. Firstly, we show that and , then we show that

(11)

1. The following equality holds

(12)

as we will show below, in Part 3. This result proves that the minimum is attained in the expectation .

2. Similarly, in the ‘sample’ case

proving that .

3. The results of previous Parts 1 and 2 are derived from the following three equalities: for any

4. Let us move on proving the second result. Notice that

5. Then,

Now, thanks to the independence of the observations