Segmenting Dynamic Network Data

Networks and graphs arise naturally in many complex systems, often exhibiting dynamic behavior that can be modeled using dynamic networks. Two major research problems in dynamic networks are (1) community detection, which aims to find specific sub-structures within the networks, and (2) change point detection, which tries to find the time points at which sub-structures change. This paper proposes a new methodology to solve both problems simultaneously, using a model selection framework in which the Minimum Description Length Principle (MDL) is utilized as minimizing objective criterion. The derived detection algorithm is compatible with many existing methods, and is supported by empirical results and data analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

page 22

page 23

page 24

08/05/2019

Change-point detection in dynamic networks via graphon estimation

We propose a general approach for change-point detection in dynamic netw...
05/20/2017

Fast Change Point Detection on Dynamic Social Networks

A number of real world problems in many domains (e.g. sociology, biology...
12/17/2018

Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection

Many existing procedures for detecting multiple change-points in data se...
10/28/2019

Harnessing the power of Topological Data Analysis to detect change points in time series

We introduce a novel geometry-oriented methodology, based on the emergin...
11/10/2020

Statistical learning for change point and anomaly detection in graphs

Complex systems which can be represented in the form of static and dynam...
06/09/2021

Ultra High Dimensional Change Point Detection

Structural breaks have been commonly seen in applications. Specifically ...
08/09/2020

Big Networks: A Survey

A network is a typical expressive form of representing complex systems i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Networks and graphs arise naturally in many complex systems, such as social networks, the World Wide Web, climate data, biological systems, and others. These networks normally encode relationships between two subjects within the system. For example, in the widely known Facebook friendship network, a connection is established between two users if they are friends on Facebook. Data of such type normally come in two versions: either a single network that encodes all information at a particular time point (static network), or a sequence of networks that captures the dynamic behavior of the system (dynamic networks). These networks can normally be classified into undirected or directed networks, as well as weighted or unweighted networks. Unless stated otherwise, all networks described below are undirected unweighted networks.

Analysis of static networks has been a popular research subject in both statistics and the social sciences. Of the many research topics one might pursue, community detection is arguably the most common choice. In brief, the goal of community detection is to locate highly dense sub-networks within the entire network (Newman, 2004; Fortunato, 2010). Among the many developed methods, modularity (Newman, 2006) and statistical model based approaches (Holland et al., 1983) have drawn much attention. Many algorithms have then been developed based on these ideas, including the famous Louvain method of Blondel et al. (2008), Infomap by Rosvall et al. (2009), and fast modularity by Clauset et al. (2004), among others. A further important area of research in static network analysis is to determine the number of communities within a network (Airoldi et al., 2008; Saldaña et al., 2017).

While static network methods aim at analyzing individual snapshots of network data, dynamic network analysis tries to analyze a sequence of network data simultaneously. Often times one is interested in analyzing how the network evolves. There are two main areas of research for dynamic networks: consensus clustering, where one tries to find a community structure that fits well for all the snapshots in the data sequence, and change point detection, where one aims at locating the time points at which community structures change.

In terms of consensus clustering, several main techniques have been developed in the literature, which are closely related to static network community detection methods. These include sum graphs and average Louvain (Aynaud & Guillaume, 2011), which start by constructing a special graph that captures the topology of all snapshots in a given graph sequence and then apply any static community detection method to this summary graph. This assumes that the discovered structure fits well to all snapshots in the sequence. The construction of this special graph can be done in many ways, and the simplest way is to add up the adjacency matrices of each snapshot to create a new matrix that resembles this special graph (see Section 4 for more details). Another detection method by Lancichinetti & Fortunato (2012) aims to find a partition for the sequence using individual partitions from the snapshots. That is, using the community structure of each snapshot as input, the method constructs an adjacency matrix that captures the community assignment relationships between the nodes across all snapshots, and conducts community detection on .

As for change point detection, there have been a few well known methods that use different approaches on solving the problem. They include GraphScope (Sun et al., 2007), Multi-Step (Aynaud & Guillaume, 2011), generalized hierarchical random graph (GHRG) (Peel & Clauset, 2015), and SCOUT (Hulovatyy & Milenković, 2016)

. GraphScope works by sequentially evaluating the next snapshot, to see whether the community structure of the next snapshot matches well with the one from the current segment according to an evaluation criterion derived based on the Minimum Description Length Principle. This method also works well for online change point detection, i.e. streaming data. However, the algorithm assumes the nodes can be partitioned into two sets: sources and sinks, and finds the partition within each set. Multi-Step, on the other hand, starts by assuming each snapshot belongs to its own segment. At each iteration, the two snapshots that are most similar (measured by an averaged modularity quantity) are grouped together. This is similar to a hierarchical clustering approach. GHRG works by first assuming a parametric model on the individual networks and a fixed-length moving window of the snapshots, and statistically tests whether a given time point in the window is a change point. Lastly, SCOUT works by finding the set of change points and community structures that minimizes an objective criterion derived based on the Akaike Information Criterion (AIC)

(Akaike, 1974) or Bayesian Information Criterion (BIC) (Schwarz, 1978). Hulovatyy & Milenković (2016) also derived three change point search algorithms, which are based on exhaustive search (with the use of dynamic programming to speed up computation), top-down search, and bottom-up search. Users can either pre-specify the number of change points and have the algorithm search over the restricted space, or let the algorithm determine the number of change points.

This paper proposes to conduct change point detection and community detection simultaneously using the Minimum Description Length Principle (MDL) (Rissanen, 1989, 2007)

. In short, the detection problem is cast as a model selection problem, where one tries to select the number of change points and community assignments by minimizing an objective criterion. Note that although GraphScope also uses the MDL principle as their objective criterion, their model assumptions are different from the ones made in this paper. Also, unlike many of the existing papers, this paper provides a thorough analysis of the proposed method via simulated data, to analyze accuracy of the method when the ground truth is known, an important validation step. It is specifically shown that, when the underlying model is correctly specified, the proposed method is able to detect change points with very high accuracy. Even when the model is misspecified, the proposed method can still capture the change points, while competitor methods tend to over-estimate the locations in this scenario.

The rest of the paper is organized as follows. Section 2 formally defines the problem. Sections 3 and 4 introduce the proposed methodology. Section 5 presents an empirical analysis of the proposed methodology and Section 6 concludes.

2 Setting

2.1 Notations

Denote a sequence of graphs of length as . Each graph consists of a vertex set and an edge set , where the node degree of each is at least 1. Note that there is no restriction on the size of the vertex sets: can be different from for , implying that the graphs and have different size – a quite natural assumption for time evolving networks. For example, in the popular Enron Email dataset (Priebe et al., 2005), each graph represents the email communication pattern between employees over one week. The nodes represent employees of the company, and an edge between two nodes means there is at least one email communication between the two employees within the time frame of the graph. It is possible that some employees have no email connection with the subjects of interest in the data set for some time , hence these employees will be missing in , and show up again at another time. Denote the overall node set as , with .

In general, each graph can be represented as a binary adjacency matrix of dimension , where represents a connection between the nodes and , and 0 otherwise. If , one can simply insert rows and columns of 0 at the appropriate locations such that the row and column arrangements of all matrices have the same meaning. Note that means that no edge is connected to node (i.e. node is a singleton). Of interest are the nodes such that , but for simplicity of notation and computation, all adjacency matrices are fixed at the same size. As stated in the Introduction, this paper focuses on simple undirected networks, hence if there is at least one connection between node and at time . The graph is also assumed to have no self-loops, i.e. .

2.2 Problem Statement

Suppose the sequence of graphs can be segmented into segments, with the graphs in each segment satisfying some homogeneity properties. For , define the graph segment , using the conventions and . The problem of change point detection in dynamic networks can then be defined as follows:

Problem 2.1.

Given a sequence of graphs , find the locations such that the community structure of each resulting graph segment is homogeneous but different from the community structure of any adjoining graph segment.

The time points are called change point locations. It is important to note that, as mentioned above, the number of nodes within each graph can be different, even within the same time segment. However, if a change in node size is considered as a change in community structure, it can easily result in segments consisting of one graph each. Hence a more robust definition of ‘change’ is needed in order to prevent overestimating the number of segments.

Definition 2.1 (Community structure within segment).

A community structure for segment is a partition of the node set into non-overlapping sets. The sets are called communities.

It is possible that some nodes might not show up in all of the graphs in the segment. However, if the original community structure is strong, adding nodes to the existing network can only strengthen the existing communities unless the new nodes introduce a large number of new connections. Similarly, removing certain nodes will not significantly weaken the existing structure unless the removed nodes play a central role in their communities. Hence this is a valid definition of community assignments. Because of this, for simplicity we will write instead of in what follows.

3 Change Point Detection and Community Detection Using MDL

This section describes the modeling procedure of the dynamic network and introduces the proposed methodology for change point and community detection. The statistical model used for the individual networks is presented first.

3.1 The Stochastic Block Model

Many statistical models have been proposed to analyze network data, with the Stochastic Block Model (SBM) the most widely used. Below we briefly review the non-degree-corrected SBM.

Recall that the adjacency matrix is a symmetric binary matrix, with 1 representing the existence of a connection between two nodes. Given the community assignment vector

and link probabilities

between communities and

, one can model the edges with a Bernoulli distribution:

, where and are the community assignments of nodes and , and is a symmetric matrix with . The standard assumption entails that should be large if , i.e. if two nodes belong to the same community, there is a high probability of an edge existing between the two nodes. This results in a denser connection for intra-communities than for inter-communities. Extending this notation to the segmented setting mentioned above, gives if belongs to the segment. Note that the link probabilities are not assumed to remain the same throughout a given segment.

The estimation of the link probabilities can be solved via the maximum likelihood method. Suppose the community assignment at time is known, where . The log-likelihood function is then

(1)
(2)

Equation (1) gives the representation when the edges are assumed to have Bernoulli distributions. Equation (2) is for the aggregation of all edges within a given community into one group, with , using as the total number of possible edges between communities and , and as the number of observed edges between communities and . The parameters can then be estimated by finding the that maximize Equation (2).

3.2 The MDL Principle

Using the SBM as the base model for the graphs, one can write down a complete likelihood for modeling the change points and the community assignments for each segment (call this the segmented time-evolving network). As seen in Section 3.1, the estimation of the link probabilities is trivial if the change point locations and community assignments are given. However, the estimation of the community structures and change points is less straightforward. In terms of community detection, various algorithms and objective criteria have been proposed to solve the problem (see Introduction). If the change point locations are known, one can easily adopt the existing methods to derive the community assignments. The rest of this section will apply the MDL principle to derive an estimate for the change point locations as well as community assignments for each segment.

The MDL principle is a model selection criterion. When applying the MDL principle, the “best” model is defined as the one allowing the greatest compression of the data . That is, the “best” model enables us to store the data in a computer with the shortest code length. There are several versions of MDL, and the “two-part” variant will be used here (see (3)). The first part encodes the fitted model being considered, denoted by , and the second part encodes the residuals left unexplained by the fitted model, denoted by . Denote by the code length of under model , then

(3)

The goal is to find the model that minimizes (3). Readers can refer to Lee (2001) for more examples on how to apply the two-part MDL in different models. To use (3) for the finding the best segmentation as well as community assignments for a given evolving network sequence, the two terms on the right side of (3) need to be calculated.

To fit a model for the segmented time-evolving network, one needs to first identify the change locations. Once the locations are determined, one can proceed to estimate the community assignments as well as the link probabilities. Denote by the community assignment for the segment, and . Since is completely characterized by , , and , the code length of can be decomposed into

(4)

According to Rissanen (1989), it requires approximately bits to encode an integer if the upper bound is unknown, and bits if is bounded from above by . Hence , the code length for number of change points, translates to , where the additional 1 is to differentiate between (no change point) and . To encode the change point locations , one can encode the distances between each change point rather than the locations themselves. Hence .

Once the change points are encoded, one can encode the community structures and link probabilities, i.e. the networks themselves. Recall in Definition 2.1, that the goal is to partition each node set into non-overlapping communities. Therefore, , where the first term encodes the number of communities for the segment (), and the second term encodes the community assignment for each node. Lastly, by Rissanen (1989), it takes bits to encode a maximum likelihood estimate of a parameter computed from observations. Hence, . Putting everything together, is then

(5)

To obtain the second term of (3), one can use the result of Rissanen (1989) that the code length of the residuals is the negative of the log-likelihood of the fitted model . With the assumption that, given the community structures and link probabilities, follows a Bernoulli distribution,

(6)

Combining (5) and (6) together, the proposed MDL criterion for estimating the change point locations and community structures is

(7)

The goal is to find the change point locations and community assignments that minimize (7).

4 Change Point and Community Assignment Search

As pointed out in Section 3.1, the estimates of link probabilities are easy to obtain if the change points and community assignments are known. However, the estimation of and are non-trivial. Below describes the procedure for estimating these two parameters, which combine to estimate the segmented time-evolving network.

4.1 Community Detection

The procedure for community detection within a given segment of networks is described first. Recall that in Definition 2.1 the goal is to find, for the segment, such that each node in belongs to exactly one community. However, it is possible that some nodes only appear in certain snapshots within the segment. Hence the community search procedure should be robust enough to deal with this problem.

Consider the set of adjacency matrices . Suppose we can aggregate these matrices (which represent networks), by simply adding up these matrices. The resulting matrix forms a super network that overlays all the networks between and , and community detection can be conducted over this super network. Since only simple undirected networks are considered, all values larger than 1 in the aggregated adjacency matrix will be replaced by 1.

As seen in the Introduction, community detection has been a popular research area in the past few decades, and many known fast algorithms have been developed for the task. However, most of the designed algorithms aim at maximizing the modularity of the network, hence they cannot be applied directly here since the objective function of interest is the MDL criterion. Nonetheless, one can still borrow ideas from the algorithmic portion of the designed methodologies.

The Louvain method of Blondel et al. (2008) is known to be one of the fastest community detection algorithms for static networks. It works in the following way. First, all nodes are assigned to be its own community. In the first iteration, each node (in some random order) is moved to its neighborhood community if there is a positive gain in modularity. If there are multiple neighborhood communities with positive gain, the one with maximum gain is picked. This is repeated for all nodes and perhaps multiple times per node until no modularity gain is achieved. Then the newly formed communities are treated as nodes and the merging procedure is repeated again until no modularity gain is achieved (at this step a neighborhood community is a group of vertices such that it has at least one connection with the current community). This method is fast and suitable for large graphs. However, it might be prone to overestimating the number of communities since it is a bottom-up search method. Also, the number of communities is usually a lot smaller than the number of nodes, hence it seems not necessary to initialize communities with nodes in the graph.

Instead of a bottom-up search, a top-down algorithm for detecting communities is proposed here. The main idea is to recursively split the network into smaller communities until no further improvement can be achieved. The algorithm starts by randomly assigning each node to one of two communities. In the first iteration, each node (at some random order) is switched to the opposite community if the switch leads to a decrease in MDL value. Repeat this multiple times until no switch will cause a decrease in MDL value. Then, repeat the same procedure on each sub-community until no further split can be found.

To prevent overestimating the number of communities, a merging step is conducted after the splits. At each iteration, each community is merged with its neighborhood community if there is a drop in MDL value, and the one with the biggest drop is picked if there are multiple such communities. Repeat this with all communities. One can think of this procedure as a top-down search (splitting communities) followed by a bottom-up search (merging communities). One can repeat the entire procedure after the merge step to prevent from trapping at a local optimal solution.

Notice that since all the segments are assumed to be independent of each other, there is no need to calculate the entire MDL value (7) when conducting community search. Instead, one can consider the sub-MDL criterion

(8)

when doing the splitting and merging steps mentioned in the previous paragraphs. This also means that all the segments can be searched simultaneously, which can then speed up computational time. Algorithm 1 lays out the community assignments search procedure.

1:  Assign each node to one of two communities. To speed up the initialization process, use existing methods to identify the two communities.
2:  Calculate the MDL value using (4.1). Denote thus value by .
3:  while there is a drop in  do
4:     for each node in  do
5:         Switch the community assignment if the value of (4.1) is lowered. Update .
6:     end for
7:  end while
8:  if there is no community found then
9:     Stop.
10:  end if
11:  while there is a drop in  do
12:     for each community found do
13:         Repeat steps 1-6, but with a subset of .
14:     end for
15:     Update .
16:  end while
17:  Merge communities until no drop in .
18:  Repeat steps 10-17 until no drop in .
19:  Return the community assignments .
Algorithm 1 Community Detection for the Segment

4.2 Change Point Detection

Change point detection algorithms for networks usually involve a top-down search, bottom-up search, or exhaustive search. An exhaustive search aims at finding the set of change point locations that minimizes the objective criterion by enumerating all possible combinations of change points. By doing so, the solution is guaranteed to be a global minimum, but the computation also becomes intractable once is large (with snapshots, there are combinations to loop through). One can use dynamic programming to reduce the computational complexity, but still needs to search through a large solution space before finalizing a global solution.

Both top-down and bottom-up searches are greedy algorithms, and their computation can be complex. For top-down search, one starts with the entire sequence of graphs, and finds the location that minimizes the objective criterion (as well as a decrease in the criterion value). Then, one finds the location that minimizes the objective criterion (with already in the model), and repeats until no change point can be found. By doing so, one needs to go through calculations at the iteration. Bottom-up search, on the other hand, starts by assuming each location is a change point, and merge the adjacent segment such that the objective function is minimized. This procedure is repeated until no further merge can be found.

This paper proposes a top-down search for finding the change point locations. However, instead of naively testing each location for the possibility of being a change location, a screening process is first conducted to select a set of candidate change locations. Then each candidate location (in some specific order) is checked to see whether it is a change point or not. Below describes the details of the search algorithm.

The screening process is conducted as follows. First, calculate the difference between each consecutive adjacency matrix. The distance used is the 1-norm between the two matrices normalized by their geometric means, given by the following formula:

(9)

where is the vector form of . The idea is that if the community structure between two consecutive networks does not change, then regardless of the differences in link probabilities, the edge pattern should remain roughly the same, hence the distance should be relatively smaller. Therefore, a large value of is an indicator that there is a change in the community structure at time . Set the locations whose distances are above the median value of ’s as the candidate change locations. This is equivalent to assuming that the maximum number of change points is , which is a reasonable assumption in most situations.

Once the candidate locations are determined, order them by the values from largest to smallest. Starting with the first location (denote by ), segment the data into two pieces, conduct community search within each segment, and calculate the MDL value (7). If this value is smaller than the MDL value with no segmentation, set as a change location, otherwise segment the data at and repeat. Every time a change location is found, remove the location from the candidate set and reset the search procedure, with the previously selected locations in the estimated model. Doing this requires at most calculations. Even though this can be large if is large, often times the search procedure will stop after a few iterations.

To prevent overestimating the number of change points, a merging step is conducted on the selected change points (if any). There are two cases to consider: (1) at least one change point is selected in the previous step and (2) no change point is selected in the previous step. For case (1), merge the segments at the selected change locations, starting from the last selected change point, and recalculate the MDL value. If there is a decrease in the MDL value, keep the merge, otherwise ignore it, and move onto the next selected change point, until all estimated change points have been tested. For case (2), use the candidate locations (in reversed order ) as estimated change points, and perform the merging step. One can view this as a bottom-up search strategy. Algorithm 2 lays out the change points search procedure.

1:  Calculate the consecutive distances using (9) for adjacency matrices .
2:  Set . Order according to the values of selected from largest to smallest. Set .
3:  Calculate the MDL value (7) for when there is no change point. Denote as .
4:  for each  do
5:     Segment the network sequence at time (given change points at ) and conduct community detection with Algorithm 1.
6:     Calculate the MDL value (7) using the segmented model. Denote as .
7:     if  then
8:         , , . Restart for loop.
9:     end if
10:  end for
11:  if  then
12:     Set . Update using as change points.
13:  else
14:     Set .
15:  end if
16:  for each  do
17:     Merge the consecutive segments at and conduct community detection with Algorithm 1 (given change points at ).
18:     Calculate the MDL value (7) using the segmented model. Denote as .
19:     if  then
20:         , . Restart for loop.
21:     end if
22:  end for
23:  Return and community structures with as change points.
Algorithm 2 Change Point Detection in Dynamic Networks.

5 Empirical Analysis

To assess the performance of the proposed methodology, multiple simulation sets will be conducted. Application to a data set is also performed to showcase the practical use of the proposed method.

5.1 Simulation

This section focuses on analyzing the performance of the proposed method on synthetic data. Out of the four settings compared, three settings involved networks generated according to the SBM discussed in Section 3.1, with each time shot independent of each other. The last setting involved networks with correlated edges, which were studied by Saldaña et al. (2017). Change point detection results were compared with the Multi-Step change point detection algorithm of Aynaud & Guillaume (2011), and the SCOUT algorithm of Hulovatyy & Milenković (2016). Publicly available implementations of both algorithms were used. Table 1 shows a summary of each setting. Detailed descriptions of each setting can be found in the Appendix. Figures 1-4 show the histograms of the estimated change point locations for Settings 1 through 4, respectively. All settings were repeated with 100 trials.

Setting Correlated Edges Sparse/Dense Number of Change Points # of Nodes Per Network Remarks
1 No Dense 5 280 - 300 Networks within the same segment have same edge probabilities
2 No Dense 4 280 - 300 Each graph has different probability
3 No Sparse 4 380 - 400 Each graph has different probability
4 Yes Dense 5 380 - 400 111 is the correlation between edges. See Saldaña et al. (2017).
Table 1: Settings for simulations. Detailed descriptions can be found in the Appendix.
Figure 1: Estimated change point locations for Setting 1 over 100 trials. For SCOUT (with BIC) - using BIC to select the number of change points; (Restricted) - restricting to the known number of change points.
Figure 2: Estimated change point locations for Setting 2.
Figure 3: Estimated change point locations for Setting 3.
Figure 4: Estimated change point locations for Setting 4.

As listed in Table 1, three of the settings involved dense networks while the remaining one involved sparse networks. A network is considered dense if the intra community edge probabilities range between 0.7 to 0.9 and inter community edge probabilities between 0.05 to 0.30, while a network is considered sparse if the probabilities are between 0.35 to 0.40 and 0.05 to 0.10, respectively. Networks within the same segment in Setting 1 have the same link probabilities, while the networks in the other settings have the different link probabilities even for networks within the same segment.

From Figures 1-4, one can see that the proposed MDL method outperforms Multi-step in almost all cases. While the proposed method over estimated the number of change points in Setting 4, Multi-Step seemed to be consistently over estimating. As for the SCOUT method, even though it always estimates the correct change point locations when the number of change points is known, it is seen that the method was not able to correctly identify the number of change points once this restriction was relaxed. As the number of change points is often unknown in real data, it is more reasonable to compare with results of the automatic selection case (with BIC).

To also evaluate the performance of the proposed community detection algorithm, the normalized mutual information (NMI) was used. In brief, NMI is an evaluation criterion used to evaluate the performance of clustering results. It is defined as

(10)

Denote

and

The quantities (entropy) and (mutual information) are then defined as

(11)
(12)

The overall NMI for the sequence of networks is defined as the mean of all individual NMIs: . Notice that ranges between 0 and 1, where 0 means the estimated community structure is a complete random guess, while 1 means it is perfectly matched. Table 2 presents the results for community detection of the proposed algorithm, as well as detection results from SCOUT. To guarantee the community detection results are comparable with the truth, all estimations were restricted to knowing the true change point locations, i.e. no change point detection was performed.

Settings 1 2 3 4
MDL 1.00 1.00 0.55 0.83
SCOUT 0.09 0.11 0.10 0.10
Table 2: Community detection results. Results show averages over 100 trials.

5.2 Data Analysis

In this application, the World Trade Web (WTW), also known as the International Trade Network (ITN), is considered. This data set is publicly available at Gleditsch (2002). In brief, this data set captures the trading flow between 196 countries from 1948 to 2000. The data set consists of the total amount of imports and exports between two countries each year. Several papers have been published on the analyses of the WTW, including Tzekina et al. (2008), Bhattacharya et al. (2007), Bhattacharya et al. (2008) and Barigozzi et al. (2011). Since the import/export information is given, many of these analyses were done considering the trade network as a directed weighted network, where the weights represent the amount of goods going from country A to country B. For this analysis, however, since the focus of this paper is on undirected networks, the data set was modified such that an edge exists between two countries if there is some trading between the two countries. As there is data for each year between 1948 to 2000 (53 years), it is straightforward to consider this as a dynamic network. Table 3 shows the summary of the data set.

# of nodes # of edges (mean SD) Time span Duration
196 5736 2804 53 years 1 year
Table 3: Summary of data set

The proposed algorithm detected 5 change points on this data set. As comparison, the SCOUT algorithm (with BIC to select the number of change points) was also applied to the data set, which detected 5 change points as well. The results are listed in Table 4 below.

Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6
MDL 1948 - 1959 1960 - 1965 1966 - 1974 1975 - 1980 1981 - 1990 1991 - 2000
SCOUT 1948 - 1961 1962 - 1972 1973 - 1980 1981 - 1990 1991 - 1992 1993 - 2000
Table 4: Segments determined by proposed algorithm and SCOUT.

Only the community assignments of the proposed method will be investigated here. Figures 5-10 show the trading communities for the six detected segments. Since multiple communities have been detected for each segment, only the top 7 largest communities will be analyzed for each time period (the top 7 communities cover a majority of the countries in most cases). For each map, blue denotes the largest community, green the second largest, then yellow, red, pink, orange, and purple, for the third to seventh largest communities, respectively. One can see that the largest community consists of all the largest nations in the world, including the US, Canada, China, Russia, and many others.

Figure 5: Segment 1: 1948 - 1959. Largest 7 communities detected using proposed algorithm. Blue denotes the largest, green the second largest, then yellow, red, pink, orange, and purple for the third to seventh largest communities, respectively.
Figure 6: Segment 2: 1960 - 1965.
Figure 7: Segment 3: 1966 - 1974.
Figure 8: Segment 4: 1975 - 1980.
Figure 9: Segment 5: 1981 - 1990.
Figure 10: Segment 6: 1991 - 2000.

The change points located by the two algorithms are similar, hence the following analysis will focus on change points detected by the proposed method. During the first segment (1948 to 1959), a majority of the countries in Africa were not involved in any of the tradings. Moreover, one can see that only the largest countries in the world were involved in some trading communities. This can be explained by the lack of data for such an early period, as well as most countries were still in their developing phase.

Starting from the second segment (1960 to 1965), most countries in Africa started to get involved in trading, but mostly among themselves. One possible event that triggered this behavior was that, during the 1960’s, many countries in Africa gained independence. During the third segment (1966 to 1974), most trade behaviors in Africa remain stable. An interesting observation happened to the countries in South East Asia. In particular, Indonesia broke off from the large community and formed a new group with several other countries. Looking at the history, a mass killing occurred between 1965 and 1966 in Indonesia due to an anti-communist campaign. This event could be part of the reason for this change. Only minor adjustments of the communities occurred in the fourth segment (1975 to 1980), and in particular, Indonesia rejoined with the large community. Indonesia was invading East Timor during the mid 1970’s, and was obtaining weapons from the US and other countries, hence causing the merge of Indonesia with the largest community. Throughout the last two segments, countries in Africa and South America started to join the trade network with the large community. In the sixth segment (1991 to 2000), almost all countries in South America had joined (except for Suriname).

6 Conclusion

This paper presented a new methodology for analyzing dynamic network data. By assuming each individual network follows a Stochastic Block Model, an objective criterion based on the Minimum Description Length Principle was derived for detecting change points and community structures in dynamic networks. Simulations showed promising results of the proposed algorithm, and a data analysis confirmed the proposed methodology is able to detect major changes.

Appendix

This section provides the details of the simulation settings.

Setting 1: Table 5 lists the specification for this setting. for this and all following settings. The number of nodes for each snapshot ranged between 280 to 300. The community sizes were specified according to the ratios listed in the column ‘Community Size Ratio’: the ratios (1/3, 1/3, 1/3) mean there are three communities, each containing roughly 1/3 of the total nodes of the graph. The link probabilities are listed in the ‘Link Probability’ column, with representing the probability of an edge existing within a community, and the probability of an edge existing between two communities. Note the quantities satisfy the assumption . For this setting, all networks within the same segment had the same within and between links probabilities. The true segments are listed in the column ‘Segment Number’.

Segment Community Size Ratio Link Probability # of Nodes
1 1 - 5 1/3, 1/3, 1/3 280 - 300
2 6 - 13 1 280 - 300
3 14 - 16 1/4, 1/4, 1/4, 1/4 280 - 300
4 17 - 22 2/3, 1/3 280 - 300
5 23 - 28 1/5, 1/5, 1/10, 3/10, 1/5 280 - 300
6 29 - 30 3/10, 2/5, 3/10 280 - 300
Table 5: Specifications for Setting 1.

Setting 2:

The previous setting assumed the link probabilities remain the same within each segment. However, this is not necessarily a valid assumption in real world data. This setting provides a setup such that each graph has a different intra and inter link probability. For all graphs, the intra and inter-link probabilities followed Uniform distributions:

and . The rest of the specifications are listed in Table 6.

Segment Community Size Ratio # of Nodes
1 1 - 12 1/3, 1/3, 1/3 280 - 300
2 13 - 21 1/3, 2/3 280 - 300
3 22 - 22 3/4, 1/4 280 - 300
4 23 - 27 3/10, 2/5, 3/10 280 - 300
5 28 - 30 1/5, 3/10, 1/5, 3/10 280 - 300
Table 6: Specifications for Setting 2.

Setting 3: Both settings considered so far consist of dense networks. Often times, however, observed networks have a sparse structure. Instead of having a high value, this setting used and . The rest of the specifications are listed in Table 7.

Segment Community Size Ratio # of Nodes
1 1 - 8 1/3, 1/3, 1/3 380 - 400
2 9 - 11 1/4, 3/4 380 - 400
3 12 - 16 1/2, 1/2 380 - 400
4 17 - 21 3/4, 1/4 380 - 400
5 22 - 30 3/10, 2/5, 3/10 380 - 400
Table 7: Specifications for Setting 3.

Settings 4: To test the robustness of the proposed method under misspecification, Setting 4 involved networks with correlated edges. Such networks have been studied by Saldaña et al. (2017). In their paper, the parameter controls the correlation between network edges. The correlation used here was , with a dense setting. The specifications of this setting are listed in Table 8.

Segment Community Size Ratio # of Nodes
1 1 - 5 1/2, 1/2 380 - 400
2 6 - 11 1/3, 1/3, 1/3 380 - 400
3 12 - 19 3/4, 1/4 380 - 400
4 20 - 24 1/2, 1/2 380 - 400
5 24 - 25 3/4, 1/4 380 - 400
6 26 - 30 2/5, 1/5, 2/5 380 - 400
Table 8: Specifications for Setting 4.

References

  • (1)
  • Airoldi et al. (2008) Airoldi, E. M., Blei, D. M., Fienberg, S. E. & Xing, E. P. (2008), ‘Mixed membership stochastic blockmodels’,

    Journal of Machine Learning Research

    9, 1981 – 2014.
  • Akaike (1974) Akaike, H. (1974), ‘A new look at the statistical model identification’, IEEE Transactions on Automatic Control 19, 716–723.
  • Aynaud & Guillaume (2011) Aynaud, T. & Guillaume, J.-L. (2011), ‘Multi-step community detection and hierarchical time segmentation in evolving networks’, Proceedings of the SNA-KDD workshop .
  • Barigozzi et al. (2011) Barigozzi, M., Fagiolo, G. & Mangioni, G. (2011), ‘Identifying the community structure of the international-trade multi-network’, Physica A: Statistical Mechanics and its Applications 390, 2051 – 2066.
  • Bhattacharya et al. (2007) Bhattacharya, K., Mukherjee, G. & Manna, S. S. (2007), ‘The international trade network’, In: Chatterjee A., Chakrabarti B.K. (eds) Econophysics of Markets and Business Networks pp. 139 – 147.
  • Bhattacharya et al. (2008) Bhattacharya, K., Mukherjee, G., Saramäki, J. & Manna, S. S. (2008), ‘The international trade network: Weighted network analysis and modeling’, Journal of Statistical Mechanics: Theory and Experiment 2008, P02002.
  • Blondel et al. (2008) Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. (2008), ‘Fast unfolding of communities in large networks’, Journal of Statistical Mechanics: Theory and Experiment 10.
  • Clauset et al. (2004) Clauset, A., Newman, M. E. J. & Moore, C. (2004), ‘Finding community structure in very large networks’, Physical Review E 70, 066111.
  • Fortunato (2010) Fortunato, S. (2010), ‘Community detection in graphs’, Physics Reports 486, 75 – 174.
  • Gleditsch (2002) Gleditsch, K. S. (2002), ‘Expanded trade and gdp data’.
    http://ksgleditsch.com/exptradegdp.html
  • Holland et al. (1983) Holland, P. W., Laskey, K. B. & Leinhardt, S. (1983), ‘Stochastic blockmodels: first steps’, Social Networks 5, 109 – 137.
  • Hulovatyy & Milenković (2016) Hulovatyy, Y. & Milenković, T. (2016), ‘Scout: simultaneous time segmentation and community detection in dynamic networks’, Scientific Reports 6.
  • Lancichinetti & Fortunato (2012) Lancichinetti, A. & Fortunato, S. (2012), ‘Consensus clustering in complex networks’, Scientific Reports 2.
  • Lee (2001) Lee, T. C. M. (2001), ‘An introduction to coding theory and the two-part minimum description length principle’, International Statistical Review 69, 169–183.
  • Newman (2004) Newman, M. E. J. (2004), ‘Detecting community structure in networks’, The European Physical Journal B 38, 321 – 330.
  • Newman (2006) Newman, M. E. J. (2006), ‘Modularity and community structure in networks’, Proc Natl Acad Sci USA 103, 8577 – 8582.
  • Peel & Clauset (2015) Peel, L. & Clauset, A. (2015), ‘Detecting change points in the large-scale structure of evolving networks’,

    AAAI Conference on Artificial Intelligence

    .
  • Priebe et al. (2005) Priebe, C. E., Conroy, J. M. & Marchette, D. J. (2005), ‘Scan statistics on enron graphs’, Computational & Mathematical Organization Theory 11, 229 – 247.
  • Rissanen (1989) Rissanen, J. (1989), Stochastic Complexity in Statistical Inquiry, World Scientific Publishing Co.
  • Rissanen (2007) Rissanen, J. (2007), Information and Complexity in Statistical Modeling, Springer.
  • Rosvall et al. (2009) Rosvall, M., Axelsson, D. & Bergstrom, C. T. (2009), ‘The map equation’, European Physical Journal Special Topics 1, 13 – 23.
  • Saldaña et al. (2017) Saldaña, D. F., Yu, Y. & Feng, Y. (2017), ‘How many communities are there?’, Journal of Computational and Graphical Statistics 1, 171 – 181.
  • Schwarz (1978) Schwarz, G. (1978), ‘Estimating the dimension of a model’, The Annals of Statistics 6, 461 – 464.
  • Sun et al. (2007) Sun, J., Yu, P. S., Papadimitriou, S. & Faloutsos, C. (2007), ‘Graphscope: Parameter-free mining of large time-evolving graphs’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 687 – 696.
  • Tzekina et al. (2008) Tzekina, I., Danthi, K. & Rockmore, D. N. (2008), ‘Evolution of community structure in the world trade web’, The European Physical Journal B 63, 541 – 545.