Chronnet: a network-based model for spatiotemporal data analysis

04/23/2020 ∙ by Leonardo N. Ferreira, et al. ∙ 2

The amount and size of spatiotemporal data sets from different domains have been rapidly increasing in the last years, which demands the development of robust and fast methods to analyze and extract information from them. In this paper, we propose a network-based model for spatiotemporal data analysis called chronnet. It consists of dividing a geometrical space into grid cells represented by nodes connected chronologically. The main goal of this model is to represent consecutive recurrent events between cells with strong links in the network. This representation permits the use of network science and graphing mining tools to extract information from spatiotemporal data. The chronnet construction process is fast, which makes it suitable for large data sets. In this paper, we describe how to use our model considering artificial and real data. For this purpose, we propose an artificial spatiotemporal data set generator to show how chronnets capture not just simple statistics, but also frequent patterns, spatial changes, outliers, and spatiotemporal clusters. Additionally, we analyze a real-world data set composed of global fire detections, in which we describe the frequency of fire events, outlier fire detections, and the seasonal activity, using a single chronnet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Large amounts of spatiotemporal data are collected from several domains, including georeferenced climate variables, epidemic outbreaks, crime events, social media, traffic, and transportation dynamics, among many others. Analyzing and mining such kind of data is of great importance for advancing the state-of-the-art in many scientific problems and real applications. Nevertheless, data with spatial and temporal characteristics have different properties in comparison to relational sources studied in classical data mining literature [1]

: they present temporal or spatial dependencies, in which instances are not independent and identically distributed. It means that samples can be structurally related in some spatial regions or temporal moments. Also, they are non-static, i.e., the instances can change their class attribute depending on time and location. Thus, traditional data mining methods are not the ideal tools for spatiotemporal data, which can lead to poor performance and interpretation results 

[2, 1].

An emerging field, called spatiotemporal data mining (STDM), proposes novel methods and algorithms for analyzing such type of data. STDM brought new challenges and opportunities in several domains with practical significance and applicability, in terms of novel scientific problems. Some of the main STDM tasks are clustering, anomaly detection, frequent pattern mining, relationship mining, change detection, and predictive learning 

[1]. Clustering problems focus on finding regions with similar spatiotemporal activities or phenomena, like crimes [3], infection locations [4], or wild-fires [5]. Spatiotemporal anomalies or outliers, are those samples that greatly discord to the natural local behavior of their neighborhood, with respect to both space and time. Some examples of works analyzed burst activity in location-based social networks like Twitter [6], detected abnormal activity/behavior in surveillance videos of crowds [7], or discovered the appearance of revolving water mass (eddies) in ocean data [8]. Frequent pattern mining aims to find repeating events, sequential patterns or motifs [9] in one or more locations and it is related to another task called relationship mining that involves the search for correlations [10], causal interactions [11], event synchronization  [12] or event coincidence [13] between pairs or groups of different regions. Change detection involves the search for significant deviations in a time series [14] or to determine the spatial extent and temporal window in where a change has occurred, for example, the transformation of natural vegetation by practices such as deforestation, agricultural expansion and urbanization [15]. Finally, the forecasting task intends to apply models to capture the temporal relationship in one or more time series and to use the models to make predictions [16].

We can find, in a diversity of real-world applications, a variety of spatiotemporal data types that can be classified as 

[1]: (i) event data, which is a tuple with a variable associated with the observation and the precise location and time of the occurrence; (ii) trajectory data, describing the path trace by a mass/object moving in space over time; (iii) time-series, a set of measurements or observations at a specific location over time; (iv) spatial maps, which are measurements or states of a set of spatial locations in a specific time; and (v) raster data, the spatial collection of time-series from fixed cells in a grid. In particular, one of the challenges in STDM research is proposing novel structures to study spatiotemporal data [1]. This way, recent works have employed networks as a way of representing and analyzing spatiotemporal information [5, 12, 17, 18]. The interest is justified by the benefits that network representation provides, such as the ability to describe sub-manifold in dimensional space, and to capture dynamic and topological structures – hierarchical structures (communities) and global or local patterns, independent of the data distribution [18, 19]. The network-based representation of spatiotemporal data allows us to discover different patterns from local-range spatially dependencies to distant locations with higher correlations, e.g., teleconnections in climate sciences [20, 12, 17].

Functional or correlation networks [21] have been widely applied to map spatiotemporal data into networks. This method connects nodes, i.e., time-series from spatial grid-cells, according to their similarity. Correlation networks have been applied in a diversity of areas like: in Earth sciences for studying global climate phenomena [20, 12, 17, 22, 23] such as evaluating the impact of El-Niño around the world [22, 23]; In Bioinformatics for understanding gene expression [24] or brain anomalies and reorganization after accidental strokes [25]; In Finance, finding the dynamics of stock-markets [26], among others. One drawback of correlation networks appears when short-length time series are considered, making the statistical significance of correlations questionable and may result in spurious links in the network. Additionally, in the case of event-based data, it is common to have a higher proportion of –no events– or values equal to zero throughout time, which clearly affect the correlation results. Alternative methods have been proposed to construct networks [18, 19, 27], like the visibility property in time series [27], a technique that was applied to detect anomalies in annual hurricanes time-series [28]. However, this method produces single and static representations of the time-series, discarding the spatial local.

Abe and Suzuki [29] studied seismic events and proposed to construct the network dividing a specific geographic region into small cubic cells, where a cell becomes a node only if an earthquake happens therein. Then, nodes are linked according to the sequence of the events. In a further study, Ferreira et al.  [30] performed similar analyzes in earthquake data but considering time-windows among seismic activities. More recently, Vega-Oliveros et al.  [5] analyzed spatiotemporal wild-fire events in the Amazon basin by constructing the network according to the co-occurrence of events in a grid division of the area. As result, they presented two network versions from the data: a single directed and weighted network, and a temporal undirected networks. Although the aforementioned works helped to identify valuable information in a couple of domains, the results and the proposed methods are very particular to the adopted problem and application, with a lack of generalization concerning a variety of domains with spatiotemporal events.

Here, we provide a broad method for representing spatiotemporal events on networks, as a structure for dealing with commonly studied data mining problems, like clustering, predictive learning, pattern mining, anomaly detection, change detection, and relationship mining. The idea is to spatially divide the geometrical space into grid cells that are represented by nodes. Links are established between the nodes in chronological order, i.e., nodes are connected if successive events occur in the respective cells. Therefore, the links represent events in a chronological occurrence, which we call, for the sake of simplicity, as chronnet. In this way, we generalize the previously reported mechanism of connections [29, 30, 5] transforming the spatiotemporal data set into a network, as explored in other domains like text-mining [31, 32] and social media [33, 34].

In such a way, the recurrent consecutive events are represented by strong (high weighted) links. The constructed network from this process have non-trivial characteristics obeying the original spatiotemporal events distribution, being scale-free and small-world for power-law event distribution, Poisson for uniform event distribution, and exponential in both cases. The chronnets allow the detection of a new type of spatiotemporal community and permit the identification of influential nodes and outliers. More importantly, the method is computationally efficient – with linear complexity, easy and direct to distribute the computation processing, and can be adapted according to the domain, e.g., different number of sequence connections, grid division, and time windows in the case of temporal analyses.

The construction method was tested and analyzed using artificial and real data. We propose a simple data set generator to explain how the topology of the chronnet captures different spatial and temporal characteristics. Specifically, we used the proposed data set generator to experimentally describe how local, mesoscale, and global chronnet measures can be used to extract information from spatiotemporal data. We also applied our model to a real data set composed of global fire events as a case study. The results show that our model can be used to describe the frequency of fire events, regions with outlier activity, and the fire seasonal activity using a single chronnet and complex network measures.

Chronnets: Chronnological networks

Our method aims to transform a spatiotemporal data set into a network whose links represent events in a chronological approach, simply named chronnet. Let us consider a spatiotemporal data set composed of time-ordered sequence of events , where an event is represented by its location and time . Note that an event may also provide more measurements as additional information about the occurrence. The method for constructing the chronnet from the data set consists of three steps. At the first step, the studied geometrical area is divided into grid cells and each grid-cell is represented by a node . The second step consists of defining a time length for the network construction. The time length divides the spatiotemporal data set into time windows. If , the whole data set will be used to generate the single network. If , then each time window will be used to create a layer (snapshot) of the temporal network. The final step comprises the network connections.

Formally, a directed link is established from node to node if two consecutive events and , within a time window of a predetermined size , co-occur therein the cells and respectively. A threshold parameter may be considered to limit the maximum distance between links. In this case, and are connected iff the distance between two consecutive events and is smaller than , i.e., . The threshold parameter can be used to avoid that events occurring in far regions be connected.

The process slides over the entire data set from the first to the last event. The link weight between two nodes and is the sum of consecutive events between them. Note that represents self-loops used to count consecutive events that occur in the same cell. The main idea behind this construction process is to map recurrent consecutive events into strong (high weighted) links. The steps to construct a chronnet with a sliding window is illustrated in Fig. 1.

Figure 1: Chronnet construction method. (A) Given a time-ordered spatiotemporal data set, the first step consists of dividing the area of study into grid cells. (B) Shows a spatial representation of the data set using links between cells to indicate the temporal order they occur, in a sliding window of . (C) In a chronnet, nodes represent grid cells and links stand for co-occurred events between cells. Link weights indicate the number of consecutive events observed between cells. (Color Online)

In the following, we summarize the chronnet construction process:

  1. Grid division: the spatiotemporal data is divided into grid cells. Each cell is represented by a node in a chronnet;

  2. Time window length: the whole period from the data set can be used to construct a single static network or the data set can be divided into time windows used to create layers (or snapshots) of a temporal network;

  3. Link construction: a directed link between two nodes is formed if two events co-occur within a sliding window of size in the cells represented by them. Self-loops are used to denote consecutive events in the same cell.

Some comments on this construction method are remarked. First, it contains one parameter, the sliding window , and one domain decision, the grid division which is a scale of coarse-graining. Since there is no priori rule to determine the grid size, it is a decision related to the application domain. Increasing the resolution impacts in the final network size and sparsity. On the other hand, lowing the resolution helps to remove redundancies in the observations, but may also result in a loss of spatial information. The same happens with the parameter . The larger , the more events are temporal aggregated and the denser the network. Both sliding window and grid size are primary decisions independent of performing temporal or single-network analysis.

Chronnets are by definition directed weighted graphs. In this paper, for simplicity purposes and without loss of generality, we opt for focusing on analyzing undirected weighted chronnets constructed by consecutive events () without distance restriction (). In an undirected chronnet, link weights represent the total number of events that co-occurred between two cells independently of which node appears first in time, i.e., the sum of in- and out-link weights. In the rest of this paper, we use the term chronnet to refer to an undirected weighted network.

We consider in this paper, for simplicity reason, that the data set does not have parallel (same timestamp) events. However, this is not a limitation in the model. Parallel events can be treated by establishing links between all the combinations of different nodes with events in consecutive time stamps. Considering a time window and two sets of nodes and whose cells have parallel events in time and respectively, a link is established between all the combinations of different nodes between both sets. We also consider only the spatial ( and ) locations and time of the events in the construction method but we emphasize that our proposed approach is not limited to these three variables. Our approach can be extended to more variables by applying the same grid division process to other variables.

Weak links in chronnets represent sporadic consecutive events between two cells that might not represent temporal patterns. These links may appear as result of different factors. One common reason is the inaccurate or wrong records generated by distinct problems in data gathering process [35], leading to spurious links in the chronnet. To minimize these problems, we consider an optional pruning step that consists of removing links whose weights are equal or lower than a threshold . Pruning weak links will remove some of these wrong records from the analyses and can be used as an outlier removal step. This process can also be used to transform weighted chronnets into unweighted ones. In this case, only those links whose weight are higher than remain in the network.

An advantage of the construction process is its low time complexity. The time complexity for finding the cells whose events occur (grid division) is linear concerning the number of events if a squared grid is considered. Assuming no parallel events and that the events were recorded in chronological order (sorted) in the data set, the time complexity for constructing the chronnet is also linear to the total number of events . If the data set is unsorted, the time complexity is bounded by the time complexity of the sorting algorithm. Even in the unsorted case, the time complexity is low, considering the fast sorting algorithms available. This process can be easily parallelized by breaking the sorted data set into smaller data sets, constructing smaller chronnets, and merging them [36]. The parallelization can considerably speed up the construction process. Therefore, our method can is suitable to model large-scale spatiotemporal data sets.

Network characterization

Degree and strength

The degree of a node in a chronnet represents the number of other cells whose events occur consecutively in time with respect to . Higher degree nodes (hubs) represent cells with relative longer periods of activity occurring simultaneously with many other cells. The activity period in a hub cell should be sufficiently higher than the others to allow the connection to several other cells.

The strength is the sum of the weights of the node and represents the frequency that the events happened after or before any neighbor. A high node strength indicates a recurrent link between this node and at least one of its neighbors. High strength accounts for a higher number of events.

The degree distribution is the fraction of nodes with the same degree

, which could be seen, in the thermodynamic limit, as the probability

of randomly select a node with degree  [37]. Recalling that in chronnets an edge represents consecutive events between two cells, the degree distribution represents how spatially distributed are the events over time. The more uniform is the degree distribution, the more spatially homogeneous occurred the events. When the chronnet exhibits heavy-tailed degree distribution, e.g. following a power law in the form , then, most of the cells have a low degree, i.e., they present few events that co-occurred with other nodes. On the other hand, few cells have very high degrees (hubs), which means, the nodes present a larger number of spatiotemporal events that co-occurred with many other nodes of the grid. The interpretations are similar for the strength distribution, but considering the frequency or recurrence of appear a link between two nodes.

Paths

A path length in chronnets represents the number of consecutive events in a route between two nodes, i.e., how many consecutive historical events separate two cells. The simplest case is the one-path distance between nodes and , meaning that both nodes were consecutively activated after a while. A two-path distance means that it was necessary an intermediate or third node to occur a consecutive cascade of events between and , and so on for larger distances paths. Therefore, the shortest path distance, , is the minimum sequence of historical cascade activation that separate two nodes.

The average shortest path () corresponds to the expected shortest path distance to occur an event activation between any pair of nodes. Then, the diameter () is the longest shortest path distances to historical co-occur a cascade of events between the nodes on the chronnet.

Centrality measures

Centrality measures quantify how influential or central nodes are, which can be defined in many ways [38]. It is important to mention that many centrality measures do not take into account the link weights. These measures are unappropriated for chronnets since they consider that strong (recurrent events) and weak links have the same influence. Therefore, two approaches can be considered: to use centrality measures for weighed graphs [39] or to prune low weight links and then apply centrality measures for unweighted networks [38].

One of the most straightforward definitions, the degree centrality, assumes that the most relevant nodes are those with the highest degrees. Considering a node and its degree , the degree centrality is . As previous discussed, high degree nodes in chronnets represent cells whose events occur consecutively to other many other cells. Another common measure is the betweenness centrality that quantifies how many times a node is traversed when considering the shortest paths between all node combinations in the network, defined as where is the number of shortest paths between nodes and that passes through and is the total number of shortest paths between and . The betweenness centrality can detect nodes connecting communities (densely connected nodes with few connections between groups). The closeness centrality, defined as the inverse of the sum of distances between all combination of node and , i.e., . In chronnets, the closeness centrality can be used to find those nodes whose consecutive events occur with less time steps to all the other ones. Some centrality measures were also extended to weighted graphs [39]. The closeness centrality, for example, takes the weights into account when calculating the between nodes.

Transitivity and community structure

Transitivity accounts for the proportion of cycles of order three on the network, which is a common property found in real-world networks [37]. In chronnets, a triangle means a recurrent or reinforcement pattern of cycle activation between three nodes. If the events occur at random positions, we have a very low transitivity coefficient in the chronnet. In the case of weighted and directed networks, it can represent strong causal or reinforcement behavior, e.g., a modus operandi, or seasonal fire progression [5].

A common pattern in spatiotemporal data sets is the cluster structure [40, 1]. In networks, the analogous structure is called communities, which are groups of densely connected nodes that have few connections between groups (also called partitions) [41]. The nodes in chronnet that correspond to an active region (multiple events) tend to be connected forming a community. If the region of activity changes to another region, another community emerges. Thus, communities in chronnets represent groups of cells (regions) whose events appear consecutively in an interval.

Spatiotemporal data mining

We present methodologies to apply chronnets in some spatiotemporal data mining tasks [1].

Frequent pattern and relationship mining

The chronnet construction method itself is a form of detecting frequent patterns. The method concentrates on a pattern: consecutive events between different locations. This pattern is represented by strong links that indicate repeating consecutive events between different regions. Since we consider different regions, it can also be used for spatiotemporal relationship mining. Strong relationships in terms of consecutive events can be studied by pruning low-weight links or focusing on the strongest links in the chronnet.

Outlier detection

In chronnet, cells with long periods of activity are mapped to high degree nodes, and cells with a high number of events are mapped to high strength nodes. These nodes represent cells whose activity is much higher than the other ones and can, therefore, be treated as outliers. The degree and/or strength can be used to detect these outlier nodes that may be removed from the network when convenient.

Clustering

In data sets with spatiotemporal clusters, the events in a cluster tend to generate strong links between the nodes representing the geometrical region where the events occur, resulting in community structures in chronnets. Different community detection methods have been proposed to find a single partition or a hierarchical clustering structure, commonly represented as a dendrogram 

[42, 41]. Hierarchical community structures are particularly interesting since they permit to define an arbitrary number of communities. Other advantages are the low time complexity and the reduced or absent number of algorithm parameters, making them suitable for large data sets.

After detecting communities, they can be used to find spatiotemporal clusters in the data set. Given a data set composed by events, the clustering for each event is defined as the community where the node that represents this event is inserted. This process results in a temporal sequence of communities defining the spatiotemporal clustering for events. Hierarchical community detection algorithms can be used to divide the nodes into an arbitrary number of communities, making possible the spatial division into distinct number of regions. An advantage of community detection algorithm over traditional clustering algorithms comes from their partition strategy that normally considers macro, mesoscale, and micro characteristics from the network [41, 43]. This strategy tends to find clusters of different forms and sizes, which might provide better clustering results.

Outliers can be treated by applying a correction function to each element where

is an odd natural number defining a time window size. This function assigns to every element

the same value in the time window centered in iff all the values in the window are the same, i.e., if the number of unique elements in is one.

Change detection

The community structure can also be used to find temporal changes in spatiotemporal data sets composed by spatial clusters. Temporal changes can be detected by tracking alterations in the communities where the events occur in time. Following the same approach described for the clustering, we consider a data set composed of events and the respective time series of communities that represent the node community where each event occur. The time intervals in the time series where the events persist in a specific community correspond to a particular region with a considerable high number of events. When the community changes, it represents a time point where the events start to appear in another region, indicating spatial changes. Therefore, a time index in the date set is considered a change point if .

Experimental results

Settings

To analyze and test our method, we use three kinds of data: an artificial data set generator, dynamical systems, and a real-world data application. The artificial data sets were constructed with specific characteristics to show the potential of the method and the real data set is a case study where we show different known phenomena via the proposed model. The three approaches are described in the following.

  • Artificial data set generator: We propose a simple spatiotemporal data set generator composed by an square grid with columns and rows. The probability matrix , which has the same size of the grid , defines the likelihood of a cell generates an event in a time , where is an arbitrary time length. Different data sets can be generated by modifying the three parameters: grid size , the probability matrix , and the time length . Figure 2 exemplifies the data generation process.

    Figure 2: Spatiotemporal event generator. An example of a data set constructed considering a grid

    , an exponential probability distribution (cells’ colors), and a time length

    . Each grid represents all the generated events (black dots) until that time. (Color Online)
  • Dynamical systems: We used data extracted from the Lorenz and Rössler equations using well-known combinations of parameters that generate chaotic trajectories in both equations [44]. The data set was constructed by sampling the values in the trajectories considering a fixed time step.

  • Real data set: We use a global fire detection data set constructed with data from the Moderate Resolution Imaging Spectroradiometer (MODIS) that runs on-board NASA’s Terra and Aqua satellites [45]. Specifically, we used the Global Fire Location Product (MCD14ML) Collection 6 from 2003 to 2018 [46]. This data set is composed of global active fire detection with geographic location, date, confidence, type, and additional information for each fire detected by MODIS sensors. We consider only those fire records with confidence higher than 75%. We opt for dividing the globe with a hexagonal grid due to its advantages over a traditional rectangular longitude-latitude one [47]. Hexagonal grids generate cells with a more uniform coverage area and avoid distortions. In our experiments, we used 21872 hexagonal grid cells of approximated 23322 each to cover the whole globe. Since many cells cover regions without fire, e.g. the oceans or poles, we discard them resulting in 5467 cells with at least one fire detection in the historical period. The “type” column in the data set defines if a detection is a vegetation fire or an outlier generated by active volcanos, offshores, or other static land source. In our experiments, we used all the types to test whether our method can detect these outlier activities. The data set is freely accessible [48].

Artificial data sets and dynamical systems

In this section, we construct data sets to generate static chronnets, i.e., the whole data set is used to construct a single chronnet. We employ this single network to characterize the whole historical activity using network theory. We provide interpretations for the network measures in the context of chronnets and show how these measures can be used to extract information from spatiotemporal data.

Figure 3 presents three spatiotemporal data sets generated using the approach described in Fig. 2. Each data set was constructed considering a different probability distribution (): uniform, power-law, and exponential. For each of them, we also present the undirected chronnets constructed for and their degree distributions. The uniform probability distribution generates a Gaussian degree distribution, the power-law distribution creates a scale-free [37] chronnet, and the exponential probability distribution generates a chronnet with exponential degree distribution. Scale-free networks are characterized by the presence of many low degree nodes and just a few high degree nodes (hubs). Exponential degree distributions have a fast decay, which makes the hubs less numerous than what was expected in a scale-free network. The presence of hubs reveals interesting features that can be explored. In summary, the degree distribution of a chronnet captures the probability distribution of event generation.

Figure 3: Chronnets constructed with different spatiotemporal data sets generated with three probability distributions: (A) uniform, (D) power-law, and (G) exponential. For simplicity purposes, we plot part of the data set. For each distribution, we illustrate the respective networks (B, E, and H) and degree distributions (C, F, and I). (Color Online)

Figure 4 illustrates an artificial data set generated with a probability distribution following a power-law. In this case, the maximum probability and the total time of generation are higher than the previous case (Fig. 3-D). When or/and are high enough, the resulting data set have a large number of events and the resulting chronnet tends to be highly connected (). In these situations, the node strength provides more information than the node degree. In Fig. 4-B, we show that the strength distribution can capture the power-law probability . From the strength distribution is possible to observe that chronnets have weak nodes. In Fig 4-C, we show the influence of the weak links pruning on the topology of the chronnet considering the same artificial data set from Fig. 4. As previously described, the resulting chronnet for this data set is densely connected, which does not bring much information. However, it is possible to reveal the original probability distribution by pruning weak links from the network (). The pruned chronnet has a degree distribution that follows a power-law (), similar to the probability distribution () used to construct the data set.

Figure 4: Data sets with a high probability of events generation and/or long periods might result in densely connected chronnets. (A) An example of a data set with a power-law distribution of event generation that creates a highly connected network. In these cases, the strength distribution of the resulting chronnets (B) provides a better description of the network. Another possibility is to consider a pruning threshold () to remove weak links and to analyze the degree distribution. (C) Shows the degree distributions for 1, 2, 5, and 9 represented by different colors and marks. The blue lines (B and C) represent the power-law fitting. In both cases (B and C), it is possible to observe the original power-law distribution from the original data set (A). (Color Online)

The pruning process has a strong influence on the network. The edge density () presents a fast decay event with small values of . For example, in Fig. 4 when , only 6% of the original links remain. This fast decay reflects the strength distribution that shows that the great majority of nodes have low strength. It means that even a very low threshold will remove many links. Therefore, the pruning threshold cannot be high otherwise it will remove too many links and might make the network disconnected. When the pruning threshold is low, it is possible to remove weak links but still preserve part of the strongest links and the nodes in the largest component. These results show, for a specific data set, how the pruning affects the chronnet and how it can be used to reveal interesting topological features. Other data sets may respond differently to the pruning process, which means that the pruning should be analyzed in particular.

Important nodes in chronnets can be detected by applying network centrality measures [38]. In Fig. 5, we present four data sets with some specific temporal characteristics. The first data set was constructed using the power-law probability distribution illustrated in Fig.3-D. The resulting chronnet (5-E) shows that the cells with higher event generation probability () tend to generate a high number of connections. These hubs can be captured by the degree centrality. The second data set (5

-B) was constructed by generating Gaussian distributions in two fixed position (

and ) alternated in time, forming spatiotemporal clusters. The resulting chronnet (5-F) is composed by two communities. By definition, communities have high intra-community links but low inter-communities connections. This feature makes the betweenness higher in links and nodes that connect communities. Therefore, the betweenness centrality can be used to detect nodes connecting communities. The third and fourth data sets (5-C and D) were constructed by sampling two coordinates of the Lorenz and Rössler systems respectively in chaotic regimes. The resulting chronnet for the Lorenz equations (5-G) has a topology that reproduces the characteristic two-loop form of the trajectory connected by a single node. This node has the shortest average distance between the other ones, which indicates that the represented cell has the shortest temporal distance to all the other ones when considering all the combinations of cells. This central node has the highest closeness centrality. The chronnet for the Rössler system (5-H) also reproduces in the topology some characteristics of the trajectory. Most part of the time, it oscillates in the -dimension leading to stronger links between the nodes that represent these cells. Nodes with stronger links have higher closeness centrality when the weights are considered [39], a measure that can be used to find these nodes.

Figure 5: Node centrality in Chronnets. The upper three figures (A-D) illustrate the data sets used to constructed the chronnets respectively illustrated below them (E-H). Only parts of the data set are shown for illustrative reasons. (A) Data set constructed considering the power-law probability illustrated in Fig.3-D and . (B) Data set constructed by generating 40 Gaussians ( that alternate in two different locations (). (C) Data set constructed by sampling ( and ) the and coordinates of one realization of the Lorenz system [44] with parameters = 10, = 8/3, and = 28, resulting in a chaotic trajectory. (D) Data set constructed by sampling ( and ) the and coordinates of one realization of the Rössler system [44] with parameters = 0.2, = 0.2, and = 5.7, also resulting in a chaotic trajectory. The chronnets were constructed considering grid sizes (E and F) , (G) , and (H) . Weak links were pruned from all the chronnets except for the Rössler system (H) considering (E and F) and (G) . The node size and colors represent the respective centrality measure in the legend below each network (Color online).

In Fig. 6, we illustrate the community detection process. First, we generate a spatiotemporal data set composed of four different periods. Different event probability distributions () were arbitrarily chosen in such a way that creates four temporal clusters (Fig. 6-B) but also generates outliers events with a smaller probability. We constructed the chronnet and applied the Fast greedy algorithm [42] to find hierarchical community structures (Fig. 6-C). The best partition (best modularity) for the chronnet is formed by four clusters (Fig. 6-D) which exactly correspond to those ones whose events occur in the same four periods in during the data set construction. By choosing the number of communities (dendrogram), it is possible to use chronnets to cluster spatiotemporal data set in distinct spatial regions. For example, the same chronnet can be divided into two communities by cut the dendrogram into two clusters. This partition splits the time into two intervals that correspond to the first and the last two periods.

Figure 6: Community structure in chronnets. (A) We propose an artificial data set constructed by changing the event probability distribution in four intervals with total time . Each of the four small grids illustrates the probabilities and the generated events in those intervals. A few outliers events were generated, represented by points outside the region with higher probability. (B) The resulting data set is composed by merging the events generated during the four periods, represented by colors. (C) Dendrogram obtained by the Fast greedy algorithm [42]. Different colors represent the four communities achieved with the best modularity division. (D) The resulting chronnet is presented in a grid layout that corresponds to the same position in the data set grid. (E) Spatiotemporal clustering using ST-DBSCAN [49] where “” marks events that were incorrectly clustered. (Color Online)

We applied our spatiotemporal data clustering method to the artificial data set presented in Fig. 6 and compared to the clustering result for the ST-DBSCAN, a well-known clustering algorithm for spatiotemporal data [49]. Since the data set has a few outlier events, we considered a small threshold for our method. For the ST-DBSCAN, we tested all the combinations for the three parameters: 0, 10, 20, , 3000, 0, 10, 20, , 3000, and 0, 5, 10, , 200 . Our results show that proposed method can correctly cluster all the points while ST-DBSCAN achieves a maximum adjusted rand index of 0.94. As illustrated in Fig. 6-E, The ST-DBSCAN algorithm mainly fails to cluster the outlier events. This result shows that proposed can successfully find temporal clusters.

Figure 7: Detecting spatial changes with chronnets. (A) Artificial data set composed by groups of points sampled from Gaussian distributions in three spatial locations (point color) alternating in time. (B) The resulting chronnet is composed by three communities (node colors) obtained by the Fast greedy community detection [42]. (C) A time series with the node community where each event from the data set. (Color online)

Spatial changes can also be detected using the community structure in chronnet. As described in method’s section, change points are defined as changes in the communities where the events occur in time. In Fig. 7 we illustrate this process using an artificial data set composed by groups of points forming spatiotemporal Gaussians that alternate in time. The resulting chronnet (7-B) is composed by three communities that represent the three spatial regions where the events occur. For every event in the data set, we construct a time series (7-C) with the node community where each event is located. Step changes in this time series represent spatial changes in the data set.

Real case application: global fire spots

Here, we apply chronnets to analyze a real spatiotemporal data set composed by fire detections. Similar to the previous analysis with the artificial data sets, we use a single chronnet and network measures to describe the historical fire activity (2003 - 2018). To remove some weak links, we considered a small pruning threshold which keeps 32% of the original links.

In Fig 8, we present the degree and strength distributions for the resulting chronnet. High degree nodes account for cells whose fire activity occurs consecutively to other cells while the strength captures the number of events. The strength does not capture here the total number of events since links were pruned. In general, regions with high degrees tend to have high strengths. Both measures show the regions with higher fire activity, e.g., Brazil, large parts of Africa, China, and North Australia. This pattern was observed in previous studies [50, 16].

Figure 8: The (A) Strength and (B) degree for the chronnet constructed with the wildfire data set (2003 - 2018). A small pruning threshold

was considered. The red circles (A) mark the 2% of higher degree nodes. The inner plots show the probability distributions. Red and blue lines illustrate the fit of power-law and log-normal distribution respectively

[51]. (Color Online)

The degree distribution follows a power-law () after a low-degree saturation. Power-law degree distributions are characterized by the presence of many low degree nodes and a few high degree nodes (hubs). The degree exponents show that the distributions decay fast to make the hubs less numerous than what was expected in scale-free networks [37]. Therefore, we cannot affirm that the network is scale-free, but the presence of hubs reveals interesting features that can be explored. In the fire activity data set, hubs represent cells with periods of fire activity much longer than the other ones. In some cases, these uncommonly long activities are accounted for false wildfire detections like in regions with hot bare soils [52]. In other cases, hubs represent volcanoes or gas flares from oil and gas exploration. The outlier cells found by our method correspond to the cells with the highest number of fire detections marked as outliers in the original data set. This result is expected and shows the accuracy of the method.

The strength distribution can be partially fitted as a log-normal ( = 7.21 and = 1.79) and a power-law ( = 4.20) tail. The distribution shows that there is a high number of nodes with low strength ( ), which mainly represent cells in medium and high latitudes (e.g. Patagonia region and the Nordics) with very low fire activity. After this low-strength saturation, the number of nodes decreases as a log-normal distribution until a transition point ( 5.1 ) where the number of nodes decays faster as the fitted log-normal distribution, which is better represented as a power-law. This interval comprises the regions with medium ( ) like the Russia and USA, and high strength nodes ( > ) which correspond to high fire activity regions occurring mainly in the tropics, like Brazil, Central America, north Australia, Indochinese Peninsula, and large parts of the Sub-Saharan Africa. This later is where the highest strength nodes appear (power-law tail), corresponding to the region in the southern Democratic Republic of the Congo and Northern Angola as well as Southern Sudan and the Central African Republic, already described in previous works [50, 16].

As demonstrated, simple chronnet measures like the degree and strength can be used to describe the distribution of the events in the data set. Other network measures can be used to describe different aspects of the data set. The average path length = 3.21, indicates that the average time steps to occur consecutive events between any pair of cells is on average small. The transitivity = 0.34 shows that there is a clustering structure in the network, which is an expected result since close regions tend to have similar climate and land use. Since is small but is not, the resulting chronnet presents the small-world feature [53, 54, 37]. The presence of hubs is highly responsible for decreasing the average path length. It means that, for any pair of nodes on the globe, the average number of edges, i.e., consecutive events, to co-occur wildfires is low.

Another capability of chronnets is to represent communities the groups of cells whose activity occurs in the same period of time. Fig. 9 illustrates community structures for the wildfire data set. Most of the communities are formed by nodes that represent relatively near geographical regions. The spatial distribution of the communities can be explained by the similarity of climate and land use in near regions that tends to have fire seasons in the same period of time [16]. However, some communities are formed by distant regions. For example, the southwest-most community in South America (light green) is the same community in the small region in South Western Australia. It is important to note that we have presented the results for just one network partition (38 communities). Other algorithms, as the fast greedy algorithm [42], returns a hierarchy of communities that can be used to divide the area of study in an arbitrary number of regions. In the fire data set, this hierarchy can be used to study the fire seasons in different geographical scales.

Figure 9: Communities in the pruned network. Only the 20% of strongest links were considered and 1% of the largest degree nodes were removed (outliers). Communities were achieved using the label propagation method [55]. Colors represent the 38 communities (Color online).

Conclusion

In this paper, we present a network-based representation model for spatiotemporal data analyses called chronnets whose main goal is to capture recurrent consecutive events in a data set. In this way, we transform spatiotemporal events into a network with a construction method that is fast (linear time complexity) and thus can be applied to large data sets. The model permits the use of different tools from network science and graph mining to extract information from spatiotemporal data sets. It can be used not just to describe, but also to compare data sets.

Our results show that the method can extract simple statistics as well as intricate and hidden information. Local network measures, as the degree and strength, can describe the event activities in each cell and can be used to detect cells with outlier activities, while centrality measures can be applied to detect important nodes. Global network measures describe the data set as a whole and can reveal features that cannot be detected by simply looking to single cells. For example, high transitivity indicates a high tendency of consecutive events to form triangles and cluster together. In these cases, community detection algorithms can be used to find communities, which represent groups of cells whose activity occurs in the same period of time. We also show how to apply our model to some spatiotemporal data mining tasks, like frequent patterns detection, spatial changes identification, clustering and outlier detection.

The aforementioned capabilities of the model were experimentally demonstrated using toy data sets generated by a simple probabilistic process here proposed. The model was also applied to real data set of global active fire detections, in which we described the frequency of fire events, outlier fire detections, and the seasonal activity, using a single chronnet. Some interesting aspects were found on the real data set: (i) The average path length, which indicates that the average time steps to occur consecutive events between any pair of cells, is on average small; (ii) The chronnet presents a clustering structure with larger transitivity values, which is an expected result since close regions tend to have similar climate and land use. Then, the chronnet has the small-world property; (iii) The larger presence of hubs is responsible for decreasing the average path length. This result indicates that for any pair of nodes on the globe, the average number of edges, i.e., consecutive events, to co-occur wildfires is low. Therefore, we conclude that chronnets are a robust and fast model for spatiotemporal data analysis.

This work is extendable in many directions. The construction method can be improved by introducing controlling parameters to capture not just consecutive events but different lags. It will permit the detection of temporal patterns in other time scales. More variables can also be considered by applying the same grid division process used to transform the spatial coordinates into nodes. Another future work is the chronnets analysis from a temporal network perspective. Different methods of graph mining can also be used with chronnets to propose new pattern recognition techniques. The temporal patterns captured by chronnets can be used to propose new machine learning tools or forecasting methods. In the particular application problem, new analyses focusing on some specific regions with high fire intensity 

[16], such as the Amazon forest, South America, or Africa, can be developed. Furthermore, other spatiotemporal data sets may also be analyzed with chronnets.

References

  • [1] Atluri, G., Karpatne, A. & Kumar, V. Spatio-temporal data mining: A survey of problems and methods. ACM Comput. Surv. 51, 83:1–83:41 (2018).
  • [2] Jiang, Z., Shekhar, S., Zhou, X., Knight, J. K. & Corcoran, J.

    Focal-test-based spatial decision tree learning.

    IEEE Trans. Knowl. Data Eng. 27, 1547–1559 (2015).
  • [3] Eftelioglu, E., Shekhar, S., Kang, J. M. & Farah, C. C. Ring-shaped hotspot detection. IEEE Transactions on Knowledge and Data Engineering 28, 3367–3381 (2016).
  • [4] Glatman-Freedman, A. et al.

    Near real-time space-time cluster analysis for detection of enteric disease outbreaks in a community setting.

    Journal of Infection 73, 99 – 106 (2016).
  • [5] Vega-Oliveros, D. A. et al. From spatio-temporal data to chronological networks: An application to wildfire analysis. In 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, 675–682 (ACM, New York, NY, USA, 2019).
  • [6] Lappas, T., Vieira, M. R., Gunopulos, D. & Tsotras, V. J. Stem: A spatio-temporal miner for bursty activity. In ACM SIGMOD International Conference on Management of Data, SIGMOD’13, 1021–1024 (ACM, New York, NY, USA, 2013).
  • [7] Li, W., Mahadevan, V. & Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 18–32 (2014).
  • [8] Faghmous, J., Le, M., Uluyol, M., Kumar, V. & Chatterjee, S. A parameter-free spatio-temporal pattern mining model to catalog global ocean dynamics. 151–160 (2013).
  • [9] Torkamani, S. & Lohweg, V. Survey on time series motif discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7, e1199 (2017).
  • [10] Ludescher, J. et al. Very early warning of next El Niño. Proceedings of the National Academy of Sciences of the United States of America 111, 2064–6 (2014).
  • [11] Ebert-Uphoff, I. & Deng, Y. Causal discovery in the geosciences – using synthetic data to learn how to interpret results. Computers & geosciences 99, 50–60 (2017).
  • [12] Boers, N. et al. Complex networks reveal global pattern of extreme-rainfall teleconnections. Nature 566, 373 (2019).
  • [13] Donges, J. F., Schleussner, C. F., Siegmund, J. F. & Donner, R. V. Event coincidence analysis for quantifying statistical interrelationships between event time series. The European Physical Journal Special Topics 225, 471–487 (2016).
  • [14] Lu, D., Mausel, P., Brondízio, E. & Moran, E. Change detection techniques. International Journal of Remote Sensing 25, 2365–2401 (2004).
  • [15] Salmon, B. P. et al. Unsupervised land cover change detection: Meaningful sequential time series analysis. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 4, 327–335 (2011).
  • [16] Ferreira, L. N., Vega-Oliveros, D. A., Zhao, L., Cardoso, M. F. & Macau, E. E. Global fire season severity analysis and forecasting. Computers & Geosciences 134, 104339 (2020).
  • [17] Falasca, F., Bracco, A., Nenes, A. & Fountalis, I. Dimensionality reduction and network inference for climate data using -maps: Application to the cesm large ensemble sea surface temperature. Journal of Advances in Modeling Earth Systems 11, 1479–1515 (2019).
  • [18] Zou, Y., Donner, R. V., Marwan, N., Donges, J. F. & Kurths, J. Complex network approaches to nonlinear time series analysis. Physics Reports 787, 1 – 97 (2019). Complex network approaches to nonlinear time series analysis.
  • [19] Berton, L., de Andrade Lopes, A. & Vega-Oliveros, D. A.

    A comparison of graph construction methods for semi-supervised learning.

    In

    2018 International Joint Conference on Neural Networks (IJCNN)

    , IJCNN’18, 1–8 (IEEE, 2018).
  • [20] Zhou, D., Gozolchiani, A., Ashkenazy, Y. & Havlin, S. Teleconnection Paths via Climate Network Direct Link Detection. Physical Review Letters 115, 268501 (2015).
  • [21] Donner, R. V., Wiedermann, M. & Donges, J. F. Complex Network Techniques for Climatological Data Analysis, 159–183 (Cambridge University Press, 2017).
  • [22] Fan, J., Meng, J., Ashkenazy, Y., Havlin, S. & Schellnhuber, H. J. Network analysis reveals strongly localized impacts of El Niño. Proceedings of the National Academy of Sciences of the United States of America 114, 7543–7548 (2017).
  • [23] Meng, J., Fan, J., Ashkenazy, Y., Bunde, A. & Havlin, S. Forecasting the magnitude and onset of El Niño based on climate network. New Journal of Physics 20, 043036 (2018).
  • [24] Farkas, I., Jeong, H., Vicsek, T., Barabási, A.-L. & Oltvai, Z. The topology of the transcription regulatory network in the yeast, Saccharomyces cerevisiae. Physica A: Statistical Mechanics and its Applications 318, 601–612 (2003).
  • [25] Zou, Y. et al. Brain anomaly networks uncover heterogeneous functional reorganization patterns after stroke. NeuroImage: Clinical 20 (2018).
  • [26] Bialonski, S., Wendler, M. & Lehnertz, K. Unraveling Spurious Properties of Interaction Networks with Tailored Random Networks. PLoS ONE 6, e22826 (2011).
  • [27] Lacasa, L., Luque, B., Ballesteros, F., Luque, J. & Nuño, J. C. From time series to complex networks: the visibility graph. Proceedings of the National Academy of Sciences of the United States of America 105, 4972–5 (2008).
  • [28] Elsner, J. B., Jagger, T. H. & Fogarty, E. A. Visibility network of United States hurricanes. Geophysical Research Letters 36, L16702 (2009).
  • [29] Abe, S. & Suzuki, N. Scale-free network of earthquakes. Europhysics Letters (EPL) 65, 581–586 (2004).
  • [30] Ferreira, D., Ribeiro, J., Papa, A. & Menezes, R. Towards evidence of long-range correlations in shallow seismic activities. EPL 121, 58003 (2018).
  • [31] Ohsawa, Y., Benson, N. E. & Yachida, M. Keygraph: automatic indexing by co-occurrence graph based on building construction metaphor. In IEEE Int. Forum on Research and Technology Advances in Digital Libraries, 12–18 (1998).
  • [32] Vega-Oliveros, D. A., Gomes, P. S., Milios, E. E. & Berton, L.

    A multi-centrality index for graph-based keyword extraction.

    Information Processing & Management 56, 102063 (2019).
  • [33] Ozgur, A. & Bingol, H. Social network of co-occurrence in news articles. 688–695 (2004).
  • [34] Wang, R., Liu, W. & Gao, S. Hashtags and information virality in networked social movement: Examining hashtag co-occurrence patterns. Online Information Review 40, 850–866 (2016).
  • [35] Han, J., Pei, J. & Kamber, M. Data mining: concepts and techniques (Elsevier, 2011).
  • [36] Ferreira, L. N. Chronnets implementation in R. https://github.com/lnferreira/chronnets (2019).
  • [37] Barabási, A. & Pósfai, M. Network Science (Cambridge University Press, 2016).
  • [38] Newman, M. Networks: An Introduction (Oxford University Press, Inc., New York, NY, USA, 2010).
  • [39] Opsahl, T., Agneessens, F. & Skvoretz, J. Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks 32, 245 – 251 (2010).
  • [40] Kisilevich, S., Mansmann, F., Nanni, M. & Rinzivillo, S. Spatio-temporal clustering, 855–874 (Springer US, Boston, MA, 2010).
  • [41] Fortunato, S. Community detection in graphs. Physics Reports 486, 75–174 (2010).
  • [42] Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. Phys. Rev. E 70, 066111 (2004).
  • [43] Ferreira, L. N. & Zhao, L. Time series clustering via community detection in networks. Information Sciences 326, 227 – 242 (2016).
  • [44] Strogatz, S. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering (CRC Press, 2018).
  • [45] Justice, C. et al. An overview of modis land data processing and product status. Remote Sensing of Environment 83, 3 – 15 (2002).
  • [46] Giglio, L., Schroeder, W. & Justice, C. O. The collection 6 modis active fire detection algorithm and fire products. Remote Sensing of Environment 178, 31 – 41 (2016).
  • [47] Barnes, R. dggridR: Discrete Global Grids for R (2017). R package version 2.0.1.
  • [48] Giglio, L., Schroeder, W., Hall, J. V. & Justice, C. O. MODIS Collection 6 Active Fire Product User’s Guide Revision B. http://modis-fire.umd.edu/files/MODIS_C6_Fire_User_Guide_B.pdf (2019). [Online; accessed 01-March-2019].
  • [49] Birant, D. & Kut, A. St-dbscan: An algorithm for clustering spatial–temporal data.

    Data & Knowledge Engineering

    60, 208 – 221 (2007).
  • [50] Andela, N. et al. The global fire atlas of individual fire size, duration, speed and direction. Earth System Science Data 11, 529–552 (2019).
  • [51] Clauset, A., Shalizi, C. & Newman, M. Power-law distributions in empirical data. SIAM Review 51, 661–703 (2009).
  • [52] Oom, D. & Pereira, J. M. Exploratory spatial data analysis of global MODIS active fire data. International Journal of Applied Earth Observation and Geoinformation 21, 326–340 (2012).
  • [53] Watts, D. J. & Strogatz, S. H. Collective dynamics of ’small-world’ networks. Nature 393, 440–442 (1998).
  • [54] Amaral, L. A. N., Scala, A., Barthélémy, M. & Stanley, H. E. Classes of small-world networks. Proceedings of the National Academy of Sciences 97, 11149–11152 (2000).
  • [55] Raghavan, U. N., Albert, R. & Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007).

Acknowledgments

This research is supported by the Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) under Grant No.: 2015/50122-0 and the German Research Council (DFG-GRTK) Grant No.: 1740/2. L.N.F. acknowledges FAPESP Grant No.: 2019/00157-3 and 2017/05831-9. D.A.V.O acknowledges FAPESP Grants 2016/23698-1, 2018/01722-3, and 2018/24260-5. M.G.Q acknowledges FAPESP Grant 2016/16291-2 and CNPq Grant 313426/2018-0. This research was developed using computational resources from the Center for Mathematical Sciences Applied to Industry (CeMEAI) funded by FAPESP (grant 2013/07375-0).

Author contributions statement

L.N.F, D.A.V.O, M.C, M.F.C, and M.G.Q developed the chronnet model. L.N.F proposed the data set generator, conceived and conducted the experiments. L.N.F and D.A.V.O analyzed the results and wrote the paper. L.Z and E.E.N.M contributed with ideas to improve the work. All authors reviewed the manuscript.

Additional information

Competing interests

The authors declare no competing of financial and/or non-financial interests.

Code

To make reproducibility easier and to allow the fast usage of our method, we freely share an implementation in R programming language online [36]. It includes the code used to generate the chronnets, the toy case generator, and other analysis functions.