Mining Sub-Interval Relationships In Time Series Data

02/16/2018
by   Saurabh Agrawal, et al.
0

Time-series data is being increasingly collected and stud- ied in several areas such as neuroscience, climate science, transportation, and social media. Discovery of complex patterns of relationships between individual time-series, using data-driven approaches can improve our understanding of real-world systems. While traditional approaches typically study relationships between two entire time series, many interesting relationships in real-world applications exist in small sub-intervals of time while remaining absent or feeble during other sub-intervals. In this paper, we define the notion of a sub-interval relationship (SIR) to capture inter- actions between two time series that are prominent only in certain sub-intervals of time. We propose a novel and efficient approach to find most interesting SIR in a pair of time series. We evaluate our proposed approach on two real-world datasets from climate science and neuroscience domain and demonstrated the scalability and computational efficiency of our proposed approach. We further evaluated our discovered SIRs based on a randomization based procedure. Our results indicated the existence of several such relationships that are statistically significant, some of which were also found to have physical interpretation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/03/2019

A Fast-Optimal Guaranteed Algorithm For Learning Sub-Interval Relationships in Time Series

Traditional approaches focus on finding relationships between two entire...
10/06/2018

Mining Novel Multivariate Relationships in Time Series Data Using Correlation Networks

In many domains, there is significant interest in capturing novel relati...
08/05/2021

Local Exceptionality Detection in Time Series Using Subgroup Discovery

In this paper, we present a novel approach for local exceptionality dete...
05/29/2019

Flexible Mining of Prefix Sequences from Time-Series Traces

Mining temporal assertions from time-series data using information theor...
11/12/2021

Identifying On-road Scenarios Predictive of ADHD usingDriving Simulator Time Series Data

In this paper we introduce a novel algorithm called Iterative Section Re...
10/14/2021

Time Series Clustering for Human Behavior Pattern Mining

Human behavior modeling deals with learning and understanding of behavio...
03/26/2018

Locality-Sensitive Hashing for Earthquake Detection: A Case Study Scaling Data-Driven Science

In this work, we report on a novel application of Locality Sensitive Has...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series data that consists of a sequence of observations collected at regular intervals of time is ubiquitous to many real-world applications. Mining relationships in time series data is of immense interest to several disciplines such as neuroscience, climate science, and transportation. For example, in climate science, relationships are studied between time series of physical variables such as Sea Level Pressure, temperature, etc., observed at different locations on the globe. Such relationships, commonly known as ‘teleconnections’ capture the underlying processes of the Earth’s climate system [kawale2013graph]. Similarly, in neuroscience, relationships are studied between activities recorded at different regions of the brain over time [atluri2016brain, atluri2015connectivity]. Studying such relationships can help us improve our understanding of real-world systems, which in turn, could play a crucial role in devising solutions for problems such as mental disorders or climate change.

(a) Two ends of El-nino Southern Oscillation
(b) Area-averaged Sea Level Pressure (SLP) monthly time series of two regions
Figure 1: An example of a Sub-Interval Relationship from climate science

Most of the existing work on mining time series relationships assume the relation to be present for the entire duration of the two time series. However, many interesting relationships in real-world applications often are intermittent in nature, i.e., they are highly prominent only in certain sub-intervals of time and absent or occur feebly in rest of the sub-intervals. As a motivating example, consider a pair of monthly Sea Level Pressure anomaly time series during 1979-2014 in Figure 0(b) that are observed at two distinct regions on the globe in Figure 0(a)). The full-length correlation between the two time series is -0.25. However, as shown in the lower panel of Figure 0(b), there exists multiple sub-intervals where the correlation between the two time series is stronger than -0.7. As we discuss later in Section 5.5, this example is the outcome of a well-known climate phenomena called ENSO (El Nino Southern Oscillation) [glantz2001currents], that is characterized by negative correlations between the surface temperatures observed near Australia and Pacific Ocean [glantz2001currents] and is known for impacting various weather events such as floods, droughts, and forest fires [siegert2001increased, ward2014strong]. The sub-intervals shown in the lower panel correspond to the two extreme phases, ‘El-Nino’ and ‘La-nina’, of ENSO, when its impact on global climate is amplified. Similar examples are also known to exist in other domains such as neuroscience, [atluri2014discovering] and stock market data [li2016efficient].

Inspired by such examples, we propose to formally study sub-interval relationship (SIR) between two time-series. Studying SIRs in time series data poses several challenges. First, we need to formally define the notion of an SIR and devise necessary interestingness measures to characterize them. Second, given a pair of time series, a computationally efficient method needs to be designed in order to find all the sub-intervals that show strong relationships. Third, the validity of discovered relationships needs to be evaluated.

For a given pair of time series, we define SIR as the set of all the disjoint time intervals each of which are sufficiently long and capture strong relationships between the two time series. The constraint on the length of the interval is important to filter out very small intervals that are more likely to capture spurious signals. An SIR with longer and multiple such intervals can be treated to be more interesting. Thus, the interestingness of an SIR could be measured as the total sum-length of all included intervals.

The next challenge is then to obtain most interesting SIR in a given pair of time series. As we describe later in Section 4.1, the above problem can exactly be solved by a standard dynamic programming (DP) approach. However, the time complexity of this approach is quadratic in the length of time series, which makes it computationally infeasible for long time series. To the best of our knowledge, none of the existing work in the literature solves our exact problem formulation and are therefore not applicable (discussed further in Section 3). In this work, we propose a novel and efficient approach called Partitioned Dynamic Programming(PDP) to find the most interesting SIR in a given pair of time series. The basic idea of our approach is to partition the original problem into multiple sub-problems, each of which could be independently solved by standard dynamic programming methods. We show that our approach is guaranteed to find the optimal solution and has time complexity that is practically linear in the length of the time series. Finally, to evaluate the validity of discovered SIRs, we propose a randomization based procedure to assess their statistical significance.

2 Definitions and Problem Formulation

2.1 Definitions and Notations

We start with presenting some basic terms and notations that will be used throughout the paper.

  1. Time series is a sequence of observations made at consecutive and regular timestamps.

  2. Time interval represents a set of consecutive timestamps . Time interval with a single timestamp is given by .

  3. Length of a time interval represents the number of timestamps included in the interval and denoted by . The length of single timestamp is equal to 1 and is denoted by .

  4. Segment of a time series in a time interval contains all the observations of corresponding to the timestamps of given interval.

  5. Full-length relationship between two time series and refers to the strength of relationship computed using all the observations of the two time series and is denoted by .

  6. Local relationship between two time series and in a time interval refers to the strength of relationship computed using all the observations in the time interval and is denoted by . For brevity, we also denote it by . Certain relationship measures (e.g., Euclidean distance) could be computed over singleton timestamp , which would be denoted by .

Following these ideas, we present our formal definition of sub-interval relationships (SIR) between a pair of time series.

Definition 2.1

A Sub-Interval Relationship (SIR) between two time series and refers to the set of non-overlapping time intervals such that every interval in

  • captures strong relationship, i.e.

  • is of length at least , i.e.

where and are user-specified thresholds.

The choice of thresholds depends on the type of SIRs that are of interest to a user. For instance, setting higher and lower results in SIRs with longer intervals of mild relationships and vice-versa.

Note that our definition of SIR is quite general and not specific to any relationship measure. Further, the above definition could also be used with distance measures (e.g. euclidean distance) except that the inequality constraint in the first bullet flips, i.e. .

2.2 Problem Formulation

Our next goal is to capture the most interesting SIR for a given pair of time series. Intuitively, an SIR is likely to be more interesting if the set of selected intervals covers a large fraction of the timestamps. Therefore, we propose to measure interestingness of an SIR as the sum-length (), which is equal to sum of lengths of all the selected time intervals. Then the problem require us to find the set of long and strong time-intervals with maximum sum-length.

Formally, for a given pair of time series , our goal is to determine the most interesting SIR as the set of time-intervals with maximum sum-length such that for every interval , , and .

where and are user-specified thresholds. In the remainder of the paper, we will refer to the set with maximum sum-length as the ‘optimal set’.

3 Related Work

Learning the similarity between two time series has been well studied in many different settings. However, quite surprisingly, our problem formulation has never been studied in any of the existing works. The most prevalent work has been done in designing various similarity measures (e.g., euclidean distance, Pearson correlation, dynamic time warping) for analyzing full-length time series [kawale2013graph, keogh2002exact, liao2005clustering]. Another part of the related work goes into devising various largest common subsequence (LCS) matching problems [das1997finding, chen2007spade, faloutsos1994fast]. Other related works focus on all-pairs-similarity-search and motif discovery [yeh2016matrix, zhu2016matrix] in which the fundamental problem is to efficiently find all the subsequences of a time series that are the closest match (or match within a threshold) to a given query sequence, that represents a pattern of high interest (e.g. motif) or obtained as a subsequence from another time series. The key difference between LCS/motif-discovery and our problem formulation is that the former allows matching between observations of two time series at different timestamps by compromising on the computational complexity. In contrast, we restrict ourselves to lock-step relationship measures (i.e. time-point in a time series can be compared with only time-point in other time series) that can be solved much more efficiently and has direct relevance to various domains including climate science and neuroscience.

The closest setting related to our work was formulated in [li2013discovering, li2016efficient], where a longest subsequence of atleast threshold (overall) correlation was considered but disregards any constraints on the sub-intervals. Like ours, their work also stress on the importance of efficiently computing the subsequence correlation but rely heavily on bringing down the cost of constant factors involved in correlation computation and therefore, still bounded by

. A tangential amount of work can also be found in some graphical based models such as hidden-semi markov models (HsMMs) and segmental hidden markov models 

[liang2011improved, kim2006segmental]. However, such approaches often lack optimal guarantees due to the non-convexity of the objective function.

4 Methodology

Our problem formulation can potentially be solved by two approaches: (i) a classical approach based on the dynamic programming, and (ii) our proposed approach – Partitioned Dynamic Programming, that is an extension of the classical dynamic programming. We now elaborate the two approaches in further details.

4.1 Classical Approach: Dynamic Programming

The problem of finding the optimal set can be treated as a classical DP problem of weighted interval scheduling [kleinberg2005algorithm] where the goal is to determine a schedule of jobs such that no two jobs conflict in time and the total sum of the weights of all the selected jobs is maximized. In our problem, we can treat every time-interval that meets the minimum strength and length criteria as a job and the interval length as the weight of the job. We could then use DP to find the set of intervals with maximum possible sum-length.

The key condition for dynamic programming is to decompose a larger problem into sub-problems such that each sub-problem can further be solved recursively. Let denote the optimal set to the problem spanning timestamps , and let denote the sum-length of . Then the recursive relation between sum-lengths of the optimal sets of different sub-problems is given as,

In other words, the optimal set is either exactly same as or can be obtained by adding the interval to the optimal set for one of the values of . By applying above recursions, the optimal sets for all the sub-problems and the original problem can be obtained as described in Algorithm 1.

1:Input:Time series ,, Parameters:
2:Output: A set of time intervals that form an SIR between and
3: a matrix that stores the weight of each interval as its length
4: such that or
5: the optimal set for the sub-problem spanning timestamps between 1 and initialized to
6: sum-lengths of initialized to zero
7: length of input time series
8:for  do
9:      
10:      
11:      if  then
12:            
13:      else
14:            
15:      end if
16:end for
17:return
Algorithm 1 Dynamic Programming to find SIR between a pair of time series
1:Input:Time series ,, Threshold on interval strength:
2:Output: SIR between and
3:Step 1
4: All timestamps that satisfy left-weakness
5:Step 2
6: All timestamps that satisfy right-weakness
7:Step 3
8: AND Points of Partition
9:Obtain all the partitions between every two adjacent points of partition
10:Use Algorithm 1 to solve the sub-problem for each partition
11: Union set of optimal sets to all the sub-problems
Algorithm 2 Partitioned Dynamic Programming

4.1.1 Time Complexity

For any pair of time series, dynamic programming needs to solve all the sub-problems of sizes , each of which takes time, where is the length of the sub-problem. Therefore, both the average-case and worst-case time complexity of dynamic programming is in time.

4.2 Proposed Approach (Partitioned Dynamic Programming)

As discussed above, the time complexity of DP is , which could be computationally infeasible for longer time series. A potential approach to reduce computational cost could be to partition the original problem into multiple sub-problems of constant sizes and solving each of them independently using DP. The optimal set to the original problem could then be obtained by taking union of the optimal sets obtained for all sub-problems. The computational cost of this approach would depend on the sizes of the different sub-problems. If the size of each sub-problem is smaller than a constant , then the computational cost would be , which would be faster than DP by an order of . However, a key challenge in this approach is to partition the problem prudently such that no interesting interval gets fragmented across the two partitions, otherwise it could potentially get lost if its fragmented parts are not sufficiently long or strong enough to meet the user-specified thresholds.

To this end, in this section we propose a novel approach called Partitioned Dynamic Programming (PDP) that is significantly efficient than Dynamic Programming (DP) and is guaranteed to find the optimal set. PDP follows the above idea and breaks the original problem into multiple sub-problems such that each one of them can be solved by using DP independently. The key step in PDP is to identify safe points of partition, where the problem could be partitioned without compromising on the optimality of the solution. However, we would need to mention that unlike DP, PDP is applicable only to those relationship measures that satisfy the following three properties:

Property 1: The relationship measure could be computed over a single timestamp.
Property 2: If is known, then and could be computed in constant time.
Property 3: For a given pair of time series, let and be two adjacent time-intervals, , , and , , then .
The above three properties are satisfied by various measures that we discuss in more detail in section 4.3.

From Property 3, it follows that an interval formed by union of two adjacent weak intervals and could never be strong. Thus, a timestamp can be considered as a ‘point of partition’ if:

  1. none of the intervals ending at are strong, i.e. . We refer to this condition as left-weakness condition.

  2. none of the intervals beginning from are strong, i.e. . We refer to this condition as right-weakness condition.

The two conditions above ensure that all the intervals ending at or beginning from are weak, therefore no strong interval that subsumes could possibly exist. Therefore, no interesting interval would be in danger of getting fragmented, if the problem is partitioned at . Following this idea, we propose a partitioning scheme to find the points of partition before applying dynamic programming module to each of the partitions.

As summarized in Algorithm 2, PDP comprises of three major steps: In step 1, we find all timestamps such that they satisfy the left-weakness property. In step 2, we identify all the timestamps that satisfy the right-weakness property. Finally, in step 3, all the timestamps that satisfy both left-weakness and right-weakness are considered as the points of partition. The original problem is then partitioned at the obtained points of partition and the resulting sub-problems are solved independently using the dynamic programming module described in Algorithm 1.

We next explain the two procedures used in Step 1 and Step 2 to obtain points that satisfy the left-weakness and right-weakness respectively.

4.2.1 Finding timestamps with left-weakness

To find timestamps with left-weakness, we perform a left-to-right scan of timestamps as summarized in Algorithm 3. We begin our scan from the leftmost timestamp to find the first timestamp s such that is strong, i.e. . We next show that all the timestamps will satisfy left-weakness using the following lemma. Consider a timestamp that satisfies left-weakness. If , then would also satisfy left-weakness. (Proof in supplementary) Since there are no timestamps to the left of first timestamp, it trivially satisfies left-weakness. By recursively applying Lemma 4.2.1, to timestamps , we get each of them to satisfy left-weakness.

We then continue our scan beyond to find the first timestamp such that is weak, i.e. (lines 15-21). This also means that for all the timestamps , interval is strong, and therefore violates left-weakness. We next show that timestamp satisfies the left-weakness as follows,

Consider a set of timestamps such that , while . If satisfies left-weakness, then timestamp would also satisfy left-weakness. (Proof in supplementary)

1:Input:Time series ,, Threshold on interval strength:
2:Output:A boolean array s.t. iff satisfies left-weakness
3: N 1 boolean array s.t. LW[i] = FALSE N: length of time series
4:LW[1] = TRUE
5:scan finish = FALSE
6:t = 1
7:while NOT scan finish do
8:      while rel[t]  do
9:            LW[t+1] = TRUE
10:            t = t+1
11:            if t then
12:                 scan finish = TRUE
13:                 break
14:            end if
15:      end while
16:      s = t
17:      while NOT scan finish AND rel[s,t]  do
18:            t = t+1
19:            if t then
20:                 scan finish = TRUE
21:            end if
22:      end while
23:end while
24:return LW
Algorithm 3 Find timestamps that satisfy left-weakness

We further continue our scan and repeat the above steps (lines 6-20) to find all the timestamps that satisfy left-weakness. In summary, the above procedure essentially finds streaks of points that satisfy or violate left- weakness in a single scan. Similar procedure could be followed to find timestamps that satisfy right-weakness except that the scan would proceed leftwards starting from the rightmost timestamp. An illustration of PDP is provided in supplementary.

4.2.2 Time Complexity

There are three major steps in PDP. In the first step, we use Algorithm 3 to find points that satisfy left-weakness. Notice that each timestamp is visited exactly once in the scan and under the assumption of Property 2, the computation of in line 15 of Algorithm 3 takes constant time. Therefore, the time complexity of Algorithm 3 is . By similar arguments, the complexity of Step 2 is also . Thus, the time complexity of finding points of partition is .

The total time complexity of PDP is therefore , where K is the average computational cost of the sub-problems corresponding to every partition. The worst-case complexity is , that corresponds to the cases where either no partition could be obtained or the largest partition obtained is of in size. However, the best case complexity is , when all the partitions obtained are of constant size and invariant to the overall length of the time series. In practice, this brings down the average computational complexity of PDP close to , as we will demonstrate in evaluation section.

4.3 Measures that qualify for PDP

In this section, we discuss the relationship measures that satisfy the three properties shown in section 4.2 and qualify for PDP:

Mean Square Error(MSE) This measures distances between two time series as the mean of squares of the differences in their observations. Specifically, MSE between two time series and on an interval is given by

Average Product (AP) Average Product computes the mean of element-wise product between the observations of the two time series. Specifically, the Average Product of two time series and in an interval is given by

Both and

satisfies all the three properties (proofs in supplementary). Notice that the AP shares similarities with covariance measure. Specifically, for two normalized time series (zero mean, unit variance), AP in an interval measures covariance of two time series with respect to their full-length means. Further, under the assumption that the two time series are the outcome of a stationary process, the mean and variance of the two time series remain constant in any sub-interval, in which case AP would also be equal to Pearson correlation in any sub-interval.

5 Results and Evaluation

5.1 Data and Pre-processing

5.1.1 Global Sea Level Pressure (SLP) Data

We used monthly SLP dataset provided by NCEP/National Center for Atmospheric Research (NCAR) Reanalysis Project [kistler2001ncep] which is available from 1979-2014 (36 years x 12 = 432 timestamps) at a spatial resolution of 2.5 2.5 degree (10512 grid points, also referred to as locations). For each time series, we followed the standard pre-processing steps followed in climate science to remove the annual seasonality and linear trends [kawale2013graph]. Relationships in spatio-temporal data are preferably studied between regions (sets of spatially contiguous locations) as opposed to individual locations, since they are more reliable and stable over time. Therefore, for every location , we grew a homogeneous and spatially contiguous region around it by including all the locations that showed a strong positive correlation of at least 0.85 to , thereby resulting in 10512 regions. For every region

, we then generated its representative time series as the z-scored average time series of the constituent locations, thus getting 10512 time series of regions.

5.1.2 Brain fMRI Data

Functional Magnetic Resonance Image (fMRI) data measures the amount of oxygen consumed at every 2x2x2 mm voxel in the brain and is known to indicate the amount of activity occurring at any location. We used a neuro-imaging data that has been collected in a study at [ramsay2013affective] on 50 subjects. In this study, participants viewed 30 sec video clips interleaved with 30 sec resting period while fMRI scans are being acquired. The temporal resolution of the scan is two seconds and the total duration of the scan was 480 seconds, thereby resulting in the length of every time series to be 240 observations. A number of fMRI pre-processing steps including motion correction, unwarping, and filtering have been performed that are been elaborately described in [ramsay2013affective]. In addition, we grouped the 2x2x2 mm voxels into 90 anatomical regions of the brain based on an Automated Anatomical Labeling Atlas [tzourio2002automated]. The resultant data matrix for each subject, was of size 240 90. Furthermore, we observed that some of the time series show much higher variability within the intervals of resting states compared to that of video-watching states. Note that the similarity measures proposed in section 4.3 are variant to scaling of the time series. Therefore, to avoid unwanted bias to either of the two states, we performed z-scoring on every resting and video-watching time interval.

5.2 Experimental Setup

We used measure Average Product (AP) to capture sub-interval positive correlations in fMRI dataset, whereas for SLP dataset, we studied sub-interval negative correlations using the measure negative Average Product (nAP), which is exactly equal to the negative of measure .

5.2.1 Choice of parameters

Our problem formulation requires inputs for two parameters: , the minimum length and , the minimum strength of relationship in a sub-interval that could be selected in an SIR. The combination of two parameters determines the type of SIRs obtained in the search. In climate science, a physical phenomenon typically shows up as a strong signal that lasts for at least six months, hence we chose for SLP data. Similarly, in fMRI data, we are interested in seeking relationships that might be prominent only during the resting or video-watching time periods. Since the length of each time period was 30 seconds, was set to 20 seconds (equivalent to 10 timestamps), a slightly smaller value to avoid noise factors. The other parameter was set to a high value of 1 for both experiments. Further analysis on parameter sensitivity shows robustness (see supplementary).

5.2.2 Selecting Candidate Pairs

The pairs of time series with strong full-length correlations are likely to show strong correlations for most of the observations and thus, the concept of SIR has a limited relevance for such pairs. Therefore, in this paper, we limit our analysis to the pairs of time series that have weaker full-length relationships. We refer to such pairs as ’candidate pairs’ to which we applied our proposed approach. For brain fMRI dataset, all those pairs of regions that have full-length correlation weaker than 0.25 in magnitude in one of the subjects were selected as candidate pairs. Out of pairs of regions, we selected 1331 candidate pairs in this fashion. Similarly for SLP dataset, we first obtained a set of candidate pairs that have a full-length correlation weaker than 0.25 in magnitude. Due to spatial autocorrelation, a lot of pairs in are redundant with each other and need to be discarded. We treated a pair of time series to be redundant with if their corresponding time series are highly similar, specifically . We removed redundant pairs by following a simple procedure described in supplementary material.

5.3 Computational Evaluation

We evaluated evaluated PDP against DP based on their scalability, i.e. how their computational costs vary with the length of the given time series on the 14837 candidate pairs of SLP dataset. To obtain longer time series, we used global Sea Level Pressure data simulated by GFDL-CM3, a coupled physical model that provides simulations for entire century. We obtained nine time-windows of different sizes, each starting from 1901 and ending at 1910, 1920,…,1990 and for each time window, we obtained nine sets of 14837 pairs of time series. On each of these nine sets, we obtained the total computational time taken by DP and PDP as shown in Figure 2. X-axis on this figure indicates the total length of the time series (in months) for all the nine time-windows, whereas Y-axis indicates the total computational time in seconds (log scale) taken to obtain SIRs in all the candidate pairs. As expected, the computational cost of DP (blue curve) follows a quadratic function of the length of time series. However, the same for PDP (red curve) increases linearly with the length of time series. As explained earlier in section 4.2.2, the complexity of PDP is in time, where is the average cost of solving the sub-problems corresponding to each partition that are further solved using DP. With all the partitions to be of a constant size, the computational time of DP for solving every sub-problem would also be constant, and therefore the resultant complexity of PDP is reduced to , that makes it scalable for longer time series without losing the optimality.

Figure 2: Computational time (Y-axis) of DP and PDP for different lengths of time series (X-axis). Note that Y-axis is on logarithmic scale.
(a) SLP Dataset
(b) fMRI Dataset
Figure 3: All the candidate pairs with their full-length correlations on X-axis and of their corresponding SIRs on Y-axis. Each scatter corresponds to a candidate pair and its color is based on the p-value of its SIR value.
Figure 4: A set of four SIRs that synchronize with each other in more than 40 timestamps for 10 subjects. Above shown are the time series from the scan of one of the subjects.

5.4 Evaluation of Discovered SIRs

The lack of ground truth prohibits us to do a systematic domain evaluation on all the candidate pairs. Therefore, we evaluate the validity of discovered SIRs by analyzing the statistical significance of the sum-lengths of the obtained SIRs. We followed a randomization-based approach that addresses the question: Can the sum-length of the given SIR between and be achieved by replacing one of the two time series with a random time series whose full-length correlation with the fixed time series is preserved? If this happens for a very few random time series, then the SIR obtained between and can potentially considered to be statistically significant. The approach was setup in the following fashion: For each of the discovered SIR in a pair of time series and , without loss of generality we replaced time series by a random time series and recomputed the sum-length of the optimal set corresponding to the SIR formed between and . The random time series is generated such that its full-length correlation with the other fixed time series is preserved, i.e. . This process was repeated 1000 times and a distribution of sum-lengths was obtained that was used to compute the p-value of the original sum-length of the optimal set corresponding to the SIR between and . Specifically, the p-value was computed as the fraction of the random time series that formed an SIR of higher sum-lengths with .

Using above procedure, we evaluated the statistical significance of all of the SIRs discovered in SLP and brain fMRI datasets. Figure 3 shows two scatter plots between strength of full-length correlations (X-axis) and the , the normalized sum-length of the SIRs in all the candidate pairs of SLP and brain fMRI dataset. To emphasize the statistical significance at different levels, we depict each pair with either blue (p-value ) or red (p-value ) or black (p-value ) scatter in the Figure 3. The red and black scatter together constitute total of statistically significant SIRs. Further, there are at least (black) pairs of highly significant pair with p-value . Similarly, Figure 2(b) shows 648 out of 1331 candidate pairs that were found to be statistically significant in fMRI dataset. Notice that the significance of an SIR depends on both its sum-length and the full-length correlation of the pair. In fMRI (SLP) dataset, pairs with weaker positive (negative) full-length correlations showed significant SIRs at smaller sum-lengths compared to the pairs with stronger positive (negative) full-length correlations. This is expected because in a pair with positive (negative) full-length correlations, the timestamps with high negative (positive) AP are expected to be rarer, and finding them together as long sub-intervals is furthermore unlikely, compared to the cases where such timestamps are more in number.

Figure 5: For every interval of size 6 months, the plot indicates the proportion of i) 1044 significant SIRs (red curve) and ii) all 14837 candidate SIRs (blue curve) that included given interval.

5.5 Applications and Domain Insights

While SIRs are interesting by themselves in finding non-trivial relationships that are otherwise difficult to be revealed using full-length relationship measures, studying them collectively in a time series data could provide further insights about the data. Here we discuss two potential approaches for their collective analysis.

5.5.1 Finding Anomalous Intervals

A potential application of this work could be to detect anomalous time intervals that experience unusually high number of relationships. Specifically, for every interval , one can obtain a score that indicates the proportion of candidate pairs that were ’active’ during entire . Intervals included in unusually high number of SIRs could potentially indicate the occurrence of a special event. Applying this idea to SIRs of SLP dataset, we obtained the scores for all possible intervals of size 6 months as shown in Figure 5. It can be seen that the scores are anomalously high for the intervals 1982 Sept-83 Mar, 1988 Sept -89 Mar, 1997 Aug -98 Feb, and 2009 Sept -10 Mar. All of the above intervals are known to have experienced the strongest el-nino and la-nina events since 1979 [ensoyears]. During these events, climate behaves quite differently compared to the general climatology. New wave patterns emerge that synchronize regions with each other that are otherwise unrelated to each other.

5.5.2 Discovery of Associated SIRs

Another potential utility of this work could be to explore interesting connections that could exist between different pairs of regions that showed significant SIRs. For instance, there could exist certain candidate pairs that show SIRs simultaneously on multiple occasions. We refer to such sets as ”Associated SIRs” due to their strong association in multiple intervals of time. Such associated SIRs can be found by applying frequent pattern mining to a given pool of pairs with SIRs, where each pair is as an item and each timestamp is a transaction.

Using this approach, we discovered a set of four SIRs (shown in Figure 4) that share highly similar intervals in the block-design fMRI dataset where the subject is exposed to interleaved video and rest periods. It is interesting to note that these SIRs are formed among parietal and temporal regions that are known to be activated by the visual and auditory stimulus [downar2001effect], respectively. It is also worth noting that one of the brain regions Cingulum participates in three of the four SIRs, which is known to be involved in integrating memory with other regions of the brain [de2005fmri]. While the regions involved in these SIRs are related to the video task at hand, it is surprising that these SIRs mostly capture intervals from the ‘rest’ segments, and not ‘video’ segments that is known to simultaneously activate the involved regions. While these observations need to be validated by domain scientists, this example underscores the utility of the proposed approach for discovering SIRs in brain fMRI data.

6 Conclusion

In this paper, we defined a notion of sub-interval relationship to capture interactions between two time series that are intermittent in nature and are prominent only in certain sub-intervals of time. We proposed a novel approach to find most interesting SIR in a pair of time series that guarantees to find the optimal SIR. We also demonstrated the scalability of our approach and showed it to be efficient by an order of compared to standard dynamic programming approach in practice. We further exhibited the utility of SIR in two real-world applications: climate and neuroscience to obtain useful domain insights.

References