1 Introduction
Mining relationships in time series data is of immense interest to several disciplines such as neuroscience, climate science, and transportation. For example, in climate science, relationships are studied between time series of physical variables such as Sea Level Pressure, temperature, etc., observed at different locations on the globe. Such relationships, commonly known as ‘teleconnections’ capture the underlying processes of the Earth’s climate system (Kawale et al., 2013). Similarly, in neuroscience, relationships are studied between activities recorded at different regions of the brain over time (Atluri et al., 2016, 2015). Studying such relationships can help us improve our understanding of realworld systems, which in turn, could play a crucial role in devising solutions for problems such as mental disorders or climate change.
Most of the existing work on mining time series relationships assume the relation to be present for the entire duration of the two time series. The most prevalent work has been done in designing various similarity measures (e.g., euclidean distance, Pearson correlation, dynamic time warping) for analyzing fulllength time series (Kawale et al., 2013; Keogh, 2002; Liao, 2005). Another part of the related work goes into devising various largest common subsequence (LCS) matching problems (Das et al., 1997; Chen et al., 2007; Faloutsos et al., 1994). Other related works focus on allpairssimilaritysearch and motif discovery (Yeh et al., ; Zhu et al., ).
However, many interesting relationships in realworld applications often are intermittent in nature, i.e., they are highly prominent only in certain subintervals of time and absent or occur feebly in rest of the subintervals. As a motivating example, consider a pair of monthly Sea Level Pressure anomaly time series during 19792014 in Figure 0(b) that are observed at two distinct regions on the globe in Figure 0(a)). The fulllength correlation between the two time series is . However, as shown in the lower panel of Figure 0(b), there exists multiple subintervals where the correlation between the two time series is stronger than . As we discuss later in Section 4.4, this example is the outcome of a wellknown climate phenomena called ENSO (El Nino Southern Oscillation) (Glantz, 2001), that is characterized by negative correlations between the surface temperatures observed near Australia and Pacific Ocean (Glantz, 2001) and is known for impacting various weather events such as floods, droughts, and forest fires (Siegert et al., 2001; Ward et al., 2014). The subintervals shown in the lower panel correspond to the two extreme phases, ‘ElNino’ and ‘Lanina’, of ENSO, when its impact on global climate is amplified. Similar examples are also known to exist in other domains such as neuroscience, (Atluri et al., 2014) and stock market data (Li et al., 2016).
Inspired by such realworld examples, we formally define the notion of SIR and devise necessary interestingness measures to characterize them. We propose a novel and efficient approach called Partitioned Dynamic Programming(PDP) to find the most interesting SIR in a given pair of time series. We show that our approach is guaranteed to find the optimal solution and has time complexity that is practically linear in the length of the time series.
2 Definitions and Problem Formulation
Definition 2.1
A SubInterval Relationship (SIR) between two time series and refers to the set of nonoverlapping time intervals such that every interval in

captures strong relationship between and , i.e.

is of length at least , i.e.
where and are userspecified thresholds, and refers to a similarity measure.
The choice of thresholds depends on the type of SIRs that are of interest to a user. For instance, setting higher and lower results in SIRs with longer intervals of mild relationships and viceversa.
Problem Formulation: Intuitively, an SIR is likely to be more reliable if the set of selected intervals covers a larger fraction of the timestamps. Therefore, we measure interestingness of an SIR as the sumlength (), which is equal to sum of lengths of all the selected subintervals. Then the problem require us to find the set of long and strong timeintervals with maximum sumlength. Formally, for a given pair of time series , the goal is to determine the optimal SIR that has the largest sumlength.
3 Methodology
Our problem formulation can potentially be solved by two approaches: (i) a classical approach based on the dynamic programming, and (ii) our proposed approach – Partitioned Dynamic Programming, that is an extension of the classical dynamic programming.
3.1 Classical Approach: Dynamic Programming
The problem of finding the optimal set can be treated as a classical DP problem of weighted interval scheduling (Kleinberg & Tardos, 2005) where the goal is to determine a schedule of jobs such that no two jobs conflict in time and the total sum of the weights of all the selected jobs is maximized. In our problem, we can treat every timeinterval that meets the minimum strength and length criteria as a job and the interval length as the weight of the job. We could then use DP to find the set of intervals with maximum possible sumlength.
Time Complexity: It can be shown that both the averagecase and worstcase time complexity of DP approach is in time, where is the length of the time series.
3.2 Proposed Approach (Partitioned Dynamic Programming)
DP’s complexity poses a serious challenge for long time series commonly found in climate and neuroscience domains. A potential approach to reduce computational cost could be to partition the original problem into multiple subproblems of constant sizes and solving each of them independently using DP. The optimal set to the original problem could then be obtained by taking union of the optimal sets obtained for all subproblems. The computational cost of would then depend on the sizes of the different subproblems. If the size of each subproblem is smaller than a constant , then the computational cost would be , which would be faster than DP by an order of . However, a key challenge in this approach is to partition the problem prudently such that no interesting interval gets fragmented across the two partitions, otherwise it could potentially get lost if its fragmented parts are not sufficiently long or strong enough to meet the userspecified thresholds.
To this end, we propose a novel approach called Partitioned Dynamic Programming (PDP) that is significantly efficient than Dynamic Programming (DP) and is guaranteed to find the optimal set. PDP follows the above idea and breaks the original problem into multiple subproblems such that each one of them can be solved by using DP independently. The key step in PDP is to identify safe points of partition, where the problem could be partitioned without compromising on the optimality of the solution. However, PDP is applicable only to those relationship measures that satisfy the following three properties:
Property 1: The relationship measure could be computed over a single timestamp.
Property 2: If is known, then and could be computed in constant time.
Property 3: For a given pair of time series, let and be two adjacent timeintervals, , , and , , then .
The above three properties are satisfied by various measures that we discuss in more detail in section 3.3.
From Property 3, it follows that an interval formed by union of two adjacent weak intervals and could never be strong. Thus, a timestamp can be considered as a ‘point of partition’ if:

none of the intervals ending at are strong, i.e. . We refer to this condition as leftweakness condition.

none of the intervals beginning from are strong, i.e. . We refer to this condition as rightweakness condition.
The two conditions above ensure that all the intervals ending at or beginning from are weak, therefore no strong interval that subsumes could possibly exist. Therefore, no interesting interval would be in danger of getting fragmented, if the problem is partitioned at . Following this idea, we propose a partitioning scheme to find the points of partition before applying dynamic programming module to each of the partitions.
PDP comprises of three major steps: In step 1, we find all timestamps such that they satisfy the leftweakness property. In step 2, we identify all the timestamps that satisfy the rightweakness property. Finally, in step 3, all the timestamps that satisfy both leftweakness and rightweakness are considered as the points of partition. The original problem is then partitioned at the obtained points of partition and the resulting subproblems are solved independently using the DP module described in Section 3.1.
3.2.1 Finding timestamps with leftweakness:
To find timestamps with leftweakness, we perform a lefttoright scan of timestamps as follows. We begin our scan from the leftmost timestamp to find the first timestamp s such that . We next show that all the timestamps will satisfy leftweakness using the following lemma.
Lemma 1
Consider a timestamp that satisfies leftweakness. If , then would also satisfy leftweakness.
Since there are no timestamps to the left of first timestamp, it trivially satisfies leftweakness. By recursively applying Lemma 1, to timestamps , we get each of them to satisfy leftweakness.
We then continue our scan beyond to find the first timestamp such that is weak, i.e. (lines 1521). This also means that for all the timestamps , interval is strong, and therefore violates leftweakness. We next claim that timestamp satisfies the leftweakness based on following lemma.
Lemma 2
Consider a set of timestamps such that , while . If satisfies leftweakness, then timestamp would also satisfy leftweakness.
We further continue our scan and repeat the above steps to find all the timestamps that satisfy leftweakness. In summary, the above procedure essentially finds streaks of points that satisfy or violate left weakness in a single scan. Similar procedure could be followed to find timestamps that satisfy rightweakness except that the scan would proceed leftwards starting from the rightmost timestamp.
3.2.2 Time Complexity
There are three major steps in PDP. In step 1, we scan all the timestamps to find the ones that satisfy leftweakness. Notice that each timestamp is visited exactly once in the scan and under the assumption of Property 2, the computation of for every takes constant time. Therefore, the time complexity of step 1 is . Similarly, the complexity of Step 2 (to find points satisfy rightweakness) will also be . Step 3 solves the problem for each partition using standard DP. The total time complexity of PDP is therefore , where k is the length of largest partition. For most cases, the threshold on relationship strength in each subinterval is set to very high value which typically prohibits any partition to grow beyond a constant that is invariant to the length of time series. As a result, the time complexity of PDP turns out to be .
3.3 Measures That Qualify For PDP
Following popular relationship measures satisfy all the three properties as discussed above,
1) Mean Square Error (MSE) is calculated for a given pair of time series as,
2) Average Product (AP) is given by .
4 Results and Evaluation
4.1 Data and Preprocessing
Global Sea Level Pressure (SLP) Data: We used monthly SLP dataset provided by NCEP/National Center for Atmospheric Research (NCAR) Reanalysis Project (Kistler et al., 2001) which is available from 19792014 (36 years x 12 = 432 timestamps) at a spatial resolution of 2.5 2.5 degree (10512 grid points, also referred to as locations). In total, we sampled 14837 such pairs of regions from entire globe, whose time series have a fulllength correlation weaker than 0.25 in magnitude.
4.2 Experimental Setup
Similarity Measure: Negative correlations in climate science have been widely studied in climate science. Hence, we adopt negative Average Product (nAP), which is exactly equal to the negative of measure.
Choice of parameters: Our problem formulation requires inputs for two parameters: , the minimum length and , the minimum strength of relationship in every subinterval of SIR. In climate science, a physical phenomenon typically shows up as a strong signal that lasts for at least six months, hence we chose for SLP data. The other parameter was set to a high value of 1.
4.3 Computational Evaluation
We evaluate PDP against DP based on their computational costs and their scalability on datasets with long time series. To generate datasets of different sizes, we used global SLP data simulated by GFDLCM3, a coupled physical climate model that provides simulations for entire century from which we obtained nine timewindows of different sizes, each starting from 1901 and ending at 1910, 1920,…,1990 and for each time window, we obtained a set of 14837 pairs of time series. Figure 2 shows the total computational time taken by DP and PDP to find SIRs in all the datasets. As can be seen, the computing time of DP (blue curve) follows a quadratic function of the length of time series, whereas the same for PDP (red curve) increases linearly with the length of time series. This is no surprise because the time complexity of PDP is in time, where the constant is governed by the length of largest partition. Typically, the size of subintervals exhibiting strong relationships does not exceed beyond a constant size and therefore, is independent of the length of time series, which makes PDP linear in runtime.
4.4 Applications and Domain Insights
Finding Anomalous Intervals: A potential application of this work could be to detect anomalous time intervals that experience unusually high number of relationships. Specifically, for every interval , one can obtain a score that indicates the proportion of candidate pairs that were ’active’ during entire . Intervals included in unusually high number of SIRs could potentially indicate the occurrence of a special event. Applying this idea to SIRs of SLP dataset, we obtained the scores for all possible intervals of size 6 months as shown in Figure 3. It can be seen that the scores are anomalously high for the intervals 1982 Sept83 Mar, 1988 Sept 89 Mar, 1997 Aug 98 Feb, and 2009 Sept 10 Mar. All of the above intervals are known to have experienced the strongest elnino and lanina events since 1979 (ens, ). During these events, climate behaves quite differently compared to the general climatology. New wave patterns emerge that synchronize regions with each other that are otherwise unrelated to each other.
5 Conclusion
In this paper, we defined a notion of subinterval relationship to capture interactions between two time series that are intermittent in nature and are prominent only in certain subintervals of time. We proposed a fastoptimal algorithm to find most interesting SIR in a pair of time series. We further demonstrated the utility of SIR in climate science applications and obtain useful domain insights.
References
 (1) Elnino and lanina years and intensities. http://ggweather.com/enso/oni.htm.
 Atluri et al. (2014) Atluri, G., Steinbach, M., Lim, K., MacDonald III, A., and Kumar, V. Discovering groups of time series with similar behavior in multiple small intervals of time. In SIAM International Conference on Data Mining, 2014.

Atluri et al. (2015)
Atluri, G., Steinbach, M., Lim, K. O., Kumar, V., and MacDonald, A.
Connectivity cluster analysis for discovering discriminative subnetworks in schizophrenia.
Human brain mapping, 2015.  Atluri et al. (2016) Atluri, G., MacDonald III, A., Lim, K. O., and Kumar, V. The brainnetwork paradigm: Using functional imaging data to study how the brain works. Computer, 2016.
 Chen et al. (2007) Chen, Y., Nascimento, M. A., Ooi, B. C., and Tung, A. K. Spade: On shapebased pattern detection in streaming time series. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on. IEEE, 2007.
 Das et al. (1997) Das, G., Gunopulos, D., and Mannila, H. Finding similar time series. In European Symposium on Principles of Data Mining and Knowledge Discovery, 1997.
 Faloutsos et al. (1994) Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. Fast subsequence matching in timeseries databases. ACM, 1994.
 Glantz (2001) Glantz, M. H. Currents of change: impacts of El Niño and La Niña on climate and society. 2001.

Kawale et al. (2013)
Kawale, J., Liess, S., Kumar, A., Steinbach, M., Snyder, P., Kumar, V.,
Ganguly, A. R., Samatova, N. F., and Semazzi, F.
A graphbased approach to find teleconnections in climate data.
Statistical Analysis and Data Mining: The ASA Data Science Journal
, 2013.  Keogh (2002) Keogh, E. Exact indexing of dynamic time warping. In Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.
 Kistler et al. (2001) Kistler, R., Collins, W., Saha, S., White, G., Woollen, J., Kalnay, E., et al. The NCEPNCAR 50year reanalysis: Monthly means CDROM and documentation. Bulletin of the American Meteorological society, 2001.
 Kleinberg & Tardos (2005) Kleinberg, J. and Tardos, E. Algorithm design addison wesley. Boston, MA, 2005.
 Li et al. (2016) Li, Y., Yiu, M. L., Gong, Z., et al. Efficient discovery of longestlasting correlation in sequence databases. The VLDB Journal, 2016.
 Liao (2005) Liao, T. W. Clustering of time series data—a survey. Pattern recognition, 2005.
 Siegert et al. (2001) Siegert, F., Ruecker, G., Hinrichs, A., and Hoffmann, A. Increased damage from fires in logged forests during droughts caused by el nino. Nature, 2001.
 Ward et al. (2014) Ward, P. J., Jongman, B., Kummu, M., Dettinger, M. D., Weiland, F. C. S., and Winsemius, H. C. Strong influence of el nino southern oscillation on flood risk around the world. Proceedings of the National Academy of Sciences, 2014.
 (17) Yeh, C.C. M., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H. A., Silva, D. F., Mueen, A., and Keogh, E. Matrix profile i: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In IEEE 16th International Conference on Data Mining, 2016.
 (18) Zhu, Y., Zimmerman, Z., Senobari, N. S., Yeh, C.C. M., Funning, G., Mueen, A., Brisk, P., and Keogh, E. Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In International Conference on Data Mining, 2016.
6 Lemma Proofs
Proof Lemma 1: Consider two adjacent intervals and for any . Since satisfies leftweakness, . Also, , therefore from Property 3, . Thus, satisfies leftweakness.
Proof Lemma 2: Following the definition of leftweakness, it would suffice to show that . We prove this in two parts: In first part, we show that , while in second, we show that .
Part 1: For any , consider two adjacent intervals and . Let , and , . Then from Property 3, we get . Since , we have . Since , thus ,we get and,
(1) 
Part 2: We know that satisfies leftweakness. Therefore, by definition, . Also we have . Thus, putting in Property 3, we get
(2) 
Together from Eqs (1) and (2), we get
Comments
There are no comments yet.