In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets and lead to novel domain insights.READ FULL TEXT VIEW PDF
In many domains, understanding the relationships between time series is essential for obtaining actionable insights. For instance, in climate science, pressure dipoles, which are pairs of locations with strong negative correlations in their Sea Level Pressure time series, have been extensively studied, and have been linked with anomalous weather events all over the globe such as forest fires, hurricanes etc.[1, 2, 3]. Similarly, in neuroscience, researchers have discovered pairs of brain regions that exhibit positive correlations between their activity time series. These correlated time series represent brain regions exhibiting synergistic activity .
In this work, we define a novel class of linear relationships, called multipoles
, that involve more than two time series, also referred to as variables. We say that a set of variables is a multipole if i) the variables show strong linear dependence, and ii) each variable makes a significant contribution to the linear dependence, i.e., excluding any of the variables from the set significantly weakens the strength of the linear dependence among the remaining variables. We define linear dependence in terms of the variance of a linear combination of the standardized (zero mean, unit variance) vectors. We measure the strength of the linear dependence of a set of time series by the variance of theirleast variant linear combination, i.e., a (unit-length) linear combination that has the minimum variance. The smaller the variance of this linear combination, the more constant (less variable) the linear combination, and thus higher the linear dependence, and vice-versa. We define the contribution of each variable to the linear dependence as the reduction in the strength of the linear dependence of the smaller set when this variable is removed. We define linear gain as the minimum contribution of included variables to the linear dependence.
We next illustrate multipoles with a real-world example. Consider a set of three time series , , and , shown in Figure 1(a), which capture the traffic volume on three different roads in Minnesota as shown in Figure 1. The three bottommost plots of Figure 1(b) show the least variant linear combination for each possible pair, while the top plot of Figure 1(b) shows the least variant linear combination for all three. Note that the linear combination of all three time series has a variance of only .Thus, we say that there exists a strong, although not perfect, linear dependence among the above three time series.
Further, note that each of the three time series forms a crucial component of the relationship, as excluding any one of them will significantly weaken the strength of the linear dependence among the remaining two time series. For example, if is excluded from the set, the variance of the least variant linear combination of and turns out to be , which is much higher than the variance of , thus indicating a significant contribution from . Similarly, if we exclude or instead of , the variance of least variant combinations become and . (See the three bottommost plots of Figure 1(b).) The linear gain for is .
An explanation for the relationships illustrated above can be provided by the notion of conservation of flow. Two of the time series, and , were observed on the roads that act as major tributaries to the highway where is being observed. All the southbound traffic coming from the tributaries is likely to merge at the highway, thus leading to a strong linear dependence among the three time series. Omitting any single time series leads to a weaker linear dependence because, unlike the highway, traffic flow in the tributaries significantly changes during weekends (see peaks and drops every Sunday), which weakens the strength of their pairwise linear relationships with highway. However, the simultaneous rise and fall in traffic on both tributaries complements each other, resulting in a stronger linear dependence among the three time series.
Multipoles can also be seen in other domains. For instance, we used the techniques presented in this paper to find several previously unknown multipole relationships between climate variables observed at more than two distant locations. Similarly, in neuroscience, we found novel multipole relationships between different brain regions that are triggered by specific visual and auditory stimuli. Some of these relationships were found to have insightful domain interpretations. (See Section 6.)
The relationships and conditional independence of two or more variables have been widely studied in different contexts using a variety of techniques including regression models [5, 6, 7, 8], PCA-based approaches[9, 10], structure learning methods [11, 12], correlation network analysis [13, 14, 15, 16], etc. However, as discussed in Section 3, multipole relationships as defined here can be viewed as novel in the sense that none of these previous approaches can be used to find the type of relationship represented by multipoles in the data.
A naïve approach to find all multipoles in a time series dataset would be to enumerate all possible combinations of time series and measure the strength of their linear dependence and linear gain. However, this is computationally infeasible due to the combinatorial nature of the search space. In this paper, we present an efficient approach to find multipoles.
This approach formulates the multipole-search problem as a clique-enumeration problem in a correlation network, where each node represents a time series, and the weight of an edge between two nodes represents the strength of the linear correlation between the corresponding time series. Our proposed problem formulation is motivated by the following two key empirical observations: the upper limit on the linear gain of a multipole is dependent on i) the size of multipole, and ii) the maximum correlation strength among two variables in a multipole.
Leveraging these observations, we propose a novel Clique Based Multipole Search (CoMEt) approach to find most interesting multipoles in a time series dataset. The central idea of the approach is to identify and restrict the search for multipoles to family of subsets, which we refer to as ‘promising candidates‘ that are more likely to exhibit multipole relationships with stronger linear gain between their members. Using the above empirical observations, we show that most promising candidates of multipoles with high linear gain appear in a correlation network either as negative cliques, i.e. sets with all pairwise correlations being negative, or negative-equivalent cliques, that can be transformed into negative cliques by flipping the signs of one or more of its member variables (see Definitions 10 and 11 for further details). The number of promising candidates typically turns out to be much smaller in scenarios where multipoles with high linear gain are desired, thereby contributing to the remarkably high computational efficiency of our approach, although with some loss of completeness in the final output.
Furthermore, we propose CoMEtExtended, a more generalized version of CoMEt approach, where we redefine what constitutes a promising candidate. In particular, CoMEtExtended involves an additional parameter that can be tuned to expand or prune the scope of promising candidates beyond negative cliques and negative-equivalent cliques, which allows one to achieve a better trade-off between computational efficiency and completeness at different thresholds of linear gain (see Section 4.4 for further details).
Our paper makes several key contributions: 1) We formally define multipole, a novel relationship in time series data and devise measures to quantify its interestingness. 2) We formulate a novel and computationally efficient pattern-mining approach to find most interesting multipoles in a time series dataset. 3) Further, we propose an empirical framework to evaluate discovered multipoles that includes an empirical procedure to assess the statistical significance of multipoles. 4) Using our proposed framework, we demonstrate the relevance of multipoles to two scientific domains: climate science and neuroscience.
Let denote a set of standardized (zero mean, unit variance) time series observed over consecutive timestamps. (We will use the terms ‘time series‘ and ‘variables‘ interchangeably in what follows.) Also, let X be the corresponding data matrix and let be the covariance matrix. Since all the variables (time series) are standardized, the correlation matrix and covariance matrices will be exactly the same for . We next define a few measures on a set of variables which will be eventually used to formally define a multipole.
A Normalized Linear Combination (NLC) refers to a linear combination with normalized weights. Specifically, for a given vector with , a normalized linear combination of variables in the set, , is given as .
Given a set of variables , let be the Least Variant Normalized Linear Combination (LVNLC) of variables, i.e., the NLC that has the least variance across the observations. Formally,
In linear algebra, a set , , …, of variables is said to be linearly dependent if there exists a linear combination of the variables that equals the 0 vector. Since we have subtracted the mean from our vectors (time series), any linear combination will also have a mean of zero, and thus the variance of a linear combination of is equivalent to finding the L2 norm of the linear combination. Hence, the LVNLC of finds the linear combination of that is the closest approximation to the 0 vector in terms of the L2 norm. If the linear combination of our time series (vectors) is perfectly constant, i.e., the time series vectors are linearly dependent, the variance of LVNLC will be exactly zero. In the other extreme case, when all the variables are mutually orthogonal to each other, the variance of LVNLC will be equal to 1. (This is because we have standardized the times series in to have unit variance.) Thus, variance of LVNLC can be used as an inverse indicator of the strength of linear dependence. Based on this observation, we define the linear dependence of a set as follows:
The linear dependence, , of a set of vectors is given by .
Relation between and least eigenvalue of correlation matrix:
Notice that on performing the eigenvalue decomposition of correlation matrix
of the variables in the set, the eigenvalues so obtained are equal to the variances of the projections of the data along their corresponding eigenvectors. Sincecorresponds to the direction of least variance, is nothing but the eigenvector corresponding to the least eigenvalue of . Thus, the variance of is exactly equal to the least eigenvalue of and therefore,
Before proceeding further, the following two properties of linear dependence are noteworthy:
For any set of standardized variables, .
Since we assumed all the variables to have unit variance, the sum of all the eigenvalues of the covariance matrix is exactly equal to , the number of variables. Therefore,
Further, since is positive semi-definite, we get
The linear dependence of a set is always less than or equal to that of its supersets.
Consider a superset of S and let the corresponding data matrix be . Let , where , denote the linear coefficients corresponding to the LVNLC . Consider , where . Then clearly,
Let be the LVNLC for set . Then, by definition of LVNLC and the above equality, we get
Although linear dependence indicates a strong relationship among the variables, it does not exclude the presence of irrelevant variables in the set. For instance, let be a set of linearly dependent variables with the linear relation being . Although the four variables are linearly dependent, is an irrelevant variable and can be pruned from without weakening the linear dependence among remaining variables. Hence, to avoid irrelevant variables in the pattern, we next propose a measure called linear gain that checks the minimum contribution from all member variables to the linear dependence of the set.
The linear gain of a set with is measured as the gain in the linear dependence of with respect to one of its proper subsets that has strongest linear dependence. Mathematically, we can write linear gain of as
From Lemma 2, we get that the linear dependence of a set is always greater than that of its subsets, which implies that the linear gain of a set will always be positive. Furthermore, we can say that the subset with strongest linear dependence will be of size . Thus, the linear gain can be more precisely written as
Higher values of linear gain imply that a more significant drop in linear dependence would be observed if any one of the variables are excluded from the set, thereby ensuring that no irrelevant variables are included. Furthermore, a high threshold on linear gain will avoid redundancies in the set. For instance, in the example from traffic data described in previous section, suppose we insert into the set , where comes from a sensor close to the sensor for , on the the same road. Since is almost a duplicate of , then the linear dependence of the resultant set would be almost , since there would exist a linear combination with near perfect linear dependence. However, that would also imply that many of its subsets, e.g. , would have near perfect linear dependence. Hence by definition, the linear gain of will be very close to zero. More generally, a high threshold on linear gain will also avoid multicollinearity in the set. For instance, consider a set that consists of two independent subsets and of perfectly linearly dependent variables. By definition, the linear gain of such a set will be 0, and hence will be discarded.
Using the above definitions, we next present the formal definition of a multipole.
A multipole refers to the set of variables with such that and , where and are user-specified thresholds.
We next define the notion of maximality in a multipole.
In a set of multipoles, a multipole is considered to be maximal if none of its supersets are in .
A maximal multipole is likely to capture the underlying signal more comprehensively compared to its subsets. Hence, all non-maximal multipoles could potentially be pruned in the final output of the search.
Using the above definitions, we formulate the multipole-discovery problem as the following:
Given and , find the set of all maximal multipoles in a given time series dataset.
In this section, we present an overview of different techniques that have been applied to study relationships between two or more variables and discuss their similarities and differences with multipoles.
Eigenanalysis based-approaches such as Principal Component Analysis (PCA)[9, 10]17] commonly focus on finding linear combinations of variables that capture the dominant global signals of variability in the data. Thus, they are interested in the largest eigenvalues and often treat the smaller eigenvalues as noise and discard them. If we consider a dataset of the traffic time series at all roads in the city and apply PCA on it, it is expected to capture dominant global patterns of variability such as work-home traffic patterns, patterns influenced by social events, etc. In contrast, multipoles capture the least-variant local linear dependencies among small subsets of vectors. (Roads in this example.) This corresponds to subsets of vectors whose least eigenvalue (of the correlation matrix) is small, i.e., close to 0.
Regression models, such as ordinary least squares (OLS) and its regularized variants , e.g. LASSO, are used to find a linear combination of independent variables that predicts the given dependent variable with high accuracy [6, 7]. The independent variables can therefore be considered as showing a strong linear dependence with the dependent variable. However, such techniques do not have a notion of gain and thus are not designed to find multipole relationships. To illustrate, consider once again the highway time series example. If we use LASSO to find a set of predictors for the highway time series , LASSO would always include the time series that is most strongly correlated with as the predictor, say , collected at another sensor on the same highway. Consequently, it will miss all multipole patterns that include but not , many of which otherwise could be capturing interesting and non-trivial relationships of with distant road stations.
Error-In-Variables Models: Error in Variables (EIV) models are a special class of regression models that account for uncertainties in the measurements of both dependent and independent variables (unlike standard regression models, which assume that independent variables are measured accurately). Like EIV models, the definition of multipoles does not create any distinctions among the participating variables. One of the multivariable linear EIV models, named Total Least Squares (TLS) has striking similarities with the proposed definition of multipoles . In particular, TLS focuses on learning a linear combination of a given set of dependent and independent variables to minimize the joint residual error in all the variables, which is exactly same as finding a linear combination of variables with highest linear dependence. Like multipoles, the solution to TLS is obtained by computing the eigenvector corresponding to the least eigenvalue of the covariance matrix of the given set of variables. TLS does not have a notion of linear gain, although it can be shown that a given set of variables obtains a high linear gain only if 1) TLS obtains a unique solution, and 2) all the regression coefficients obtained in the solution of TLS are significantly higher than zero in magnitude. Thus, applying TLS on a given set of variables could be an alternative approch to evaluate the goodness of a multipole relationship formed between the variables of the set. However, TLS does not provide any approach for searching through a large set of variables, e.g., the time series that capture temperature on the Earth’s surface. Thus, it cannot not be used as a tool to find all the subsets of variables forming multipole patterns from a larger dataset.
Another stream of related work in machine learning literature is that of structure learning methods that learn the structure of stochastic dependencies among variables in a dataset in the form of a graphical model calledMarkov network[11, 12], which is a graph where each node represents a variable and follows the pairwise Markovian property, according to which it is independent of any non-neighboring node in the network conditioned on all of its neighboring nodes. Markov networks are typically studied to infer the conditional independence between different subsets of variables using various statistical inference techniques . In contrast the multipole relationship focuses on mutual dependence.
Correlation Networks: Linear relationships in time series data have also been studied in past using correlation networks, where each time series represent a node, and the weight of an edge between any two nodes represents the strength of the linear correlation between the corresponding time series. Correlation networks have been used in past for studying a variety of patterns, the most popular being ‘community‘, which refers to a group of nodes (time series) with strong mutual positive correlations [13, 14]. If considered as a potential multipole, a community would have very low linear gain since its time series are highly similar, i.e., they show considerable collinearity. In contrast, time series in a multipole with high linear gain cannot be highly similar.
Some works, including our own , have further studied pairs of negatively correlated communities, which form dipoles. Multipoles often have negative correlations among vectors—see Section 4.1. Nonetheless, those links can be relatively weak, i.e., not meaningful dipoles. Further, there is no guarantee that a dipole will show up as part of a multipole pattern. We also recently defined tripoles , but a multipole is not a generalization of a tripole. A tripole consists of a root and a pair of leaf time series, such that the sum of the leaf time series shows much stronger correlation with the root compared to either of their individual correlations with the root. Thus, as with regression, one of the variables (root or dependent variable) has a special role, which is not the case for multipoles. Further, tripoles are restricted to only one linear combination (i.e. sum of leaves), whereas multipoles allow arbitrary normalized linear combination to attain linear dependence. More importantly, there does not seem to be a way to generalize the tripole concept beyond three time series in a way that would facilitate efficient search for such patterns in a large data sets.
The combinatorial aspect of the problem makes it extremely challenging to come up with an approach that is both computationally efficient and guarantees completeness of the search. A brute-force approach would examine all subsets of size , varying from to
, the total number of time series in the dataset. While such an approach guarantees completeness of the search, it will easily become computationally infeasible even for very small datasets due to its exponential time complexity. To give an estimate, one of our real-world datasets is quite small in size and has only 171 time series. However, performing a brute-force search on a regular desktop over all subsets of i) size 4 takes about 4 hours, ii) size 5 takes more than 5 days, and so on for subsets of size beyond 5.
In this paper, we propose a correlation graph-based approach to capture most interesting multipoles in a computationally efficient manner. Our approach is primarily motivated by some empirical observations that indicate a direct relationship between the linear gain of a set and the strengths of pairwise correlations between the members of the set. Leveraging these observations, our approach identifies and restricts the search for multipoles to a family of subsets, which we refer to as ‘promising candidates‘ that are more likely to exhibit multipole relationships with stronger linear gain between their members. Such a family of subsets are usually much rarer in the data, thereby contributing to the remarkably high computational efficiency of our approach, although with some loss of completeness in the final output of multipoles.
Before describing our approach in detail, we first present the empirical observations that we will use to identify promising candidates.
One of our key empirical observations relates the maximum pairwise correlation of two variables in a multipole to the linear gain. To more easily demonstrate that connection, we need to first define the notion of self-cancellation on a set.
A set is said to be self-canceling if all the weights in its LVNLC are non-negative.
Any given set of variables can be converted into its self-canceling version , by flipping the signs of all the members in that have negative weights in LVNLC of . Note that flipping the signs of one or more variables does not affect the eigenvalues of their correlation matrix. Hence, and will have exactly the same linear gain. This self-canceling version represents a canonical form.
Let denote the highest pairwise correlation observed in the self-canceling version of . Then, for any set . we observed an empirical relationship between and linear gain . Specifically, Figure 3 shows scatter plots between (X-axis) and (Y-axis) for more than a million correlation matrices of sizes for . Each of these matrices were generated by sampling all pairwise correlations between
using a uniform distribution. Among the generated matrices, only the ones that satisfy positive semi-definiteness were considered to be valid correlation matrices. The implementation of the procedure can be accessed at this URL111https://github.com/15saurabh16/Multipoles.
From Figure 3, we make two key observations:
First key observation: The linear gain of a set tends to be higher when all the pairwise correlations in are negative. More precisely, for all sets of size that form multipoles with linear gain at least , there exists an upper bound on . Hence, if we are interested in finding multipoles with linear gain at least , we can potentially discard all those sets that have at least one pairwise correlation larger than in their self-canceling versions.
Second key observation: The maximum possible linear gain of a multipole of size is . (In Figure 3, we can see that for equal to 3,4, and 5, the maximum linear gain is 0.5, 0.33, and 0.25, respectively.) The maximum linear gain corresponds to the set where all the pairwise correlations in its self-canceling version equals . This implies that the linear gain is smaller for larger multipoles. Hence, if we are only interested in finding multipoles with linear gain at least , we can safely ignore all sets beyond of size . Both of the above two observations were found to empirically hold true also for sets of sizes beyond 5.
Consider a set of variables. Let be the self-canceling set obtained from . Based on the observations presented in previous subsection, we next define a promising candidate for a multipole as the following:
A set is said to be a promising candidate if i) , where is the user-specified threshold on linear gain, and ii) maximum pairwise correlation in .
Recall from section 4.1
that flipping the signs of one or more variables in a set does not alter the eigenvalues of the underlying correlation matrix. Therefore, all promising candidates belong to a family of sets that are transformable into a set where all pairwise correlations are negative. Such a family of sets can be classified into two types: i) Negative Cliques, and ii) Negative Equivalent Cliques.
A negative clique refers to a set where all the pairwise correlations between its members are negative.
The terminology is motivated from the appearance of such sets in a correlation graph, where each vertex represents a variable and the weight of an edge is equal to . Such sets would appear as a clique of negative edges in the correlation graph. It can be shown that for all negative cliques, , and hence satisfy the requirement of all pairwise correlations in to be negative. Therefore, a negative clique is a promising candidate.
A negative-equivalent clique refers to a set that can be transformed into a negative clique by flipping the signs of one or more variables in .
Since every negative-equivalent clique can be transformed into a negative clique using sign-flipping operations, and negative cliques are promising candidates, hence every negative-equivalent clique is also a promising candidate.
All negative-equivalent cliques can be identified using the following lemma.
A set is a negative-equivalent clique iff it can be partitioned into two negative cliques and such that the all the cross correlations between members of and are non-negative.
Consider a negative clique , where all the pairwise correlations are negative. Let be partitioned arbitrarily into non-empty subsets and . Hence, and will also be negative cliques. Without loss of generality, let us flip the signs of all variables in to obtain . Due to this transformation, all the cross-correlations between and will flip the signs and hence will become non-negative. Furthermore, every pairwise correlation within remains negative, since . Also correlations within remain negative. As a result, the negative equivalent set can be partitioned into two negative cliques and with all non-negative cross-correlations between them. Since was generated using an arbitrary partitioning of an arbitrary negative clique , this holds true for all negative equivalent cliques.
To prove converse, consider a set that can be partitioned into two negative cliques and such that all pairwise cross-correlations between and are non-negative. Now, let us flip the signs of all variables in any one of the two partitions say (w.l.o.g) to form . Then by similar arguments as above, all pairwise correlations within and will remain negative, while the cross correlations between and flip their signs from positive to negative. Hence, all the pairwise correlations in the transformed set are negative, which implies that it is a negative clique, and hence, is a negative equivalent clique.
Leveraging the above empirical observations and the concept of promising candidates discussed in previous section, we propose our Clique Based Multipole SEarch (CoMEt) to find multipoles. The central idea of CoMEt is to find all promising candidates for multipoles (negative cliques and negative-equivalent cliques) and then check each of them to obtain true multipoles. To find all promising candidates, we first construct a graph such that every promising candidate forms a clique in it. We then obtain all maximal promising candidates by enumerating all the maximal cliques of the constructed graph. Note that maximal clique-enumeration problem is NP-complete in general. However, the cliques of our interest tend to be rare in the graphs generated from real-world datasets, which allows us to recover most of the promising candidates in much less computing time. For each maximal clique obtained, we examine its subcliques and select all those that form true multipoles. Finally we eliminate all duplicate and non-maximal multipoles to obtain the final set of maximal multipoles.
Algorithm 1 summarizes the CoMEt approach. We begin by finding all maximal promising candidates in line 3, viz. maximal negative cliques and maximal negative-equivalent cliques using Algorithm 2. Among the obtained maximal promising candidates, we then obtain all the multipoles in line 3 using Algorithm 3. Finally, in line 5, we eliminate duplicate and non-maximal multipoles using Algorithm 4. We next describe each of the modules used in the different steps of CoMEt.
This module is used to find all maximal promising candidates in the data that include all negative and negative-equivalent cliques. Algorithm 2 summarizes this module. The key idea is to construct a correlation graph (network) (lines 1-4) such that a set of nodes of size in would form a clique iff it is a negative clique or negative-equivalent clique. To accomplish that, we first construct two identical graphs and , such that for both graphs, a set of nodes will form a clique iff it is a negative clique. To obtain such graphs, we begin by creating a set of nodes such that each node corresponds to a time series in the given dataset . For any pair of nodes , an edge is drawn iff . Let denote the set of all such edges. Then is the desired correlation graph where a clique will be formed among a set of nodes iff their corresponding variables in form a negative clique. Similarly, an identical correlation graph can be constructed on a set of nodes , where each node corresponds to a time series in the given dataset .
Next, to include edges for non-negative correlations in all negative-equivalent cliques, we construct a set of cross-edges between nodes of and . Specifically, we connect each node to all the nodes such that . As a result of this operation, for any negative-equivalent clique , where and are its two negative subcliques, all the non-negative correlations across and are now included. Hence, in the resultant graph , every negative-equivalent clique will also appear as a clique. Also note that every clique of size in would be a promising candidate; it would either be a negative clique (if or ), or a negative-equivalent clique (if it includes nodes from both or ).
From the resultant graph , all the maximal promising candidates could be obtained by enumerating all maximal cliques in graph using any of the standard clique-enumeration algorithms. In this work, we used an efficient algorithm proposed in  (line 11) to enumerate maximal cliques in sparse graphs (implementation provided by authors in ).
Note that every maximal promising candidate will result in formation of two maximal cliques in . For instance, every negative clique will form two cliques: one in and each. Similarly, a negative-equivalent clique , where and are two negative sub-cliques, will result in formation of two cliques: and in . For every maximal promising candidate, exactly one of the maximal cliques is retained (line 12). Finally, the set of retained maximal cliques is returned as the set of all maximal promising candidates.
This procedure is applied to each of the obtained maximal promising candidates to extract all multipole relationships. As summarized in Algorithm 3, we begin by checking if the given candidate forms a multipole by comparing its linear dependence and linear gain with user-specified thresholds and , respectively. If so, then is added to the set of discovered multipoles and we move on to the next promising candidate. Otherwise, it could be possible that one or more of the subsets of might form a multipole that satisfies the thresholds. However, if turns out to be lower than the threshold , then by Lemma 2, all the subsets of would also have weaker linear dependence than and thus, could be safely ignored. Therefore, only if , do we perform an exhaustive search on all subsets of of sizes and select all the ones that satisfy the thresholds. The range of sizes of subsets is derived based on the observation made in section 4.1, which states that for a given threshold on linear gain, all sets of sizes beyond can be safely discarded.
Applying the above procedure to all maximal promising candidates might result in inclusion of several non-maximal multipoles in the output. Furthermore, a multipole could be generated multiple times from different maximal promising candidates. Hence, in the final step of CoMEt, we eliminate all non-maximal and duplicate multipoles using the module described below.
This module is called in line 5 of Algorithm 1 to eliminate all non-maximal and duplicate multipoles that are generated in the previous step. As described in Algorithm 4, we first initialize two empty sets (lines 1-2): i) that collects the final set of non-redundant multipoles, and ii) , that maintains a collection of all the subsets of multipoles that are included in the final set at any point of time. We then scan through all the multipoles in input set . For each multipole in , we first check if it is present in the set, and if not, insert it into the final output set as well as insert all the subsets of in that are not previously present in it (lines 5-8). We then remove from and repeat the entire procedure in lines (4-9) until is empty. All the multipoles in the resultant set are consequently distinct and maximal.
In certain cases, the completeness and computational efficiency of CoMEt might be unsatisfactory. For instance, at lower thresholds of linear gain, the obtained set of promising candidates in CoMEt could potentially miss some of the interesting multipoles. On the other hand, at high thresholds of linear gain, there could potentially be many false positives among promising candidates, which would compromise the computational efficiency. Hence, to overcome these limitations of CoMEt, we further propose CoMEtExtended, a generalized version of CoMEt, where we redefine what constitutes a promising candidate and allow user to expand or prune the scope of promising candidates beyond negative cliques and negative-equivalent cliques in different scenarios. The ability to adjust the search space helps CoMEtExtended in achieving a better trade-off between computational efficiency and completeness at different thresholds of linear gain.
We first begin with redefining the notion of a promising candidate. Specifically, instead of enforcing that all the pairwise correlations in be negative, we require all of them to be below a threshold , where . The definitions of negative cliques and negative-equivalent cliques also need to be modified accordingly so that they can include all the promising candidates. To accomplish that, we first relax the notion of a ’negative’ and ’non-negative’ edge to ’pseudo negative’ and ’pseudo non-negative’ edge respectively and define them as the following:
For a given , a pseudo negative edge refers to the edge formed between a pair of nodes only if the correlation between the variables represented by the two nodes is less than or equal to .
For a given , a pseudo non-negative edge refer to the edge formed between a pair of nodes only if the correlation between the variables represented by the two nodes is greater than or equal to .
Using the above defined notions of pseudo-negative and pseudo non-negative edges222Note that for , all the edges with correlation strengths will be treated as both pseudo negative and pseudo non-negative edges. The overlap between the two classes of edges is essential for Lemma 4 to hold which we will discuss shortly., we redefine the notion of negative and negative equivalent cliques as pseudo-negative and pseudo negative equivalent cliques respectively as the following:
For a given , a pseudo negative clique refers to a set where all the pairwise correlations between its members are pseudo-negative, i.e., less than or equal to .
A pseudo negative-equivalent clique refers to a set that can be transformed into a pseudo negative clique by flipping the signs of one or more variables in .
A pseudo negative-equivalent clique can be identified using the following lemma which is a general version of Lemma 3.
A set is a pseudo negative-equivalent clique iff it can be partitioned into two pseudo negative cliques and such that the all the cross correlations between members of and are pseudo non-negative.
Consider a pseudo negative clique , where all the pairwise correlations are . Let be partitioned arbitrarily into non-empty subsets and . Hence, and will also be pseudo negative cliques. Without loss of generality, let us flip the signs of all variables in to obtain . Due to this transformation, all the cross-correlations between and will flip the signs and hence will become . Furthermore, every pairwise correlation within remain pseudo negative, since . Similarly, correlations within remain pseudo negative. As a result, the pseudo negative equivalent set can be partitioned into two pseudo negative cliques and with all non-negative cross-correlations between them. Since was generated using an arbitrary partitioning of an arbitrary pseudo negative clique , this holds true for all pseudo negative-equivalent cliques.
To prove converse, consider a set that can be partitioned into two pseudo negative cliques and such that all pairwise cross-correlations between and are pseudo non-negative, i.e. . Now, let us flip the signs of all variables in any one of the two partitions say (w.l.o.g) to form . Then by similar arguments as above, all pairwise correlations within and will remain , while the cross correlations between and flip their signs and will now . Hence, all the pairwise correlations in the transformed set are pseudo negative, which implies that it is a pseudo negative clique, and hence, is a pseudo negative-equivalent clique.
Using the above definitions and results, we now propose CoMEtExtended which is the more general version of CoMEt. The steps of CoMEtExtended are exactly the same as that of CoMEt, the only difference being in the lines 2 and 7 of Algorithm 2 where the threshold of 0 on pairwise correlations is replaced by and , respectively. By incorporating in thresholds, we ensure that every clique in the resultant graph is either a pseudo negative clique or a pseudo negative-equivalent clique. Note that setting to zero gives us the original definition of promising candidates, whereas setting to a higher positive value would include more sets as promising candidates and hence lead to recovery of more of the missed multipoles. However, it should be noted that setting to positive values would increase the number of candidate cliques and thus increase the computational cost. In fact, setting to 1 would turn CoMEtExtended into an exhaustive brute-force search, which guarantees completeness, but would be computationally infeasible. On the other hand, setting to a high negative value would reduce the search space and hence leads to much faster recovery of the multipoles with a very high linear gain in larger datasets.
In this section, we discuss results and computational evaluation of proposed approach CoMEtExtended. Specifically, we evaluate the completeness of CoMEtExtended against a regularized linear regression-based baseline, analyze trade-off between completeness and computational efficiency of CoMEtExtended at multiple parameter settings, study the scalability of CoMEtExtended, analyze the statistical significance of the multipoles, and evaluate their utility using real-world datasets from climate science and neuroscience domains. All experiments were run on a computer with 20 processors, each processor being Intel(R) Xeon(R) CPU E5-2470 0 running at 2.30GHz with a total shared RAM of 100 GB, running Linux version 2.6.32-696. We begin by describing all the datasets along with the data pre-processing steps that were applied to each of them.
We used monthly Sea Level Pressure (SLP) dataset provided by NCEP/National Center for Atmospheric Research (NCAR) Reanalysis Project , which is available from 1979-2014 (36 years) at a spatial resolution of 2.5 2.5 degree (10512 grid points, also referred to as locations). In this paper, we constructed SLP time series for each location using only the the months of winter season (December, January, and February) from each year, thereby resulting in 108 observations in every time series. For each of the time series, we followed the standard pre-processing steps followed in climate science to remove the annual seasonality and linear trends .
Relationships in climate datasets are preferably studied between regions (sets of spatially contiguous locations) as opposed to individual locations because of i) spatial autocorrelation, due to which locations in a spatial neighborhood have highly similar time series that will lead to discovery of redundant relationships, and ii) the fact that the relationships studied across regions are likely to be more reliable and stable over time. Therefore, we next converted the given location-based time series dataset into a region-based time series dataset using the procedure described in Algorithm 5. Specifically, treating each of 10512 locations as a centre, we grew a spatially contiguous region around it by including up to the top 50 locations that were most strongly correlated to and spatially contiguous. For each region, a representative time series is obtained as the normalized area-weighted average time series of all the included locations. Note that there would be a high degree of redundancy among the regions due to spatial autocorrelation and mutual spatial-overlaps among them. Therefore, we followed steps elaborated in lines 7-16 of Algorithm 5, to obtain a subset of representative time series of 171 non-redundant regions that are non-redundant and cover entire globe. In addition, for purposes of validation of obtained multipoles, we used monthly Hadley Center SLP (HadSLP2) observational data available for years prior to 1979 to obtain the time series of these regions.
We used neuroimaging data collected at the University of Utah as part of a reproducibility study . In this study, a set of 50 functional-Magnetic Resonance Imaging (fMRI) scans of one subject were acquired while the subject was involved in an audio-visual task (watching cartoons). Another set of 50 fMRI scans were collected from the same subject while the subject was resting. The spatial resolution and the temporal resolution of every scan was 3mm 3mm 3mm and 2 secs, respectively. A number of fMRI pre-processing steps—described in —were performed including motion correction, unwarping, and filtering. In addition, we used an Automated Anatomical Labeling Atlas , which maps grey matter locations to 90 anatomical regions, to compute a mean time series of each brain region from each scan. As a result, we obtained a set of 90 time series for each of the 100 fMRI scans. We applied our approach to find multipoles in one of the 50 audio-visual fMRI scans, while the other 49 scans were used for evaluation purposes that we will describe later in this section.
Two user-specified thresholds, minimum linear gain () and minimum linear dependence (), are needed for discovering multipoles. The choice of values for these parameters needs to be determined based on domain knowledge and the availability of computational resources to find multipoles. In particular, a relaxed linear gain threshold will increase the search space of multipoles and thus will require more computational time for search and evaluation. Similarly, a lower threshold of will result in a larger number of discovered multipoles to be analyzed further by domain experts.
In this work, we performed computational evaluation and scalability analysis at different combinations of values of and respectively. Statistical significance analysis was performed on multipoles obtained at and for both SLP and brain fMRI datasets.
|Total Multipoles in||Completeness|
|(0.4,0.2)||1264||5.14 %||99%, =|
|Total Multipoles in||Completeness|
Comparison of CoMEtExtended to another approach is difficult, since it defines local patterns in terms of linear algebra concepts and finds such patterns by searching for negative cliques in a similarity graph. None of the related works mentioned in Section 3 do the same thing. Nonetheless, to provide some comparison, we evaluate the completeness of the search of CoMEtExtended with respect to a LASSO-based baseline approach (LAB) that is described as follows:
LASSO-based baseline approach (LAB): LASSO is a variant of regularized linear regression that obtains a subset of variables from a larger set that could be linearly combined to predict the given predictand with high accuracy. Specifically, given a predictand and a set of predictors , it learns a sparse set of regression coefficients by minimizing the following objective function:
is the hyperparameter that can be tuned to control the number of non-zero regression coefficients in the solution. In this baseline, we use LASSO to find potential candidates that could form multipoles. Specifically, in a dataset ofvariables , for any variable , we consider the remaining set of variables as predictors and apply LASSO find a subset of predictors from the remaining variables that could be linearly combined to best predict . The combined set of and the predictors with non-zero regression coefficients is then added to the set of potential candidates of multipoles. For each variable , we obtain up to different solutions of LASSO by varying such that every solution yields a different number of predictors with non-zero regression coefficients. By applying the above procedure to all possible , we obtain upto potential candidates of multipoles in total. For each the obtained candidates, we next compute linear dependence and linear gain and discard all those that do not satisfy the given thresholds and .
We next evaluate the completeness of CoMEtExtended with that of LAB.The completeness of an algorithm is measured as the fraction of multipoles of a complete set (a set that includes all the multipoles present in the data) that it finds from the data. An ideal approach to generate a complete set of multipoles will be to perform an exhaustive brute-force search over all possible subsets of variables and select those subsets that show linear dependence and linear gain . However, in practice, such an approach is computationally infeasible even for small-size datasets. For instance, in case of SLP data that has 171 time series, the estimated time on our computer to examine all subsets of size 5 is more than 5 days, size 6 is more than 144 days, and so on. Hence, for the purposes of evaluation, we generated a pseudo-complete set of multipoles that includes unique and non-redundant multipoles generated by the following approaches: i) an exhaustive brute-force search over all subsets of variables upto size 4 for SLP, and 5 for fMRI dataset, ii) A random-search approach that is run for 24 hours to examine random subsets of variables of size , where for SLP, and for fMRI datasets, and select the ones that show linear dependence and linear gain , iii) LAB approach, and iv) CoMEtExtended approach, where the choice of parameters is set so as to keep its computational time within 90 minutes.
Tables I and II summarize the results of completeness evaluation on SLP and fMRI datasets respectively. The first column in both the tables indicate the total size of the pseudo-complete set of multipoles at different combinations of parameters and . Second and third columns indicate the completeness of LAB and CoMEtExtended approach. As can be seen in both tables, LAB recovers a negligible or very tiny fraction of multipoles in all the parameter settings. For instance, in the SLP dataset, LAB recovers less than 13.3% of total multipoles in pseudo-complete set in all the parameter settings. The completeness is negligible () for cases where . Further, in the fMRI dataset, LAB is able to recover only one multipole for cases where , whereas for the remaining other cases where , LAB is unable to find any multipole. This strongly indicates that linear regression-based approaches like LASSO are not suitable for finding multipoles in the data.
In contrast, the completeness of CoMEtExtended is more than 80 % for all the parameter settings. Moreover, the completeness of CoMEtExtended increases at higher thresholds of linear gain and is close to 100 % for , which shows that our approach is less likely to miss multipoles with high linear gain. This is in concordance with our empirical observations made in Figure 3, according to which, whenever linear gain is high, the largest pairwise correlation in the self-canceling version of all of the multipoles tends to be much lower and approaches strong negative values. As a result, they are more likely to be captured as promising candidates by our approach. Further, by varying parameter in CoMEtExtended, one could achieve a desired trade-off between completeness and efficiency at various parameter settings. (See Supplemental material for detailed analysis).
In summary, CoMEtExtended outperforms the LASSO-based baseline approach in completeness at different thresholds on linear dependence and linear gain. Further, it is much more efficient and relatively complete in finding multipoles at higher thresholds of linear gain, compared to brute-force and the LASSO-based baseline approach.
As is common with many pattern finding techniques, such as frequent pattern mining, the CoMEt algorithm is inherently exponential. (For a detailed analysis of the the time complexity of the the three parts of the algorithm, see the Supplemental material.) However, as with association analysis, adjusting the parameter settings—in this case, and —can make the pattern search quite tractable in many situations. To illustrate, we next discuss the scalability of our approach using SLP data, i.e., how does the computational time vary with the size of datasets (number of time series). To obtain datasets of different sizes, we first generated 10 additional seasonal datasets for different seasons, each season being a set of three consecutive months: ((Jan.,Feb.,Mar.), (Feb.,Mar.,Apr.) ,…(Oct. Nov. Dec.)). Each dataset consists of time series from the same 171 regions that we chose for our original SLP dataset. We then generated 10 datasets of sizes 171*k, where by merging k of the above 10 seasonal datasets.
Figure 4(a) shows the total computational time of CoMEtExtended on all of the above datasets, at and for different values of in range . As can be seen in figure, the computing time increases with the increase in the size of the datasets at different rates depending on . For stronger negative values of , the computing time increases almost linearly with the increase in size of datasets, which highlights its scalability. However, as approaches zero, the scalability is weak, and the computing time increases dramatically for bigger datasets. Similar observations are also made at other parameter settings, e.g. at and , as shown in Figures 4(b) and 4(c) respectively. The observed loss in scalability could be attributed to the typical distribution of pairwise correlations in the correlation graph of any time series dataset, as shown in Figure 4(d) for one of the SLP datasets. The distribution is bell-shaped with major fraction of edges having strengths close to zero in magnitude. Consequently, as approaches zero, the number of cliques found in Step 1 of the algorithm increase exponentially. Note that many of the cliques that have all or most of the edges being weak are expected to have weak linear dependence among their variables, and hence are unlikely to form a multipole. By setting to stronger negative values, we avoid such cliques and save a lot of computing time. However, that also leads to missing some of the interesting cliques where only one or two edges were close to zero.333The exact number of missing cliques couldn’t be computed due to the absence of ground truth and computational intractability of brute-force approach
Such cliques could potentially be recovered by heuristic approaches, which could be an interesting direction to pursue for future work.
One of the key challenges of this work is distinguishing between reliable and spurious multipoles, i.e., those multipole patterns that arise due to random variation, from the large number of discovered multipoles. Domain validation is an ideal approach to evaluate multipoles, but most of the multipole relationships discovered in this work are currently unknown to domain scientists. Thus, in our work, we used an empirical evaluation framework that consists of two steps. The first step involves a procedure for estimating the statistical significance of a multipole. This procedure is then used in the second step to assess the reproducibility of a discovered multipole in multiple datasets, where each dataset is collected during a time period different from that of original dataset used for finding multipoles. Intuitively, spurious multipoles are less likely to reproduce in time periods that were not used for finding them. In contrast, multipoles that do reproduce with high statistical significance are more likely to be patterns that are outcome of a real phenomenon, and thus they would be ideal candidates for further investigation by domain experts.
To filter spurious multipoles, it is important to answer the following two questions: i) how likely is it that the observed level of linear dependence, , of a multipole is due to chance? and ii) does every member in contribute significantly to the linear dependence of ?
To address the first question, we generate a null distribution of linear dependence by randomly generating 100,000 sets of time series and evaluating each set for its level of linear dependence. Each of these randomly generated sets is created by sampling time series from different time periods. For instance, for our SLP investigations, a random set of size is constructed by sampling a time series from any of the nine time windows of HadSLP2 data. Similarly, for brain fMRI data, a random set of time series is constructed by sampling a time series from any of the 50 scans. Generating a random set in this manner is an approximation to independently generating time series while ensuring that the general underlying nature of domain time series (e.g. autocorrelation, periodicity etc.) is retained in the randomly generated data. Using the resultant null distribution, we then determine the statistical significance of the multipoles we originally found. We evaluate at a 0.01 level of significance.
We next describe our approach to assess the second point, i.e, the significance of the contribution of each of the variables in a given set to its linear dependence. Specifically, to assess the significance of the contribution of time series , we replace it with a random time series that is sampled from an independent dataset in a manner similar to the procedure above. We then compute the linear dependence of the resultant set, which we call . If the contribution of is not spurious, it would be unlikely for a randomly chosen time series to replicate it in which case, . We repeat the above process 1000 times and compute the fraction of the population for which holds true. This fraction is our significance level. We again use a significance level of 0.01. The above procedure is repeated for each of the members.
In this step, we estimate the reproducibility of a given multipole in multiple time periods. A multipole is considered to be reproducible in a dataset , if, at a 0.01 level of significance, it is found to have statistically significant linear dependence, as well as a statistically significant contribution from each of its members. Specifically, for each multipole discovered in 1979-2014 SLP dataset, we computed its linear gain and linear dependence in HadSLP2 data in 9 time windows 1901-1936, 1906-1941,…,1941-1976. Likewise, for multipoles discovered in one of the 50 brain fMRI scans, we studied their reproducibility in the remaining 49 scans.
Figure 5(a) shows number of multipoles reproduced in different numbers of HadSLP2 time windows during 1901-1976. More than 40% of multipoles reproduced in all 9 time windows. Similarly, Figure 5(b) shows number of multipoles from fMRI data reproduced in different number of fMRI scans taken while the subject was watching a video. At least 25% of multipoles reproduced in more than 10 other scans. Higher reproducibility of multipoles suggests that they are more robust to noise in the data and unlikely to be spurious.
Results discussed in the previous section indicate the existence of several multipole relationships with high reproducibility in multiple independent time periods, which makes a compelling case for their connection to underlying physical phenomena that might be currently unknown to domain scientists, but could potentially be discovered by domain experts upon further analysis. In this section, we present case studies on the physical interpretation of two of the discovered multipoles in SLP and brain fMRI data.
One of the multipoles in SLP data was found between the four regions shown in Figure 7. The time series of the four regions show negative correlations with each other, resulting in a multipole relationship with linear dependence of and a linear gain of during 1979-2014. Further, as indicated in Figure 7(a), the multipole was found to be reproducible in 7 out of 9 time windows during the period of 1901-1976, showing strong linear dependence (red curve) and linear gain (indicated by gap between red and black curves). This multipole appears to be strongly related to the well-known climate phenomenon known as the El-Nino Southern Oscillation (ENSO) as indicated by regions and . This phenomenon appears not only in the tropical Pacific Ocean, but also has large-scale impacts on regional climate outside the tropics . More recently, a connection between the West Siberian Plain and ENSO was discovered as a tripole relationship (as defined in ) between three regions, that are re-captured as regions , , and in above multipole. This phenomenon was attributed to a wave train that originates from the sub-tropical Atlantic and propagates north-eastwards towards the north of the West Siberian Plain, where it is deflected southeastwards and reaches all the way to the central Pacific Ocean, where the two centres of action of ENSO are located. Notably, region in the northern Atlantic Ocean is located near the proposed location of origin of the wave train. This finding can be further used to study the detailed path of the wave train and to attribute weather and climate characteristics over wave-affected regions to their potential ENSO origins.
Evaluation of Climate Models: Multipoles, being potential representatives of physical processes, could serve as useful benchmarks to evaluate various climate models that are often used to study climate change under different greenhouse gas emission scenarios. In particular, climate models can be evaluated based on their ability to reproduce the physical processes represented by these multipoles. For instance, Figure 8 compares the linear dependence and linear gain of the above multipole across multiple time windows obtained in observations data (HadSLP2) and a couple of climate models used in the IPCC (Intergovernmental Panel on Climate Change) CMIP5 (Couple Model Intercomparison Project Phase 5) evaluation. It can be seen that climate models differ in their ability to simulate the multipole effectively. Specifically, climate model MPI-ESM-MR is able to reproduce multipole with statistically significance in at least 4 time windows, whereas the other model BNU-ESM could not reproduce the given multipole at all.
The notion of multipoles proposed in this paper is highly suited for capturing complex signaling relationships in the brain. For example, one of the multipoles we discovered in one of the brain fMRI scans captures a relationship between three brain regions: Right Frontal Inferior (), Right Parietal Inferior(), and Right Temporal Pole Superior(). This multipole was found to be reproducible in 23 out of 50 ‘task‘ scans (collected while the subject was watching a cartoon video) while only in 5 of the 50 other ‘resting state‘ scans (collected while subject was resting). This relationship is interesting for multiple reasons. First, the parietal and temporal regions are known to be the first recipients of the video and audio stimuli , respectively, that are presented to the subject, in our case in the form of cartoons. Second, the dominance of the above multipole relationship in task scans in comparison to the resting scans provides strong evidence for signaling from parietal and temporal regions to the frontal region, that appears to be triggered due to visual and auditory stimuli. While some existing studies  have observed activity in the frontal region due to visual and auditory stimuli, no direct signaling pathway between visual and auditor cortices with the frontal region is established. In summary, the multipole framework could serve as a promising tool for discovering previously unknown signaling pathways that exist in the brain.
In this paper, we introduced and formally studied a novel class of multivariate linear relationships called multipoles in time series data. A multipole corresponds to a set of time series for which the least eigenvalue of their correlation matrix is much lower than that of any of its subset of time series. We proposed a novel and computationally efficient correlation network-based approach to find multipoles in the data. In particular, we identified the utility of negative cliques in a correlation network. Furthermore, we demonstrated the utility of multipoles in discovering novel physical processes in neuroscience and climate science. One of the most useful extensions of this work could be to extend the notion of multipoles to non-linear relationships and other similarity measures. Another generalization could be time-lagged mutipole-relationships.
We would like to thank Siddhant Agrawal, Department of Mathematics, University of Michigan for invaluable discussions. This work was supported by NSF grants IIS-1029771 and IIS-1319749 and NASA grant 14-CMAC14-0010. Access to the computing facilities was provided by the University of Minnesota Supercomputing Institute.
Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 6, no. 3, pp. 158–179, 2013.