1 Introduction
Nonstationary processes are the rule rather than the exception in many scientific disciplines such as epidemiology, biology, sociology, economics, or finance. In recent years, there has been a surge of interest in the analysis of problems described by large sets of interrelated variables with few observations over time, often involving complex nonlinear and nonstationary behaviours. Examples of such problems include the longitudinal spread of obesity in social networks christakis2007spread, disease modelling from timevarying inter and intracellular relationships barabasi2011network, behavioural responses to losses of loved ones within social groups bond2017complex, and the linkage between climate change and the global financial system battiston2017climate. All such analyses rely on the statistical assessment of the similarity between, or the relationship amongst, noisy time series that exhibit temporal memory. Therefore, the ability to test the statistical significance of homogeneity and dependence between random processes that cannot be assumed to be independent and identically distributed (i.i.d.) is of fundamental importance in many fields.
Kernelbased methods provide a popular framework for homogeneity and independence tests by embedding probability distributions in reproducing kernel Hilbert spaces (muandet2017kernel, Section 2.2). Of particular interest are the kernelbased twosample statistic maximum mean discrepancy (mmd) gretton2007kernel, which is used to assess whether two samples were drawn from the same distribution, hence testing for homogeneity; and the related HilbertSchmidt independence criterion (hsic) gretton2008kernel
, which is used to assess dependence between two random variables, thus testing for
independence. These methods are nonparametric, i.e., they do not make any assumptions on the underlying distribution or the type of dependence. However, in their original form, both mmd and hsic assume access to a sample of i.i.d. observations—an assumption that is often violated for temporallydependent data such as random processes.Extensions of mmd and hsic to random processes have been proposed besserve2013statistical; chwialkowski2014wild. Yet, these methods require the random process to be stationary, meaning that its distribution does not change over time. Whilst it is sometimes possible to approximately achieve stationarity with preprocessing techniques such as (seasonal) differencing or square root and power transformations, such approaches become cumbersome and notoriously difficult, particularly with large sets of variables. The stationarity assumption can therefore pose severe limitations in many application areas where multiple nonstationary processes must be taken into consideration. When studying the relationships of climate change to the global financial system, for example, factors such as greenhouse gas emissions, stock market indices, government spending, and corporate profits would have to be transformed or assumed to be stationary over time.
In this paper, we show how the kernelbased statistics mmd and hsic can be applied to nonstationary random processes. At the heart of our proposed approach is the simple, yet effective idea that realisations of a random process in the form of temporallydependent measurements (i.e., the observed time series) can be viewed as independent samples from a multivariate probability distribution, provided that they are observed at the same points in time, i.e., over the same temporal grid. Then, mmd and hsic can be applied on these distributions to test for homogeneity and independence, respectively.
The remainder of this paper is structured as follows. After discussing related work in section 2, we introduce our applications of twosample and independence testing with mmd and hsic to nonstationary random processes in section 3. We then carry out experiments on multiple synthetic datasets in section 4 and demonstrate that the proposed tests have higher power compared with current functional or multivariate twosample and independence tests under the same conditions. We provide an example application of our proposed methods to a socioeconomic dataset in section 5 and conclude the paper with a brief discussion in section 6.
2 Related work
Twosample and independence tests on stochastic processes have been widely studied in recent years. Under the stationarity assumption, besserve2013statistical investigate how the kernel crossspectral density operator may be used to test for independence, and chwialkowski2014wild formulate a wild bootstrapbased approach for both twosample and independence tests, which outperforms besserve2013statistical in various experiments. The wild bootstrap in chwialkowski2014wild
approximates the null hypothesis
by assuming there exists a time lag such that a pair of measurements at any point in time , , is independent of for . This method is applicable to test for instantaneous homogeneity and independence in stationary processes, but requires further assumptions to investigate noninstantaneous cases: a maximum lag must be defined as the largest absolute lag for the test. This results in multiple hypothesis testing requiring adjustment by a Bonferroni correction. Further, davis2018applications have applied distance correlation szekely2007measuring, a hsicrelated statistic, to independence testing on stationary random processes.Beyond the stationarity assumption, twosample testing in the functional data analysis literature has mostly focused on differences of mean Horvth2012 or covariance structures FREMDT2012; Panaretos2010. However, pomann2016two have developed a twosample test for distributions
based on generalisations of a finitedimensional test by utilising functional principal component analysis, and
wynne2020kernel have derived kernels over functions to be used with mmd for the twosample test. Independence testing for functional data using kernels was recently proposed in Grecki2018, but assumes the samples lie on a finitedimensional subspace of the function space—an assumption not required in our work. Moreover, zhang2018largehave developed computationally efficient methods to test for independence on highdimensional distributions and large sample sizes by using eigenvalues of centred kernel matrices to approximate the distribution under the null hypothesis
instead of simulating a large number of permutations.3 mmd and hsic for nonstationary random processes
3.1 Notation and assumptions
Let and denote two nonstationary stochastic processes with probability laws and , respectively. We assume that we observe independent realisations of and independent realisations of in the form of time series measured at and time points, respectively. Said differently, the data samples are a set of nonstationary time series, , arriving over the same temporal grid, and similarly for with . Note that the measurements and are not independent across time.^{1}^{1}1We use the terms ‘sample’ and ‘realisation’ interchangeably to denote and
, and use the term ’measurement’ to denote the temporally dependent vectors
and .We may view the realisations and as samples of multivariate probability distributions of dimension and , respectively, which are independent at any given point in time, i.e., and and . Consequently, we can represent these distributions by their mean embeddings and in rkhss, and use these to conduct kernelbased twosample and independence tests. Given a characteristic kernel , i.e., the mean embedding captures all information of a distribution sriperumbudur2010hilbert, the dependence between measurements in time is captured by the ordering of the variables, and the fact that any characteristic kernel is injective, thus guaranteeing a unique mapping of any probability distribution into a rkhs sriperumbudur2011universality.
For homogeneity testing (), we use the kernelbased mmd statistic and require equal number of measurements , but allow different sample sizes, . For independence testing (), we employ the related hsic, and in this case number of measurements can differ, but we require the same number of realisations, . We now describe how twosample and independence tests can be performed under these assumptions.
3.2 mmd for nonstationary random processes
Let be a characteristic kernel, such as the Gaussian kernel , which uniquely maps and to its associated rkhs via the mean embeddings and (muandet2017kernel, Section 2.1). The mmd between and in is defined as gretton2007kernel:
(1) 
Given samples and ,
can then be approximated by the following unbiased estimator
gretton2007kernel:(2) 
Henceforth, we drop the implied for ease of notation.
Using
as a test statistic, one can construct a statistical twosample test for the null hypothesis
against the alternative hypothesis gretton2012kernel.Let be the significance level of the test, i.e., the maximum allowable probability of falsely rejecting
and hence an upper bound on the typeI error. Given
, the threshold for the test statistic can be approximated with a permutation test as follows. We first generate randomly permuted partitions of the set of all realisations with sizes commensurate with , denoted . We then compute , and sort the results in descending order. Finally, we select the statistic at position as our empirical threshold . The null hypothesis is then rejected if. For a computationally less expensive (but generally less accurate) option, the inverse cumulative density function of the Gamma distribution can be computed to approximate the null distribution
gretton2009fast.3.3 hsic for nonstationary random processes
Let
denote the joint distribution of
and , and let and be separable rkhss with characteristic kernels and , respectively. hsic is then defined as the mmd between and gretton2008kernel:(3)  
Here,
denotes the tensor product. Recall that we assume an equal number of realisations
for both processes, and let be the kernel matrices with entries and , respectively. Given i.i.d. samples , an unbiased empirical estimator of is given by (song2012feature, Theorem 2):(4) 
where and , and is the vector of ones. To ease our notation, we henceforth omit the implied and .
To test for statistical significance, we define the null hypothesis and the alternative . We broadly repeat the procedure outlined in section 3.2 by bootstrapping the distribution under via permutations, with the distinction that we only permute the samples , resulting in , whilst the are kept unchanged gretton2008kernel. is then computed for each permutation and the empirical threshold is taken as the statistic at position . The null hypothesis is rejected, if .
3.4 Maximising the test power
The power of both mmdbased twosample and hsicbased independence tests is prone to decay in high dimensional spaces ramdas2015decreasing; reddi2015high, as in our setting where each measurement point in time is treated as a separate dimension. Hence, we describe here how a kernel can be chosen to maximise the test power, i.e., the probability of correctly rejecting given that it is false. First, note that under both (gretton2012kernel, Corollary 16) and (gretton2008kernel, Theorem 1) are asymptotically Gaussian:
(5)  
(6) 
where and
denote the asymptotic variance of
and , respectively (Serfling2002Atom, Section 5.5.1 (A)).Given a significance level , we define the test thresholds and and reject if or . Following sutherland2016generative, the test power is defined in terms of , the distributions under , with equal sample sizes as:
(7)  
(8) 
where
is the cumulative density function of the standard Gaussian distribution, and where
with increasing sample size. To maximise the test power, we maximise the argument of , which we approximate by maximising and minimising for (7), and similarly for (8). The empirical unbiased variance in (7) was derived in sutherland2016generative, and we use (song2012feature, Theorem 5) for in (8).We perform this optimisation by splitting our samples into training and testing sets, of which we take the former to learn the kernel hyperparameters and the latter to conduct the final hypothesis test with the learnt kernel.
4 Experimental results on synthetic data
To evaluate our proposed tests empirically, we first apply our homogeneity and independence tests to various nonstationary synthetic datasets. We report test performance using , the percentage of rejection of the null hypothesis , which becomes the test power once is false, by repeating the experiments on trials (i.e., independently generated synthetic datasets). We provide confidence intervals computed as .
4.1 Homogeneity tests with mmd
Setup.
We evaluate our mmdbased homogeneity test against shifts in mean and variance of two nonstationary stochastic processes and by establishing if they are correctly accepted or rejected under the null hypothesis . For ease of comparison, we adopt the experimental protocol of pomann2016two and consider two stochastic processes based on a linear mixed effects model. We generate independent samples and on an equally spaced temporal grid of length in the interval ,
(9) 
where we set with Fourier basis functions and . The coefficients and and the additive noises are all independent Gaussiandistributed random variables with means and variances specified below.
We evaluate the test power against varying values of shifts in mean and variance as follows:

Mean shift: and . The basis coefficients are sampled as and , and the additive noises are sampled as .

Variance shift: We take , and introduce a shift in variance in the first basis function coefficients via and . The second coefficients are sampled as , and the noises as .
The coefficients and for mean and variance shifts, respectively, determine the departure from the null hypothesis. Setting means is true, whereas means is false. Although this is not a necessity, we set the number of independent samples of and to be equal, . To test for statistical significance, we follow the procedure described in section 3.2 and perform permutation tests of partitions for varying values of and and different sample sizes .
Baseline results without test power optimisation.
Our baseline results are obtained with a Gaussian kernel with bandwidth equal to the median distance between observations of the aggregated samples. Figure 1 shows how our method (solid lines) compares to pomann2016two (dashed lines) for discrete time points. For all sample sizes, the typeI error rate lies at or below the allowable probability of false rejection , and our method significantly outperforms pomann2016two
for nearly all levels of mean and variance shifts. Both shifts become easier to detect for larger sample sizes. Particularly strong improvements are achieved for mean shifts: our method makes no typeII errors for
on samples, whereas pomann2016two only reach such performance with samples and . We obtain similar test power results (see Appendix A.1) for coarser realisations with over the same interval .Results of the optimised test.
Next, we apply the method described in section 3.4 to maximise the test power. Specifically, we search for the Gaussian kernel bandwidth (over spaces defined in Table 1 in Appendix A.2), that maximises the argument of in our approximations of (7) on our training samples. For demonstrative purposes, we choose to split our dataset equally into training and testing sets although other ratios may lead to higher test power. Figure 2 shows the results of the optimised test (dotted lines) against the baseline results (solid lines) and the results of pomann2016two (dashed lines) for and samples and discrete points in time. We find that the test power is significantly improved by our optimisation for the detection of mean shifts. For instance, test power rises fourfold for and compared to our baseline method. Furthermore, we have no typeII errors once for , as compared to for our baseline test and for pomann2016two. In its current form, however, our optimisation does not yield higher test power for the detection of variance shifts, a fact that we discuss in section 6.
4.2 Independence tests with hsic
Setup.
To test for independence, the null hypothesis is . We assume we observe measurements and over temporal grids of length and in the interval , respectively. To measure typeI and typeII error rates, we use the following experimental protocols, partly adopted from zhang2018large and gretton2008kernel; gretton2005kernel:

Linear dependence: is generated as in (9) with , basis coefficients , , and noise . The samples of the second process are where , as in zhang2018large.

Dependence through a shared coefficient: and are generated as in (9) with and independently sampled , , , as in the mean shift experiments of section 4.1, but where the stochastic processes now share the second basis function coefficient: .

Dependence through rotation: We start by generating independent and as in (9) with and , but with and
drawn from: (i) studentt, (ii) uniform, or (iii) exponential distributions
(gretton2005kernel, Table 3). We next multiply by a rotation matrix with to generate new rotated samples , which we then test for independence. Clearly, for our samples are independent and as is increased their dependence becomes easier to detect (see (gretton2008kernel, Section 4) and Figure 7 for implementation details).
Statistical significance is computed using permutations of whilst is kept fixed to approximate the distribution under . Test power is calculated for varying and different sample sizes .
Baseline results without test power optimisation.
Our baseline results are computed using a Gaussian kernel with equal to the median distance between measurements in the corresponding sample. Figure 3 (left) shows the results of our test on the linear dependence experiments, which demonstrate, due to , how dependencies between individual points in time and an entire time series can be detected. We compare our method to: (i) a statistic explicitly aimed at linear dependence, , where is the Pearson correlation coefficient; and (ii) . For both of these methods, the distribution under is also approximated via permutations. We find that SubCorr outperforms the other methods in experiments with sample sizes , and SubHSIC achieves comparable results to our method. The results for (see Appendix A.1) are similar.
Figure 3 (right) displays the power of our independence test for the case of dependent samples through a shared coefficient for varying sample sizes and measurements . We compare our results to two spectral methods zhang2018large that approximate the distribution under using eigenvalues of the centred kernel matrices of and : spectral hsic uses the unbiased estimator (4) as the test statistic with the eigenvaluebased null distribution; and spectral random Fourier feature (rff) uses a test statistic induced by a number of rffs (set here to ) that approximate the kernel matrices of and . Our method and spectral hsic achieve improvement in test power compared to spectral rff. For small numbers of samples (), our method outperforms spectral hsic, which converges to the performance of our method with increasing sample size, as we would expect it (gretton2009fast, Theorem 1).
Figure 4 shows the rotation dependence experiments, where corresponds to the null hypothesis (independence) and to the alternative. The distribution hyperparameters for and are detailed in Appendix A.3, and we set , although equality is not required. As expected, dependence is easier to detect with increasing . We observe that denser temporal measurements do not result in enhanced test power. Note that the test power is highly dependent on the distribution of the coefficients of the basis functions , .
Results of the optimised test.
The test power maximisation was applied to the rotation dependence experiments by searching for optimal Gaussian kernel bandwidths and over predefined intervals (specified in Appendix A.2). Figure 4
shows that the test power is improved when the basis function coefficients are drawn from uniform distributions. In this case, the percentage of rejected
is higher for between and , but it levels off at once , which is the same level achieved by our baseline method for . With our current testtrain split, our optimised test does not improve the test power if the basis function coefficients and are drawn from studentt or exponential distributions.5 Application to a socioeconomic dataset
As a further illustration, we apply our method to the United Nations’ socioeconomic Sustainable Development Goals (see Appendix A.4 for details). Specifically, we investigate whether some socalled Targets of the 17 sdgs have been homogeneous over the last years across low and highincome countries, and whether certain sdgs in African countries exhibit dependence over the same period. In both setting, we assume countries are independent.
For our homogeneity tests, we classify countries into low and highincome according to
worldbank2020. We use temporal data of Targets for which WBdata provides data collected over the last years for lowincome countries and highincome countries. Applying our baseline method without test power optimisation, we find that, out of the Targets we have data available for, only have had homogeneous trajectories in low and highincome countries. For instance, whereas the ‘death rate due to road traffic injuries’ (Target 3.6) has been homogeneous between these two groups, the ‘fight the epidemics of AIDS, tuberculosis, malaria and others’ (Target 3.3) has not been homogeneous in low and highincome countries.For our independence tests, we consider temporal data from African countries over years, and test any two Targets for pairwise independence. Of the total possible pairwise combinations, the null hypothesis of independence is rejected for . As an illustration, we examine the dependencies of ‘implementation of national social protection systems’ (Target 1.3) with ‘economic growth’ (Target 8.1) and the ‘proportion of informally employed workers’ (Target 8.3). Applying our baseline method, we accept the null hypothesis of independence between Target 1.3 and 8.1, i.e., we find that the ‘implementation of national social protection systems’ has been independent of economic growth. In contrast, we find that Target 1.3 has been dependent on the ‘proportion of informally employed workers’ (Target 8.3).
6 Discussion and conclusion
Building on ideas from functional data analysis, we have presented approaches to testing for homogeneity and independence between two nonstationary random processes with the kernelbased statistics mmd and hsic. We view independent realisations of the underlying processes as samples from multivariate probability distributions to which mmd and hsic can be applied. Our tests are shown to outperform current stateoftheart methods in a range of experiments. Furthermore, we optimise the test power over the choice of kernel and achieve improved results in most settings. However, we also observe that our optimisation procedure does not always yield an increase in test power. We leave the investigation of this behaviour open for future research with the possibility of defining search spaces and step sizes over kernel hyperparameters differently, or of choosing a gradientbased approach for optimisation sutherland2016generative. Our results show that small sample sizes of less than independent realisations can already achieve high test power, and that denser measurements over the same time period do not necessarily lead to enhanced test power.
The proposed tests can be of interest in many areas where nonstationary and nonlinear multivariate temporal datasets constitute the norm, as illustrated by our application to test for homogeneity and independence between the United Nations’ Sustainable Development Goals measured in different countries over the last years.
References
Appendix A Appendix
a.1 Results for realisations with varying number of time points, T
a.1.1 mmd
We show here the results for mean and variance shifts for , but the results are similar for all tested sample sizes ,
a.1.2 hsic
Experiments for linear dependence and dependence through shared second basis function coefficient for various . We find that the granularity of measurements over time does not influence the text power significantly.
a.2 Test power maximisation
a.2.1 mmd
For mean shift experiments for mmd, we predefine a linear search space with values for the Gaussian kernel bandwidth due to the dependence on , and similarly for variance shift experiments (both stated in Table 1). These search spaces resulted from extensive manual explorations for all shifts and sample sizes. We acknowledge that the test power may be further improved with search spaces of finer granularity.
a.2.2 hsic
We define search intervals of both and across all angles , but different for the studentt, uniform, and exponential distributions. For studentt and exponential distributions, both and were chosen as evenly spaced numbers on a linear scale between and . For uniform distributions, both and were chosen as evenly spaced numbers on a linear scale between and . These search spaces resulted from extensive manual explorations for all angles and distributions. We acknowledge that the test power may be further improved with search spaces of finer granularity.
a.3 Distribution specifications for basis function coefficients in rotation mixing
Distribution  Fourier basis function coefficients  

Exponential  
Studentt  
Uniform 
a.4 SDG dataset
Data of the Indicators measuring the progress of the Targets of the sdgs can be found at WBdata. Each of these Indicators measures the progress towards a specific Target. For instance, an Indicator for Target 1.1, ‘by 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.90 a day’, is the ‘proportion of population below the international poverty line, by gender, age, employment status and geographical location (urban/rural)’. Each of the Targets belongs to one specific Goal (e.g., Target 1.1 belongs to Goal 1, ‘end poverty in all its forms everywhere’). There are such Goals, which are commonly referred to as the Sustainable Development Goals (sdgs). We compute averages over all Indicators belonging to one Target for our analyses in Section 5.
The dataset of WBdata
has many missing values, especially for the time span 20002005. We impute these values using a weighted average across countries (where data is available) with weights inversely proportional to the Euclidean distance between indicators.