Statistical modelling allows for the learning of the relationship between two variables, where the said relationship is responsible for generating the data available on the variables. Thus, let
be a random variable that represents a behavioural or structural parameter of the system, andis another variable that bears influence on s.t. , where the functional relation that we seek to learn, is itself a random structure, endowed with information about the error made in predicting the values of (or ) at which the noise-included measurement of (or ) has been realised. Such a function can be modelled as a realisation from an adequately chosen stochastic process. In general, either or both variables could be tensor-valued, such that, data comprising measurements of either variable, is then shaped as a hypercuboid. Typically, the structure/behaviour of a system is parametrised using a set of scalar-valued parameters, (say number of such parameters), which can, in principle be collated into a -dimensional vector. Then is typically, the system parameter vector. The other, observed variable , can be tensor-valued in general. There are hypercuboidally-shaped data that show up in real-world applications, (Mardia and Goodall, 1993; Bijma et al., 2005; Werner et al., 2008; Theobald and Wuttke, 2008; Barton and Fuhrmann, 1993)
. For example, in computer vision, the image of one person might be a matrix of dimensions, i.e. image with resolution of pixels by pixels. Then, repetition across persons inflates the data to a cuboidally-shaped dataset. Examples of handling high-dimensional datasets within computer vision exist (Dryden et al., 2009; Fu, 2016; Pang et al., 2016; Wang, 2011; Qiang and Fei, 2011). In health care, the number of health parameters of patients, when charted across
time-points, again generates a high-dimensional data, which gets further enhanced, if the experiment involves tracking for changes acrossgroups of patients each, where each such group is identified by the level of intervention (Chari, Coe, Vucic, Lockwood and Lam, 2010; Clarke et al., 2008; Oberg et al., 2015; Chari, Thu, Wilson, Lockwood, Lonergan, Coe, Malloff, Gazdar, Lam, Garnis et al., 2010; Sarkar, 2015; Wang et al., 2015; Fan, 2017). Again, in ecological datasets, there could be spatial locations at each of which, traits of species could be tracked, giving rise to a high-dimensional data (Leitao et al., 2015; Warton, 2011; Dunstan et al., 2013).
It is a shortcoming of the traditional modelling strategies that we treated these groupings in the data as independent–or for that matter, even the variation in parameter values of any group across the time points, is ignored, and a mere snapshot of each group is traditionally considered, one at a time. In this work, we advance a method for the consideration of parameters across all relevant levels of measurement, within one integrated framework, to enable the learning of correlations across all such levels, thus permitting the prediction of the system parameter vector, with meaningful uncertainties, and avoid information loss associated with categorisation of data.
While discussing the generic methodology that helps address the problem of learning the inter-variable relationship , given general hypercuboid-shaped data, we focus on developing such learning when this data displays discontinuities. In such a learning exercise, the inter-variable functional relation , needs to be modelled using a high-dimensional stochastic process (a tensor-variate Gaussian Process, for example), the covariance function of which is non-stationary. The correlation between a pair of data slices, (defined by two such measured values of , each realised at two distinct values of the system parameter
), is sometimes parametrically modelled as a function of the distance between the values of the system parameter at which these slices are realised, i.e. “similarity” in values ofcan be modelled as a function of “similarity” in the corresponding values. However, if there are discontinuities in the data, then such a mapping between “similarities” in and no longer holds. Instead, discontinuities in data call for a model of the correlation that adapts to the discontinuities in the data. We present such correlation modelling in this paper, by modelling each scalar-valued hyperparameter of the correlation structure of the high-dimensional stochastic process, as a random function of the sample path of that process; this random function then, can itself be modelled as a realisation of a scalar-variate stochastic process–a scalar-variate Gaussian Process (GP) for example (Section 2).
Thus, the learning of is double-layered, in which multiple scalar-variate GPs inform a high-dimensional (tensor-variate) GP. Importantly, we show below (Section 3.3) that no more than 2 such layers in the learning strategy suffice. Thus, the data on the observable can be shown to be sampled from a compound tensor-variate and multiple scalar-variate Gaussian Processes.
Acknowledgement of non-stationarity in correlation learning is not new (Paciorek and Schervish, 2004). In some approaches, a transformation of the space of the input variable is suggested, to accommodate non-stationarity (Sampson and Guttorp, 1992; Snoek et al., 2014; Schmidt and O’Hagan, 2003). When faced with learning the dynamically varying covariance structure of time-dependent data, others have resorted to learning such a covariance, using Generalised Wishart Process (Wilson and Ghahramani, 2011). In another approach, latent parameters that bear information on non-stationarity, have been modelled with GPs and learnt simultaneously with the sought function (Tolvanen et al., 2014), while others have used multiple GPs to capture the non-stationarity (Gramacy, 2005; Heinonen et al., 2016). However, what has not been presented, is a template for including non-stationarity in high-dimensional data, by nesting lower-dimensional Gaussian Processes with distinct covariances, within a tensor-variate GP (Section 2 and Section 3), using a Metropolis-within-Gibbs inference scheme (Section 4), to perform with-uncertainties learning of a high-dimensional function, given discontinuities that show up in the hypercuboidally-shaped datasets in general, and illustration of the method on a cuboidally-shaped, real-world dataset (Section 5, Section 6). This is what we introduce in this paper. Our model is capacitated to learn the temporally-evolving covariance of time-dependent data (Section 3.2), if such is the data at hand, but the focus of our interest is to follow the learning of the sought tensor-valued functional relation between a system parameter vector and a tensor-valued observable, with inverse Bayesian prediction of the system parameter values, at which test data on the observable is measured (Section 7, Section 6). Additionally, flexibility of our model design permits both inverse and forward predictions. So we also predict new data at chosen system parameter values given our model and results, and perform model checking, by comparing such generated data against the empirically observed data (Section 3 of Supplementary Materials).
Let system parameter vector , be affected by observable , where is (-th ordered) tensor-valued in general, i.e. is , . That bears influence on suggests the relationship where .
We define functional relationship , between and , as a “tensor-valued function”, with -number of component functions, where these components suffer inter-correlations. Thus, the learning of is equivalent to learning the component functions, inclusive of learning the correlation amongst these component functions.
Inverse of , is defined as the tensor-valued function of same dimensionalities as , comprising inverses of each component function of , assuming inverse of each component function exists.
The inversion of the sought function –where –allows for the forward prediction of given a measured value of , as well as for the inverse prediction of the value of at which a given measurement of is recorded. It may be queried: why do we undertake the seemingly more difficult learning of the tensor-valued (that outputs the tensor ), than of the vector-valued (that outputs the vector ). We do this, because we want to retain the capacity of predicting both new data at a given value of the system parameter (), as well as predict the system parameter at which a new measurement of the observable is realised.
If we had set ourselves the task of learning , where , i.e. is a “vector-valued” function, and therefore lower dimensional with fewer number of component functions than the tensor-valued –we could not have predicted value of at a given . The -dimensional vector-valued inverse function cannot yield a value of the number of components of the tensor at this given , if .
The learning of the function , uses the training data . Conventional prediction of , at which test data on is realised, suggests: .
However, this there is no objective way to include the uncertainties learnt in the learning of the function , to propagate into the uncertainty of this prediction. This underpins an advantage of Bayesian prediction of one variable, given test data on the other, subsequent to learning of using training data .
Conventional fitting methods (such as fitting with splines, etc), also fumble when measurements of both/either of the r.v.s and , are accompanied by measurement errors; in light of this, it becomes difficult to infer the function that fits the data the best. In fact, the uncertainty in the learning of the sought function is also then difficult to quantify.
Secondly, there is no organic way of quantifying the smoothness of the sought , in th econventional approach. Ideally, we would prefer to learn this smoothness from the data itself. However, there is nothing intrinsic to the fitting-with-splines/wavelets method that can in principle, quantity the smoothness of the curve, given a training data.
Lastly, when is an r.v. that is no longer a scalar, but higher-dimensional (say tensor-valued in general), fitting with splines/wavelets starts to become useless, since in such cases of sought tensor-valued function (in general), the component functions of are correlated, but methods such as parametric fitting approaches, cannot capture such correlation, given the training data. As we have remarked above, such correlation amongst the components functions of is the same correlation structure amongst the components of the tensor-valued –so in principle, the sought correlation can be learnt from the training data.
In light of this, we identify a relevant Stochastic Process that can give a general, non-restrictive description of the sought function
–a Gaussian Process for example. The joint probability density of a set of realisations of a sampled, is then driven by the Process under consideration, where each such realisation of the function, equals a value of the output variable . Thus, the joint also represents the likelihood of the Process parameters given the relevant set of values of
, i.e. the data. We impose judiciously chosen priors, to write the posterior probability density of the Process parameters given the data. Generating samples from this posterior then allows for the identification of the 95HPD credible regions on these Process parameters, i.e. on the learnt function . It is possible to learn the smoothness of the function generated from this Process, via kernel-based parameterisation of the covariance structure of the GP under consideration. Thus, we focus on the pursuit of adequate covariance kernel parametrisation.
When possible, covariance matrices of the GP that is invoked to model the sought function , are kernel-parametrised using stationary-looking kernel functions, hyperparameters of which are modelled as dependent on the sample paths (or rather sample functions) of this GP. We show below (Lemma 3.1) that such a model can address the anticipated discontinuities in data.
As LHS of equation is -th ordered tensor-valued, is tensor-variate function of equal dimensionalities. So we model as a realisation from a tensor-variate GP.
Modelling as sampled from a tensor-variate GP, where the
-th ordered tensor-valued variable , we get that the
joint probability of the set of values of sampled function , at each of the design
points (that reside within the training data ), follows
the -variate Tensor Normal distribution
-variate Tensor Normal distribution(Kolda and Bader, 2009; Richter et al., 2008; McCullagh, 1987; Manceur and Dutilleul, 2013):
where mean of this density is a -th ordered mean tensor of dimensions , and is the -dimensional, -th covariance matrix; . In other words, likelihood of given is the -variate Tensor Normal density:
where observed values of the -th dimensional tensor-valued are collated to form the -th ordered tensor . The notation in Equation 2.1 presents the -mode product of a matrix and a tensor (Oseledets, 2011). Here is the unique square-root of the positive definite covariance matrix , i.e. .
One example of a computational algorithm that can be invoked to realise such a square root of a matrix, is Cholesky decomposition111 The covariance tensor of this -th order Tensor Normal distribution, has been decomposed into different covariance matrices by Tucker decomposition, (Hoff et al., 2011; Manceur and Dutilleul, 2013; Kolda and Bader, 2009; Xu and Yan, 2015), to yield the number of covariance matrices, , where the -th covariance matrix is an -dimensional square matrix, . As Hoff (1997); Manceur and Dutilleul (2013) suggest, a -th ordered random tensor can be decomposed to a -th ordered tensor and number of covariance matrices by Tucker product, according to , It can be proved that all tensors can be decomposed into a set of covariance matrices (Xu et al., 2011), though not uniquely. This may cause difficulty in finding the correct combination of covariance matrices that present the correlation structure of the data at hand. One way to solve this problem is to use priors for the respective covariance parameters..
We employ this likelihood in Equation 2.1
to write the joint posterior probability density of the mean tensor and covariance matrices, given the data. But prior to doing that, we identify those parameters–if any–that can be estimated in a pre-processing stage of the inference, in order to reduce the computational burden of inference. Also, it would be useful to find ways of (kernel-based) parametrisation of the sought covariance matrices, thereby reducing the number of parameters that we need to learn. To this effect, we undertake the estimation of the mean tensor is. It is empirically estimated as the sample mean of the sample , s.t. repetitions of form the value of . However, if necessary, the mean tensor itself can be regarded as a random variable and learnt from the data (Chakrabarty et al., 2015), The modelling of the covariance structure of this GP is discussed in the following subsection.
Ultimately, we want to predict the value of one variable, at which a new or test data on the other variable is observed.
To perform inverse prediction of value of the input variable , at which test data on is realised, we will
sample from the posterior probability density of given the test data , and (modal) values of the unknowns that parametrise the covariance matrices the high-dimensional GP invoked to model , subsequent to learning the marginals of each such unknown given the training data, using MCMC.
sample from the joint posterior probability density of and all other unknowns parameters of this high-dimensional GP, given training, as well as test data, using MCMC.
Computational speed of the first approach, is higher, as marginal distributions of the GP parameters are learnt separately. When the training data is small, or if the training data is not representative of the test data at hand, the learning of via the second method may affect the learning of the GP parameters.
2.1 3 ways of learning covariance matrices
Let the -th element of -th covariance matrix be ; , .
At a given , bears information about covariance amongst the -th and -th slices of the -th ordered data tensor , s.t. -dimensional -th “slice” of data tensor is measured value of -th ordered tensor-valued , where the -th slice is realised at the -th design point .
The covariance between the -th and -th slices of data decreases as the slices get increasingly more disparate, i.e. with increasing . In fact, we can model as a decreasing function of this disparity , where is the covariance kernel function, computed at the -th and -th values of input variable . In such a model, the number of distinct unknown parameters involved in the learning of reduces from , to the number of hyper-parameters that parametrise the kernel function .
However, kernel parametrisation is not always possible.
–Firstly, this parametrisation may cause information loss and this may not be acceptable
(Aston and Kirch, 2012).
–Again, we will necessarily avoid kernel parametrisation, when we cannot find input parameters, at which the corresponding slices in the data are realised.
In such situations,
–we can learn the elements of the covariance matrix directly using MCMC, though direct learning of all distinct elements of is feasible, as long as total number of all unknowns learnt by MCMC .
–we can use an empirical estimation for the covariance matrix . We collapse each of the number of -th ordered tensor-shaped slices of the data, onto the -th axis in the space of , where we can choose any one value of from . This will reduce each slice to a -dimensional vector, so that is covariance computed using the -th and -th such -dimensional vectors.
Indeed such an empirical estimate of any covariance matrix is easily generated, but it indulges in linearisation amongst the different dimensionalities of the observable , causing loss of information about the covariance structure amongst the components of these high-dimensional slices. This approach is inadequate when the sample size is small because the sample-based estimate will tend to be incorrect; indeed discontinuities and steep gradients in the data, especially in small-sample and high-dimensional data, will render such estimates of the covariance structure incorrect. Importantly, such an approach does not leave any scope for identifying the smoothness in the function that represents the functional relationship between the input and output variables. Lastly, the uncertainties in the estimated covariance structure of the GP remain inadequately known.
We model the covariance matrices as
–or learnt directly using MCMC.
An accompanying computational worry is the inversion of any of the covariance matrices; for a covariance matrix that is an -dimensional matrix, the computational order for matrix inversion is well known to be (Knuth, 1997).
3 Kernel parametrisation
Kernel parametrisation of a covariance matrix, when undertaken, uses an Squared Exponential (SQE) covariance kernel
where is a diagonal matrix, the diagonal elements of which are the length scale hyperparameters that tell us how quickly correlation fades away in each of the -directions in input space , s.t. the inverse matrix is also diagonal, with the diagonal elements given as , where is the smoothness hyperparameter along the -th direction in , . We learn these unknown parameters from the data.
Here is the global amplitude, that is subsumed as a scale factor, in one of the other covariance matrices, distinct elements of which are learnt directly using MCMC.
We avoid using a model for the kernel in which amplitude depends on the locations at which covariance is computed, i.e. the model: , and use a model endowed with a global amplitude . This helps avoid learning a very large number () of amplitude parameters directly from MCMC.
A loose interpretation of this amlitude modelling is that we have scaled all local amplitudes to be using the global factor (), and these scaled local amplitudes are then subsumed into the argument of the exponential in the RHS of the last equation, s.t. the reciprocal of the correlation length scales, that are originally interpreted as the elements of the diagonal matrix , are now interpreted as the smoothing parameters modulated by such local amplitudes. This interpretation is loose, since the same smoothness parameter cannot accommodate all (scaled by a global factor) local amplitudes, for all .
3.1 Including non-stationarity, by modelling hyperparameters of covariance kernels as realisations of Stochastic Process
By definition of the kernel function we choose, (Equation 3.1), all functions sampled from the tensor-variate GP, are endowed with the same length scale hyperparameters , and global amplitude . However, the data on the output variable is not continuous, i.e. similarity between and does not imply similarity between and , computed in a universal way . Indeed, then a stationary definition of the correlation for all pairs of points in the function domain, is wrong. One way to generalise the model for the covariance kernel is to suggest that the hyperparameters vary as random functions of the sample path.
For , with and , if the map is a Lipschitz-continuous map over the bound set , where absolute value of correlation between and is
then the vector of correlation hyperparameters is finite, and each element of is -dependent, i.e.
For , where is a bounded subset of , and , the mapping is a defined to be Lipschitz-continuous map, i.e.
–for constant , s.t. the infinum over all such
constants is the finite Lipschitz constant for ;
– and are metric spaces.
Let metric be the norm:
and the metric be defined as (square root of the logarithm of) the inverse of the correlation:
–where correlation being a measure of affinity,
, transforms this affinity into a squared distance for this
correlation model; so the transformation to a metric is undertaken;
–and the given kernel-parametrised correlation is:
Then for the map to be Lipschitz-continuous, we require:
where the vector of correlation hyperparameters, , is finite given finite .
Given discontinuities in the data on , the function is not expected to obey the Lipschitz criterion defined in inequation 3.2 globally. We anticipate sample function to be locally or globally discontinuous.
Sample function can be s.t.
, s.t. finite Lipschitz constant , for which . Here the bounded set .
, with , s.t. , but ; . In such a case, the Lipschitz constant used for the sample function is defined to be
If each function in the set is
–either globally Lipschitz, or is as described in Case II,
–and Case I does not hold true, then
a finite , where
where is the -th Lipschitz constant defined for the -th
sample function , ,
i.e. a finite Lipschitz constant for all sample functions.
a universal correlation hyperparameter vector for all sample functions (=, by Theorem 3.1).
Following on from Lemma 3.1,
if for any
Case I holds, finite maxima of
does not exist,
a finite Lipschitz constant for all sample functions,
a universal correlation hyperparameter vector , for all sample functions,
i.e. we need to model correlation hyperparameters to vary with the sample function.
Above, are hyperparameters of the correlation kernel; they are interpreted as the reciprocals of the length-scales , i.e. .
If the map is Lipschitz-continuous, (i.e. if hyperparameters are -dependent, by Theorem 3.1), then by Kerkheim’s Theorem (Kerkheim, 1994), is differentiable almost everywhere in ; this is a generalisation of Rademacher’s Theorem to metric differentials (see Theorem 1.17 in Hajlasz (2014)). However, in our case, the function is not necessarily differentiable given discontinuities in the data on the observable , and therefore, is not necessarily Lipschitz.
For , with and ,
where is a sample function of a tensor-variate GP.
Thus, in this updated model, -th component of correlation
modelled as randomly varying with the sample function, , of the
tensor-variate GP, .
In the Metropolis-within-Gibbs-based inference that we undertake, one sample function of the tensor-variate GP generated, in every iteration, that we model above
|as randomly varying with the sample path of the tensor-variate GP,|
where this scalar-valued random function , is modelled as a realisation from a scalar-variate GP.
Scalar-variate GP that is sampled from, is independent of the GP that is sampled from; . In addition, parameters that define the correlation function of the generative scalar-variate GP can vary, namely the amplitude and scale of one such GP might be different from another. Thus, scalar-valued functions sampled from GPs with varying correlation parameters and –even for the same value–should be marked by these descriptor variables and .
We update the relationship between iteration number and correlation length scale hyperparameter in the -th direction in input space to be:
– the amplitude variable of the SQE-looking covariance function of
the scalar-variate GP that is a realisation of.
takes the value ;
– the length scale variable of the SQE-looking covariance function of the scalar-variate GP that is a realisation of; .
Then the scalar-variate GPs that and are sampled from, have distinct correlation functions if . Here .
Current value of correlation length scale hyperparameter , acknowledges information on only the past number of iterations as in:
where is an unknown constant that we learn from the data, during the first iterations.
is a realisation from a scalar-variate GP, the joint probability distribution ofnumber of values of the function –at a given –is Multivariate Normal, with -dimensional mean vector and -dimensional covariance matrix , i.e.
Here is the number of iterations that we look back at, to collect the dynamically-varying “look back-data” that is employed to learn parameters of the scalar-variate GP that is modelled with.
The mean vector is empirically estimated as the mean of the dynamically varying look back-data, s.t. at the -th iteration it is estimated as a -dimensional vector with each component .
-dimensional covariance matrix is dependent on the iteration-number and this is now acknowledged in the notation to state: .
In the -th iteration, upon the empirical estimation of the mean as given above, it is subtracted from the “look back-data” so that the subsequent mean-subtracted look back-data is . It is indeed this mean-subtracted sample that we use.
In light of this declared usage of the mean-subtracted “look back-data” , we update the likelihood over what is declared in Equation 3.6, to
3.2 Temporally-evolving covariance matrix
The dynamically varying covariance matrix of the Multivariate Normal likelihood in Equation 3.7, at iteration number , is
the number of iterations we look back to is ;
is the covariance kernel parametrising the covariance function of the scalar-variate GP that generates the scalar-valued function , at the vector of descriptor variables, s.t. ;
is a positive definite square scale matrix of dimensionality , containing the amplitudes of this covariance function;
, with the space of input variable -dimensional.
The covariance kernel that parametrises the covariance function of the scalar-variate GP that generates , is s.t. =1 .
In a general model, at each iteration, a new value of the vector of descriptor variables in the -th direction in the space of the input variable , is generated, s.t. in the -th iteration, it is ;
Now, , where is the Delta function.
sample estimate of is
is the -th element of matrix .
This definition of the covariance holds since mean of the r.v. is 0, as we have sampled the function from a zero-mean scalar-variate GP.
Let be a -dimensional diagonal matrix, the -th diagonal element of which is . Then factorising the scale matrix , is diagonal with the -th diagonal element ; . This is defined for every .
Then at iteration number , we define the current covariance matrix
Then is distributed according to the Wishart distribution w.p, and (Eaton, 1990), i.e. the dynamically-varying covariance matrix is:
If interest lies in learning the covariance matrix at any time point, we could proceed to inference here from, in attempt of the learning of the unknown parameters of this process given the lookback-data .
Our learning scheme then would then involve compounding a Tensor-Variate GP and a .
The above would be a delineated route to recover the temporal variation in the correlation structure of time series data (as studied, for example by Wilson and Ghahramani (2011)).
In our study, the focus is on high-dimensional data that display discontinuities, and on learning the relationship between the observable that generates such data, and the system parameter –with the ulterior aim being parameter value prediction. So learning the time-varying covariance matrix is not the focus of our method development.
We want to learn given training data . The underlying motivation is to sample a new from a scalar-variate GP, at new values of , to subsequently sample a new tensor-valued function , from the tensor-normal GP, at a new value of its -dimensional correlation length scale hyperparameter vector .
3.3 2-layers suffice
One immediate concern that can be raised is the reason for limiting the layering of our learning scheme to only 2. It may be argued that just as we ascribe stochasticity to the length scales that parametrise the correlation structure of the tensor-variate GP that models , we need to do the same to the descriptor variables that parametrise the correlation structure of the scalar-variate GP that model . Following this argument, we would need to hold –or at least model the scale –to be dependent on the sample path of the scalar-variate GP, i.e. set to be dependent on .
However, we show below that a global choice of is possible irrespective of the sampled function , given that is always continuous (a standard result). In contrast, the function not being necessarily Lipschitz (see Remark 3.3), implies that the correlation kernel hyperparameters , are -dependent, .
Given , with and , the map is a Lipschitz-continuous map, . Here
The proof of this standard theorem is provided in Section 4 of the supplementary Materials.
For any sampled function realised from a scalar-variate GP that has a covariance function that is kernel-parametrised with an SQE kernel function, parametrised by amplitude and scale hyperparameters, the Lipschitz constant that defines the Lipschitz-continuity of , is -dependent, and is given by the reciprocal of the scale hyperparameter, s.t. the set of values of scale hyperparameters, for each of the samples of taken from the scalar-variate GP, admits a finite minima.
Distance between and is given by metric
s.t. , and is finite (since live in a bound set, and is continuous). The parametrised model of the correlation is
s.t. , where is the scale hyperparameter.
Now, Lipschitz-continuity of implies
where the Lipschitz constant is -dependent (Theorem 3.1). As , where is a known finite integer, and as is defined as (using definition of ), exists for , and is finite. We get
As is any point in , exists for all points in .