1 Introduction
Statistical modelling allows for the learning of the relationship between two variables, where the said relationship is responsible for generating the data available on the variables. Thus, let
be a random variable that represents a behavioural or structural parameter of the system, and
is another variable that bears influence on s.t. , where the functional relation that we seek to learn, is itself a random structure, endowed with information about the error made in predicting the values of (or ) at which the noiseincluded measurement of (or ) has been realised. Such a function can be modelled as a realisation from an adequately chosen stochastic process. In general, either or both variables could be tensorvalued, such that, data comprising measurements of either variable, is then shaped as a hypercuboid. Typically, the structure/behaviour of a system is parametrised using a set of scalarvalued parameters, (say number of such parameters), which can, in principle be collated into a dimensional vector. Then is typically, the system parameter vector. The other, observed variable , can be tensorvalued in general. There are hypercuboidallyshaped data that show up in realworld applications, (Mardia and Goodall, 1993; Bijma et al., 2005; Werner et al., 2008; Theobald and Wuttke, 2008; Barton and Fuhrmann, 1993). For example, in computer vision, the image of one person might be a matrix of dimensions
, i.e. image with resolution of pixels by pixels. Then, repetition across persons inflates the data to a cuboidallyshaped dataset. Examples of handling highdimensional datasets within computer vision exist (Dryden et al., 2009; Fu, 2016; Pang et al., 2016; Wang, 2011; Qiang and Fei, 2011). In health care, the number of health parameters of patients, when charted acrosstimepoints, again generates a highdimensional data, which gets further enhanced, if the experiment involves tracking for changes across
groups of patients each, where each such group is identified by the level of intervention (Chari, Coe, Vucic, Lockwood and Lam, 2010; Clarke et al., 2008; Oberg et al., 2015; Chari, Thu, Wilson, Lockwood, Lonergan, Coe, Malloff, Gazdar, Lam, Garnis et al., 2010; Sarkar, 2015; Wang et al., 2015; Fan, 2017). Again, in ecological datasets, there could be spatial locations at each of which, traits of species could be tracked, giving rise to a highdimensional data (Leitao et al., 2015; Warton, 2011; Dunstan et al., 2013).It is a shortcoming of the traditional modelling strategies that we treated these groupings in the data as independent–or for that matter, even the variation in parameter values of any group across the time points, is ignored, and a mere snapshot of each group is traditionally considered, one at a time. In this work, we advance a method for the consideration of parameters across all relevant levels of measurement, within one integrated framework, to enable the learning of correlations across all such levels, thus permitting the prediction of the system parameter vector, with meaningful uncertainties, and avoid information loss associated with categorisation of data.
While discussing the generic methodology that helps address the problem of learning the intervariable relationship , given general hypercuboidshaped data, we focus on developing such learning when this data displays discontinuities. In such a learning exercise, the intervariable functional relation , needs to be modelled using a highdimensional stochastic process (a tensorvariate Gaussian Process, for example), the covariance function of which is nonstationary. The correlation between a pair of data slices, (defined by two such measured values of , each realised at two distinct values of the system parameter
), is sometimes parametrically modelled as a function of the distance between the values of the system parameter at which these slices are realised, i.e. “similarity” in values of
can be modelled as a function of “similarity” in the corresponding values. However, if there are discontinuities in the data, then such a mapping between “similarities” in and no longer holds. Instead, discontinuities in data call for a model of the correlation that adapts to the discontinuities in the data. We present such correlation modelling in this paper, by modelling each scalarvalued hyperparameter of the correlation structure of the highdimensional stochastic process, as a random function of the sample path of that process; this random function then, can itself be modelled as a realisation of a scalarvariate stochastic process–a scalarvariate Gaussian Process (GP) for example (Section 2).Thus, the learning of is doublelayered, in which multiple scalarvariate GPs inform a highdimensional (tensorvariate) GP. Importantly, we show below (Section 3.3) that no more than 2 such layers in the learning strategy suffice. Thus, the data on the observable can be shown to be sampled from a compound tensorvariate and multiple scalarvariate Gaussian Processes.
Acknowledgement of nonstationarity in correlation learning is not new (Paciorek and Schervish, 2004). In some approaches, a transformation of the space of the input variable is suggested, to accommodate nonstationarity (Sampson and Guttorp, 1992; Snoek et al., 2014; Schmidt and O’Hagan, 2003). When faced with learning the dynamically varying covariance structure of timedependent data, others have resorted to learning such a covariance, using Generalised Wishart Process (Wilson and Ghahramani, 2011). In another approach, latent parameters that bear information on nonstationarity, have been modelled with GPs and learnt simultaneously with the sought function (Tolvanen et al., 2014), while others have used multiple GPs to capture the nonstationarity (Gramacy, 2005; Heinonen et al., 2016). However, what has not been presented, is a template for including nonstationarity in highdimensional data, by nesting lowerdimensional Gaussian Processes with distinct covariances, within a tensorvariate GP (Section 2 and Section 3), using a MetropoliswithinGibbs inference scheme (Section 4), to perform withuncertainties learning of a highdimensional function, given discontinuities that show up in the hypercuboidallyshaped datasets in general, and illustration of the method on a cuboidallyshaped, realworld dataset (Section 5, Section 6). This is what we introduce in this paper. Our model is capacitated to learn the temporallyevolving covariance of timedependent data (Section 3.2), if such is the data at hand, but the focus of our interest is to follow the learning of the sought tensorvalued functional relation between a system parameter vector and a tensorvalued observable, with inverse Bayesian prediction of the system parameter values, at which test data on the observable is measured (Section 7, Section 6). Additionally, flexibility of our model design permits both inverse and forward predictions. So we also predict new data at chosen system parameter values given our model and results, and perform model checking, by comparing such generated data against the empirically observed data (Section 3 of Supplementary Materials).
2 Model
Let system parameter vector , be affected by observable , where is (th ordered) tensorvalued in general, i.e. is , . That bears influence on suggests the relationship where .
Definition 2.1.
We define functional relationship , between and , as a “tensorvalued function”, with number of component functions, where these components suffer intercorrelations. Thus, the learning of is equivalent to learning the component functions, inclusive of learning the correlation amongst these component functions.
Inverse of , is defined as the tensorvalued function of same dimensionalities as , comprising inverses of each component function of , assuming inverse of each component function exists.
The inversion of the sought function –where –allows for the forward prediction of given a measured value of , as well as for the inverse prediction of the value of at which a given measurement of is recorded. It may be queried: why do we undertake the seemingly more difficult learning of the tensorvalued (that outputs the tensor ), than of the vectorvalued (that outputs the vector ). We do this, because we want to retain the capacity of predicting both new data at a given value of the system parameter (), as well as predict the system parameter at which a new measurement of the observable is realised.
Remark 2.1.
If we had set ourselves the task of learning , where , i.e. is a “vectorvalued” function, and therefore lower dimensional with fewer number of component functions than the tensorvalued –we could not have predicted value of at a given . The dimensional vectorvalued inverse function cannot yield a value of the number of components of the tensor at this given , if .
The learning of the function , uses the training data . Conventional prediction of , at which test data on is realised, suggests: .

However, this there is no objective way to include the uncertainties learnt in the learning of the function , to propagate into the uncertainty of this prediction. This underpins an advantage of Bayesian prediction of one variable, given test data on the other, subsequent to learning of using training data .

Conventional fitting methods (such as fitting with splines, etc), also fumble when measurements of both/either of the r.v.s and , are accompanied by measurement errors; in light of this, it becomes difficult to infer the function that fits the data the best. In fact, the uncertainty in the learning of the sought function is also then difficult to quantify.

Secondly, there is no organic way of quantifying the smoothness of the sought , in th econventional approach. Ideally, we would prefer to learn this smoothness from the data itself. However, there is nothing intrinsic to the fittingwithsplines/wavelets method that can in principle, quantity the smoothness of the curve, given a training data.

Lastly, when is an r.v. that is no longer a scalar, but higherdimensional (say tensorvalued in general), fitting with splines/wavelets starts to become useless, since in such cases of sought tensorvalued function (in general), the component functions of are correlated, but methods such as parametric fitting approaches, cannot capture such correlation, given the training data. As we have remarked above, such correlation amongst the components functions of is the same correlation structure amongst the components of the tensorvalued –so in principle, the sought correlation can be learnt from the training data.
In light of this, we identify a relevant Stochastic Process that can give a general, nonrestrictive description of the sought function
–a Gaussian Process for example. The joint probability density of a set of realisations of a sampled
, is then driven by the Process under consideration, where each such realisation of the function, equals a value of the output variable . Thus, the joint also represents the likelihood of the Process parameters given the relevant set of values of, i.e. the data. We impose judiciously chosen priors, to write the posterior probability density of the Process parameters given the data. Generating samples from this posterior then allows for the identification of the 95
HPD credible regions on these Process parameters, i.e. on the learnt function . It is possible to learn the smoothness of the function generated from this Process, via kernelbased parameterisation of the covariance structure of the GP under consideration. Thus, we focus on the pursuit of adequate covariance kernel parametrisation.Proposition 2.1.
When possible, covariance matrices of the GP that is invoked to model the sought function , are kernelparametrised using stationarylooking kernel functions, hyperparameters of which are modelled as dependent on the sample paths (or rather sample functions) of this GP. We show below (Lemma 3.1) that such a model can address the anticipated discontinuities in data.
As LHS of equation is th ordered tensorvalued, is tensorvariate function of equal dimensionalities. So we model as a realisation from a tensorvariate GP.
Definition 2.2.
Modelling as sampled from a tensorvariate GP, where the th ordered tensorvalued variable , we get that the joint probability of the set of values of sampled function , at each of the design points (that reside within the training data ), follows the
variate Tensor Normal distribution
(Kolda and Bader, 2009; Richter et al., 2008; McCullagh, 1987; Manceur and Dutilleul, 2013):where mean of this density is a th ordered mean tensor of dimensions , and is the dimensional, th covariance matrix; . In other words, likelihood of given is the variate Tensor Normal density:
(2.1) 
where observed values of the th dimensional tensorvalued are collated to form the th ordered tensor . The notation in Equation 2.1 presents the mode product of a matrix and a tensor (Oseledets, 2011). Here is the unique squareroot of the positive definite covariance matrix , i.e. .
One example of a computational algorithm that can be invoked to realise such a square root of a matrix, is Cholesky decomposition^{1}^{1}1 The covariance tensor of this th order Tensor Normal distribution, has been decomposed into different covariance matrices by Tucker decomposition, (Hoff et al., 2011; Manceur and Dutilleul, 2013; Kolda and Bader, 2009; Xu and Yan, 2015), to yield the number of covariance matrices, , where the th covariance matrix is an dimensional square matrix, . As Hoff (1997); Manceur and Dutilleul (2013) suggest, a th ordered random tensor can be decomposed to a th ordered tensor and number of covariance matrices by Tucker product, according to , It can be proved that all tensors can be decomposed into a set of covariance matrices (Xu et al., 2011), though not uniquely. This may cause difficulty in finding the correct combination of covariance matrices that present the correlation structure of the data at hand. One way to solve this problem is to use priors for the respective covariance parameters..
We employ this likelihood in Equation 2.1
to write the joint posterior probability density of the mean tensor and covariance matrices, given the data. But prior to doing that, we identify those parameters–if any–that can be estimated in a preprocessing stage of the inference, in order to reduce the computational burden of inference. Also, it would be useful to find ways of (kernelbased) parametrisation of the sought covariance matrices, thereby reducing the number of parameters that we need to learn. To this effect, we undertake the estimation of the mean tensor is
. It is empirically estimated as the sample mean of the sample , s.t. repetitions of form the value of . However, if necessary, the mean tensor itself can be regarded as a random variable and learnt from the data (Chakrabarty et al., 2015), The modelling of the covariance structure of this GP is discussed in the following subsection.Ultimately, we want to predict the value of one variable, at which a new or test data on the other variable is observed.
Proposition 2.2.
To perform inverse prediction of value of the input variable , at which test data on is realised, we will

sample from the posterior probability density of given the test data , and (modal) values of the unknowns that parametrise the covariance matrices the highdimensional GP invoked to model , subsequent to learning the marginals of each such unknown given the training data, using MCMC.

sample from the joint posterior probability density of and all other unknowns parameters of this highdimensional GP, given training, as well as test data, using MCMC.
Computational speed of the first approach, is higher, as marginal distributions of the GP parameters are learnt separately. When the training data is small, or if the training data is not representative of the test data at hand, the learning of via the second method may affect the learning of the GP parameters.
2.1 3 ways of learning covariance matrices
Let the th element of th covariance matrix be ; , .
Definition 2.3.
At a given , bears information about covariance amongst the th and th slices of the th ordered data tensor , s.t. dimensional th “slice” of data tensor is measured value of th ordered tensorvalued , where the th slice is realised at the th design point .
The covariance between the th and th slices of data decreases as the slices get increasingly more disparate, i.e. with increasing . In fact, we can model as a decreasing function of this disparity , where is the covariance kernel function, computed at the th and th values of input variable . In such a model, the number of distinct unknown parameters involved in the learning of reduces from , to the number of hyperparameters that parametrise the kernel function .
However, kernel parametrisation is not always possible.
–Firstly, this parametrisation may cause information loss and this may not be
acceptable
(Aston and Kirch, 2012).
–Again, we will necessarily avoid kernel parametrisation, when we cannot find
input parameters, at which the corresponding slices in the data are realised.
In such situations,
–we can learn the elements of the covariance matrix directly using MCMC,
though direct learning of all distinct elements of is feasible, as
long as total number of all unknowns learnt by MCMC .
–we can use an empirical estimation for the covariance matrix .
We collapse each of the number of th ordered tensorshaped slices
of the data, onto the th axis in the space of , where we
can choose any one value of from . This will reduce each
slice to a dimensional vector, so that is covariance
computed using the th and th such dimensional vectors.
Indeed such an empirical estimate of any covariance matrix is easily generated, but it indulges in linearisation amongst the different dimensionalities of the observable , causing loss of information about the covariance structure amongst the components of these highdimensional slices. This approach is inadequate when the sample size is small because the samplebased estimate will tend to be incorrect; indeed discontinuities and steep gradients in the data, especially in smallsample and highdimensional data, will render such estimates of the covariance structure incorrect. Importantly, such an approach does not leave any scope for identifying the smoothness in the function that represents the functional relationship between the input and output variables. Lastly, the uncertainties in the estimated covariance structure of the GP remain inadequately known.
Proposition 2.3.
We model the covariance matrices as
–kernel parametrised,
–or empiricallyestimated,
–or learnt directly using MCMC.
An accompanying computational worry is the inversion of any of the covariance matrices; for a covariance matrix that is an dimensional matrix, the computational order for matrix inversion is well known to be (Knuth, 1997).
3 Kernel parametrisation
Proposition 3.1.
Kernel parametrisation of a covariance matrix, when undertaken, uses an Squared Exponential (SQE) covariance kernel
(3.1) 
where is a diagonal matrix, the diagonal elements of which are the length scale hyperparameters that tell us how quickly correlation fades away in each of the directions in input space , s.t. the inverse matrix is also diagonal, with the diagonal elements given as , where is the smoothness hyperparameter along the th direction in , . We learn these unknown parameters from the data.
Here is the global amplitude, that is subsumed as a scale factor, in one of the other covariance matrices, distinct elements of which are learnt directly using MCMC.
Remark 3.1.
We avoid using a model for the kernel in which amplitude depends on the locations at which covariance is computed, i.e. the model: , and use a model endowed with a global amplitude . This helps avoid learning a very large number () of amplitude parameters directly from MCMC.
A loose interpretation of this amlitude modelling is that we have scaled all local amplitudes to be using the global factor (), and these scaled local amplitudes are then subsumed into the argument of the exponential in the RHS of the last equation, s.t. the reciprocal of the correlation length scales, that are originally interpreted as the elements of the diagonal matrix , are now interpreted as the smoothing parameters modulated by such local amplitudes. This interpretation is loose, since the same smoothness parameter cannot accommodate all (scaled by a global factor) local amplitudes, for all .
3.1 Including nonstationarity, by modelling hyperparameters of covariance kernels as realisations of Stochastic Process
By definition of the kernel function we choose, (Equation 3.1), all functions sampled from the tensorvariate GP, are endowed with the same length scale hyperparameters , and global amplitude . However, the data on the output variable is not continuous, i.e. similarity between and does not imply similarity between and , computed in a universal way . Indeed, then a stationary definition of the correlation for all pairs of points in the function domain, is wrong. One way to generalise the model for the covariance kernel is to suggest that the hyperparameters vary as random functions of the sample path.
Theorem 3.1.
For , with and , if the map is a Lipschitzcontinuous map over the bound set , where absolute value of correlation between and is
with
then the vector of correlation hyperparameters is finite, and each element of is dependent, i.e.
Proof.
For , where is a bounded subset of , and , the mapping is a defined to be Lipschitzcontinuous map, i.e.
(3.2) 
–for constant , s.t. the infinum over all such
constants is the finite Lipschitz constant for ;
– and are metric spaces.
Let metric be the norm:
and the metric be defined as (square root of the logarithm of) the inverse of the correlation:
–where correlation being a measure of affinity,
, transforms this affinity into a squared distance for this
correlation model; so the transformation to a metric is undertaken;
–and the given kernelparametrised correlation is:
so that
Then for the map to be Lipschitzcontinuous, we require:
(3.3) 
where the vector of correlation hyperparameters, , is finite given finite .
Given discontinuities in the data on , the function is not expected to obey the Lipschitz criterion defined in inequation 3.2 globally. We anticipate sample function to be locally or globally discontinuous.
Lemma 3.1.
Sample function can be s.t.

, s.t. finite Lipschitz constant , for which . Here the bounded set .

, with , s.t. , but ; . In such a case, the Lipschitz constant used for the sample function is defined to be
If each function in the set is
–either globally Lipschitz, or is as described in Case II,
–and Case I does not hold true, then
a finite , where
where is the th Lipschitz constant defined for the th
sample function , ,
i.e. a
finite Lipschitz constant for all sample functions.
a universal correlation hyperparameter vector for
all sample functions (=, by Theorem 3.1).
Lemma 3.2.
Following on from Lemma 3.1,
if for any
Case I holds, finite maxima of
does not exist,
a finite Lipschitz constant
for all sample functions,
a universal correlation
hyperparameter vector , for all sample functions,
i.e. we need to model correlation hyperparameters to vary with the sample function.
Remark 3.2.
Above, are hyperparameters of the correlation kernel; they are interpreted as the reciprocals of the lengthscales , i.e. .
Remark 3.3.
If the map is Lipschitzcontinuous, (i.e. if hyperparameters are dependent, by Theorem 3.1), then by Kerkheim’s Theorem (Kerkheim, 1994), is differentiable almost everywhere in ; this is a generalisation of Rademacher’s Theorem to metric differentials (see Theorem 1.17 in Hajlasz (2014)). However, in our case, the function is not necessarily differentiable given discontinuities in the data on the observable , and therefore, is not necessarily Lipschitz.
Thus, Theorem 3.1 and Lemma 3.1 negate usage of a universal correlation length scale independent of sampled function , in anticipation of discontinuities in the sample function.
Proposition 3.2.
For , with and ,
where is a sample function of a tensorvariate GP.
Thus, in this updated model, th component of correlation
hyperparameter is
modelled as randomly varying with the sample function, , of the
tensorvariate GP, .
In the MetropoliswithinGibbsbased inference that we undertake,
one sample function of the tensorvariate GP generated, in every iteration,
that we model above
as randomly varying with the sample path of the tensorvariate GP, 
where this scalarvalued random function , is modelled as a realisation from a scalarvariate GP.
Scalarvariate GP that is sampled from, is independent of the GP that is sampled from; . In addition, parameters that define the correlation function of the generative scalarvariate GP can vary, namely the amplitude and scale of one such GP might be different from another. Thus, scalarvalued functions sampled from GPs with varying correlation parameters and –even for the same value–should be marked by these descriptor variables and .
Proposition 3.3.
We update the relationship between iteration number and correlation length scale hyperparameter in the th direction in input space to be:
– the amplitude variable of the SQElooking covariance function of
the scalarvariate GP that is a realisation of.
takes the value ;
– the length scale variable of the SQElooking covariance
function of the scalarvariate GP that is a realisation of;
.
Then the scalarvariate
GPs that and are sampled from, have
distinct correlation functions if . Here .
Proposition 3.4.
Current value of correlation length scale hyperparameter , acknowledges information on only the past number of iterations as in:
(3.5) 
where is an unknown constant that we learn from the data, during the first iterations.
As
is a realisation from a scalarvariate GP, the joint probability distribution of
number of values of the function –at a given –is Multivariate Normal, with dimensional mean vector and dimensional covariance matrix , i.e.(3.6) 
Definition 3.1.
Here is the number of iterations that we look back at, to collect the dynamicallyvarying “look backdata” that is employed to learn parameters of the scalarvariate GP that is modelled with.

The mean vector is empirically estimated as the mean of the dynamically varying look backdata, s.t. at the th iteration it is estimated as a dimensional vector with each component .

dimensional covariance matrix is dependent on the iterationnumber and this is now acknowledged in the notation to state: .
In the th iteration, upon the empirical estimation of the mean as given above, it is subtracted from the “look backdata” so that the subsequent meansubtracted look backdata is . It is indeed this meansubtracted sample that we use.
Definition 3.2.
In light of this declared usage of the meansubtracted “look backdata” , we update the likelihood over what is declared in Equation 3.6, to
(3.7) 
3.2 Temporallyevolving covariance matrix
Theorem 3.2.
The dynamically varying covariance matrix of the Multivariate Normal likelihood in Equation 3.7, at iteration number , is
the number of iterations we look back to is ;
is
the covariance kernel parametrising the covariance function of the
scalarvariate GP that generates the scalarvalued function
, at the vector of descriptor
variables,
s.t. ;
is a positive definite square scale
matrix of dimensionality , containing the amplitudes of this
covariance function;
, with the space of input
variable dimensional.
Proof.
The covariance kernel that parametrises the covariance function of the scalarvariate GP that generates , is s.t. =1 .
In a general model, at each iteration, a new value of the vector of descriptor variables in the th direction in the space of the input variable , is generated, s.t. in the th iteration, it is ;
Now, , where is the Delta function.
sample estimate of is
is the th element of matrix .
This definition of the covariance holds since mean of the
r.v. is 0, as we have sampled the function from a zeromean
scalarvariate GP.
Let be a dimensional diagonal matrix, the th diagonal element of which is . Then factorising the scale matrix , is diagonal with the th diagonal element ; . This is defined for every .
Then at iteration number , we define the current covariance matrix
Then is distributed according to the Wishart distribution w.p, and (Eaton, 1990), i.e. the dynamicallyvarying covariance matrix is:
∎
Remark 3.4.
If interest lies in learning the covariance matrix at any time point, we could proceed to inference here from, in attempt of the learning of the unknown parameters of this process given the lookbackdata .
Our learning scheme then would then involve compounding a TensorVariate GP and a .
The above would be a delineated route to recover the temporal variation in the correlation structure of time series data (as studied, for example by Wilson and Ghahramani (2011)).
Remark 3.5.
In our study, the focus is on highdimensional data that display discontinuities, and on learning the relationship between the observable that generates such data, and the system parameter –with the ulterior aim being parameter value prediction. So learning the timevarying covariance matrix is not the focus of our method development.
We want to learn given training data . The underlying motivation is to sample a new from a scalarvariate GP, at new values of , to subsequently sample a new tensorvalued function , from the tensornormal GP, at a new value of its dimensional correlation length scale hyperparameter vector .
3.3 2layers suffice
One immediate concern that can be raised is the reason for limiting the layering of our learning scheme to only 2. It may be argued that just as we ascribe stochasticity to the length scales that parametrise the correlation structure of the tensorvariate GP that models , we need to do the same to the descriptor variables that parametrise the correlation structure of the scalarvariate GP that model . Following this argument, we would need to hold –or at least model the scale –to be dependent on the sample path of the scalarvariate GP, i.e. set to be dependent on .
However, we show below that a global choice of is possible irrespective of the sampled function , given that is always continuous (a standard result). In contrast, the function not being necessarily Lipschitz (see Remark 3.3), implies that the correlation kernel hyperparameters , are dependent, .
Theorem 3.3.
Given , with and , the map is a Lipschitzcontinuous map, . Here
The proof of this standard theorem is provided in Section 4 of the supplementary Materials.
Theorem 3.4.
For any sampled function realised from a scalarvariate GP that has a covariance function that is kernelparametrised with an SQE kernel function, parametrised by amplitude and scale hyperparameters, the Lipschitz constant that defines the Lipschitzcontinuity of , is dependent, and is given by the reciprocal of the scale hyperparameter, s.t. the set of values of scale hyperparameters, for each of the samples of taken from the scalarvariate GP, admits a finite minima.
Proof.
For , is a Lipschitzcontinuous map, (Theorem 12.1), with and . ( is defined in Theorem 12.1). Distance between any is given by metric
Distance between and is given by metric
s.t. , and is finite (since live in a bound set, and is continuous). The parametrised model of the correlation is
s.t. , where is the scale hyperparameter.
Now, Lipschitzcontinuity of implies
(3.8) 
where the Lipschitz constant is dependent (Theorem 3.1). As , where is a known finite integer, and as is defined as (using definition of ), exists for , and is finite. We get
(3.9) 
As is any point in , exists for all points in .
Let set
Comments
There are no comments yet.