1 Introduction
This research is motivated by the analysis of a timeconsuming computer code in nuclear engineering,
depending on both continuous and categorical inputs, one of them having more than 90 levels.
The final motivation is an inversion problem.
However, due to the heavy computational cost, a direct usage of the simulator is hardly possible.
A realistic approach is to use a statistical emulator or metamodel.
Thus, as a first step, we investigate the metamodelling of such computer code.
More precisely, we consider Gaussian process (GP) regression models, also called kriging models
((Sacks et al., 1989), (Rasmussen and Williams, 2006)),
which have been successfully used in sequential metamodelbased strategies for uncertainty quantification
(see e.g. (Chevalier et al., 2014)).
Whereas there is a flourishing literature on GP regression,
the part concerned with categorical inputs remains quite limited.
We refer to (Zhang and Notz, 2015) for a review.
As for continuous inputs, covariance functions or kernels
are usually built by combination of 1dimensional ones,
most often by multiplication or, more rarely, by addition (Deng et al., 2017).
The question then comes down to constructing a valid kernel on a finite set,
which is a positive semidefinite matrix.
Some effort has been spent on parameterization of general covariance matrices
(Pinheiro and Bates, 1996) and parsimonious parameterizations of smaller classes (Pinheiro and Bates, 2009).
Some block forms have also been proposed (Qian et al., 2007), in order to deal with a potential large number of levels.
However, their validity (in terms of positive definiteness) was not investigated.
Furthermore, to the best of our knowledge,
applications in GP regression are limited to categorical inputs with very few levels, typically less than 5.
Guided by the application, we investigate more deeply the socalled group kernels cited in Qian et al. (2007),
defined by block covariance matrices with constant covariances between pairs of blocks and within blocks.
We exploit the hierarchy group/level by revisiting a nested Bayesian linear model
where the response term is a sum of a group effect and a level effect.
The level effects are assumed to sum to zero, which allows recovering negative withingroup correlations.
This model leads to a parameterization of which is automatically positive definite.
Interestingly, the assumption on within blocks can be relaxed,
and we obtain a parameterization of a wider class of valid group kernels.
The positive definiteness condition of is also explicited:
it is equivalent to the positive definiteness of the smaller covariance matrix
obtained by replacing each block by its average.
As mentioned above, this work has some connections with Bayesian linear models
as well as linear mixed effect models (see e.g. Lindley and Smith (1972), Smith (1973))
in a hierarchical view.
Other related works concern hierarchical GPs with a tree structure.
For instance, particular forms of group kernels are obtained in multiresolution GP models
((Fox and Dunson, 2012), (Park and Choi, 2010)).
Such models usually assume that children are conditionally independent on the mother.
This is not the case in our model, due to the condition that the level effects sum to zero.
The paper is structured as follows. Section 2 gives some background on GP regression with mixed categorical and continuous inputs. Section 3 presents new findings on group kernels. Section 4 illustrates on synthetic examples. Section 5 is devoted to the application which motivated this work. Section 6 gives some conclusions and perspectives for future research.
2 Background and notations
2.1 GPs with continuous and categorical variables
We consider a set of continuous variables defined on a hypercubic domain ,
and a set of categorical variables with levels.
Without loss of generality, we assume that and that, for each ,
the levels of are numbered .
We denote , , and .
We consider GP regression models defined on the product space
and written as:
(1) 
where , and are respectively the trend, the GP part and a noise term. There exist a wide variety of trend functions, as in linear models. Our main focus here is on the centered GP , characterized by its kernel
Kernels on can be obtained by combining kernels on and kernels on . Standard valid combinations are the product, sum or ANOVA. Thus if denotes a kernel for the continuous variables , a kernel for the categorical ones , examples of valid kernels for are written:
For consiseness, we will denote by one of the operations: sum, product or ANOVA. The three formula above can then be summarized by:
(2) 
Then, in turn, and can be defined by applying these operations to dimensional kernels. For continuous variables, famous dimensional kernels include squared exponential or Matérn (Rasmussen and Williams, 2006). We denote by such kernels (). For a categorical variable, notice that, as a positive semidefinite function on a finite space, a kernel is a positive semidefinite matrix. We denote by the matrix of size corresponding to kernels for (). Thus, examples of expressions for and are written:
(3)  
(4) 
The formulation given by Equations (2), (3), (4
) is not the most general one, since kernels are not always obtained by combining 1dimensional ones. Nevertheless, it encompasses the GP models used in the literature of computer experiments with categorical inputs. It generalizes the tensorproduct kernels, very often used, and the sum used recently by
(Deng et al., 2017)on the categorical part. It also contains the heteroscedastic case, since the matrices
are not assumed to have a constant diagonal, contrarily to most existing works (Zhang and Notz, 2015). This will be useful in the application of Section 5, where the variance of the material is level dependent.
Remark 1.
Combining kernels needs some care to obtain identifiable models. For instance, the product of kernels with (), is a kernel depending on only one variance parameter . The GP model is identifiable for this new parameter, but not for the initial parameters .
2.2 1dimensional kernels for categorical variables
We consider here a single categorical variable with levels . We recall that a kernel for is then a by positive semidefinite matrix .
2.2.1 Kernels for ordinal variables
A categorical variable with ordered levels is called ordinal. In this case, the levels can be viewed as a discretization of a continuous variable. Thus a GP on can be obtained from a 1dimensional GP on the interval by using a nondecreasing transformation (also called warping):
Consequently, the covariance matrix can be written:
(5) 
When depends on the distance , then
depends on the distance between the levels , distorted by .
In the general case, is piecewiselinear and defined by
parameters. However, a parsimonious parameterization may be preferred, based on the cdf of a flexible probability distribution such as the Normal or the Beta. We refer to
(McCullagh, 1980) for examples in regression and to (Qian et al., 2007) for illustrations in computer experiments.There is also some flexibility in the choice of the continuous kernel . The standard SquaredExponential or Matérn kernels are admissible, but induce positive correlation between levels. In order to allow negative correlations, one may choose, for instance, the cosine correlation kernel on :
(6) 
where is a fixed parameter tuning the minimal correlation value. Indeed, (6) defines a decreasing function of from to . It is a valid covariance function obtained by choosing as a Dirac nonnegative measure in Bochner theorem for realvalued stationary kernels:
2.2.2 Kernels for nominal variables
For simplicity we present here the homoscedastic case, i.e. when has a constant diagonal. It is immediately extended to situations where the variance depends on the level, by considering the correlation matrix.
General parametric covariance matrices.
There are several parameterizations of positivedefinite matrices based on the spectral and Choleky decompositions. The spectral decomposition of is written
(7) 
where is diagonal and orthogonal. Standard parameterizations of involve the Cayley transform, Eulerian angles, Householder transformations or Givens rotations, as detailed in (Khuri and Good, 1989) and (Shepard et al., 2015). Another general parameterization of is provided by the Cholesky decomposition:
(8) 
where is lower triangular. When the variance does not depend on the level , the columns of have the same norm and represent points on a sphere in . A spherical parameterization of is then possible with one variance term and angles, representing correlations between levels (see e.g. Pinheiro and Bates, 1996).
Parsimonious parameterizations.
The general parametrizations of described above require parameters. More parsimonious ones can be used, up to additional model assumptions. Among the simplest forms, the compound symmetry (CS)  often called exchangeable  covariance matrix assumes a common correlation for all levels (see e.g. Pinheiro and Bates, 2009). The CS matrix with variance and covariance is defined by:
(9) 
This generalizes the kernel obtained by substituting the Gower distance (Gower, 1982)
into the exponential kernel, corresponding to .
The CS covariance matrix treats equally all pairs of levels, which is an important limitation, especially when .
More flexibility is obtained by considering groups of levels.
Assume that the levels of are partitioned in groups
and denote by the group number corresponding to a level .
Then a desired parameterization of is given by the block matrix (see e.g. Qian et al. (2007)):
(10) 
where for all , the terms are withingroup correlations, and () are betweengroup correlations. Notice that additional conditions on the ’s are necessary to ensure that is a valid covariance matrix, which is developed in the next section.
3 Generalized compound symmetry block covariance matrices
We consider the framework of Section 2.2.2 where denotes a categorical variable whose levels are partitioned in groups of various sizes . Without loss of generality, we assume that . We are interested in parsimonious parameterizations of the covariance matrix , written in block form:
(11) 
where the diagonal blocks contain withingroup covariances, and the offdiagonal blocks are constant matrices containing betweengroup covariances. We denote:
where is the by matrix of ones.
This means that the betweengroup covariances only depends on groups (and not on levels).
Although block matrices of the form (11) may be covariance matrices, they are not positive semidefinite in general. A necessary condition is that all diagonal blocks are positive semidefinite. But it is not sufficient. In order to provide a full characterization, we will ask a little more, namely that they remain positive semidefinite when removing the mean:
(12) 
where is matrix of ones of size and is the average of coefficients.
This condition will appear naturally in Subsection 3.3.
Notice that valid CS covariance matrices satisfy it.
Indeed, if is a positive semidefinite matrix with variance and covariance , then
where verifies , which is positive semidefinite.
For this reason, we will call matrices with
Generalized Compound Symmetry (GCS),
block matrices of the form (11) verifying (12).
In particular, the class of GCS block matrices contains block matrices of the form (10).
The rest of the section is organized as follows.
Subsection 3.1 shows how valid CS covariance matrices can be parameterized by a Gaussian model.
The correspondence is ensured, thanks to a centering condition on the effects of the levels.
Subsection 3.2 gives material on centered covariance matrices.
Subsection 3.3 contains the main results.
It extends the model of Subsection 3.1 to a GCS block matrix.
This gives a proper characterization of positive semidefinite GCS block matrices,
as well as a parameterization which automatically fulfills the positive semidefinite conditions.
Subsection 3.4 indicates connections with related works.
Finally, in Subsection 3.5, we collect together the details of our parameterization,
for ease of reference.
3.1 A Gaussian model for CS covariance matrices
We first focus on the case of a CS matrix. The following additional notations will be used: for a given integer ,
is the identity matrix of size
,is the vector of ones of size
. We denote by(13) 
the CS matrix with a common variance term and a common covariance term . It is wellknown that is positive definite if and only if
(14) 
For instance, one can check that the eigenvalues of
arewith multiplicity 1 (eigenvector
) and with multiplicity (eigenspace ). Notice that a CS matrix is positive definite for a range of negative values of its correlation term.Then we consider the following Gaussian model:
(15) 
where with , and
are i.i.d. random variables from
, with , assumed to be independent of .A direct computation shows that the covariance matrix of is the CS covariance matrix . Clearly this characterizes the subclass of positive definite CS covariance matrices such that is nonnegative. The full parameterization, including negative values of in the range , can be obtained by restricting the average of level effects to be zero, as detailed in the next proposition.
Proposition 1.
When and are related as in (15), the covariance of conditional on zero average errors is a CS matrix with variance and covariance . Conversely, given a CS covariance matrix with variance and covariance , there exists a representation (15) such that is the covariance of conditional on zero average errors where and .
3.2 Parameterization of centered covariance matrices
The usage of Model (15) to describe CS covariance matrices involves Gaussian vectors that sum to zero. This is linked to centered covariance matrices, i.e. covariance matrices such that , as detailed in the next proposition. We further give a parameterization of centered covariance matrices.
Proposition 2.
Let be a covariance matrix of size . Then, is centered iff there exists a Gaussian vector on such that . In that case, let be a matrix whose columns form an orthonormal basis of . Then is written in an unique way
(16) 
where is a covariance matrix of size .
In particular if is a centered CS covariance matrix, then ,
and we can choose .
3.3 A hierarchical Gaussian model for GCS block covariance matrices
Let us now return to the general case, where the levels of are partitioned in groups. It will be convenient to use the hierarchical notation , indicating that belongs to the group . Then, we consider the following hierarchical Gaussian model:
(17) 
where for each the random variable represent the effect of the group ,
and the random variables represent the effects of the levels in this group.
We assume that is normal ,
and all the are normal .
We also assume that are independent,
and independent of .
Notice that, up to centering conditions on that will be considered next,
is the mean of group .
Hence, is interpreted as the between group means covariance.
Similarly, is the withingroup effect around the group mean.
This justifies the notations and .
As an extension of Prop. 1, the next results show that (17) gives a onetoone parameterization of valid GCS block covariance matrices, under the additional assumption that the average of level effects is zero in each group.
Theorem 1.
The covariance matrix of conditional on is a GCS block matrix with, for all :
(18) 
where is a centered positive semidefinite matrix equal to . Conversely, let be a positive semidefinite GCS block matrix. Then there exists a representation (17) such that is the covariance of conditional on zero average errors , with:
where is the matrix obtained by averaging each block of .
Corollary 1.
Positive semidefinite GCS block matrices with CS diagonal blocks
exactly correspond to covariance matrices of in (17)
conditional on the constraints
when .
As a byproduct, we obtain a simple condition for checking the positive definiteness of GCS block matrices. Interestingly, it only involves a small matrix whose size is the number of groups.
Theorem 2.
Let be a GCS block matrix. Then

is positive semidefinite if and only if is positive semidefinite.

is positive definite if and only if is positive definite and the diagonal blocks are positive definite for all .
Furthermore, we have
(19) 
where is the matrix
Remark 2.
All the results depend on the conditional distribution . Thus there is some flexibility in the choice of , since several matrices can lead to the same conditional covariance matrix .
Remark 3 (Groups of size 1).
Theorem 1 is still valid for groups of size 1.
Indeed if , then is degenerate and equal to 0.
Thus is positive semidefinite.
We end this section with an “exclusion property” for groups with strong negative correlation. A positive semidefinite GCS covariance matrix can exhibit negative withingroup correlations, but this induces limitations on the betweengroup correlations. More precisely, the following result shows that if a group has the strongest possible negative withingroup correlations, then it must be independent of the others.
Proposition 3 (Exclusion property for groups with minimal correlation).
Let a GCS covariance matrix, and let be a centered Gaussian vector such that . Let be a group number, and denote by (resp. ) the subvector extracted from whose coordinates are (resp. are not) in group . Assume that is such that . Then is independent of .
The condition is linked to minimal correlations. Indeed, since is positive semidefinite, . The limit case is obtained when negative terms of are large enough to compensate positive ones. As an example, if is a positive semidefinite CS covariance matrix with variance and minimal negative covariance , then .
3.4 Related works
The hierarchical model (17) shares similarities with twoway Bayesian models and linear mixed effect models (see e.g. Lindley and Smith (1972)), with Gaussian priors for the effects and . The centering constraints are also standard identifiability conditions in such models. Furthermore, the particular case of CS covariance matrices corresponds to the exchangeable assumption of the corresponding random variables. Typically, in the framework of linear modelling, Model (17) could be written as
with additionals grand mean and errors .
However, if the framework is similar, the goal is different. In linear modelling, the aim is to quantify the effects by estimating their posterior distribution
. On the other hand, we aim at investigating the form of the covariance matrix of the response part , or, equivalently, the covariance matrix of the likelihood3.5 Guideline for practical usage
The results of the previous sections show that valid GCS block covariance matrices can be parameterized by a family of covariance matrices of smaller sizes. It contains the case where diagonal blocks are CS covariance matrices. The algorithm is summarized below.

Generate a covariance matrix of size G.

For all ,
If , set , else:
Generate a covariance matrix of size .

Compute a centered matrix , where is a by matrix whose columns form an orthonormal basis of .


For all , compute the withingroup blocks and betweengroup blocks by Eq. (18).
In steps 1 and 2, the generator covariance matrices and can be general,
and obtained by one of the parameterizations of §2.2.2.
A direct application of Theorem 2 also shows that is invertible
if and only if and the ’s are invertible (cf. Appendix for details).
Furthermore, some specific form, such as CS matrices, can be chosen.
Depending on the number of groups and their sizes, different levels of parsimony can be obtained.
Table 1 summarizes some possibilities.
Note that the parameterization is for a general covariance matrix,
but not for additional constraint, as for a correlation matrix.
In these situations, one can take advantage of the economic condition
of Theorem 2:
positive semidefiniteness on is always equivalent to
positive semidefiniteness of the small matrix of size .
This can be handled more easily by nonlinear or semidefinite programming algorithms.
Parametric setting  Resulting form of  Number of parameters  

General  
General  General  
General  General  General 
4 Example
Example 1
Consider the deterministic function
with , and .
As visible in Figure 1, there are two groups of curves corresponding to levels
and with strong withingroup correlations, and strong negative betweengroup correlations.
We aim at reconstructing with five GP models based on levels grouping.
The first one uses a CS covariance matrix, corresponding to a single group.
The second one considers the two groups and .
The third model, based on the five groups , , , , , has two variants: (a) when the intergroups correlation is constant and (b) in the general case.
The fourth model uses the spherical parameterization of , leading to 13 groups, and the last one considers an ordinal paramaterization for .
We also compare the result with an ordinal kernel obtained by mapping a piecewise linear cdf
into the cosine kernel of Eq. 6 with .
(see Section 2.2.1).
Notice that the choice of is motivated by recovering negative correlations,
and has no link with the sinusoidal form of the curves of the example.
In order to benefit from the strong link between levels,
we use a design that spreads out the points between levels.
For instance, the information given by may be useful to estimate
at for a different level , without computing .
More precisely, we have used a (random) sliced Latin hypercube design (SLHD) (Qian, 2012)
with points by level, for a total budget of points.
All parameters are estimated by maximum likelihood.
As the likelihood surface may be multimodal,
we have launched several optimizations with different starting points chosen at random in the domain.
Model accuracy is measured over a test set formed by a regular grid of size , in terms of criterion.
The criterion has a similar expression than , but is computed on the test set:
(20) 
where the denote the observations (on the test set), their mean, the predictions. It is negative if the model performs worst than the mean, positive otherwise, and tends to 1 when predictions are close to true values. Finally, the process is repeated times, in order to assess the sensitivity of the result to the design.
Results.
The estimated correlation parameters are shown in Figure 2.
Information about the plots used is given in a last section.
The correlation structure that can be intuited from Figure 1
is well recovered with two groups and five groups,
with different betweengroups correlations.
The model with thirteen groups involves the estimation of parameters, which is hard to achieve, especially with points.
This is visible in the erratic values of the estimated correlations values, which seem not meaningful.
On the opposite, considering only one group or five groups with a common betweengroup correlation
oversimplifies the correlation structure.
Finally, the model using an ordinal kernel recovers the two groups of curves,
as well as the strong negative correlation between them,
which is made possible by the choice of the 1dimensional kernel used in the warping.
In Figure 3, we can see that the best tradeoff between prediction accuracy and parsimony is obtained with two groups. whereas it reduces the number of observations by group. Notice the rather good performance of the ordinal model, at the cost of a larger number of parameters.
Example 2
We now provide a second example, in order to illustrate the ability of the hierarchical model (17) to deal with negative withingroup correlations. We consider the deterministic function given by:
with , .
As visible in Figure 4, the levels can be split in two groups:
a group of almost linear functions (levels ), and a group of damped sinusoidal functions (levels ).
Within the latter group, there are strong negative correlations between levels and .
Hence, the levels could also be split into three groups.
In this section, we briefly compare the corresponding GP models:

The first model considers the two groups and . The withingroup structure is CS for the first group (linear functions). But a general structure is chosen for the second one (sinusoids), in order to capture its complex covariance structure.

The second model is based on the three groups , and . The withingroup structure is CS, and the betweengroup covariance is general.
For simplicity, we consider a single stratified design of experiments
extracted from a sequence of regularly spaced points, with points per level.
The other settings are the same as in Example 1.
The estimated correlation parameters are shown in Figure 5. The correlation structure that can be intuited from Figure 4 is well recovered by the two models. However, in the case of two groups, the estimated betweengroup correlation is nearly zero. This is an illustration of the exclusion property (Prop. 3). Indeed, due to the strong negative (estimated) correlations within the second group, we have , which induces a small correlation between the other group. In this example, the model with three groups may be more appropriate, which seems confirmed by the larger value of (compared to for two groups). Nevertheless, it is nice to see that, starting from a smaller number of groups, the correlation plot detects the two subgroups of sinusoids.
Discussion
It is remarkable that complex correlation structures can be recovered with few data points per level.
Indeed, even if the correlation structure can be intuited from the whole curves ,
this complete information is of course not available,
and the estimation is done with only three points per level in the previous examples.
The reasons may be twofold.
Firstly, the model is parametric, which may provide more accurate estimation for few data,
provided that the model is relevant (in particular if the correct groups are given).
Secondly, the global amount of information available may be large enough,
since the small number of points per levels is compensated by the quite large number of levels.
Consequently, when there are few levels, one may need to increase the number of points per level
in order to reach an acceptable level of accuracy.
The sinusoidal form of the toy functions has been chosen to illustrate the power of GP models in modelling complex functions. Other complex forms could have been used. An alternative would have been to use sample paths of GP models with group kernels rather than deterministic functions. Finally, notice that correlation plots may be used with care in general, since they do not account for estimation error on correlations.
5 Application in nuclear engineering
5.1 Position of the problem
As presented in Introduction, this research is originally motivated by the solving of an inverse problem confronting experimental measurements in nuclear engineering and timeconsuming numerical simulation. More precisely, this analysis concerns the identification of the mass of that is present in a particular waste container using a nondestructive nuclear detection technique such as the gamma spectrometry (Knoll, 2010). In that case, at each energy level ,
(21) 
where is the quantity of interest provided by the gamma transmitter, and is the attenuation coefficient, which depends on the source environment denoted by . In practice, only discrete values of are of interest, corresponding to the natural energy levels of :
(22) 
Then, based on previous studies (Guillot, 2015), the real source environment is parameterized by the following input variables:

An equivalent geometric shape for the nuclear waste: sphere (‘sph’), cylinder (‘cyl’) or parallelepiped (‘par’).

An equivalent material for this waste, characterized by its chemical element with atomic number in {1, …, 94},

The bulk density of the waste, in ,

The distance of measurement between the container and the measurement device, in (cm),

The mean width and lateral surfaces (in logarithmic scale) crossed by a gamma ray during the rotation of the object.
After normalization, the characteristics of the input space can be summed up in Table 2.
Name of the input  Variation domain 

Distance  [0,1] 
Density  [0,1] 
Width  [0,1] 
Surface  [0,1] 
Energy  
Shape  {sph, cyl, par} 
Chemical element 
To recapture the notation of the previous sections, let and be the vectors gathering respectively the continuous and categorical inputs, and . For a given value of , Monte Carlo simulation codes as MCNP (Goorley et al., 2013) can be used to model the measured scene and approach the value of . The mass can eventually be searched as the solution of the following optimization problem:
(23) 
where is the classical Euclidian norm, and respectively gather the values of and at the six values of that are used for the measurements. To solve (23), it is therefore necessary to compute at a high number of points. However, each evaluation of the MCNP code can be extremely demanding (between several minutes to several hours CPU for one evaluation). Thus, surrogate models have to be introduced to emulate the function , which is now investigated in the frame of Gaussian process regression. We refer to Clement et al. (2018) for the second step, namely the treatment of the inversion problem.
5.2 Model settings
For pedagogical purpose, a dataset of large size has been computed with the MNCP code.
The construction of the design of experiments was guided by the categorical inputs,
such that each of the combinations of levels appears times.
It was completed by a Latin hypercube of size to define the values of the four continuous inputs.
From this full dataset, a training set of size is extracted by selecting at random observations by chemical element.
The remaining points serve as test set.
Model settings are now motivated by a graphical analysis. In Figure 6, the output is displayed in function of the energy and the geometric shape. We observe that successive energy levels correspond to close values. This fact confirms that the energy is ordinal and we use the warped kernel defined by Eq. (5). The influence of the geometric shape is less obvious, and we have chosen an exchangeable (CS) covariance structure for it.
In Figure 7, is displayed in function of the chemical elements, ordered by atomic number. Two important facts are the high number of levels and heteroscedasticity. For this purpose, the chemical elements are divided into groups, provided by expert knowledge and represented by colors. This partition suggests to use a group kernel of the form (11), where the withingroup blocks are CS covariance matrices. In order to handle heteroscedasticity, the variance of is assumed to depend on the group number .
The influence of continuous variables can be observed by panels (not represented), and does not reveal useful information for our purpose.
A Matérn kernel is set for all continuous inputs,
as we expect the output to be a regular function of the continuous inputs.
Indeed, for this kernel, the corresponding Gaussian process is two times meansquare differentiable.
Finally, three candidate kernels for are obtained by combining the kernels of input variables defined above,
by sum, product or ANOVA (see Section 2).
5.3 Results
Following the model settings detailed above,
Figure 8, Panel 3, presents the results obtained with random designs of size and three operations on kernels.
Furthermore, we have implemented three other kernels for the chemical element, in order to compare other model choices for this categorical input.
In the first panel, we grouped all the 94 levels in a single group.
In the second one, we kept the 5group kernel but forced the betweengroup covariances to have a common value.
Finally, in the fourth panel, we considered that the levels were ordered by their atomic number,
and used the warped kernel of Eq. (5) with a Normal transform.
First, comparing the three operations on kernels,
we remark that in all the panels, additive kernels provide the worst results.
This suggests the existence of interactions between different inputs of the simulator.
Second, the ANOVA combination produces slight improvements, compared to the standard tensorproduct,
both in terms of accuracy and stability with respect to design choice.
Now, comparing the four panels, we see that gathering the levels in a single group is the least efficient strategy. The 5group kernel gives very good performances, especially when the betweengroup covariances vary freely: constraining them to be equal degrades the result. Surprisingly here, the ordinal kernel gives the best performance. Indeed, for this application it was not intuitive to the experts that the chemical element can be viewed as an ordinal variable, simply sorted by its atomic number. This is confirmed by the correlation plots of Figure
9, corresponding to a model with a median score. We can see that the estimated correlations between levels seems to decrease as the difference between levels increases, an indication that the levels may be ordered by their atomic number.Finally, we report several postprocessing results. First, the estimated transformation of energy levels (Figure 10, left) is concave and flat near high values, which corresponds to the behaviour observed in Figure 6 (left panel). In addition, the last three levels lead to similar results (Figure 10, right). This corresponds to the fact that when the energy is high, the gamma ray almost always crosses the nuclear waste, leading to a high value for the output. Second, the estimated correlation among the sphere, the cylinder and the parallelepiped is very high (, Figure 11). This justifies considering a covariance structure for that categorical input, rather than using three independent GP models for all the three levels.
6 Conclusion
In the framework of GP regression with both continuous and categorical inputs,
we focus on problems where categorical inputs may have a potentially large number of levels ,
partitioned in groups of various sizes.
We provide new results about parsimonious block covariance matrices,
defined by a few within and betweengroup covariance parameters.
We revisit a twoway nested Bayesian linear model,
where the response term is defined as a sum of a group effect and a level effect.
We obtain a flexible parameterization of block covariance matrices
which automatically satisfy the positive definiteness conditions.
As a particular case, we recover situations where the withingroup covariance structures are compound symmetry,
with possible negative correlations.
Furthermore, we show that the positive definiteness of a given block covariance matrix
can be checked by verifying that the small matrix of size
obtained by averaging each block is positive definite.
This criterion can be useful if the proposed block matrix has a desirable constraint,
such as homoscedasticity, which is not directly handled by the proposed parameterization.
We apply these findings on several toy functions as well as an application in nuclear engineering,
with 4 continuous inputs, 3 categorical inputs, one of them having 94 levels
corresponding to chemical numbers in Mendeleev’s table.
In this application, 5 groups were defined by experts.
The results, measured in terms of prediction accuracy, outperform
those obtained with simpler assumptions, such as gathering all levels into a single group.
It is gratifying that our nominal method performs almost as well as
using the appropriate order with a warped kernel.
There are several perspectives for this work. First, one future direction is to find a datadriven technique to recover groups of levels, made more difficult when a small number of observations is available. Similarly, if there is an order between levels, can we infer it from the data? Second, the trend of the GP models with mixed continuous and categorical inputs could be made more complex, in the same vein as works on GP models with continuous inputs.
Software and acknowledgements
Implementations have been done with the R packages mixgp and kergp (Deville et al., 2015).
Illustrations use ggplot2 (Wickham, 2009) and corrplot (Wei and Simko, 2016).
This research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (CNRS, Ecole Centrale de Lyon, Mines SaintEtienne, University of Grenoble, University of Nice, University of Toulouse) around advanced methods for Computer Experiments. The authors thank the participants for fruitful discussions.
They are also grateful to the Isaac Newton Institute for Mathematical Sciences, Cambridge,
for support and hospitality during the programme UNQ when work on this paper was undertaken
(EPSRC grant number EP/K032208/1).
Appendix
Proof of Proposition 1.
The vector is a centered Gaussian vector with covariance matrix
Hence the conditional distribution of knowing is a centered Gaussian vector with covariance matrix
Then, by using the independence between and the ’s, we deduce
We recognize the CS covariance matrix with
and .
As a covariance matrix, it is positive semidefinite.
Furthemore, we have and ,
and the conditions of positive definiteness (14) are satisfied.
Conversely, let be a positive definite CS matrix .
Then we have ,
and we can define and .
From the direct sense, we then obtain that the covariance matrix
of
is .
∎
Proof of Proposition 2.
The first part of the proposition is obtained by remarking that if ,
then .
Thus, assuming that is centered, is equivalent to with probability .
For the second part, notice that means that is orthogonal to .
Thus, one can write the expansion of in the orthonormal basis defined by .
Denoting by the vector of coordinates, we have .
This gives , and (16) follows with .
To prove unicity, observe that, by definition, .
Starting from , and multiplying by on the left and by on the right,
we get , showing that is unique.
Now, let . Since , we obtain
As a byproduct, notice that resubstituting into
gives .
Finally, if , then the properties of conditional Gaussian vectors
lead immediately to .
∎
Proof of Theorem 1.
The expressions of and are obtained directly by using the independence assumptions about and the ’s. Notice that , the covariance matrix of knowing , is centered by Proposition 2. This gives , which is positive semidefinite. Hence is a GCS block matrix.
Conversely, let be a positive semidefinite GCS block matrix. Let be the matrix obtained from by averaging each block. Then is also a positive semidefinite matrix. Indeed, since is positive semidefinite, it is the covariance matrix of some vector . Then is the covariance matrix of , the vector obtained from by averaging by group: . Thus there exists a centered Gaussian vector whose covariance matrix is
Now, for , define
Observe that , and by assumption is positive semidefinite. Hence, from Proposition 2, there exists a centered Gaussian vector such that
We can assume that are independent, and and are independent. Finally, we set . By the direct sense and (18), we obtain that is the covariance matrix of conditional on ∎
Proof of Corollary 1.
Let be a positive semidefinite GCS block matrix with CS diagonal blocks. Then the diagonal CS matrices are positive semidefinite, leading to . Thus,
Comments
There are no comments yet.