Group kernels for Gaussian process metamodels with categorical inputs

02/07/2018
by   Olivier Roustant, et al.
0

Gaussian processes (GP) are widely used as a metamodel for emulating time-consuming computer codes.We focus on problems involving categorical inputs, with a potentially large number L of levels (typically several tens),partitioned in G << L groups of various sizes. Parsimonious covariance functions, or kernels, can then be defined by block covariance matrices T with constant covariances between pairs of blocks and within blocks. However, little is said about the positive definiteness of such matrices, which may limit their practical usage.In this paper, we exploit the hierarchy group/level and provide a parameterization of valid block matrices T, based on a nested Bayesian linear model. The same model can be used when the assumption within blocks is relaxed, giving a flexible parametric family of valid covariance matrices with constant covariances between pairs of blocks. As a by-product, we show that the positive definiteness of T is equivalent to the positive definiteness of a small matrix of size G, obtained by averaging each block.We illustrate with an application in nuclear engineering, where one of the categorical inputs is the atomic number in Mendeleev's periodic table and has more than 90 levels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 24

12/17/2021

GP-HMAT: Scalable, O(nlog(n)) Gaussian Process Regression with Hierarchical Low-Rank Matrices

A Gaussian process (GP) is a powerful and widely used regression techniq...
12/04/2020

A Canonical Representation of Block Matrices with Applications to Covariance and Correlation Matrices

We obtain a canonical representation for block matrices. The representat...
03/15/2012

Speeding up the binary Gaussian process classification

Gaussian processes (GP) are attractive building blocks for many probabil...
03/01/2022

Riemannian statistics meets random matrix theory: towards learning from high-dimensional covariance matrices

Riemannian Gaussian distributions were initially introduced as basic bui...
10/15/2021

Multi-group Gaussian Processes

Gaussian processes (GPs) are pervasive in functional data analysis, mach...
10/06/2020

Gaussian Process Models with Low-Rank Correlation Matrices for Both Continuous and Categorical Inputs

We introduce a method that uses low-rank approximations of cross-correla...
01/09/2020

Rapid Numerical Approximation Method for Integrated Covariance Functions Over Irregular Data Regions

In many practical applications, spatial data are often collected at area...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This research is motivated by the analysis of a time-consuming computer code in nuclear engineering, depending on both continuous and categorical inputs, one of them having more than 90 levels. The final motivation is an inversion problem. However, due to the heavy computational cost, a direct usage of the simulator is hardly possible. A realistic approach is to use a statistical emulator or metamodel. Thus, as a first step, we investigate the metamodelling of such computer code. More precisely, we consider Gaussian process (GP) regression models, also called kriging models ((Sacks et al., 1989), (Rasmussen and Williams, 2006)), which have been successfully used in sequential metamodel-based strategies for uncertainty quantification (see e.g. (Chevalier et al., 2014)).

Whereas there is a flourishing literature on GP regression, the part concerned with categorical inputs remains quite limited. We refer to (Zhang and Notz, 2015) for a review. As for continuous inputs, covariance functions or kernels are usually built by combination of 1-dimensional ones, most often by multiplication or, more rarely, by addition (Deng et al., 2017). The question then comes down to constructing a valid kernel on a finite set, which is a positive semidefinite matrix. Some effort has been spent on parameterization of general covariance matrices (Pinheiro and Bates, 1996) and parsimonious parameterizations of smaller classes (Pinheiro and Bates, 2009). Some block forms have also been proposed (Qian et al., 2007), in order to deal with a potential large number of levels. However, their validity (in terms of positive definiteness) was not investigated. Furthermore, to the best of our knowledge, applications in GP regression are limited to categorical inputs with very few levels, typically less than 5.

Guided by the application, we investigate more deeply the so-called group kernels cited in Qian et al. (2007), defined by block covariance matrices with constant covariances between pairs of blocks and within blocks. We exploit the hierarchy group/level by revisiting a nested Bayesian linear model where the response term is a sum of a group effect and a level effect. The level effects are assumed to sum to zero, which allows recovering negative within-group correlations. This model leads to a parameterization of which is automatically positive definite. Interestingly, the assumption on within blocks can be relaxed, and we obtain a parameterization of a wider class of valid group kernels. The positive definiteness condition of is also explicited: it is equivalent to the positive definiteness of the smaller covariance matrix obtained by replacing each block by its average.

As mentioned above, this work has some connections with Bayesian linear models as well as linear mixed effect models (see e.g. Lindley and Smith (1972), Smith (1973)) in a hierarchical view. Other related works concern hierarchical GPs with a tree structure. For instance, particular forms of group kernels are obtained in multiresolution GP models ((Fox and Dunson, 2012), (Park and Choi, 2010)). Such models usually assume that children are conditionally independent on the mother. This is not the case in our model, due to the condition that the level effects sum to zero.

The paper is structured as follows. Section 2 gives some background on GP regression with mixed categorical and continuous inputs. Section 3 presents new findings on group kernels. Section 4 illustrates on synthetic examples. Section 5 is devoted to the application which motivated this work. Section 6 gives some conclusions and perspectives for future research.

2 Background and notations

2.1 GPs with continuous and categorical variables

We consider a set of continuous variables defined on a hypercubic domain , and a set of categorical variables with levels. Without loss of generality, we assume that and that, for each , the levels of are numbered . We denote , , and .

We consider GP regression models defined on the product space

and written as:

(1)

where , and are respectively the trend, the GP part and a noise term. There exist a wide variety of trend functions, as in linear models. Our main focus here is on the centered GP , characterized by its kernel

Kernels on can be obtained by combining kernels on and kernels on . Standard valid combinations are the product, sum or ANOVA. Thus if denotes a kernel for the continuous variables , a kernel for the categorical ones , examples of valid kernels for are written:

For consiseness, we will denote by one of the operations: sum, product or ANOVA. The three formula above can then be summarized by:

(2)

Then, in turn, and can be defined by applying these operations to -dimensional kernels. For continuous variables, famous -dimensional kernels include squared exponential or Matérn (Rasmussen and Williams, 2006). We denote by such kernels (). For a categorical variable, notice that, as a positive semidefinite function on a finite space, a kernel is a positive semidefinite matrix. We denote by the matrix of size corresponding to kernels for (). Thus, examples of expressions for and are written:

(3)
(4)

The formulation given by Equations (2), (3), (4

) is not the most general one, since kernels are not always obtained by combining 1-dimensional ones. Nevertheless, it encompasses the GP models used in the literature of computer experiments with categorical inputs. It generalizes the tensor-product kernels, very often used, and the sum used recently by

(Deng et al., 2017)

on the categorical part. It also contains the heteroscedastic case, since the matrices

are not assumed to have a constant diagonal, contrarily to most existing works (Zhang and Notz, 2015). This will be useful in the application of Section 5

, where the variance of the material is level dependent.

Remark 1.

Combining kernels needs some care to obtain identifiable models. For instance, the product of kernels with (), is a kernel depending on only one variance parameter . The GP model is identifiable for this new parameter, but not for the initial parameters .

2.2 1-dimensional kernels for categorical variables

We consider here a single categorical variable with levels . We recall that a kernel for is then a by positive semidefinite matrix .

2.2.1 Kernels for ordinal variables

A categorical variable with ordered levels is called ordinal. In this case, the levels can be viewed as a discretization of a continuous variable. Thus a GP on can be obtained from a 1-dimensional GP on the interval by using a non-decreasing transformation (also called warping):

Consequently, the covariance matrix can be written:

(5)

When depends on the distance , then depends on the distance between the levels , distorted by .
In the general case, is piecewise-linear and defined by

parameters. However, a parsimonious parameterization may be preferred, based on the cdf of a flexible probability distribution such as the Normal or the Beta. We refer to

(McCullagh, 1980) for examples in regression and to (Qian et al., 2007) for illustrations in computer experiments.
There is also some flexibility in the choice of the continuous kernel . The standard Squared-Exponential or Matérn kernels are admissible, but induce positive correlation between levels. In order to allow negative correlations, one may choose, for instance, the cosine correlation kernel on :

(6)

where is a fixed parameter tuning the minimal correlation value. Indeed, (6) defines a decreasing function of from to . It is a valid covariance function obtained by choosing as a Dirac non-negative measure in Bochner theorem for real-valued stationary kernels:

2.2.2 Kernels for nominal variables

For simplicity we present here the homoscedastic case, i.e. when has a constant diagonal. It is immediately extended to situations where the variance depends on the level, by considering the correlation matrix.

General parametric covariance matrices.

There are several parameterizations of positive-definite matrices based on the spectral and Choleky decompositions. The spectral decomposition of is written

(7)

where is diagonal and orthogonal. Standard parameterizations of involve the Cayley transform, Eulerian angles, Householder transformations or Givens rotations, as detailed in (Khuri and Good, 1989) and (Shepard et al., 2015). Another general parameterization of is provided by the Cholesky decomposition:

(8)

where is lower triangular. When the variance does not depend on the level , the columns of have the same norm and represent points on a sphere in . A spherical parameterization of is then possible with one variance term and angles, representing correlations between levels (see e.g. Pinheiro and Bates, 1996).

Parsimonious parameterizations.

The general parametrizations of described above require parameters. More parsimonious ones can be used, up to additional model assumptions. Among the simplest forms, the compound symmetry (CS) - often called exchangeable - covariance matrix assumes a common correlation for all levels (see e.g. Pinheiro and Bates, 2009). The CS matrix with variance and covariance is defined by:

(9)

This generalizes the kernel obtained by substituting the Gower distance (Gower, 1982) into the exponential kernel, corresponding to .
The CS covariance matrix treats equally all pairs of levels, which is an important limitation, especially when . More flexibility is obtained by considering groups of levels. Assume that the levels of are partitioned in groups and denote by the group number corresponding to a level . Then a desired parameterization of is given by the block matrix (see e.g. Qian et al. (2007)):

(10)

where for all , the terms are within-group correlations, and () are between-group correlations. Notice that additional conditions on the ’s are necessary to ensure that is a valid covariance matrix, which is developed in the next section.

3 Generalized compound symmetry block covariance matrices

We consider the framework of Section 2.2.2 where denotes a categorical variable whose levels are partitioned in groups of various sizes . Without loss of generality, we assume that . We are interested in parsimonious parameterizations of the covariance matrix , written in block form:

(11)

where the diagonal blocks contain within-group covariances, and the off-diagonal blocks are constant matrices containing between-group covariances. We denote:

where is the by matrix of ones. This means that the between-group covariances only depends on groups (and not on levels).

Although block matrices of the form (11) may be covariance matrices, they are not positive semidefinite in general. A necessary condition is that all diagonal blocks are positive semidefinite. But it is not sufficient. In order to provide a full characterization, we will ask a little more, namely that they remain positive semidefinite when removing the mean:

(12)

where is matrix of ones of size and is the average of coefficients. This condition will appear naturally in Subsection 3.3. Notice that valid CS covariance matrices satisfy it. Indeed, if is a positive semidefinite matrix with variance and covariance , then where verifies , which is positive semidefinite. For this reason, we will call matrices with Generalized Compound Symmetry (GCS), block matrices of the form (11) verifying (12). In particular, the class of GCS block matrices contains block matrices of the form (10).

The rest of the section is organized as follows. Subsection 3.1 shows how valid CS covariance matrices can be parameterized by a Gaussian model. The correspondence is ensured, thanks to a centering condition on the effects of the levels. Subsection 3.2 gives material on centered covariance matrices. Subsection 3.3 contains the main results. It extends the model of Subsection 3.1 to a GCS block matrix. This gives a proper characterization of positive semidefinite GCS block matrices, as well as a parameterization which automatically fulfills the positive semidefinite conditions. Subsection 3.4 indicates connections with related works. Finally, in Subsection 3.5, we collect together the details of our parameterization, for ease of reference.

3.1 A Gaussian model for CS covariance matrices

We first focus on the case of a CS matrix. The following additional notations will be used: for a given integer ,

is the identity matrix of size

,

is the vector of ones of size

. We denote by

(13)

the CS matrix with a common variance term and a common covariance term . It is well-known that is positive definite if and only if

(14)

For instance, one can check that the eigenvalues of

are

with multiplicity 1 (eigenvector

) and with multiplicity (eigen-space ). Notice that a CS matrix is positive definite for a range of negative values of its correlation term.

Then we consider the following Gaussian model:

(15)

where with , and

are i.i.d. random variables from

, with , assumed to be independent of .

A direct computation shows that the covariance matrix of is the CS covariance matrix . Clearly this characterizes the subclass of positive definite CS covariance matrices such that is non-negative. The full parameterization, including negative values of in the range , can be obtained by restricting the average of level effects to be zero, as detailed in the next proposition.

Proposition 1.

When and are related as in (15), the covariance of conditional on zero average errors is a CS matrix with variance and covariance . Conversely, given a CS covariance matrix  with variance and covariance , there exists a representation (15) such that is the covariance of conditional on zero average errors where and .

3.2 Parameterization of centered covariance matrices

The usage of Model (15) to describe CS covariance matrices involves Gaussian vectors that sum to zero. This is linked to centered covariance matrices, i.e. covariance matrices such that , as detailed in the next proposition. We further give a parameterization of centered covariance matrices.

Proposition 2.

Let be a covariance matrix of size . Then, is centered iff there exists a Gaussian vector on such that . In that case, let be a matrix whose columns form an orthonormal basis of . Then is written in an unique way

(16)

where is a covariance matrix of size .
In particular if is a centered CS covariance matrix, then , and we can choose .

The choice of in Prop. 2 is free, and can be obtained by normalizing the columns of a Helmert contrast matrix (Venables and Ripley (2002), §6.2.):

3.3 A hierarchical Gaussian model for GCS block covariance matrices

Let us now return to the general case, where the levels of are partitioned in groups. It will be convenient to use the hierarchical notation , indicating that belongs to the group . Then, we consider the following hierarchical Gaussian model:

(17)

where for each the random variable represent the effect of the group , and the random variables represent the effects of the levels in this group. We assume that is normal , and all the are normal . We also assume that are independent, and independent of .
Notice that, up to centering conditions on that will be considered next, is the mean of group . Hence, is interpreted as the between group means covariance. Similarly, is the within-group effect around the group mean. This justifies the notations and .

As an extension of Prop. 1, the next results show that (17) gives a one-to-one parameterization of valid GCS block covariance matrices, under the additional assumption that the average of level effects is zero in each group.

Theorem 1.

The covariance matrix of conditional on is a GCS block matrix with, for all :

(18)

where is a centered positive semidefinite matrix equal to . Conversely, let be a positive semidefinite GCS block matrix. Then there exists a representation (17) such that is the covariance of conditional on zero average errors , with:

where is the matrix obtained by averaging each block of .

Corollary 1.

Positive semidefinite GCS block matrices with CS diagonal blocks exactly correspond to covariance matrices of in (17) conditional on the constraints when .

As a by-product, we obtain a simple condition for checking the positive definiteness of GCS block matrices. Interestingly, it only involves a small matrix whose size is the number of groups.

Theorem 2.

Let be a GCS block matrix. Then

  1. is positive semidefinite if and only if is positive semidefinite.

  2. is positive definite if and only if is positive definite and the diagonal blocks are positive definite for all .

Furthermore, we have

(19)

where is the matrix

Remark 2.

All the results depend on the conditional distribution . Thus there is some flexibility in the choice of , since several matrices can lead to the same conditional covariance matrix .

Remark 3 (Groups of size 1).

Theorem 1 is still valid for groups of size 1. Indeed if , then is degenerate and equal to 0. Thus is positive semidefinite.

We end this section with an “exclusion property” for groups with strong negative correlation. A positive semidefinite GCS covariance matrix can exhibit negative within-group correlations, but this induces limitations on the between-group correlations. More precisely, the following result shows that if a group has the strongest possible negative within-group correlations, then it must be independent of the others.

Proposition 3 (Exclusion property for groups with minimal correlation).

Let a GCS covariance matrix, and let be a centered Gaussian vector such that . Let be a group number, and denote by (resp. ) the subvector extracted from whose coordinates are (resp. are not) in group . Assume that is such that . Then is independent of .

The condition is linked to minimal correlations. Indeed, since is positive semidefinite, . The limit case is obtained when negative terms of are large enough to compensate positive ones. As an example, if is a positive semidefinite CS covariance matrix with variance and minimal negative covariance , then .

3.4 Related works

The hierarchical model (17) shares similarities with two-way Bayesian models and linear mixed effect models (see e.g. Lindley and Smith (1972)), with Gaussian priors for the effects and . The centering constraints are also standard identifiability conditions in such models. Furthermore, the particular case of CS covariance matrices corresponds to the exchangeable assumption of the corresponding random variables. Typically, in the framework of linear modelling, Model (17) could be written as

with additionals grand mean and errors .

However, if the framework is similar, the goal is different. In linear modelling, the aim is to quantify the effects by estimating their posterior distribution

. On the other hand, we aim at investigating the form of the covariance matrix of the response part , or, equivalently, the covariance matrix of the likelihood

3.5 Guideline for practical usage

The results of the previous sections show that valid GCS block covariance matrices can be parameterized by a family of covariance matrices of smaller sizes. It contains the case where diagonal blocks are CS covariance matrices. The algorithm is summarized below.

  1. Generate a covariance matrix of size G.

  2. For all ,
    If , set , else:

    • Generate a covariance matrix of size .

    • Compute a centered matrix , where is a by matrix whose columns form an orthonormal basis of .

  3. For all , compute the within-group blocks and between-group blocks by Eq. (18).

In steps 1 and 2, the generator covariance matrices and can be general, and obtained by one of the parameterizations of §2.2.2. A direct application of Theorem 2 also shows that is invertible if and only if and the ’s are invertible (cf. Appendix for details).
Furthermore, some specific form, such as CS matrices, can be chosen. Depending on the number of groups and their sizes, different levels of parsimony can be obtained. Table 1 summarizes some possibilities.

Note that the parameterization is for a general covariance matrix, but not for additional constraint, as for a correlation matrix. In these situations, one can take advantage of the economic condition of Theorem 2: positive semidefiniteness on is always equivalent to positive semidefiniteness of the small matrix of size . This can be handled more easily by non-linear or semidefinite programming algorithms.

Parametric setting Resulting form of Number of parameters
General
General General
General General General
Table 1: Parameterization details for some valid GCS block covariance matrices .

4 Example

Example 1

Consider the deterministic function

with , and .

Figure 1: Test function of Example 1. Bullets represent design points.

As visible in Figure 1, there are two groups of curves corresponding to levels and with strong within-group correlations, and strong negative between-group correlations.

1 group
2 groups
5 groups (a)
5 groups (b)
13 groups
Ordinal
Figure 2: Estimated correlation kernel , based on a representative design of experiments (design with median ).

We aim at reconstructing with five GP models based on levels grouping. The first one uses a CS covariance matrix, corresponding to a single group. The second one considers the two groups and . The third model, based on the five groups , , , , , has two variants: (a) when the inter-groups correlation is constant and (b) in the general case. The fourth model uses the spherical parameterization of , leading to 13 groups, and the last one considers an ordinal paramaterization for . We also compare the result with an ordinal kernel obtained by mapping a piecewise linear cdf into the cosine kernel of Eq. 6 with . (see Section 2.2.1). Notice that the choice of is motivated by recovering negative correlations, and has no link with the sinusoidal form of the curves of the example.

In order to benefit from the strong link between levels, we use a design that spreads out the points between levels. For instance, the information given by may be useful to estimate at for a different level , without computing . More precisely, we have used a (random) sliced Latin hypercube design (SLHD) (Qian, 2012) with points by level, for a total budget of points.
All parameters are estimated by maximum likelihood. As the likelihood surface may be multimodal, we have launched several optimizations with different starting points chosen at random in the domain.
Model accuracy is measured over a test set formed by a regular grid of size , in terms of criterion. The criterion has a similar expression than , but is computed on the test set:

(20)

where the denote the observations (on the test set), their mean, the predictions. It is negative if the model performs worst than the mean, positive otherwise, and tends to 1 when predictions are close to true values. Finally, the process is repeated times, in order to assess the sensitivity of the result to the design.

Results.

The estimated correlation parameters are shown in Figure 2. Information about the plots used is given in a last section. The correlation structure that can be intuited from Figure 1 is well recovered with two groups and five groups, with different between-groups correlations. The model with thirteen groups involves the estimation of parameters, which is hard to achieve, especially with points. This is visible in the erratic values of the estimated correlations values, which seem not meaningful. On the opposite, considering only one group or five groups with a common between-group correlation oversimplifies the correlation structure. Finally, the model using an ordinal kernel recovers the two groups of curves, as well as the strong negative correlation between them, which is made possible by the choice of the 1-dimensional kernel used in the warping.

In Figure 3, we can see that the best tradeoff between prediction accuracy and parsimony is obtained with two groups. whereas it reduces the number of observations by group. Notice the rather good performance of the ordinal model, at the cost of a larger number of parameters.

Figure 3: of six GP models, based on

repetitions of the design. From left to right: 1 group (CS structure), 2 groups, 5 groups (a : common between-group covariance), 5 groups (b : general), 13 groups, ordinal. Number of parameters used (bloxplot order): 5, 7, 10, 19, 90, 16.

Example 2

We now provide a second example, in order to illustrate the ability of the hierarchical model (17) to deal with negative within-group correlations. We consider the deterministic function given by:

with , . As visible in Figure 4, the levels can be split in two groups: a group of almost linear functions (levels ), and a group of damped sinusoidal functions (levels ). Within the latter group, there are strong negative correlations between levels and . Hence, the levels could also be split into three groups.

Figure 4: Test function of Example 2. Bullets represent design points.

In this section, we briefly compare the corresponding GP models:

  • The first model considers the two groups and . The within-group structure is CS for the first group (linear functions). But a general structure is chosen for the second one (sinusoids), in order to capture its complex covariance structure.

  • The second model is based on the three groups , and . The within-group structure is CS, and the between-group covariance is general.

For simplicity, we consider a single stratified design of experiments extracted from a sequence of regularly spaced points, with points per level. The other settings are the same as in Example 1.

The estimated correlation parameters are shown in Figure 5. The correlation structure that can be intuited from Figure 4 is well recovered by the two models. However, in the case of two groups, the estimated between-group correlation is nearly zero. This is an illustration of the exclusion property (Prop. 3). Indeed, due to the strong negative (estimated) correlations within the second group, we have , which induces a small correlation between the other group. In this example, the model with three groups may be more appropriate, which seems confirmed by the larger value of (compared to for two groups). Nevertheless, it is nice to see that, starting from a smaller number of groups, the correlation plot detects the two subgroups of sinusoids.

2 groups
3 groups
Figure 5: Estimated correlation kernel for the two GP models.

Discussion

It is remarkable that complex correlation structures can be recovered with few data points per level. Indeed, even if the correlation structure can be intuited from the whole curves , this complete information is of course not available, and the estimation is done with only three points per level in the previous examples. The reasons may be twofold. Firstly, the model is parametric, which may provide more accurate estimation for few data, provided that the model is relevant (in particular if the correct groups are given). Secondly, the global amount of information available may be large enough, since the small number of points per levels is compensated by the quite large number of levels. Consequently, when there are few levels, one may need to increase the number of points per level in order to reach an acceptable level of accuracy.

The sinusoidal form of the toy functions has been chosen to illustrate the power of GP models in modelling complex functions. Other complex forms could have been used. An alternative would have been to use sample paths of GP models with group kernels rather than deterministic functions. Finally, notice that correlation plots may be used with care in general, since they do not account for estimation error on correlations.

5 Application in nuclear engineering

5.1 Position of the problem

As presented in Introduction, this research is originally motivated by the solving of an inverse problem confronting experimental measurements in nuclear engineering and time-consuming numerical simulation. More precisely, this analysis concerns the identification of the mass of that is present in a particular waste container using a non-destructive nuclear detection technique such as the gamma spectrometry (Knoll, 2010). In that case, at each energy level ,

(21)

where is the quantity of interest provided by the gamma transmitter, and is the attenuation coefficient, which depends on the source environment denoted by . In practice, only discrete values of are of interest, corresponding to the natural energy levels of :

(22)

Then, based on previous studies (Guillot, 2015), the real source environment is parameterized by the following input variables:

  • An equivalent geometric shape for the nuclear waste: sphere (‘sph’), cylinder (‘cyl’) or parallelepiped (‘par’).

  • An equivalent material for this waste, characterized by its chemical element with atomic number in {1, …, 94},

  • The bulk density of the waste, in ,

  • The distance of measurement between the container and the measurement device, in (cm),

  • The mean width and lateral surfaces (in logarithmic scale) crossed by a gamma ray during the rotation of the object.

After normalization, the characteristics of the input space can be summed up in Table 2.

Name of the input Variation domain
Distance [0,1]
Density [0,1]
Width [0,1]
Surface [0,1]
Energy
Shape {sph, cyl, par}
Chemical element
Table 2: Description of the input variables for the nuclear application.

To recapture the notation of the previous sections, let and be the vectors gathering respectively the continuous and categorical inputs, and . For a given value of , Monte Carlo simulation codes as MCNP (Goorley et al., 2013) can be used to model the measured scene and approach the value of . The mass can eventually be searched as the solution of the following optimization problem:

(23)

where is the classical Euclidian norm, and respectively gather the values of and at the six values of that are used for the measurements. To solve (23), it is therefore necessary to compute at a high number of points. However, each evaluation of the MCNP code can be extremely demanding (between several minutes to several hours CPU for one evaluation). Thus, surrogate models have to be introduced to emulate the function , which is now investigated in the frame of Gaussian process regression. We refer to Clement et al. (2018) for the second step, namely the treatment of the inversion problem.

5.2 Model settings

For pedagogical purpose, a dataset of large size has been computed with the MNCP code. The construction of the design of experiments was guided by the categorical inputs, such that each of the combinations of levels appears times. It was completed by a Latin hypercube of size to define the values of the four continuous inputs.
From this full dataset, a training set of size is extracted by selecting at random observations by chemical element. The remaining points serve as test set.

Figure 6: in function of the energy (left) and geometric shape (right).

Model settings are now motivated by a graphical analysis. In Figure 6, the output is displayed in function of the energy and the geometric shape. We observe that successive energy levels correspond to close values. This fact confirms that the energy is ordinal and we use the warped kernel defined by Eq. (5). The influence of the geometric shape is less obvious, and we have chosen an exchangeable (CS) covariance structure for it.

In Figure 7, is displayed in function of the chemical elements, ordered by atomic number. Two important facts are the high number of levels and heteroscedasticity. For this purpose, the chemical elements are divided into groups, provided by expert knowledge and represented by colors. This partition suggests to use a group kernel of the form (11), where the within-group blocks are CS covariance matrices. In order to handle heteroscedasticity, the variance of is assumed to depend on the group number .

Figure 7: Y in function of chemical elements, ordered by atomic number.

The influence of continuous variables can be observed by panels (not represented), and does not reveal useful information for our purpose. A Matérn kernel is set for all continuous inputs, as we expect the output to be a regular function of the continuous inputs. Indeed, for this kernel, the corresponding Gaussian process is two times mean-square differentiable.
Finally, three candidate kernels for are obtained by combining the kernels of input variables defined above, by sum, product or ANOVA (see Section 2).

5.3 Results

Following the model settings detailed above, Figure 8, Panel 3, presents the results obtained with random designs of size and three operations on kernels. Furthermore, we have implemented three other kernels for the chemical element, in order to compare other model choices for this categorical input. In the first panel, we grouped all the 94 levels in a single group. In the second one, we kept the 5-group kernel but forced the between-group covariances to have a common value. Finally, in the fourth panel, we considered that the levels were ordered by their atomic number, and used the warped kernel of Eq. (5) with a Normal transform.

Figure 8: of several GP models, based on random designs, corresponding to different model choices for the chemical element. First panel: Single group. Second panel: 5 groups, with a common between-group covariance. Third panel: 5 groups. Fourth panel: Ordered levels. For each panel, three combinations of kernel are tested: sum (S), product (P) or ANOVA (A). Total number of parameters used (panel order): ‘prod’ = (12, 21, 30, 14), ‘sum’ = ‘prod’ + 6, ‘anova’ = ‘prod’ + 7.

First, comparing the three operations on kernels, we remark that in all the panels, additive kernels provide the worst results. This suggests the existence of interactions between different inputs of the simulator. Second, the ANOVA combination produces slight improvements, compared to the standard tensor-product, both in terms of accuracy and stability with respect to design choice.

Now, comparing the four panels, we see that gathering the levels in a single group is the least efficient strategy. The 5-group kernel gives very good performances, especially when the between-group covariances vary freely: constraining them to be equal degrades the result. Surprisingly here, the ordinal kernel gives the best performance. Indeed, for this application it was not intuitive to the experts that the chemical element can be viewed as an ordinal variable, simply sorted by its atomic number. This is confirmed by the correlation plots of Figure 

9, corresponding to a model with a median score. We can see that the estimated correlations between levels seems to decrease as the difference between levels increases, an indication that the levels may be ordered by their atomic number.
Finally, we report several post-processing results. First, the estimated transformation of energy levels (Figure 10, left) is concave and flat near high values, which corresponds to the behaviour observed in Figure 6 (left panel). In addition, the last three levels lead to similar results (Figure 10, right). This corresponds to the fact that when the energy is high, the gamma ray almost always crosses the nuclear waste, leading to a high value for the output. Second, the estimated correlation among the sphere, the cylinder and the parallelepiped is very high (, Figure 11). This justifies considering a covariance structure for that categorical input, rather than using three independent GP models for all the three levels.

5 groups (a)
5 groups (b)
Figure 9: Estimated correlation kernel for the chemical element, with a common between-group covariance parameter (left) or different ones (right).
Figure 10: Estimated correlation kernel for the energy: estimated warping (left) and correlation structure (right).
Figure 11: Estimated correlation kernel for the geometric shape.

6 Conclusion

In the framework of GP regression with both continuous and categorical inputs, we focus on problems where categorical inputs may have a potentially large number of levels , partitioned in groups of various sizes. We provide new results about parsimonious block covariance matrices, defined by a few within- and between-group covariance parameters.

We revisit a two-way nested Bayesian linear model, where the response term is defined as a sum of a group effect and a level effect. We obtain a flexible parameterization of block covariance matrices which automatically satisfy the positive definiteness conditions. As a particular case, we recover situations where the within-group covariance structures are compound symmetry, with possible negative correlations. Furthermore, we show that the positive definiteness of a given block covariance matrix can be checked by verifying that the small matrix of size obtained by averaging each block is positive definite. This criterion can be useful if the proposed block matrix has a desirable constraint, such as homoscedasticity, which is not directly handled by the proposed parameterization.

We apply these findings on several toy functions as well as an application in nuclear engineering, with 4 continuous inputs, 3 categorical inputs, one of them having 94 levels corresponding to chemical numbers in Mendeleev’s table. In this application, 5 groups were defined by experts. The results, measured in terms of prediction accuracy, outperform those obtained with simpler assumptions, such as gathering all levels into a single group. It is gratifying that our nominal method performs almost as well as using the appropriate order with a warped kernel.

There are several perspectives for this work. First, one future direction is to find a data-driven technique to recover groups of levels, made more difficult when a small number of observations is available. Similarly, if there is an order between levels, can we infer it from the data? Second, the trend of the GP models with mixed continuous and categorical inputs could be made more complex, in the same vein as works on GP models with continuous inputs.

Software and acknowledgements

Implementations have been done with the R packages mixgp and kergp (Deville et al., 2015). Illustrations use ggplot2 (Wickham, 2009) and corrplot (Wei and Simko, 2016).
This research was conducted within the frame of the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (CNRS, Ecole Centrale de Lyon, Mines Saint-Etienne, University of Grenoble, University of Nice, University of Toulouse) around advanced methods for Computer Experiments. The authors thank the participants for fruitful discussions. They are also grateful to the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme UNQ when work on this paper was undertaken (EPSRC grant number EP/K032208/1).

Appendix

Proof of Proposition 1.

The vector is a centered Gaussian vector with covariance matrix

Hence the conditional distribution of knowing is a centered Gaussian vector with covariance matrix

Then, by using the independence between and the ’s, we deduce

We recognize the CS covariance matrix with and . As a covariance matrix, it is positive semidefinite. Furthemore, we have and , and the conditions of positive definiteness (14) are satisfied.
Conversely, let be a positive definite CS matrix . Then we have , and we can define and . From the direct sense, we then obtain that the covariance matrix of is . ∎

Proof of Proposition 2.

The first part of the proposition is obtained by remarking that if , then . Thus, assuming that is centered, is equivalent to with probability .
For the second part, notice that means that is orthogonal to . Thus, one can write the expansion of in the orthonormal basis defined by . Denoting by the -vector of coordinates, we have . This gives , and (16) follows with .
To prove unicity, observe that, by definition, . Starting from , and multiplying by on the left and by on the right, we get , showing that is unique.
Now, let . Since , we obtain

As a by-product, notice that resubstituting into gives .
Finally, if , then the properties of conditional Gaussian vectors lead immediately to . ∎

Proof of Theorem 1.

The expressions of and are obtained directly by using the independence assumptions about and the ’s. Notice that , the covariance matrix of knowing , is centered by Proposition 2. This gives , which is positive semidefinite. Hence is a GCS block matrix.

Conversely, let be a positive semidefinite GCS block matrix. Let be the matrix obtained from by averaging each block. Then is also a positive semidefinite matrix. Indeed, since is positive semidefinite, it is the covariance matrix of some vector . Then is the covariance matrix of , the vector obtained from by averaging by group: . Thus there exists a centered Gaussian vector whose covariance matrix is

Now, for , define

Observe that , and by assumption is positive semidefinite. Hence, from Proposition 2, there exists a centered Gaussian vector such that

We can assume that are independent, and and are independent. Finally, we set . By the direct sense and (18), we obtain that is the covariance matrix of conditional on

Proof of Corollary 1.

Let be a positive semidefinite GCS block matrix with CS diagonal blocks. Then the diagonal CS matrices are positive semidefinite, leading to . Thus,