Neural network-based, structure-preserving entropy closures for the Boltzmann moment system

This work presents neural network based minimal entropy closures for the moment system of the Boltzmann equation, that preserve the inherent structure of the system of partial differential equations, such as entropy dissipation and hyperbolicity. The described method embeds convexity of the moment to entropy map in the neural network approximation to preserve the structure of the minimal entropy closure. Two techniques are used to implement the methods. The first approach approximates the map between moments and the minimal entropy of the moment system and is convex by design. The second approach approximates the map between moments and Lagrange multipliers of the dual of the minimal entropy optimization problem, which present the gradients of the entropy with respect to the moments, and is enforced to be monotonic by introduction of a penalty function. We derive an error bound for the generalization gap of convex neural networks which are trained in Sobolev norm and use the results to construct data sampling methods for neural network training. Numerical experiments are conducted, which show that neural network-based entropy closures provide a significant speedup for kinetic solvers while maintaining a sufficient level of accuracy. The code for the described implementations can be found in the Github repositories.

READ FULL TEXT VIEW PDF

Authors

page 13

page 24

06/16/2021

Data-driven, structure-preserving approximations to entropy-based moment closures for kinetic equations

We present a data-driven approach to construct entropy-based closures fo...
09/23/2020

A Positive and Stable L2-minimization Based Moment Method for the Boltzmann Equation of Gas dynamics

We consider the method-of-moments approach to solve the Boltzmann equati...
10/07/2021

Learning invariance preserving moment closure model for Boltzmann-BGK equation

As one of the main governing equations in kinetic theory, the Boltzmann ...
06/23/2021

An entropic method for discrete systems with Gibbs entropy

We consider general systems of ordinary differential equations with mono...
05/28/2019

Deep Learning Moment Closure Approximations using Dynamic Boltzmann Distributions

The moments of spatial probabilistic systems are often given by an infin...
05/16/2021

A Realizable Filtered Intrusive Polynomial Moment Method

Intrusive uncertainty quantification methods for hyperbolic problems exh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many applications, a macroscopic description of the physical systems is no longer applicable and one has to rely on a more general description, which is given by kinetic equations such as the Boltzmann equation. Example include neutron transport [NeutronTransport], radiative transport [Chahine1987FoundationsOR] and semiconductors [Markowich1990SemiconductorE] and rarefied gas dynamics [BoltzmannApplications]. The Boltzmann equation is a high dimensional integro-differential equation, with phase space dependency on space and particle velocity. This high dimensionality of the phase space presents a severe computational challenge for large scale numerical simulations.
Several methods for phase space reduction have been proposed to solve the Boltzman equation, including the discrete ordinate/velocity methods  [NeutronTransport, Camminady2019RayEM, xiao2020velocity, XIAO2021110689, xiao2017well] and moment methods [AlldredgeFrankHauck, AlldredgeHauckTits, GarretHauck, KRISTOPHERGARRETT2015573, Levermore]. Discrete ordinate methods evaluate the velocity space at specific points, which yields a system of equations only coupled by the integral scattering operator. While computationally efficient, these methods suffers from numerical artifacts, which are called ray effects [Camminady2019RayEM]. Moment methods eliminate the dependency of the phase space on the velocity variable by computing the moment hierarchy of the Boltzmann equation. Due to the structure of the advection term, the resulting moment system is typically unclosed. One distinguishes moment methods according to the modelling of their closure. The classical closure uses a simple truncation that results in a system of linear hyperbolic equations. The main drawback of this method is its numerical artifacts, specifically large oscillations of the particle density, which may even result in negative particle concentrations. This effect is particularly present in the streaming particle regime [Brunner2002FormsOA].
A moment closure, which preserves important physical and mathematical properties [Levermore1996MomentCH] of the Boltzmann equation, such as entropy dissipation, hyperbolicity, and the H-theorem, is constructed by solving a convex constrained optimization problem based on the entropy minimization principle [AlldredgeHauckTits, Levermore]. The method, which is commonly referred to as closure, is accurate in the diffusive limit [GOUDON2013579] and unlike the closure, it is also accurate in the streaming limit [DUBROCA1999915]. Although the closure is methodically superior to the closure, it is by far more expensive to compute. Garret et al. [GarretHauck] have demonstrated, that in a high performance implementation, more than % of the computational time of the whole solver is required for the solution of the entropy minimization problem. This motivates the development of a neural network surrogate model to accelerate the closure.

Several machine learning inspired methods have been proposed recently. The authors of 

[huang2021machine] close the moment system by learning the spatial gradient of the highest order moment. In [Han21983], the authors pursue two strategies. First, they use a encoder-decoder network to generate generalized moments and then learn the moment closure of the system with its dynamics in mind. Second, they learn directly the correction term to the Euler equations. In [huang2020learning], Galilean invariant machine learning methods for partial differential equations are developed using the conservation dissipation formalism. Using convolutional networks, a closure for the one dimensional Euler-Poisson system was constructed in  [bois2020neural]. In [XIAO2021110521], a dense neural network was used to model the deviation from the Maxwellian in the collision term of the Boltzmann equation. The authors of [Maulik2020neural] use neural networks to reproduce physical properties of known magnetized plasma closures. In [ma20machine]

, fully connected, dense and discrete Fourier transform networks are used to learn the Hammett-Perkins Landau fluid closure. Physics informed neural networks were employed to solve forward and inverse problems via the Boltzmann-BGK formulation to model flows in continuum and rarefied regimes in 

[lou2020physicsinformed], to solve the radiative transfer equation  [mishra2021physics] and the phonon Boltzmann equation in [li2021physicsinformed]. In [porteous2021datadriven], the authors propose a data driven surrogate model of the minimal entropy closure using convex splines and empirically convex neural networks. To ensure convexity at the training data points, the authors penalize a non symmetric positive definite Hessian of the network output.
The goal of this work is to construct structure-preserving deep neural network surrogate models for the entropy closure of the moment system of the Boltzmann Equation. The motivation is the estabilshed result from [AlldredgeFrankHauck], that convex approximations to the entropy will preserve important mathematical properties of the moment system. The first proposed neural network maps a given moment to its minimal mathematical entropy. In contrast to the work proposed in [porteous2021datadriven], the neural network is input convex by design using the methods of Amos et al. [Amos2017InputCN]. By this ansatz, the learned closure automatically inherits all structural properties of the entropy closure for the moment system, due to the result of  [AlldredgeFrankHauck], that any convex approximation to the entropy preserves said properties. The derivative of the network with respect to the moments maps to the corresponding optimal Lagrange multipliers of the entropy minimization problem. We train the neural network on the output, the Lagrange multiplier and additionally on the reconstructed moments, whereas in [porteous2021datadriven], the authors train on the output, the reconstructed moments and the Hessian of the network. The second approach of in this work is a monotonic neural network that maps the moments directly to the Lagrange multipliers of the entropy minimization problem. We use a penalty function to train the neural network to be monotonic and otherwise use the same loss as in the input convex approach.
The remainder of this paper is structured as follows. In Section 2, we give a brief introduction to the Boltzmann equation. We review the moment system with minimal entropy closure and examine its benefits and shortcomings. In Section 3, we present our findings for structure-preserving, neural network based surrogate models for the minimal entropy closure. To this end we describe two structure-preserving neural network architectures and their integration in an end-to-end numerical solver. We show that the intrinsic structure of the moment system is preserved. Additionally, we analyze the data-to-solution map of the minimal entropy closure and perform a dimension reduction. In Section 4, we give a review over characterizations of the boundary of the set of feasible moments for the minimal entropy closure. Afterwards we propose an error bound for the generalization gap for the gradient of input convex neural networks trained in Sobolev norm. Then, we propose a sampling strategy to generate training data for closures in arbitrary spatial dimension and moment order, based on the analysis of the generalization gap. Lastly, Section 5 presents a range of numerical studies that show that the neural entropy closure is computationally more efficient than the reference solution. Furthermore, a series of synthetic tests as well as simulation tests are conducted to inspect the numerical accuracy of the neural network based entropy closures.

2 Kinetic Theory

2.1 Kinetic equations

Classical kinetic theory is profoundly built upon the Boltzmann equation, which describes the space-time evolution of the one-particle kinetic density function in a many-particle system

(1)

The phase space consists of time , space , and particle velocity . The left-hand side of the equation describes particle transport, where the advection operator describes the movement of the particle density with velocity in the spatial directions. The integral operator on the right hand side of the equation models interaction of the particle with the background medium and collisions with other particles. If the particles only collide with a background material one can model this behavior with the linear Boltzmann collision operator

(2)

where the collision kernel models the strength of collisions at different velocities. If the interactions among particles are considered, the collision operator becomes nonlinear. For example, the two-body collision results in

(3)

where are the pre-collision velocities of two colliding particles, and are the corresponding post-collision velocities and is the unit sphere. The right-hand side is a fivefold integral, where is the so-called deflection angle. In the following, we use the notation

(4)

to define integrals over velocity space.
Well-posedness of Eq. (1) requires appropriate initial and boundary conditions. The Boltzmann equation is a first-principles model based on direct modeling. It possesses some key structural properties, which are intricately related to the physical processes and its mathematical existence and uniqueness theory. We briefly review some of these properties, where we follow [AlldredgeFrankHauck, Levermore1996MomentCH]. First, the time evolution of the solution is invariant in range, i.e. if , then for all . Particularly this implies non-negativity of . Second, if is a collision invariant fulfilling

(5)

the equation

(6)

is a local conservation law. Third, for each fixed direction , the advection operator, i.e. the left-hand side term of Eq. (1), is hyperbolic in space and time. Forth, let . There is a twice continuously differentiable, strictly convex function , which is called kinetic entropy density. It has the property

(7)

Applied to Eq. (1), we get the local entropy dissipation law

(8)

Usually we set . Lastly, the solution fulfills the H-theorem, i.e. equilibria are characterized by any of the three equivalent statements,

(9)
(10)
(11)

where denotes the linear span of all collision invariants.

2.2 Moment methods for kinetic equations

The Boltzmann equation is an integro-differential equation model defined on a seven-dimensional phase space. With the nonlinear five-fold integral, it is challenging to solve accurately and efficiently. The well-known moment method encode the velocity dependence of the Boltzmann equation by multiplication with a vector of velocity dependent basis functions

, that consists of polynomials up to order and subsequent integration over . In one spatial dimension, usually we have , whereas in higher spatial dimensions equals the number of basis functions up to order . The solution of the resulting moment equation is the moment vector and is calculated by

(12)

Common choices for the basis functions are monomials or spherical harmonics, depending on the application. Typically, they include the collision invariants defined in Eq. (5). The moment vector satisfies the system of transport equations

(13)

which is called moment system. By construction, the advection operator depends on and thus, the moment system is unclosed. Moment methods aim to find a meaningful closure for this system. Since the kinetic equation dissipates entropy and fulfills a local entropy dissipation law, one can close the system by choosing the reconstructed kinetic density out of the set of all possible functions , that fulfill as the one with minimal entropy . The minimal entropy closure can be formulated as a constrained optimization problem for a given vector of moments .

(14)

The minimal value of the objective function is denoted by and is the minimizer of Eq. (14), which we use to close the moment system

(15)

The set of all moments corresponding to a kinetic density with is called the realizable set

(16)

is the set of all moments correpsonding to kinetic densities that fulfill the invariant range condition of the kinetic equation. There does not always exists a solution for the minimal entropy problem [Hauck2008ConvexDA]. However, if a solution exists for , it is unique and of the form

(17)

where the Lagrange multiplier maps to the solution of the convex dual problem

(18)

and is the Legendre dual of . By the strong duality of the minimal entropy problem, the maximum of (18) equals the minimum of (14) and we can write at the optimal point

(19)

The twice differentiable and convex function serves as the entropy of the moment system [AlldredgeFrankHauck]. We can recover the moment by using first order optimality conditions

(20)

which yields also Eq. (17), since . This yields the inverse of the solution map of the dual optimization problem. Furthermore, the derivative of recovers the optimal Lagrange multipliers of Eq. (18),

(21)

This minimal entropy closure also conserves the above listed structural properties of the Boltzmann equation . We present the above properties for the moment system for the sake of completeness, where we follow [AlldredgeFrankHauck, Levermore]. First, the invariant range property of the solution translates to the set of realizable moments . One demands that for all . Second, if a moment basis function is a collision invariant, then

(22)

is a local conservation law. Third, one can write Eq. (15) as a symmetric hyperbolic conservation law in . Forth, for , and is a suitable entropy and entropy-flux pair compatible with the advection operator and yield a semi-discrete version of the entropy dissipation law.

(23)

Note that convexity of is crucial for the entropy dissipation property. Lastly, the moment system fulfills the H-theorem, which states equality of the following statements

(24)
(25)
(26)

A numerical method to solve the moment system therefore consists of an iterative discretization scheme for the moment system (15) and a Newton optimizer for the dual minimal entropy optimization problem in Eq. (18). The former scheme can be a finite volume or discontinuous Garlerkin scheme, for example [KRISTOPHERGARRETT2015573]. The drawback of the method is the high computational cost associated with the Newton solver. The optimization problem in Eq. (18) needs to be solved in each grid cell at every time step of the kinetic solver. The computational effort to solve the minimal entropy optimization problem grows over proportionately with the order of the moment basis . Using three basis functions, the optimizer requires % of the computation time and % when using seven basis functions, as Garrett et al. have demonstrated in a computational study [KRISTOPHERGARRETT2015573]. Furthermore, the optimization problem is ill-conditioned, if the moments are near the boundary of the realizable set  [AlldredgeHauckTits]. At the boundary , the Hessian of the objective function becomes singular and the kinetic density is a sum of delta functions [Curto_recursiveness].

3 Structure-preserving entropy closures using neural networks

The following section tackles the challenge of solving the minimal entropy closure computationally efficiently while preserving the key structural properties of the Boltzmann equation. We propose two neural network architectures, which map a given moment vector to the solution of the minimal entropy problem, replacing the costly Newton solver that is used in traditional numerical methods. Whereas a Newton solver requires the inversion of a near singular Hessian matrix multiple times, the usage of a neural network needs a comparatively small amount of fast tensor operations to compute the closure.


A neural network is a parameterized mapping from an input to the network prediction . Typically a network is a concatenation of layers, where each layer is a nonlinear parameterized function of its input values. The precise structure of the network

depends on basic architectural design choices as well as many hyperparameters. A simple multi-layer neural network

is a concatenation of layers

consisting of non-linear (activation) functions

applied to weighted sums of the previous layer’s output . An layer network can be described in tensor formulation as follows.

(27)
(28)
(29)

where is the weight matrix of layer and

the corresponding bias vector. In the following, we denote the set of all trainable parameters of the network, i.e. weights and biases by

. Usually, one chooses a set of training data points with index set

and evaluates the networks output using a loss function, for example the mean squared error between prediction and data

(30)

Then one can set up the process of finding suitable weights, called training of the network, as an optimization problem

(31)

The optimization is often carried out with gradient-based algorithms, such as stochastic gradient descent 

[robbins1951] or related methods as ADAM [kingma2017adam], which we use in this work.

3.1 Data structure and normalization

The structure of the underlying data is crucial for the construction of meaningful machine learning models. In the following we consider the primal and dual minimal entropy closure optimization problem, review the characterization of the realizable set as well as a dimension reduction and finally describe helpful relations between the moment , Lagrange multiplier , the entropy functional and the corresponding variables of reduced dimensionality.
The minimal entropy optimization problem in Eq. (18) and the set of realizable moments is studied in detail by by [AlldredgeFrankHauck, Levermore, Curto_recursiveness, Hauck2008ConvexDA, Junk, Junk1999, JunkUnterreiter2001, Pavan2011GeneralEA]. The characterization of uses the fact that the realizable set is uniquely defined by its boundaries [Kren1977TheMM]. First we remark that the realizable set of the entropy closure problem of order is generally an unbounded convex cone. To see this consider the moment of order zero, for any kinetic density function , which can obtain values in . For a fixed moment of order zero , the subset of the corresponding realizable moments of higher order is bounded and convex [Kershaw1976FluxLN, Monreal_210538]. Consequently, we consider the normalized realizable set and the reduced normalized realizable set

(32)
(33)

which are both bounded and convex [Kershaw1976FluxLN, Monreal_210538]. This approach is also used in computational studies of numerical solvers of the minimal entropy closure problem  [AlldredgeHauckTits]. We denote normalized moments and reduced normalized moments as

(34)
(35)

We establish some relations between the Lagrange multiplier , the Lagrange multiplier of the normalized moment and of the reduced normalized moment ,

(36)
(37)

We define the reduced moment basis, which contains all moments of order , as

(38)

since is the basis function of order . For the computations we choose the Maxwell-Boltzmann entropy and a monomial basis, however the relations can be analogously computed for other choices of entropy function and moment basis. The Maxwell-Boltzmann entropy has the following definition, Legendre dual and derivative.

(39)
(40)
(41)
(42)

In one spatial dimension, we have and a monomial basis is given by . Assuming knowledge about the Lagrange multiplier of the reduced normalized moment we can derive an expression for using the definition of the moment of order zero,

(43)

which we can transform to

(44)

using . This yields the complete Lagrange multiplier of the complete normalized moment vector . Finally, we use a scaling relation [AlldredgeHauckTits] to recover the Lagrange multiplier of the original moment vector , by considering again the definition of the normalized moment

(45)

and multiply both sides with

(46)

which yields the original Lagrange multiplier

(47)

This also implies that for all . For completeness, the entropy of the normalized moments and the entropy of the original moments have the relation

(48)
(49)
(50)
(51)

where we use Eq. (19) and (47). We denote the entropy of a normalized moment vector . These scaling relations enable a dimension reduction for faster neural network training. Furthermore, we use these relations to integrate the neural network models, which are trained on , into the kinetic solver that operates on .

3.2 Neural network approximations to the entropy functional

In the following we present two ideas for neural network entropy closures, which are based on the results of [AlldredgeFrankHauck], where the authors propose a regularized entropy closure with a corresponding entropy and a regularization parameter ,

(52)

In the limit, we have as . The regularized entropy , which is twice differentiable and convex, acts as an entropy for the regularized moment system. Furthermore, the authors have shown, that this approximation to the original entropy satisfies the conditions for the mathematical properties of the moment system presented in Section 2.2, most importantly hyperbolicity, entropy dissipation and the H-Theorem. In a similar manner, we present a twice differentiable, convex approximations to the moment to entropy map of the normalized moment system. A neural network approximation, which we denote by , constructed with these properties in mind preserves the structural properties of the moment system. Assuming the neural network is trained, i.e. it approximates the entropy sufficiently well, we have the following relations,

(53)
(54)
(55)
(56)
(57)

by using Eq. (17), Eq. (21), Eq. (44) and the definition of the moment vector.
The idea of the second neural network closure for the dual minimal entropy problem in Eq. (18), makes use of the following characterization of multivariate convex functions via montonicity of their gradients [berkovitz2003convexity]. Let be a convex set. A function is monotonic, if and only if for all . Let differentiable. Then is convex, if and only if is monotonic. As a consequence, if the mapping is monotonic for all , then the corresponding entropy functional is is convex in . A trained monotonic neural network, that approximates the moment to Lagrange multiplier map, fulfills the following relations,

(58)
(59)
(60)
(61)
(62)

We briefly examine the structural properties of a convex neural network based entropy closure. The invariant range property of depends solely on the range of . By definition of the Maxwell-Boltzmann entropy, the neural network based entropy closure is of invariant range, since . Interchanging the entropy functional by a neural network does not affect the conservation property of the moment system. Consider the hyperbolicity requirement. In order to define the Legendre dual of , it must be convex. Note, that is convex, if and only if is convex. In the proof of the hyperbolicity property, which is conducted in [Levermore1996MomentCH] for and as the system variable, , respectively , must be symmetric positive definite. As a consequence, and therefore the neural network must be strictly convex in . Strict convexity of the entropy functional is the crucial requirement for the related properties entropy dissipation and the H-theorem [Levermore1996MomentCH] as well.

3.2.1 Input convex neural network approximation of the entropy functional

Convex neural networks have been inspected in [Amos2017InputCN], where the authors propose several deep neural networks that are strictly convex with respect to the input variables by design. The design is led by the following principles [boyd_vandenberghe_2004] that yield sufficient conditions to build a convex function. First, a positive sum of convex functions is convex. Second, let be the concatenation of the functions and . Then is convex, if is convex, is non-decreasing in each argument and all are convex. Applying these conditions to the definition of a layer of a neural network, Eq. (27), yields that all entries of the weight matrix must be positive in all layers except the first. Furthermore, the activation function of each layer must be convex. The authors of [chen2019optimal]

have shown, that such a network architecture with ReLU activations is able dense in the space of convex functions. They first show that an input convex network can approximate any maximum of afine functions, which itself can approximate any convex function in the limit of infinite layers. However, in practice it turns out that very deep networks with positive weights have difficulties to train. The authors of 

[Amos2017InputCN] therefore modify the definition of a hidden layer in Eq. (27) to

(63)
(64)

where must be non-negative, and may attain arbitrary values. We choose the strictly convex softplus function

(65)

as the layer activation function for and a linear activation for the last layer, since we are dealing with a regression task. This leads to an at least twice continuously differentiable neural network. Non-negativity can be achieved by applying a projection onto to the elements of after a weight update. Next, we modify the first layer in Eq. (64) to include two prepossessing layers. We first zero center the input data w.r.t the mean vector of the training data set , then we decorrelate the channels of the input vector w.r.t to the covariance matrix of the training data.

(66)
(67)
(68)

where

is the eigenvector matrix of the covariance matrix of the training data set. The first two operations and the weight multiplication of the dense layer are a concatenation of linear operations and thus do not destroy convexity as well. Centering and decorrelation of the input data accelerate training, since the gradient of the first layer directly scales with the mean of the input data. Thus a nonzero mean may cause zig-zagging of the gradient vector  

[LeCun2012]. Lastly, we rescale and center the entropy function values of the training data. Note, that in the following we switch to notation corresponding to the entropy closure problem. We scale the entropy function values to the interval via

(69)

which is equivalent to a shift and scale layer after the output layer of the neural network. Thus the gradient of the scaled neural network output needs to be re-scaled to recover the original gradient,

(70)

Both operations are linear with a positive multiplicator, thus do not break convexity.

Figure 1: Input convex neural network closure. Model input vectors are depicted in blue. Red vectors are outputs, on which the model is trained on. When the trained model is employed, the yellow solution vector is used to construct the flux for the kinetic solver .

We briefly describe the workflow of the neural network in training and execution time, which is illustrated in Fig. 1. For training a given input convex neural network architecture, we use a training data-set , where we first scale according to Eq. (69) and compute mean and covariance of for the shift and decorrelation layer. After a forward pass through the modified input convex neural network, we obtain and by automatic differentiation through the network w.r.t. , we obtain , which we scale using Eq. (70) to get . Using Eq. (44) we reconstruct and therefore with Eq. (36). The normalized moments and the reduced normalized moments are computed using Eq. (45). The training loss function is evaluated on the mean squared error of , and ,

(71)

The parameter is used to scale the loss in to the same range as the loss in and . Training the neural network on the Lagrange multiplier corresponds to fitting the neural network approximation to the entropy functional in Sobolev norm. The authors of [SobolevTraining] found that neural network models trained on the additional knowledge of derivative information archive lower approximation errors and generalize better.
When integrating the neural network in the kinetic solver, we gather the moments of all grid cells of the spatial domain from the current iteration of the used finite volume scheme. The moments are first normalized in the sense of Eq. (34),then the predicted are obtained in the same manner as in the training workflow. Afterwards, we use Eq. (44) and  (47) to obtain corresponding to the non-normalized moments . Finally, Eq. (17) yields the closure of the moment system, from which the numerical flux for the finite volume scheme can be computed.

3.2.2 Monotone neural network approximation of the Lagrange multiplier

No particular design choices about the neural network are made to enforce monotonicity, since the characterization of monotonic functions is not constructive. To the best of our knowledge, there exists no constructive definition of multidimensional monotonic function. Instead we construct an additional loss function to make the network monotonic during training time. This is an important difference to the first approach, where the network is convex even in the untrained stage and on unseen data points.

Definition 1 (Monotonicity Loss).

Consider a neural network . Let the training data set. The monotonicity loss is defined as

(72)

The ReLU function is defined as usual,

(73)

The monotonicity loss checks pairwise the monotonicity property for all datapoints of thetraining data set. If the dot product is negative, the property is violated and the value of the loss is increased by the current dot product. This is a linear penalty function and can be potentiated by a concatenation with a monomial function. Note, that we only validate the monotonicity of the networkpointwise in a subset of the training data. As a consequence, the mathematical structures of the resulting moment closure is only preserved in an empirical sense, i.e. if the realizable set and more importantly, the set of Lagrange multipliers is sampled densely. The resulting neural network architecture is illustrated in Fig. 2. Normalization and the meanshift and decorrelation layers in Eq. (66) and Eq. (67) is implemented analogously to the input convex neural network. The core network architecture consists of a number of ResNet blocks. The ResNet architecture has been successfully employed in multiple neural network designs for multiple applications and was first presented in [HeZRS15]. The ResNet blocks used in this work read as

(74a)
(74b)
(74c)
(74d)
(74e)
(74f)
(74g)

with the idea, that the skip connection in Eq. (74g

) mitigates the gradient vanishing problem for deep neural networks. Furthermore, we include a batch normalization (BN) layer in front of each activation, which reduces the problem internal covariance shift 

[IoffeS15], that many deep neural network structures suffer from, and which slows down the training. Batch normalization is performed by applying pointwise the following two transformation to the previous layers output ,

(75)
(76)

where and are trainable weights and and

denote the expectation value and the variance of the current batch of training data, respectively.


One transforms the network output to the values of interest and analogously to the input convex network design. The entropy functional directly computed from and using Eq. (19). Training data rescaling and integration in the kinetic solver follow the ideas of the input convex network design. The batchwise monotonicity loss is calculated using and , the gradient of the convex entropy functional . The loss function for the network training becomes

(77)
Figure 2: Input convex neural network closure. Model input vectors are depicted in blue. Red vectors are outputs, on which the model is trained on. When the trained model is employed, the yellow solution vector is used to construct the flux for the kinetic solver .

4 Training Data and the generalization gap

In this section, we present methods to generate training data for the neural network entropy closures and construct a local bound to the generalization gap for the approximated gradient of a input convex neural network.

4.1 Data generation

In contrast to many applications of neural networks, the minimal entropy closure is a self contained problem with a clear data to solution map. Furthermore, the set of potential inputs to the neural network is bounded and convex. This provides more options to sample training data than common machine learning applications. The training data distribution heavily influences the trained model and therefor the generalization gap [math_understanding_weinan]. The generalization gap is defined as

(78)

where is the set of parameters, that minimizes the training loss. The generalization gap describes the performance difference of the neural network with parameters between the training data set and any real world data , i.e. the perfomance on unseen data. Thus we are left with a modelling decision about the data generation.
In related work [porteous2021datadriven], the Lagrange multiplier is sampled from a uniform grid in a cube and then Eq. (45) and Eq. (44) is used to reconstruct the corresponding moments . In [SADR2020109644], the authors sample analogously to  [porteous2021datadriven] before reconstructing the kinetic density using Eq. (17). However they iteratively update until the reconstructed kinetic density has zero mean and unit variance, then they compute the moments of this kinetic density.
A popular method for generating data for neural network models that interact with numerical differential equation solvers is to generate the training data by direct simulation, see e.g. [huang2021machine, XIAO2021110521, Lei_2020]. The idea of this concept is, to concentrate the training data generation on regions, which are likely to occur in real world. This is done by running simulation configurations similar to those expected in the application case. One expects that the model then performs better in a corresponding simulation test case than a model trained on general data. However, the model error might be much higher when encountering out of sample data and the model is specifically tailored to a small range of application configurations. Another way to sample data using simulation is to use a Fourier series with random coefficients [huang2021machine2] to generate initial and boundary conditions.
In the following we present two data sampling strategies that take advantage of the structure of the data to solution map. We investigate the generalization gap for the prediction of using input convex neural networks and derive a local error bound for the predicted on unseen data. A further point of interest is the control over the boundary distance for a given sampling method.

4.2 The boundary of the normalized realizable set

The entropy minimization problem of Eq. (18) becomes increasingly difficult to solve near the boundary of the realizable set  [AlldredgeHauckTits]. Close to , the condition number of the Hessian matrix of the entropy functional in Eq. (19) can become arbitrarily large, which causes numerical solvers to fail rather unforgivingly. This situation appears for moments of highly anisotropic distributions, vaccuum states, where or in the presence of strong sources [AlldredgeHauckTits]. At the boundary , the Hessian matrix of is singular, and the minimal entropy problem has no solution. In the space of Lagrange multipliers, this translates to growing beyond all bounds, which leads to numerical instabilities when computing the reconstruction of . The simplest case of the minimal entropy closure, the closure, already incorporates these difficulties. We can see in Fig. 3a) the map and in Fig. 3b the minimal entropy functional . Since is the gradient of with respect to , the minimal entropy functional becomes steeper as approximates .

a) over b) over
Figure 3: Data to solution maps for the closure

Note that both network architectures require this computation of Eq. (44) and (45

) and thus need to evaluate the exponential during training time, which induces high vulnerability for numerical overflows, especially, when the networks are in the first iterations of the training process. A further critical issue for the training of neural networks is the fact that a wide range of output values causes the exploding gradient problem during the gradient descent for the weight updates. No matter if we sample

and then compute or vice versa, a sampling strategy must incorporate a meaningful distance measure to .
Let us first consider proximity to the boundary in directly. There exist extensive studies about the characterization of the boundary and we use results by Kershaw [Kershaw1976FluxLN] and Monreal [Monreal_210538]. For the Maxwell-Boltzmann entropy and a monomial basis, can be described in one spatial dimension, i.e. up to order using the inequalities

(79a)
(79b)
(79c)
(79d)

whereas higher moment order moments can be characterized using the more general results in [Curto_recursiveness]. Equation (79) gives direct control over the boundary , since equality in one or more of the equations describes a boundary of the normalized realizable set. In this case, the distance measure to is the norm distance. An example for normalized moments of the closure in spatial dimensions with norm boundary distance is shown in Fig. 4a) and the corresponding Lagrange multipliers are shown in Fig. 4b). Note, that in Fig. 4a),c) and e), is displayed by the dotted black line. More general results for arbitrarily high order moments in one spatial dimension can be found in [Kershaw1976FluxLN]. In three spatial dimensions necessary and sufficient conditions have been constructed by [Monreal_210538] for up to order , but a full characterization of remains an open problem [lasserre].
From a numerical point of view, it is interesting to construct a notion of distance to directly in the space of Lagrange multipliers, since it turns out that the magnitude of the has implications on the numerical stability of the neural network training process. A first idea consists of a norm bound of , i.e.  [AlldredgeFrankHauck, porteous2021datadriven, SADR2020109644], which yields a convex subset of Lagrange multipliers. Fig. 4

d) shows a uniform distribution of

and , where , and Fig. 4c) displays the corresponding reconstructed moments .

a), using uniform grid sampling of b) , using uniform grid sampling of c) , using uniform grid sampling of with norm bound d) , using uniform grid sampling of with norm bound e) , using uniform low-discrepancy sampling of

with eigenvalue bound

f) , using uniform low-discrepancy sampling of with eigenvalue bound
Figure 4: Scatter plots of data points for the model with data generated from different sampling strategies. The color bar indicates the value of the minimal entropy functional .

However, this approach gives no control over the boundary distance and produced very biased data distributions of . A comparison of Fig. 4c) and d) shows, that two thirds of the sampled moments are concentrated in the regions near and , which correspond to high entropy values and are colored yellow. In contrast, Fig. 4a) and b) show that there are no samples in the regions and , since the corresponding moments are too close to the boundary . As a conclusion, the second sampling strategy does not produce data with uniform distance to .
Another approach is to use the condition number of the Hessian, of w.r.t directly. Since the Hessian Eq. (18) w.r.t

(80)

is symmetric and positive definite, the condition number is the ratio of the biggest and smallest eigenvalue. The Hessian is singular at , so the smallest possible eigenvalue is , and we use to measure the distance to the boundary of the realizable set. Figure 4e) and f) show a uniform sampling, where is sampled with . Note, that on the one hand, the near boundary region of is more densely sampled than the interior, compare Fig. 4a) and e), whereas there is no over-representation of the regions near and and the set of sampled Lagrange multipliers, see Fig. 4f), is similar in shape to the Lagrange multipliers in Fig. 4b).

4.3 Generalization gap for input convex neural networks trained in Sobolev norm

In this sections we present our findings to the generalization gap for the derivative approximation of a convex neural network that approximates a convex function . The network is trained (at least) in Sobolev norm, i.e. the training loss reads

(81)

when evaluating the loss over the whole data set. In the following, we assume that the network is trained, i.e. . Thus we have

(82)

Furthermore, let the sampling domain be convex and bounded and the neural network be convex by design. We are interested in the generalization gap of the derivative neural network with respect to its input variable. To this end, we consider the local generalization gap of the neural network when using training data points , if the sampling space has dimension . Let be the convex hull of and , which we call the point of interest. We assume w.l.o.g ; if this does not hold, one can consider the shifted setting , , instead. Using the characterization of a monotonic function, we define the set

(83)
Figure 5: Illustration of the convex hull of the training points (left) and the set of feasible gradients (right) for . The normal vectors to the faces are the vectors of the training points .

which is the dual polygon defined by the gradients at the sampling points and the point of interest and can be seen in Fig. 5. contains all values which the gradient of a convex function that has fixed gradients at the sampling points can attain at the point of interest .

Theorem 1.

Let be convex, the point of interest in the interior of . Then is a bounded polyhedron, whith faces, defined by and vertices .

Proof.

The proof is structured in two parts. First, we show that the vertices are well defined, if is element of the interior of . Second, we show that all . Thus any convex combination of is in and therefore, is defined by a (bounded) polyhedron with vertices