1 Introduction
How heat propagates is a classical topic. Mathematically a parabolic type heat equation has long been regarded as the model to describe heat conductance. It was recently discovered that the underlying ab initio model should be the phonon transport Boltzmann equation [hua2017experimental]. One can formally derive that this phonon transport equation degenerates to the heat equation in the macroscale regime when one assumes that the Fourier law holds true. This is to assume that the rate of heat conductance through a material is negatively proportional to the gradient of the temperature.
Rigorously deriving the heat equation from kinetic equations such as the linear Boltzmann equation or the radiative transfer equation is a standard process in the socalled parabolic regime [bardos1984diffusion]. That means for light propagation (using the radiative transfer equation), we can link the diffusion effect with the light scattering. However, such derivation does not exist for the phonon system. One problem we encounter comes from the fact that phonons, unlike photons, have all frequencies
coupled up. In particular, the collision operator for the phonon transport equation contributes an equilibrium term that satisfies a BoseEinstein distribution in the frequency domain: it allows energy contained in one frequency to transfer to another. Furthermore the speed in the phonon transport term involves not only the velocity
, but also the group velocity which also positively depends on the frequency . These differences make the derivation of the diffusion limit (or the parabolic derivation) for phonon transport equation not as straightforward.On the formal level, passing from the ab initio Boltzmann model to the heat equation requires the validity of the Fourier law, an assumption that breaks down when kinetic phenomenon dominates. This is especially true at interfaces where different solid materials meet. Material discontinuities lead to thermal phonon reflections. On the macroscopic level, it is observed as thermal boundary resistance (TBR), and is reflected by a temperature drop at the interface [cahill2003nanoscale]. TBR exists at the interface between two dissimilar materials due to differences in phonon states on each side of the interface. Defects and roughness further enhance phonon reflections and TBR effect. Such effect can hardly be explained, or measured directly on the heat equation level.
As scientists came to the realization that the underlying ab initio model is the Boltzmann equation instead of the heat equation, more and more experimental studies are conducted to reveal the model parameters and properties of the Boltzmann equation. In the recent years, a lot of experimental work has been done to understand the heat conductance inside a material or at the interface of two solids, hoping these collected data can help in designing materials that have better heat conductance or certain desired heat properties [lyeo2006thermal, norris2009examining, wang2011thermal].
This is a classical inverse problem. Measurements are taken to infer the material properties. In [hua2017experimental] the authors essentially used techniques similar to least square with an
penalty term to ensure sparsity. As the first investigation into this problem, the approach gives a rough estimate of the parameter, but mathematically it is very difficult to evaluate the accuracy of the recovery. In this article, we study this problem with a more rigorous mathematical view. We will also confine ourselves in the optimization framework. Given certain amount of data (certain number of experiments, and certain number of measurements in one experiment), we formulate the problem into a least square setup to have the mismatch minimized. We will study if this minimization problem is wellposed, and how to numerically find the solution in an efficient way. More specifically, we will first formulate the optimization problem, apply the stochastic gradient descent method, the stateoftheart optimization technique, and then derive the problem’s Fréchet derivative. This allows us to investigate the associated convergence property. We will demonstrate that the system has maximum principle, which allows us to justify the objective function to be convex in a small region around the optimal point. This further means the SGD approach will converge. Finally we apply the method we derived in two examples in the numerical section.
We should note that we took a practical route in this paper and focus on the numerical property of the problem. Investigating the wellposedness of the inverse problem, such as proving the uniqueness and the stability of the reconstruction require much more delicate PDE analysis, and is left for future investigation.
We should also note that inverse problems for kinetic equations, though rare in literature, are not completely new. The research is mostly focused on the reconstruction of optical parameters in the radiative transfer equation. In [CS96], the authors pioneered the problem assuming the entire albedo operator is known. The stability was later proved in [Romanov96, Wang, bal2010stability, LaiLiUhlmann] in different scenarios. In [bal2008inverse, time_harmonic, BalMonard_time_harmonic] the authors revised the measurementtaking mechanism and assumed only the intensity is known. They furthermore studied the timeharmonic perturbation hoping to stablize the inverse problem. The connection of optical tomography and the Calderón problem was drawn in [LaiLiUhlmann] where the stability and its dependence on the Knudsen number is investigated. See reviews in [Ren_review, arridge1999optical]. There are many other imaging techniques that involve the coupling of the radiative transfer equation with other equations, that lead to bioluminescence tomography, quantitative photoacoustic tomography, and acoustooptic imaging [ChungSchotland, Ren_fPAT, BalChungSchotland, BalSchotland2010, BalTamasan], among many others. Outside the realm of the radiative transfer equation, the inverse problem is also explored for the VlasovPoissonBoltzmann system that characterizes the dynamics of electrons in semiconductors [Gamba]. The authors of the paper investigated how to use Neumann data and the albedo operator at the surface of semiconductors to infer the deterioration in the interior of the semiconductor. Recently, in [CarlemanEstimate] the authors used the Carleman estimate to infer the parameters in general transport type equation in the kinetic framework.
There is also abundant research in numerical inverse problems. For kinetic equations, multiple numerical strategies have been investigated, including both gradient based method and hessian based method [arridge1999optical, Arridge_couple, dehghani1999photon]
. For general PDEbased inverse problems, SGD has been a very popular technique for finding the minimizer. This is because most PDEbased inverse problems eventually are formulated as minimization problem with the loss function being in the format of a summation of multiple smaller loss functions. This is the exactly the format where SGD outperforms other optimization methods
[chen2018online].The paper is organized as follows. In Section 2, we present the model equation and its linearization (around the room temperature). The formulation of the inverse problem is also described in this section. Numerically to solve this optimization problem, we choose to use the stochastic gradient descent method, which would require the computation of the Fréchet derivative in each iteration. In Section 3 we discuss the selfadjoint property of the collision operator and derive the Fréchet derivative. This allows us to summarize the application of SGD at the end of this section. We discuss properties of the optimization formulation and the use of SGD in our setting in Section 4. In particular, we will discuss the maximum principle in Section 4.1, and Section 4.2 is dedicated to the properties of SGD applied in this particular problem. In particular we will show the Lipschitz continuity of the Fréchet derivative. We conduct two sets of numerical experiments: first, assuming that the reflection coefficient is parametrized by a finite set of variables and second, assuming the reflection coefficient is a simple smooth function without any prior parametrization. SGD gives good convergence in both cases, and the results are shown in Section 5.
2 Model and the inverse problem setup
In this section we present the phonon transport equation and its linearization. We also set up the inverse problem as a PDEconstrained minimization problem.
2.1 Phonon transport equation and linearization
The phonon transport equation can be categorized as a BGKtype kinetic equation. Denote the density of phonons (viewed as particles in this paper) at time on the phase space at location , transport speed and frequency . Since has a constant amplitude, we normalize it and set it to be . In labs the experiments are set to be planesymmetric, and the problem becomes pseudo 1D, meaning , and . The equation then writes as:
(1) 
The two terms on the left describe the phonon transport with velocity and group velocity . The term on the right is called the collision term, with representing the relaxation time. It can be viewed equivalently to the BGK (BhatnagarGrossKrook) operator in the classical Boltzmann equation for the rarefied gas. is the BoseEinstein distribution, also termed the Maxwellian, and is defined by:
In the formula, is the Boltzmann constant, is the reduced Planck constant, and is the phonon density of states. The profile is a constant in and approximately exponential in (for big enough), and the rate of the exponential decay is uniquely determined by temperature .
The system preserves temperature, meaning the right hand side vanishes when the zeroth moment is being taken. This uniquely determines the temperature of the system, namely:
In experiments, the temperature is typically kept around the room temperature and the variation is rather small [hua2017experimental]. This allows us to linearize the system around the room temperature. Denote the room temperature to be , and the associated Maxwellian is denoted
We linearize equation (1) by subtracting and adding it back, and call , then with straightforward calculation, since has no dependence on and :
(2) 
For and evaluated at very similar and , we approximate
where higher orders at are eliminated. Noting that
(3) 
we rewrite equation (2) to:
(4) 
Similar to the nonlinear case, conservation law requires the zeroth moment of the right hand side to be zero, which amounts to
where we use the bracket notation .
For simplicity, we nondimensionalize the system, and set , and . We also make the approximation that and , as suggested in [PRB] and [hua2017experimental]. We now finally arrive at
(5) 
with
(6) 
Clearly, the Maxwellian is normalized: . We also call the right hand side the linearized BGK operator
(7) 
We note that the Maxwellian here is not a traditional one. It has a relatively heavy tail compared to which is typically used as an approximation. The polynomial weight coming from the linearization process also fattens it. In Figure 2 we depict the Maxwellian, and its comparison to some approximations:
In practice, to handle the situation in Figure 1, phonon is in charge of heat transfer and the model equation is used in both the aluminum and silicon regions, i.e, we write and to be the density of phonon in the two regions respectively. Considering the boundary condition for the aluminum component is incomingtype and the boundary condition at the interface is reflective type, we have the following model (denote the location of the surface of aluminum to be and the interface to be ):
(8)  
In the model and are phonon density distribution within aluminum and silicon respectively. At , laser beam injects heat into aluminum, and that serves as incoming boundary condition at , which we name by . Due to the large size of silicon, it can be viewed that equation is supported on the halfdomain. At , the interface, we model the reflection coefficient to be , meaning portion of the phonon density is reflected back with the flipped sign . According to [hua2017experimental], this reflection coefficient only depends on , the frequency. It is the coefficient that determines how much heat gets propagated into silicon, and is the coefficient we need to reconstruct in the lab using measurements.
2.2 Formulating the inverse problem
The experiments are nonintrusive, in the sense that the materials are untouched, and the data is collected only on the surface . The data we can tune is the incoming boundary condition in (8) and the data we can collect is the temperature at the surface, namely as a function of and . In labs we can send many different profiles of into the system, and measure for the different .
Accordingly, the forward map is:
(9) 
It maps the incoming data configuration to the temperature at the surface as a function of time. The subscript represents the dependence of the map. In the forward setting, is assumed to be known, then equation (8) can be solved for any given for the outcome . In reality, is unknown, and we test the system with a number of different configurations of and measure the corresponding to infer .
While the tobereconstructed is a function of , chosen in , an infinite dimensional function space, there are infinitely many configurations of too. At this moment whether tuning gives a unique reconstruction of is an unknown wellposedness problem that we plan to investigate in the near future. In the current paper we mainly focus on the numerical and practical setting, namely, suppose a finite number of experiments are conducted with finitely many configurations of imposed, and in each experiment, the measurement is taken on discrete time, how do we reconstruct ?
We denote by the number of configurations of the boundary condition in different experiments, namely
is injected into (8) at different rounds of experiments. We also take measurements on the surface through test functions. Denote the test function, the data points are then:
When we choose the temperature is simply taken at discrete time .
We denote the data to be , collected at the th experiment, measured with test function with additive noise, then, for indices
where is the solution to (8) equipped with boundary condition , assuming the reflection coefficient is . The noise term is inevitable in experiments.
A standard approach to reconstruct is to formulate a PDEconstrained minimization problem. We look for a function that minimizes the mismatch between the produced and the measured data:
(10)  
s.t. 
Here is merely a constant and does not affect the minimum configuration, but adding it puts the formulation in the framework that SGD deals with.
In a concise form, define the loss function
(11) 
then the PDEconstrained minimization problem can also be written as:
(12) 
We denote the minimizer to be .
Remark 1
If some prior information is known about , this information could be built into the minimization formulation as a relaxation term. For example, if it is known ahead of time that should be close to in some sense, then the minimization is modified to
where norm is chosen according to properties from physics and is the relaxation coefficient. We assume in this paper that we do not have such prior information.
3 Stochastic gradient descent
There are many approaches for solving PDEconstrained minimization problems such as (10), or equivalently (12). Most of the approaches are either gradientbased or Hessianbased. While usually gradientbased methods converge at a slower rate than methods that incorporate Hessians, the cost of evaluating Hessians is very high. For PDEconstrained minimization problems, every data point provides a different Hessian term, and there are many data points. Furthermore, if the tobereconstructed parameter is a function, the Hessian is infinitedimensional, which makes a very large sized matrix upon discretization. This cost is beyond what a typical computer can afford. Thus, we choose gradientbased methods.
Among all gradientbased optimization methods, the stochastic gradient descent (SGD) started gaining ground in the past decade. It is a method that originated in the 90s (which also sees its history back in [robbins1951stochastic]
), and gradually became popular in the new data science era. As a typical example of probabilistictype algorithms, it sacrifices certain level of accuracy in trade of efficiency. Unlike the standard Gradient Descent method, SGD does require a certain form of the objective function.
is an average of many smaller objective functions, meaning with a fairly large .If a classical gradient descent method is used, then per iteration, to update from , one needs to compute the gradient of at this specific , and that amounts to computing gradients: . For a large , the cost is very high. For PDEconstrained minimization in particular, this is a Fréchet derivative, and is usually obtained through two PDE solves: one for the forward problem and one for the adjoint. gradients means PDEsolves at each iteration. This cost is prohibitively high.
The idea of SGD is that in each iteration, only one is randomly chosen as a fair representative of and the single gradient is viewed as a surrogate of the entire . So per iteration instead of computing gradients, one only computes . This significantly reduces the cost for each iteration. However, replacing by a random representative brings extra error, and as a consequence, a larger number of iterations is required. This leads to a delicate error analysis. In the past few years, in what scenario does SGD win over the traditional GD has been a popular topic, and many variations and extensions of SGD have also been proposed [needell2014stochastic, le2012stochastic, zhao2015stochastic]. Despite the rich literature for SGD as a standalone algorithm, its performance in the PDEconstrained minimization setting is mostly in lack. We only refer the readers to the review papers in general settings [bottou2010large, bottou2018optimization].
We apply SGD method to our problem. That is to say, per each time step, we randomly choose one set of data pair and use it to adjust the evaluation of . More specifically, at time step , instead of using all loss functions, one selects a multiindex at random and performs gradient descent determined by this particular data pair:
(13) 
Here is the timestep to be adjusted according to the user’s preference.
In the formula of (13), the evaluation of is straightforward. According to equation (11), it amounts to setting in (8) as and computing it with incoming data as the boundary condition, and test the solution at with . How to compute , however, is less clear. Considering is a given data point and has no dependence on , it is the Fréchet derivative of the map on at this particular :
(14)  
The computation of this derivative requires the computation of the adjoint equation. Before stating the result, we first notice the collision operator is selfadjoint with respect to the weight .
Lemma 1
Proof
The proof amounts to direct calculation. Expand the left hand side we have
which concludes the lemma.
Now we make the Fréchet derivative explicit in the following theorem.
Theorem 1
For a fixed , the Fréchet derivative at can be computed as:
(15) 
where is the solution to (8) with refection coefficient and boundary condition , and is the solution to the adjoint equation with as the boundary condition:
(16) 
Proof
This is a standard calculus of variation argument. Let be perturbed by . Denote the corresponding solution to (8) with incoming data by , and let , then since and solve (8) with the same incoming data but different reflection coefficients, one derives the equation for :
(17) 
where we ignored the higher order term . In this equation, serves as the source term at the boundary.
Now we multiply equation (17) with and multiply equation (16) with and add them up. Realizing the operator is self adjoint with respect to , the right hand side cancels out and we obtain:
Integrate further in time and space of this equation. Noticing that has trivial initial data and has trivial final state data, the term drops out and we have:
At plugging in , we have
where we used the definition of .
At we note that the terms with are all canceled out using the reflection condition and the integral reduces to
Let , the two formula lead to:
Remark 2
We note that the derivation is completely formal. Indeed the regularity of the solution is not yet known when the incoming data for has a singularity at , making it not even . Numerically one can smooth out the boundary condition by adding a mollifier. That is to convolve the boundary condition with a narrow Gaussian term. Numerically, this direct delta will be replaced by a kronecker delta.
We now summarize the SGD algorithm in Algorithm 1.
4 Properties of the equation and the minimization process
In this section we study some properties of the equation, and the convergence result of the associated optimization problem using SGD. In particular, since the equation is a typical kinetic model with a slightly modified transport term and a linear BGK type collision operator, one would expect some good properties that hold true for general kinetic equations are still valid in this specific situation. Below we study the maximum principle first, and this gives us a natural bound of the solution. The result will be constantly used in showing SGD convergence.
4.1 Maximum Principle
Maximum principle is the property that states the solution in the interior is pointwise controlled by the size of boundary and initial data. It is a classical result for diffusion type equations. In the kinetic framework, it holds true for the radiative transfer equation that has a linear BGK collision operator. But it is not true for the general BGK type Boltzmann equation. If the initial/boundary data is specifically chosen, the Maxwellian term can drag the solution to peak with a value beyond norm of the given data. In these cases, an extra constant is expected.
We assume the initial condition to be zero and rewrite the equation as:
(18) 
where is a function bounded between and . We arrive at the following theorem.
Theorem 2
The equation (18) satisfies the maximum principle, namely, there exists a constant independent of so that
Proof
The proof of the theorem comes from direct calculation. We first decompose the equation (18). Let with being the ballistic part with the boundary condition and being the remainder, one can then write:
(19) 
and
(20) 
Note that in this decomposition, and satisfy the same reflective boundary condition at , but serves as a source term in the collision term of . Both equations can be explicitly calculated. Define the characteristics:
then for every fixed and , along the trajectory we have:
(21) 
which further gives
(22) 
and
(23) 
For the characteristic propagates to the right. Depending on the value for and , the trajectory, when traced back, could either hit the initial data or the boundary data. Let , and consider , we have:

If the trajectory hits the boundary data, for we have , and
(24) 
If the trajectory hits the initial data, then and , with the solution being
(25)
Similarly, for , if the trajectory hits the boundary data, meaning for we have . Define :
(26) 
We multiply (24) with and use to have:
(27) 
Similar inequality can be derived for (25) as well. Since the bound of (25) has zero dependence on , we ignore its contribution. Taking the moments on both sides of (27), we have:
(28) 
where we used the notation . Similarly, multiplying (26) with and taking moments gives
(29) 
Here .
(30) 
with , .
4.2 Convergence for noisy data
The study of convergence of SGD is a fairly popular topic in recent years, especially in the machine learning community
[bottou2018optimization, li2019convergence, shamir2013stochastic]. It has been proved that with properly chosen time stepsizes, convergence is guaranteed for convex problems. For nonconvex problems, the situation is significantly harder, and one only seeks for the points where where is the objective function. However, it is hard to distinguish local and global minimizers, and the point that makes could also be a saddle point, or even a maximizer. See recent reviews in [bottou2018optimization].The situation is slightly different in the PDEconstrained minimization problems. In this setup, the objective function enjoys a special structure: every is the square of the mismatch between one forward map and the corresponding data point, namely . It is typically unlikely for the forward map to be convex directly, but the outerlayer function is quadratic and helps in shaping the Hessian structure. In particular, since:
where stands for the Hessian term. Then due to the term, the would have a better hope of being positive definite, as compared to especially when the data is almost clean.
Such structure changes the convergence result. Indeed, in [jin2020convergence] the authors investigated the performance of SGD applied on problems with PDE constraints. We cite the theorem below (with notation adjusted to fit our setting).
Theorem 3
Let with be a forward map that maps to with and being two Hilbert spaces with norm denoted by . Suppose is nonempty. Denote to be the real data perturbed by a noise that is controlled by . The loss function is defined as:
(31) 
Suppose for every fixed :

The operator is continuous, with a continuous and uniformly bounded Fréchet derivative on .

There exists an such that for any ,
(32)
If the Fréchet derivative is uniformly bounded in , and the existence of holds uniformly true for all , then with properly chosen stepsize, SGD converges.
We note that in the original theorem, the assumptions are made on
viewed as a vector. We state the assumptions componentwise but we do require the boundedness and the existence of
to hold true uniformly in . Note also the theorem views as a vector and hence the notation and . When viewed as a function of , the term reads .To show that the method works in our setting, we essentially need to justify the two assumptions for a properly chosen domain. This is exactly what we will show in Proposition 1 and Proposition 2. In particular, we will show, for each , the Fréchet derivative is Lipschitz continuous with the Lipschitz constant depending on and . By choosing properly bounded the Lipschitz constants are bounded. We thus have the uniform in uniform in boundedness of . The second assumption is much more delicate: it essentially requests the gradient to represent a good approximation to the local change and that the second derivative to be relatively weak, at least in a small region. This is shown in Proposition 2 where we specify the region of such control.
Now we prove the two propositions.
Proposition 1
For a fixed , is Lipschitz continuous, meaning there is an that depends on and so that
Proof
We omit in the subscript of throughout the proof. Recall the definition of , we have:
(33)  
where and solve (16), the adjoint equation equipped with the same boundary and initial data (). The reflection coefficients are and respectively. solve (8), the forward problem, with being the reflective coefficient (also equipped with the same boundary and initial data ).
Call the differences and , we rewrite the formula into:
(34)  
Here