The use of machine learning algorithms has gained a lot of interest in the past decade [14, 15, 31]. Besides the data science problems like clustering, regression, image recognition or pattern formation there are novel applications in the field of engineering as e.g. for production processes [21, 29, 32]. In this study we focus on deep residual neural networks (ResNet) which date back to the 1970s and have been heavily influenced by the pioneering work of Werbos . ResNet have been successfully applied to a variety of applications such as image recognition , robotics  or classification . More recently, also applications to mathematical problems in numerical analysis [25, 26, 34] and optimal control  have been studied.
A ResNet can be shortly summarized as follows: Given inputs, which are usually measurements, the ResNet propagates those to a final state. This final state is usually called output and aims to fit a given target. In order to solve this optimization procedure, parameters of the ResNet needs to be optimized and this step is called training. The parameters are distinguished as weights and biases. For the training of ResNets backpropagation algorithms are frequently used[35, 36].
The purpose of this work is to use kinetic theory in order to gain insights on performance of ResNet. There have been made several attempts to describe ResNet by differential equations [2, 19, 27]. For example in 2] the time continuous version of a ResNet is studied and different temporal discretization schemes are discussed. There are also studies on application of kinetic methods to ResNet [1, 20, 30]. For example in 
the authors consider the limit of infinitely many neurons and gradient steps in the case of one hidden layer. They are able to proof a central limit theorem and show that the fluctuations of the neural network at the mean field limit are normally distributed.
In this work we do not consider the limit of infinitely many neurons. Instead we fix the number of neurons to the number of input features, and we call the resulting network simplified residual neural network (SimResNet). Then, we derive the time continuous limit of the SimResNet model which leads to a possible large system of ODEs. We consider the mean field limit in the number of inputs (or measurements) deriving a hyperbolic PDE. Throughout this study we assume that the bias and weights are optimized and given. The purpose of this approach is to analyze the forward propagation of the derived mean field neural network model with given weights and bias. We especially focus on aggregation and clustering phenomena, which we study with the help of the corresponding moment model. Furthermore, we compute steady states and perform a sensitivity analysis. The quantity of interest of the sensitivity analysis is an operator called loss function. With the help of the sensitivities we are able to deduce an update formula for the bias and weights in the case of a change in the input or target distribution.
In addition we study Boltzmann type equations with noisy neural network dynamics as extension to the mean field formulation. Long time behavior of such Boltzmann type equations can be conveniently studied in the grazing limit regime. This asymptotic limit naturally leads to Fokker-Planck type equations where it is possible to obtain non trivial steady state distributions [23, 24, 33]. The study of the aggregation phenomena gives us conditions on the shape of the weights and bias. In addition we gain information on the simulation time needed to reach a desired target. The novel update algorithm for the weight and bias seems to give a large performance increase in the case of a shifted target or initial condition. Finally, our Fokker-Planck asymptotics indicate that a stochastic SimResNet model performs well in several applications.
The outline of the paper is as follows. In Section 2 we define the microscopic ResNet model and the time continuous limit. In Section 3 we introduce first the SimResNet and then we derive the mean field neural network model. The corresponding Boltzmann type neural network model is presented in Section 4, with an asymptotic limit leading to a Fokker-Planck equation. We analyze the kinetic neural network formulations with respect to steady states and study qualitative properties with the help of a moment model and a sensitivity analysis. In Section 5 we conduct several numerical test cases which validate our previous analysis. Especially, we conduct two classical machine learning tasks, namely a clustering problem and a regression problem. We conclude the paper in Section 6 with a brief conclusion and an outlook on future research perspectives.
2 Time Continuous ResNet
We assume that the input signal consists of features. A feature is one type of measured data as e.g. temperature of a tool, length or width of a vehicle, color intensity of pixels of an image. Without loss of generality we assume that the value of each feature is one dimensional and thus the input signals are given by . Here, denotes the number of measurements or input signals. In the following, we assume that the number of neurons is identical in each layer, corresponding to a fully connected ResNet. Namely, we consider layers and in each layer the number of neurons is given by , where is the number of neurons for one feature. The microscopic model which defines the time evolution of the activation energy of each neuron , fixed input signal and bias reads :
for each fixed and . Here, denotes the activation function which is applied component wise. Examples for the activation function are given by the identity function
, the so-called ReLU function
, the sigmoid functionand the hyperbolic tangent function . In general (1) can be written in compact formulation by suitably collecting all the weights in an extend matrix . In particular, we have that , for each .
In (1) we formulated a neural network by introducing a parameter . In a classical ResNet . Here, instead, we introduce a time discrete concept which corresponds to the layer discretization. More precisely the time step is defined as and, in this way, we can see (1) as an explicit Euler discretization of an underlying time continuous model. As similar modeling approach with respect to the layers has been introduced in .
A crucial part in applying a neural network is the training of the network. By training one aims to minimize the distance of the output of the neural network at some fixed time to the target . Mathematically speaking one aims to solve the minimization problem
where we use the squared distance between the target and the output to defined the loss function. Other choices are certainly possible 
. The procedure can be computationally expensive on the given training set. Most famous examples of such an optimization are so called back propagation algorithms or ensemble Kalman filters[9, 16, 35]. In the following we assume that the bias and weights are given and the neural network is already trained.
In the following we aim to consider the continuum limit which corresponds to and . In this limit, (1) is given by:
for each fixed and .
Existence and uniqueness of a solution is guaranteed as long the activation function satisfies a Lipschitz condition.
3 Mean Field Formulations of SimResNet
In this section we derive a time continuous PDE model for the forward propagation of residual neural networks. We follow a Liouville type approach for infinitely many measurements or inputs in order to obtain a mean field equation.
We assume a single neuron for each feature. Thus equation (2) becomes
for each fixed , and where . This simplification reduces the complexity of neural networks drastically. This special form is not only beneficial for the kinetic formulation of a neural network but especially reduces the costs in the training stage of a neural networks. This novel method has been successfully applied to an engineering application . We refer to this formulation as SimResNet.
3.2 Mean Field Limit
We perform the kinetic limit in the number of measurements . Since the dimension of is directly related to the dimension of the variable of the kinetic distribution function, for practical purposes we should consider moderate . The mean field model corresponding to the dynamic (3) is
is the compactly supported probability distribution function with known and normalized initial conditions
Observe that corresponds to the distribution of the measured features and that (4) preserves the mass, i.e. , . The derivation is classical and we refer to Golse et al. [8, 11] for details. If fulfills
Then the hyperbolic conservation law (4) admits a unique weak entropy solution in the sense of Krûzkov  for initial data . The mean field neural network (4) can be solved pointwise by the method of characteristics.
For a test function the steady state equation reads
If is a Dirac delta function located at , equation (6) is satisfied only if is the solution to the system . ∎
With the help of the empirical measure it is straightforward to connect the solution of the large particle dynamics to the PDE. The empirical measure defined by the solution vectoris given by
A straightforward calculation shows that is a weak solution of the weak form of the model (4), provided that the initial distribution is given by
In order to show that the microscopic dynamics converge to the mean field limit we use the Wasserstein distance and the Dobrushin inequality. We follow the presentation in . The convergence is obtained in the space of probability measures using the 1-Wasserstein distance, which is defined as follows:
Let and two probability measures on . Then the 1-Wasserstein distance is defined by
where is the space of probability measures on such that the marginals are and i.e.
We only sketch the main steps of the proof and refer to [8, 11] for details. As first step we define the characteristic equations of the mean field neural network model (4). These characteristic equations are measure dependent and one usually uses the push-forward operator in order to be able to derive the Dobrushin stability estimate. The existence of the solution of the characteristic equations ban be shown with the help of the Lipschitz constant and the corresponding fixed point operator. As next step one considers the distance of two measures and again uses the fixed point operator and the Lipschitz continuity in order to bound the distance. Then one can apply the Gronwall inequality and obtains the Dobrushin stability estimate. ∎
3.3 Properties of the One Feature Mean Field Equation
In the case of one feature, i.e. , the mean field equation (4) reduces to
Our subsequent analysis is performed in this simple case.
We first define the -th moment,
, and variance of our mean field model by
Clearly, the possibility to obtain a moment model is solely determined by the shape of the activation function .
In the following we aim to characterize concentration phenomena of our solution with respect to the functions and . Therefore, we study the expected value and energy of our mean field model. In particular, we are interested in the characterization of several concentration phenomena that we define below.
We say that the solution to equation (8) is characterized by
energy bound if
holds at a fixed time ;
energy decay if
holds for any . This means that the energy is decreasing with respect to time;
holds at a fixed time , where denotes the variance;
holds for any . This means that the variance is decreasing with respect to time.
We observe that if the first moment is conserved in time, then definition of energy bound is equivalent to concentration, and definition of energy decay is equivalent to aggregation.
3.3.1 Identity as activation function
A simple computation reveals that the 0-th moment is conserved and holds for all times . For the moments we obtain
Notice that the -th moment only depends on the -th moment. It is then possible to solve the moment equations iteratively with the help of the separation of variables formula obtaining
Let us define
Assume that the bias is identical to zero, namely , . Then we obtain
energy bound if at a fixed time ; and
energy decay if and only if for all ; and
aggregation if and only if . In particular the steady state is distributed as a Dirac delta centered at .
If then (11) simplifies to
and thus the first and the second moment are identical except the given initial conditions. Then we can easily apply the definitions of energy bound, energy decay and aggregation to prove the statement. ∎
Assume that the bias is identical zero, namely , . Then
if energy bound exists at a time we have concentration; and
if energy decay holds we have aggregation.
We start proving the first statement. First we observe that (12) is still true, since by hypothesis we assume that the bias is zero. Due to the definition of concentration we need to verify that for a fixed time . Using the definition of the variance, concentration is implied by
We have aggregation if for any . This is equivalent to
The choice is suitable for clustering problems at the origin, independently on the initial first moment. Conservation of the first moment is possible by choosing . See the following results.
Let the bias be . Then the first moment is conserved in time and we obtain
a energy bound if at a fixed time ; and an
aggregation phenomena if holds for all . The steady state is Dirac delta centered at .
The solution formula for the second moment is
Then we have energy bound if
which is satisfied assuming that at a fixed time . For the second statement we observe that delta aggregation is also implied by , for all times. Or, equivalently, by , for all times. We have
if and only if for all times. ∎
We aim to discuss the impact of the variance on aggregation and concentration phenomena. This is especially interesting if we are not interested in the long time behavior but rather aim to know if for some tolerance and time . In applications this level would be determined by the variance of the target distribution.
If the bias is identical to zero, namely , , then the energy at time is below tolerance if
is satisfied. Similarly, we have that the variance is below the level if
Similarly, if the bias fulfills , then the energy at time is below the level if
is satisfied provided that holds. Similarly, the variance at time is below the level if
is satisfied provided that holds.
In general it is not possible to obtain a closed moment model in the case of the sigmoid or hyperbolic tangent activation function Nevertheless one might approximate the sigmoid or tanh activation function by the linear part of their series expansion.
In the case of the ReLU activation function we decompose the moments in two parts
and we define
where and .
Let us define , then using the Leibniz integration rule we compute
Consequently, the evolution equation for the -th moment cannot be expressed by a closed formula since it depends on the partial moment on and boundary conditions:
In the case of constant weights and bias the equality holds. Notice also that the above discussion simplifies in the case of vanishing bias, and it becomes equivalent to the case when the activation function is the identity function. In fact, if , , then the set switches to be or , depending on the sign of the weight . Moreover, thanks to (13) we immediately obtain that holds true and thus is satisfied for all . Hence, the evolution equation (16) reduces to the case (10) and same computations can be performed.
3.4 Sensitivity Analysis
The goal is to compute the sensitivity of a quantity of interest with respect to some parameter. The quantity of interest is the distance of function at finite time to the target distribution . We assume the ResNet has been trained using the loss function .
We may expect that training was expensive and will not necessarily be done again if input or target changes. Therefore, it is of interest if the trained network (namely and ) can be reused if or changes. We propose to compute the corresponding sensitivities of with respect to the weight and bias. This in turn can be used to apply a gradient step on . We use adjoint calculus to adjust to the modified , i.e. the formal first order optimality condition for Lagrange multiplier reads:
where is the derivative of the activation function. It is amed that is differentiable, i.e. or . Adjustment of optimal weights and bias can be then obtained via gradient step. After an update of initial data or target maybe summarized by:
4 Boltzmann type Equations
The mean field models in the previous section can be obtained as suitable asymptotic limit of Boltzmann type space homogeneous integro-differential kinetic equations. In particular, the case of one-dimensional feature can be derived from a linear Boltzmann type equation.
In fact, in the one-dimensional case the system of ODEs (3) can be recast as the following linear interaction rule:
where, by kinetic terminology, and are the post- and pre-collision states, respectively. The corresponding weak form of the Boltzmann type equation reads
where represents the interaction rate. In the present homogeneous setting, influences only the relaxation speed to equilibrium and thus, without loss of generality, in the following we take .
An advantage of the Boltzmann type description (18) in comparison to the mean field equation is the possibility to study different asymptotic scales leading to non-trivial steady states. The choice of weights , bias and of the activation function may be obtained by fitting the target distribution.
In the the case of the identity activation function the first moment can be computed explicitly and a Fokker-Planck type asymptotic equation can be derived. In order to study steady-state profiles for arbitrary activation functions, we choose the following approach: we add noise to the microscopic interaction rule (17) leading in the asymptotic limit to a Fokker-Planck equation.
4.1 Fokker-Planck Neural Network Model
Let be a small parameter, weighting for the strength of the interactions. Modify the interaction (17) as
is a random variable with mean zero and variance, and is a diffusion function. In the classical grazing collision limit , , we recover the following Fokker-Planck equation for the probability distribution
where we define the interaction operator and the diffusive operator as
The advantage in computing the grazing collision limit is that the classical integral formulation of the Boltzmann collision term is replaced by differential operators. This allows a simple analytical characterization of the steady state solution of (20). Provided the target can be well fitted by a steady state distribution of the Fokker-Planck neural network model, the weight, the bias and the activation function are immediately determined. This is a huge computational advantage in comparison the the classical training of neural networks. In the following we present examples of steady states of the Fokker-Planck neural network model.
4.1.1 Steady state characterization
Steady state solution can be easily found as
The constant is determined by mass conservation, i.e. . The existence and explicit shape of the steady state is determined by the specific choice of activation function , diffusion function and parameters .
If the target is distributed as a Gaussian, then choosing and yields a suitable approximation since the steady state (21) is given by
provided that . Condition on mass conservation leads to
with to guarantee converge of .
If the target is distributed as an inverse Gamma, then choosing and yields a suitable approximation since the steady state (21) is given by
with and normalization constant
where denotes the Gamma function. Notice that we have to assume that and hold in order to obtain a distribution.
Let and , and w. l. o. g. we assume so that is identical zero on the set . The steady state on is given by and can be extended on by the Pareto distribution:
If activation and diffusion function
it is possible to obtain a generalized Gamma distribution as steady state. This specific model has been discussed in and the exponential convergence of the solution to the steady state has been proven in . This may motivate to choose the novel activation function provided that the data is given by a generalized Gamma distribution.
5 Numerical Experiments
In this section we present two classical applications of machine learning algorithms, namely a clustering and regression problem. Furthermore, we validate the theoretical observations of the moment model. In addition we test our weight and bias update derived in the sensitivity analysis. Finally, we show that the Fokker-Planck type neural network is able to fit non trivial continuous probability distributions. The weights and biases are given and constant in time. For the simulations we solve the PDEs models presented in this work by using a third order finite volume scheme , briefly reviewed below.
All the cases we consider for the numerical experiments can be recast in the following compact formulation:
with and given constants. Application of the method of lines to (22) on discrete cells leads to the system of ODEs
where is the approximation of the cell average of the exact solution in the cell at time .
Here, approximates with suitable accuracy and is computed as a function of the boundary extrapolated data, i.e.
and is a consistent and monotone numerical flux, evaluated on two estimates of the solution at the cell interface, i.e . We focus on the class of central schemes, in particular we consider a local Lax-Friedrichs flux. In order to construct a third-order scheme the values at the cell boundaries are computed with the third-order CWENO reconstruction [4, 18].
The term is a high-order approximation to the diffusion term in (22). In the examples below we use the explicit fourth-order central differencing employed in  for convective-diffusion equations with a general dissipation flux, and which uses point-values reconstructions computed with the CWENO polynomial.
Finally, is the numerical source term which is typically approximated as , where and are the nodes and weights of a quadrature formula on . the four point Gaussian quadrature of order seven. We employ three point Gaussian quadrature formula matching the order of the scheme.
System (22) is finally solved by the classical third-order (strong stability preserving) SSP Runge-Kutta with three stages . At each Runge-Kutta stage, the cell averages are used to compute the reconstructions via the CWENO procedure and the boundary extrapolated data are fed into the local Lax-Friedrichs numerical flux. The initial data are computed with the three point Gaussian quadrature. The time step is chosen in an adaptive fashion and all the simulations are run with a CFL of .
5.1 Moment Model
We have solved the mean field neural network model in order to compute the corresponding moments. In this section we choose the initial condition to be the following Gaussian probability distribution:
In section 3.3 we extensively discussed several aggregation phenomena of the moment model. As predicted by Proposition 2 we obtain with the choice a decay to zero of the energy, expected value and the variance as depicted in Figure 1. Furthermore, we have plotted in Figure 1 the case with which guarantees the conservation of the first moment as discussed in Proposition 3. Figure 2 illustrates the time needed in order to reach a desired energy or variance level, which has been studied in Corollary 2 and LABEL:cor3.
5.2 Machine Learning Applications
We present the kinetic approach for a classification and regression problem. In these section the activation function is chosen to be the hyperbolic tangent.
5.2.1 Classification Problem
Consider a classification problem as follows. We measure a quantity (e.g. length of a car) and the object must be sorted with respect of the type (e.g. car or truck). The task of the neural network is to determine the type given a measurement. A training set might look like Table 1.
A probabilistic interpretation of the length measure can be obtained by a normalized histogram. Thus, the histogram shows the frequency of a classifier of certain measurement (see Figure 3
). The continuous approximation of this histogram is the input of our mean field neural network model. The type could be a binary variable. In terms of our kinetic model the target is characterized by two Dirac delta distributions located at the binary values. Therefore, we introduce zero flux boundary conditions on the numerical scheme. The kinetic variable describes the distribution of the measurements (e.g. the length of vehicles). At final time we interpret this length as the decision of being a car or a truck. Thus, we introduce two measurements that determine the type. On the particle level we obtain the convergence of the neurons two these clusters (see Figure4). The solution to the mean field neural network model is depicted in Figure 5.
5.2.2 Regression Problem
We may have given measurements at fixed locations. These measurements might be disturbed possibly due to measurement errors as in Figure 6. The task of the neural network is to find the fit of those data points. It is not possible to solve this task with our model in one dimension since we have proven in Proposition 1 that the mean field neural network model only performs clustering tasks in the case of the identical and hyperbolic tangent activation function.
Therefore we transform the problem and assume a linear fit and aim to learn the slope of this fit by a neural network. The data is used to generate empirical slopes. These slopes are given by a probabilistic interpretation as in the histogram in Figure 7, being the input of the model. The target is a Dirac delta distribution located at . The solution converges to the target, see Figure 8. Thus, the location of the Dirac delta gives the correct slope of the graph in Figure 6.
5.3 Sensitivity Analysis and Update of Weights and Bias
We aim to present the benefit of the sensitivity analysis. We consider the regression problem as introduced in the previous section. The initial data is Gaussian with mean and variance equal to one, the target distribution is a Dirac delta centered at and the weights and bias are .
As Figure 8 shows, is close to the target. Then, we introduce as new target a Dirac delta centered at and use adjoint calculus with fixed step size to update the weights. The result of the mean field neural network model for different number of gradient steps are presented in Figure 9. Thus, the gradient step can be used in order to update weights and bias in case of changing the initial input or the target.
5.4 Fokker-Planck Type Neural Network
In this example we consider a standard normal Gaussian distribution as target and the input is uniformly distributed on . As presented in Section 4.1.1 the Fokker-Planck type neural network model is able to obtain the Gaussian distribution as steady state. This directly leads us to the choice of the weight, bias and activation function. We need to choose the identity as activation function. This approach allows us to drive the initial input to the given target by simply fitting the two parameters and .
Recall that, on the contrary, and as proven in Proposition 1, the mean field neural network can perform in the case of a hyperbolic tangent or identity activation function only clustering tasks. This means that for large times the distribution approaches a Dirac delta distribution and consequently it is not possible to fit a Gaussian distributed target by using the deterministic SimResNet model.
Starting from the classical formulation of a residual neural network, we have derived a simplified residual neural network and the corresponding time continuous limit. Then, we have taken the mean field limit in the number of measurements, thus switching from a microscopic perspective on the level of neurons to a probabilistic interpretation. We have analyzed solutions and steady states of the novel model. Furthermore, we have investigated the sensitivity of the loss function with respect to the weight and bias. Finally, we have derived a Boltzmann description of the simplified residual neural network and extended it to the case of a noisy setting. As consequence, non trivial steady states have been obtained for the limiting Fokker-Planck type model. In the last section we have validated our analysis and have presented simple machine learning applications, namely regression and classification problems.
Our study may yield insights in order to understand the performance of residual neural networks. E.g. the moment analysis gives practical estimates on the simulation time and how to choose bias and weight. The gradient algorithm derived form the sensitivity analysis provides us with an update formula to recompute bias and weight: after changes in initial or target conditions. Probably most interestingly, we have seen that our mean field neural network model has in special situations only a Dirac delta function as unique steady state. In these cases the mean field neural network performs clustering tasks. In comparison to the mean field model, the Fokker-Planck neural network model is able to exhibit non-trivial steady states.
This research is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC-2023 Internet of Production – 390621612 and supported also by DFG HE5386/15.
M. Herty and T. Trimborn acknowledge the support by the ERS Prep Fund - Simulation and Data Science. The work was partially funded by the Excellence Initiative of the German federal and state governments.
-  D. Araújo, R. I. Oliveira, and D. Yukimura. A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193, 2019.
T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud.
Neural ordinary differential equations.In Advances in neural information processing systems, pages 6571–6583, 2018.
-  R. M. Colombo, M. Mercier, and M. D. Rosini. Stability and total variation estimates on general scalar balance laws. Commun. Math. Sci., 7(1):37–65, 2009.
-  I. Cravero, G. Puppo, M. Semplice, and G. Visconti. CWENO: uniformly accurate reconstructions for balance laws. Math. Comp., 87(312):1689–1719, 2018.
-  G. Dimarco and G. Toscani. Kinetic modeling of alcohol consumption. arXiv preprint arXiv:1902.08198, 2019.
-  H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. Muller. Data augmentation using synthetic data for time series classification with deep residual networks. arXiv preprint arXiv:1808.02455, 2018.
-  C. Gebhardt and T. Trimborn. Simplified ResNet approach for data driven prediction of microstructure-fatigue relationship. In preparation.
-  F. Golse. On the dynamics of large particle systems in the mean field limit. In Macroscopic and large scale phenomena: coarse graining, mean field limits and ergodicity, pages 1–144. Springer, 2016.
-  E. Haber, F. Lucka, and L. Ruthotto. Never look back - A modified EnKF method and its application to the training of neural networks without back propagation. Preprint arXiv:1805.08034, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. , pages 770–778, 2015.
-  P.-E. Jabin. A review of the mean field limits for vlasov equations. Kinetic & Related Models, 7(4):661–711, 2014.
-  K. Janocha and W. M. Czarnecki. On loss functions for deep neural networks in classifications. Preprint arXiv:1702.05659v1, 2017.
-  G.-S. Jiang and C.-W. Shu. Efficient implementation of weighted ENO schemes. J. Comput. Phys., 126:202–228, 1996.
-  M. I. Jordan and T. M. Mitchell. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260, 2015.
A. V. Joshi.
Machine learning and artificial intelligence, 2019.
-  N. B. Kovachki and A. M. Stuart. Ensemble Kalman inversion: a derivative-free technique for machine learning tasks. Inverse Probl., 35(9):095005, 2019.
-  A. Kurganov and D. Levy. A third-order semidiscrete central scheme for conservation laws and convection-diffusion equations. SIAM J. Sci. Comput., 22(4):1461–1488, 2000.
-  D. Levy, G. Puppo, and G. Russo. Compact central WENO schemes for multidimensional conservation laws. SIAM J. Sci. Comput., 22(2):656–672, 2000.
-  Y. Lu, A. Zhong, Q. Li, and B. Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121, 2017.
-  S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
-  S. C. Onar, A. Ustundag, Ç. Kadaifci, and B. Oztaysi. The changing role of engineering education in industry 4.0 era. In Industry 4.0: Managing The Digital Transformation, pages 137–151. Springer, 2018.
-  F. Otto and C. Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
-  L. Pareschi and G. Toscani. Self-Similarity and Power-Like Tails in Nonconservative Kinetic Models. J. Stat. Phys., 124(2-4):747–779, 2006.
-  L. Pareschi and G. Toscani. Interacting Multiagent Systems. Kinetic equations and Monte Carlo methods. Oxford University Press, 2013.
-  D. Ray and J. S. Hesthaven. An artificial neural network as a troubled-cell indicator. J. Comput. Phys., 367(15):166–191, 2018.
-  D. Ray and J. S. Hesthaven. Detecting troubled-cells on two-dimensional unstructured grids using a neural network. J. Comput. Phys., 397, 2019. To appear.
-  L. Ruthotto and E. Haber. Deep neural networks motivated by partial differential equations. arXiv preprint arXiv:1804.04272, 2018.
-  L. Ruthotto, S. Osher, W. Li, L. Nurbekyan, and S. W. Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Preprint arXiv:1912.01825, 2019.
-  R. Schmitt and G. Schuh. Advances in Production Research: Proceedings of the 8th Congress of the German Academic Association for Production Technology (WGP), Aachen, November 19-20, 2018. Springer, 2018.
-  J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 2019.
-  R. J. Solomonoff. Machine learning-past and future. Dartmouth, NH, July, 2006.
-  H. Tercan, T. Al Khawli, U. Eppelt, C. Büscher, T. Meisen, and S. Jeschke. Improving the laser cutting process design by machine learning techniques. Production Engineering, 11(2):195–203, 2017.
-  G. Toscani. Kinetic models of opinion formation. Commun. Math. Sci., 3(4):481–496, 2006.
-  Q. Wang, J. S. Hesthaven, and D. Ray. Non-intrusive reduced order modelling of unsteady flows using artificial neural networks with application to a combustion problem. J. Comput. Phys., 384:289–307, 2019.
-  K. Watanabe and S. G. Tzafestas. Learning algorithms for neural networks with the Kalman filters. J. Intell. Robot. Syst., 3(4):305–319, 1990.
-  P. J. Werbos. The roots of backpropagation: from ordered derivatives to neural networks and political forecasting, volume 1. John Wiley & Sons, 1994.
-  Z. Wu, C. Shen, and A. Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90:119–133, 2019.
-  A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.