Kinetic Theory for Residual Neural Networks

01/07/2020 ∙ by M. Herty, et al. ∙ 0

Deep residual neural networks (ResNet) are performing very well for many data science applications. We use kinetic theory to improve understanding and existing methods. A microscopic simplified residual neural network (SimResNet) model is studied as the limit of infinitely many inputs. This leads to kinetic formulations of the SimResNet and we analyze those with respect to sensitivities and steady states. Aggregation phenomena in the case of a linear activation function are also studied. In addition the analysis is validated by numerics. In particular, results on a clustering and regression problem are presented.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of machine learning algorithms has gained a lot of interest in the past decade [14, 15, 31]. Besides the data science problems like clustering, regression, image recognition or pattern formation there are novel applications in the field of engineering as e.g. for production processes [21, 29, 32]. In this study we focus on deep residual neural networks (ResNet) which date back to the 1970s and have been heavily influenced by the pioneering work of Werbos [36]. ResNet have been successfully applied to a variety of applications such as image recognition [37], robotics [38] or classification [6]. More recently, also applications to mathematical problems in numerical analysis [25, 26, 34] and optimal control [28] have been studied.

A ResNet can be shortly summarized as follows: Given inputs, which are usually measurements, the ResNet propagates those to a final state. This final state is usually called output and aims to fit a given target. In order to solve this optimization procedure, parameters of the ResNet needs to be optimized and this step is called training. The parameters are distinguished as weights and biases. For the training of ResNets backpropagation algorithms are frequently used

[35, 36].

The purpose of this work is to use kinetic theory in order to gain insights on performance of ResNet. There have been made several attempts to describe ResNet by differential equations [2, 19, 27]. For example in [27]

the connection of deep convolution neural networks to partial differential equation (PDE) is derived. In

[2] the time continuous version of a ResNet is studied and different temporal discretization schemes are discussed. There are also studies on application of kinetic methods to ResNet [1, 20, 30]. For example in [30]

the authors consider the limit of infinitely many neurons and gradient steps in the case of one hidden layer. They are able to proof a central limit theorem and show that the fluctuations of the neural network at the mean field limit are normally distributed.

In this work we do not consider the limit of infinitely many neurons. Instead we fix the number of neurons to the number of input features, and we call the resulting network simplified residual neural network (SimResNet). Then, we derive the time continuous limit of the SimResNet model which leads to a possible large system of ODEs. We consider the mean field limit in the number of inputs (or measurements) deriving a hyperbolic PDE. Throughout this study we assume that the bias and weights are optimized and given. The purpose of this approach is to analyze the forward propagation of the derived mean field neural network model with given weights and bias. We especially focus on aggregation and clustering phenomena, which we study with the help of the corresponding moment model. Furthermore, we compute steady states and perform a sensitivity analysis. The quantity of interest of the sensitivity analysis is an operator called loss function. With the help of the sensitivities we are able to deduce an update formula for the bias and weights in the case of a change in the input or target distribution.

In addition we study Boltzmann type equations with noisy neural network dynamics as extension to the mean field formulation. Long time behavior of such Boltzmann type equations can be conveniently studied in the grazing limit regime. This asymptotic limit naturally leads to Fokker-Planck type equations where it is possible to obtain non trivial steady state distributions [23, 24, 33]. The study of the aggregation phenomena gives us conditions on the shape of the weights and bias. In addition we gain information on the simulation time needed to reach a desired target. The novel update algorithm for the weight and bias seems to give a large performance increase in the case of a shifted target or initial condition. Finally, our Fokker-Planck asymptotics indicate that a stochastic SimResNet model performs well in several applications.

The outline of the paper is as follows. In Section 2 we define the microscopic ResNet model and the time continuous limit. In Section 3 we introduce first the SimResNet and then we derive the mean field neural network model. The corresponding Boltzmann type neural network model is presented in Section 4, with an asymptotic limit leading to a Fokker-Planck equation. We analyze the kinetic neural network formulations with respect to steady states and study qualitative properties with the help of a moment model and a sensitivity analysis. In Section 5 we conduct several numerical test cases which validate our previous analysis. Especially, we conduct two classical machine learning tasks, namely a clustering problem and a regression problem. We conclude the paper in Section 6 with a brief conclusion and an outlook on future research perspectives.

2 Time Continuous ResNet

We assume that the input signal consists of features. A feature is one type of measured data as e.g. temperature of a tool, length or width of a vehicle, color intensity of pixels of an image. Without loss of generality we assume that the value of each feature is one dimensional and thus the input signals are given by . Here, denotes the number of measurements or input signals. In the following, we assume that the number of neurons is identical in each layer, corresponding to a fully connected ResNet. Namely, we consider layers and in each layer the number of neurons is given by , where is the number of neurons for one feature. The microscopic model which defines the time evolution of the activation energy of each neuron , fixed input signal and bias reads [10]:

(1)

for each fixed and . Here, denotes the activation function which is applied component wise. Examples for the activation function are given by the identity function

, the so-called ReLU function

, the sigmoid function

and the hyperbolic tangent function . In general (1) can be written in compact formulation by suitably collecting all the weights in an extend matrix . In particular, we have that , for each .

In (1) we formulated a neural network by introducing a parameter . In a classical ResNet . Here, instead, we introduce a time discrete concept which corresponds to the layer discretization. More precisely the time step is defined as and, in this way, we can see (1) as an explicit Euler discretization of an underlying time continuous model. As similar modeling approach with respect to the layers has been introduced in [27].

A crucial part in applying a neural network is the training of the network. By training one aims to minimize the distance of the output of the neural network at some fixed time to the target . Mathematically speaking one aims to solve the minimization problem

where we use the squared distance between the target and the output to defined the loss function. Other choices are certainly possible [12]

. The procedure can be computationally expensive on the given training set. Most famous examples of such an optimization are so called back propagation algorithms or ensemble Kalman filters

[9, 16, 35]. In the following we assume that the bias and weights are given and the neural network is already trained.

In the following we aim to consider the continuum limit which corresponds to and . In this limit, (1) is given by:

(2)

for each fixed and .

Existence and uniqueness of a solution is guaranteed as long the activation function satisfies a Lipschitz condition.

3 Mean Field Formulations of SimResNet

In this section we derive a time continuous PDE model for the forward propagation of residual neural networks. We follow a Liouville type approach for infinitely many measurements or inputs in order to obtain a mean field equation.

3.1 SimResNet

We assume a single neuron for each feature. Thus equation  (2) becomes

(3)

for each fixed , and where . This simplification reduces the complexity of neural networks drastically. This special form is not only beneficial for the kinetic formulation of a neural network but especially reduces the costs in the training stage of a neural networks. This novel method has been successfully applied to an engineering application [7]. We refer to this formulation as SimResNet.

3.2 Mean Field Limit

We perform the kinetic limit in the number of measurements . Since the dimension of is directly related to the dimension of the variable of the kinetic distribution function, for practical purposes we should consider moderate . The mean field model corresponding to the dynamic (3) is

(4)

where

is the compactly supported probability distribution function with known and normalized initial conditions

Observe that corresponds to the distribution of the measured features and that (4) preserves the mass, i.e. , . The derivation is classical and we refer to Golse et al. [8, 11] for details. If fulfills

(5)

Then the hyperbolic conservation law (4) admits a unique weak entropy solution in the sense of Krûzkov [3] for initial data . The mean field neural network (4) can be solved pointwise by the method of characteristics.

Proposition 1.

Let be a compactly supported weak solution of the mean field equation (4). Consider the case of the identity activation function or the hyperbolic tangent activation function . Assume and . Then

is a steady state solution of (4) in the sense of distributions provided that solves .

Proof.

For a test function the steady state equation reads

(6)

If is a Dirac delta function located at , equation (6) is satisfied only if is the solution to the system . ∎

With the help of the empirical measure it is straightforward to connect the solution of the large particle dynamics to the PDE. The empirical measure defined by the solution vector

is given by

A straightforward calculation shows that is a weak solution of the weak form of the model (4), provided that the initial distribution is given by

In order to show that the microscopic dynamics converge to the mean field limit we use the Wasserstein distance and the Dobrushin inequality. We follow the presentation in [8]. The convergence is obtained in the space of probability measures using the 1-Wasserstein distance, which is defined as follows:

Definition 1.

Let and two probability measures on . Then the 1-Wasserstein distance is defined by

(7)

where is the space of probability measures on such that the marginals are and i.e.

Theorem 1.

We assume that the activation function of the microscopic system (3) is Lipschitz continuous with Lipschitz constant . Let the initial condition of the Cauchy problem (4) be a probability measure with finite first moment and

holds. Then the Dobrushin stability estimate

is satisfied and

holds as .

Proof.

We only sketch the main steps of the proof and refer to [8, 11] for details. As first step we define the characteristic equations of the mean field neural network model (4). These characteristic equations are measure dependent and one usually uses the push-forward operator in order to be able to derive the Dobrushin stability estimate. The existence of the solution of the characteristic equations ban be shown with the help of the Lipschitz constant and the corresponding fixed point operator. As next step one considers the distance of two measures and again uses the fixed point operator and the Lipschitz continuity in order to bound the distance. Then one can apply the Gronwall inequality and obtains the Dobrushin stability estimate. ∎

3.3 Properties of the One Feature Mean Field Equation

In the case of one feature, i.e. , the mean field equation (4) reduces to

(8)

Our subsequent analysis is performed in this simple case.

We first define the -th moment,

, and variance of our mean field model by

(9)

Clearly, the possibility to obtain a moment model is solely determined by the shape of the activation function .

In the following we aim to characterize concentration phenomena of our solution with respect to the functions and . Therefore, we study the expected value and energy of our mean field model. In particular, we are interested in the characterization of several concentration phenomena that we define below.

Definition 2.

We say that the solution to equation (8) is characterized by

  • energy bound if

    holds at a fixed time ;

  • energy decay if

    holds for any . This means that the energy is decreasing with respect to time;

  • concentration if

    holds at a fixed time , where denotes the variance;

  • aggregation if

    holds for any . This means that the variance is decreasing with respect to time.

We observe that if the first moment is conserved in time, then definition of energy bound is equivalent to concentration, and definition of energy decay is equivalent to aggregation.

3.3.1 Identity as activation function

A simple computation reveals that the 0-th moment is conserved and holds for all times . For the moments we obtain

(10)

Notice that the -th moment only depends on the -th moment. It is then possible to solve the moment equations iteratively with the help of the separation of variables formula obtaining

(11)

Let us define

Proposition 2.

Assume that the bias is identical to zero, namely , . Then we obtain

  1. energy bound if at a fixed time ; and

  2. energy decay if and only if for all ; and

  3. aggregation if and only if . In particular the steady state is distributed as a Dirac delta centered at .

Proof.

If then (11) simplifies to

(12)

and thus the first and the second moment are identical except the given initial conditions. Then we can easily apply the definitions of energy bound, energy decay and aggregation to prove the statement. ∎

Corollary 1.

Assume that the bias is identical zero, namely , . Then

  1. if energy bound exists at a time we have concentration; and

  2. if energy decay holds we have aggregation.

Proof.

We start proving the first statement. First we observe that (12) is still true, since by hypothesis we assume that the bias is zero. Due to the definition of concentration we need to verify that for a fixed time . Using the definition of the variance, concentration is implied by

We have aggregation if for any . This is equivalent to

The choice is suitable for clustering problems at the origin, independently on the initial first moment. Conservation of the first moment is possible by choosing . See the following results.

Proposition 3.

Let the bias be . Then the first moment is conserved in time and we obtain

  1. a energy bound if at a fixed time ; and an

  2. aggregation phenomena if holds for all . The steady state is Dirac delta centered at .

Proof.

The solution formula for the second moment is

Then we have energy bound if

which is satisfied assuming that at a fixed time . For the second statement we observe that delta aggregation is also implied by , for all times. Or, equivalently, by , for all times. We have

if and only if for all times. ∎

We aim to discuss the impact of the variance on aggregation and concentration phenomena. This is especially interesting if we are not interested in the long time behavior but rather aim to know if for some tolerance and time . In applications this level would be determined by the variance of the target distribution.

Corollary 2.

If the bias is identical to zero, namely , , then the energy at time is below tolerance if

is satisfied. Similarly, we have that the variance is below the level if

holds.

Similarly, if the bias fulfills , then the energy at time is below the level if

is satisfied provided that holds. Similarly, the variance at time is below the level if

is satisfied provided that holds.

Remark 1.

In general it is not possible to obtain a closed moment model in the case of the sigmoid or hyperbolic tangent activation function Nevertheless one might approximate the sigmoid or tanh activation function by the linear part of their series expansion.

Remark 2.

In the case of the ReLU activation function we decompose the moments in two parts

and we define

where and .

Let us define , then using the Leibniz integration rule we compute

(13)

and

(14)
(15)

Consequently, the evolution equation for the -th moment cannot be expressed by a closed formula since it depends on the partial moment on and boundary conditions:

(16)

In the case of constant weights and bias the equality holds. Notice also that the above discussion simplifies in the case of vanishing bias, and it becomes equivalent to the case when the activation function is the identity function. In fact, if , , then the set switches to be or , depending on the sign of the weight . Moreover, thanks to (13) we immediately obtain that holds true and thus is satisfied for all . Hence, the evolution equation (16) reduces to the case (10) and same computations can be performed.

3.4 Sensitivity Analysis

The goal is to compute the sensitivity of a quantity of interest with respect to some parameter. The quantity of interest is the distance of function at finite time to the target distribution . We assume the ResNet has been trained using the loss function .

We may expect that training was expensive and will not necessarily be done again if input or target changes. Therefore, it is of interest if the trained network (namely and ) can be reused if or changes. We propose to compute the corresponding sensitivities of with respect to the weight and bias. This in turn can be used to apply a gradient step on . We use adjoint calculus to adjust to the modified , i.e. the formal first order optimality condition for Lagrange multiplier reads:

where is the derivative of the activation function. It is amed that is differentiable, i.e. or . Adjustment of optimal weights and bias can be then obtained via gradient step. After an update of initial data or target maybe summarized by:

  • initially select optimized weights

  • update by new initial data and target

  • compute

    with step size

  • repeat until

    for a given tolerance is reached

4 Boltzmann type Equations

The mean field models in the previous section can be obtained as suitable asymptotic limit of Boltzmann type space homogeneous integro-differential kinetic equations. In particular, the case of one-dimensional feature can be derived from a linear Boltzmann type equation.

In fact, in the one-dimensional case the system of ODEs (3) can be recast as the following linear interaction rule:

(17)

where, by kinetic terminology, and are the post- and pre-collision states, respectively. The corresponding weak form of the Boltzmann type equation reads

(18)

where represents the interaction rate. In the present homogeneous setting, influences only the relaxation speed to equilibrium and thus, without loss of generality, in the following we take .

The one-dimensional Boltzmann type equation (18) leads to the one-dimensional mean field equation (8) by suitable scaling [23, 24, 33].

An advantage of the Boltzmann type description (18) in comparison to the mean field equation is the possibility to study different asymptotic scales leading to non-trivial steady states. The choice of weights , bias and of the activation function may be obtained by fitting the target distribution.

In order to obtain non-trivial steady states of model (18) consider self-similar solutions [24]:

In the the case of the identity activation function the first moment can be computed explicitly and a Fokker-Planck type asymptotic equation can be derived. In order to study steady-state profiles for arbitrary activation functions, we choose the following approach: we add noise to the microscopic interaction rule (17) leading in the asymptotic limit to a Fokker-Planck equation.

4.1 Fokker-Planck Neural Network Model

Let be a small parameter, weighting for the strength of the interactions. Modify the interaction (17) as

(19)

where

is a random variable with mean zero and variance

, and is a diffusion function. In the classical grazing collision limit , , we recover the following Fokker-Planck equation for the probability distribution

(20)

where we define the interaction operator and the diffusive operator as

The advantage in computing the grazing collision limit is that the classical integral formulation of the Boltzmann collision term is replaced by differential operators. This allows a simple analytical characterization of the steady state solution of (20). Provided the target can be well fitted by a steady state distribution of the Fokker-Planck neural network model, the weight, the bias and the activation function are immediately determined. This is a huge computational advantage in comparison the the classical training of neural networks. In the following we present examples of steady states of the Fokker-Planck neural network model.

4.1.1 Steady state characterization

Steady state solution can be easily found as

(21)

The constant is determined by mass conservation, i.e. . The existence and explicit shape of the steady state is determined by the specific choice of activation function , diffusion function and parameters .

If the target is distributed as a Gaussian, then choosing and yields a suitable approximation since the steady state (21) is given by

provided that . Condition on mass conservation leads to

with to guarantee converge of .

If the target is distributed as an inverse Gamma, then choosing and yields a suitable approximation since the steady state (21) is given by

with and normalization constant

where denotes the Gamma function. Notice that we have to assume that and hold in order to obtain a distribution.

Let and , and w. l. o. g. we assume so that is identical zero on the set . The steady state on is given by and can be extended on by the Pareto distribution:

If activation and diffusion function

it is possible to obtain a generalized Gamma distribution as steady state. This specific model has been discussed in

[5] and the exponential convergence of the solution to the steady state has been proven in [22]. This may motivate to choose the novel activation function provided that the data is given by a generalized Gamma distribution.

5 Numerical Experiments

In this section we present two classical applications of machine learning algorithms, namely a clustering and regression problem. Furthermore, we validate the theoretical observations of the moment model. In addition we test our weight and bias update derived in the sensitivity analysis. Finally, we show that the Fokker-Planck type neural network is able to fit non trivial continuous probability distributions. The weights and biases are given and constant in time. For the simulations we solve the PDEs models presented in this work by using a third order finite volume scheme [4], briefly reviewed below.

All the cases we consider for the numerical experiments can be recast in the following compact formulation:

(22)

with and given constants. Application of the method of lines to (22) on discrete cells leads to the system of ODEs

(23)

where is the approximation of the cell average of the exact solution in the cell at time .

Here, approximates with suitable accuracy and is computed as a function of the boundary extrapolated data, i.e.

and is a consistent and monotone numerical flux, evaluated on two estimates of the solution at the cell interface, i.e . We focus on the class of central schemes, in particular we consider a local Lax-Friedrichs flux. In order to construct a third-order scheme the values at the cell boundaries are computed with the third-order CWENO reconstruction [4, 18].

The term is a high-order approximation to the diffusion term in (22). In the examples below we use the explicit fourth-order central differencing employed in [17] for convective-diffusion equations with a general dissipation flux, and which uses point-values reconstructions computed with the CWENO polynomial.

Finally, is the numerical source term which is typically approximated as , where and are the nodes and weights of a quadrature formula on . the four point Gaussian quadrature of order seven. We employ three point Gaussian quadrature formula matching the order of the scheme.

System (22) is finally solved by the classical third-order (strong stability preserving) SSP Runge-Kutta with three stages [13]. At each Runge-Kutta stage, the cell averages are used to compute the reconstructions via the CWENO procedure and the boundary extrapolated data are fed into the local Lax-Friedrichs numerical flux. The initial data are computed with the three point Gaussian quadrature. The time step is chosen in an adaptive fashion and all the simulations are run with a CFL of .

5.1 Moment Model

We have solved the mean field neural network model in order to compute the corresponding moments. In this section we choose the initial condition to be the following Gaussian probability distribution:

Figure 1: LHS: Moments of our PDE model with . RHS: Moments of our PDE model with .
Figure 2: LHS: The energy and variance plotted against the desired values with . RHS: The energy and variance plotted against the desired values with .

In section 3.3 we extensively discussed several aggregation phenomena of the moment model. As predicted by Proposition 2 we obtain with the choice a decay to zero of the energy, expected value and the variance as depicted in Figure 1. Furthermore, we have plotted in Figure 1 the case with which guarantees the conservation of the first moment as discussed in Proposition 3. Figure 2 illustrates the time needed in order to reach a desired energy or variance level, which has been studied in Corollary 2 and LABEL:cor3.

5.2 Machine Learning Applications

We present the kinetic approach for a classification and regression problem. In these section the activation function is chosen to be the hyperbolic tangent.

5.2.1 Classification Problem

Consider a classification problem as follows. We measure a quantity (e.g. length of a car) and the object must be sorted with respect of the type (e.g. car or truck). The task of the neural network is to determine the type given a measurement. A training set might look like Table 1.

Quantity Classifier
3 car
3.5 car
5.5 truck
7 truck
4.5 car
8 truck
Table 1: Example of input data.
Figure 3: We consider vehicles with measured length between and . Mathematically, we sampled

realizations of a uniformly distributed random variable between

and . LHS: Length of vehicles plotted against the number of measurements. RHS: Histogram of the measured length of the vehicles.
Figure 4: Trajectories of the neuron activation energies in the case of measurements.
Figure 5: Solution of the mean field neural network model at different time steps. The initial value is a uniform distribution on and the weight and bias is chosen as .

A probabilistic interpretation of the length measure can be obtained by a normalized histogram. Thus, the histogram shows the frequency of a classifier of certain measurement (see Figure 3

). The continuous approximation of this histogram is the input of our mean field neural network model. The type could be a binary variable. In terms of our kinetic model the target is characterized by two Dirac delta distributions located at the binary values. Therefore, we introduce zero flux boundary conditions on the numerical scheme. The kinetic variable describes the distribution of the measurements (e.g. the length of vehicles). At final time we interpret this length as the decision of being a car or a truck. Thus, we introduce two measurements that determine the type. On the particle level we obtain the convergence of the neurons two these clusters (see Figure

4). The solution to the mean field neural network model is depicted in Figure 5.

5.2.2 Regression Problem

We may have given measurements at fixed locations. These measurements might be disturbed possibly due to measurement errors as in Figure 6. The task of the neural network is to find the fit of those data points. It is not possible to solve this task with our model in one dimension since we have proven in Proposition 1 that the mean field neural network model only performs clustering tasks in the case of the identical and hyperbolic tangent activation function.

Figure 6: LHS: Regression problem with 50 Measurements at fixed positions around the linear slope. Measurement errors are distributed standard Gaussian. RHS: Numerical slopes computed out of the previous measurements.
Figure 7:

LHS: Histogram of numerical computed slopes with 100 measurements. RHS: Time continuous Gaussian distribution with mean one and variance one.

Figure 8: Solution of the mean field neural network model at different time steps with weight and bias .

Therefore we transform the problem and assume a linear fit and aim to learn the slope of this fit by a neural network. The data is used to generate empirical slopes. These slopes are given by a probabilistic interpretation as in the histogram in Figure 7, being the input of the model. The target is a Dirac delta distribution located at . The solution converges to the target, see Figure 8. Thus, the location of the Dirac delta gives the correct slope of the graph in Figure 6.

5.3 Sensitivity Analysis and Update of Weights and Bias

We aim to present the benefit of the sensitivity analysis. We consider the regression problem as introduced in the previous section. The initial data is Gaussian with mean and variance equal to one, the target distribution is a Dirac delta centered at and the weights and bias are .

Figure 9: Results of the mean field neural network model with updated weights and bias in the case of a novel target.

As Figure 8 shows, is close to the target. Then, we introduce as new target a Dirac delta centered at and use adjoint calculus with fixed step size to update the weights. The result of the mean field neural network model for different number of gradient steps are presented in Figure 9. Thus, the gradient step can be used in order to update weights and bias in case of changing the initial input or the target.

5.4 Fokker-Planck Type Neural Network

In this example we consider a standard normal Gaussian distribution as target and the input is uniformly distributed on . As presented in Section 4.1.1 the Fokker-Planck type neural network model is able to obtain the Gaussian distribution as steady state. This directly leads us to the choice of the weight, bias and activation function. We need to choose the identity as activation function. This approach allows us to drive the initial input to the given target by simply fitting the two parameters and .

Recall that, on the contrary, and as proven in Proposition 1, the mean field neural network can perform in the case of a hyperbolic tangent or identity activation function only clustering tasks. This means that for large times the distribution approaches a Dirac delta distribution and consequently it is not possible to fit a Gaussian distributed target by using the deterministic SimResNet model.

Figure 10: LHS: Histogram of 100 measurements uniformly distributed between . RHS: Histogram of the standard Gaussian distributed target values.
Figure 11: Solution of the Fokker-Planck neural network model at different times. Here, we have chosen the identity as activation function with weight , bias and diffusion function .

The histogram of sampled data is depicted in Figure 10. The solution of the Fokker-Planck neural network model for different time steps is presented in Figure 11 showing the convergence to the given target.

6 Conclusion

Starting from the classical formulation of a residual neural network, we have derived a simplified residual neural network and the corresponding time continuous limit. Then, we have taken the mean field limit in the number of measurements, thus switching from a microscopic perspective on the level of neurons to a probabilistic interpretation. We have analyzed solutions and steady states of the novel model. Furthermore, we have investigated the sensitivity of the loss function with respect to the weight and bias. Finally, we have derived a Boltzmann description of the simplified residual neural network and extended it to the case of a noisy setting. As consequence, non trivial steady states have been obtained for the limiting Fokker-Planck type model. In the last section we have validated our analysis and have presented simple machine learning applications, namely regression and classification problems.

Our study may yield insights in order to understand the performance of residual neural networks. E.g. the moment analysis gives practical estimates on the simulation time and how to choose bias and weight. The gradient algorithm derived form the sensitivity analysis provides us with an update formula to recompute bias and weight: after changes in initial or target conditions. Probably most interestingly, we have seen that our mean field neural network model has in special situations only a Dirac delta function as unique steady state. In these cases the mean field neural network performs clustering tasks. In comparison to the mean field model, the Fokker-Planck neural network model is able to exhibit non-trivial steady states.

Acknowledgments

This research is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC-2023 Internet of Production – 390621612 and supported also by DFG HE5386/15.

M. Herty and T. Trimborn acknowledge the support by the ERS Prep Fund - Simulation and Data Science. The work was partially funded by the Excellence Initiative of the German federal and state governments.

References