1 Introduction
Let a stochastic process with the following properties be denoted as :
(1) 
where
(2) 
(3) 
where all the parameters including and are known in advanced. Then is normal with the mean
(4) 
and the variance
(5) 
Since the distribution is normal, best unbiased estimate is also normally distributed, and should not need moments beyond the first two. The formulation above is a kernel of an existing problem of short rate computation within two factor lognormal models.
Given the distribution of we can deduce the distribution . The problem can be formulated as follows: given the distribution find the distribution . Under the method of moments , since this is a theoretical measure. We train a neural network to test, whether on an average sample it can outperform the theoretical prediction.
The same Neural network can then outperform the Method of moments in a set generated from different parameters data.
2 Generating training and validation data
2.1 Introduction
In order to train the neural network to perform the same functions as stochastic equations, a large data series generated. In order to improve the precision of computation in line with the possible evolution of paths, we use a branching technique, which increases the number of unique paths by a multiple of 4 at every timestep (discretization of continuous process).
2.2 Generating data from Stochastic equations
We generate a large dataset (5,000 simulations based on individual realisations of the model, for each simulation). This is achieved by branching each of the outcomes in the previous step by multiple of 4 (i.e. 1st step has 4 results per simulation, the next 16 and so on, until 12th)  see Appendix 2 for more information. In order to derive formulation for our neural network we will have to reduce the model to percentile outcomes.
Given formulation of 2factor BlackKarasinski model as:
(6)  
(7) 
where
and are constants, define a new state variable where:
Then we can rewrite the original equation as:
The above, is a classic formulation of 2 factor HullWhite model. This then allows [4] the G2++ formulation of the model, where the state variables are decoupled. The new formulation will have the following terms:
(8) 
Such that (8) is a deterministic equation relative to time t.
Reformulation of the above into two independent Vasicek equations is obvious. The attempt to estimate the original series from the previous data (i.e. decomposition and recomposition independent series) creates additional estimation error with its own noise. Hence in order to reduce the method derived error we will attempt to do the estimation on the joint series.
Explicitly this would mean that each percentile value of the outcome would need to be adjusted by the joint drift over the time step, and then amended for the standard deviation. However, having tested various possible drift functions, we came to conclusion that not introducing any drift to the previous percentile values gives the most stringent version of the test.
3 Method of Moments
3.1 Introduction
Typically intractable short rate are approximated by binomial trees [2] which rely on theoretical moments of the function. We use the moments of theoretical distribution to compare the results generated by running the model in the process described in Appendix 2.
3.2 Estimating next timestep with Method of Moments
Method of moments in this paper approach describes a theoretical measure: based on the theoretical standard deviation and mean of the distribution we derive the quantiles of the distribution values. The expectations for the next step are easily derived using basic Euler scheme:
Standard deviation over a particular time step are derived from orthogonal decomposition of the above model into two independent Vasicek processes. This approach is standard when approximating the distribution of path with binomial trees. This results in the following expression:
Where
(9)  
(10)  
(11)  
(12)  
(13) 
Given a realization of the process at step , we can index it by in order to arrive at the expected values over the timestep .
Having adjusted for the mean, we can then compute the realized percentiles from the theoretical standard deviation of the distribution – effectively deriving the percentiles from the theoretical first two moments of the distribution. is an array of for at .
Where denotes an 2D array shifted by +1 along the t.
Where is defined as the variance and covariance of the two processes
4 Model formulation  Neural Network
4.1 Introduction
In order to show the possibility of additional information captured by looking at distribution as a whole we use the simplest neural net possible, and attempt not to overfit the data. We verify the results against the mean scenario, and scale it by standard deviation of resulting from stochastic errors between all scenarios for each percentile.
We can create a state space representation of required function of recurrent neural network as such:
where optimising for D matricies and vectors with stochastic gradient descend method [3] . Where is a 10 by 200 matrix, and is a 200 by a 10 matrix, and is a 10 parameter vector, and is a 200 parameter vector.
4.2 Neural Network Specification
We will then train a simple (single layer, 10 nodes) neural network to derive additional information (relying on the values of other percentiles) about the distribution from the percentiles of the previous timestep at time t (i.e. n= 200 values corresponding to 0.5% to 100% quantiles of the distribution), to predict the outcomes at time t+1.
Such that the trained neural network is able to produce outcomes that lie within 1 standard deviation (this is relative error in graphs below) from true average of the process in Figure 1. 1. Further refinements can be made to reduce the error (which is almost constant, relative to t).
We used MLPRegressors ([6]
). This is a single hidden layer “Multilayer Perceptron regressor”
[5], and limited the number of nodes in a single hidden layer to 10 regressors.The network uses 10 sigmoid nodes to regress the input (in form of percentile values) against the output (percentile values at t+1). It fits the weights and intercepts to inputs to produce outputs, and this is then repeated for each incremental month from 2nd to 12th, for all 5,000 scenarios.
The notable effect of this is constraining the amount of information carried from one example to the next. It is also useful to note, that when the number of regressors increases the fit is increasingly aligned with the method of moments.
The graphs below present the forecast error on average for each of the quantiles at 3rd, 6th, 9th and 12th timestep standardized by standard error for the 5,000 scenarios, on that dataset.
Figure 1 goes here
5 Simulation
The original hypothesis is that the previous value of distribution carries no bearing on the way the distribution looks at the next time step if this statement is true then the neural network (we train on data at t=0) should not be able to learn any features that would enable it to beat method of moments.
However, here the variance of the process increases with predetermined amount sigma (as per theoretical distribution), so if we have previous distribution correctly standardized against the theoretical distribution in the previous step, the distribution of the next step would be best predicted by the percentile outcomes of the previous step.
As both processes utilize Brownian motion, we can use these percentiles directly to predict the future standardized percentiles. By relying on sigmoid/logistic function, we can use the fact that the normal distribution is relatively well approximated by the logistic equation in terms of CDF and derive the new values of CDF.
time t=  Neural  NN out of  Method of  MofM out 

Network  sample  Moments  of sample  
2  0.05753  0.0072  0.13586  0.63372 
3  0.02049  0.0213  0.06481  0.06242 
4  0.00290  0.00333  0.06060  0.06080 
5  0.00565  0.00574  0.03220  0.03186 
6  0.00268  0.00216  0.00999  0.01015 
7  0.00298  0.00253  0.00621  0.00623 
8  0.00299  0.00194  0.00439  0.00431 
9  0.0015  0.0012  0.00313  0.00323 
10  0.00072  0.00059  0.00252  0.00253 
11  0.0007  0.00085  0.00209  0.00207 
12  0.00136  0.00157  0.00168  0.00173 
We evaluate the quality of fit on the mean of the process (which is not part of training dataset), in order to remove stochastic error from the evaluation process. We also standardize the estimation by stochastic error of each percentile. In order to verify the veracity of the method we test it against a different dataset (not used for training, and with 1000 trails on different values) to verify that the method does indeed produce superior performance to the method of moments. This result holds across different time points across the 12 month evolution of the process.
6 Conclusion
Compared with method of moments the neural network produces lower RMSE. Due to constant changes in the calibration results this method is superior, because it’s more accurate and faster.
Two things of note: the errors appears to have a constant structure which suggests training a time dependent neural net, or indeed a simple regression may further reduce the error. The fit of the method of moments is better around the mean, hence a further process can be added to place more reliance on theoretical mean around that point.
7 Appendix 1
Table of calibrated values
Here the mean reversion parameters are the log of
Variable  Period 1  Period 2  Validation 

0.1759  0.1776  0.1776  
0.0785  0.0819  0.0785  
0.3423  0.3407  0.3407  
0.2242  0.2177  0.2242  
0.0377  0.0377  0.0377  
Rate at t=0  0.0307  0.0394  0.0307 
8 Appendix 2
Data generation
The dataset for training the model is generated by running the full formulation of the original equation of Black Karasinski and Euler discretization scheme [1] reformulating equation to give:
at monthly timestep with four new results for generating from two Normal nonsymmetric shocks for each node (total of 8 random numbers per every time steps), over 12 monthly timesteps. We record the percentiles over each time step. We do not record the full dataset because at the last timestep there are unique interest rate paths with floating rate precision for each trial which amounts to over 4GB of storage. We then simplify this data set by condensing this dataset to its percentiles (total of 200 from 0.5% to 100%)
We take advantage of the fact that values the are closer to the point of origin require less dispersion  I.e. there are fewer significantly different paths, but this number grows with time.
9 Data availability statement
Data available on request from the primary author
References
 [1] (1996) The law of the euler scheme for stochastic differential equations. Probability theory and related fields 104 (1), pp. 43–60. Cited by: §8.
 [2] (1998) Binomial term structure models. Mathematica in Education and Research 7, pp. 11–19. Cited by: §3.1.
 [3] (2010) Stochastic gradient descent. Note: https://leon.bottou.org/projects/sgdAccessed: 20100930 Cited by: §4.1.
 [4] (2001) Interest rate models  theory and practice. Springer Finance, Springer. External Links: ISBN 9783540417729, LCCN 01034206, Link Cited by: §2.2.

[5]
(1990)
Connectionist learning procedures. artificial intelligence, 40 13: 185 234, 1989. reprinted in j. carbonell, editor,”
. Machine Learning: Paradigms and Methods”, MIT Press. Cited by: §4.2.  [6] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.