Let a stochastic process with the following properties be denoted as :
where all the parameters including and are known in advanced. Then is normal with the mean
and the variance
Since the distribution is normal, best unbiased estimate is also normally distributed, and should not need moments beyond the first two. The formulation above is a kernel of an existing problem of short rate computation within two factor log-normal models.
Given the distribution of we can deduce the distribution . The problem can be formulated as follows: given the distribution find the distribution . Under the method of moments , since this is a theoretical measure. We train a neural network to test, whether on an average sample it can outperform the theoretical prediction.
The same Neural network can then outperform the Method of moments in a set generated from different parameters data.
2 Generating training and validation data
In order to train the neural network to perform the same functions as stochastic equations, a large data series generated. In order to improve the precision of computation in line with the possible evolution of paths, we use a branching technique, which increases the number of unique paths by a multiple of 4 at every timestep (discretization of continuous process).
2.2 Generating data from Stochastic equations
We generate a large dataset (5,000 simulations based on individual realisations of the model, for each simulation). This is achieved by branching each of the outcomes in the previous step by multiple of 4 (i.e. 1st step has 4 results per simulation, the next 16 and so on, until 12th) - see Appendix 2 for more information. In order to derive formulation for our neural network we will have to reduce the model to percentile outcomes.
Given formulation of 2-factor Black-Karasinski model as:
and are constants, define a new state variable where:
Then we can re-write the original equation as:
The above, is a classic formulation of 2 factor Hull-White model. This then allows  the G2++ formulation of the model, where the state variables are decoupled. The new formulation will have the following terms:
Such that (8) is a deterministic equation relative to time t.
Reformulation of the above into two independent Vasicek equations is obvious. The attempt to estimate the original series from the previous data (i.e. decomposition and re-composition independent series) creates additional estimation error with its own noise. Hence in order to reduce the method derived error we will attempt to do the estimation on the joint series.
Explicitly this would mean that each percentile value of the outcome would need to be adjusted by the joint drift over the time step, and then amended for the standard deviation. However, having tested various possible drift functions, we came to conclusion that not introducing any drift to the previous percentile values gives the most stringent version of the test.
3 Method of Moments
Typically intractable short rate are approximated by binomial trees  which rely on theoretical moments of the function. We use the moments of theoretical distribution to compare the results generated by running the model in the process described in Appendix 2.
3.2 Estimating next timestep with Method of Moments
Method of moments in this paper approach describes a theoretical measure: based on the theoretical standard deviation and mean of the distribution we derive the quantiles of the distribution values. The expectations for the next step are easily derived using basic Euler scheme:
Standard deviation over a particular time step are derived from orthogonal decomposition of the above model into two independent Vasicek processes. This approach is standard when approximating the distribution of path with binomial trees. This results in the following expression:
Given a realization of the process at step , we can index it by in order to arrive at the expected values over the timestep .
Having adjusted for the mean, we can then compute the realized percentiles from the theoretical standard deviation of the distribution – effectively deriving the percentiles from the theoretical first two moments of the distribution. is an array of for at .
Where denotes an 2D array shifted by +1 along the t.
Where is defined as the variance and covariance of the two processes
4 Model formulation - Neural Network
In order to show the possibility of additional information captured by looking at distribution as a whole we use the simplest neural net possible, and attempt not to overfit the data. We verify the results against the mean scenario, and scale it by standard deviation of resulting from stochastic errors between all scenarios for each percentile.
We can create a state space representation of required function of recurrent neural network as such:
where optimising for D matricies and vectors with stochastic gradient descend method  . Where is a 10 by 200 matrix, and is a 200 by a 10 matrix, and is a 10 parameter vector, and is a 200 parameter vector.
4.2 Neural Network Specification
We will then train a simple (single layer, 10 nodes) neural network to derive additional information (relying on the values of other percentiles) about the distribution from the percentiles of the previous timestep at time t (i.e. n= 200 values corresponding to 0.5% to 100% quantiles of the distribution), to predict the outcomes at time t+1.
Such that the trained neural network is able to produce outcomes that lie within 1 standard deviation (this is relative error in graphs below) from true average of the process in Figure 1. 1. Further refinements can be made to reduce the error (which is almost constant, relative to t).
We used MLPRegressors (
). This is a single hidden layer “Multilayer Perceptron regressor”, and limited the number of nodes in a single hidden layer to 10 regressors.
The network uses 10 sigmoid nodes to regress the input (in form of percentile values) against the output (percentile values at t+1). It fits the weights and intercepts to inputs to produce outputs, and this is then repeated for each incremental month from 2nd to 12th, for all 5,000 scenarios.
The notable effect of this is constraining the amount of information carried from one example to the next. It is also useful to note, that when the number of regressors increases the fit is increasingly aligned with the method of moments.
The graphs below present the forecast error on average for each of the quantiles at 3rd, 6th, 9th and 12th timestep standardized by standard error for the 5,000 scenarios, on that dataset.
Figure 1 goes here
The original hypothesis is that the previous value of distribution carries no bearing on the way the distribution looks at the next time step if this statement is true then the neural network (we train on data at t=0) should not be able to learn any features that would enable it to beat method of moments.
However, here the variance of the process increases with predetermined amount sigma (as per theoretical distribution), so if we have previous distribution correctly standardized against the theoretical distribution in the previous step, the distribution of the next step would be best predicted by the percentile outcomes of the previous step.
As both processes utilize Brownian motion, we can use these percentiles directly to predict the future standardized percentiles. By relying on sigmoid/logistic function, we can use the fact that the normal distribution is relatively well approximated by the logistic equation in terms of CDF and derive the new values of CDF.
|time t=||Neural||NN out of||Method of||MofM out|
We evaluate the quality of fit on the mean of the process (which is not part of training dataset), in order to remove stochastic error from the evaluation process. We also standardize the estimation by stochastic error of each percentile. In order to verify the veracity of the method we test it against a different dataset (not used for training, and with 1000 trails on different values) to verify that the method does indeed produce superior performance to the method of moments. This result holds across different time points across the 12 month evolution of the process.
Compared with method of moments the neural network produces lower RMSE. Due to constant changes in the calibration results this method is superior, because it’s more accurate and faster.
Two things of note: the errors appears to have a constant structure which suggests training a time dependent neural net, or indeed a simple regression may further reduce the error. The fit of the method of moments is better around the mean, hence a further process can be added to place more reliance on theoretical mean around that point.
7 Appendix 1
Table of calibrated values
Here the mean reversion parameters are the log of
|Variable||Period 1||Period 2||Validation|
|Rate at t=0||0.0307||0.0394||0.0307|
8 Appendix 2
The dataset for training the model is generated by running the full formulation of the original equation of Black Karasinski and Euler discretization scheme  reformulating equation to give:
at monthly timestep with four new results for generating from two Normal non-symmetric shocks for each node (total of 8 random numbers per every time steps), over 12 monthly timesteps. We record the percentiles over each time step. We do not record the full dataset because at the last timestep there are unique interest rate paths with floating rate precision for each trial which amounts to over 4GB of storage. We then simplify this data set by condensing this dataset to its percentiles (total of 200 from 0.5% to 100%)
We take advantage of the fact that values the are closer to the point of origin require less dispersion - I.e. there are fewer significantly different paths, but this number grows with time.
9 Data availability statement
Data available on request from the primary author
-  (1996) The law of the euler scheme for stochastic differential equations. Probability theory and related fields 104 (1), pp. 43–60. Cited by: §8.
-  (1998) Binomial term structure models. Mathematica in Education and Research 7, pp. 11–19. Cited by: §3.1.
-  (2010) Stochastic gradient descent. Note: https://leon.bottou.org/projects/sgdAccessed: 2010-09-30 Cited by: §4.1.
-  (2001) Interest rate models - theory and practice. Springer Finance, Springer. External Links: Cited by: §2.2.
Connectionist learning procedures. artificial intelligence, 40 1-3: 185 234, 1989. reprinted in j. carbonell, editor,”. Machine Learning: Paradigms and Methods”, MIT Press. Cited by: §4.2.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.2.