1 Introduction
Neural networks can often have inputs that are uncertain. Input uncertainty can come from measurement error, adversarial noise (Szegedy et al., 2013), or even from output feedback. The majority of work on the subject of uncertainty in neural networks revolves around uncertainty in the model, not necessarily uncertainty in the input itself. Bayesian neural networks perform inference using a prior over the weights of the network, while dropout (Gal and Ghahramani, 2016)
samples iterations of the network with a probabilistic mask to build the posterior distribution of the model. In both cases the output uncertainty emerges as an explicit function of the model uncertainty, while the uncertainty of the input is classified as aleatoric and not explicity propagated through the model. For nonparametric probabilistic methods, such as Gaussian process regression
(Rasmussen, 2004), special care must be taken for multistep prediction with uncertain inputs (Girard et al., 2003). In this paper, we focus on addressing the problem of uncertainty propagation through the function, which is a popular choice for activation function in neural networks, in particular the Echo State Network.In light of the challenges in input uncertainty propagation and their role in recurrent neural networks, we aim to contribute the following:

A theoretical and numerical analysis for 3 methods of propagating input uncertainty through the activation function, with an extension to other nonlinear activation functions.

A new method, named the Probabilistic Echo State Network (PESN), which aims to reduce the time required to achieve the echo state property.
Related Works: There exists recent work on propagating input uncertainty through feedforward bayesian neural networks 5. Here, the authors perform approximate inference to propagate uncertain inputs through feedforward neural networks for classification tasks. We differentiate our work in three fundamental ways: 1) we consider the tanh function which is not addressed in (Shekhovtsov et al., 2018), 2) a novel contribution utilizing splines to propagate gaussian input uncertainty through continuous activations and 3) a general focus on improving the reservoir computing framework. To the best of the authors’ knowledge we are the first to utilize a spline approximation to the integrand of an expectation of a gaussian in order to perform approximate inference. Similar ideas include (Langrock et al., 2015)
, where the authors utilize bsplines to approximate the density in order to perform approximate inference. Work in improving reservoir computing has traditionally focused on the structure of the reservoir, such as optimizing hyperparameters
(Jaeger et al., 2007) or finding the minimum reservoir size (Rodan and Tino, 2011). This work is one of the first to attempt to reduce the time required to converge to the echo state property.1.1 Reservoir Computing
Reservoir computing (RC) is a paradigm for training recurrent neural networks (RNNs) (Lukoševičius and Jaeger, 2009). It was introduced in early 2000’s by Jaeger under the name ‘Echo State Networks’ (ESN) for timeseries predictions (Jaeger and Haas, 2004), and by Maass under the name ‘Liquid State Machines’ (Maass et al., 2004)
for modeling computation in biological networks of neurons. Despite being developed from very different communities, these two approaches are largely mathematically identical. In this work we focus on ESN since we aim to improve the efficiency of RNN learning for engineering applications.
ESN differs from other RNNs in terms of its training scheme. Generally, RC consists of two steps: 1) drive a network with sparse and fixed connections with an input and output sequence. 2) Train the output (or readout) layer so that the network output is similar to the teacher output. The readout layer is usually trained using regression techniques such as linear regression
(Lukoševičius and Jaeger, 2009) and Gaussian process regression (Chatzis and Demiris, 2011). One problem is determining the initial state of the hidden layer (reservoir). According to the echo state property (Jaeger, 2001), the effect of initial conditions can be ‘washed out’ therefore the state can be initialized randomly. However, there are three main drawbacks: first, a significant amount of training data is wasted because the initial period of training run needs to be discarded. Second, at test time, an initial input sequence needs to be fed into the network before using it for prediction tasks. Third, the state forgetting property is usually not guaranteed so the performance of the ESN may still depend on the initial state.2 Propagating Uncertainty through the tanh
In our work we analyze three distinct methods to propagate a gaussian input through the
activation function. The simplest and most well known method is simply MonteCarlo (MC) sampling where we sample the input distribution then pass each sample through the activation to compute an estimate of the moments. While easy to implement and understand, Monte Carlo has an obvious drawback in terms of computational time. The second method utilizes a well known approximation to the
activation function in the form of the logistic cumulative distribution function. The connection between the logistic and gaussian distributions are utilized to approximate the mean of the activation output. The variance approximation fits the moments of the Gaussian pdf to the function
in order make the expectation tractable. The cross covariance terms are ignored in our derivation. While this method is certainly faster, the accuracy is limited by our approximations. The final method leverages spline approximations and analytical expressions for the expectation of polynomial functions. This method strikes a balance between computational complexity and accuracy by adjusting the width of the spline mesh. We compare absolute error of the moment approximations, computational complexity of the two analytical and spline methods, and provide error bounds for the spline approximation.2.1 Analytical Approximation to Mean and Variance
First we relate the
function with the logistic cumulative distribution function CDF, and approximate the logistic distribution with an appropriate gaussian distribution. We assume that the dimensions of the logistically distributed random vector
are independent given the location parameter . The expectation and variance of given are respectively. We can use moment matching to approximate with a gaussian(1)  
(2)  
(3) 
Here the logistic distribution is parameterized by the location and the scale . The function is a function of a logistic CDF with location and scale . Additionally let us assume that . The mean of the distribution, , is given by:
(4) 
Next, we ignore the cross covariance terms between the elementwise , and directly compute the diagonals of the output covariance matrix. The authors in (Bassett, 1998) investigated different approximations to the error function. One preliminary solution was the use of the tanh function to approximate the error function. We can take advantage of this approximation and the appropriate scaling factors to approximate the function . We use a Gaussian pdf where and . Then we can again apply similar algebraic manipulations to compute the variance terms. The full derivation for the mean and variance expressions can be found in the supplementary material.
(5) 
(6) 
2.2 Spline Approximation to Mean and Variance
In this section, we present a formulation for computing the moments of the transformation utilizing a spline approximation of the argument of the expectation. Again let us assume that , the expectation and variance of the elementwise are given by the following equations:
(7)  
(8)  
Let us assume that we are dealing with scalar inputs to the activation function, since the vector case is handled elementwise and we are ignoring the offdiagonal elements of the input variance. Instead of direct numeric integration, we propose taking advantage of three facts.

The tails of the Gaussian pdf approach zero, thus when multiplied against a function with constant tails, such as the , the integrand also goes to zero. Thus we can approximate the integral over the real line using an integral over some compact subset centered around the mean of the input.

The function is continuous so we can uniformly approximate it with polynomials on the compact subset of integration.

We can derive analytic forms for the definite integral of a product of polynomials and exponentials.
As a result, we can derive expressions for the mean and variance in terms of these analytic integrals. Let be a scalar valued function that is continuous on the real interval . Additionally, let
be a piecewise continuous cubic polynomial that interpolates
with nodes, where each node is the beginning of intervals , where and .(9)  
(10)  
(11)  
(12) 
Within the inner sum, the four terms can be computed analytically utilizing expressions for the integral of the product between polynomials and exponentials. We provide the full expressions for polynomials up to 3rd order in the supplementary material. The nodes and coefficients can be computed prior to evaluation of the network, and the integrals are computed once at every layer. This method is in fact general for any continuous function
whose tails are bounded, or polynomial outside of a compact set. Additionally, we can move forward to approximate higher moments of the output distribution and get expressions for the skew and kurtosis, at the cost of having spline approximations for
and . We provide expressions for higher moments and for alternative activation functions in the supplement. The mean and variance are given below, and Algorithm 1 describes how to use this technique to compute the moments of the activation.(13)  
(14)  
(15)  
(16) 
2.3 Computational Complexity Analysis
Next we look at the computational complexity of both the analytical and spline method in terms of the number of hidden states in the network. Let D be the number of hidden states in the network and let N be the number of mesh points for the spline approximation. Since the spline method makes frequent use of the error function, to approximate the operation count, we use the error function approximation introduced in (Abramowitz and Stegun, 1964). We can see from Table 1 that if we use the spline approximation of , the time complexity is (ND), while the analytical approximation of has the complexity of (D). If we choose a fixed number of Monte Carlo samples , then the Monte Carlo method is also linear with respect to reservoir since the activation is elementwise and cross correlations are ignored. All 3 are linear in reservoir size and in practice we can see the this behavior of all 3 methods in Figure 2.
2.4 Error Bounds for Mean and Variance Estimation
In this section we provide an upper bound on the mean and variance estimates as a function of the input mean, input variance, the number of nodes in the mesh, and the location of the mesh in space. Let , and define as the maximum interval length in the mesh spanning some compact set . Then for a cubic spline approximation , we can place an upper bound on the maximum error over the interval (De Boor, 1978).
(17) 
Let us use . Given the fact that , we can compute an error bound over the mean of the output distribution by applying the triangle inequality.
(18)  
where , , . Let us additionally compute a bound for the variance. Let denote the true value of the mean, be the approximated value of the mean. Additionally, let be our polynomial approximation of . Then we go to the expression for variance Equation (78):
(19)  
is the spline error bound where the unknown function is . Additionally, we know that and that since we are dealing with the activation function, the activation mean is upper bounded by 1. Utilizing this method allows us to design the number of uniformly spaced mesh points and the size of the interval according to a desired accuracy. For reference, we utilize the interval [10, 10] for the mean of the activation, and use mesh points, resulting in an error bound of for the mean of an input distribution with mean and variance . Again, the full derivation for both error bounds can be found in the supplement.
3 Probabilistic Echo State Networks
The proposed algorithm, Probabilistic Echo State Networks (PESN) derives its internal equations from the deterministic echo state network (Jaeger, 2001). Let be input to the network, be the output of the network, is the hidden state at time . The deterministic echo state network is characterized by 5 hyperparameters: reservoir size , leak rate , noise magnitude , sparsity fraction , and spectral radius .
(20)  
(21) 
Here is Gaussian noise with variance 1. The sparsity fraction refers to the fraction on nonzero elements in the matrix , while the spectral radius parameter is the spectral radius of . For the echo state network, we randomly generate the internal weight matrices , , and , and we only train the readout weight matrix . In our case, training is accomplished using batch least squares. At test time, we recursively update the readout matrix using recursive least squares regression. (Simon, 2006)
For probabilistic echo state networks, assume that the input to the networks are Gaussian. We use moment matching to propagate the Gaussian uncertainty through the nonlinearity in the network, thus we retrieve a Gaussian output. Our notation is as follows: a Gaussian random vector has a mean and covariance matrix . We assume that all Gaussian vectors in our echo state network are additionally jointly Gaussian, so that linear combinations of these Gaussian vectors produce another Gaussian vector with appropriate mean vector and covariance matrix. Define the input as , the previous target as and the hidden state as .
(22)  
Using our elementwise approximation, we drop the offdiagonal terms in and compute the approximate mean and variance using Equations (77) and (78). We also ignore the cross correlation terms, which will cause underestimation of uncertainty.
(24)  
(25)  
(26)  
(27)  
(28) 
Here are the five hyperparameters for the echo state network, and are input and output data, generated or collected from experiments, and are the lengths of the train and test data, is a forgetting factor, is where I
is the identity matrix. Algorithm
2 is used for training. Algorithm 3 is used for single and multistep prediction for the probabilistic echo state network.4 Experimental Evaluation
4.1 Numerical Analysis of Moment Approximation
First we compare the absolute error between the mean and variance estimated from Monte Carlo against the mean and variance from spline method and analytic method. We evaluate our moment propagation on a range input means and select variance values to assess accuracy. At each point, we have an input Gaussian distribution determined by a mean and variance. We sample this input distribution and compute an estimated mean and variance through Monte Carlo, then compare our two methods. Figure 4 shows that the spline method results in a more accurate approximation, and as the input saturates (mean approaches 5 and 5), the error quickly drops. In general we see the analytic approximation over estimates the variance of the output. Full contour plots of the spline and analytical comparison for a grid of both means and variance are available in the supplementary materials. We see in Figure 3 how computational time effects absolute error of moments. The analytic method is constant, but has comparatively high error compared to the others. The spline method has higher initial computational cost, but converges in error quickly.
4.2 Effect of Probabilistic Hidden State on Washout Period
In Table 2 we numerically analyze the difference in multistep prediction error between the probabilistic echo state and the deterministic echo state algorithms as a function of the number of washout points. Each of the 50 trials were run with varying washout lengths (in timesteps), where we ran 50 samples of the deterministic echo state algorithm and compared the average absolute error between the ground truth and predicted state. The table shows the the probabilistic echo state network outperforms the montecarlo deterministic echo state in absolute error for the majority of washout lengths. When the Monte Carlo deterministic ESN outperforms the PESN, it does so by a fairly small margin. This implies that the PESN reduces the time required to attain the “echo state property” that is characteristic for echo state networks.




Washout Length (timesteps)  Shannon Entropy Statistics  
mean  min  max  
P  MC  P  MC  P  MC  
1  5.42  5.54  5.42  5.551  
10  5.61  5.69  5.62  5.70  
20  5.82  5.86  5.84  5.87  
30  5.88  5.93  5.89  5.94  
40  5.77  5.92  5.78  5.93  
50  5.62  5.86  5.62  5.88  
100  5.35  5.70  5.35  5.72  
200  4.94  5.41  4.95  5.45 
In an attempt to quantify the level of synchronization in the hidden states of both the ESN and the PESN, we can create a probability distribution of the values of the reservoir values during washout. Synchronization would imply that there is less uncertainty in the values of the hidden state, thus we can quantify this via the Shannon Entropy of this probability distribution. Table 3 lists the Shannon entropy values for hidden state trajectories. Intuitively, the entropy is small for every washout length, which implies there is less randomness to “wash out”.
4.3 Model Learning
We test the PESN on the task of learning dynamics of cart pole and Gazebo ARDrone. We compare the PESN against Monte Carlo estimates obtained with the deterministic ESN. The PESN successfully propagates uncertainty for single step predictions and for up to 20 timesteps for multi step prediction. The results, found in the supplementary material Section 3.3, demonstrate the ability of our method to capture input uncertainty as it passes through the recurrent network.
5 Conclusions
In this work, we investigate two different methods of propagating uncertainty through the function. We demonstrate that the analytical method is the fastest way to propagate uncertainty through the function, however it tends to over estimate uncertainty. The spline method is more general and has tuneable accuracy, however the computational cost is higher. Utilizing these new developments, we propose the probabilistic echo state network (PESN), which attempts to solve a fundamental problem of reservoir computing, which is the time required to fulfill the asymptotic behavior of the echo state property. We are able demonstrate, for multi step regression tasks, the PESN has better performance than a deterministic ESN (as measured by absolute error). The better performance is accompanied by a lower shannon entropy in the distribution of hidden state value distribution, which we believe indicates faster convergence to the echo state property. We additionally test the PESN method on ARDrone data collected from the Gazebo simulator, to demonstrate that the method can scale to higher dimensional systems and still propagate uncertainty effectively. However, these methods are not without their drawbacks. For the spline method, input, output scaling and spectral radius of the PESN must be tuned fit inside interval so that we satisfy our moment error bounds. The multistep regression is heavily dependent on incremental updating of the output weights in order to maintain low absolute error. Both the spline and analytic methods are tractable techniques to propagate uncertainty through the function, and the resulting Probabilistic Echo State network is able to improve the convergence to the echo state property.
References
 Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016.

Rasmussen (2004)
Carl Edward Rasmussen.
Gaussian processes in machine learning.
In Advanced lectures on machine learning, pages 63–71. Springer, 2004.  Girard et al. (2003) Agathe Girard, Carl Edward Rasmussen, Joaquin Quinonero Candela, and Roderick MurraySmith. Gaussian process priors with uncertain inputs application to multiplestep ahead time series forecasting. In Advances in neural information processing systems, pages 545–552, 2003.
 Shekhovtsov et al. (2018) Alexander Shekhovtsov, Boris Flach, and Michal Busta. Feedforward uncertainty propagation in belief and neural networks. arXiv preprint arXiv:1803.10590, 2018.

Langrock et al. (2015)
Roland Langrock, Thomas Kneib, Alexander Sohn, and Stacy L DeRuiter.
Nonparametric inference in hidden markov models using psplines.
Biometrics, 71(2):520–528, 2015.  Jaeger et al. (2007) Herbert Jaeger, Mantas Lukoševičius, Dan Popovici, and Udo Siewert. Optimization and applications of echo state networks with leakyintegrator neurons. Neural networks, 20(3):335–352, 2007.
 Rodan and Tino (2011) Ali Rodan and Peter Tino. Minimum complexity echo state network. IEEE transactions on neural networks, 22(1):131–144, 2011.
 Lukoševičius and Jaeger (2009) Mantas Lukoševičius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
 Jaeger and Haas (2004) Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.
 Maass et al. (2004) Wolfgang Maass, Thomas Natschläger, and Henry Markram. Computational models for generic cortical microcircuits. Computational Neuroscience: A Comprehensive Approach, pages 575–605, 2004.
 Chatzis and Demiris (2011) Sotirios P Chatzis and Yiannis Demiris. Echo state gaussian process. IEEE Transactions on Neural Networks, 22(9):1435–1445, 2011.
 Jaeger (2001) Herbert Jaeger. The echo state approach to analysing and training recurrent neural networkswith an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34):13, 2001.
 Bassett (1998) Bruce Bassett. e x2 dx and the kink soliton. Computers & Mathematics with applications, 36(4):37–45, 1998.
 Abramowitz and Stegun (1964) Milton Abramowitz and Irene A. Stegun. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Courier Corporation, 1964.
 Press et al. (2007) William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical recipes 3rd edition: the art of scientific computing. Cambridge University Press, 2007.
 De Boor (1978) Carl De Boor. A practical guide to splines. SpringerVerlag New York, 1978.
 Simon (2006) Dan Simon. Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John Wiley & Sons, 2006.
 Petersen et al. (2008) Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
 Engel et al. (2012) J. Engel, J. Sturm, and D. Cremers. Camerabased navigation of a lowcost quadrocopter. In Proc. of the International Conference on Intelligent Robot Systems (IROS), 2012.
6 Supplementary Material
This is the supplementary material for the paper on Propagating Uncertainty through the Tanh function for Reservoir Computing. The supplement is organized as follows: Section 6.1 includes derivations for both the analytic and spline methods, derivations for the error bound for mean and variance of the spline tanh approximation. Section 6.2 repeats the equations for the Probabilistic Echo State Network (PESN). Section 6.3 contains additional numerical examples. These include propagation of uncertainty through a feedforward neural network for regression, dynamical systems regression, and activation function comparisons.
6.1 Propagating Uncertainty through Tanh
6.1.1 Analytical Approximation of Mean and Variance
First we present some useful identities and approximations: The cumulative distribution function (CDF) of a logistic distribution over a random variable
is:(29)  
(30) 
Here the logistic distribution is parameterized by the location and the scale . The function can be written as follows and relation can be derived:
(31)  
(32)  
(33)  
(34)  
(35)  
(36) 
Thus the function is a function of a logistic CDF with location (the input to the ) and scale . Let us assume that the dimensions of the logistically distributed random vector are independent given the location parameter . From the definition of the logistic distribution, the mean and variance of the conditional distribution are:
(37) 
We also make use of the formula for the product of two Gaussian distributions(Petersen et al. [2008]).
(38) 
where
(39)  
(40)  
(41) 
Let us assume that . We want to compute the moments of the distribution: . We will make use of moment matching to approximate a logistic distribution with a Gaussian, and vice versa.
(42) 
Mean: First we compute the mean of .
(43)  
(44)  
(45)  
(46)  
(47)  
(48)  
(49)  
In Equation (6.1.1), we use the product rule of Gaussians, and specify: , , , and . Then we are left with the constant coefficient in front of a Gaussian pdf of the input . As a result, the constant coefficient is not a function of the input, thus the integral over this Gaussian pdf integrates to 1.
(51)  
(52)  
(53) 
Thus, given the input mean and diagonal covariance function, we can compute the posterior mean of an elementwise tanh operation.
Variance: The author in Bassett [1998] investigated different approximations to the error function. One preliminary solution was the use of the function to approximate the error function. Using this fact, we see that with the appropriate scaling factors, the Gaussian pdf can be an approximation to the function . Again, we are ignoring cross correlation terms, thus the output variance will be diagonal and we can compute each term separately.
erf(x)  (54)  
(55)  
(56) 
where and .
(57)  
(58)  
(59)  
(60)  
6.1.2 Spline Approximation of Mean and Variance
In this section, we present a formulation for computing the moments of the transformation utilizing a spline approximation of the argument of the expectation. Again let us assume that , the expectation and variance of the elementwise are given by the following equations:
(63)  
(64)  
In both cases the integrands are not easily integrable, and depend on the mean and variance of the input. Let us assume that we are dealing with scalar inputs to the activation function, since the vector case is handled elementwise and we are ignoring the offdiagonal elements of the input variance. Instead of direct numeric integration, we propose taking advantage of three facts. First, we know that the tails of the Gaussian pdf approach zero, thus when multiplied against a function with constant tails, the integrand also goes to zero. The , both have constant tails. For the and activation functions, the negative tail is 0, and the positive tail is , which has an analytical form when integrated against a Gaussian. Thus we can approximate the integral over the real line using an integral over some compact subset centered around the mean of the input. Second, most activation functions are continuous, and on the compact subset of integration, we can uniformly approximate these functions with polynomials, in fact we will utilize piecewise cubic polynomials for this approximation. Finally, we can derive analytic forms to the integrals of polynomials multiplied against exponentials. The results are expressions for both the output mean and variance that are functions of the exponential function, error function, and input mean and variance. Let be a scalar valued function that is continuous on the real interval . Additionally, let be a piecewise continuous cubic polynomial that interpolates with nodes, where each node is the beginning of intervals , where and .
(65)  
(66)  
(67)  
(68) 
Within the inner sum, the four terms can be computed analytically, however let us first make a change of variables to simplify the expressions. Let , then . Additionally we should transform our bounds: and . Next we can compute the integrals:
(69)  
(70) 
Comments
There are no comments yet.