Recurrent neural networks (RNNs) are the state-of-the-art nonparametric methods for sequence learning that map an input sequence to an output sequence by predicting the next time steps. RNN training using the backpropagation through time algorithm is challenging due to vanishing and exploding gradients where the norm of the backpropagated error gradient can increase or decrease exponentially, hindering the network in capturing long-term dependencies .
Three main solutions have been proposed in the literature to improve RNN training; modifications of the training algorithm, modifications of the network architecture, or different weight initialization schemes. In the first approach, advanced optimization techniques such as the Hessian-Free method 
or regularized loss functions are applied to improve the backpropagation through time algorithm for learning long sequences. The second approach is to properly initialize the RNN weight matrices, for example, to be identity  or orthogonal , to find solution to the long-term dependency problem. The third approach is to employ nonlinear reset units in the RNN architecture to store information for a long time, for instance, using long short-term memory (LSTM) networks 
or gated recurrent units (GRUs).
LSTM networks, the most successful type of RNNs, use a gated architecture to replace the hidden unit with a memory cell to efficiently capture long-term temporal dependencies by storing and retrieving sequence information over time. The memory cell is used as a feedback along with three nonlinear (multiplicative) reset units to keep the backpropagated error signal constant. The input and output gates of the cell learn their weights to incorporate the stored information or to control the output values. There is also a forget gate that learns to remember or forget the memory information over time by scaling the cell content. Therefore, in contrast to vanilla RNNs, LSTM units by design allow gradients to flow unchanged, but they can still suffer from instabilities (exploding gradient problem) when trained on long sequences.
In this paper, a simple, yet robust initialization method is proposed to tackle the training instabilities in LSTM networks. The idea is based on normalized random initialization of the network weights with the property that the input and output signals have the same variance. The proposed method is applied to standard LSTMs [9, 5] for univariate time series regression using data from the UCR Time Series Archive  and to LSTMs robust to missing values  for multivariate disease progression modeling in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort  using volumetric magnetic resonance imaging (MRI) measurements.
2 Related Work
Since deep neural network training is achieved by solving a nonconvex optimization problem, mostly in a stochastic way, a random weight initialization scheme is important for faster convergence and stability. Otherwise, the magnitudes of the input signal and error gradients at different layers can exponentially decrease or increase, leading to an ill-conditioned problem. Standard initialization of weights with zero-mean uniform/Gaussian distributions and heuristic variances ranging fromto or an input layer size () dependent variance of have been widely used in previous studies . But, studies on the initialization, for instance, using unsupervised pre-training , showed its importance as a regularizer for the optimization procedure to robustly reach a local minimum and to improve generalization.
Accordingly, training difficulties have been investigated based on the variance of the responses in each layer, when the singular values of the Jacobian are not unit, and a normalized initialization of uniform weights with a variance of
is suggested assuming that the activation functions are identity and/or hyperbolic tangent. Likewise, a scaled initialization method has been developed to train deep rectified models from scratch using zero-mean Gaussian weights whose variances are .
To resolve the long-term temporal dependencies problem in RNNs, which can be seen as deep networks when unfolded through time, the (scaled) identity matrix has been applied to initialize the hidden (recurrent) weights matrix to output the previous hidden state in the absence of the current inputs in RNNs composed of rectified linear units (ReLU). Alternatively, (nearly) orthogonal matrices  and scaled positive-definite weight matrices  have been used to address vanishing and exploding gradients in RNNs by preserving the gradient norm during backpropagation.
As it can be seen, different initialization methods have been proposed to deal with the training convergence problem in deep neural networks including RNNs, assuming that LSTMs by design can handle the issue. Hence, the abovementioned initialization methods, e.g., orthogonal recurrent weight matrices and current input weight matrices, both drawn i.i.d. from zero-mean Gaussian distributions with variances of , have also been applied to LSTMs. However, as noted before, LSTMs can still suffer from instability with improper initialization due to the stochastic nature of the optimization and using multiplicative gates and feedback signals.
3 The Proposed Initialization
To address training instability and slow convergence in LSTMs, we propose a scaled random weights initialization method that aims to keep the variance of the network input and output in the same range. Let’s be the -th observation of an
-dimensional input vector at time. The feedforward pass of an LSTM network can be expressed as
where are the -th observation of forget gate, input gate, modulation gate, cell state, output gate, and hidden output at time , respectively, and is the number of output units. Also, are weight matrices containing the connecting weights from input to the gates and cell, are weight matrices containing the connecting weights from recurrent input to the gates and cell,
denote the corresponding biases of neurons, andis the Hadamard product. Finally, , , and are nonlinear activation functions allocated to the gates, input modulation, and hidden output, respectively. Note that, in a regression problem, , and
is an estimation of. The regression assumptions can still be applied to sequence-to-sequence or sequence-to-label learning problems simply by adding a fully-connected layer with input nodes and a desired number of output units.
Assume that all of the weight matrices are independently initialized with zero-mean i.i.d. random values obtained from a symmetric distribution. The goal is to derive the condition(s) on the initialization of the weights to achieve . Since the weights are independent from the input, assuming an exact estimation for the recurrent value, i.e., , and mutually independent zero-mean input features – sharing the same distribution, the variance of the forget gate can be calculated as
where and are the elements of and , respectively. The bias in the variance calculation is canceled out as it is an independent constant initialized to zero. Moreover, the second equality holds under the assumption that is an identity function. We will discuss other commonly used functions in LSTM units in Section 3.2.
Variance calculations for the input, modulation, and output gates can be performed in a similar way to the forget gate. That is to say,
where , , , , , and are the elements of , , , , , and , respectively.
The cell state formula is a form of the stochastic recurrence equation 
, also known as growing perpetuity, in which the moments of the cell state are time varying. Therefore, one tractable way to stabilize the network training is to set. Accordingly,
where the above equation is obtained based on the zero-mean assumption and independence assumption between all of the gates and the cell state to avoid terms containing covariance matrices in the last expression. Also, note that .
Finally, the variance of the network output is computed as
where the last equality is obtained assuming that there is an identity activation function and independence between the output gate and the cell state. Considering all of the calculated variances and setting , the required condition can be summarized as
where the right hand side of the above equation is the multiplication of the weights connected to the input, modulation, and output gates.
Similar to the feedforward pass, some initialization conditions can be derived to ensure that the variance of the backpropagated gradient remains unchanged, i.e., where is the loss function defined based on the actual target and network output. However, as shown in  and , initialization with properly scaling the forward signal is equivalent to initialization with properly scaling the backward signal, and since the number of units in the input and output of the LSTM network are the same, similar conditions for weight initialization using backpropagation will be obtained.
3.1 Peephole Connections
In general, LSTMs can be extended to augment their internal cell state to the multiplicative gates using the so-called peephole connections. These cell-to-gate connections allow the gates to inspect the current cell state even if the output gate is closed, and consequently help improving the performance, especially when the task involves a precise duration of intervals . The feedforward pass of the peephole LSTM can be formulated as
where are diagonal peephole weight matrices. Hence, each gate will only look at its corresponding cell state. To achieve , all the assumptions applied to the traditional LSTM are used for the peephole LSTM. Assuming that the peephole matrices are independent from the input and the cell state and are independently initialized with zero-mean i.i.d. random values obtained from a symmetric distribution, the variances can be calculated as
where , , and . Since the discriminant is always positive considering nonzero variances, there are two possible solutions for Equation (8): . However, since and , with a positive discriminant and based on the sign of the product of the roots (), one of the real solutions would be negative, which cannot be accepted as . Therefore, the desired solution to Equation (8) will be obtained as
where , , and . The two possible solutions for Equation (10) will be obtained as , where is the discriminant of the equation. Here, since , assuming a nonnegative discriminant and based on the sign of the sum and product of the roots ( and ), both real solutions could be positive and acceptable provided that . However, to achieve a simple solution for initialization, one can set and which produces repeated real positive roots for the problem. Therefore, the real solution to Equation (10) can be obtained as
3.2 Nonlinear Activation Functions
All the abovementioned equations are obtained based on the assumption that the activation functions are identity functions. In general, symmetric functions with zero intercepts such as the identity and hyperbolic tangent are suggested for and , respectively, and logistic sigmoid is suggested for . Both the hyperbolic tangent and logistic sigmoid are nonlinear symmetric functions that can be linearly approximated using a Taylor series expansion. The former has a zero intercept and its expansion about zero leads to an identity function (). The latter, however, has a nonzero intercept and its Taylor series about zero is approximated as
. Therefore, the sigmoid function approximately increases the input signal mean byand scales its variance by . Note that the nonzero mean value of the sigmoid can induce important singular values in the Hessian matrix, resulting in saturation of the top layers and prohibition of gradients to flow backward to learn useful features in the lower layers . Using the suggested activation functions in the gates, the variance calculations for the traditional LSTM network are updated as follows based on the aforementioned Taylor series expansion
where the last equation is obtained bearing in mind that
for two independent random variablesand , and considering , , and, hence, . Finally, the updated rule for initialization of a traditional LSTM network using Equation (7) can be written as
Applying the same suggested functions in the peephole LSTM network generalizes the variance calculations as follows
3.3 Initialization Summary
The proposed initialization rule can be summarized as follows:
Standardize the input data to have a zero mean and unit variance per feature, and initialize the LSTM network biases to zero.
Initialize the weights in the weight matrices randomly using zero-mean i.i.d. Gaussian distributions with variances satisfying one of the following equations:
Equation (1), if using the traditional LSTM network based on identity or hyperbolic tangent functions.
Equation (12), if using the peephole LSTM network based on identity or hyperbolic tangent functions.
Equation (13), if using the traditional LSTM network based on identity or hyperbolic tangent for input modulation and cell activation, and logistic sigmoid functions in the gates.
Equation (14), if using the peephole LSTM network based on identity or hyperbolic tangent for input modulation and cell activation, and logistic sigmoid functions in the gates.
Note that the variances need to be selected subject to the specified conditions in the selected equation. For example, when using a peephole LSTM, and, correspondingly, Equation (12) or (14), there are eleven variances to fix, , , , , , , , , , , and .
4 Experiments and Results
Both univariate and multivariate data are used to study the effect of initialization on LSTM training.
The following three univariate datasets are obtained from the UCR Time Series Archive  due to having the largest training samples size: ElectricDevices with 16,637 samples (8,926 for training and 7,711 for test) of sequence length 96; FordA with 4,921 samples (3,601 for training and 1,320 for test) of sequence length 500; and Crop with 24,000 samples (7,200 for training and 16,800 for test) of sequence length 46.
The multivariate dataset, ADNI, focuses on disease progression modeling and is obtained from the ADNI cohort . It constitutes yearly measurements for 383 subjects (332 for training and 51 for test) of sequence length 3 to 10 with normal cognition, mild cognition impairment, or Alzheimer’s disease. The multivariate feature set consists of T1-weighted MRI volumetric measurements of ventricles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex, all normalized for intracranial volume.
4.2 Experimental Setup
The proposed initialization method is assessed using a peephole LSTM  applied to the univariate data () for time series regression and a peephole LSTM robust to missing values  applied to the multivariate data () for disease progression modeling. In both cases, an identity function and a hyperbolic tangent are used in and , respectively, a logistic sigmoid is used in , and the network biases are initialized to zero. Therefore, the variance selection for weight matrices is performed using Equation (14), and weight values are drawn from the zero-mean i.i.d. Gaussian distributions. Four different configurations of the variances are inspected as illustrated in Table 1.
The input data is standardized to have a zero mean and unit variance per feature dimension. Moreover, the batch size is set to of training samples (
used for validation to tune the optimization hyperparameters), and the first to penultimate time point is used to estimate the second to last time point per observation. The L2-norm is used as loss function and momentum batch gradient descent is applied to optimize the network parameters using L2 regularization. The optimization hyperparameters, i.e., the learning rate, momentum weight, and weight decay are set to, , and , respectively. These values were selected according to the validation set error across the different experiments.
The proposed approach is compared with two state-of-the-art initialization techniques applied to the same LSTM networks assuming zero biases and using the same optimization setup: normalized , all weight matrices drawn i.i.d. from zero-mean Gaussian distributions with a scaled variance of ; orthogonal , same as normalized, but with orthogonal recurrent weight matrices drawn i.i.d. from zero-mean Gaussian distributions with a variance of .
Figure 1 compares the training loss of the proposed and state-of-the-art initialization methods applied to the univariate and multivariate datasets. As can be seen, the proposed method with any configuration outperforms the prevalent initialization techniques in all experiments, either by achieving a lower loss (ElectricDevices and FordA) or by faster convergence to the same loss (Crop and ADNI).
To further investigate the influence of initialization on the performance, we also evaluate the generalization error in the test set. Table 2 reports the test mean square error (MSE) in predicting the feature values per dataset for the utilized initialization methods. As it can be deduced, the proposed initialization method with any configuration achieves superior results to the prevalent initialization approaches, which illustrates the generalizability of the proposed method.
More interestingly, the fourth configuration of the proposed method in which the recurrent weights receive more variance than the current input weights outperforms all the other methods in almost all of the experiments.
In this paper, a robust initialization method was proposed for LSTM networks to address training instability and slow convergence. The proposed method was based on scaled random weights initialization aiming to keep the variance of the network input and output signals in the same range subjected to a number of assumptions simplifying the initialization conditions. The proposed method was applied to univariate and multivariate time series regression datasets and outperformed two state-of-the-art initialization methods in all cases.
The obtained conditions can be optimized for eight or eleven unknowns using a traditional LSTM or peephole LSTM, respectively. In this work, different configurations of the variances were inspected to confirm the proposed assumption for initializing the network weights. Moreover, the proposed method can be used for sequence-to-sequence and sequence-to-label learning paradigms by connecting a fully-connected layer with a desired output size to the LSTM network output. It should also be noted that the initialization conditions need to be properly modified in case of using activation functions other than a hyperbolic tangent, identity function, or logistic sigmoid in the gates.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 721820.
-  (2016) Stochastic models with power-law tails. Springer. Cited by: §3.
Learning phrase representations using RNN encoder–decoder for statistical machine translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. Cited by: §1.
-  (2018) The UCR Time Series Archive. CoRR abs/1810.07758. Cited by: §1, §4.1.
The difficulty of training deep architectures and the effect of unsupervised pre-training.
Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 153–160. Cited by: §2.
Learning precise timing with LSTM recurrent networks.
Journal of Machine Learning Research3, pp. 115–143. Cited by: §1, §3.1, §3.2, §4.2.
-  (2019) Training recurrent neural networks robust to incomplete data: application to Alzheimer’s disease progression modeling. Medical Image Analysis 53, pp. 39–46. Cited by: §1, §4.2.
-  (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 249–256. Cited by: §2, §2, §3.2, §3, §4.2.
Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In
Proceedings of the 2015 IEEE International Conference on Computer Vision, pp. 1026–1034. Cited by: §2, §3.
-  (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, Cited by: §1, §1.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1.
-  (2015) A simple way to initialize recurrent networks of rectified linear units. CoRR abs/1504.00941. Cited by: §1, §2.
-  (2011) Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the International Conference on Machine Learning, pp. 1033–1040. Cited by: §1.
-  (2010) Alzheimer’s Disease Neuroimaging Initiative (ADNI): clinical characterization.. Neurology 74, pp. 201–209. Cited by: §1, §4.1.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
-  (2015) Improving performance of recurrent neural network with ReLU nonlinearity. CoRR abs/1511.03771. Cited by: §2.
-  (2018) Learning longer-term dependencies in RNNs with auxiliary losses. CoRR abs/1803.00144. Cited by: §1.
-  (2017) On orthogonality and learning recurrent networks with long term dependencies. CoRR abs/1702.00071. Cited by: §1, §2, §4.2.