1 Introduction
Dynamic sequences of discrete tokens are abundant and we encounter them on a daily basis. Examples of discrete tokens are characters or words in a text, notes in a musical composition, pixels in an image, actions in a reinforcement learning agent, web pages one visits, tracks one listens to on a music streaming service etc. Each of these tokens appears in a sequence, in which there is often a strong correlation between consecutive or nearby tokens. For example, the similarity between neighboring pixels in an image is very large since they often share similar shades of colors. Words in sentences, or characters in words, are also correlated because of the underlying semantics and language characteristics.
In this paper only discrete tokens are considered as opposed to sequences of realvalued samples, such as stock prices, analog audio signals, word embeddings, etc. but our methodology is also applicable to these kinds of sequences. A sequence of discrete tokens can be presented to a machine learning model that is designed to assess the probability of the next token in the sequence by modeling
, in which is the’th token in the sequence. These kinds of models can go by the names of autoregressive models
Gregor:2013ti, recurrent or recursive models, dynamical systems etc. In the field of natural language processing (NLP) they are called language models, in which each token stands for a separate word or
gram Kim:2015vh. Since these models give us a probability distribution of the next token in the sequence, a sample from this distribution can be drawn and thus a new token for the sequence is generated. By recursively applying this generation step, entire new sequences can be generated. In NLP, for example, a language model is not only capable of assessing proper language utterances but also of generating new and unseen text.
One particular type of generative models that has become popular in the past years is the recurrent neural network (RNN). In a regular neural network a fixeddimensional feature representation is transformed into another feature representation through a nonlinear function; multiple instances of such feature transformations are applied to calculate the final output of the neural network. In a recurrent neural network this process of feature transformations is also repeated in time: at every time step a new input is processed and an output is produced, which makes it suitable for modeling time series, language utterances etc. These dynamic sequences can be of variable length, and RNNs are able to effectively model semantically rich representations of these sequences. For example, in 2013, Graves showed that RNNs are capable of generating structured text, such as a Wikipedia article, and even continuous handwriting Graves:2013ua . From then on these models have shown great potential at modeling the temporal dynamics of text, speech as well as audio signals Karpathy:2015wu ; Sercu:2016ub ; VanDenOord:2016uo . Recurrent neural networks can also effectively generate new images on a per pixel basis, as was shown by Gregor et al. with DRAW Gregor:2015up and by van den Oord et al. with PixelRNNs Oord:2016um . Next to this, in the context of recommender systems, RNNs have been used successfully to model user behavior on online services and to recommend new items to consume Tan:2016vy ; Hidasi:2015uq .
Despite the fact that RNNs are abundant in scientific literature and industry, there is not much consensus on how to efficiently train these kinds of models, and, to the extent of our knowledge, there are no focused contributions in literature that tackle this question. The choice of training algorithm very often depends on the deep learning framework at hand, while in fact there are multiple factors that influence the RNN performance, and those are often ignored or overlooked. Merity et al. have pointed out before that “[the] training of RNN models […] has fundamental tradeoffs that are rarely discussed” Merity:2016wg . The goal of this paper is to study a number of widely applicable training and sampling techniques for RNNs, along with their respective (dis)advantages and tradeoffs. These will be tested on a variety of datasets, neural network architectures, and parameter settings, in order to gain insights into which algorithm is best suited. In the next section the concept of RNNs is introduced and, more specifically, characterlevel RNNs, and how these models are trained. In Section 3 four different training and sampling methods for RNNs are detailed. After that, in Section 4, we will present experimental results on the accuracy, efficiency and performance of the different methods. We will also present a set of takehome recommendations and a range of future research tracks. Finally, the conclusions are listed in Section 6. Table 1 gives an overview of the symbols that will be used throughout this paper in order of appearance.
, 
Input token , input vector 

,  Hidden state , hidden state vector 
,  Output token , output vector 
Parameterized and differentiable functions  
Loss function  
Learning rate  
,  Arbitrary weight, arbitrary weight matrix 
Number of time steps after which one truncated BPTT operation is performed  
Number of time steps over which gradients are backpropagated in truncated BPTT  
Nonlinear activation function 

Sigmoid function:  
RNN model  
Output function of RNN model  
Hidden state of RNN model  
Elementwise vector multiplication operator  
Sequence concatenation operator  
, ,  Dataset, train set, test set 
Ordered set of tokens appearing in a dataset (‘vocabulary’)  
Number of recurrent layers in an RNN model  
Dimensionality of the recurrent layers in an RNN model 
2 Characterlevel Recurrent Neural Networks
As mentioned in the introduction, this paper mainly focuses on dynamic sequences of discrete tokens. Generating and modeling such sequences is the core application of a specific type of recurrent neural networks: characterlevel recurrent neural networks. Recurrent neural networks (RNN) are designed to maintain a summary of the past sequence in their memory or socalled hidden state, which is updated whenever a new input token is presented. This summary is used to make a prediction about the next token in the sequence, i.e. the model in which the hidden state of the RNN is a function of the past sequence . Formally, we have:
(1) 
Given an adequate initialization of the hidden state and trained parameterized functions and , the previous scheme can be used to generate an infinite sequence of tokens. In characterlevel RNNs specifically, all input tokens are discrete and is a stochastic function that produces a probability mass function over all possible tokens. To produce the next token in the sequence, one can sample from this mass function or simply pick the token with the highest probability:
(2) 
Since the hidden state at time step is only dependent on the tokens up to time and not on future tokens, a characterlevel RNN can be regarded as the following fully probabilistic generative model Sutskever:2013wo :
(3) 
In this, we have used the slice notation , which means . As a side comment, even though their name only refers to characterbased language models, characterlevel RNNs are fit to model a wide variety of discrete sequences, for which we refer the reader to the introduction section.
2.1 Truncated backpropagation through time
Regular feedforward neural networks are trained using the backpropagation algorithm Rumelhart:1988we
. In this, a certain input is first propagated through the network to compute the output. This is called the forward pass. The output is then compared to a ground truth label using a differentiable loss function. In the backward pass the gradients of the loss with respect to all the parameters in the network are computed by application of the chain rule. Finally, all parameters are updated using a gradientbased optimization procedure such as gradient descent
Goodfellow:2016wc . In neural network terminology, the parameters of the network are also called the weights. If the loss function between the network output and the ground truth label is denoted by , and the vector of all weights in the network by , a standard update rule in gradient descent is given by:(4) 
Here,
is the socalled learning rate, which controls the size of the steps taken with each weight update. In practice, input samples will be organized in equallysized batches sampled from the training dataset for which the loss is averaged or summed, leading to less noisy updates. This is called minibatch gradient descent. Other gradient descent flavors such as RMSprop and Adam further extend on Equation (
4), making the optimization procedure even more robust Kingma:2015ku .In recurrent neural networks a new input is applied for every time step, and the output at a certain time step is dependent on all previous inputs, as was shown in Equation (3). This means that the loss at time step needs to be backpropagated up until the applied inputs at time step . This procedure is therefore called backpropagation through time (BPTT) Sutskever:2013wo . If the sequence is very long, BPTT quickly becomes inefficient: backpropagating through 100 time steps can be compared to backpropagating through a 100layer deep feedforward neural network. Unlike with feedforward networks however, in RNNs the weights are shared across time. This can best be seen if we unroll the RNN to visualize the separate time steps, see Figure 1.
To scale backpropagation through time for use with long sequences, the gradient update is often halted after having traversed a fixed number of time steps. Such a procedure is called truncated backpropagation through time. Apart from stopping the gradient updates from backpropagating all the way to the beginning of the sequence, we also limit the frequency of such updates. For a given training sequence, truncated BPTT then proceeds as follows. Every time step a new token is processed by the RNN, and whenever tokens have been processed in the socalled forward pass—and the hidden state is updated times—truncated BPTT is initiated by backpropagating the gradients for time steps. Here, by analogy with Sutskever Sutskever:2013wo , we have denoted the number of time steps between performing truncated BPTT by and the length of the BPTT by . We will keep using these parameters throughout this paper. A visual explanatory example of truncated BPPT can be found in Figure 2, which shows how every two () time steps the gradients are backpropagated for three () time steps. Note that, in order to remain as data efficient as possible, should preferably be less than or equal to , since otherwise some data points would be skipped during training.
2.2 Common RNN layers
As mentioned before, RNNs keep a summary of the past sequence encoded in a hidden state representation. Whenever a new input is presented to the RNN, this hidden state gets updated. The way in which the update happens depends on the internal dynamics of the RNN. The simplest version of an RNN is an extension of a feedforward neural network, in which matrix multiplications are used to perform input transformations:
(5) 
in which is the vector representation at the ’th layer of the neural network, is the matrix containing the weights of this layer, and is the vector of biases. The function is a nonlinear transformation function, also called activation function; often used examples are the sigmoid logistic function ,
or rectified linear units (ReLU) and their variants. In the case of RNNs, the hidden state update is rewritten in the style of Equation (
2) as follows:(6) 
The output of the RNN can be computed with Equation (5) using as input vector.
To help the RNN model longterm dependencies and to counter the vanishing gradient problem
Hochreiter:2001uj , several extensions on Equation (6) have been proposed. The best known examples are long shortterm memories (LSTMs) and, more recently, gated recurrent units (GRUs), which both have comparable characteristics and similar performance on a variety of tasks
Hochreiter:1997fq ; Greff:2015wv ; Chung:2014wf . Both LSTMs and GRUs incorporate a gating mechanism which controls to what extent the new input is stored in memory and the old memory is forgotten. For this purpose the LSTM introduces an input () and forget () gate, as well as an output gate ():(7) 
Here, the symbol stands for the elementwise vector multiplication. Note that the LSTM uses two types of memories: a hidden state and a socalled cell state . Compared to LSTMs, GRUs do not have this extra cell state and only incorporate two gates: a reset and update gate Cho:2014uo . This reduces the overall number of parameters and generally speeds up learning. The choice of LSTMs versus GRUs is dependent on the application at hand, but in the experiments of this paper we will use LSTMs as these are currently the most widely used RNN layers.
Since in neural networks multiple layers are usually stacked onto one another, this is also possible with recurrent layers. In that case, the output at time step is fed to the input of the next recurrent layer, also at time step . Each layer thus processes the sequence of outputs produced by the previous layer. This will, of course, significantly slow down BPTT.
3 Training and sampling schemes for characterlevel RNNs
In this section four schemes are presented on how to train characterlevel RNNs and how to sample new tokens from them. The task of the RNN model is independent of these schemes and its purpose is to predict the next symbol or character in a sequence given all previous tokens. The training and sampling schemes are thus merely a practical means to solve the same task, and in later sections the effect of the used scheme on the performance and efficiency of the RNN model is studied. We already point out that the four schemes presented are among the most basic and practical methods to train and sample from RNNs, but of course many more combinations or variants could be devised. In the discussion of the different schemes a general characterlevel RNN will be denoted by , and is the output of the RNN by applying token at its input. This output is a vector that represents a probability distribution across all characters. For notational convenience, we write the ’th hidden state of the RNN by .
3.1 Highlevel overview
As mentioned before, we will isolate four different schemes on how to train RNNs and how to sample new tokens from them. Each scheme fits in the truncated BPTT framework, and is in fact a practical approximation of the original algorithm. So whenever we use the and parameters, these refer to the definitions we gave in Section 2.1. It is also important to keep in mind that the task for all schemes is the same, namely to predict the next token in a given sequence.
To help understand the mechanisms of each scheme, we visualized them schematically in Figures 6, 6, 6 and 6. In the training procedures we have drawn the output tokens for which a loss is defined. We see that, for example, the main difference of scheme 2 compared to scheme 1 is that we only compute a loss for the final output token instead of for all output tokens during training. Regarding the sampling procedures, in the first two schemes a new token is always sampled starting from the same initial hidden state, which is colored light gray. We call this principle ‘windowed sampling’. In schemes 3 and 4, on the other hand, sampling a new token is based on the current hidden state and by applying the previous token at the input of the RNN. This sampling procedure is called ‘progressive sampling’. In the training procedure of scheme 4 we observe a similar technique, in which the hidden state is carried across subsequent sequences. In the next subsections we will give details on all training and sampling procedures, after which we go over the practical details of the different schemes one by one. We mention that the schemes are described without batching, while in practice minibatch training is usually done, as motivated in Section 2.1. The schemes, however, are easily transferred to a batched setting.
3.2 Training algorithms
We have isolated three different training procedures for characterlevel RNNs. A first algorithm is called multiloss training, and a rudimentary outline of this is shown in Algorithm 1. The input sequences all have length and are subsequently taken from the train set by skipping every characters. For each input token a loss is calculated at the output of the RNN. When the entire sequence is processed, the average of all losses which we use to update the RNN weights is calculated. We also reset the hidden state of the RNN for each new training sequence. This initial hidden state will be learned through backpropagation together with the weights of the RNN. In practice, for every input sequence of characters, the initial hidden state will be the same. Note that for LSTMs, the hidden state comprises both the hidden and cell vectors from Equation (7). The truncated_BPTT procedures on lines 10 and 12 calculates the gradients of the loss with respect to all weights in the RNN using backpropagation. The optimize procedure on lines 11 and 13 then uses these gradients to update the weights using (a variant of) Equation (4).
In singleloss training, instead of defining a loss on all outputs of the RNN—which forces the RNN to make good predictions for the first few tokens of the sequence—we only define a loss on the final predicted token in the sequence. The complete training procedure is shown in Algorithm 2. The difference with Algorithm 1 is that the inner most loop does not aggregate the loss for every RNN output. Now the loss is calculated outside this loop on line 9, only for the final RNN output.
In both the multiloss and singleloss procedures we always start training on a sequence from an initial hidden state that is learned. In conditional multiloss training, on the other hand, the multiloss training procedure is adapted to reuse the hidden state across different sequences. Such an approach leans much closer to the original truncated BPTT algorithm than when the initial state is always reset. The outline of the training method is given in Algorithm 3. Since we are using truncated BPTT, the procedure requires meticulous bookkeeping of the hidden state at every time step, which can be observed in lines 11–12. This is especially true when we work in a minibatch setting where we also need to keep track of how the subsequent batches are constructed.
3.3 Sampling algorithms
Next to the training algorithms we have explained in the previous section, we also discern two different sampling procedures. These are used to generate new and previously unseen sequences. Both procedures have in common that sampling is started with a seed sequence of tokens that is fed to the RNN. This is done in order to appropriately bootstrap the RNN’s hidden state. After the seed sequence has been processed, the two procedures start to differ.
In socalled windowed sampling the next token of the sequence is sampled from the RNN after applying the seed sequence. This newly sampled token is concatenated at the end of the sequence. After this, the hidden state of the RNN is reset to its learned representation. Sampling the next token proceeds in the same way: we take the last tokens from the sequence that have been sampled thus far, we feed them to the RNN, the next token is sampled and appended to the sequence, and the hidden state of the RNN is reset. The entire windowed sampling procedure is sketched in Algorithm 4. On line 6 of the algorithm, we have used the symbol to indicate sequence concatenation.
In progressive sampling the next token in a sequence is always sampled given the current hidden state and the previously sampled token. That is, a token is applied at the input of the RNN, which updates its hidden state, and then the next token is sampled at the RNN output. The initial hidden state is therefore never reset. This is the most intuitive way of sampling from an RNN. The entire sampling procedure is given in Algorithm 5. On lines 1–3 the RNN is bootstrapped using the initial hidden state and the seed sequence. On the following lines, new tokens are continuously sampled from the RNN one token at a time, so the inner loop from Algorithm 4 is no longer needed.
3.4 Scheme 1 – Multiloss training, windowed sampling
In a first scheme, multiloss training (Algorithm 1) is combined with a windowed sampling procedure (Algorithm 4). The main advantage of using this scheme is that there is no need for hidden state bookkeeping across sequences. Especially during training this can be cumbersome in a batched version of the algorithm. One disadvantage is that sampling is slower if increases: to sample one new token inputs need to be processed. If the RNN model contains many layers, this can lead to scalability issues. Another disadvantage is that a loss is defined on all outputs of the RNN during training. That is, we force the RNN to produce good token candidates after having seen only one or a few input tokens. This can lead to a shortsighted RNN model that mostly looks at the more recent history to make a prediction for the next token. In scheme 2 this potential issue is solved using singleloss training.
3.5 Scheme 2 – Singleloss training, windowed sampling
In the second scheme, the multiloss training procedure of scheme 1 is replaced by the singleloss equivalent of Algorithm 2. The main advantage is that we allow the hidden state of the RNN a certain burnin period, so that predictions can be made using more knowledge from the past sequence. Burning in the hidden state also causes the RNN to be able to learn longterm dependencies in the data, because we only make a prediction after having seen tokens. The potential drawback is that learning is slower, since only one signal is backpropagated for every sequence compared to signals in the first scheme. The sampling algorithm, on the other hand, is the same as in the first scheme, and now almost perfectly reflects how the RNN has been trained, i.e. by only considering the final token for each input sequence.
3.6 Scheme 3 – Multiloss training, progressive sampling
In scheme number 3, we go back to the multiloss training procedure of scheme 1, but now the progressive sampling from Algorithm 5 is used instead of windowed sampling. One drawback of the sampling method in scheme 1 is that it is not very scalable for large values of , since we need to feed a sequence of tokens to the RNN for every token that is sampled. In progressive sampling, on the other hand, the next token is sampled immediately for every new input token. This way, the sampling of new sequences is sped up by a factor of approximately , which is the main advantage of this scheme.
3.7 Scheme 4 – Conditional multiloss training, progressive sampling
In scheme 3 we still use standard multiloss training, which resets the hidden state for every train sequence. Scheme 4 replaces this by the conditional multiloss training procedure from Algorithm 3, while maintaining the progressive sampling algorithm. One of the main disadvantages of using this particular training algorithm, is its requirement to keep track of the hidden states across train sequences and to carefully select these train sequences from the dataset, which can be hard in minibatch settings. Next to this, whenever the RNN weights are updated, the hidden state from before the update is reused, which can potentially lead to unstable learning behavior. On the plus side, we are able to learn dependencies between tokens that are more than time steps away, since the hidden state is remembered in between train sequences. Also, the need to learn an appropriate initial hidden state is eliminated, which can lead to a small speedup in learning.
3.8 Literature overview
We will now go over some of the works in literature that have used RNNs for language modeling, on both character and word level. Most of the works that are listed, describe or have described stateoftheart results on famous benchmarks such as the Penn Treebank dataset Marcus:1993wd ; Mikolov:2010wx , WikiText2 Merity:2016wg and the Hutter Prize datasets Hutter:AXQ_crEu . The first two datasets are mainly used to benchmark wordlevel language models, while the Hutter Prize datasets are generally used for characterlevel evaluation. Some papers, however, also train characterlevel models on the Penn Treebank dataset. It is our purpose to give the reader a highlevel idea of what schemes are being used in existing literature. We do not intend to give a complete overview of the literature on RNNs for language modeling. Instead, we focus on highly cited works that have, at some point, reported stateoftheart results on some of the abovementioned benchmarks. In this, attention is given to the most recent literature in the field.
The overview can be found in Table 2. A distinction is made between characterlevel models, wordlevel models and models that are applied on both levels. At the bottom, three different applications are listed that have used RNNs to model various sequential problems. We immediately notice that only 5 out of the 22 investigated papers explicitly mention training details regarding loss aggregation or hidden state propagation. In the other cases we had to go through the source code manually to infer the training and sampling scheme. If there was no source code available, we contacted the authors directly to ask for more details. Whenever we could not find information in the paper, the source code or through the authors, we have marked it with ‘Unknown’.
Scheme number 4 is by far the most popular in recent literature, but scheme number 3 is also widely used. As mentioned previously, the main difference between these two schemes is whether the transfer of the hidden state between subsequent training sequences takes place or not. There seems to be no clear consensus on this topic among researchers. The older works from 2012 and 2013 by Graves Graves:2013ua and Mikolov et al. Mikolov:2012bw
(and by extension, most of the older works on RNNs) do not transfer the hidden state, while the community seems to be transitioning towards explicitly doing this. Although there exists no literature describing the advantages and disadvantages of both methods, we can think of some possible explanations for this. First, while going through multiple source code repositories, we have noticed that source code is often reused by copying and adapting from previous work. This causes architectural and computational designs to transfer from previous work into other works. Another possible cause lies with the evolution of deep learning frameworks. Tensorflow, Keras en PyTorch have made it fairly easy to train RNNs with hidden state transfer, while this was less straightforward or required more effort in older frameworks, such as Theano and Lasagne.
Reference  Model type  Scheme  Information source 
(Graves, 2013) Graves:2013ua  Characterlevel  3  Author communication 
(Wu, 2016) Wu:2016vm  Characterlevel  Unknown  
(Ha, 2016) Ha:2016ua  Characterlevel  Unknown  
(Cooijmans, 2016) Cooijmans:2016te  Characterlevel  3  Author communication 
(Krause, 2016) BenKrause:2016um  Characterlevel  4  Author communication 
(Chung, 2016) Chung:2016tma  Characterlevel  4  Paper 
(Mujika, 2017) Mujika:2017uj  Characterlevel  4  Paper 
(Zilly, 2017) Zilly:2017wg  Word & characterlevel  4  Source code 
(Melis, 2017) Melis:2017vx  Word & characterlevel  4  Paper 
(Mikolov, 2012) Mikolov:2012bw  Wordlevel  3  Source code 
(Zaremba, 2014) Zaremba:2014up  Wordlevel  4  Paper 
(Kim, 2015) Kim:2015vh  Wordlevel  4  Source code 
(Gal, 2016) Gal:2016ti  Wordlevel  3  Source code 
(Merity, 2016) Merity:2016wg  Wordlevel  2/4 (?)  Author communication 
(Bradbury, 2016) Bradbury:2016ul  Wordlevel  4 (?)  Author communication 
(Zoph, 2016) Zoph:2016jq  Wordlevel  4  Author communication 
(Inan, 2016) Inan:2016wq  Wordlevel  4  Source code 
(Merity, 2017) Merity:2017vl  Wordlevel  4  Source code 
(Yang, 2017) Yang:2017ur  Wordlevel  3  Author communication 
(Sturm, 2016) Sturm:2016tv  Music notes  3  Author communication 
(Saon, 2016) Saon:2016vu  Phonemes  3  Author communication 
(De Boom, 2017) DeBoom:2017jo  Playlist tracks  2  Paper 
To conclude this concise overview, we have shown that there is a need for clarity and transparency in literature concerning training and sampling details for RNNs. Not only in the interest of reproducibility, but also to spike awareness in the research community. This paper is a first attempt at calling attention to the different training and sampling schemes for RNNs, and which tradeoffs each of these pose. In the next section, each of the schemes is evaluated thoroughly in a number of experimental settings.
4 Evaluation
In this section, all training and sampling schemes are evaluated in a variety of settings. As mentioned before, the task in each of these settings is the same: predicting the next token or character in a sequence given the previous tokens or characters. To perform the evaluation we will use four datasets with different characteristics: English text, Finnish text, C code, and classical music. Next to this, we will vary the RNN architecture—such as the number of recurrent layers and the hidden state size—as well as the truncated BPTT parameters. Through these evaluations, we will give some recommendations on how to train and sample from characterlevel RNNs.
4.1 Experimental setup
The central part of our experiments is the RNN model. For this, we construct a standard architecture with some parameters that we can vary. The input of the RNN is a onehot representation of the current character in the sequence, i.e. a vector of zeros with length , with the ordered set of all characters in the dataset, except for a single one at the current character’s position in . Next, recurrent LSTM layers are added, each with a hidden state dimensionality of . In the experiments, the parameters of and
will be varied. Finally, we add two fully connected dense layers, one with a fixed dimensionality of 1,024 neurons, and the final dense layer again has dimensionality
. At this final layer a softmax function is applied in order to arrive at a probability distribution across all possible next characters. The complete architecture is summarized in Table 3 including nonlinear activation functions and extra details.Layer type (no. of dimensions) and nonlinearity  

Input ()  
1 to  LSTM () 
sigmoid (gates); tanh (hidden and cell state update)  
orthogonal initialization, gradient clipping at 50.0 

+ 1  Fully connected dense (1,024) 
leaky ReLU, leakiness , glorot uniform initialization  
+ 2  Fully connected dense () 
softmax, glorot uniform initialization 
To train the RNN model we will use one of the schemes outlined in Section 3. As is common practice in deep learning and gradientbased optimization, multiple training sequences are grouped in batches. Each sequence in such a batch has a length of tokens, from which the first are used as input to the RNN, and the next token is used as ground truth signal for every input token. In this paper, a batch size of 64 sequences is used across all experiments. To ensure a diverse mix of sequences in each batch, we pick sequences at equidistant offsets, which we increase by for every new batch. More specifically, every ’th batch 64 sequences are sampled at the following offsets in the train set :
(8) 
The entire train set is also circularly shifted after each training epoch. Since in scheme 4 the hidden states is transferred across different batches, this batching method allows us to fairly compare all four schemes.
At regular points during training the performance of the RNN is evaluated with data from the test set . For this we will use the perplexity measure, which is widely used in evaluating language models:
(9) 
In this formula is the ’th token in the sequence and is the total number of tokens. The better a model is at predicting the next token, the lower its perplexity measure. In the context of RNNs, the conditional probability in Equation (9) is approximated using the hidden state of the RNN, as was shown in Equation (3). In practice, the hidden state of the RNN is bootstrapped with characters and perplexity is calculated on all subsequent characters in the test set.
In the experiments below, every RNN is trained with 12,800 batches of 64 sequences using the batching method described above. For all schemes and experiments the standard categorical crossentropy loss function is used, which calculates the inner product between the log output probability vector and the onehot vector of the target token :
(10) 
During training we report perplexity on the test set at logarithmically spaced intervals. All RNN models are trained five times with always set to 100 (unless explicitly indicated otherwise), and we choose . For every new configuration we reinitialize all network weights and random generators to the same initial values. As optimization algorithm we use Adam with a learning rate of 0.001 throughout the experiments.
All experiments are performed on a single machine, 12 core Intel Xeon E52620 2.40GHz CPU and Nvidia Tesla K40c GPU. We use a combination of Theano 0.9 and Lasagne 0.2 as implementation tools, powered by cuDNN 5.0.
4.2 Datasets
In the experiments the performance of each scheme is evaluated on four datasets^{1}^{1}1The datasets are available for download at https://github.com/cedricdeboom/characterlevelrnndatasets. The dataset characteristics of these datasets.

English: we compiled all plays by William Shakespeare from the Project Gutenberg website^{2}^{2}2www.gutenberg.org in one dataset. The plays follow each other in random order. The total number of characters is 6,347,705 with 85 unique characters.

Finnish: this language is very different from English. On the Gutenberg website we gathered all texts from Finnish playwrights Juhani Aho and Eino Leino. This results in a dataset of 10,976,530 characters, of which 106 are unique.

Linux: we saved all C code from the Linux kernel^{3}^{3}3github.com/torvalds/linux/tree/master/kernel and gathered the files together. On November 22 2016, the entire kernel contained 6,546,665 characters, and 97 of them are unique.

Music: we created this dataset by extracting music notes from MIDI files. When notes are played simultaneously in the MIDI file, we extract them from low to high, so that we obtain a single sequence of subsequent notes. We downloaded all piano compositions by Bach, Beethoven, Chopin and Haydn from Classical Archives^{4}^{4}4www.classicalarchives.com, removed duplicate compositions, and gathered a dataset of 1,553,852 notes, of which there are 90 unique ones.
After cyclically permuting each dataset over a randomly chosen offset, we extract the last 11,100 characters to compile a test set . All remaining characters form the train set .
4.3 Experiments
Several experiments are now performed to evaluate the predictive performance of RNNs that have been trained with different configurations. In a first round of experiments the architecture of the RNN models is varied. More specifically, we set the number of recurrent LSTM layers to 1 or 2, and we also change the hidden state size to either 128 or 512. Figure 7 shows plots for these different RNN architectures, trained using scheme 1 and on all four datasets. For every architecture we have plotted five lines for the different settings of mentioned above. In all plots we see that the RNNs with (green and yellow) initially perform better, but the RNNs with (green and blue) learn somewhat faster on the long term. At 12,800 batches there is no clear difference in performance anymore between the architectures. On the music dataset and architectures with we observe some overfitting. If we add of dropout to the final two dense layers Srivastava:2014ww , this overfitting is already greatly reduced, but still observable (not shown in the graph). For the Finnish dataset and architectures with we notice a bump around 1,000 train sequences, which is present for all configurations. This bump is lowered if we reduce the learning rate to 0.0001 or use a different, non momentumbased optimizer such as RMSProp, but it remains an artefact of both the dataset and architecture.
We also perform the same experiments with scheme 2, for which the results are shown in Figure 8. The same behavior with respect to the architectural differences is observed as in the first scheme. But now the networks converge somewhat slower, which can be seen especially for the Music dataset by comparing Figures 6(d) and 7(d). On the plus side, the performance curves are smoother than for scheme 1. Both effects can be explained by the fact that there is only one loss signal at the end of each training sequence, which makes learning slower, but the backpropagated gradient is of higher quality, since more historical characters are taken into account. From Figures 7 and 8 we conclude that the RNN architecture indeed influences the efficiency of the training procedure, but that the same effect is observed globally across datasets and training schemes. The best architecture for all four datasets has parameters and , i.e. the green plots. This specific architecture will therefore be used in the next experiments.
Next, all schemes are compared on the different datasets. As mentioned above, the architecture with is used. The perplexity plots are gathered in Figure 9. We see that schemes 1 and 2 are very robust across datasets, but also across different settings of , since all lines lie very close to each other. Scheme 1 is also the best performing in terms of perplexity. The performance of scheme 2 is overall worse compared to scheme 1, which is probably due to the fact that learning occurs more slowly, as argued before. The performance of scheme 3 is comparable to the first scheme, but only very slightly worse and robust. Since the training procedure of schemes 1 and 3 is the same, we hypothesize that the sampling procedure of scheme 3 sometimes has difficulties recovering from errors, which can be carried across many time steps. Another reason is that the RNN has not learned to make predictions for sequences longer than tokens. We also mention that for schemes 1, 2 and 3 we experimented with randomly shuffling all training sequences instead of circularly shifting the train set, as explained in Section 4.1, but this did not lead to different observations. In scheme 4 the hidden state is transferred across sequences during training, which appears to solve this problem, at least for some configurations of . All configurations for scheme 4 start with the same performance as for scheme 3, but after around 200 train batches—i.e. 12,800 train sequences in the graph—some configurations start diverging, for which we cannot isolate any consistent motivation or explanation. From the figures we see that this behavior is also heavily dependent on the dataset; the difference between e.g. the Finnish and Music dataset is notable.
It is also interesting to take a look at a comparison between performances on different datasets for the same scheme. These curves are plotted in Figure 10
. For all schemes we notice that the performance on the English, Finnish and Linux datasets is almost equal; only the Music dataset seems harder to model with the same RNN architecture. What we also observe is that scheme 1 is very robust against changes in training parameters, since all curves lie very close to each other. There is more variance in this for scheme 2, even more for scheme 3, and it is highest for scheme 4.
At this point we would also like to discuss data efficiency. For small values of , we use less data at a particular point in the training process compared to larger values of . This is important when data resources are scarce. From Figure 10 it is noticeable that, at least for the same scheme, the lines for different values of lie very close to each other. From these experiments, a general conclusion could be to use a small value of in order to be as data efficient as possible. The choice of , after all, seems to have less impact than the choice of training scheme. Additionally, using a small value of improves label reuse in the multiloss training algorithms. This can approximately be quantified by , i.e. the number of times a label is reused in the training process.
Up until now we have been comparing the performance of different RNN models and schemes in terms of the number of train sequences used up until a certain point in time. But the models can also be compared in terms of absolute training and sampling time, which will give us an overview of which configurations are the fastest. In the next experiment, we calculate the average training time per batch and sampling time for a single token on the English dataset. We will vary the scheme that we use for training, as well as the RNN architecture. Concerning the training parameter, there will be almost no difference in training time, so in all measurements we use . The numbers are shown in Table 4. It is no surprise that schemes 1, 3 and 4 have almost equal training time per batch, while scheme 2 trains significantly faster since we only need to compute one softmax output for each training sequence. It is however noticeable that the more complex the RNN architecture, the smaller the relative difference in training time, with a decrease of 25% for the architecture and just 9% for the architecture. Regarding the sampling times, we see that the 3rd and 4th schemes are faster by a factor of 10 up to 20 compared to schemes 1 and 2, since there is no need to propagate an entire sequence through the RNN to sample a new token.
Scheme 1  60.5 / 7.0  138.8 / 21.7  106.5 / 13.7  267.4 / 43.3 

Scheme 2  46.5 / 7.0  114.6 / 21.7  93.1 / 13.7  245.8 / 43.3 
Scheme 3  60.5 / 0.7  138.8 / 1.2  106.5 / 1.1  267.5 / 2.2 
Scheme 4  60.6 / 0.7  138.9 / 1.2  106.6 / 1.1  267.6 / 2.2 
We also compare the performance of the different schemes with respect to changes in the parameter. For each scheme we perform five experiments, for which is set successively to 20, 40, 60, 80 and 100. After setting , the parameter is set to , and , rounded to the nearest integer. Every experiment is performed on the Music dataset, since, based on previous experiments, we expect to gain most insights on it. We report the model perplexity on the test set as a function of the elapsed training time, and we train again for a total of 12,800 batches. The results are shown in Figure 11, in which the axis is clipped to a maximum of 80 to achieve the most informative view. We see that the smaller the value, the faster we have trained all batches, since it leads to a shorter BPTT. The first scheme is again the most robust against a changes in . Only the shortest sequence lengths behave more noisily in the first 10 seconds of training, but all configurations are able to reach a similar optimal perplexity. The second scheme trains much slower than scheme 1, and experiences instability problems for small sequence lengths of 20 and 40. The configurations with are all very stable, but have not yet fully converged after 12,800 batches. For scheme 3 we see almost the same behavior as in scheme 1, with all configurations reaching the same optimal perplexity. But, just as we saw before, the robustness against changes in is worse. This is especially true for small values of , as shown by the blue lines in Figure 10(c). Finally, for scheme 4 we see that almost all configurations are unstable and behave very noisily. Two configurations with even achieve a final perplexity of around 350; lowering for small values of seems to help in this case.
We conclude this experimentation section with a few recommendations. We found that the global behavior of the different schemes is nearly independent of the used dataset. This is good news, since we do not have to tune the learning and sampling procedure to the dataset at hand. In this respect, we arrive at the following conclusions:

In terms of training schemes, the multiloss approach (scheme 1 and 3) is recommended. Compared to the singleloss approach (scheme 2), multiloss training is more efficient. The faster individual iterations of the singleloss approach cannot compensate for the benefit of combining the loss over multiple positions in the sequence, when considering the total train time.

Our general recommendation is to avoid training procedures in which the hidden state is transferred between input sequences (scheme 4). Training is as efficient as the multiloss approach without transferred hidden states (scheme 3), but less robust. On noisy datasets, such as the Music dataset in our experiments, transferring hidden states is likely to cause an unstable behavior.

On the sampling side, there is a tradeoff between windowed sampling and progressive sampling. By comparing scheme 1 and 3, it is seen that windowed sampling is more robust than progressive sampling. However, the latter is more efficient by construction, as it samples the next character based on the current one and the hidden state, instead of each time performing a forward pass over a (possibly long) sequence as in the windowed sampling approach.
5 Future research tracks
We include one final section on future research tracks in the area of training and sampling procedures for characterlevel RNNs. In this paper we have made an attempt at isolating the four most common schemes that have been or are being used in literature. There are however multiple hybrid combinations that can be identified and investigated in the future. The most straightforward extension is an intermediate form between single and multiloss training. For example, an extra parameter could be identified, for which , that defines the number of time steps for which the loss is calculated and aggregated. The edge cases and correspond respectively to the singleloss and multiloss training procedures. One other possibility is to decay the loss at each time step (linearly or exponentially) and combine these through a linear combination to calculate the final loss. For a single training sequence this results in:
with or for resp. linear and exponential decay. Consequentially, the resulting gradient is scaled similarly, thereby reducing the contribution of the first few tokens in the sequence to the total loss.
6 Conclusion
We explained the concept of characterlevel RNNs and how such models are typically trained using truncated backpropagation through time. We then introduced four schemes to train characterlevel RNNs and how to sample new tokens from such models. These schemes differ in how they approximate the truncated backpropagation through time paradigm: how the RNN outputs are combined in the final loss, and whether the hidden state of the RNN is remembered or reset for each new input sequence. After that, we evaluated each scheme against different datasets and RNN configurations in terms of predictive performance and training time. We showed that our conclusions remain valid across all these different experimental settings.
Perhaps the most surprising result of the study is that conditional multiloss training, in which the hidden state is carried across training sequences, often leads to unstable training behavior depending on the dataset. This contrasts sharply with the observation that this training procedure is used most often in literature, although it requires meticulous bookkeeping of the hidden state and a carefully designed batching method. Singleloss training is, compared to multiloss, slower regarding the number of used train sequences. An advantage of singleloss training, however, is that we encourage the network to make predictions on a longterm basis, since we only backpropagate one loss defined at the end of a sequence.
We saw that progressive sampling is slightly less robust to changes in training parameters compared to windowed sampling, especially for datasets that are more difficult to model, as we showed with the Music dataset. The main advantage of progressive sampling is that it is orders of magnitudes faster than windowed sampling.
Conflicts of interest
Funding: the hardware used to perform the experiments in this paper was funded by Nvidia.
Conflict of interest: Cedric De Boom is funded by a PhD grant of the Research Foundation  Flanders (FWO). The other authors declare that they have no conflicts of interest.
References
 (1) Bradbury, J., Merity, S., Xiong, C., Socher, R.: QuasiRecurrent Neural Networks. arXiv.org (2016)
 (2) Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. arXiv.org (2014)
 (3) Chung, J., Ahn, S., Bengio, Y.: Hierarchical Multiscale Recurrent Neural Networks. arXiv.org (2016)
 (4) Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv.org (2014)

(5)
Cooijmans, T., Ballas, N., Laurent, C., Courville, A.: Recurrent Batch Normalization.
arXiv.org (2016)  (6) De Boom, C., Agrawal, R., Hansen, S., Kumar, E., Yon, R., Chen, C.W., Demeester, T., Dhoedt, B.: Largescale user modeling with recurrent neural networks for music discovery on multiple time scales (2017)
 (7) Gal, Y., Ghahramani, Z.: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. NIPS (2016)
 (8) Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
 (9) Graves, A.: Generating Sequences With Recurrent Neural Networks. arXiv.org (2013)
 (10) Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: A Search Space Odyssey. arXiv.org (2015)
 (11) Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: DRAW: A Recurrent Neural Network For Image Generation. arXiv.org (2015)
 (12) Gregor, K., Danihelka, I., Mnih, A., Blundell, C., Wierstra, D.: Deep AutoRegressive Networks. arXiv.org (2013)
 (13) Ha, D., Dai, A., Le, Q.V.: HyperNetworks. arXiv.org (2016)
 (14) Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Sessionbased Recommendations with Recurrent Neural Networks. arXiv.org (2016)
 (15) Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning longterm dependencies (2001)
 (16) Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural Computation (1997)
 (17) Hutter, M.: The Human Knowledge Compression Contest (2012)

(18)
Inan, H., Khosravi, K., Socher, R.: Tying Word Vectors and Word Classifiers  A Loss Framework for Language Modeling.
CoRR (2016)  (19) Karpathy, A., Johnson, J., FeiFei, L.: Visualizing and Understanding Recurrent Networks. arXiv.org (2015)
 (20) Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: CharacterAware Neural Language Models. arXiv.org (2015)
 (21) Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: ICLR (2015)
 (22) Krause, B., Lu, L., Murray, I., Renals, S.: Multiplicative LSTM for sequence modelling. arXiv.org (2016)
 (23) Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics (1993)
 (24) Melis, G., Dyer, C., Blunsom, P.: On the State of the Art of Evaluation in Neural Language Models. CoRR (2017)
 (25) Merity, S., Keskar, N.S., Socher, R.: Regularizing and Optimizing LSTM Language Models. CoRR (2017)
 (26) Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer Sentinel Mixture Models. arXiv.org (2016)
 (27) Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech (2010)
 (28) Mikolov, T., Zweig, G.: Context dependent recurrent neural network language model. In: 2012 IEEE Spoken Language Technology Workshop (SLT (2012)
 (29) Mujika, A., Meier, F., Steger, A.: FastSlow Recurrent Neural Networks. arXiv.org (2017)
 (30) Oord, A.v.d., Kalchbrenner, N., Kavukcuoglu, K.: Pixel Recurrent Neural Networks (2016)
 (31) Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Cognitive modeling (1988)
 (32) Saon, G., Sercu, T., Rennie, S., Kuo, H.K.J.: The IBM 2016 English Conversational Telephone Speech Recognition System. arXiv.org (2016)

(33)
Sercu, T., Goel, V.: Advances in Very Deep Convolutional Neural Networks for LVCSR.
In: Interspeech (2016)  (34) Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout  a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (2014)
 (35) Sturm, B.L., Santos, J.F., BenTal, O., Korshunova, I.: Music transcription modelling and composition using deep learning. arXiv.org (2016)
 (36) Sutskever, I.: Training recurrent neural networks. Ph.D. thesis (2013)
 (37) Tan, Y.K., Xu, X., Liu, Y.: Improved Recurrent Neural Networks for Sessionbased Recommendations. arXiv.org (2016)
 (38) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: WaveNet: A Generative Model for Raw Audio. arXiv.org (2016)
 (39) Wu, Y., Zhang, S., Zhang, Y., Bengio, Y., Salakhutdinov, R.: On Multiplicative Integration with Recurrent Neural Networks. arXiv.org (2016)
 (40) Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the Softmax Bottleneck: A HighRank RNN Language Model. arXiv.org (2017)
 (41) Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent Neural Network Regularization. arXiv.org (2014)
 (42) Zilly, J.G., Srivastava, R.K., Koutník, J., Schmidhuber, J.: Recurrent Highway Networks. ICML (2017)
 (43) Zoph, B., Le, Q.V.: Neural Architecture Search with Reinforcement Learning. CoRR (2016)
Comments
There are no comments yet.