1 Introduction
Deep generative models for sequential data are an active field of research. Generation of text, in particular, remains a challenging and relevant area Hu2017Towards
. Recurrent neural networks (RNNs) are a common model class, and are typically trained via maximum likelihood
bowmanVVDJB15 or adversarially yuZWY16 ; fedusGD2018 . For conditional text generation, the sequencetosequence architecture of sutskeverVL2014 has proven to be an excellent starting point, leading to significant improvements across a range of tasks, including machine translation bahdanauCB14 ; vaswaniSPUJGKP17 rushCW15 , sentence compression filippova2015sentence and dialogue systems serban2016building . Similarly, RNN language models have been used with success in speech recognition mikolov2010recurrent ; graves2014towards. In all these tasks, generation is conditioned on information that severely narrows down the set of likely sequences. The role of the model is then largely to distribute probability mass within relatively constrained sets of candidates.
Our interest is, by contrast, in unconditional or free generation of text via RNNs. We take as point of departure the shortcomings of existing model architectures and training methodologies developed for conditional tasks. These arise from the increased challenges on both, accuracy and coverage. Generating grammatical and coherent text is considerably more difficult without reliance on an acoustic signal or a source sentence, which may constrain, if not determine much of the sentence structure. Moreover, failure to sufficiently capture the variety and variability of data may not surface in conditional tasks, yet is a key desideratum in unconditional text generation.
The de facto standard model for text generation is based on the RNN architecture originally proposed by graves2013generating and incorporated as a decoder network in sutskeverVL2014
. It evolves a continuous state vector, emitting one symbol at a time, which is then fed back into the state evolution – a property that characterizes the broader class of autoregressive models. However, even in a conditional setting, these RNNs are difficult to train without substitution of previously generated words by ground truth observations during training, a technique generally referred to as
teacher forcing williams1989learning . This approach is known to cause biases ranzato2015sequence ; goyalLZZCB16 that can be detrimental to test time performance, where such nudging is not available and where state trajectories can go astray, requiring ad hoc fixes like beam search wisemanR16 or scheduled sampling bengioVJS15 . Nevertheless, teacher forcing has been carried over to unconditional generation bowmanVVDJB15 .Another drawback of autoregressive feedback graves2013generating is in the dual use of a single source of stochasticity. The probabilistic output selection has to account for the local variability in the next token distribution. In addition, it also has to inject a sufficient amount of entropy into the evolution of the state space sequence, which is otherwise deterministic. Such noise injection is known to compete with the explanatory power of autoregressive feedback mechanisms and may result in degenerate, near deterministic models bowmanVVDJB15 . As a consequence, there have been a variety of papers that propose deep stochastic state sequence models, which combine stochastic and deterministic dependencies, e.g. chung2015recurrent ; fraccaro2016SPW , or which make use of auxiliary latent variables goyalSCKB17 , auxiliary losses shabanian17 , and annealing schedules bowmanVVDJB15 . No canoncial architecture has emerged so far and it remains unclear how the stochasticity in these models can be interpreted and measured.
In this paper, we propose a stochastic sequence model that preserves the Markov structure of standard state space models by cleanly separating the stochasticity in the state evolution, injected via a white noise process, from the randomness in the local token generation. We train our model using variational inference (VI) and build upon recent advances in normalizing flows
rezendeM15 ; kingmaSW16 to define rich enough stochastic state transition functions for both, generation and inference. Our main goal is to investigate the fundamental question of how far one can push such an approach in text generation, and to more deeply understand the role of stochasticity. For that reason, we have used the most basic problem of text generation as our testbed: word morphology, i.e. the mechanisms underlying the formation of words from characters. This enables us to empirically compare our model to autoregressive RNNs on several metrics that are intractable in more complex tasks such as word sequence modeling.2 Model
We argue that text generation is subject to two sorts of uncertainty: Uncertainty about plausible longterm continuations and uncertainty about the emission of the current token. The first reflects the entropy of all things considered “natural language", the second reflects symbolic entropy at a fixed position that arises from ambiguity, (near)analogies, or a lack of contextual constraints. As a consequence, we cast the emission of a token as a fundamental tradeoff between committing and forgetting about information.
2.1 State space model
Let us define a state space model with transition function
(1) 
is deterministic, yet driven by a white noise process , and, starting from some , defines a homogeneous stochastic process. A local observation model generates symbols
and is typically realized by a softmax layer with symbol embeddings.
The marginal probability of a symbol sequence is obtained by integrating out ,
(2) 
Here is defined implicitly by driving with noise as we will explain in more detail below.^{1}^{1}1For ease of exposition, we assume fixed length sequences, although in practice one works with endofsequence tokens and variable length sequences.In contrast to common RNN architectures, we have defined to not include an autoregressive input, such as , making potential biases as in teacherforcing a nonissue. Furthermore, this implements our assumption about the role of entropy and information for generation. The information about the local outcome under is not considered in the transition to the next state as there is no feedback. Thus in this model, all entropy about possible sequence continuations must arise from the noise process , which cannot be ignored in a successfully trained model.
The implied generative procedure follows directly from the chain rule. To sample a sequence of observations we (i) sample a white noise sequence
(ii) deterministically compute from and via and (iii) sample from the observation model . The remainder of this section focuses on how we can define a sufficiently powerful familiy of state evolution functions and how variational inference can be used for training.2.2 Variational inference
Modelbased variational inference (VI) allows us to approximate the marginalization in Eq. (2) by posterior expectations with regard to an inference model . It is easy to verify that the true posterior obeys the conditional independences , which informs our design of the inference model, cf. fraccaro2016SPW :
(3) 
This is to say, the previous state is a sufficient summary of the past. Jensen’s inequality then directly implies the evidence lower bound (ELBO)
(4)  
(5) 
This is a wellknown form, which highlights the perstep balance between prediction quality and the discrepancy between the transition probabilities of the unconditioned generative and the dataconditioned inference models fraccaroSPW2016 ; chung2015recurrent . Intuitively, the inference model breaks down the long range dependencies and provides a local training signal to the generative model for a single step transition and a single output generation.
Using VI successfully for generating symbol sequences requires parametrizing powerful yet tractable next state transitions. As a minimum requirement, forward sampling and loglikelihood computation need to be available. Extensions of VAEs rezendeM15 ; kingmaSW16 have shown that for nonsequential models under certain conditions an invertible function can shape moderately complex distributions over into highly complex ones over , while still providing the operations necessary for efficient VI. The authors show that a bound similar to Eq. (5) can be obtained by using the law of the unconscious statistician rezendeM15 and a density transformation to express the discrepancy between generative and inference model in terms of instead of
(6) 
This allows the inference model to work with an implicit latent distribution at the price of computing the Jacobian determinant of . Luckily, there are many choices such that this can be done in rezendeM15 ; dinhSB16 .
2.3 Training through coupled transition functions
We propose to use two separate transition functions and for the inference and the generative model, respectively. Using results from flowbased VAEs we derive an ELBO that reveals the intrinsic coupling of both and expresses the relation of the two as a part of the objective that is determined solely by the data. A shared transition model constitutes a special case.
TwoFlow ELBO
For a transition function as in Eq. (1) fix and define the restriction . We require that for any , is a diffeomorphism and thus has a differentiable inverse. In fact, as we work with (possibly) different and for generation and inference, we have restrictions and , respectively. For better readability we will omit the conditioning variable in the sequel.
By combining the perstep decomposition in (5) with the flowbased ELBO from (6), we get (implicitly setting ):
(7) 
As our generative model also uses a flow to transform into a distribition on , it is more natural to use the (simple) density in space. Performing another change of variable, this time on the density of the generative model, we get
(8) 
where now is simply the (multivariate) standard normal density as does not depend , whereas does. We have introduced new noise variable to highlight the importance of the transformation , which is a combined flow of the forward inference flow and the inverse generative flow. Essentially, it follows the suggested distribution of the inference model into the latent state space and back into the noise space of the generative model with its uninformative distribution. Putting this back into Eq. (7) and exploiting the fact that the Jacobians can be combined via we finally get
(9) 
Interpretation
Naïvely employing the modelbased ELBO approach, one has to learn two independently parametrized transition models and , one informed about the future and one not. Matching the two then becomes and integral part of the objective. However, since the transition model encapsulates most of the model complexity, this introduces redundancy where the learning problem is most challenging. Nevertheless, generative and inference model do address the transition problem from very different angles. Therefore, forcing both to use the exact same transition model might limit flexibility during training and result in an inferior generative model. Thus our model casts and as independently parametrized functions that are coupled through the objective by treating them as proper transformations of an underlying white noise process. ^{2}^{2}2Note that identifying as an invertible function allows us to perform a backwards density transformation which cancels the regularizing terms. This is akin to any flow objective (e.g. see equation (15) inrezendeM15 ) where applying the transformation additionally to the prior cancels out the Jacobian term. We can think of as a stochastic bottleneck with the observation model attached to the middle layer. Removing the middle layer collapses the bottleneck and prohibits learning compression.
Special cases
Additive Gaussian noise can be seen as the simplest form of or, alternatively, as a generative model without flow (as ). Of course, repeated addition of noise does not provide a meaningful latent trajectory. Finally, note that for ,
and the nominator in the second term becomes a simple prior probability
, whereas the determinant reduces to a constant. We now explore possible candidates for the flows in and .2.4 Families of transition functions
Since the Jacobian of a composed function factorizes, a flow is often composed of a chain of individual invertible functions rezendeM15 . We experiment with individual functions
(10) 
where is a multilayer MLP and is a neural network mapping to a lowertriangular matrix with nonzero diagonal entries. Again, we use MLPs for this mapping and clip the diagonal away from for some hyper parameter . The lowertriangular structure allows computing the determinant in and stable inversion of the mapping by substitution in . As a special case we also consider the case when is restricted to diagonal matrices. Finally, we experiment with a conditional variant of the Real NVP flow dinhSB16 .
Computing is central to our objective and we found that depending on the flow actually parametrizing the inverse directly results in more stable and efficient training.
2.5 Inference network
So far we have only motivated the factorization of the inference network but treated it as a blackbox otherwise. Remember that sampling from the inference network amounts to sampling and then performing the deterministic transition . We observe much better training stability when conditioning on the data only and modeling interaction with exclusively through . This coincides with our intuition that the two inputs to a transition function provide semantically orthogonal contributions.
We follow existing work dinhSB16 and choose
as the density of a normal distribution with diagonal covariance matrix. We follow the idea of
fraccaro2016SPW and incorporate the variablelength sequence by conditioning on the state of an RNN running backwards in time across . We embed the symbols in a vector space and use use a GRU cell to produce a sequence of hidden states where has digested tokens . Together andparametrize the mean and covariance matrix of
.2.6 Optimization
Except in very specific and simple cases, for instance, a Kalman filter, it will not be possible to efficiently compute the
expectations in Eq. (5) exactly. Instead, we sample in every timestep as is common practice for sequential ELBOs fraccaroSPW2016 ; goyalSCKB17 . The reparametrization trick allows pushing all necessary gradients through these expectations to optimize the bound via stochastic gradientbased optimization techniques such as Adam kingmaB14 .2.7 Extension: Importanceweighted ELBO for tracking the generative model
Conceptionally, there are two ways we can imagine an inference network to propose sequences for a given sentence . Either, as described above, by digesting righttoleft and proposing lefttoright. Or, by iteratively proposing a taking into account the last state proposed and the generative deterministic mechanism . The latter allows the inference network to peek at states that could generate from before proposing an actual target . This allows the inference model to track a multimodal without need for to match its expressiveness. As a consequence, this might offer the possibility to learn multimodal generative models, without the need to employ complex multimodal distributions in the inference model.
Our extension is built on importance weighted autoencoders (IWAE) burda15
. The IWAE ELBO is derived by writing the log marginal as a Monte Carlo estimate
before using Jensen’s inequality. The result is an ELBO and corresponding gradients of the form^{3}^{3}3Here we have tacitly assumed that can be rewritten using the reprametrization trick so that the expectation can be expressed with respect to some parameterfree basedistribution. See burda15 for a detailed derivation of the gradients in (11).(11) 
The authors motivate (11) as a weighting mechanism relieving the inference model from explaining the data well with every sample. We will use the symmetry of this argument to let the inference model condition on potential next states , from the generative model without requiring every to allow to make a good proposal. In other words, the sampled outputs of become a vectorized representation of to condition on. In our sequential model, computing exactly is intractable as it would require rolling out the network until time . Instead, we limit the horizon to only one timestep. Although this biases the estimate of the weights and consequently the ELBO, longer horizons did empirically not show benefits. When proceeding to timestep we choose the new hidden state by sampling with probability proportionally to . Algorithm 1 summarizes the steps carried out at time for a given (to not overload the notation, we drop in ) and a more detailed derivation of the bound is given in Appendix A.
3 Related Work
Our work intersects with work directly addressing teacherforcing, mostly on language modelling and translation (which are mostly not state space models) and stochastic state space models (which are typically autoregressive and do not address teacher forcing).
Early work on addressing teacherforcing has focused on mitigating its biases by adapting the RNN training procedure to partly rely on the model’s prediction during training bengioVJS15 ; ranzatoCAZ15 . Recently, the problem has been addressed for conditional generation within an adversarial framework goyalLZZCB16 and in various learning to search frameworks wisemanR16 ; leblondAOL17 . However, by design these models do not perform stochastic state transitions.
There have been proposals for hybrid architectures that augment the deterministic RNN state sequences by chains of random variables
chung2015recurrent ; fraccaro2016SPW . However, these approaches are largely patchingup the output feedback mechanism to allow for better modeling of local correlations, leaving the deterministic skeleton of the RNN state sequence untouched. A recent evolution of deep stochastic sequence models has developed models of ever increasing complexity including intertwined stochastic and deterministic state sequences chung2015recurrent ; fraccaro2016SPW additional auxiliary latent variables goyalSCKB17 auxiliary losses shabanian17 and annealing schedules bowmanVVDJB15 . At the same time, it remains often unclear how the stochasticity in these models can be interpreted and measured.Closest in spirit to our transition functions is work by Maximilian et al.karl16KSBvS on generation with external control inputs. In contrast to us they use a simple mixture of linear transition functions and work around using density transformations akin to bayer2014 . In our unconditional regime we found that relating the stochasticity in explicitly to the stochasticity in is key to successful training. Finally, variational conditioning mechanisms similar in spirit to ours have seen great success in image generationgregorDGW15 .
Among generative unconditional sequential models GANs are as of today the most prominent architecture yuZWY16 ; Kusner16 ; fedusGD2018 ; cheLZHLSB17 . To the best of our knowledge, our model is the first nonautoregressive model for sequence generation in a maximum likelihood framework.
4 Evaluation
Naturally, the quality of a generative model must be measured in terms of the quality of its outputs. However, we also put special emphasis on investigating whether the stochasticity inherent in our model operates as advertised.
4.1 Data Inspection
Evaluating generative models of text is a field of ongoing research and currently used methods range from simple dataspace statistics to expensive human evaluation fedusGD2018 . We argue that for morphology, and in particular nonautoregressive models, there is an interesting middle ground: Compared to the space of all sentences, the space of all words has still moderate cardinality which allows us to estimate the data distribution by unigram wordfrequencies. As a consequence, we can reliably approximate the crossentropy which naturally generalizes dataspace metrics to probabilistic models and addresses both, overgeneralization (assigning nonzero probability to nonexisting words) and overconfidence (distributing high probability mass only among a few words).
This metric can be addressed by all models which operate by first stochastically generating a sequence of hidden states and then defining a distribution over the dataspace given the state sequence. For our model we approximate the marginal by a Monte Carlo estimate of (2)
(12) 
Note that sampling from boils down to sampling from independent standard normals and then applying . In particular, the nonautoregressive property of our model allows us to estimate all words in some set using samples each by using only independent trajectories overall.
Finally, we include two dataspace metrics as an intuitive, yet less accurate measure. From a collection of generated words, we estimate (i) the fraction of words that are in the training vocabulary () and (ii) the fraction of unique words that are in the training vocabulary ( unique).^{4}^{4}4Note that for both dataspace metrics there is a trivial generation system that achieves a ‘perfect’ score. Hence, both must be taken into account at the same time to judge performance.
4.2 Entropy Inspection
We want to go beyond the usual evaluation of existing work on stochastic sequence models and also assess the quality of our noise model. In particular, we are interested in how much information contained in a state about the output is due to the corresponding noise vector . This is quantified by the mutual information between the noise and the observation given the noise that defined the prefix up to time . Since is a deterministic function of , we write
(13) 
to quantify the dependence between noise and observation at one timestep. For a model ignoring the noise variables, knowledge of does not reduce the uncertainty about , so that . We can use Monte Carlo estimates for all expectations in (13).
5 Experiments
5.1 Dataset and baseline
For our experiments, we use the BooksCorpus kiros2015skip ; zhu2015aligning , a freely available collection of novels comprising of almost 1B tokens out of which 1.3M are unique. To filter out artefacts and some very uncommon words found in fiction, we restrict the vocabulary to words of length with at least 10 occurrences that only contain letters resulting in a 143K vocabulary. Besides the standard 10% testtrain split at the word level, we also perform a second, alternative split at the vocabulary level. That means, 10 percent of the words, chosen regardless of their frequency, will be unique to the test set. This is motivated by the fact that even a small testset under the former regime will result in only very few, very unlikely words unique to the testset. However, generalization to unseen words is the essence of morphology. As an additional metric to measuring generalization in this scenario, we evaluate the generated output under WittenBell discounted character gram models trained on either the whole corpus or the test data only.
Our baseline is a GRU cell and the standard RNN training procedure with teacherforcing^{5}^{5}5It should be noted that despite the greatly reduced vocabulary in characterlevel generation, RNN training without teacherforcing for our data still fails miserably.. Hidden state size and embedding size are identical to our model’s.
5.2 Model parametrization
We stick to a standard softmax observation model and instead focus the model design on different combinations of flows for and . We investigate the flow in Equation (10), denoted as tril, its diagonal version diag and a simple identity id. We denote repeated application of (independently parametrized) flows as in . For the weighted version we use samples. In addition, for we experiment with a sequence of Real NVPs with masking dimensions (two internal hidden layers of size 8 each). Furthermore, we investigate deviating from the factorization (3) by using a bidirectional RNN conditioning on all in every timestep. Finally, for the best performing configuration, we also investigate statesizes .
5.3 Results
Table 1 shows the result for the standard split. By
we indicate mean and standard deviation across 5 or 10 (for IWAE) identical runs
^{6}^{6}6Single best model with : achieved and .. The dataspace metrics require manually trading off precision and coverage. We observe that two layers of the tril flow improve performance. Furthermore, importance weighting significantly improves the results across all metrics with diminishing returns at . Its effectiveness is also confirmed by an increase in variance across the weights during training which can be attributed to the significance of the noise model (see 5.4 for more details). We found training with realNVP to be very unstable. We attribute the relatively poor performance of NVP to the sequential VI setting which deviates heavily from what it was designed for and keep adaptions for future work.Model  unique  

tril  12.13.11  11.99.11  0.18.00  0.43.03  0.95.04 
tril, K=2  11.76.12  11.82.12  0.16.01  0.46.02  1.06.16 
tril, K=5  11.46.05  11.51.05  0.16.01  0.48.02  1.08.13 
tril, K=10  11.43.05  11.47.05  0.16.01  0.49.02  1.12.12 
tril  11.91.08  11.86.13  0.17.01  0.45.02  0.89.07 
tril, K=2  11.55.09  11.61.09  0.16.00  0.47.01  1.00.13 
tril, K=5  11.42.07  11.46.06  0.16.00  0.49.01  1.20.12 
tril, K=10  11.33.05  11.38.06  0.16.00  0.49.01  1.28.13 
tril, K=10, bidi  11.33.09  11.39.10  0.16.01  0.48.00  1.25.16 
tril, K=10  11.21  11.43  0.15  0.48  1.43 
tril, K=10  11.27  11.13  0.15  0.50  1.31 
realNVP[2,3,4,5,6,7]  11.77  11.81  0.12  0.53  0.94 
baseline8d  12.92  12.97  0.13  0.53  – 
baseline16d  12.55  12.60  0.14  0.62  – 
oracletrain  7.0  7.02^{7}^{7}7Note that the trainingset oracle is not optimal for the test set. The entropy of the test set is 6.80.  0.27  1.0  – 
Interestingly, our standard inference model is on par with the equivalently parametrized bidirectional inference model suggesting that historic information can be sufficiently stored in the states and confirming dseparation as the right principle for inference design.
The poor crossentropy achieved by the baseline can partly be explained by the fact that autoregressive RNNs are trained on conditional nextwordpredictions. Estimating the real dataspace distribution would require aggregating over all possible sequences . However, the dataspace metrics clearly show that the performance cannot solely be attributed to this.
Table 2 shows that generalization for the alternative split is indeed harder but cross entropy results carry over from the standard setting. Here we sample trajectories and extract the argmax from the observation model which resembles more closely the procedure of the baseline. Under gram perplexity both models are on par with a slight advantage of the baseline on longer grams and slightly better generalization of our proposed model.
gram from train+test  gram from test  

Model  
tril, K=10  11.56  12.27  10.4  12.8  20.9  30.7  13.1  21.9  49.6  81.1 
baseline8d  12.90  13.67  11.4  12.1  17.5  24.8  14.5  22.7  48.3  80.5 
oracletrain  –  –  10.1  6.7  4.8  4.1  13.2  15.7  21.4  26.4 
oracletest  –  –  9.5  6.0  4.5  3.9  7.9  4.1  2.9  2.6 
To give more insight into how the transition functions influence the results, Table 0(a) presents an exhaustive overview for all combinations of our simple flows. We observe that a powerful generative flow is essential for successful models while the inference flow can remain relatively simple – yet simplistic choices, such as id degrade performance. Choosing slightly more powerful than emerges as a successful pattern.


5.4 Noise Model Analysis
We use samples to approximate the entropy terms in (13). In addition we denote by the average mutual information across all timesteps. Figure 5.4 shows how along with the symbolic entropy changes during training. Remember that in a nonautoregressive model, the latter corresponds to information that cannot be recovered in later timesteps. Over the course of the training, more and more information is driven by and absorbed into states where it can be stored.
Figures 1 and 0(b) show for all trained models. In addition, Figure 5.4 shows a boxplot of for each for the configuration tril, K=10. As initial tokens are more important to remember, it should not come as a surprise that is largest first and decreases over time, yet with increased variance.
6 Conclusion
In this paper we have shown how a deep state space model can be defined and trained with the help of variational flows. The recurrent mechanism is driven purely by a simple white noise process and does not require an autoregressive conditioning on previously generated symbols. In addition, we have shown how an importanceweighted conditioning mechanism integrated into the objective allows shifting stochastic complexity from the inference to the generative model. The result is a highly flexible framework for sequence generation with an extremely simple overall architecture, a measurable notion of latent information and no need for pretraining, annealing or auxiliary losses. We believe that pushing the boundaries of nonautoregressive modeling is key to understanding stochastic text generation and can open the door to related fields such as particle filtering naesseth2017 ; maddisonLTHNMDT17 .
References
 [BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [BGS15] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. CoRR, abs/1509.00519, 2015.
 [BO14] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014. arXiv.
 [BVJS15] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. CoRR, abs/1506.03099, 2015.
 [BVV15] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. CoRR, abs/1511.06349, 2015.
 [CKD15] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2980–2988, 2015.
 [CLZ17] Tong Che, Yanran Li, Ruixiang Zhang, R. Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximumlikelihood augmented discrete generative adversarial networks. CoRR, abs/1702.07983, 2017.
 [DSB16] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real NVP. CoRR, abs/1605.08803, 2016.
 [FAC15] Katja Filippova, Enrique Alfonseca, Carlos A Colmenares, Lukasz Kaiser, and Oriol Vinyals. Sentence compression by deletion with lstms. In EMNLP 2015, pages 360–368, 2015.
 [FGD18] William Fedus, Ian J. Goodfellow, and Andrew M. Dai. Maskgan: Better text generation via filling in the ______. CoRR, abs/1801.07736, 2018.
 [FSnPW16] Marco Fraccaro, Søren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. pages 2199–2207. Curran Associates, Inc., 2016.
 [FSPW16] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. pages 2199–2207, 2016. NIPS.
 [GDGW15] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/1502.04623, 2015.
 [GJ14] Alex Graves and Navdeep Jaitly. Towards endtoend speech recognition with recurrent neural networks. In ICML 2014, pages 1764–1772, 2014.
 [GLZ16] Anirudh Goyal, Alex Lamb, Ying Zhang, Saizheng Zhang, Aaron C. Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS 2016, pages 4601–4609, 2016.
 [Gra13] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 [GSC17] Anirudh Goyal, Alessandro Sordoni, MarcAlexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. Zforcing: Training stochastic recurrent networks. In NIPS 2017, pages 6716–6726, 2017.

[HYX17]
Z. Hu, Z. Yang, Liang X., R. Salakhutdinov, and E. R. Xing.
Toward controlled generation of text.
In
International Conference on Machine Learning (ICML)
, 2017.  [JKMHL16] Matt J. Kusner and José Miguel HernándezLobato. Gans for sequences of discrete elements with the gumbelsoftmax distribution. 11 2016.
 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [KSBvdS17] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2017. ICLR.
 [KSW16] Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016.
 [KZS15] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skipthought vectors. arXiv preprint arXiv:1506.06726, 2015.
 [LAOL17] Rémi Leblond, JeanBaptiste Alayrac, Anton Osokin, and Simon LacosteJulien. SEARNN: training rnns with globallocal losses. CoRR, abs/1706.04499, 2017.
 [MKB10] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 2010.
 [MLT17] Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. Filtering variational objectives. CoRR, abs/1705.09279, 2017.
 [NLRB17] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential monte carlo. arXiv preprint arXiv:1705.11140, 2017.
 [RCAZ15a] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
 [RCAZ15b] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015.
 [RCW15] Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. CoRR, abs/1509.00685, 2015.
 [RM15] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML 2015, pages 1530–1538, 2015.
 [SATB17] Samira Shabanian, Devansh Arpit, Adam Trischler, and Y Bengio. Variational bilstms. 11 2017.
 [SSB16] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. Building endtoend dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776–3784, 2016.
 [SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In NIPS 2017, NIPS’14, pages 3104–3112, Cambridge, MA, USA, 2014. MIT Press.
 [VSP17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
 [WR16] Sam Wiseman and Alexander M. Rush. Sequencetosequence learning as beamsearch optimization. CoRR, abs/1606.02960, 2016.
 [WZ89] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
 [YZWY16] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. CoRR, abs/1609.05473, 2016.
 [ZKZ15] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards storylike visual explanations by watching movies and reading books. arXiv preprint arXiv:1506.06724, 2015.
Appendix A Detailed derivation of the weighted ELBO
We simplify the notation and write the distribution of the inference model over a subsequence as for any without making the dependency on and the data explicit. Furthermore, let be short for a set of samples of from the inference model. Finally, let summarize all parameters of both, generative and inference model.
The key idea is to write the marginal as a nested expectation
(14) 
and observe that we can perform an MC estimate with respect to only
(15) 
The same argument applies for , the integrand in the ELBO. Now we can repeat the IWAE argument from [BGS15] for the outer expectation
(16)  
(17)  
(18)  
(19) 
where we have used the above factorization in (17), MC sampling in (18) and Jensen’s inequality in (19). Now we can identify
(21) 
and use the logderivative trick to derive gradients
(22) 
Again, we have omitted carrying out the reparametrization trick explicitly when moving the gradient into the expectation and refer to the original paper for a more rigorous version. The gradient of the logarithm decomposes into two terms,
(23)  
(24) 
The first is the contribution to our original ELBO normalized by the IWAE MC weights. The second is identical to our startingpoint in (16) but for and conditioned on . Iterating the above for yields the desired bound.
To allow tractable gradient computation using the importanceweighted bound, we use two simplifications. First, we limit the computation of the weights to a finite horizon of size 1 which reduces them to only the first factor in (21). Second, we forward only a single sample to the next timestep to remain in the usual singlesample sequential ELBO regime (which is important as depends on ). That is, we sample proportional to the weights . A more sophisticated solution would be to incorporate techniques from particle filtering which maintain a fixedsize sample population that is updated over time.
2 Model
We argue that text generation is subject to two sorts of uncertainty: Uncertainty about plausible longterm continuations and uncertainty about the emission of the current token. The first reflects the entropy of all things considered “natural language", the second reflects symbolic entropy at a fixed position that arises from ambiguity, (near)analogies, or a lack of contextual constraints. As a consequence, we cast the emission of a token as a fundamental tradeoff between committing and forgetting about information.
2.1 State space model
Let us define a state space model with transition function
(1) 
is deterministic, yet driven by a white noise process , and, starting from some , defines a homogeneous stochastic process. A local observation model generates symbols
and is typically realized by a softmax layer with symbol embeddings.
The marginal probability of a symbol sequence is obtained by integrating out ,
(2) 
Here is defined implicitly by driving with noise as we will explain in more detail below.^{1}^{1}1For ease of exposition, we assume fixed length sequences, although in practice one works with endofsequence tokens and variable length sequences.In contrast to common RNN architectures, we have defined to not include an autoregressive input, such as , making potential biases as in teacherforcing a nonissue. Furthermore, this implements our assumption about the role of entropy and information for generation. The information about the local outcome under is not considered in the transition to the next state as there is no feedback. Thus in this model, all entropy about possible sequence continuations must arise from the noise process , which cannot be ignored in a successfully trained model.
The implied generative procedure follows directly from the chain rule. To sample a sequence of observations we (i) sample a white noise sequence
(ii) deterministically compute from and via and (iii) sample from the observation model . The remainder of this section focuses on how we can define a sufficiently powerful familiy of state evolution functions and how variational inference can be used for training.2.2 Variational inference
Modelbased variational inference (VI) allows us to approximate the marginalization in Eq. (2) by posterior expectations with regard to an inference model . It is easy to verify that the true posterior obeys the conditional independences , which informs our design of the inference model, cf. fraccaro2016SPW :
(3) 
This is to say, the previous state is a sufficient summary of the past. Jensen’s inequality then directly implies the evidence lower bound (ELBO)
(4)  
(5) 
This is a wellknown form, which highlights the perstep balance between prediction quality and the discrepancy between the transition probabilities of the unconditioned generative and the dataconditioned inference models fraccaroSPW2016 ; chung2015recurrent . Intuitively, the inference model breaks down the long range dependencies and provides a local training signal to the generative model for a single step transition and a single output generation.
Using VI successfully for generating symbol sequences requires parametrizing powerful yet tractable next state transitions. As a minimum requirement, forward sampling and loglikelihood computation need to be available. Extensions of VAEs rezendeM15 ; kingmaSW16 have shown that for nonsequential models under certain conditions an invertible function can shape moderately complex distributions over into highly complex ones over , while still providing the operations necessary for efficient VI. The authors show that a bound similar to Eq. (5) can be obtained by using the law of the unconscious statistician rezendeM15 and a density transformation to express the discrepancy between generative and inference model in terms of instead of
(6) 
This allows the inference model to work with an implicit latent distribution at the price of computing the Jacobian determinant of . Luckily, there are many choices such that this can be done in rezendeM15 ; dinhSB16 .
2.3 Training through coupled transition functions
We propose to use two separate transition functions and for the inference and the generative model, respectively. Using results from flowbased VAEs we derive an ELBO that reveals the intrinsic coupling of both and expresses the relation of the two as a part of the objective that is determined solely by the data. A shared transition model constitutes a special case.
TwoFlow ELBO
For a transition function as in Eq. (1) fix and define the restriction . We require that for any , is a diffeomorphism and thus has a differentiable inverse. In fact, as we work with (possibly) different and for generation and inference, we have restrictions and , respectively. For better readability we will omit the conditioning variable in the sequel.
By combining the perstep decomposition in (5) with the flowbased ELBO from (6), we get (implicitly setting ):
(7) 
As our generative model also uses a flow to transform into a distribition on , it is more natural to use the (simple) density in space. Performing another change of variable, this time on the density of the generative model, we get
(8) 
where now is simply the (multivariate) standard normal density as does not depend , whereas does. We have introduced new noise variable to highlight the importance of the transformation , which is a combined flow of the forward inference flow and the inverse generative flow. Essentially, it follows the suggested distribution of the inference model into the latent state space and back into the noise space of the generative model with its uninformative distribution. Putting this back into Eq. (7) and exploiting the fact that the Jacobians can be combined via we finally get
(9) 
Interpretation
Naïvely employing the modelbased ELBO approach, one has to learn two independently parametrized transition models and , one informed about the future and one not. Matching the two then becomes and integral part of the objective. However, since the transition model encapsulates most of the model complexity, this introduces redundancy where the learning problem is most challenging. Nevertheless, generative and inference model do address the transition problem from very different angles. Therefore, forcing both to use the exact same transition model might limit flexibility during training and result in an inferior generative model. Thus our model casts and as independently parametrized functions that are coupled through the objective by treating them as proper transformations of an underlying white noise process. ^{2}^{2}2Note that identifying as an invertible function allows us to perform a backwards density transformation which cancels the regularizing terms. This is akin to any flow objective (e.g. see equation (15) inrezendeM15 ) where applying the transformation additionally to the prior cancels out the Jacobian term. We can think of as a stochastic bottleneck with the observation model attached to the middle layer. Removing the middle layer collapses the bottleneck and prohibits learning compression.
Special cases
Additive Gaussian noise can be seen as the simplest form of or, alternatively, as a generative model without flow (as ). Of course, repeated addition of noise does not provide a meaningful latent trajectory. Finally, note that for ,
and the nominator in the second term becomes a simple prior probability
, whereas the determinant reduces to a constant. We now explore possible candidates for the flows in and .2.4 Families of transition functions
Since the Jacobian of a composed function factorizes, a flow is often composed of a chain of individual invertible functions rezendeM15 . We experiment with individual functions
(10) 
where is a multilayer MLP and is a neural network mapping to a lowertriangular matrix with nonzero diagonal entries. Again, we use MLPs for this mapping and clip the diagonal away from for some hyper parameter . The lowertriangular structure allows computing the determinant in and stable inversion of the mapping by substitution in . As a special case we also consider the case when is restricted to diagonal matrices. Finally, we experiment with a conditional variant of the Real NVP flow dinhSB16 .
Computing is central to our objective and we found that depending on the flow actually parametrizing the inverse directly results in more stable and efficient training.
2.5 Inference network
So far we have only motivated the factorization of the inference network but treated it as a blackbox otherwise. Remember that sampling from the inference network amounts to sampling and then performing the deterministic transition . We observe much better training stability when conditioning on the data only and modeling interaction with exclusively through . This coincides with our intuition that the two inputs to a transition function provide semantically orthogonal contributions.
We follow existing work dinhSB16 and choose
as the density of a normal distribution with diagonal covariance matrix. We follow the idea of
fraccaro2016SPW and incorporate the variablelength sequence by conditioning on the state of an RNN running backwards in time across . We embed the symbols in a vector space and use use a GRU cell to produce a sequence of hidden states where has digested tokens . Together andparametrize the mean and covariance matrix of
.2.6 Optimization
Except in very specific and simple cases, for instance, a Kalman filter, it will not be possible to efficiently compute the
expectations in Eq. (5) exactly. Instead, we sample in every timestep as is common practice for sequential ELBOs fraccaroSPW2016 ; goyalSCKB17 . The reparametrization trick allows pushing all necessary gradients through these expectations to optimize the bound via stochastic gradientbased optimization techniques such as Adam kingmaB14 .2.7 Extension: Importanceweighted ELBO for tracking the generative model
Conceptionally, there are two ways we can imagine an inference network to propose sequences for a given sentence . Either, as described above, by digesting righttoleft and proposing lefttoright. Or, by iteratively proposing a taking into account the last state proposed and the generative deterministic mechanism . The latter allows the inference network to peek at states that could generate from before proposing an actual target . This allows the inference model to track a multimodal without need for to match its expressiveness. As a consequence, this might offer the possibility to learn multimodal generative models, without the need to employ complex multimodal distributions in the inference model.
Our extension is built on importance weighted autoencoders (IWAE) burda15
. The IWAE ELBO is derived by writing the log marginal as a Monte Carlo estimate
before using Jensen’s inequality. The result is an ELBO and corresponding gradients of the form^{3}^{3}3Here we have tacitly assumed that can be rewritten using the reprametrization trick so that the expectation can be expressed with respect to some parameterfree basedistribution. See burda15 for a detailed derivation of the gradients in (11).(11) 
The authors motivate (11) as a weighting mechanism relieving the inference model from explaining the data well with every sample. We will use the symmetry of this argument to let the inference model condition on potential next states , from the generative model without requiring every to allow to make a good proposal. In other words, the sampled outputs of become a vectorized representation of to condition on. In our sequential model, computing exactly is intractable as it would require rolling out the network until time . Instead, we limit the horizon to only one timestep. Although this biases the estimate of the weights and consequently the ELBO, longer horizons did empirically not show benefits. When proceeding to timestep we choose the new hidden state by sampling with probability proportionally to . Algorithm 1 summarizes the steps carried out at time for a given (to not overload the notation, we drop in ) and a more detailed derivation of the bound is given in Appendix A.
3 Related Work
Our work intersects with work directly addressing teacherforcing, mostly on language modelling and translation (which are mostly not state space models) and stochastic state space models (which are typically autoregressive and do not address teacher forcing).
Early work on addressing teacherforcing has focused on mitigating its biases by adapting the RNN training procedure to partly rely on the model’s prediction during training bengioVJS15 ; ranzatoCAZ15 . Recently, the problem has been addressed for conditional generation within an adversarial framework goyalLZZCB16 and in various learning to search frameworks wisemanR16 ; leblondAOL17 . However, by design these models do not perform stochastic state transitions.
There have been proposals for hybrid architectures that augment the deterministic RNN state sequences by chains of random variables
chung2015recurrent ; fraccaro2016SPW . However, these approaches are largely patchingup the output feedback mechanism to allow for better modeling of local correlations, leaving the deterministic skeleton of the RNN state sequence untouched. A recent evolution of deep stochastic sequence models has developed models of ever increasing complexity including intertwined stochastic and deterministic state sequences chung2015recurrent ; fraccaro2016SPW additional auxiliary latent variables goyalSCKB17 auxiliary losses shabanian17 and annealing schedules bowmanVVDJB15 . At the same time, it remains often unclear how the stochasticity in these models can be interpreted and measured.Closest in spirit to our transition functions is work by Maximilian et al.karl16KSBvS on generation with external control inputs. In contrast to us they use a simple mixture of linear transition functions and work around using density transformations akin to bayer2014 . In our unconditional regime we found that relating the stochasticity in explicitly to the stochasticity in is key to successful training. Finally, variational conditioning mechanisms similar in spirit to ours have seen great success in image generationgregorDGW15 .
Among generative unconditional sequential models GANs are as of today the most prominent architecture yuZWY16 ; Kusner16 ; fedusGD2018 ; cheLZHLSB17 . To the best of our knowledge, our model is the first nonautoregressive model for sequence generation in a maximum likelihood framework.
4 Evaluation
Naturally, the quality of a generative model must be measured in terms of the quality of its outputs. However, we also put special emphasis on investigating whether the stochasticity inherent in our model operates as advertised.
4.1 Data Inspection
Evaluating generative models of text is a field of ongoing research and currently used methods range from simple dataspace statistics to expensive human evaluation fedusGD2018 . We argue that for morphology, and in particular nonautoregressive models, there is an interesting middle ground: Compared to the space of all sentences, the space of all words has still moderate cardinality which allows us to estimate the data distribution by unigram wordfrequencies. As a consequence, we can reliably approximate the crossentropy which naturally generalizes dataspace metrics to probabilistic models and addresses both, overgeneralization (assigning nonzero probability to nonexisting words) and overconfidence (distributing high probability mass only among a few words).
This metric can be addressed by all models which operate by first stochastically generating a sequence of hidden states and then defining a distribution over the dataspace given the state sequence. For our model we approximate the marginal by a Monte Carlo estimate of (2)
(12) 
Note that sampling from boils down to sampling from independent standard normals and then applying . In particular, the nonautoregressive property of our model allows us to estimate all words in some set using samples each by using only independent trajectories overall.
Finally, we include two dataspace metrics as an intuitive, yet less accurate measure. From a collection of generated words, we estimate (i) the fraction of words that are in the training vocabulary () and (ii) the fraction of unique words that are in the training vocabulary ( unique).^{4}^{4}4Note that for both dataspace metrics there is a trivial generation system that achieves a ‘perfect’ score. Hence, both must be taken into account at the same time to judge performance.
4.2 Entropy Inspection
We want to go beyond the usual evaluation of existing work on stochastic sequence models and also assess the quality of our noise model. In particular, we are interested in how much information contained in a state about the output is due to the corresponding noise vector . This is quantified by the mutual information between the noise and the observation given the noise that defined the prefix up to time . Since is a deterministic function of , we write
(13) 
to quantify the dependence between noise and observation at one timestep. For a model ignoring the noise variables, knowledge of does not reduce the uncertainty about , so that . We can use Monte Carlo estimates for all expectations in (13).
5 Experiments
5.1 Dataset and baseline
For our experiments, we use the BooksCorpus kiros2015skip ; zhu2015aligning , a freely available collection of novels comprising of almost 1B tokens out of which 1.3M are unique. To filter out artefacts and some very uncommon words found in fiction, we restrict the vocabulary to words of length with at least 10 occurrences that only contain letters resulting in a 143K vocabulary. Besides the standard 10% testtrain split at the word level, we also perform a second, alternative split at the vocabulary level. That means, 10 percent of the words, chosen regardless of their frequency, will be unique to the test set. This is motivated by the fact that even a small testset under the former regime will result in only very few, very unlikely words unique to the testset. However, generalization to unseen words is the essence of morphology. As an additional metric to measuring generalization in this scenario, we evaluate the generated output under WittenBell discounted character gram models trained on either the whole corpus or the test data only.
Our baseline is a GRU cell and the standard RNN training procedure with teacherforcing^{5}^{5}5It should be noted that despite the greatly reduced vocabulary in characterlevel generation, RNN training without teacherforcing for our data still fails miserably.. Hidden state size and embedding size are identical to our model’s.
5.2 Model parametrization
We stick to a standard softmax observation model and instead focus the model design on different combinations of flows for and . We investigate the flow in Equation (10), denoted as tril, its diagonal version diag and a simple identity id. We denote repeated application of (independently parametrized) flows as in . For the weighted version we use samples. In addition, for we experiment with a sequence of Real NVPs with masking dimensions (two internal hidden layers of size 8 each). Furthermore, we investigate deviating from the factorization (3) by using a bidirectional RNN conditioning on all in every timestep. Finally, for the best performing configuration, we also investigate statesizes .
5.3 Results
Table 1 shows the result for the standard split. By
we indicate mean and standard deviation across 5 or 10 (for IWAE) identical runs
^{6}^{6}6Single best model with : achieved and .. The dataspace metrics require manually trading off precision and coverage. We observe that two layers of the tril flow improve performance. Furthermore, importance weighting significantly improves the results across all metrics with diminishing returns at . Its effectiveness is also confirmed by an increase in variance across the weights during training which can be attributed to the significance of the noise model (see 5.4 for more details). We found training with realNVP to be very unstable. We attribute the relatively poor performance of NVP to the sequential VI setting which deviates heavily from what it was designed for and keep adaptions for future work.Model  unique  

tril  12.13.11  11.99.11  0.18.00  0.43.03  0.95.04 
tril, K=2  11.76.12  11.82.12  0.16 