Consistency of a Recurrent Language Model With Respect to Incomplete Decoding

02/06/2020 ∙ by Sean Welleck, et al. ∙ 7

Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition. We study the related issue of receiving infinite-length sequences from a recurrent language model when using common decoding algorithms. To analyze this issue, we first define inconsistency of a decoding algorithm, meaning that the algorithm can yield an infinite-length sequence that has zero probability under the model. We prove that commonly used incomplete decoding algorithms - greedy search, beam search, top-k sampling, and nucleus sampling - are inconsistent, despite the fact that recurrent language models are trained to produce sequences of finite length. Based on these insights, we propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model. Empirical results show that inconsistency occurs in practice, and that the proposed methods prevent inconsistency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

Vaclav Kosar

Listen to the paper here you can https://youtu.be/yRv5a1Bd21g 👂📰🤓 

(Consistency of a Recurrent Language Model With Respect to Incomplete Decoding)

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural sequence models trained with maximum likelihood estimation (MLE) have become a standard approach to modeling sequences in a variety of natural language applications such as machine translation

(Bahdanau et al., 2015), dialogue modeling (Vinyals et al., 2015), and language modeling (Radford et al., 2018). Despite this success, MLE-trained neural sequence models have been shown to exhibit issues such as length bias (Sountsov and Sarawagi, 2016; Stahlberg and Byrne, 2019) and degenerate repetition (Holtzman et al., 2019). These issues are suspected to be related to the maximum likelihood objective’s local normalization, which results in a discrepancy between the learned model’s distribution and the distribution induced by the decoding algorithm used to generate sequences (Lafferty et al., 2001; Andor et al., 2016). This has prompted the development of alternative decoding methods (Wu et al., 2016; Holtzman et al., 2019) and training objectives (Murray and Chiang, 2018; Welleck et al., 2019). In this paper, we formalize and study this discrepancy between the model and the decoding algorithm.

We begin by formally defining recurrent neural language models

, a family that encompasses neural models used in practice, such as recurrent neural networks

(Elman, 1990; Cho et al., 2014; Hochreiter and Schmidhuber, 1997), and transformers (Vaswani et al., 2017). Next, we formally define a decoding algorithm – a function that induces a distribution over sequences given a recurrent language model and a context distribution – which is used to obtain probable sequences from a model. In this paper, we show that the distribution induced by a decoding algorithm can contradict this intended use; instead, the decoding algorithm may return improbable, infinite-length sequences.

Our main finding is that a sequence which receives zero probability under a recurrent language model’s distribution can receive nonzero probability under the distribution induced by a decoding algorithm. This occurs when the recurrent language model always ranks the sequence termination token outside of the set of tokens considered at each decoding step, yielding an infinite-length, zero probability sequence. This holds whenever the decoding algorithm is incomplete, in the sense that the algorithm excludes tokens from consideration at each step of decoding, which is the case for common methods such as greedy search, beam search, top- sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2019). We formalize our main finding using the notion of consistency (Chen et al., 2017) – whether a distribution assigns probability mass only to finite sequences – and prove that a consistent recurrent language model paired with an incomplete decoding algorithm can induce an inconsistent sequence distribution.

Based on the insight that inconsistency occurs due to the behavior of the termination token under incomplete decoding, we develop two methods for addressing inconsistency. First, we propose consistent sampling methods which guarantee that the termination token is not excluded from selection during decoding. Second, we introduce a self-terminating recurrent language model which ensures that the termination token is eventually ranked above all others, guaranteeing consistency under incomplete decoding.

To empirically measure inconsistency, we decode sequences from trained recurrent language models and measure the proportion of sequences with lengths far exceeding the maximum training sequence length. Our experiments on the Wikitext2 dataset (Merity et al., 2016) suggest that inconsistency occurs in practice when using incomplete decoding methods, while the proposed consistent sampling methods and self-terminating model parameterization prevent inconsistency and maintain language modeling quality.

The theoretical analysis reveals defects of existing decoding algorithms, providing a way to develop future models, inference procedures, and learning algorithms. We present methods related to sampling and model parameterization, but there are more directions which we leave to the future; we close with directions related to sequence-level learning.

2 Background

We begin our discussion by establishing background definitions. First, we define a sequence which is the main object of our investigation.

Definition 2.1 (Sequence).

A sequence is an ordered collection of items from a predefined finite vocabulary . A sequence of finite length always ends with a special token that only appears at the end of a sequence.

Each model we consider generates a sequence conditioned on context information, such as a prefix in sentence completion. To consider this, we define a context distribution.

Definition 2.2 (Context distribution).

A context distribution

is a probability distribution defined over a set

. An element is called a context.

2.1 Recurrent Language Models

A recurrent language model is an autoregressive model of a sequence distribution, where each conditional probability is parameterized with a neural network. Importantly, we assume that all tokens in a sequence are dependent on each other under a recurrent language model. This allows us to avoid cases in which the model degenerates to a Markovian language model, such as an

-gram model with a finite .

Definition 2.3 (Recurrent language model).

A recurrent language model is a neural network that computes the following conditional probability at each time step

where and , and are parameters. A recurrent language model thereby computes the probability of a sequence by

where . This distribution satisfies

Practical variants of the recurrent language model differ by the choice of transition function (Elman, 1990; Hochreiter and Schmidhuber, 1997; Cho et al., 2014; Vaswani et al., 2017). The use of softmax (Bridle, 1990) implies that every unique token in the vocabulary is considered at every location of a sequence.

Remark 2.1.

Under the conditional distribution of a recurrent language model, every token is assigned a positive probability. This implies that In addition, it follows that any finite sequence is probable by a recurrent language model under any context, i.e., for any sequence of finite length.

2.2 Decoding Algorithms

Because it is intractable to decode the most probable sequence, it is necessary in practice to use an approximate decoding algorithm.

Definition 2.4 (Decoding algorithm).

A decoding algorithm is a function that generates a sequence given a recurrent language model and context . Let denote the distribution induced by the decoding algorithm .

We consider two families of decoding algorithms. In our analysis we only consider decoding algorithms that decode in a single pass, forward in time, without modifying previously selected tokens.

Stochastic decoding.

The first family consists of stochastic algorithms. Among them, ancestral sampling is asymptotically unbiased and can be used for finding the most probable sequence, although it requires a substantial number of samples to achieve a low-variance estimate.

Definition 2.5 (Ancestral sampling).

Ancestral sampling generates a sequence from a recurrent language model given context by recursively sampling from until :

In order to avoid the high variance, two approximate stochastic decoding algorithms have recently been proposed and tested with recurrent language models. Top- sampling considers only a subset of the most probable tokens from the vocabulary at a time, while nucleus sampling considers only the minimal subset of most probable tokens whose total probability is higher than a predefined threshold.

Definition 2.6 (Top- sampling (Fan et al., 2018)).

Top- sampling generates a sequence from a recurrent language model given context by recursively sampling from the following proposal distribution:

Definition 2.7 (Nucleus sampling (Holtzman et al., 2019)).

Nucleus sampling generates a sequence from a recurrent language model given context by recursively sampling from the following proposal distribution. Let denote tokens in such that for all , and define

where with

Deterministic decoding.

The other family consists of deterministic decoding algorithms, where a token is selected deterministically according to a rule at each decoding step. The most naive algorithm, called greedy decoding, simply takes the most probable token at each step.

Definition 2.8 (Greedy decoding).

Greedy decoding generates a sequence from a recurrent language model given context by recursively selecting the most likely token from until :

In contrast to greedy decoding, beam search operates on the level of partial sequences or prefixes.

Definition 2.9 (Prefix).

A prefix is an ordered collection of items from . The score of a prefix is

where is a token at time from .

Starting from a set of empty prefixes, at each iteration a new prefix set is formed by expanding each prefix, then choosing the highest scoring expanded prefixes.

Definition 2.10 (Beam search).

Beam search with width , , generates a sequence from a recurrent language model by maintaining a size- prefix set . Starting with , at each iteration beam search forms a new prefix set by expanding the current set, (where is concatenation), then choosing the highest scoring elements,

Any ending with is restricted from being expanded further, and is added to a set . Beam search ends when contains sequences, and returns the highest scoring sequence in .

Incompleteness.

Other than ancestral sampling, the decoding algorithms above are incomplete in that they only consider a strict subset of the the full vocabulary at each time step, aside from the trivial case of .111 Nucleus sampling is incomplete when for every context and prefix ,

Definition 2.11 (Incomplete Decoding).

A decoding algorithm is incomplete when for each context and prefix , there is a strict subset such that

3 Consistency of a Decoding Algorithm

Definition of consistency.

A recurrent language model may assign a positive probability to an infinitely long sequence, in which case we call the model inconsistent. This notion of consistency was raised and analyzed earlier, for instance by Booth and Thompson (1973) and Chen et al. (2017), in terms of whether the distribution induced by is concentrated on finite sequences. We extend their definition to account for the context .

Definition 3.1 (Consistency of a recurrent language model).

A recurrent language model is consistent under a context distribution if . Otherwise, the recurrent language model is said to be inconsistent.

Any sequence decoded from a consistent model for a given probable context is guaranteed to terminate.

Lemma 3.1.

If a recurrent language model is consistent, for any probable context .222Proofs of Lemmas 3.1-3.3 are in Appendix A.

Next, we establish a practical condition under which a recurrent language model is consistent.

Lemma 3.2.

A recurrent language model is consistent if is uniformly bounded for some .

Proof sketch.

If is bounded, then each is bounded, hence for a constant . Thus , meaning that is consistent. ∎

Although this condition is practical because layer normalization or bounded activation functions

(Elman, 1990; Cho et al., 2014; Vaswani et al., 2017) result in bounded , we show that even if a recurrent language model is consistent, a decoding algorithm may produce an infinite-length sequence. We formalize this discrepancy using the consistency of a decoding algorithm.

Definition 3.2 (Consistency of a decoding algorithm).

A decoding algorithm is consistent with respect to a consistent recurrent language model under a context distribution if the decoding algorithm preserves the consistency of the model , that is, .

When a consistent recurrent language model and a decoding algorithm induce a consistent distribution , we say that paired with is consistent. For instance, any consistent recurrent language model paired with ancestral sampling is consistent, because the induced distribution is the same as the distribution of the original model. We also have an analogue of Lemma 3.1.

Lemma 3.3.

A consistent decoding algorithm with respect to a consistent recurrent language model decodes only probable sequences. That is, if , then for any probable context .

Inconsistency of incomplete decoding.

Any incomplete decoding algorithm (Definition 2.11) can be inconsistent regardless of the context distribution, because there is a recurrent language model that places  outside of at every step of decoding. To show this, we construct a consistent recurrent language model whose distribution induced by an incomplete decoding algorithm is inconsistent.

Theorem 3.4 (Inconsistency of an incomplete decoding algorithm).

There exists a consistent recurrent language model from which an incomplete decoding algorithm , that considers only up to -most likely tokens according to at each step , finds a sequence whose probability under is for any context distribution.

Proof.

We prove this theorem by constructing a recurrent network. We define the recurrent function as

where is a one-hot representation of , where every entry is positive, and

is an identity matrix of size

. is constructed to consist of positive values only. Because each element of is bounded by 1, the constructed recurrent language model is consistent by Lemma 3.2.

For , we set (see Definition 2.3) to be

where all elements of are positive and is a one-hot representation of . is set to zero. Next, let

where all elements of are negative.

This defines a valid recurrent language model (Definition 2.3), since the conditional distribution at each time

is influenced by all the previous tokens. More specifically, the logit of a token

depends on , where is an indicator function.

This recurrent language model always outputs positive logits for non- tokens, and outputs negative logits for the  token. This implies for all . This means that  is always ranked last at each time step, so an incomplete decoding algorithm that considers at most most probable tokens at each time step from cannot decode  and thus always decodes an infinitely long sequence.

The log-probability of this infinitely long sequence is

For any ,

where . The last inequality holds because is increasing in . Therefore, the log-probability diverges as , and thus , which implies the decoding algorithm is inconsistent by Lemma 3.3. ∎

Greedy decoding, beam search, top- sampling, and nucleus sampling are all inconsistent according to this theorem; there are consistent models that induce inconsistent distributions when paired with these decoding algorithms.

4 Fixing the inconsistency

In this section, we consider two ways to prevent inconsistency arising from incomplete decoding algorithms. First, we introduce consistent versions of top- and nucleus sampling. Second, we introduce the self-terminating recurrent language model, which is consistent when paired with any of the decoding algorithms considered in this paper.

4.1 Consistent Sampling Algorithms

The proof of Theorem 3.4 suggests that inconsistency of incomplete decoding algorithms arises from the fact that  may be excluded indefinitely from the set of top-ranked tokens. We propose a simple modification to top- and nucleus sampling that forces  to be included at each step of decoding. First, we give a condition for when a particular model paired with a decoding algorithm is consistent.

Theorem 4.1.

Let be a consistent recurrent language model. If a decoding algorithm satisfies for every prefix and context , then the decoding algorithm is consistent with respect to the model .

Proof.

Let denote a set of all prefixes of length . For ,

Taking the limit and expectation over on both sides, we have

from which the decoding algorithm is consistent. ∎

We define consistent variants of top- and nucleus sampling which satisfy this condition.

Definition 4.1 (Consistent top- sampling).

Consistent top- sampling is top- sampling with the following modified proposal distribution:

where .

Definition 4.2 (Consistent nucleus sampling).

Consistent nucleus sampling is nucleus sampling with the following modified proposal distribution:

The induced probability of  under these two algorithms is always equal to or larger than the model’s probability. By Theorem 4.1, these algorithms are consistent with respect to any consistent recurrent language model.

4.2 A Self-Terminating Recurrent Language Model

Although these consistent sampling algorithms can be used with any recurrent language model, their stochastic nature may not be suitable for finding a single, highly probable sequence. To avoid this limitation, we propose the self-terminating recurrent language model (STRLM).

Definition 4.3 (Self-terminating recurrent language model).

A self-terminating recurrent language model computes the following conditional probability at each time step:

where

with and . is computed as in the original recurrent language model.

The underlying idea is that the probability of  increases monotonically. The model is consistent when paired with greedy decoding.

Theorem 4.2.

Greedy decoding is consistent with respect to any self-terminating recurrent language model.

Proof.

Let denote and denote . By Definition 4.3 we have

Take . We then have for all , which implies that is always the most probable token after time step . Hence, the sequence length is less than with probability 1. ∎

Beam search is also consistent with respect to any self-terminating recurrent language model according to a similar argument; see Appendix B for the proof.

5 Empirical Validation

The theoretical results rely on the existence of a model that results in inconsistency; it remains to be shown that inconsistency with respect to incomplete decoding occurs with recurrent language models encountered in practice. Moreover, while the proposed consistent sampling methods and self-terminating recurrent language model carry theoretical guarantees in terms of consistency, we must check whether they retain language modeling quality. To do so, we perform two experiments using a sequence completion task. In each experiment, we use the beginning of a sequence as context, then decode continuations from a trained recurrent language model and measure the proportion of non-terminated sequences in order to approximately measure inconsistency. The first experiment (§5.1) shows that inconsistency occurs in practice, and the second experiment (§5.2) shows the effectiveness of the proposed approaches.

Sequence completion.

We evaluate recurrent language models on a sequence completion task, which has previously been used to evaluate the effectiveness of sequence models, e.g. Sutskever et al. (2011); Graves (2013); Radford et al. (2018); Holtzman et al. (2019); Welleck et al. (2019). Sequence completion is a general setting for studying the behavior of language models, encompassing machine translation (Bahdanau et al., 2015)

, story generation

(Fan et al., 2018), and dialogue modeling (Vinyals et al., 2015). The task consists of decoding a continuation given a length- prefix , resulting in a completion .

Dataset.

We use the Wikitext2 dataset (Merity et al., 2016) consisting of paragraphs from Wikipedia, since it has frequently been used to evaluate language models (Grave et al., 2017; Melis et al., 2018; Merity et al., 2018). We split each paragraph into sentences using Spacy,333https://spacy.io/ resulting in roughly 100k sequences (78,274 train, 8,464 valid, 9,708 test). We split each sequence, using the first

tokens as a context and the remaining tokens as a continuation. To ensure that each sequence contains a prefix, we prepend padding tokens to make it length

. Special  and  tokens are then inserted at the beginning and end of every sequence. Our experiments use . We model sequences at the word level with a vocabulary size of 33,182. The average training sequence length is 24 tokens, with a maximum of 137.

Context distribution.

We define empirical context distributions with prefixes from the train, valid, and test sets,

where is a dataset split.

Evaluation metrics.

We use finite sequences to approximately measure the consistency of a model paired with a decoding algorithm, since decoding an infinite-length sequence is impossible. We use the proportion of decoded continuations that are longer than a predefined limit,

where for each context in . We call the non-termination ratio of the decoding algorithm for an underlying model and context distribution. A value of greater than zero means that some sequences did not terminate within steps. When is infinity, this implies that the model paired with the decoding algorithm is inconsistent. In practice, we use a finite that is substantially larger than the maximum training sequence length, and we interpret a non-zero as evidence that the model paired with the decoding algorithm is inconsistent. We use , which is more than 10 times the maximum training sequence length.

In each experiment, we report the mean and standard deviation of metrics across 10 independent initializations. Unless specified otherwise, we report metrics using the test context distribution, since the train, valid, and randomly generated context distributions had similar results.

Training.

We train recurrent language models for sequence completion with maximum likelihood, using the following loss on each sequence :

This amounts to running the full training sequence through a recurrent model and zeroing the loss for the first tokens, so that the first steps correspond to learning a

that encodes the context. Each model is trained on a single Nvidia P40 GPU for up to 100 epochs, stopping early when validation perplexity does not decrease for 10 consecutive epochs.

Models.

We consider recurrent neural networks with hyperbolic tangent activations (-RNN) (Elman, 1990) and LSTM units (LSTM-RNN) (Hochreiter and Schmidhuber, 1997). We perform an initial hyper-parameter sweep and select the best set of hyper-parameters for each of -RNN and LSTM-RNN based on the validation perplexities.444Refer to Appendix C for the hyper-parameter ranges.

With this best set of hyperparameters, we train each of these models with 10 different initializations. The choice of

and LSTM RNNs implies that all of the recurrent language models that we train are consistent according to Lemma 3.2. Our LSTM models achieve similar test perplexity () to those reported in previous work (Merity et al., 2018); see Appendix C for further details.

Additionally, we train self-terminating -RNN and LSTM-RNN variants (Definition 4.3) at various values of , which controls a lower bound on the termination probability at each step. We use . We use the hyper-parameters selected in the preceding grid search.

-RNN LSTM-RNN
ancestral 0.00 0.0 0.00 0.0
greedy 6.07 5.6 1.03 0.3
beam-2 1.21 0.3 0.07 0.1
beam-4 0.29 0.1 0.00 0.0
topk-2 0.84 0.8 0.00 0.0
topk-4 0.02 0.0 0.00 0.0
nucleus-0.2 2.49 0.2 0.76 0.3
nucleus-0.4 0.32 0.1 0.22 0.1
Table 1: Non-termination ratio ( (%)) of decoded sequences using ancestral sampling and incomplete decoding methods.
-RNN LSTM-RNN
consistent topk-2 0.00 0.0 0.00 0.0
consistent topk-4 0.00 0.0 0.00 0.0
consistent nucleus-0.2 0.00 0.0 0.01 0.0
consistent nucleus-0.4 0.00 0.0 0.01 0.0
Table 2: Non-termination ratio ( (%)) of decoded sequences using consistent sampling methods.

5.1 Inconsistency of Recurrent Language Models

In this experiment, we demonstrate evidence of inconsistency with incomplete decoding methods (Theorem 3.4).

Table 1 shows non-termination ratios for the recurrent language models using the incomplete decoding algorithms considered in this work, along with ancestral sampling. Decoding with ancestral sampling always resulted in sequences that terminated within steps, since the induced distribution is the same as that of the consistent model. On the other hand, the non-zero non-termination ratios for the incomplete decoding algorithms suggest inconsistency with respect to each algorithm, providing evidence for Theorem 3.4.

In particular, greedy search, beam search, and nucleus sampling yielded non-terminating sequences with both the and LSTM RNNs. Using greedy decoding, roughly 6% of all contexts resulted in a non-terminating continuation with the -RNN, and roughly 1% with the LSTM-RNN. Nucleus sampling also produced non-terminating sequences with the -RNN (2.49%, nuc-0.2) and LSTM-RNN (0.76%, nuc-0.2), with the amount of non-termination decreasing as increased (see Definition 2.7), likely due to having a higher chance of being included in . Top- sampling resulted in non-terminating sequences with the -RNN, but not with the LSTM, implying that was ranked within the top positions on at least one timestep during each decoding. Beam search produced non-terminating sequences with both the -RNN (beam-2,4) and LSTM-RNN (beam-2) models. This means that was outside of the top tokens (determined by the beam width) considered at each step, since in our experiments we terminated the beam search when a single beam prefix contained . With the LSTM-RNN, a larger beam width (beam-4) prevented non-termination.

Prefix He had a guest @-@ starring role on the television
nucleus film the website , with whom he wrote to the title of The Englishwoman ’s Domestic Magazine .
c-nucleus film the website , but he did not be a new sequel .
Prefix Somewhere between 29 and 67 species are recognised in the
nucleus  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  
c-nucleus  , with the exception of an average of 6 @.@ 4 in ( 1 @.@ 6 mm ) .
Prefix The Civil War saw more ironclads built by both sides
nucleus and towns , including  , the British Empire ,  ,  ,  ,  ,  ,  ,  ,  ,  
c-nucleus and towns , including  , the British Empire ,  ,  ,  ,  ,  
Table 4: Example continuations with the LSTM-RNN and a self-terminating LSTM-RNN ().
Prefix With 2 : 45 to go in the game ,
Baseline the team was able to gain a first down .
STRLM the Wolfpack was unable to gain a first down .
Prefix As of 2012 , she is a horse riding teacher
Baseline , and a  , and a  , and a  , and a  , and a  , and a  , and a  , and a  , and a  ,
STRLM , and a member of the   .
Prefix Nintendo Power said they enjoyed Block Ball and its number
Baseline of songs , including the ”  ” , ”  ” , ”  ” , …, ” Ode to a Nightingale ” , ” Ode to a Nightingale ”, ” Ode to a
STRLM of songs , including the ”  ” , ”  ” ,  ” , ”
Table 3: Example continuations using nucleus and consistent nucleus () sampling with the LSTM-RNN.

5.2 Consistency of the Proposed Methods

In this experiment, we evaluate the consistent variants of top- and nucleus sampling (§4.1) as well as the self-terminating recurrent language model (§4.2) in terms of consistency and language modeling quality.

Consistent sampling.

Table 2 shows that consistent nucleus and top- sampling (§4.1) resulted in only terminating sequences, except for a few cases that we attribute to the finite limit used to measure the non-termination ratio. The example continuations in Table 4 show that the sampling tends to preserve language modeling quality on prefixes that led to termination with the baseline (first row). On prefixes that led to non-termination with the baseline (second & third rows), the quality tends to improve since the continuation now terminates. Since the model’s non- token probabilities at each step are only modified by a multiplicative constant, the sampling process can still enter a repetitive cycle (e.g. when the constant is close to 1), though the cycle is guaranteed to eventually terminate.

Self-terminating RNN.

As seen in Table 5, the self-terminating recurrent language models with are consistent with respect to greedy decoding, at the expense of perplexity compared to the vanilla model. The value of from Definition 4.3, which controls a lower-bound on termination probability at each step, influences both and perplexity. When is too large (), perplexity degrades. When is too small (), the lower-bound grows slowly, so  is not guaranteed to be top-ranked within steps, and the metrics resemble the baseline’s. An of balanced consistency and language modeling quality, with a zero non-termination ratio and perplexity within 3 points of the baseline.

For the example decoded sequences in Table 4, generation quality is similar when both the self-terminating and baseline models terminate (first row). For prefixes that led to non-termination with the baseline, the self-terminating variant can yield a finite sequence with reasonable quality (second row). This suggests that some cases of degenerate repetition (Holtzman et al., 2019; Welleck et al., 2019) may be attributed to inconsistency. However, in other cases the self-terminating model enters a repetitive (but finite) cycle that resembles the baseline (third row), showing that consistency does not necessarily eliminate degenerate repetition.

ST (%) perplexity
-RNN 0.00 0.0 150.07 2.7
0.00 0.0 138.01 0.6
1.04 0.6 138.67 1.8
6.07 5.6 136.57 1.8
LSTM 0.00 0.0 101.24 0.3
0.00 0.0 94.33 0.6
0.94 0.5 94.15 0.8
1.03 0.3 91.86 0.4
Table 5: Non-termination ratio (%)) of greedy-decoded sequences and test perplexity for self-terminating recurrent models.

6 Future Directions

The methods we proposed in this paper have focused on how to resolve inconsistency from the viewpoint of decoding algorithms or model parameterization. Another approach is to address the issue of inconsistency in the learning phase.

One interesting direction is to investigate whether maximum likelihood learning is a cause of inconsistency. Given a training set drawn from a data distribution, maximum likelihood learning solves:

where is a regularizer and is a regularization weight.

Inconsistency may arise from the lack of decoding in solving this optimization problem. Maximum likelihood learning fits the model using the data distribution, whereas a decoded sequence from the trained model follows the distribution induced by a decoding algorithm. Based on this discrepancy, we make a strong conjecture: we cannot be guaranteed to obtain a good consistent sequence generator using maximum likelihood learning and greedy decoding. Sequence-level learning, however, uses a decoding algorithm during training (Ranzato et al., 2016; Bahdanau et al., 2016). We hypothesize that sequence-level learning can result in a good sequence generator that is consistent with respect to incomplete decoding.

7 Conclusion

We extended the notion of consistency of a recurrent language model put forward by Chen et al. (2017) to incorporate a decoding algorithm, and used it to analyze the discrepancy between a model and the distribution induced by a decoding algorithm. We proved that incomplete decoding is inconsistent, and proposed two methods to prevent this: consistent decoding and the self-terminating recurrent language model. Using a sequence completion task, we confirmed that empirical inconsistency occurs in practice, and that each method prevents inconsistency while maintaining the quality of generated sequences. We suspect the absence of decoding in maximum likelihood estimation as a cause behind this inconsistency, and suggest investigating sequence-level learning as an alternative in the future.

Acknowledgements

We thank Chris Dyer, Noah Smith and Kevin Knight for valuable discussions. This work was supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI) and Samsung Research (Improving Deep Learning using Latent Structure). KC thanks eBay and NVIDIA for their support.

References

  • D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins (2016) Globally normalized transition-based neural networks. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, Vol. 4, pp. 2442–2452. External Links: Document, 1603.06042, ISBN 9781510827585, Link Cited by: §1.
  • D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio (2016) An actor-critic algorithm for sequence prediction. External Links: 1607.07086 Cited by: §6.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §5.
  • T. L. Booth and R. A. Thompson (1973) Applying probability measures to abstract languages. IEEE Transactions on Computers C-22 (5), pp. 442–450. External Links: Document, ISSN 2326-3814 Cited by: §3.
  • J. S. Bridle (1990) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Cited by: §2.1.
  • Y. Chen, S. Gilroy, A. Maletti, J. May, and K. Knight (2017) Recurrent neural networks as weighted language recognizers. arXiv preprint arXiv:1711.05408. Cited by: §1, §3, §7.
  • K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103–111. External Links: Link, Document Cited by: §1, §2.1, §3.
  • J. L. Elman (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §1, §2.1, §3, §5.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. arXiv preprint arXiv:1805.04833. Cited by: §1, Definition 2.6, §5.
  • E. Grave, A. Joulin, and N. Usunier (2017) Improving neural language models with a continuous cache. In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, External Links: 1612.04426 Cited by: §5.
  • A. Graves (2013) Generating Sequences With Recurrent Neural Networks. External Links: 1308.0850, Link Cited by: §5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1, §2.1, §5.
  • A. Holtzman, J. Buys, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §1, §1, Definition 2.7, §5, §5.2.
  • J. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML ’01 Proceedings of the Eighteenth International Conference on Machine Learning. External Links: Document, arXiv:1011.4088v1, ISBN 1558607781, ISSN 1750-2799 Cited by: §1.
  • G. Melis, C. Dyer, and P. Blunsom (2018) On the state of the art of evaluation in neural language models. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, External Links: 1707.05589 Cited by: §5.
  • S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing LSTM language models. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, External Links: 1708.02182 Cited by: §5, §5.
  • S. Merity, C. Xiong, J. Bradbury, and R. Socher (2016) Pointer sentinel mixture models. ArXiv abs/1609.07843. Cited by: §1, §5.
  • K. Murray and D. Chiang (2018) Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 212–223. External Links: Link, Document Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2018) Language Models are Unsupervised Multitask Learners. Cited by: §1, §5.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2016) Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: §6.
  • P. Sountsov and S. Sarawagi (2016) Length bias in encoder decoder models and a case for global conditioning. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    ,
    Austin, Texas, pp. 1516–1525. External Links: Link, Document Cited by: §1.
  • F. Stahlberg and B. Byrne (2019) On NMT search errors and model errors: cat got your tongue?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3354–3360. External Links: Link, Document Cited by: §1.
  • I. Sutskever, J. Martens, and G. Hinton (2011) Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, External Links: ISBN 9781450306195 Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, External Links: 1706.03762, ISSN 10495258 Cited by: §1, §2.1, §3.
  • O. Vinyals, G. Quoc, and V. Le (2015) A Neural Conversational Model. In ICML Deep Learning Workshop, Cited by: §1, §5.
  • S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston (2019) Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319. Cited by: §1, §5, §5.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. External Links: 1609.08144, Link Cited by: §1.

Appendix A Proof of Lemmas in Section 3

Lemma 3.1.

If a recurrent language model is consistent, for any probable context .

Proof.

Suppose there exists a probable context such that . Then

which contradicts the consistency of the model . ∎

Lemma 3.2.

A recurrent language model is consistent if is uniformly bounded for some .

Proof.

Let be an upper bound such that for all . Let be the conjugate of satisfying . Then we have from Hölder’s inequality, for all and ,

where . Note that

where . For a given and context ,

and it follows that for some strictly positive constant . Then

and hence is consistent. ∎

Lemma 3.3.

A consistent decoding algorithm with respect to a consistent recurrent language model decodes only probable sequences. That is, if , then for any probable context .

Proof.

Suppose there exists a decoded sequence by and probable context such that but . By Remark 2.1, the sequence is of infinite length and thus , which contradicts the consistency of by Lemma 3.1. ∎

Appendix B Consistency of STRLM

Theorem 4.3.

Beam search with width , , is consistent with respect to any STRLM.

Proof.

Let be the size- set of sequences kept by that start with a prefix .

Take as in the proof of Theorem 4.2. Suppose that there exists at least one prefix which does not end with .

We first want to show that induces at most more steps in beam search with width , that is, implies .

We know from the proof of Theorem 4.2 that an STRLM satisfies: for any context and

For any subsequence with ,

Thus, is the most probable sequence among sequences starting with the prefix , and it follows that .

Thus, in , there are sequences starting with for . By the same argument, at each step at least one sequence ending with  is added to , and therefore at time step , sequences ending with  are in .

Note that the result set by (Definition 2.10) satisfies

Since each induces sequences of length at most , we have

Taking the expectation over yields the consistency of the model . ∎

Appendix C Additional Details and Results

Prefix In his ”  with Francis  ” series of
Baseline the song , ” The  ” , ” The  ” , ” The  ” , ” The  ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Avalanche ” , ” The Tallest Man ” , ” The Avalanche ” , ” The Tallest Man ” , ” The Avalanche ” …
STRLM the same name , the song was released in the United States on November 18 , 2003 .
Prefix Tú ” as one of the most important songs on
Baseline the album , and ” The  ” , ” The  ” , ” The  ” , ” The  ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Tallest Man ” , ” The Avalanche ” , ” The Tallest Man ” , ” The Avalanche ” , ” The Tallest Man ” , ” The Avalanche ” , ” The Tallest Man ” …
STRLM the album , the song was released on the Billboard Hot 100 , and was certified gold by the Recording Industry Association of America ( RIAA ) .
Prefix Pound , who believed that the ” Hellenic hardness ”
Baseline of the poem was ”  ” , and ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ”  ” , ” ”  ” , ”  ” , ” ”  ” , ” ”  ” , ” ”  ” , ” ”  ” , ” ”  ” , …
STRLM was ” a  of the  of the  ” .
Prefix Due to industrial waste and automobiles , Manila suffers from
Baseline the   ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,