Neural sequential text generation models have become the standard in NLP applications such as language modelling, NLG, machine translation. When enough data is available, these models can be trained end-to-end with impressive results. Generally, inference and training proceed in an auto-regressive manner, namely, the next decoded symbol is predicted by alocally normalized
conditional distribution (the “softmax”). This has several advantages: (i) the probability of the sequence is already normalized, by the chain-rule over local decisions, (ii) max-likelihood (ML) training is easy, because the log-likelihood of the full sequence is simply the sum of local CE (cross-entropy) losses, (iii) exact sampling of full sequences from the model distribution is directly obtained through a sequence of local sampling decisions.
However, these autoregressive models (AMs) tend to suffer from a form of myopia. They have difficulty accounting for global properties of the predicted sequences, from overlooking certain aspects of the semantic input in NLG
to duplicating linguistic material or producing “hallucinations” in MT, and generally through being unable to account for long-distance consistency requirements that would be obvious for a human reader.111 To borrow terminology from Reinforcement Learning (RL) , such NLP models work by “imitation learning”, without any representation of “objectives” to be realized. While this defect can be mitigated in the presence of large training sets, it can become serious when this condition is not met.
To borrow terminology from Reinforcement Learning (RL)Sutton and Barto (2018)
, such NLP models work by “imitation learning”, without any representation of “objectives” to be realized. While this defect can be mitigated in the presence of large training sets, it can become serious when this condition is not met.
The main contributions of this paper are as follows.
First, we propose a hybrid seq2seq formalization, the Global Autoregressive Model (GAM), that combines a local autoregressive component with a global log-linear component, allowing the use of a priori
features to compensate for the lack of training data. GAMs are related both to the class of Energy-Based Models (EBM) and to that of Exponential Families (EF), and inherit some important properties from those: an intimate relationship between training and sampling (EBM); the identity of empirical and model expectations at maximum-likelihood; convexity of log-likelihood (EF).
Second, we propose a training procedure in two steps. In the first step, we train through max-likelihood a GAM, which however is unnormalized and improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation.
Third, we demonstrate the ability of GAMs to be data-efficient, namely, to exploit the original data better than a standard autoregressive model. In order to clarify the core techniques and issues, we design a simple class of synthetic data, consisting of random binary strings containing “motifs” (specific substrings) that we can manipulate in different ways. We show that, in limited data conditions, GAMs are able to exploit the features to obtain final autoregressive models that perform better than the original ones.
The remainder of the paper is structured as follows. In Section 2, we provide some background about autoregressive models, energy-based models, and log-linear models. In Section 3, we introduce GAMs. In section 4, we describe our focus on synthetic data. In Section 5, we explain our training procedure. In Section 6, we comment on related work. In Section 7, we describe our experiments. In Section 8, we provide an analysis of our results. We conclude with a discussion in Section 9. Note that some additional explanations and experiments are provided in the Supplementary Material, indicated by [SM].
Autoregressive models (AM)
These are currently the standard for neural seq2seq processing, with such representatives as RNN/LSTMs Hochreiter and Schmidhuber (1997); Sutskever et al. (2014), ConvS2S Gehring et al. (2017), Transformer Vaswani et al. (2017)). Formally, they are defined though a distribution , where is an input (aka Context, e.g. a source sentence in Machine Translation (MT)), and is a target sequence (e.g. a target sentence in MT). We have:
is a normalized conditional probability over the next symbol of the sequence, computed by a neural network (NN) with parameters. The local normalization of the incremental probabilities implies the overall normalization of the distribution , and consequently, the possibility of directly sampling from it and evaluating the likelihood of training sequences.
Energy-Based Models (EBM)
EBMs are a generic class of models, characterized by an energy function computed by a NN parametrized by LeCun et al. (2006)
. Equivalently, they can be seen as directly defining a potential (an unnormalized probability distribution), and indirectly the normalized distribution , with . A fundamental property of these models is that, for max-likelihood training, the SGD updates can be computed through the formula:222See (LeCun et al., 2006, p. 15), and [SM] for a derivation.
which, in principle, reduces the problem of training with unnormalized potentials to the problem of sampling from them.
Log-Linear Models / Exponential Families
with . Here
is a vector of predefined real features of the pair, which is combined by scalar product with a real vector of weights of the same dimension; is an arbitrary “base measure”, which is fixed. These models, which allow to introduce prior knowledge through features and have nice formal properties (see below), were mainstream in NLP before the revival of neural approaches.
3 Proposal: GAMs
We now define Global Autoregressive Models (GAMs). These are hybrid seq2seq models that exploit both local autoregressive properties as well as global properties of the full target sequence. A GAM is an unnormalized distribution over sequences , parametrized by a vector :
Here is an autoregressive seq2seq model for generating from input , parametrized by ; is a vector of predefined real features of the pair , which is combined by a scalar product with a real vector of the same dimension, computed over the input by a network parametrized by . The normalized distribution associated with the GAM is , where .
GAMs appear promising for the following reasons:
Features provide a simple way to draw attention of the model to potentially useful aspects that may be difficult for the AM component to discover on its own from limited data.
GAMs are an instance of EBMs, where the potential is the product of the an AM potential with a “log-linear” potential . Here the gradient relative to the log-linear part takes the especially simple form:
Log-linear models, on their own, while great at expressing prior knowledge, are not as good as AM models at discovering unforeseen regularities in the data. Also, they are typically problematic to train from a log-likelihood perspective, because sampling from them is often unfeasible. GAMs address the first issue through the component, and alleviate the second issue by permitting the use of as a powerful “proposal” (aka “surrogate”) distribution in importance sampling and related approaches, as we will see.
4 Experimental focus
While the motivation for GAMs ultimately lies in practical NLP applications such as those evoked earlier, in this paper we aim to understand some of their capabilities and training techniques in simple and controllable conditions. We focus on the unconditional (i.e. language modelling) case, and on synthetic data. Our setup is as follows:
We consider an underlying process that generates binary sequences according to a well-defined and flexible process. In this paper we use PFSAs (Probabilistic Finite State Automata) to impose the presence or absence of sub-strings (“motifs”) anywhere in the generated data, exploiting the intersection properties of automata.
Due to the dynamic programming properties of PFSAs, it is possible to compute the true entropy of the process (see [SM]), as well as other quantities (Partition Functions, Mean sequence length); it is also possible to generate training (), validation (), and test data () in arbitrary quantities.
We employ an unconditional GAM of the simple form:
where is trained on and then kept fixed, and where is then trained on top of , also on .
It should be noted that with fixed in this way, this formulation exactly corresponds to the definition of an exponential family Jordan (2010), with as base measure. In such models, we have two important properties: (i) the log-likelihood of the data is convex relative to the parameters , and thus a local maximum is also global; (ii) the max-likelihood value has the property that the model expectation is equal to the empirical expectation
(“Moment Matching” property of exponential families).
We are specially interested in the relative data-efficiency of the GAM compared to the AM : namely the ability of the GAM to recover a lower perplexity approximation of than , especially in small training-set conditions.
5 Training procedure
We consider a two-stage training procedure (see Fig. 1).
This consists in training the model on . This is done by first training on in the standard way (by cross-entropy) and then by training by SGD with the formula (adapted from (3)):
The main difficulty then consists in computing an estimate of the model moments. In our experiments, we compare two Monte-Carlo approaches Robert and Casella (2005) for addressing this problem: (i) Rejection Sampling (rs), using as the proposal distribution and (ii) Self-Normalized Importance Sampling (snis) (Y. Bengio and J. S. Senecal, 2008), also using as the proposal.
Rejection sampling is performed as follows. We use as the proposal, and as the unnormalized target distribution; for any specific , because our features are bounded between and , we can easily upper-bound the ratio by a number ; we then sample from , compute the ratio , and accept with probability . The accepted samples are unbiased samples from and can be used to estimate model moments.
Snis also uses the proposal distribution , but does not require an upper-bound, and is directly oriented towards the computation of expectations. In this case, we sample a number of points from , compute “importance ratios” , and estimate through . The estimate is biased for a given , but consistent (that is, it converges to the true for ).
While Training-1 results in a well-defined model , which may fit the data closely in principle, we should not conclude that is convenient to use for inference — namely, in language modeling, efficiently sampling from its normalized version ; as seriously, because of the partition factor , it is also not obvious to evaluate the perplexity of on test data. In order to do both, one approach consists in using a distillation technique Hinton et al. (2015), where, during training, one expends generous time towards producing a set of samples from , for instance by Monte-Carlo (e.g. Rejection Sampling) techniques, and where this set (which may be arbitrarily larger than the original ) is in turn used to train a new autoregressive model , which can then be used directly for sampling or for computing data likelihood. This is the approach that we use in our current experiments, again using the original as a proposal distribution.
In the case of small , the proposal distribution is weak and as a result the distillation process, based on rejection sampling, can be slow. To address this issue, we also consider a cyclical training regime that updates the proposal distribution after distilling each batch of samples, with the intention of reducing the rejection rate. Once the process of distillation is finished, we use the aggregated samples to train the final . The two-stage training procedure is a variant of the cyclical one, with a fixed proposal (see the Algorithm 1 for more details).
6 Related Work
Hoang et al. (2018), working in a NMT context, have a similar motivation to ours. They first train an autoregressive seq2seq model (Transformer in their case) on bilingual data, then attempt to control global properties of the generated sequences through the introduction of a priori
features. They interpolate the training of the autoregressive model with training of a Moment Matching component which tries to equate the features expectations of the model with those of the data. Contrarily to our approach, they do not directly try to maximize likelihood in an integrated model.
Andor et al. (2016) consider transition-based neural networks, and contrast local to global normalization of decision sequences, showing how the global approach avoids the label bias problem in such tasks as tagging or parsing. They focus on inference as maximization, e.g. finding the best sequence of tags for a sequence of words, and consistent with that objective, their training procedure exploits a beam-search approximation. By contrast, our focus is on inference as sampling in a language modelling perspective, on the complementarity between auto-regressive models and log-linear models, and on the relations between training and sampling in energy-based models.
We conduct a series of experiments on synthetic data to illustrate our approach.
To assess the impact of GAMs, we focus on distributions that are likely to be well approximated by the AM in the presence of large data. The first class of distributions is obtained through a PFSA that filters binary strings of fixed length , ’s and
’s being equally probable (white-noise strings), through the condition that they contain a specific substring (“motif”) anywhere; here the relative frequency of sequences containing the motif among all sequences varies fromto .
We also consider mixtures of two PFSAs (motif/anti-motif): the first (with mixture prob. ) produces white-noise strings containing the motif and the second (with mixture prob. ) strings excluding the motif.
From these processes we produce a training set , of size varying between and , a validation set of size (but never smaller than or bigger than ) and a test set of fixed size .
In a real world scenario, prior knowledge about the true process will involve, along with predictive features, a number of noisy and useless features. By training the parameters to match the empirical moments, the GAM will learn to distinguish between these types. In order to simulate this situation we consider feature vectors over our artificial data that involve both types.
With the full string and the fixed motif used in constructing the training data, we consider variations among the 7 binary features in the set :
where iff the motif appears in , iff the motif followed by a zero (“super-motif”) appears in , iff an initial section of the motif (“sub-motif”, roughly half the size of ) appears in . These three features are chosen because they have some correlation with the process for generating the training data. By contrast, the four remaining features are “distractors”: iff begins with a , (resp. , ) iff a certain random, but fixed, string of similar length to (resp. of larger length, of smaller length) appears in . We test different configurations of these features for training , and document the use/non-use of features with a bit-vector of length , for instance means that all features are exploited, apart from .
The AMs are implemented in PyTorchPaszke et al. (2017) using a 2-layered LSTM Hochreiter and Schmidhuber (1997)
with hidden-state size 200. The input is presented through one-hot encodings over the vocabulary. These LSTMs are optimized with Adam Kingma and Ba (2014), with learning rate , and with early stopping (patience ) over a validation set.
Training: Two-Stage and Cyclical
The implementation is described in (Algorithm 1). Here we provide some additional details.
Training-1 For training we test two regimes in Eq. 5, namely and ; in both cases, we first train on the whatever is available, and use it as the proposal distribution. During , we compute the model’s expectation over 10 accepted samples, update the ’s according to (5), and iterate. During , we keep a buffer of the last samples from to compute the weighted average of the feature moments. For the training of ’s, we use a basic SGD optimization with learning rate . To assess the quality of for early stopping during training, we use the distance between the empirical and model moments:
Training-2 and Cyclical Training When distilling from in Training-2, we use a single proposal , and systematically produce a distilled dataset of size , which corresponds to the highest value of among those considered for training . In Cyclical Training, the distillation process is performed in several stages, with an evolving for improving the rejection rate.
We conduct experiments to compare the cross-entropy (measured in nats) between the initial AM relative to the test set and the final AM also relative to ; we vary the size of , the regimes (tReg) for Training-1 ( or ), the features employed, the rarity of the motifs. Figure 2 depicts the resulting curves at the end of the two-stage training (plain lines).
Here we show only a few experiments (a more extensive set is provided in the [SM]).
We observe that, for a small dataset size , there is a big gap between the CE of and the CE of . As increases, these cross-entropies become closer to one another, but a large gap persists for .
We note that the presence of the “fully-predictive” feature results in a that has CE very close to the theoretical entropy, even in low regimes, where on its own is very weak.333The CE of a model relative to the true underlying process (approximated by the test set ) can never be below the entropy of this process, due to the KL-divergence being non-negative. Thus, not only is the distilled AM much better than the initial AM, but this is an indication that itself (for which the cross-entropy is more difficult to compute exactly) is a good approximation of the true process.
By contrast, if the feature is absent, then, while is still better than in low regimes, it cannot reach the theoretical entropy in such regimes, because features such as and can only partially model the data. With large , on the other hand, on itself does a good job at predicting the data, and adds little on top of its component.
Finally, we note that the two regimes for training , and , result in ’s with similar accuracies.
In order to assess the predictive properties of obtained AMs, we also compare the frequency of motifs in strings sampled from and from ( samples in total).
From Figure 2 we see that when vary , the frequency of motifs (dashed lines) is aligned with the CE performance. Namely, produces a higher fraction of strings with motif than when is small ().
Detailed illustration To provide more intuition, we provide an illustration from one experiment in Table 2.
|9||CEs||: 0.45, : 0.56, : 0.47|
|10||motif freqs||: 1.0, : 0.045, : 0.959|
Mixture vs pure
In our experiments, the strings in (motif-anti-motif) contain a motif with . However, if not all of the samples in contain the motif, then the motif feature itself is not fully predictive. It can be seen in panel (d) of Figure 2 that the achieved with trained on mixture has consistent behaviour with the results obtained on the pure of panels (a,b,c).
Regimes in Training-1
For training GAM we consider two methods, and . As described in the previous sections, their impact on leads to ’s that have similar CE’s and motif frequencies. Despite such resemblance in terms of accuracy, these two methods differ in terms of speed (see Table 1). Namely, when is close to white noise due to small , then for the rare events rejects most samples not containing the motif due to the effect of the log linear term and negative value of the component corresponding to the feature, while is able to exploit all samples. Despite being faster than , remains competitive in terms of CE.
Cyclical vs two-stage training
We conducted a small experiment to compare the performance of cyclical training with two-stage training in terms of speed and accuracy for a fixed motif and features (see [SM] Table 4, Figure 3). We observed that CEs of the obtained ’s were about the same for different values of and Training-1 regimes. On the other hand, there was no systematic improvement in the training speed of one method over the other.
The basic idea behind GAMs is very simple. First, we extend the representational power of the autoregressive model by multiplying by a log-linear potential, obtaining an unnormalized model (Training-1). Then we try to “project” this extended representation again to an autoregressive model (Training-2). Our results showed that, under favorable prior knowledge conditions, the final was able to perform as well, when trained on small data, as the standard , trained on large data. During our experiments, we noticed that training was actually easier than training from it. Intuitively, the small number of parameters to be fitted in the log-linear model requires less work and fewer data than the training of an autoregressive component.444At a deeper level, there are extreme situations where the obtained at the end of Training-1 can perfectly represent the true process, but where no autoregressive model can actually fit : one way to obtain such situations consists in generating binary strings that satisfy a certain cryptographic predicate, associated with a specific feature; the importance of this feature can be easily detected through Training-1, but an autoregressive model has no chance of generalizing from distilled or true data, even in large quantities.
It is interesting to relate our study to certain aspects of Reinforcement Learning (RL).
First, consider Training-2. There, we have a “score” that we are trying to approximate through an autoregressive model , which is basically a sequential “policy”. The main difference with RL is that we are not trying to find a policy that maximizes the score (which would be a bad idea for language modelling, as it would tend to concentrate the mass on a few sequences), but one that approximates in a distributional sense; our current distillation technique is only one way to approach this problem, but other techniques more in the spirit of RL are possible, a direction that we leave for future work.
Second, consider Training-1. Our approach, consisting in suggesting to the model a number of prior features, might look too easy and suspicious. But notice that in RL, one would typically directly provide to the model an externally defined reward, a very strong form of prior knowledge. Here, instead, we “only” indicate to the models which features it might attend to, and Training-1 then determines the “reward” through max-likelihood, a milder form of prior knowledge, more respectful for what the data has to say.
- Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally Normalized Transition-Based Neural Networks.
- Bellemare et al. (2017) Marc G. Bellemare, Will Dabney, and Rémi Munos. 2017. A Distributional Perspective on Reinforcement Learning. arXiv:1707.06887 [cs, stat]. ArXiv: 1707.06887.
- Carrasco (1997) Rafael C. Carrasco. 1997. Accurate computation of the relative entropy between stochastic regular grammars. Theoretical Informatics and Applications, 31:437–444.
- Cortes et al. (2008) Corinna Cortes, Mehryar Mohri, Ashish Rastogi, and Michael Riley. 2008. On the computation of the relative entropy of probabilistic automata. Int. J. Found. Comput. Sci., 19(1):219–242.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. CoRR. Cite arxiv:1705.03122.
- Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Hoang et al. (2018) Cong Duy Vu Hoang, Ioan Calapodescu, and Marc Dymetman. 2018. Moment Matching Training for Neural Machine Translation: A Preliminary Study.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Jebara (2013) Tony Jebara. 2013. Log-Linear Models, Logistic Regression and Conditional Random Fields.
- Jordan (2010) Michael I. Jordan. 2010. Chapter 8 The exponential family : Basics.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang. 2006. A Tutorial on Energy-Based Learning. Predicting Structured Data, pages 191–246.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop.
- Robert and Casella (2005) Christian P. Robert and George Casella. 2005. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
- Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction, second edition. The MIT Press.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
- Y. Bengio and J. S. Senecal (2008) Y. Bengio and J. S. Senecal. 2008. Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model. Ieee Transactions on Neural Networks, 19(4):713–722.
Appendix A Supplementary Material
a.1 SGD in Energy-based models
The formula 1 is fundamental for studying the SGD behavior of Energy-based models, and for convenience, we provide a derivation here.
If we define , we find that:
We then have:
a.2 The relevance of finite state automata; connections and differences with Reinforcement Learning
The way our synthetic data is produced through FSA’s may look contrived, but there are good motivations for using automata in such a study as ours.
Consider the following problem: you are given some RNN that produces sequences over a vocabulary , with probabilities but you would like to filter out sequences that do not contain a specific symbol , while preserving the relative probabilities of sequences provided by the RNN: . There appears to be no obvious way to realize through an RNN, apart from techniques similar to what we have been describing in our discussion of Training-2.
The situation is completely different with FSA’s. If you have a PFSA (Probabilistic FSA) generating sequences , then you can intersect with an automaton that accepts all sequences containing at least one , and re-normalize the intersection through dynamic programming, obtaining a new PFSA that generates the filtered distribution.555And this is exactly what we do to produce training data for our experiments, but using a binary sequence (motif ) instead of a single symbol . Such dynamic programming, with the capacity to anticipate properties that need to be satisfied on the global sequence, is unavailable in the RNN world.
With RNNs, the situation is reminiscent of RL, with a reward associated with having observed an during the production of the sequence. But a standard RL approach would mean that we would try to maximize , without taking into consideration the original that we are filtering from. To be correct, we need to find a policy (similar to an RNN), that tries to approximate in a distributional sense, not in a maximization sense (see (Bellemare et al., 2017) for related considerations). This is what we try to do in Training-2, using motifs as our main case-study, instead of a single symbol (which would not make sense for binary strings).
The advantages of using PFSAs in our study are multiple. They provide a well-understood comparison point to the more complex techniques that need to be deployed for autoregressive models. From an operational viewpoint, they also permit, through dynamic programming, to perform various calculations of interest for our study, such as sampling datasets of arbitrary size and computing exact entropy and partition functions that can serve as comparison points for the results obtained with GAMs. In the present paper, we only exploited PFSA’s in the context of motifs, but they provide a much larger class of models that could serve to expand our understanding of sequence-based energy based models.
a.2.1 Computing the Entropy of a PFSA
As mentioned earlier, one advantage of using weigthed finite-state automata for generating synthetic data is that some important quantities, such as entropy, mean sequence length, or partition function can be computed by dynamic programming.
Here we only derive a simple iterative method for computing the entropy of a PFSA, the other computations are very similar.666For another technique, and for extensions to the computation of relative entropy, see Carrasco (1997).
We consider a PFSA with transitions of the form , where are states, is the label of the transition from to (in our case ), and is the probability of the transition. The fact that the automaton is probabilistic, instead of simply weighted, means that the sum of ’s associated with transitions starting at is equal to . We further assume that the automaton is deterministic, namely that given and uniquely determines the next state .777The case of non-deterministic probabilistic automata appears much more difficult Cortes et al. (2008).
The entropy of a state is defined as , where denotes a sequence of labels that ends in a final state of the automaton, for which is computed in the obvious way. The entropy of the automaton as a whole is then defined as , where is the initial state of the automaton.
Lemma The entropies of states satisfy the fixpoint equation:
Proof. Let’s denote by the state obtained from by following label . We have:
It is possible to show that the state entropies actually correspond to the least fixpoint of equation (7), and this allows a simple iterative algorithm for computing the state entropies: at time , for all states , we define , and then we iterate until convergence:
a.3 Additional Experiments and Results
(See next pages)