Consider a situation where one has samples from a true distribution over a set and one wishes to learn to generate similar samples, such as learning to generate English sentences from a large English text corpus. One seeks an approximation of
which is “close” in some sense and from which samples can efficiently be generated. A common approach to fit these models is Maximum Likelihood Estimation (MLE), which given a training set fromand a parametrized distribution seeks parameters that maximize the likelihood assigns to a training set. MLE has long been one of the most popular methods for fitting generative models of sequential data, such as language, where autoregressive neural language models generate remarkably realistic text, e.g., GPT-3 (brown2020language) and PaLM (PaLM2022). MLE generally involves computing likelihoods
which can be more challenging in some domains than others, e.g., it may be more difficult to estimate the probability of a (high-dimensional, real-valued) image than a (discrete-valued) sentence.
An alternative approach, Generative Adversarial Networks (GANs), has become popular across several domains, particularly Computer Vision, owing to breakthrough realism in the images they output(e.g., G2014; ZGMO2019). GANs employ an adversarial approach to generation through a zero-sum game between a generator and a distinguisher in which the generator produces samples which the distinguisher tries to distinguish from real samples from . Often, both the generator and the distinguisher are differentiable neural networks, though this min-max approach of choosing a model whose outputs are nearly indistinguishable from real examples might be considered for any families of generative models and distinguishers. A major advantage of GANs (particularly for images) is that they can be used for generation without requiring computing likelihoods. This advantage is not significant for many sequential models such as language models, where computing likelihoods is not difficult.
In contrast, the adversarial approach has yet to demonstrate significant improvements in some other domains such as Natural Language Processing (NLP). One well-known barrier to NLP GANs is that language models produce discrete outputs (words), so they are not naturally differentiable(goodfellow2017nips). However, despite numerous works circumventing this limitation and adapting GANs to text generation (yu2017seqgan; press2017language; guo2018long; dautume2019training), adversarial-based models have yet to achieve the same popularity or performance gains that were seen for images. In particular, language GANs have been shown to under-perform MLE in terms of quality (tevet2019evaluating) while facing the challenge of lack of diversity due to mode collapse (caccia2018language), which is a well-known issue with GANs in other domains.
1.1 Likelihood and Distinguishability: Two sides of the same coin?
In this work, we suggest a different, fundamental barrier to adopting GANs in domains where MLE is prevalent: the adversarial approach of minimizing distinguishability is effectively a roundabout method of maximizing likelihood on observed data, and hence employing MLE directly can be more efficient. This is the case in NLP where, unlike computer vision, a measure of likelihood called perplexity has been the prevailing metric for training and evaluating language models for decades. We show how GANs boost likelihood in the spirit of, and inspired by, the related celebrated work of FHT2000
that showed how boosting can be viewed as an iterative approach for logistic regression.
Consider a large finite set or countably infinite set and a family
of probability distributions over. For language, these might be n-gram models or neural models. Also consider a family of distinguishers that aim to distinguish random examples drawn from a distribution from those sampled from
. For any such classifier, we call the difference the distinguishability advantage of because it quantifies the accuracy of at the task of identifying “fake” examples. A perfect distinguisher would thus have , while which predicts at random has . More formally, imagine picking uniformly at random and picking a random example from if and from if . The (randomized) binary classifier that predicts, for any , with probability , has (expected) accuracy:
Given a family , we define the distinguishability of from to be . Distinguishability is known to be a lower-bound on total variation distance (also called statistical distance), a measure of distance between distributions that is difficult to directly estimate for large domains (sriperumbudur2012empirical). The “Bayes-optimal” distinguisher simply predicts 1 iff , and has advantage equal to the total variation distance (see, e.g., hashimoto2019unifying). Clearly , i.e., is indistinguishable from itself. Motivated by this observation, numerous adversarial approaches to approximating have been attempted to minimize distinguishability . If then is minimized at .
Example where maximizing likelihood minimizing distinguishability.
When , minimizing distinguishability among may be different than maximizing the likelihood of
. For instance, consider modeling the age in years of humans (say the entire population on earth) as a uniform distributionover . Now, the which maximizes likelihood would be the age of the oldest person, which is at the time of this article—any smaller would assign zero probability to the 119-year-old and thus to the entire population. However, this distribution is very distinguishable from the true distribution—for instance it assigns probability to being over 100 years, which is extremely unlikely among today’s population. A smaller would yield less distinguishable samples. While it may seem therefore that distinguishability and likelihood are inherently different criteria, as we shall see this is an artificial limitation due to the weakness of family .
Of course, the (in)equivalence depends on the families of distinguishers and of probability distributions. We give two results showing that maximizing likelihood and minimizing distinguishability are equivalent as long as and are similar in power, even when . First, we consider families that are “log-linear” over some set of functions, which include n-gram models and neural networks whose top layer is a softmax, among others. The equivalence in this case is particularly simple and serves to illustrate how MLE can be a simpler way to reach the same optimum. In this case, and are naturally paired.
Maximizing likelihood minimizing distinguishability for log-linear .
In the above age example, the familyfor is an example of a log-linear family. We show that if can be distinguished from the population distribution by a function , then folding into yields a new model in with greater likelihood. In practice, one only has a sample of the true distribution (not the entire population) and maximizing log-likelihood is approximation of minimizing KL divergence . We give formal statements about minimizing KL-divergence as well.
The conclusion of this first observation is that if a GAN were to converge within a log-linear family (and making GANs converge is often not an easy feat in practice), it would converge to the MLE.
General polynomial-time reduction.
Our second result is a polynomial-time reduction from likelihood maximization to next-token distinguishability, without the log-linear requirement. We consider the common case of (unidirectional) sequential models that predict the next token based on the previous tokens, which have several practical advantages including being efficient to compute—the probability of a sequence is simply the product of the conditional probabilities of each subsequent word given the previous words. Many state-of-the-art transformer language models such as GPT-3 take this form. Achieving an efficient reduction is challenging due to the normalization requirement of computing partition functions. In order to achieve a polynomial-time reduction, we consider a notion of next-token distinguishability, where the game is as follows: a prefix of tokens is chosen, based on which the generator generates a token to follow the prefix. Given the actual next token and the generated next token, the distinguisher aims is to identify which is which. Algorithm 1 leverages a next-token distinguisher to iteratively increase likelihood. In particular, given any target , Theorem 1 shows that Algorithm 1 will terminate and output an efficiently computable model which is nearly (to within ) indistinguishable from the truth, and it runs in time polynomial in .
If and one has an optimal distinguisher, one will eventually converge to a model close to , as has been discussed heavily in the literature. However, our results are also meaningful in the more realistic case where one has imperfect distinguishers.
Contributions. The main contributions of this paper are:
showing that, although in general minimizing distinguishability and maximizing likelihood seem to be different, they are in fact closely related,
introducing a new model of next-token distinguishability that is necessary to make the reduction efficient, and
offering a new perspective on why GANs are less popular in NLP and other sequential domains as they are for images.
Organization. We begin by summarizing related work on GANs, especially for text. We then illustrate how GANs can be overkill for the simple case of n-gram models in Section 3. Section 4 covers log-linear models. Section 5 gives explicit bounds on general reductions between maximizing likelihood and minimizing distinguishability. Section 6 shows how the reduction can be efficiently computed in the case of sequential inputs, from which we propose a simple polynomial time algorithm that provably finds a distribution which is nearly-indistinguishable with respect to a given class of discriminator functions. Finally, we discuss the relevance of our work in Section 7.
2 Related Work
. Several approaches to generative modeling have been investigated, especially in the context of images. In particular, impressive results have been obtained recently with variational autoencoders, GANs, normalizing flows, autoregressive models, diffusion processes, and score/energy based modelsVAEs; G2014; NormalizingFlows; PixelCNN2016; Diffusion2020; song2021scorebased; score2021; NeuralODEs
. Generally, training approaches are either adversarial; or rely on MLE, contrastive divergence estimation, or score matchingsong2021maximum. Some connections have begun to emerge between these models, and alternate training procedures have been advocated CZSLPCB2020; song2021scorebased; yair2021contrastive.
GANs for text. Since their introduction (G2014), there has been interest in adapting GANs to text data. The driving motivation was that —up until very recently— samples generated by traditional (likelihood-based) models had been easy to distinguish from human-generated text, and the success of image GANs at generating realistic-looking samples suggested a possible avenue to improve the quality of their natural language counterparts.
The first and most significant challenge in adapting GANs to text arises from the very nature of this data. Goodfellow goodfellow2017nips points out that GANs require the generator to be differentiable, which poses a challenge for discrete text representations such as one-hot word or character representations. Two of the most popular approaches to circumvent this obstacle are policy gradient techniques (e.g., REINFORCE (williams1992simple)) —which when applied to language modeling nevertheless often require maximum likelihood pre-training (che2017maximum; yu2017seqgan))— and the Gumbel-Softmax approximation (kusner2016gans). The few adversarial methods that do not require pre-training (e.g., (press2017language; rajeswar2017adversarial)) have failed to show significant promise in all but a few artificial tasks.
This nascent but active line of work seemed to suggest for a period of time that GANs might provide a breakthrough in text generation. This promise did not fully materialize, and instead the most recent breakthrough came from models building very large transformer-based architectures like GPT (radford2018improving; radford2019language; brown2020language) or PaLM (PaLM2022) — which are trained with traditional cross-entropy (MLE) objectives.
Yet the question of how GAN-based methods for text compare with likelihood-based ones still garners significant interest, and while various works have provided an empirical comparison between them —with most of these suggesting the advantage of MLE-based ones (caccia2018language)— theoretical explanations have been less explored.
Relating objectives via divergences. The connection between maximum likelihood estimation, distinguishability and divergences between probability distributions has been explored before. For example, it is well known that maximizing likelihood is equivalent to minimizing the KL divergence between certain families of fitted and reference distributions, though this is not the only divergence for which such a connection exists (rigollet2018entropic)
. On the other hand, from the moment GANs were introduced,G2014 noted that —assuming a perfect discriminator— the adversarial objective corresponds to minimizing a Jensen-Shannon divergence. Furthermore, the minimal discrimination error is also directly related to the total variation distance (see, e.g., hashimoto2019unifying). On the other hand, for exponential families the gradient of the KL divergence is known to be related to the discrepancy between distributions (TH2015). While conceptually similar to this line of work, here instead we give an explicit reduction that shows how distinguishability and (log) likelihood are in direct correspondence.
Pinsker’s inequality is a well-known result linking KL divergence and total variation distance (TVD): . While related, this inequality is not directly relevant to the context of this work. First, while total variation provides an upper bound to distinguishability, it is not computable in general, so it is rarely used as a training objective for generative models. On the other hand, being one-sided,111Reverse Pinsker’s inequalities exist only for particular cases, but they too are very loose in general (sason2015reverse). it does not imply that reducing TVD reduces KL divergence. Furthermore, Pinsker’s is in general a very loose inequality, particularly for the direction of KLD that is equivalent to MLE (i.e., ), since if even for a single leads to unbounded KL divergence. In contrast, in this work we provide a direct reduction directly linking the two criteria of interest: distinguishability and maximum likelihood.
Log-linear language models. In this work we focus our analysis on log-linear models (LMP2001; MFP2000), which are widely used in natural language processing (often known in that community as Maximum Entropy –MaxEnt– models) for various tasks. In particular, these models have been a cornerstone of both neural (BDVJ2003; mikolov2013efficient) and non-neural (rosenfeld1994adaptive; khudanpur2000maximum) language modeling.
Boosting. The reduction shown here bears resemblance to boosting. It is well-known that boosting can be analyzed through the lens of maximum likelihood (FHT2000), while LL2002 formalized the equivalence of AdaBoost and maximum likelihood training for exponential models. More recently, boosting has been applied to generative adversarial models (tolstikhin2017adagan; grover2018boosted), albeit with a different approach and objective than the connection drawn in this work.
3 Illustration: GANs for n-gram language models
To illustrate our main point, consider first the simplest model of language: a unigram model where the probability of each word is independent, and the end of sentence token has a given probability as well. If represents the log-probability of word , then the log-probability of sentence is given by:
The MLE parameters can be computed in linear time by simply counting word frequencies.
A more roundabout approach to fitting a unigram language model would be to start with any initial unigram model , generate random samples from and compare them to those from . One could then distinguish the two by finding a word that appears significantly more often in one than in the other. For example, if one generates text from the model and finds that the word “the” occurs much more often in text generated from , one would then update the parameters by increasing (and decreasing for all other words to keep a probability distribution). As we shall see later, if this more involved procedure converged, it would necessarily converge to the same maximum-likelihood estimator .
A similar argument applies to any -gram model in which the probability of each subsequent word is determined only by the previous words. This is also optimized by frequency counts (a variety of “smoothing” techniques, e.g., adding 1 to counts, also known as Laplace Smoothing (good1953population; kneser1995improved) are often used as a form of regularization on top of these counts). Distinguishers could similarly be used to find a model that is indistinguishable from according to -gram frequencies, but again this would simply converge to the MLE parameters.
4 Equivalence for log-linear models
In this section, we show that there is one optimal log-linear model that both minimizes distinguishability and maximizes log-likelihood. Consider a log-linear model with features , i.e., bounded features . The model predicts
where denotes inner product,
is a parameter vector andis a normalizing constant called the partition function.
In the unigram example, the features would be word counts normalized by dividing by the maximum allowed sentence length (to ensure ). In a neural network the features would correspond to the top layer and computes a softmax. Multiple strategies have been studied for computing or estimating the partition function (see, e.g., desjardins2011tracking).
As discussed earlier, these feature functions can also be thought of as classifiers that distinguish examples drawn from from those drawn from and the advantage of is . The advantage vector is . Note that a negative advantage can be used for distinguishing by using the reverse classifier as a distinguisher, which has opposite advantage .
The gradient of with respect to is the advantage vector , i.e., for all :
The above straightforward calculation is well-known as is the fact that is convex in . However, we interpret this fact in the context of GANs: searching for which gives a zero-gradient for KL divergence is equivalent to finding which is indistinguishable with respect to each . While a number of GANs have be designed in various architectures that solve the seemingly more complex problem of , it can generally be more efficient to maximize likelihood, which (approximately) minimizes the KL divergence.
5 Distinguishability is equivalent to increasing likelihood for general ,
In this section, we show how reducing log-loss is equivalent to distinguishing real and generated examples. This is the basis behind a single step of our main algorithm (the reduction in this section is efficient for a single step, but the increase in runtime would lead to a general exponential-time algorithm). The bounds here are in terms of log-loss, as measured on a specific sample, rather than the abstract KL divergence quantity of the previous section, which cannot be computed exactly using a finite sample. In particular, we show how, if one can distinguish a given distribution from the sample, then one can decrease that distribution’s log-loss, and vice versa.
For the remainder of this section, we drop from the variable denoting the fitted distribution to avoid cluttering the notation. Fix a sample of training examples drawn from , and define the log-loss to be:
where we use hat on to denote that the loss is estimated on a (finite) training set . Likewise, denotes the empirical expectation . Note that the expected log-loss over training sets is known as the cross-entropy
and hence the expected difference in log-loss between two candidate distributions and is equal to the difference
so minimizing log-loss approximately minimizes the KL divergence. Also, we define the training advantage of distinguisher to be:
which is independent of , depending on the sample alone and can thus be estimated to arbitrary accuracy using samples generated from . The lemmas below show how one can use a distinguisher to reduce log-loss on the same training sample, and how to use a distribution with a lower log-loss to distinguish the two distributions.
Let and suppose has training advantage Then, the probability distribution where , has lower log-loss:
Before we give the proof, we note that if is computed by a neural network and is computed as a neural network with a softmax at the top layer, i.e., where is some neural network, then is naturally represented as the combined neural network with softmax in dimensions.
Proof (Lemma 1).
Since by assumption, it remains to show . Using the bound for any , we get that,
where we have used the fact that and, to get to the last line we use for by Taylor expansion. Since , the last quantity is at most , which together with (3), gives . ∎
This means that if we can distinguish from , then we can simply reduce log-loss by down-weighting suspicious samples that satisfy the distinguisher . The difference between this statement and Observation 1 is analogous to the difference between boosting and logistic regression (FHT2000). In logistic regression, one typically fixes the set of features in advance, whereas in boosting this is not possible if there are infinitely many possible classifiers.
Conversely, we next show that if has a lower log-loss than on the training samples, then we can distinguish from these samples.
For any constant and distributions such that for all , the distinguisher defined by,
has a training advantage of,
Proof (Lemma 2).
Let . By Jensen’s inequality,
Since , the training advantage of is that of scaled by a factor of . Finally, it is straightforward to verify that by our assumptions on the ratio between and . ∎
Importantly, due to the logarithmic dependence on , the above lemma is meaningful even if and are exponentially far apart so long as they have the same support.
Lemma 1 implies a reduction between the problem of distinguishing with nontrivial advantage to non-trivially reducing log-loss for log-linear families. Note that iteratively applying the reduction requires repeated computation of the normalization terms over , and computing such partition functions is an area of active research—where it is known how to do it efficiently for some classes and not for others. The next section gives an efficient reduction for (unidirectional) sequential models.
6 Efficient Reduction for Sequential Models
This section gives an efficient reduction from distinguishing to MLE for sequential models. This requires showing how one can efficiently compute the normalization terms (partition function) on a token-by-token basis for black-box sequential (e.g., language or auto-regressive) models. The key insight for efficiency is that, rather than distinguishing entire sequences from and , one distinguishes the conditional next-token predictions. In particular, rather than generating entire sequences from , one can generate next-token predictions on all sequence prefixes in the training data.
Clearly, evaluating a neural network over all sequences is infeasible. However, in many applications such as NLP, the inputs are sequential , where every token is taken from a large discrete vocabulary. In such cases, the combinatorial nature of the data makes density estimation intractable unless the likelihood computation is broken into small sequential steps by representing the overall probability as the .
In this section we show how a natural extension of the framework described above allows us to achieve an efficient reduction for this common type of sequential model. To do so, we define a simple generalization of the training advantage criterion (2), which now relies on a step-wise distinguisher operating on variable-length sequences. Formally, we consider a language of -length sequences222Padding can be used to handle sequences of variable length. of tokens taken from a vocabulary , and distinguisher functions , i.e., functions which can take subsequences of any size as input. Given a sample of sequences, we say that has generalized training advantage given by
where, by convention, , so that . This criterion can be interpreted as follows. For every length , is tasked with distinguishing a subsequence consisting of the first tokens in a true sequence sampled from from another -length sequence in which the last element is replaced by a randomly selected token from the alternate distribution .
Let and suppose has generalized training advantage . We define a distribution through its conditional probabilities as:
where now . Then incurs lower log-loss than :
The proof is deferred to Appendix A.
Next, we use Lemma 3 repeatedly to derive a simple algorithm that, given access to non-trivial weak distinguishers, returns a distribution that is nearly indistinguishable (by that class) from the true distribution . Formally, let be a class of distinguishers. We assume access to an oracle which for any returns a distinguisher . In practice, such as in typical GAN training setting, one could think of this oracle as being approximated by the subroutine that trains the discriminator. We say that is -indistinguishable by oracle if its output has advantage . We do not need to assume that is optimal in any sense.
Let be a language model and let . Algorithm 1 returns a distribution which is -indistinguishable from by oracle . It runs in time, where is the log-loss of , is the runtime of oracle , is the complexity of evaluating any distinguisher on a single input, is the vocabulary size, is the sequence length and is the number of training sequences.
The fact that Algorithm 1 terminates with a distribution which is -indistinguishable by is immediate from the stopping criterion.
Now, for the runtime analysis, note that —by construction— the iterates , have training advantage . Thus, by Lemma 3, the algorithm makes at least improvement in each iteration. Therefore, the total number of iterations is at most , where is the log-loss of the initial model. Each iteration of Algorithm 1 requires calling oracle once, evaluating at an complexity, and updating each of the next-token probabilities of for each sequence length . Each of these updates involves evaluating plus an partition normalization. Putting these together, we conclude that each iteration has complexity.
Combining the the two arguments above, we conclude that Algorithm 1 has a total runtime of . ∎
7 Discussion and conclusions
In this work, we have argued that minimizing log-loss (i.e., KL-divergence) and minimizing statistical distinguishability are tightly related goals. Specifically, if the families of distinguishers and probability distributions are of similar power, then one can use a distinguisher to reduce log-loss. This means in applications where it is natural to fit models by minimizing log-loss, it is indeed likely to be a more direct and efficient means of fitting a model. This is the case for n-gram language models (and other sequential tasks), for which perplexity (a measure of likelihood) is easy to compute, naturally meaningful, and allows for efficient sampling. Thus, for a long time, minimizing log-loss has been the objective with which most state-of-the-art models are trained. For such models, Lemma 1 implies that if one can distinguish the model from samples by a neural network then one can construct a larger neural network with lower log-loss. Hence, one may prefer to simply train a larger model in the first place.
The contribution of this work is conceptual and theoretical, and as such, any nuanced discussion of the potential harms or benefits of its impact are irremediably tied to the applications where it might be put to use. We first make a few general observations regarding its immediate impact, and then discuss in a more informal manner downstream ramifications these might have in applications.
This work revolves around comparing to training paradigms: maximum likelihood and adversarial learning. We believe that the maxim of ‘choosing the right tool for the job’ applies in this context too, and can have important downstream consequences. For example, the amount of resources consumed by training large generative models has been growing substantially over the past few years (amodei2018ai). This is particularly true for Natural Language Processing, where state-of-the-art models are increasingly large and trained on increasingly larger datasets, leading to striking computationally and environmental costs (strubell2019energy). The key takeaway offered by this work, namely that training certain generative models like language models through adversarial methods is less efficient than doing so via likelihood maximization approaches, could potentially lead to significant saving of these resources by steering practitioners away from adversarial approaches. On the other hand, it is not our intention for this work to lead to the opposite —but equally undesirable— effect of dissuading practitioners from choosing adversarial training approaches whenever those are a sensible choice.
Appendix A Proof of Lemma 3
We proceed analogously as in the proof of Lemma 1. We first note that
Let us use the short-hand notation . Subtracting the two equalities above we obtain
which, after adding and subtracting and rearranging terms, yields
By assumption we have , so it it remains to show that the second term is upper bounded by . Using, as before, the bound for every , we get that, for every :
where the last inequality follows again from the fact that for any . Therefore, the sum over these terms is upper bounded by , which combined with (7), yields the desired result.