1 Introduction
Consider a situation where one has samples from a true distribution over a set and one wishes to learn to generate similar samples, such as learning to generate English sentences from a large English text corpus. One seeks an approximation of
which is “close” in some sense and from which samples can efficiently be generated. A common approach to fit these models is Maximum Likelihood Estimation (MLE), which given a training set from
and a parametrized distribution seeks parameters that maximize the likelihood assigns to a training set. MLE has long been one of the most popular methods for fitting generative models of sequential data, such as language, where autoregressive neural language models generate remarkably realistic text, e.g., GPT3 (brown2020language) and PaLM (PaLM2022). MLE generally involves computing likelihoodswhich can be more challenging in some domains than others, e.g., it may be more difficult to estimate the probability of a (highdimensional, realvalued) image than a (discretevalued) sentence.
An alternative approach, Generative Adversarial Networks (GANs), has become popular across several domains, particularly Computer Vision, owing to breakthrough realism in the images they output
(e.g., G2014; ZGMO2019). GANs employ an adversarial approach to generation through a zerosum game between a generator and a distinguisher in which the generator produces samples which the distinguisher tries to distinguish from real samples from . Often, both the generator and the distinguisher are differentiable neural networks, though this minmax approach of choosing a model whose outputs are nearly indistinguishable from real examples might be considered for any families of generative models and distinguishers. A major advantage of GANs (particularly for images) is that they can be used for generation without requiring computing likelihoods. This advantage is not significant for many sequential models such as language models, where computing likelihoods is not difficult.In contrast, the adversarial approach has yet to demonstrate significant improvements in some other domains such as Natural Language Processing (NLP). One wellknown barrier to NLP GANs is that language models produce discrete outputs (words), so they are not naturally differentiable
(goodfellow2017nips). However, despite numerous works circumventing this limitation and adapting GANs to text generation (yu2017seqgan; press2017language; guo2018long; dautume2019training), adversarialbased models have yet to achieve the same popularity or performance gains that were seen for images. In particular, language GANs have been shown to underperform MLE in terms of quality (tevet2019evaluating) while facing the challenge of lack of diversity due to mode collapse (caccia2018language), which is a wellknown issue with GANs in other domains.1.1 Likelihood and Distinguishability: Two sides of the same coin?
In this work, we suggest a different, fundamental barrier to adopting GANs in domains where MLE is prevalent: the adversarial approach of minimizing distinguishability is effectively a roundabout method of maximizing likelihood on observed data, and hence employing MLE directly can be more efficient. This is the case in NLP where, unlike computer vision, a measure of likelihood called perplexity has been the prevailing metric for training and evaluating language models for decades. We show how GANs boost likelihood in the spirit of, and inspired by, the related celebrated work of FHT2000
that showed how boosting can be viewed as an iterative approach for logistic regression.
Consider a large finite set or countably infinite set and a family
of probability distributions over
. For language, these might be ngram models or neural models. Also consider a family of distinguishers that aim to distinguish random examples drawn from a distribution from those sampled from. For any such classifier
, we call the difference the distinguishability advantage of because it quantifies the accuracy of at the task of identifying “fake” examples. A perfect distinguisher would thus have , while which predicts at random has . More formally, imagine picking uniformly at random and picking a random example from if and from if . The (randomized) binary classifier that predicts, for any , with probability , has (expected) accuracy:Given a family , we define the distinguishability of from to be . Distinguishability is known to be a lowerbound on total variation distance (also called statistical distance), a measure of distance between distributions that is difficult to directly estimate for large domains (sriperumbudur2012empirical). The “Bayesoptimal” distinguisher simply predicts 1 iff , and has advantage equal to the total variation distance (see, e.g., hashimoto2019unifying). Clearly , i.e., is indistinguishable from itself. Motivated by this observation, numerous adversarial approaches to approximating have been attempted to minimize distinguishability . If then is minimized at .
Example where maximizing likelihood minimizing distinguishability.
When , minimizing distinguishability among may be different than maximizing the likelihood of
. For instance, consider modeling the age in years of humans (say the entire population on earth) as a uniform distribution
over . Now, the which maximizes likelihood would be the age of the oldest person, which is at the time of this article—any smaller would assign zero probability to the 119yearold and thus to the entire population. However, this distribution is very distinguishable from the true distribution—for instance it assigns probability to being over 100 years, which is extremely unlikely among today’s population. A smaller would yield less distinguishable samples. While it may seem therefore that distinguishability and likelihood are inherently different criteria, as we shall see this is an artificial limitation due to the weakness of family .Of course, the (in)equivalence depends on the families of distinguishers and of probability distributions. We give two results showing that maximizing likelihood and minimizing distinguishability are equivalent as long as and are similar in power, even when . First, we consider families that are “loglinear” over some set of functions, which include ngram models and neural networks whose top layer is a softmax, among others. The equivalence in this case is particularly simple and serves to illustrate how MLE can be a simpler way to reach the same optimum. In this case, and are naturally paired.
Maximizing likelihood minimizing distinguishability for loglinear .
In the above age example, the family
for is an example of a loglinear family. We show that if can be distinguished from the population distribution by a function , then folding into yields a new model in with greater likelihood. In practice, one only has a sample of the true distribution (not the entire population) and maximizing loglikelihood is approximation of minimizing KL divergence . We give formal statements about minimizing KLdivergence as well.The conclusion of this first observation is that if a GAN were to converge within a loglinear family (and making GANs converge is often not an easy feat in practice), it would converge to the MLE.
General polynomialtime reduction.
Our second result is a polynomialtime reduction from likelihood maximization to nexttoken distinguishability, without the loglinear requirement. We consider the common case of (unidirectional) sequential models that predict the next token based on the previous tokens, which have several practical advantages including being efficient to compute—the probability of a sequence is simply the product of the conditional probabilities of each subsequent word given the previous words. Many stateoftheart transformer language models such as GPT3 take this form. Achieving an efficient reduction is challenging due to the normalization requirement of computing partition functions. In order to achieve a polynomialtime reduction, we consider a notion of nexttoken distinguishability, where the game is as follows: a prefix of tokens is chosen, based on which the generator generates a token to follow the prefix. Given the actual next token and the generated next token, the distinguisher aims is to identify which is which. Algorithm 1 leverages a nexttoken distinguisher to iteratively increase likelihood. In particular, given any target , Theorem 1 shows that Algorithm 1 will terminate and output an efficiently computable model which is nearly (to within ) indistinguishable from the truth, and it runs in time polynomial in .
If and one has an optimal distinguisher, one will eventually converge to a model close to , as has been discussed heavily in the literature. However, our results are also meaningful in the more realistic case where one has imperfect distinguishers.
Contributions. The main contributions of this paper are:

showing that, although in general minimizing distinguishability and maximizing likelihood seem to be different, they are in fact closely related,

introducing a new model of nexttoken distinguishability that is necessary to make the reduction efficient, and

offering a new perspective on why GANs are less popular in NLP and other sequential domains as they are for images.
Organization. We begin by summarizing related work on GANs, especially for text. We then illustrate how GANs can be overkill for the simple case of ngram models in Section 3. Section 4 covers loglinear models. Section 5 gives explicit bounds on general reductions between maximizing likelihood and minimizing distinguishability. Section 6 shows how the reduction can be efficiently computed in the case of sequential inputs, from which we propose a simple polynomial time algorithm that provably finds a distribution which is nearlyindistinguishable with respect to a given class of discriminator functions. Finally, we discuss the relevance of our work in Section 7.
2 Related Work
Generative models
. Several approaches to generative modeling have been investigated, especially in the context of images. In particular, impressive results have been obtained recently with variational autoencoders, GANs, normalizing flows, autoregressive models, diffusion processes, and score/energy based models
VAEs; G2014; NormalizingFlows; PixelCNN2016; Diffusion2020; song2021scorebased; score2021; NeuralODEs. Generally, training approaches are either adversarial; or rely on MLE, contrastive divergence estimation, or score matching
song2021maximum. Some connections have begun to emerge between these models, and alternate training procedures have been advocated CZSLPCB2020; song2021scorebased; yair2021contrastive.GANs for text. Since their introduction (G2014), there has been interest in adapting GANs to text data. The driving motivation was that —up until very recently— samples generated by traditional (likelihoodbased) models had been easy to distinguish from humangenerated text, and the success of image GANs at generating realisticlooking samples suggested a possible avenue to improve the quality of their natural language counterparts.
The first and most significant challenge in adapting GANs to text arises from the very nature of this data. Goodfellow goodfellow2017nips points out that GANs require the generator to be differentiable, which poses a challenge for discrete text representations such as onehot word or character representations. Two of the most popular approaches to circumvent this obstacle are policy gradient techniques (e.g., REINFORCE (williams1992simple)) —which when applied to language modeling nevertheless often require maximum likelihood pretraining (che2017maximum; yu2017seqgan))— and the GumbelSoftmax approximation (kusner2016gans). The few adversarial methods that do not require pretraining (e.g., (press2017language; rajeswar2017adversarial)) have failed to show significant promise in all but a few artificial tasks.
This nascent but active line of work seemed to suggest for a period of time that GANs might provide a breakthrough in text generation. This promise did not fully materialize, and instead the most recent breakthrough came from models building very large transformerbased architectures like GPT (radford2018improving; radford2019language; brown2020language) or PaLM (PaLM2022) — which are trained with traditional crossentropy (MLE) objectives.
Yet the question of how GANbased methods for text compare with likelihoodbased ones still garners significant interest, and while various works have provided an empirical comparison between them —with most of these suggesting the advantage of MLEbased ones (caccia2018language)— theoretical explanations have been less explored.
Relating objectives via divergences. The connection between maximum likelihood estimation, distinguishability and divergences between probability distributions has been explored before. For example, it is well known that maximizing likelihood is equivalent to minimizing the KL divergence between certain families of fitted and reference distributions, though this is not the only divergence for which such a connection exists (rigollet2018entropic)
. On the other hand, from the moment GANs were introduced,
G2014 noted that —assuming a perfect discriminator— the adversarial objective corresponds to minimizing a JensenShannon divergence. Furthermore, the minimal discrimination error is also directly related to the total variation distance (see, e.g., hashimoto2019unifying). On the other hand, for exponential families the gradient of the KL divergence is known to be related to the discrepancy between distributions (TH2015). While conceptually similar to this line of work, here instead we give an explicit reduction that shows how distinguishability and (log) likelihood are in direct correspondence.Pinsker’s inequality is a wellknown result linking KL divergence and total variation distance (TVD): . While related, this inequality is not directly relevant to the context of this work. First, while total variation provides an upper bound to distinguishability, it is not computable in general, so it is rarely used as a training objective for generative models. On the other hand, being onesided,^{1}^{1}1Reverse Pinsker’s inequalities exist only for particular cases, but they too are very loose in general (sason2015reverse). it does not imply that reducing TVD reduces KL divergence. Furthermore, Pinsker’s is in general a very loose inequality, particularly for the direction of KLD that is equivalent to MLE (i.e., ), since if even for a single leads to unbounded KL divergence. In contrast, in this work we provide a direct reduction directly linking the two criteria of interest: distinguishability and maximum likelihood.
Loglinear language models. In this work we focus our analysis on loglinear models (LMP2001; MFP2000), which are widely used in natural language processing (often known in that community as Maximum Entropy –MaxEnt– models) for various tasks. In particular, these models have been a cornerstone of both neural (BDVJ2003; mikolov2013efficient) and nonneural (rosenfeld1994adaptive; khudanpur2000maximum) language modeling.
Boosting. The reduction shown here bears resemblance to boosting. It is wellknown that boosting can be analyzed through the lens of maximum likelihood (FHT2000), while LL2002 formalized the equivalence of AdaBoost and maximum likelihood training for exponential models. More recently, boosting has been applied to generative adversarial models (tolstikhin2017adagan; grover2018boosted), albeit with a different approach and objective than the connection drawn in this work.
3 Illustration: GANs for ngram language models
To illustrate our main point, consider first the simplest model of language: a unigram model where the probability of each word is independent, and the end of sentence token has a given probability as well. If represents the logprobability of word , then the logprobability of sentence is given by:
The MLE parameters can be computed in linear time by simply counting word frequencies.
A more roundabout approach to fitting a unigram language model would be to start with any initial unigram model , generate random samples from and compare them to those from . One could then distinguish the two by finding a word that appears significantly more often in one than in the other. For example, if one generates text from the model and finds that the word “the” occurs much more often in text generated from , one would then update the parameters by increasing (and decreasing for all other words to keep a probability distribution). As we shall see later, if this more involved procedure converged, it would necessarily converge to the same maximumlikelihood estimator .
A similar argument applies to any gram model in which the probability of each subsequent word is determined only by the previous words. This is also optimized by frequency counts (a variety of “smoothing” techniques, e.g., adding 1 to counts, also known as Laplace Smoothing (good1953population; kneser1995improved) are often used as a form of regularization on top of these counts). Distinguishers could similarly be used to find a model that is indistinguishable from according to gram frequencies, but again this would simply converge to the MLE parameters.
4 Equivalence for loglinear models
In this section, we show that there is one optimal loglinear model that both minimizes distinguishability and maximizes loglikelihood. Consider a loglinear model with features , i.e., bounded features . The model predicts
(1) 
where denotes inner product,
is a parameter vector and
is a normalizing constant called the partition function.In the unigram example, the features would be word counts normalized by dividing by the maximum allowed sentence length (to ensure ). In a neural network the features would correspond to the top layer and computes a softmax. Multiple strategies have been studied for computing or estimating the partition function (see, e.g., desjardins2011tracking).
As discussed earlier, these feature functions can also be thought of as classifiers that distinguish examples drawn from from those drawn from and the advantage of is . The advantage vector is . Note that a negative advantage can be used for distinguishing by using the reverse classifier as a distinguisher, which has opposite advantage .
Observation 1.
The gradient of with respect to is the advantage vector , i.e., for all :
The above straightforward calculation is wellknown as is the fact that is convex in . However, we interpret this fact in the context of GANs: searching for which gives a zerogradient for KL divergence is equivalent to finding which is indistinguishable with respect to each . While a number of GANs have be designed in various architectures that solve the seemingly more complex problem of , it can generally be more efficient to maximize likelihood, which (approximately) minimizes the KL divergence.
5 Distinguishability is equivalent to increasing likelihood for general ,
In this section, we show how reducing logloss is equivalent to distinguishing real and generated examples. This is the basis behind a single step of our main algorithm (the reduction in this section is efficient for a single step, but the increase in runtime would lead to a general exponentialtime algorithm). The bounds here are in terms of logloss, as measured on a specific sample, rather than the abstract KL divergence quantity of the previous section, which cannot be computed exactly using a finite sample. In particular, we show how, if one can distinguish a given distribution from the sample, then one can decrease that distribution’s logloss, and vice versa.
For the remainder of this section, we drop from the variable denoting the fitted distribution to avoid cluttering the notation. Fix a sample of training examples drawn from , and define the logloss to be:
where we use hat on to denote that the loss is estimated on a (finite) training set . Likewise, denotes the empirical expectation . Note that the expected logloss over training sets is known as the crossentropy
and hence the expected difference in logloss between two candidate distributions and is equal to the difference
so minimizing logloss approximately minimizes the KL divergence. Also, we define the training advantage of distinguisher to be:
(2) 
which is independent of , depending on the sample alone and can thus be estimated to arbitrary accuracy using samples generated from . The lemmas below show how one can use a distinguisher to reduce logloss on the same training sample, and how to use a distribution with a lower logloss to distinguish the two distributions.
Lemma 1.
Let and suppose has training advantage Then, the probability distribution where , has lower logloss:
Before we give the proof, we note that if is computed by a neural network and is computed as a neural network with a softmax at the top layer, i.e., where is some neural network, then is naturally represented as the combined neural network with softmax in dimensions.
Proof (Lemma 1).
(3) 
Since by assumption, it remains to show . Using the bound for any , we get that,
where we have used the fact that and, to get to the last line we use for by Taylor expansion. Since , the last quantity is at most , which together with (3), gives . ∎
This means that if we can distinguish from , then we can simply reduce logloss by downweighting suspicious samples that satisfy the distinguisher . The difference between this statement and Observation 1 is analogous to the difference between boosting and logistic regression (FHT2000). In logistic regression, one typically fixes the set of features in advance, whereas in boosting this is not possible if there are infinitely many possible classifiers.
Conversely, we next show that if has a lower logloss than on the training samples, then we can distinguish from these samples.
Lemma 2.
For any constant and distributions such that for all , the distinguisher defined by,
has a training advantage of,
Proof (Lemma 2).
Let . By Jensen’s inequality,
Since , the training advantage of is that of scaled by a factor of . Finally, it is straightforward to verify that by our assumptions on the ratio between and . ∎
Importantly, due to the logarithmic dependence on , the above lemma is meaningful even if and are exponentially far apart so long as they have the same support.
Lemma 1 implies a reduction between the problem of distinguishing with nontrivial advantage to nontrivially reducing logloss for loglinear families. Note that iteratively applying the reduction requires repeated computation of the normalization terms over , and computing such partition functions is an area of active research—where it is known how to do it efficiently for some classes and not for others. The next section gives an efficient reduction for (unidirectional) sequential models.
6 Efficient Reduction for Sequential Models
This section gives an efficient reduction from distinguishing to MLE for sequential models. This requires showing how one can efficiently compute the normalization terms (partition function) on a tokenbytoken basis for blackbox sequential (e.g., language or autoregressive) models. The key insight for efficiency is that, rather than distinguishing entire sequences from and , one distinguishes the conditional nexttoken predictions. In particular, rather than generating entire sequences from , one can generate nexttoken predictions on all sequence prefixes in the training data.
Clearly, evaluating a neural network over all sequences is infeasible. However, in many applications such as NLP, the inputs are sequential , where every token is taken from a large discrete vocabulary. In such cases, the combinatorial nature of the data makes density estimation intractable unless the likelihood computation is broken into small sequential steps by representing the overall probability as the .
In this section we show how a natural extension of the framework described above allows us to achieve an efficient reduction for this common type of sequential model. To do so, we define a simple generalization of the training advantage criterion (2), which now relies on a stepwise distinguisher operating on variablelength sequences. Formally, we consider a language of length sequences^{2}^{2}2Padding can be used to handle sequences of variable length. of tokens taken from a vocabulary , and distinguisher functions , i.e., functions which can take subsequences of any size as input. Given a sample of sequences, we say that has generalized training advantage given by
(4) 
where, by convention, , so that . This criterion can be interpreted as follows. For every length , is tasked with distinguishing a subsequence consisting of the first tokens in a true sequence sampled from from another length sequence in which the last element is replaced by a randomly selected token from the alternate distribution .
Lemma 3.
Let and suppose has generalized training advantage . We define a distribution through its conditional probabilities as:
where now . Then incurs lower logloss than :
The proof is deferred to Appendix A.
Next, we use Lemma 3 repeatedly to derive a simple algorithm that, given access to nontrivial weak distinguishers, returns a distribution that is nearly indistinguishable (by that class) from the true distribution . Formally, let be a class of distinguishers. We assume access to an oracle which for any returns a distinguisher . In practice, such as in typical GAN training setting, one could think of this oracle as being approximated by the subroutine that trains the discriminator. We say that is indistinguishable by oracle if its output has advantage . We do not need to assume that is optimal in any sense.
algocf[htbp]
Theorem 1.
Let be a language model and let . Algorithm 1 returns a distribution which is indistinguishable from by oracle . It runs in time, where is the logloss of , is the runtime of oracle , is the complexity of evaluating any distinguisher on a single input, is the vocabulary size, is the sequence length and is the number of training sequences.
Proof.
The fact that Algorithm 1 terminates with a distribution which is indistinguishable by is immediate from the stopping criterion.
Now, for the runtime analysis, note that —by construction— the iterates , have training advantage . Thus, by Lemma 3, the algorithm makes at least improvement in each iteration. Therefore, the total number of iterations is at most , where is the logloss of the initial model. Each iteration of Algorithm 1 requires calling oracle once, evaluating at an complexity, and updating each of the nexttoken probabilities of for each sequence length . Each of these updates involves evaluating plus an partition normalization. Putting these together, we conclude that each iteration has complexity.
Combining the the two arguments above, we conclude that Algorithm 1 has a total runtime of . ∎
7 Discussion and conclusions
In this work, we have argued that minimizing logloss (i.e., KLdivergence) and minimizing statistical distinguishability are tightly related goals. Specifically, if the families of distinguishers and probability distributions are of similar power, then one can use a distinguisher to reduce logloss. This means in applications where it is natural to fit models by minimizing logloss, it is indeed likely to be a more direct and efficient means of fitting a model. This is the case for ngram language models (and other sequential tasks), for which perplexity (a measure of likelihood) is easy to compute, naturally meaningful, and allows for efficient sampling. Thus, for a long time, minimizing logloss has been the objective with which most stateoftheart models are trained. For such models, Lemma 1 implies that if one can distinguish the model from samples by a neural network then one can construct a larger neural network with lower logloss. Hence, one may prefer to simply train a larger model in the first place.
Broader Impact
The contribution of this work is conceptual and theoretical, and as such, any nuanced discussion of the potential harms or benefits of its impact are irremediably tied to the applications where it might be put to use. We first make a few general observations regarding its immediate impact, and then discuss in a more informal manner downstream ramifications these might have in applications.
This work revolves around comparing to training paradigms: maximum likelihood and adversarial learning. We believe that the maxim of ‘choosing the right tool for the job’ applies in this context too, and can have important downstream consequences. For example, the amount of resources consumed by training large generative models has been growing substantially over the past few years (amodei2018ai). This is particularly true for Natural Language Processing, where stateoftheart models are increasingly large and trained on increasingly larger datasets, leading to striking computationally and environmental costs (strubell2019energy). The key takeaway offered by this work, namely that training certain generative models like language models through adversarial methods is less efficient than doing so via likelihood maximization approaches, could potentially lead to significant saving of these resources by steering practitioners away from adversarial approaches. On the other hand, it is not our intention for this work to lead to the opposite —but equally undesirable— effect of dissuading practitioners from choosing adversarial training approaches whenever those are a sensible choice.
Appendix A Proof of Lemma 3
We proceed analogously as in the proof of Lemma 1. We first note that
and
Let us use the shorthand notation . Subtracting the two equalities above we obtain
which, after adding and subtracting and rearranging terms, yields
(5)  
(6)  
(7) 
By assumption we have , so it it remains to show that the second term is upper bounded by . Using, as before, the bound for every , we get that, for every :
where the last inequality follows again from the fact that for any . Therefore, the sum over these terms is upper bounded by , which combined with (7), yields the desired result.