1 Introduction
Text generation is an essential task for many NLP applications, such as machine writing (Zhang et al., 2017a), machine translation (Bahdanau et al., 2014)
(Rennie et al., 2017) and dialogue system (Li et al., 2017). Text generation models work by either explicitly modeling the probability distribution of text (Mikolov et al., 2010; Yu et al., 2017), or implicitly learning a generator which maps noise data to text (Zhang et al., 2017b; Chen et al., 2018). Both approaches aim at generating text with the same distribution of given text data.To achieve the distributionfitting goal, divergence metrics are usually applied as the training objective for text generation models, which take minimal value 0 if and only if the model distribution exactly recover the real text distribution. Typical choices include the KullbackLeibler divergence by maximum likelihood estimation (MLE)
(Mikolov et al., 2010), and JensenShannon divergence or Wasserstein distance by adversarial training (Yu et al., 2017; Gulrajani et al., 2017). However during evaluation, divergencebased metrics fails to distinguish two underfitting cases from each other: the lowquality case that generate unrealistic text, and the lowdiversity case that generates dull and repeated text. As such, quality and diversity metrics are introduces to help the model diagnosis, such as BLEU (Papineni et al., 2002) and SelfBLEU (Zhu et al., 2018). High generation quality requires the model to generate realistic samples, i.e. generated samples are free of grammatical or logical errors. High generation diversity requires the model to generate diverse samples, i.e. generated samples are less likely to be duplicate and contain diverse unique patterns.Despite popular application of qualitydiversity metrics in evaluation of text generation models (Chen et al., 2018; Lu et al., 2018b; Fedus et al., 2018; Alihosseini et al., 2019), the relationship between such evaluation and the distributionfitting goal is still not clear. It seems to be a tacit consensus in recent works that a model with both higher quality and higher diversity also better fit the real text distribution (Caccia et al., 2018; Li et al., 2019; d’Autume et al., 2019). However, such assumption is yet to be verified. This is critical since a potential inequivalence may result in misleading evaluation conclusions. In this paper, we try to answer this question under the unconditional text generation setting by a theoretical approach.
To bridge the gap between distributionfitting goal and qualitydiversity evaluation, we require the optimal solutions from divergence minimization to be consistent with that of qualitydiversity maximization. As such, we first give a general definition of quality and diversity. Then, we study a MultiObjective Programming (MOP) problem which maximizes quality and diversity simultaneously. We prove there exists a family of Paretooptimal solutions for this MOP problem, i.e. solutions which cannot be outperformed in terms of both quality and diversity. Then we prove the real distribution belongs to this Paretooptimal family if and only if qualitydiversity metrics are used in pairs with strong restrictions. Under such condition, a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution.
For qualitydiversity metrics used in practice, we show that the widely applied BLEU/SelfBLEU metric pair fails to match any divergence metric. This is highlighted by a counterintuitive observation that real text samples are significantly outperformed by manually constructed models over both BLEU and SelfBLEU. Therefore, we further propose Coverage Rate (CR) and Negative Repetition Rate (NRR) as substitute based on above theoretical analysis. Experiments show that CR/NRR act well as quality/diveristy metrics respectively, while a linear combination of CR/NRR acts well as divergence metric.
2 Related Work
To evaluate the performance of text generation models, many evaluation metrics are designed from different perspectives. Early neural text generation models use Perplexity (PPL) to show how well a language model fit the training data (Mikolov et al., 2010). This is a divergencebased metric, and is still adopted in recent works (Fedus et al., 2018; Lu et al., 2018a; Subramanian et al., 2018)
. Calculation of PPL may be intractable for implicit models, so other divergencebased metrics are also practical choices, such as Kernel Density Estimation
(Zhang et al., 2017b), Word Mover Distance (Lu et al., 2018a), MSJaccard (Alihosseini et al., 2019), and Frechet Distance (Semeniuta et al., 2018; Alihosseini et al., 2019; d’Autume et al., 2019). However, divergence metrics provide limited information for model diagnosis, and may not correlate well with task performance (Chen et al., 1998; Fedus et al., 2018). Therefore, the quality and diversity of generated text are further considered as complementary metrics, which are also practical requirements in real applications (Zhang et al., 2018; Hashimoto et al., 2019; Gao et al., 2019).For quality metrics, the evaluation is closely related to the ground truth distribution. Yu et al. (2017) propose to use Negative LogLikelihood where the real distribution is known in advance, which measures the average logprobability of generated samples over the real distribution. If the real distribution is not explicitly given, BLEU (Papineni et al., 2002) and ROUGE (Lin & Och, 2004) are usually applied, which measure the gram overlap between generated samples and a set of reference ground truth samples. For diversity metrics, the evaluation is performed within the model itself. Li et al. (2015) proposed Distinct as diversity metric, which calculates the ratio of unique grams in generated samples. Zhu et al. (2018) proposed SelfBLEU, which is similar to BLEU but use generated samples as reference set.
There was a time in the past that only quality metrics are applied for evaluation, such as in works of SeqGAN (Yu et al., 2017), RankGAN (Lin et al., 2017), and LeakGAN (Guo et al., 2017). However after an observation of the qualitydiversity tradeoff problem, Zhu et al. (2018) suggest to use a hybrid of both quality and diversity metrics, such as BLEU and SelfBLEU. This suggestion is widely adopted by many analytical works (Lu et al., 2018b; Caccia et al., 2018; Semeniuta et al., 2018; Alihosseini et al., 2019), as well as newly proposed methods, such as FMGAN (Chen et al., 2018), DDR (Li et al., 2019), and ScratchGAN (d’Autume et al., 2019). Despite the prevailing application of qualitydiversity evaluation, its relationship with divergence metrics remains unclear, which poses great uncertainty for evaluation conclusions. Our work will help to build bridges between qualitydiversity and divergence, and provide guidance for choosing appropriate qualitydiversity metrics.
3 Definition of Quality and Diversity
Currently there is no unified definition for quality and diversity in text generation, which brings great challenges for further theoretical studies. In fact, it is not easy to define a general form of quality and diversity due to various understandings of these two aspects. Thus before moving on to further analysis, we first try to give a general form of quality and diversity in a mathematical view, though it may not be comprehensive enough to cover all possible understandings.
3.1 A General Form of Quality and Diversity
Text data is usually discrete, so we make the following notations. Assume the vocabulary size is , and the maximum length is , then the distribution of text data can be described by a categorical distribution with size . We denote the real distribution and the generated model distribution as and , respectively.
In general, the Quality of a text generation model measures how likely the generated text are to be realistic text in human’s view. Since the value of real probability can be viewed as reflecting the realistic degree of a text , the expectation of some function over could be used to quantify quality. For example, in works of Yu et al. (2017) and Nie et al. (2018), LogLikelihood (LL) is used as the quality metric, where . Following this idea, we propose a general form of quality, i.e., , where is a function over .
Similarly, the Diversity of a text generation model measures how much difference there are among generated texts. From the viewpoint of information, ShannonEntropy (SE) of can be used as a natural diversity metric, where . From another understanding view, a text should be less likely to be generated again if the diversity is high. This idea has been adopted in biology to evaluate the diversity of biocoenosis, named as the Simpson’s Diversity Index (SDI), where . Summarizing these two different understandings, we obtain a general form of diversity, i.e. .
To this end, we propose a general form of quality and diversity metrics as follows:
where is denoted as and as .
3.2 The Rationality of Quality and Diversity
To guarantee and are rational quality and diversity metrics, we need to discuss about the conditions of and . Without loss of generality, we first assume that is differentiable and is twice differentiable. Further, the following requirements are necessary for rational quality and diversity:

Generating more samples with higher real probability yields higher overall quality;

Distributing the probability more equally yields higher overall diversity.
Mathematically, these two requirements can be formalized as the following two properties:
1. If , then for , there is for any .
2. If , then for , there is for any .
Then we can obtain the conditions of and by the following theorem:
Theorem 1.
The following conditions are both sufficient and necessary to satisfy the properties 12: For any s.t. and , we have and .
According to Theorem 1, it is necessary for to be strictly monotonically increasing and to be strictly concave for . For simplicity, we only consider the cases where such properties hold for , thus get a sufficient condition:

is strictly monotonically increasing for ;

is strictly concave for .
Under this condition, we can see that a model with highest quality will distribute all its density to text with highest real probability, and a model with highest diversity will be uniform, which are consistent with human understandings.
4 Analysis of QualityDiversity Evaluation
In this section, we show how and to what extent can the qualitydiversity evaluation reflect the distributionfitting goal. The key idea is to solve the MultiObjective Programming (MOP) problem which tries to maximize quality and diversity simultaneously. We give the structure of all the Paretooptima of this MOP problem, which constitutes the Paretofrontier. Then we prove the ground truth distribution lies in this frontier if and only if and are paired according to a given rule. Under such condition, a linear combination of quality and diversity constitutes a divergence metric, which means the qualitydiversity evaluation is sufficient to reflect the distributionfitting goal.
4.1 The MOP Problem
We consider the following MOP problem:
The goal is to maximize both quality and diversity, while keeping a legal distribution. The optimal solutions of a MOP problem are called Paretooptima, which means no other solution can beat them consistently over all objectives.
We give definitions of the terminologies of Paretooptimality below:
Definition 1.
For two distributions and , if one of the following conditions are satisfied, we say that is dominated by .

and ;

and .
A solution is called a Paretooptimum if it is not dominated by any . The set containing all the Paretooptima is called the Paretofrontier.
Intuitively, a Paretooptimum is a solution that there is no distribution can achieve both higher quality and higher diversity than it. And all the Paretooptima constitutes the Paretofrontier. The Paretofrontier may collapse into one solution which leads to a global optimum, e.g. if is uniform, the unique optimal solution would be . However it is often the case where the objectives in MOP problem cannot reach their optima consistently, which results in a family of optimal solutions. Therefore, the structure of the Paretofrontier under a nonuniform is what we care about.
4.2 The Paretofrontier
We show the structure of the Paretofrontier by giving the following theorem:
Theorem 2.
For a distribution , if is not uniform, then:
(1) The following condition is both sufficient and necessary for to be a Paretooptimum: there exist real value and that for any , there is
where
(2) is correspondent to , i.e. is fixed once is fixed. If for all , then is strictly monotonically increasing w.r.t. . If for all , then is strictly monotonically decreasing w.r.t. .
(3) Denote a Paretooptimum as , then for any : if , there is and ; if , there is ; where , and , , , # denotes the cardinality of a set.
According to Theorem 2, different s lead to different distributions, so we can change from to and get a family of optimal solutions with different quality and diversity. As such, for a nonuniform , the Paretofrontier is a family of distributions.
We can see quality and diversity act as a tradeoff if we want to maximize them at the same time. Since all distributions in the Paretofrontier are Paretooptima, trying to improve one metric for an optimum will lead to another optimum at most, thus inevitably causing another metric to drop. This result provides support for the qualitydiversity tradeoff problem observed in previous works (Zhu et al., 2018; Caccia et al., 2018).
We show the result of Theorem 2 here on a special case. We pair LogLikelihood (LL) with ShannonEntropy (SE), the corresponding Paretooptima can be written as
we have , and . These Paretooptima are formerly used as qualitydiversity tradeoff solutions by Li et al. (2019).
An illustration of the Paretofrontier on a toy distribution is shown in Figure 1. We can see that quality and diversity are negatively correlated for solutions in the Paretofrontier. Note that the ground truth distribution lies exactly on the frontier in this LLSE case, which can be checked by setting . We will then show this is the key to the relation between qualitydiversity metrics and divergence metrics.
4.3 Relationship with Divergence
To bridge the gap between the distributionfitting goal and qualitydiversity evaluation, it is necessary for the optimal solutions from divergence minimization to be consistent with that from qualitydiversity maximization. Since is the optimal solution with minimum divergence and the above Paretofrontier is the set of optimal solutions with maximal quality and diversity, we require to be in the Paretofrontier. Theoretical results are shown in the following Theorem:
Theorem 3.
The following condition is both sufficient and necessary for to be a Paretooptimum for any : there exist and that
If the above condition is satisfied, then corresponds to a Paretooptimum with and , and it is the only distribution that maximize with , and becomes a divergence metric.
We find that if quality and diversity metrics are carefully chosen, namely is the integral of an affine transformation of , we can get a divergence metric by a linear combination of these two metrics.
The LLSE case satisfies the condition in Theorem 3. Under this special case, there is , and
which is exactly the Reverse KL divergence if the constant is ignored. This linearly combined divergence metric can be viewed as a tangent line of the Paretofrontier curve in Figure 1, and the real distribution is the tangent point.
Since such condition is also necessary, the real distribution is unlikely to be a Paretooptima if we use casually chosen metrics. This means, there would be one distribution achieving both higher quality and higher diversity than the ground truth, which is implausible. Therefore, if the condition in Theorem 3 is not satisfied, it would be unlikely to measure the divergence using a combination of quality and diversity.
Now we can conclude that, it is sufficient to reflect the distributionfitting goal by a hybrid of qualitydiversity evaluation. However, specific metrics should be chosen carefully, in order to avoid the potential violation of such property. Suppose such property is violated severely, featured by a huge gap between the ground truth distribution and the Paretofrontier, then a model which perfectly fits the real distribution would be significantly outperformed by another model over both quality and diversity, resulting in misleading conclusions.
Therefore in the next section, we will examine the existence of the gap for qualitydiversity metrics used in practice, and provide suggestions on the choice of qualitydiversity metrics.
5 Options for QualityDiversity Metrics
It is yet to be examined that whether existing qualitydiversity metrics are sufficient to reflect the distributionfitting goal. For metrics satisfying our defined general form in Section 3.1, conclusions can be drawn directly by applying Theorem 3. For example, the Loglikelihood (LL) is widely used as quality metric, which is correspondent to NLLoracle (Yu et al., 2017) and Reverse PPL (Subramanian et al., 2018). As proved above, LL satisfies the condition in Theorem 3 if it’s paired with Shannon Entropy (SE). Consequently, it is safe to use LLSE together as in the work of Alihosseini et al. (2019).
However for most scenarios with real text data, the calculation is intractable for the general form of qualitydiversity in Section 3.1 as the ground truth distribution is unknown, including the LLSE pair. Practical metrics (e.g. BLEU and SelfBLEU) thus usually fall out of this framework, and Theorem 3 cannot be applied directly. In order to make a judgement on such metrics, we suggest to consider the compatibility between divergence and qualitydiversity metric pair. We say a pair of qualitydiversity metrics is divergencecompatible if the real distribution is a Paretooptimum under the MOP problem maximizing both metrics. Such compatibility is a necessary condition for the existence of a corresponding divergence metric which is strictly monotonically decreasing w.r.t. both quality and diversity.
5.1 BLEU and SelfBLEU
BLEU (Papineni et al., 2002) and SelfBLEU (Zhu et al., 2018) are common metrics for quality and diversity evaluation, respectively. Intuitively, BLEU measures the gram overlap between a candidate set of generated text and a reference set of real text, while SelfBLEU is the average BLEU score of each generated text with other candidates as reference. High BLEU score means that grams in generated text are more likely to appear in real text, thus BLEU can be used as quality metric. Similarly, high SelfBLEU score means that generated text are similar to each other in terms of gram, thus Negative SelfBLEU (NSBLEU as abbreviation) can be used as diversity metric.
The expression of BLEU on a candidate set is:
where is the Brevity Penalty which penalizes short sentences, and denotes the maximum gram order.
is a precision term, which measures the proportion of grams in the candidate set that also appear in the reference set. BLEU is the geometric mean of
for all , multiplied by a penalty term.The expression of BLEU does not seem to satisfy the general form of quality/diversity defined in Section 3.1. However on some special case, the general form is still satisfied, upon which we show some symptoms indicating the incompatibility of BLEUNSBLEU. Assume the lengths of text are all , so that and . In this case, BLEU contains only one term, i.e. . Then for candidate set and reference set , the expectation of BLEU and NSBLEU over generated distribution and real distribution would be
Such expressions satisfy the general form with
The condition in Theorem 3 would be satisfied if and only if and , which becomes and . However, the size of reference set is usually far more than , under which cases the BLEUNSBLEU metric pair would be divergenceincompatible.
Though above analysis is done on a special case, such results imply a potential incompatibility for general BLEUNSBLEU metric pairs. We will confirm this incompatibility by an empirical approach in Section 6.
Metrics  

QDisc  DRate(%)  QDisc  DRate(%)  QDisc  DRate(%)  
BS1  0.01287  2.55  0.01509  3.29  0.01063  3.15 
BS2  0.02384  9.41  0.01699  4.27  0.01146  1.71 
BS3  2.090  0.01  6.045  0.19  3.878  0.05 
5.2 The Proposed Metric Pair
To avoid possible misleading conclusions in practice, we suggest to use diversitycompatible qualitydiversity metric pair.
Since the real probability is required in under the general form in Section 3.1, calculation of most quality metrics are intractable on real text data. The only exception is the case with , paired with . The linearity of can avoid the explicit form of by sampling from real data, i.e. . We name the corresponding quality metric as Coverage Rate (CR), and diversity metric as Negative Repetition Rate (NRR)
. Even so, we observe a large variance while estimating CR and NRR on real text data. This is mainly because of the extremely large space of text of
. Therefore, estimations of CR/NRR are highly inaccurate in the text space.We thus suggest to calculate CRNRR in gram space rather than in text space. Derive the gram distribution and from text distribution and , so that
where denotes the set of all possible grams. In practice, and can be estimated by the empirical distribution, i.e. count the number of target grams and divide by the total number. Note that if calculated by the longest gram with , and would exactly recover the original CR and NRR metric in text space, thus can be viewed as a generalized form. In the rest of this paper, we use CRNRR as a default notation in the gram space unless explicitly stated.
In the grams space, calculation of metric pairs with other / functions also becomes possible. However, metrics such as LLSE suffer from another smoothing problem on real text data, i.e. their values go to infinity if some grams do not appear in candidate set or reference set. Therefore, we still suggest to use CRNRR as a first choice.
Though there is a conversion from the text space to the gram space, CR/NRR can still reflect quality/diversity. The metric measures the average probability for an gram in candidate set to appear in the reference set, thus is an indicator of quality. Similarly, measures the average probability for an gram to appear again in two consecutive sampling processes over the candidate set, thus is an indicator of diversity.
We then check the divergencecompatibility of CRNRR evaluation. Firstly, CRNRR is divergencecompatible w.r.t. distributions in the gram space, according to Theorem 3. We name the corresponding divergence metric as CRNRR Divergence (CND), where
and
Secondly, CRNRR is also divergencecompatible w.r.t. distributions in the text space. Assume is dominated by under CRNRR evaluation, which means would also be dominated by . This cause contradiction with the compatibility in gram space, so the compatibility in text space also holds.
In addition to the divergencecompatibility property, CRNRR is also easy to acquire. It does not require the explicit value of or , thus can be applied on implicit models similarly to BLEUNSBLEU. Moreover, the time complexity of CRNRR algorithm is , which is much lower than BLEUNSBLEU with , where and denote the size of candidate and reference set respectively. To conclude, we suggest to use CRNRR in gram space for qualitydiversity evaluation, instead of BLEUNSBLEU.
6 Experiments
In this section, we perform compatibility analysis of BLEUNSBLEU, compared with CRNRR on both synthetic data and real text data. We show that BLEUNSBLEU is significantly divergenceincompatible, by observing a phenomenon that ground truth text data are clearly outperformed over both BLEU and NSBLEU by some manually constructed model. We also show that CR/NRR are representative for quality/diversity evaluation respectively, while CND is representative for divergence evaluation.
To measure the degree of incompatibility, we calculate the Quality Discrepancy (QDisc) and Discrepancy Rate (DRate):
Intuitively, we try to find a model with best quality while its diversity is no lower than that of real distribution. Then QDisc measures the difference between this model and the real distribution in terms of quality. DRate measures the ratio between QDisc and the total range of quality for all Paretooptima. A metric pair is divergencecompatible if and only if .
6.1 Experiments on Synthetic Data


Metrics  MSCOCO  WMT  

QDisc  DRate(%)  SelfRatio  RefRatio  QDisc  DRate(%)  SelfRatio  RefRatio  
BS3  0.090  9.0  0.152  0.81  0.117  11.7  0.242  0.88 
BS4  0.162  16.2  0.274  1.46  0.211  21.1  0.437  1.59 
BS5  0.211  21.1  0.350  1.99  0.258  25.8  0.528  1.97 
CN3  1.07  0.053  0.0092  0.087  3.45  0.098  0.0125  0.358 
CN4  1.18  0.079  0.0095  0.125  3.25  0.103  0.0116  0.489 
CN5  1.33  0.098  0.0082  0.207  2.57  0.079  0.0089  0.689 
We first run experiments on synthetic data rather than real text data, in order to get the precise values of all metrics. Under this setting, the information of generated distribution and real distribution are explicitly given in advance, thus eliminates the possible variance from sampling. The synthetic data are texts with length using a pseudo vocabulary . We construct the real distribution using an oracle LSTM model as in SeqGAN (Yu et al., 2017)
, whose weights are randomly sampled from a gaussian distribution with
. Different standard deviation
s are applied to get several synthetic real distributions with different levels of entropy, i.e. distribution with smaller is more flat and of higher entropy, and distribution with larger is more sharp and of lower entropy.Calculation of QDisc and DRate can be achieved by a simple binarysearch algorithm if the exact form of Paretofrontier is known. However for BLEUNSBLEU metric pair, the frontier is unknown since Theorem 2
cannot be applied in this case. Consequently, we opt to used an optimizationbased method for the estimation of QDisc. We try to solve the following optimization problem using stochastic gradient descent (SGD) with momentum:
where is a penalty term to discourage the case where divergence is lower than real distribution . We set in our experiments. So that , and the denominator in DRate is also calculated through such optimizationbased method.
For BLEU metric with candidate set size and reference set size , the expectation can be directly calculated by
The time complexity (number of terms) of such calculation is . This is intolerable for above optimization problem even in text space of normal size. As a result, we set
, and apply SGD under the Tensorflow framework
^{1}^{1}1Slight increase of any parameter will consume intolerably more time, and is not necessary for the conclusions..We use CNn and BSn as abbreviation for CRNRR and BLEUNSBLEU with gram, respectively. We report the QDisc and DRate of BLEUNSBLEU in Table 1. Note that the reported QDisc values are corresponding lower bounds, since the optimizationbased method does not guarantee a global optimum. These nonzero QDisc values provide a clear support for the incompatibility of BLEUNSBLEU. We can also see that such discrepancy is significant on some cases, e.g. and for BS2 on data with . A QDisc value of 0.02 means that, we cannot surely claim that a model is better than another when the quality gap is below 0.02, which is already a clear gap for BLEU. We also run similar experiments for CRNRR. However, no positive lower bound is observed, which is in accordance with our theory.
6.2 Experiments on Real Text Data
Significance of quality discrepancy varies on different cases, thus we care about the discrepancies on real text data. We use two public datasets, MSCOCO Image Caption dataset (Chen et al., 2015) and EMNLP2017 WMT News dataset^{2}^{2}2http://statmt.org/wmt17/translationtask.html. We use 50,000 sentences as candidate set and another 50,000 as reference set for each dataset ^{3}^{3}3See appendix for detailed configurations..
To provide an estimation of QDisc and DRate, we manually construct a family of strong models. We mix the empirical distribution with truncated uniform distribution under different proportions, i.e. . During text generation, a random text from reference set is sampled with probability , otherwise a text with random tokens of length is constructed with probability . We set in our experiments^{4}^{4}4We observe that shorter noise length leads to stronger model, which is helpful for a better estimation of QDisc. Thus we set to equal the highest gram order in evaluation metrics..
We estimate QDisc by a linear interpolation between two closest points on the curve w.r.t. quality of real data. For the denominator of DRate in BLEUNSBLEU, we use
directly, since is reached for highest quality with , and for highest diversity with . For CRNRR, CR goes to when diversity is maximized with . As for the maximal value of CR, we estimate it by using a single reference sentence as candidate and select the one with maximal CR value.For a clearer view of the significance of quality discrepancy, we introduce two additional metrics: SelfRatio and RefRatio. SelfRatio calculates the ratio between QDisc and the quality of candidate set. RefRatio calculates the ratio between QDisc and the quality difference of and . The evaluation results of BLEUNSBLEU and CRNRR under gram are shown in Figure 2.
We can see that real data stays close to the CRNRR curve, while a much larger gap is observed between real data and the BLEUNSBLEU curve. We give the values of QDisc, DRate, SelfRatio, and RefRatio in Table 2. BLEUNSBLEU shows a significant incompatibility, by QDisc values ranging from 0.090 to 0.258. Such huge discrepancy in BLEU is unbearable in real applications, e.g. we cannot claim a model is better than another even if it achieves higher NSBLEU and significantly higher BLEU. As a result, we suggest not to use BLEUNSBLEU in order to avoid misleading conclusions. CRNRR also shows a small positive discrepancy, this is due to the inevitable difference between the empirical distributions of candidate set and reference set. However, discrepancy caused by such distribution difference is far much smaller than BLEUNSBLEU in terms of DRate, SelfRatio, and RefRatio.


Next we show how CR/NRR/CND behave on real text data. We apply temperature sweep on an RNNbased language model (RNNLM) pretrained by maximum likelihood estimation, which is a quick way to get a family of models with qualitydiversity tradeoff according to works of Caccia et al. (2018). The RNNLM consists of an embedding layer, an LSTM layer, and a fullyconnected output layer. The embedding dimension and number of hidden nodes are all set to 128. We train the model using Adam (Kingma & Ba, 2014) optimizer with learning rate
by 30 epochs. As temperature
grows, the model becomes more close to uniform, so that quality decreases and diversity increases, and minimal divergence is taken near . Results are shown in Figure 3, where we can see CR/NRR/CND are representative for quality/diversity/divergence respectively, which clearly fit our expectations. Therefore, we suggest to use CRNRR for qualitydiversity evaluation.7 Discussion
Our above conclusions are mainly drawn under the unconditional text generation setting, however, qualitydiversity evaluation is also getting great attentions under conditional text generation settings, such as dialogue system (Vijayakumar et al., 2016), machine translation (Shen et al., 2019) and image captioning (Ippolito et al., 2019). In this section, we give a brief discussion about qualitydiversity evaluation under conditional text generation settings.
Due to different formalization of quality and diversity metrics, our conclusions cannot be directly transferred to conditional text generation settings. Under these settings, the quality of text under condition is still defined as monotonically increasing w.r.t. the real conditional probability . So that the overall quality metric becomes the expectation of text quality over and , which is the case for BLEU. Meanwhile, diversity metrics have two different understandings. One is defined as the average diversity of conditional model distribution under different , such as PairwiseBLEU (Shen et al., 2019). The other is define as the diversity of marginal model distribution , such as Distinct (Li et al., 2015). Formalization of both quality and diversity metrics depart from ours in Section 3.1, and may result in different conclusions, thus require further separate analysis. Though such analyses are not covered here, our work provides a paradigm for future theoretical analysis, including metric definition, Paretooptimality analysis, and divergencecompatibility judgement.
Another difference lies in the point of view of task goal. While the goal of unconditional text generation is to design models that better fit the text distribution, in conditional text generation however, better human evaluation results are viewed as final goal in most cases. Therefore in these cases, the main focus would be designing metrics that better reflect human evaluation as well as designing training objectives that achieve better evaluation. It is also anticipated that whether human evaluation is compatible with divergence. We regard these as our future work.
8 Conclusion
In this paper, we give theoretical analysis of the relation between qualitydiversity evaluation and distributionfitting goal. We show that when using properly paired qualitydiversity metrics, i.e. is the integral of an affine transformation of , a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution. For metrics used in practice, we show the commonly used BLEU and SelfBLEU metric pair fails to reflect the distributionfitting goal. For a substitute, we suggest to use CRNRR instead as qualitydiversity metric pair.
Acknowledgement
This work was supported by Beijing Academy of Artificial Intelligence (BAAI) under Grants No. BAAI2019ZD0306, and BAAI2020ZJ0303, the National Natural Science Foundation of China (NSFC) under Grants No. 61722211, 61773362, 61872338, 61902381, and 61906180, the Youth Innovation Promotion Association CAS under Grants No. 20144310, and 2016102, the National Key RD Program of China under Grants No. 2016QY02D0405, the LenovoCAS Joint Lab Youth Scientist Project.
References
 Alihosseini et al. (2019) Alihosseini, D., Montahaei, E., and Baghshah, M. S. Jointly measuring diversity and quality in text generation models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 90–98, 2019.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Caccia et al. (2018) Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., and Charlin, L. Language gans falling short. arXiv preprint arXiv:1811.02549, 2018.
 Chen et al. (2018) Chen, L., Dai, S., Tao, C., Zhang, H., Gan, Z., Shen, D., Zhang, Y., Wang, G., Zhang, R., and Carin, L. Adversarial text generation via featuremover’s distance. In Advances in Neural Information Processing Systems, pp. 4666–4677, 2018.
 Chen et al. (1998) Chen, S. F., Beeferman, D., and Rosenfeld, R. Evaluation metrics for language models. In DARPA Broadcast News Transcription and Understanding Workshop, pp. 275–280. Citeseer, 1998.
 Chen et al. (2015) Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
 d’Autume et al. (2019) d’Autume, C. d. M., Rosca, M., Rae, J., and Mohamed, S. Training language gans from scratch. arXiv preprint arXiv:1905.09922, 2019.
 Fedus et al. (2018) Fedus, W., Goodfellow, I., and Dai, A. M. Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736, 2018.
 Gao et al. (2019) Gao, X., Lee, S., Zhang, Y., Brockett, C., Galley, M., Gao, J., and Dolan, B. Jointly optimizing diversity and relevance in neural response generation. arXiv preprint arXiv:1902.11205, 2019.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777, 2017.
 Guo et al. (2017) Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., and Wang, J. Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624, 2017.
 Hashimoto et al. (2019) Hashimoto, T. B., Zhang, H., and Liang, P. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792, 2019.
 Ippolito et al. (2019) Ippolito, D., Kriz, R., Sedoc, J., Kustikova, M., and Callisonburch, C. Comparison of diverse decoding methods from conditional language models. pp. 3752–3762, 2019.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Li et al. (2015) Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. A diversitypromoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
 Li et al. (2017) Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
 Li et al. (2019) Li, J., Lan, Y., Guo, J., Xu, J., and Cheng, X. Differentiated distribution recovery for neural text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6682–6689, 2019.
 Lin & Och (2004) Lin, C.Y. and Och, F. Looking for a few good metrics: Rouge and its evaluation. In Ntcir Workshop, 2004.
 Lin et al. (2017) Lin, K., Li, D., He, X., Zhang, Z., and Sun, M.T. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pp. 3155–3165, 2017.
 Lu et al. (2018a) Lu, S., Yu, L., Zhang, W., and Yu, Y. Cot: Cooperative training for generative modeling of discrete data. arXiv preprint arXiv:1804.03782, 2018a.
 Lu et al. (2018b) Lu, S., Zhu, Y., Zhang, W., Wang, J., and Yu, Y. Neural text generation: Past, present and beyond. arXiv preprint arXiv:1803.07133, 2018b.
 Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
 Nie et al. (2018) Nie, W., Narodytska, N., and Patel, A. Relgan: Relational generative adversarial networks for text generation. 2018.
 Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
 Rennie et al. (2017) Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. Selfcritical sequence training for image captioning. In CVPR, volume 1, pp. 3, 2017.
 Semeniuta et al. (2018) Semeniuta, S., Severyn, A., and Gelly, S. On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936, 2018.
 Shen et al. (2019) Shen, T., Ott, M., Auli, M., and Ranzato, M. Mixture models for diverse machine translation: Tricks of the trade. arXiv: Computation and Language, 2019.
 Subramanian et al. (2018) Subramanian, S., Mudumba, S. R., Sordoni, A., Trischler, A., Courville, A. C., and Pal, C. Towards text generation with adversarially learned neural outlines. In Advances in Neural Information Processing Systems, pp. 7551–7563, 2018.
 Vijayakumar et al. (2016) Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D. J., and Batra, D. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv: Artificial Intelligence, 2016.
 Yu et al. (2017) Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pp. 2852–2858, 2017.
 Zhang et al. (2018) Zhang, H., Lan, Y., Guo, J., Xu, J., and Cheng, X. Tailored sequence to sequence models to different conversation scenarios. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1479–1488, 2018.
 Zhang et al. (2017a) Zhang, J., Feng, Y., Wang, D., Wang, Y., Abel, A., Zhang, S., and Zhang, A. Flexible and creative chinese poetry generation using neural memory. arXiv preprint arXiv:1705.03773, 2017a.
 Zhang et al. (2017b) Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and Carin, L. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850, 2017b.
 Zhu et al. (2018) Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097–1100. ACM, 2018.
Appendix
Appendix A Preliminaries
Before starting the proofs, we first introduce some preliminaries on the constrained convex optimization problem. Assume , , and are continuous differentiable function define on , consider the constrained convex optimization problem defined as follows:
(1)  
The optimal solutions for above problem are given by the Lagrange Multiplier approach , as shown in the following theorem:
Theorem 4.
Assume and are convex, are affine, and are strictly feasible (there exists one satisfying for all ). Define the Lagrange function as:
where . Then the the following conditions are both sufficient and necessary for to be a solution in problem 1.
(2)  
The conditions in Equation 2 are called the KarushKuhnTucker(KKT) conditions.
Appendix B Proof of Theorem
For property 1, from , we get . We then get the conclusion by setting and .
For property 2, is true for any . Denote and , then we have for any . Since , we need for . Then, since is true for any . Set and and we get for any and .
Appendix C Lemmas
We give two lemmas to support the proof of Theorem and Theorem .
c.1 Lemma
Lemma 1.
If is a Paretooptimum, then the following conditions are satisfied: if , then ; if , then .
If , assume , we can construct where for all and . As such, but . This means is dominated by , which conflicts with the fact that is a Paretooptimum. So .
If , assume , and we can further assume . Again we construct where for all and . Surely we have , and . Since is strictly concave, we have , which means is dominated by . This causes confliction, so .
c.2 Lemma
Lemma 2.
Assume and , then the distribution that maximize satisfies , and .
Define the optimization problem as follows:
Again we first check that the prerequisites in KKT are all satisfied. is linear and is convex w.r.t. ; is affine w.r.t. ; since all can be positive, so the inequalities are all strictly feasible.
The Lagrange function is:
Apply KKT and we get the following conditions for a optimal solution:
For , there is , so
for , there is , so
Denote and and combine the two cases together, we get:
The above derivation is both sufficient and necessary, so we finished the proof.
Appendix D Proof of Theorem
We give the proofs for three conclusions individually.
d.1 Conclusion
Here we only consider the case with , and the case where will be incorporated into conclusion 3. We try to find a distribution with the highest diversity while quality is not lower than . Define a convex optimization problem as follows:
For to be a Paretooptimum, it’s necessary for to be a solution of above problem. Thus we try to solve this problem next.
We first check that the prerequisites in KKT are all satisfied. is convex w.r.t. ; is affine w.r.t. ; and are convex(linear) w.r.t ; since all can be positive and , so the inequalities are all strictly feasible.
The Lagrange function is:
Apply KKT and we get the following conditions for a optimal solution:
Since we need to be a solution, so
For , there is , so ; for , there is , so . Denote and and combine the two cases together, we get:
where
Now we get a necessary condition for to be a Paretooptimum. To make it sufficient, we still require that for any two distributions satisfying this form, no one could dominate another. This property can be proved by combining conclusion and .
d.2 Conclusion
We separate the proof into two parts: (1) is correspondent to ; (2) the monotonicity of w.r.t. .
(1) The sum of all should be . Denote
Since is strictly monotonically decreasing, so is monotonically nonincreasing w.r.t. . If , there would be a term which is strictly monotonically decreasing w.r.t. , under which condition is strictly monotonically decreasing w.r.t. . Also, is continuous w.r.t. since is continuous. When
there is
so ; when
there is
so . From above analysis, the value of can reach or be greater than . So combining the monotonicity of , there exists and only one that satisfies , leading to a rational distribution.
(2) Define as above. Since represents the total probability of a distribution, so there should be , thus .
where . By the condition , we get
Since , so if for all , we can get , thus is strictly monotonically increasing w.r.t. . Similarly, if for all , we can get , thus is strictly monotonically decreasing w.r.t. .
d.3 Conclusion
We also separate the proof into two parts: (1) the uniqueness of ; (2) the monotonicity of and w.r.t. .
(1) Since is not uniform, so we can denote , , as they are in the theorem. According to Lemma 1, since is the largest one, so the corresponding is also the largest one, which means
Comments
There are no comments yet.