Recently, natural language processing (NLP) has enjoyed unprecedented progress largely due to developments in deep learning. In this regard considerable attention in the NLP community is devoted to topics related to automatic question answering (QA). In comparison to that, the inverse task - automatic question generation - has received significantly less attention. Although, question generation (QG)Du et al. (2017); Serban et al. (2016); Pan et al. (2019); Kim et al. (2019) enjoys a bit of niche existence, it has a plethora of potential applications such as improving the training of QA systems, and help in the creation of material in the educational domain Chen et al. (2018). Automatic creation of questions as well as making them person-specific, can alleviate educator from this tedious task, improving the education experience of both teachers and students.
In this paper, we consider collaborative learning of QA and QG. The key idea of this work is that as question answering and generation are naturally related tasks, so leveraging their connection should be mutually beneficial in terms of performance as well as reducing the amount of labeled data, e.g. for training a QA system. The proposed solution builds upon two recent variants of self-attention Transformer network architectures Vaswani et al. (2017), namely GPT-2 Radford et al. (2019) and BERT Devlin et al. (2018)
. Essentially, the Transformer architecture consists of two main building blocks, stacked encoders and decoders, which are connected in a cascaded fashion. The encoder is further divided into two components: self-attention layer and a feed-forward neural network. Self-attention allows for attending to specific words encoding and therefore establishing a focus context w.r.t. each word. The decoder has an additional encoder-decoder layer that is switched between the self-attention and the feed-forward network. This allows the decoder to attend to specific parts of the input sequence. Compared to the original Transformer architecture, GPT-2 discards the encoder blocks reducing it to a decoder stack. It provides traditional language model functionality, allowing to predict the next word in a sequence given the history. Consequently, it is naturally applicable in generalgenerative tasks. However, since it is not optimized for question generation purpose, there is no guarantee that the generated questions are valid and answerable. In contrast the latter model BERT is a masked language model. It allows only for predicting a masked out word conditioned on both its left and right context, notably establishing word embeddings in a context-specific and bi-directional manner. What is more, BERT is trained for discriminative QA with applying a specific regression head. Specifically, it predicts the answer text span in the given paragraph for a given question. However, even beyond QA BERT has shown extreme versatility in terms of applicability for numerous downstream tasks. In comparison to the conventional Transformer network and in contrast to GPT-2, it discards the decoder blocks, reducing it to a pure encoder stack.
This work relates to recent studies which attempt to improve the performance of a discriminative QA model with generative QG models Lewis and Fan (2019); Wang et al. (2017); Duan et al. (2017); Dong et al. (2017); Yang et al. (2017); Harrison and Walker (2018); Tang et al. (2018). These works regard QA as the primary task and use auxiliary tasks, such as question generation and question paraphrasing, to improve the primary task. Whereas this is one part of our goal, our other goal is to improve the QG model with the QA system and to further improve both tasks in a loop. Another key difference compared to these methods is that all these methods require very similar models for the generation and answering, whereas our method is built on top of two different Transformer architectures for answering and generation, coupling the best of both worlds. Specifically, a Transformer network model is proposed which consisting of the conjunction of the Transformer decoder GPT-2 model with Transformer encoder BERT.
In a similar work, Tang et al. (2017a)
consider learning question generation and question answering as a dual task, but learning them jointly using the entire dataset. Specifically, they employ a triplet-like loss by sampling negative and positive pairs of questions and answers, injecting the duality into the optimization procedure via a regularization term. However, in contrast to our proposed approach, semi-supervised learning is not considered. Manual question generation is a laborious and tedious task. Therefore, employing QG in a semi-supervised setup is very desirable. The most related workYang et al. (2017) proposes to use a GRU-based encoder/decoder architecture to generate questions based on the unlabeled text. The generated questions are then combined with the human questions through a domain adaptation pipeline for training QA models. This approach employs a heuristic on computing possible answers, whereas the proposed approach requires weak labels in terms of answer spans, although in theory could also be operated on heuristic generated annotations.
have shown that the task of question generation often exhibits linguistic variance that is semantically admissible; this renders it inappropriate to judge a generated question solely by matching against a ground truth sentence. Therefore, as another contribution of this paper, we propose to assess the quality of QG systems using the performance of a QA network trained on generated questions by QG as surrogate measure.
The proposed approach is evaluated on the SQuAD dataset Rajpurkar et al. (2016) for both “question generation” and “question generation” tasks. In both tasks we show improvement as a result of our proposed generation & answering collaboration framework. This study opens up avenues for exploiting inexpensive QG solutions similar to ours to achieve performance gain in QA task.
The contributions are two-fold: First, we leverage question generation by tying together GPT-2 and BERT in end-to-end trainable fashion facilitating semi-supervised learning. Second, we propose to use QA as a surrogate measure for assessing question generation quality.
Question answering and question generation are intrinsically linked. Therefore, it is natural to combine these aspects together to improve the desired task. Hence, the core of the proposed method consists of learning question generation network by making use of the feedback of a question answering network.
Here we propose to employ GPT-2’s language model for question generation and BERT for question answering. We first briefly discuss the intrinsics of GPT-2 and BERT. This is followed by elaborating on how we adapt GPT-2’s language model, for question generation. Subsequently, we explain the details on how to merge the process of question generation & question answering in a collaborative framework through mixing GPT-2 and BERT.
In this section we briefly review GPT-2 Radford et al. (2019) and BERT Devlin et al. (2018) models. Both are variants of self-attention “Transformer” network architectures Vaswani et al. (2017). The former provides a traditional language model, allowing to predict the next word in a sequence given the history. In contrast to that, the latter is a masked language model. It allows for predicting a masked out word, conditioned on both, its left and right context. Another key technical innovation in BERT is its bidirectional training instantiating a context specific word embedding. Context-specific embeddings are in stark contrast to static word embeddings such as word2vec Mikolov et al. (2013). In this regard, BERT can easily be fine-tuned for a plethora of different downstream tasks. This can largely be attributed to the self-attention mechanism in the Transformer that allows BERT facilitates generic applicability. Another interesting aspect of BERT is that it does not have an explicit notion of word order beyond marking each word with its absolute-position embedding.
Wang and Cho Wang and Cho (2019) also showed BERT is a combination of a Markov random field language model with pseudo log-likelihood training. As a consequence, similar to a traditional language model, this formulation automatically allows for Gibbs sampling sequences. Technically, GPT-2 and BERT are opposite slices of the Transformer stack. That is, GPT-2 incorporates the Decoder stack of the Transformer architecture, whereas BERT consists of the Transformer Encoder stack.
2.2 Question Generation and Answering
For question generation with GPT-2, we follow the standard strategy for text generation as proposed in the original paper Radford et al. (2019). Given the natural sequential ordering of the language model, the joint probability of a sequence can be factorized into a product of conditional probabilities
largely following Jelinek and Mercer (1980); Bengio et al. (2003). This in turn allows for efficient sampling strategies such as sequential top-k Fan et al. (2018); Radford et al. (2019) (here, k=1). At each sampling step, the model computes the word probability over the entire vocabulary for being the next word. This is followed by randomly sampling from the k most-likely candidates. Sampling is discontinued when a maximum sequence length is reached or a special terminal symbol is produced, e.g. “?” for questions.
|QA-QG-Dual Tang et al. (2017a)||-||-||-||5.03||-|
|LM-init Radford et al. (2019)||24.85||17.85||11.06||6.85||33.56|
|Our Proposed Method||31.46||19.50||12.41||7.84||34.51|
However, in order to be tailored to the specifics of questions a number of extensions have to be made. Specifically, as GPT-2 is trained for general text generation a fine-tuning stage has to be included, which allows for the conditional generation of questions given an annotated possible answer. To this end, the model’s tags are augmented by special tokens delimiting answers during training. Thus during the training phase, a question context is provided together with answer-question tuples , whereby varies from context to context, and , denoting the groundtruth answer question, respectively. Furthermore, we denote the length for the groundtruth answer as . Then during the optimization, we maximize the likelihood over all contexts and their respective tuple sets denoted as
where denotes the context cardinality. Thus, factorizing over all contexts, we yield
where in contrast to Eq. 1 conditioning is extended by a context and a specific answer in the context .
The fine-tuning step yields a model that allows for basic QG (i.e., LM-init in the experiment). However, in order to boost the performance with increased diversity in generation output, we have a subsequent downstream optimization step that ties together question generation with a QA module. Details about the collaborative downstream optimization are explained in the next section.
2.3 Collaborative Generation & Answering
The models trained to this point consist of a rudimentary GPT-2 network question generation and a BERT network for question answering, respectively. The next step entails fine-tuning of both models in tandem in an end-to-end fashion. The underlying idea is to exploit the duality of the tasks in order to increase the diversity of the answer generation, specifically capitalizing on the QA power of BERT. Thereby, the QA module is employed statically for the task of question answering, whereas GPT-2’s task is to generate questions which are improved over time. That is, backpropagation is only performed w.r.t. weights of the QG module, namely the GPT-2 language model weights, whereas the weights of QA remain unchanged. Technically, it is easily possible also to perform backpropagation w.r.t. weights of the QA module, however, this increases the risk of drift and unstable behaviour (loss oscillations) during optimization such that regularization becomes non-trivial. Furthermore, experiments indicate that a short finetuning step is sufficient for the QA network to serve as feedback mechanism.
Specifically, given a context with annotated question, a question is generated using GPT-2 - identically as discussed in the previous section. The difference lies in the forthcoming steps. Then the context endowed with the newly generated question (without answer annotation) is given to the pre-trained QA network. The BERT QA network then in turn generates an answer span, which is compared with the groundtruth. A question that cannot be answered by the QA system, i.e. yields an incorrect answer span, gives rise to sub-optimal wording or semantic mismatch.Therefore, we backpropagate the loss incurred by these samples w.r.t. GPT-2 language model. Effectively, this leads to a division the tuple set ( see Eq. 2) in two different parts during optimization. Namely, we obtain
One set contains contexts and answers that cannot be answered, and the other set contains context-answer pairs that are answerable. These sets represent a performance snapshot of the system at current iteration. Thus, during each round of the optimization we try to continuously shrink the first the cardinality of , i.e. minimizing and thus minimizing the number of questions that are answered incorrectly. At the same time, in order to avoid catastrophic forgetting, we keep probing for the previously correctly answered questions (known as replay mechanism in continual learning Shin et al. (2017), continuously sampling from , trying to maximize . In case a previously answered question cannot be answered anymore, it moves back to the set of unanswerable question, such that at any time the following holds: . See Fig. 2 for the illustration of the optimization pipeline.
|Labeling rate||Method||Dev F1||Test F1||Test EM|
|0.1||Gen + GAN Ganin and Lempitsky (2015)||0.4897||0.4373||0.2885|
|0.1||Gen + dual He et al. (2016)||0.5036||0.4555||0.3005|
|0.1||Gen + domain Yang et al. (2017)||0.5234||0.4703||0.3145|
|0.1||Gen + domain + adv Yang et al. (2017)||0.5313||0.4802||0.3218|
|0.1||Our Proposed Method||0.6931||0.6391||0.4741|
|0.2||Gen + GAN Ganin and Lempitsky (2015)||0.5525||0.5037||0.3470|
|0.2||Gen + dual He et al. (2016)||0.5720||0.5192||0.3612|
|0.2||Gen + domain Yang et al. (2017)||0.5749||0.5216||0.3658|
|0.2||Gen + domain + adv Yang et al. (2017)||0.5867||0.5394||0.3781|
|0.2||Our Proposed Method||07614||0.7053||0.5476|
|0.5||Gen + GAN Ganin and Lempitsky (2015)||0.6110||0.5590||0.4044|
|0.5||Gen + dual He et al. (2016)||0.6368||0.5746||0.4163|
|0.5||Gen + domain Yang et al. (2017)||0.6378||0.5826||0.4261|
|0.5||Gen + domain + adv Yang et al. (2017)||0.6375||0.5831||0.4267|
|0.5||Our Proposed Method||0.8185||0.7564||0.6056|
|0.9||Gen + GAN Ganin and Lempitsky (2015)||0.6396||0.5874||0.4317|
|0.9||Gen + dual He et al. (2016)||0.6511||0.5892||0.4340|
|0.9||Gen + domain Yang et al. (2017)||0.6611||0.6102||0.4573|
|0.9||Gen + domain + adv Yang et al. (2017)||0.6585||0.6043||0.4497|
|0.9||Our Proposed Method||0.8409||0.7755||0.6282|
3 Experiments & Results
In order to assess the performance a number of experiments were conducted. At first a qualitative analysis is performed in terms of language generation metrics such as BLEU and ROGUE. Due to the deficiency of these scores as will be discussed, QA is employed as an indirect surrogate measure. Next, the behaviour of QG in a semi-supervised setup is analysed, simulating the feasibility in small-data regime problems. Finally, we perform an ablation study in terms of analyzing the importance of QA component used during training.
Experiments are performed on the Stanford Question Answering Dataset (SQuAD) v1.1 dataset Rajpurkar et al. (2016). It consists of a collection of more than 100k question/answer pairs w.r.t. paragraphs from Wikipedia articles that were acquired by crowdsourcing. We employ a data split which divides the training data set equally in two parts. The first part (SP2) is used for supervised pre-training of the QG and QA models. The second half (SP1) is used for evaluation purposes.
Initialization of the model entails pre-training of both BERT and GPT-2. Whereas BERT is fine-tuned for the task of question answering, GPT-2 is fine-tune for question generation. Both is conducted for 2 epochs.
3.1 Quality of Generated Questions:
For question generation, we evaluated the quality of generated questions by comparing it with the groundtruth questions existing in the dataset using the standard language generation metrics: BLUE and ROGUE (Tab. 1).
Figure 3 shows some qualitative results on the question generation. As can be seen, generated sentences have high diversity and differ significantly from groundtruth. The last example shows one case of failure of our method, due to the very fine-grained nature of the question. As can be seen, questions generated by the proposed approach have good quality. Therefore the proposed approach is applicable in the low-data regime, compensating absence of annotations in large corpora. Questions generated also have higher quality than using the vanilla GPT-2 language model, suggesting that learning from BERT in the feedback loop provides further language cues, which may be attributed to the strength of the context-specific embeddings of BERT that allows for establishing complex relationships in sentences as well as rich semantic representation that can be exploited by QA.
|LM-init Radford et al. (2019)||67.51||77.15|
|Our Method (GPT-2)||70.61||79.73|
|Our Method (BERT)||75.37||84.42|
|LM-init Radford et al. (2019)||67.51||77.15|
3.2 Quantitative Analysis using Surrogate Measure
In order to gauge the performance of automatic QG systems, it is very important to have a good metrics at hand. Scores such as BLEU and ROGUE are only of limited use therein, as they mainly measure the lexical similarity between the generated question and the ground truth Tang et al. (2017b). That can be attributed to the fact that often these metrics originate a specific application domain, e.g. BLEU for translation. As a result, they are of limited us for other NLP applications. Specifically in the context of question generation they tend to be inadequate as they are unable to capture whether a generated question really looks like a semantic valid question or not. A desirable evaluation system should also have the ability to judge whether the generated question could be answered by input sentence, even if the generated question use totally different words to express the meaning. As an example, taking the first sample from Tab. 3, “What team did the broncos defeat in the AFC championship game?” is a perfectly reasonable question given the answer “New England Patriots” and the specific context. However, it scores very low in terms of BLEU against the groundtruth “Who won Super Bowl XLIX?”, highlighting the deficiency of these scores for the task of assessing QG. Motivated by this, we introduced to train a QA system on generated questions by QG system, and utilize the performance of the former model as a surrogate measure for the latter one. For this purpose, we train a BERT QA model on generated questions, as well as combination of generated questions and groundtruth questions. The underlying idea is that if the model is able to generate questions with high diversity using words with low lexical similarity to the groundtruth, the QA system is also improved. Incorporating QA, entails multiple aspects: semantic information as well as answerability - therefore providing complementary cues. Specifically, the QA network becomes more robust as it learns to generalize better. Specifically, in terms of BERT, this implies a widening of the semantic spectrum of the context-specific embedding driving QA model. For all the evaluations of the surrogate metric, BERT was trained for 2 epochs.
As can be seen in Tab. 3, the performance in question answering using generated questions using BERT in the feedback loop almost reaches groundtruth benchmark performance. This is followed by using GPT-2 in the feedback loop and by a large margin GPT-2 language model directly for question generation. At the same time, the strong performance suggests that there is sufficient diversity in the questions generated as well as that the questions are semantically correct. This fact is further highlighted by simultaneously obtained low scores for BLEU and ROGUE metrics in Tab. 1. Scores of similar low BLEU magnitude, however, on a slightly different data split are reported in for their related approach Tang et al. (2017b).
To better analyze the power of our question generation in terms of semantic diversity, we provide additional groundtruth data for learning the QA model. The rationale behind that was to check if the augmented generated question can also be beneficial in presence of groundtruth data or not. For that, a BERT QA model is trained on entire training set of SQuAD, but half the training data is fully supervised (i.e., contain question / answer pairs), but the other half is not annotated with the questions, and we use our proposed method to generate question for the latter part. Finally, performance is evaluated on the development set of SQuAD. As can be seen in Tab. 4, the performance in this setup using the questions generated by our method almost reaches to the fully supervised baseline where has been trained on whole training set in a fully supervised manner. The small margin between our QA method, and fully supervised baseline suggests the applicability of our proposed framework for the low data regime scenarios. The large margin between our proposed method and the initialization model clearly shows the semantic diversity of the generated question compared to the groundtruth questions.
3.3 Semi-Supervised Learning
In this experiment we study the performance in a semi-supervised setup at different labeling rates. For the sake of transparency, we conducted experiments following the protocol of Yang et al. (2017) with the associated customized SQuAD split. Specifically, we evaluate the performance at labeling rate of 10%, 20%, 50% and 90% of the 50k unlabeled data corpus with remaining 10% used for testing. The results are reported in Table 2. As it can be seen in the table, the proposed approach outperforms the state-of-the-art semi-supervised QA approach of Yang et al. (2017) by a large margin at any labeling rate. The larger the labeling rate is, the larger the margin. However, it should be noted a part of the margin can be attributed to the approach of Yang et al. (2017) employs some heuristic to extract answer candidates from which questions are generated, rather than employing groundtruth answer spans. The results further suggest that the approach at a labeling rate of around 50% already starts to saturate. That is, the gain in accuracy beyond that labeling rate drops significantly, indicating the high quality and diversity of the generated questions. Generally, using the proposed approach in a semi-supervised setting seems feasible given that even at very low labeling rates such as 10%, the proposed approach reaches good accuracy not too far off from the upper bound.
3.4 Ablation Study of QA component
The proposed approach employs BERT as QA component for co-training the QG component. In this section we analyze the effect of the type of QA system in collaboration with QG system. Specifically, we replace BERT with a variant of GPT-2 that emulates BERT. In order achieve question answering emulating span regression in BERT-like fashion, the GPT-2 architecture has to be altered. To this end, a “regression-head” has to be added at the top of GPT-2 stack. As a result, one obtains a trainable QA head layer that facilitates association of log-likelihoods for each context token, indicating the probability of being a span delimiter. It should be noted that in the original GPT-2, answers were generated using the language model instead. Table3
shows the results of using the proposed approach that employs BERT as QA module during training vs. using the modified GPT-2 QA variant. All models were fine-tuned for 2 epochs. In both cases (a separate) BERT QA module was used as surrogate evaluation metric as introduced in the previous subsection. The results suggest that using BERT leads to a much better performance than GPT-2 QA. This can be attributed that GPT-2 features a language model, with its backbone not optimized for multiple downstream tasks. Therefore it can provide only limited and unreliable feedback in terms of diversity of the questions generated. In contrast to that BERT, with its context-specific embeddings allows for robust and reliable QA.
In this paper, a simple yet effective approach for question generation is presented. For that, we leverage question generation by tying together GPT-2 and BERT in end-to-end trainable fashion facilitating semi-supervised learning. The generated questions are of high quality, showing high semantic similarity w.r.t. groundtruth data. Furthermore, conducted experiments suggest that the proposed approach allows for generation of question significantly reducing the burden of full annotation. BERT as QA module in the feedback loop model shows best performance, which may be attributed to the bi-directional context specific embeddings leveraging a powerful feedback mechanism. Additionally, as the BLEU and similar scores are weak metric for assessing generative power, we proposed to use BERT QA as a surrogate measure for assessing question generation quality. Future work will entail on further improving question generation as well as reducing the requirements of answer annotations. Completely eliminating the answer annotations could pave the way towards fully unsupervised question generation.
- A neural probabilistic language model. J. Mach. Learn. Res. 3, pp. 1137–1155. External Links: Cited by: §2.2.
- LearningQ: a large-scale dataset for educational question generation. In Twelfth International AAAI Conference on Web and Social Media, Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds, §1, §2.1.
- Learning to paraphrase for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 875–886. External Links: Cited by: §1.
- Learning to ask: neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106. Cited by: §1.
- Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 866–874. External Links: Cited by: §1.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 889–898. External Links: Cited by: §2.2.
- Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1180–1189. External Links: Cited by: Table 2.
- Neural generation of diverse questions using answer focus, contextual and linguistic features. arXiv preprint arXiv:1809.02637. Cited by: §1.
- Dual learning for machine translation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 820–828. External Links: Cited by: Table 2.
Interpolated estimation of Markov source parameters from sparse data.
Proceedings, Workshop on Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal (Eds.), pp. 381–397. Cited by: §2.2.
Improving neural question generation using answer separation.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6602–6609. Cited by: §1.
- Generative question answering: learning to answer the whole question. In International Conference on Learning Representations, External Links: Cited by: §1.
Efficient estimation of word representations in vector space. External Links: Cited by: §2.1.
- Towards a better metric for evaluating question generation systems. EMNLP. Cited by: §1.
- Recent advances in neural question generation. arXiv preprint arXiv:1905.08949. Cited by: §1.
- Language models are unsupervised multitask learners. Cited by: Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds, §1, §2.1, §2.2, Table 1, Table 3, Table 4.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Cited by: Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds, §1, §3.
Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807. Cited by: §1.
- Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §2.3.
- Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027. Cited by: §1, Table 1.
- Question answering and question generation as dual tasks. CoRR abs/1706.02027. External Links: Cited by: §3.2, §3.2.
- Learning to collaborate for question answering and asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1564–1574. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1.
- BERT has a mouth, and it must speak: BERT as a markov random field language model. CoRR abs/1902.04094. External Links: Cited by: §2.1.
- Irgan: a minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 515–524. Cited by: §1.
- Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1040–1050. External Links: Cited by: §1, Table 2, §3.3.
- BERTScore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675. Cited by: §1.