Natural language interface (NLI) applications such as Personal assistants (e.g., Amazon Alexa, Apple Siri, Google Assistant, and Microsoft Cortana) and search engines (e.g., Google) have become an integral part of our everyday life. Among the many features in NLI applications, text auto-completion, which aims to suggest words, phrases, and sentences that complete the user’s textual input, is a common, but key feature. Smart reply Kannan et al. (2016) and Smart compose Chen et al. (2019) are two recent works that provide contextual assistance to aid users in completing everyday text such as emails, search engine inputs, etc.
While recent advances in deep neural models have shown impressive performance on the text auto-completion task, these models generally require a large amount of everyday text and huge amount of computing power for training to generate adequate suggestions Chen et al. (2019). The challenge is compounded when we perform auto-completion in specific domains such as academic writing, which requires a large training corpus for specific expertise. tab:perplexity illustrates the difficulty in domain-specific auto-completion with the same amount of supervisions.
A potential solution to address the challenges in text auto-completion is exploiting the Decoder-only Transformer model such as GPT-2 Radford et al. (2019). The model performs well on constructing syntactically sound sentences from a partial query. However, GPT-2 requires a huge fine-tuning effort to construct sentences of expert domains. fig:example shows an example of GPT-2 auto-completion suggestions for computer science domain sentences before fine-tuning. Recently, text-to-text transformers such as BART Lewis et al. (2020) and T5 Raffel et al. (2020) have demonstrated great potential in natural language generation (NLG) tasks by using masked-span infilling as a pre-training objective. However, similar to GPT-2, these models also require huge fine-tuning efforts to perform domain-specific text auto-completion.
This paper aims to address this research gap by proposing an intermediate training strategy Pruksachatkun et al. (2020); Zhou et al. (2021), which incrementally trains a pre-trained text-to-text transformer to provide better auto-completion suggestions and fastly adapt to the expert domain during fine-tuning. As shown in Figure 2, the core of our intermediate training strategy is a simple self-supervised objective called Next Phrase Prediction (NPP), which has two major steps: Phrase Extraction (Section 3.1) and Generative Question Answering (Section 3.2). The first stage extracts qualitative phrases by constituency parsing. By exploiting constituency parsing, the framework is able to utilize the complete phrase, not just a fraction of the sentence. Next, the pre-trained language model is guided to choose the correct next phrase among other phrases of the same type (e.g., noun phrase, verb phrase, etc.) in the sentence. For example, the sentence "She bought a top and bottom from that strange little shop." has two noun phrases "a top and bottom" and "that strange little shop". If the partial query is "She bought", the model is guided choose the proper complete noun phrase "a top and bottom" for its next phrase.
To the best of our knowledge, this is the first work that proposed an intermediate training strategy for improving language models’ performance on the text auto-completion task. Through extensive experiments, we demonstrated that our proposed approach could improve the text-to-text transformer’s performance on auto-completion task and fastly adapt to expert domain of text auto-completion.
In this section, we first formalize the auto completion problem, and then introduce the workflow of our intermediate training strategy.
2.1 Problem Statement
Given partial query , an auto completion returns , where is a syntactic and semantic extension of . Specifically, every tokens of is a prefix of , and every tokens of is a suffix of ; is a full sentence. We evaluate the auto completion model’s performance on two attributes: (a) the soundness of , and (b) the semantic similarity of with the ground truth.
The workflow consists of two main steps, starting with a pre-trained T5 model: (i) applying the proposed self-supervised objective NPP for intermediate-task training, and (ii) fine-tuning on target auto-completion task.
3 Next Phrase Prediction
The key idea of the next phrase prediction (NPP) objective is to train a text-to-text transformer to complete the partial query with adequate phrases. The underlying intuition of our proposed approach is as follows: (1) Phrases tend to express meaning beyond simple word concatenation. For example, noun phrase such as is constructed by three different words ("Recurrent", "Neural", "Network"), where each word has its own meaning. (2) Common phrases tend to be used on their own in the text. For instance, the prepositional phrase such as "in this paper" frequently appears in academic writing domain. Unlike existing language models that are trained to neglect such characteristics of phrases and predict the next word or span of the text, text auto-completion can be improved by performing phrase-level text completion as an intermediate training strategy in an effort to make the most of the phrase. Specifically, NPP involves two main steps: (i) Phrase Extraction, and (ii) Generative Question Answering (QA).
3.1 Phrase Extraction
We first begin by extracting phrases using constituency parsing to retrieve qualitative phrases. Given an input , we first conduct constituency parsing using AllenNLP Gardner et al. (2018) and extract the Noun Phrase (NP), Verb Phrase (VP), and Prepositional Phrase (PP). The extracted phrases are grouped into sets according to their types, denoted as , and , respectively: , and . For each phrase, we only keep the node that does not have a child node of the same phrase type. For example, the sentence "She wants to eat pie." has three VPs as follows:
(1) wants to eat pie (VP) wants (VBZ) to eat pie (VP)
(2) to eat pie (VP) to (TO) eat pie (VP)
(3) eat pie (VP)
To construct for this sentence, we only consider "eat pie" as to avoid word overlap between phrases.
3.2 Generative QA
After retrieving the phrases, we train the language model to predict the correct next phrase in a generative QA task setting Khashabi et al. (2020). Specifically, from , and , we randomly choose a set that has more than two phrases. To formulate the Generative QA task with the selected , here we present both the question and answer: If the answer is a randomly chosen phrase from , then the question is composed of partial query in which the chosen phrase is an extension of and all phrases in as answer choices. The model is trained to output the correct phrase , given partial query and answer choices . Figure 2 shows a real example of this format by choosing as , "a top and bottom" as , and "She bought" as .
4.1 Details for intermediate training
We train a pre-trained T5-base model with NPP. We randomly sample 1M sentences from the English Wikipedia corpus111https://dumps.wikimedia.org/enwiki/latest/, which is used for pre-training BERT and its variants, as the source data for NPP. The corpus has about 1.2B tokens, which is considerably less than the 34B token used in T5, and 10B tokens used in GPT2222Assuming the average token size is four characters..
4.2 Target Dataset
To show the effectiveness of our proposed method, we utilize two domains of text corpus to create the text auto-completion datasets:
Email: We utilize Enron email corpus 333https://www.cs.cmu.edu/.̃/enron/ for general domain which is written in English collected from internal communication within a large business organization.
Table 2 summerizes the statistics of the datasets used in our experiments. For data processing, we first extract the sentences from these text corpus. For each sentence, we split into pairs by all word points. We consider as partial phrase query to predict completion of the remaining phrase in the sentence. Note that the formulation is used in fine-tuning the base models.
|Model / Metrics||Emails||Academic Writing|
|GPT-2 Radford et al. (2019)||1.1||6.6||26.4||3.3||0.6||6.0||23.6||2.6|
|T5 Raffel et al. (2020)||2.8||6.8||39.8||4.2||2.2||7.5||50.3||3.9|
|Building large OCR databases is a time||consuming and tedious work .||challenging .||consuming task .|
|vpi is part of the ieee programming||language interface standard .||system .||language .|
|a connection between the kalman||filter is developed .||et al .||filter is established .|
|appendix provides a complete listing||of code for the systems .||of the apl libraries .||of the tools and techniques used in this paper .|
|automatic target||recognition is an important task .||selection is based on a set of criteria .||detection is a key feature of this approach .|
4.3 Base models
We compare our proposed approach with other pre-trained language generation models. We fine-tuned the following models on our training data in a sequence-to-sequence format: (1) GPT-2 Radford et al. (2019) is the pre-trained GPT-2 large model, which has 774M parameters. For fine-tuning, we condition the model on the format . For inference, we sample from the fine-tuned GPT-2 model after a prompt of the partial query with beam search, and cleaning the samples by postprocessing. Then, we use the first sample as the output sentence. (2) T5 Raffel et al. (2020) is the pre-trained T5-base model, which has 220M parameters. For fine-tuning, we prepend the prefix: "generate next phrase:" to partial query and feed into the model to generate completion . (3) NSP+T5 is intermediately-trained based on T5-Base using Next Sentence Prediction (NSP), which is used in BERT Devlin et al. (2019) pre-training. (4) NPP+T5 is intermediately-trained based on T5-base using Next Phrase Prediction (NPP), which is our proposed objective.
4.4 Evaluation Metrics
To evaluate the syntactic and semantic soundness of generated sentences, we exploit several widely used automatic metrics to assess the performance, such as BLEU Papineni et al. (2002), METEOR Banerjee and Lavie (2005), CIDEr Vedantam et al. (2015), and SPICE Anderson et al. (2016). These metrics evaluate whether the model is able to generate semantically expert domain focused sentence by measuring surface similarities and associations between system generations and original text.
4.5 Experimental Results
Table 3 shows the experimental results of text auto-completion on the email and academic writing datasets. We observed that the model intermediately trained with our objective outperforms the base models on both datasets.
Specifically, our approach, NPP+T5, outperforms NSP+T5 by a margin from 0.2 to 0.3 BLEU/METEOR/SPICE score, suggesting that predicting the next phrase is more effective than predicting next sentence in text auto-completion task. Moreover, we also observe that NPP+T5 outperforms GPT-2 even though the number of parameters in NPP+T5 is less than half of GPT-2. The experimental results demonstrated the flexibility of our proposed approach, which can serve as "plug-and-play" for any text-to-text transformer models and enhance their performance in the text auto-completion task.
tab:generations shows the comparison of generated suggestions for the same partial query between T5 and NPP+T5. We can observe that the completions by NPP+T5 are generally more acceptable in terms of semantic similarity between generated completions and original text.
In this paper, we propose a novel intermediate training strategy that encourages the model to complete the partial query with enriched phrases and eventually improving the performance of the text auto-completion system. Our proposed approach enhances state-of-the-art language model’s performance by intermediately training it with our next phrase prediction self-supervised objective. Preliminary experiments have shown that our approach is able to outperform the baselines in auto-completion for email and academic-writing domains with only around 1.2B tokens of training. For future work, we aim to experiment our proposed approach on text auto-completion in more writing domains and develop a demonstration system to better showcase our approach in text auto-completion.
- SPICE: semantic propositional image caption evaluation. In ECCV, Cited by: §4.4.
- METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. External Links: Cited by: §4.4.
- Gmail smart compose: real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2287–2295. Cited by: §1, §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §4.3.
AllenNLP: a deep semantic natural language processing platform. In
Proceedings of Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, pp. 1–6. External Links: Cited by: §3.1.
- Smart reply: automated response suggestion for email. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 955–964. Cited by: §1.
- UnifiedQA: crossing format boundaries with a single qa system. EMNLP - findings. Cited by: §3.2.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Cited by: §1.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Cited by: §4.4.
Intermediate-task transfer learning with pretrained language models: when and why does it work?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5231–5247. External Links: Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1, §4.3, Table 3.
Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research21 (140), pp. 1–67. External Links: Cited by: §1, §4.3, Table 3.
- Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. Cited by: 2nd item.
- Cider: consensus-based image description evaluation. In , pp. 4566–4575. Cited by: §4.4.
- Pre-training text-to-text transformers for concept-centric common sense. In ICLR, Cited by: §1.