Traditional approaches to conditional natural language generation consider each generation task (e.g., summarisation, paraphrasing, generative question answering) to be an independent task and build a task-specific model. However, there are fundamental similarities between all of these tasks. In an encoder-decoder model, when the generated language from multiple tasks is in the same language, the structure of this language does not change across different tasks and the generation module (i.e., the decoder) should not have to relearn how to put words together into sentences for every new task. When the input for all tasks is from the same language, the model should benefit from sharing a common language understanding module (i.e., the encoder).
Recent papers attempt to design a single model to perform many language generation tasks (Raffel et al., 2019; Lewis et al., 2019; Radford et al., 2019). Such a universal model generally performs worse on a particular task than their task-specific counterpart, but the model is able to perform many diverse tasks. In this paper, we endeavour to design a better universal language generation model. Our motivating hypothesis is that modelling tasks explicitly in a latent task embedding space allows a universal model (for natural language generation on multiple tasks) to represent a highly multimodal distribution of examples from diverse tasks better, which ultimately leads to better performance.111We use tasks and datasets interchangeably and consider each dataset to be its own task.
We present a multitask generative model (§2) that takes a sequence input and outputs another sequence . Our model is based on an encoder-decoder model that is augmented with a latent task space. Each point in the latent task space can be considered as a continuous representation of a “skill”, which is used to customise the encoder-decoder model to perform a specific task for a given input. We model the latent task space as a mixture of Gaussians where each training task (dataset) is represented by a Gaussian. We provide the model with a weak supervision in the form of a dataset identifier (i.e., which dataset a particular example comes from). This results in an inductive bias that encourages examples which form a specific dataset to be solved with shared underlying skills. In order to generate , our model first samples a skill variable from the mixture of Gaussians and outputs conditioned on both and . Crucially, our model is able to generate different outputs given the same input by changing . We show how to train (§2.2) and use this model for new examples (§2.3).
We evaluate our model in both the multitask and few-shot settings. We collect various datasets spanning across multiple tasks (e.g., summarisation, dialogue, question answering, etc.),222We limit our scope to English-to-English language generation tasks. We also only consider tasks where the input conditioning context is also natural language. We leave exploration on multilingual generation and incorporating image-to-text tasks to future work. which we discuss in detail in §3.1. Automated evaluation of natural language generation models is notoriously hard, and comparing models across multiple tasks is even more difficult. We borrow the example of GLUE (Wang et al., 2019) and report a normalised score across multiple tasks with multiple different metrics. We demonstrate that multitask learning using a latent embedding space improves overall performance across tasks on average—with sometimes dramatic results (§3.3). In the few-shot learning setup, we show that model adaptation based on inference in the latent task space performs comparably to and is more robust than standard fine-tuning based parameter adaptation (§3.4). Finally, we probe the latent space and discover that the model clusters training tasks in a natural way (Figure. 7).
Given a collection of text-to-text datasets indexed by , we consider the problem of generating a (natural language) output sequence given a (natural language) input sequence , where indexes an example in a dataset, and and index words in and respectively. For example, for generative question-answering tasks, we concatenate the context and question to form and predict the answer as . For other tasks (i.e., summarisation, dialogue, and paraphrasing), we use the input context as and predict a summary, a continuation of a dialogue, and a paraphrase respectively. Importantly, our model needs to be able to perform different tasks (i.e., use different skills) given a dataset identifier and an input .
We present a hierarchical generative model with a generative story as follows:
Given a dataset identifier and an input example :
Sample an example specific skill representation from .
Set to be the start of sentence symbol.
Sample from .
Figure (a)a shows a graphical model depiction of our model. We discuss our model architecture and the training and inference methods in the following.
For generating outputs, we use the same vocabulary as the encoder input (produced by the SentencePiece tokenizer). We reuse the encoder word embedding layer to embed words for the decoder. This word embedding layer is also used as the final softmax layer of our decoder—following previous work byInan et al. (2016), Press and Wolf (2017), and others. We use a decoder-specific transformer that attends to encoded representations of the input and representations of previously generated outputs to produce a next target word . Note that the first word that the decoder conditions on is always the start of sentence symbol, and we concatenate the skill embedding vector to the word embeddings of previously generated words when computing contextualised representations of previously generated words using the decoder-specific transformer.
Latent task space.
A naïve design of the latent space would be to assume a structureless Gaussian prior and hope that examples which require similar skills to solve naturally cluster in this latent space. However, with such a unimodal distribution, we have little control over exactly what features the model will rely on to perform the clustering.
To encourage the model to cluster examples from the same dataset together, we use a mixture of Gaussians prior:
where denotes a dataset. Each of the Gaussians has two learnable parameters: the mean and covariance matrix (assumed to be diagonal). Each of the first Gaussians represents a dataset that exists in our training set. The final -th Gaussian is used to accommodate unseen datasets which can be used in a certain evaluation setup (e.g., when the model is presented with examples from a new dataset at test time). We defer discussion of test time to §2.3.
The objective function that we want to maximise is:
This objective function is intractable due to the integration over , so we resort to (amortized) variational inference (Kingma and Welling, 2013; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014).
We introduce a variational distribution
, which takes the form of a normal distribution
. We use a feed-forward neural network that takes as input, (by concatenating them as one sequence), and a one-hot vector that represents the dataset the example comes from (i.e., a one-hot vector of the size of ) to parametrise and .333We restrict the posterior covariance matrix to be a diagonal matrix as well.
At training time, we always observe the dataset label , so and when computing the prior in Eq. 1. However, recall that our latent task space is composed of Gaussians. For 90% of the training examples from all tasks, we keep their original dataset label . For the remaining 10% (chosen randomly), we assign these examples to the -th Gaussian to allow the model to accommodate for unseen datasets (i.e., examples which come from datasets that are not in the training datasets) at evaluation time. The goal is to allow the the final Gaussian to be a generalist that can be adapted quickly to a new dataset. We describe this procedure in detail in §2.3. The final variational lower bound of the above likelihood function that we optimise to train our model is:
We use a single sample from to approximate the expectation and compute the KL between and in closed form.
Recall that our model always takes a weak supervision in the form of which dataset a test example comes from—either a particular training dataset or a new unseen dataset. We have three evaluation modes: in-domain evaluation, zero-shot evaluation, and few-shot adaptation.
In this setup, we evaluate on an example from a dataset that we have seen at training time. This is a classic multitask evaluation scenario. Given a test example , we use the mean of the Gaussian associated with this dataset to obtain and decode to generate our prediction . We use this evaluation mode in §3.3.
For zero-shot evaluation, the model is given an example from a new unseen dataset. We simply use the mean of the extra -th Gaussian to obtain and decode to generate our prediction . We use this evaluation mode to establish baseline zero-shot performance and compare it with few-shot learning in §2.3.
One of the crucial parts of machine learning is rapidly adapting an existing model to a new dataset, often with limited examples. In this setup, we are given a few examples from a new dataset that has not been seen at training time. The model must then make predictions for more examples drawn from the same distribution.
One approach towards few-shot adaptation is to optimise the parameters of the model on the few-shot data, typically via gradient descent. This effectively retrains (fine-tunes) the model on the newly presented training data, and is currently the dominant paradigm for transfer learning. However, fine-tuning all the parameters of the model with few examples is known to be sensitive to the hyperparameters of the tuning procedure. It also carries the risk ofcatastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990), where prior knowledge learnt on the training tasks is not retained and performance on them dramatically decreases.
On the other hand, our model explicitly learns latent skills which can be used across tasks. Our approach is to perform inference on what latent skills are used in the few-shot examples, and then make predictions using these inferred latent skills. As we do not change any parameter values of our base encoder-decoder model, we expect this method to be more robust than standard gradient-based adaptation.
Specifically, given a new set of training examples from a new unseen dataset: , we first compute for each new example . We then compute a posterior mean of the new dataset by averaging the posterior mean of each new example . We finally use as the skill representation to generate our prediction for all test examples in this new dataset.
A possible limitation of this approach is that our inference network
might not generalise well to examples from the new task. One possible way to improve our estimateis to optimise:
using the inferred as the initialisation point. This procedure improves our estimate of given an example since and the proportionality constant does not depend on . Furthermore, the prior acts as a natural regulariser that prevents our estimate from deviating too far from previously learnt values. Similar to the above, after doing this optimisation for all new training examples (independently), we then average over all of the few-shot training examples to get , which we then use to generate predictions for the few-shot test data.
We show results with both of our few-adaptation methods—which we refer to as Infer and Infer++—and compare them with a standard fine-tuning based approach in §3.4.
3.1 Tasks and Datasets
|Dataset name||# Train||# Test||Input len||Output len|
Our experiments focus on monolingual text-to-text language generation. We collect a diverse set of tasks and datasets to evaluate our model. We show descriptive statistics of our datasets in Table1 and discuss them below.
We use the following summarisation datasets:
We perform all question answering tasks as a language generation task, rather than span extraction. As we are primarily interested in generating correct answers, we remove all questions which are unanswerable in the MSMARCO and SQuAD datasets.
In the multitask scenario, we train our model on two further categories of tasks: paraphrasing and dialogue. However, we do not evaluate on them. We include three paraphrase corpora: ParaNMT (Wieting and Gimpel, 2018), Quora Question Pairs444https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs, and MRPC (Dolan and Brockett, 2005); we use positive paraphrase examples from paraphrase identification corpora, and generate each sentence in the paraphrase pair from the other one. We also include two dialogue corpora: the Reddit conversational corpus (Al-Rfou et al., 2016), and OpenSubtitles (Lison and Tiedemann, 2016).
3.2 Implementation Details
We train all models on a cluster of 8 Nvidia V100s, with a batch size of 32 for each GPU. We use 6 transformer layers in the encoder and decoder, and use a word embedding size and transformer hidden layer size of 512. The dimensionality of the latent skill space was 64. For training, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate , and optimise for steps. These values are chosen using the development set of each dataset; if a model does not have a development set, we take examples from the head of the training set, where is as shown in Table 1.
Our SentencePiece tokenizer is trained on a random sample of 1 million randomly selected inputs and outputs from the entire set of training examples, and we keep a vocabulary of 24,000 tokens. All contexts are truncated to a maximum of 268 tokens, and all outputs are truncated to 256 tokens. For QA tasks, the question is truncated to at most 32 tokens.
For models with latent variables, we also tune the weight of the KL term (the beta parameter as in Higgins et al. (2018)) and anneal the KL term from 0 to maximum over the course of model training. We find that a beta term of 0.5 and annealing the KL term linearly over 100,000 model steps works the best across all datasets.
3.3 Multitask Evaluation
We first evaluate whether our model can transfer knowledge across multiple tasks without task interference. We compare our model—denoted by Full—to three baseline models which have a simpler latent variable hierarchy:
NoDataset: our first baseline removes the dataset index from the latent variable. Here, all examples for all datasets are generated from the same Gaussian prior with zero mean and identity covariance (see Figure (b)b for a graphical model overview).
NoLatent: the second baseline removes the latent variable , but keeps the conditioning on the dataset index. This can be viewed as collapsing each component in the latent mixture model to a point mass in the latent skill space (Figure (c)c). This model is analogous to a sequence-to-sequence model that is augmented with a trainable task embedding (one for each task) and is trained on multiple tasks.
Base: the last ablation removes all conditioning, and generates each output conditioning only on the input (Figure (c)c). This model is analogous to a standard sequence-to-sequence model that is trained on multiple tasks without any knowledge of the task beyond what exists in the input.
For each model, we also train it on each task independently (the single task setup, where we train a separate model for each dataset) to assess whether the model benefits from multitask training.
Many different metrics have been proposed for evaluating natural language generation systems. The most popular ones are overlap measures based on a reference, such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) for summarisation, and and exact match for question answering (Rajpurkar et al., 2016). However, comparing model performance across tasks and across metrics is difficult, as these metrics all take different ranges of values for different metrics and different tasks.
Motivated by the GLUE score proposed by Wang et al. (2019), which aims to compare the performance of models across a range of different tasks, we propose a normalised score to compare multitask language generation systems. For each task, we first report the maximum score for each metric achieved across all of our models. We then report all model results for all metrics as percentages of the maximum score for that metric for that task. This facilitates comparison across tasks, as now the ranges of the metrics for each task are comparable. We report the best scores for each metric and each summarisation task in Table 2, as well as which models achieved the best score. For summarisation, the metrics we use are ROUGE-1, ROUGE-2, ROUGE-L, and BLEU. For question answering, we evaluate using only, as exact match penalises generative question answering models harshly.
Previous multitask natural language generation models often underperform a single-task baseline trained on each dataset separately (McCann et al., 2018; Radford et al., 2019). However, our results in Table 2 and Table 3 clearly demonstrate that our model (Full) which is trained on multiple tasks is the best performing model. In terms of absolute scores (Table 2) on each dataset and each metric (nine datasets, four summarisation metrics, one question answering metric), multitask Full is the best model on 10 out of 27 cases. The second best model (absolute scores) is a single task NoLatent, which is the best model in 7 cases. Note that this single task model does not generalise to other tasks and can only perform well on a specific dataset.
In terms of aggregated scores (Table 3), multitask Full again performs the best. The second best model under this metric is the multitask Base model. Overall results from both NoDataset and NoLatent show that if either task information or the latent variable is removed, the multitask performance of the full model drops significantly. This indicates that modelling latent skills in a continuous shared space is beneficial for performing multiple tasks with one model.
Comparing the results of task-specific and multitask models for question answering datasets, it is evident that multitask training significantly helps for question answering. To evaluate whether this effect is simply down to the models sharing information among question answering tasks only, we train each model on only those tasks, and find worse performance compared to training on all tasks for every model—the best normalised scores (comparable to numbers in Table 3) we see in this condition are 87.08, 72.27 and 79.75 on MSMARCO, NewsQA and SQuAD respectively. Our results show that training a model on summarisation tasks can help improve question answering—Arumae and Liu (2019) previously observe the reverse direction of this phenomenon.
We visualise each of the Gaussian means from our prior (projected into a two-dimensional space using PCA) learnt by our full model in Figure 7. Our model appears to group datasets mainly by the domain the datasets come from, although the two non-news QA datasets also form a cluster, despite being from very different domains. The plot indicates that our model clusters tasks that require similar skills in a meaningful way.
3.4 Few-Shot Learning
|News QA: zero-shot 32.44|
|Infer++||32.59 1.36||33.00 0.39||33.10 0.15|
|Gradient||32.98 0.74||33.94 1.22||33.52 2.11|
|NYT: zero-shot 29.04|
|Infer++||32.35 0.93||32.64 0.41||32.67 0.28|
|Gradient||29.36 2.73||31.59 1.71||33.55 2.05|
|TL;DR: zero-shot 11.38|
|Infer++||10.71 1.02||10.94 0.28||11.00 0.13|
|Gradient||11.15 0.37||10.71 0.51||10.32 0.75|
Few shot results for each adaptation method, including the standard deviations across hyperparameters. We report ROUGE-1 only for this experiment. Our best inference-based adaptation method achieves comparable or superior performance to gradient-based adaptation. In addition, our adaptation methods are more stable across hyperparameter values, as shown by their generally smaller standard deviations. For the TL;DR dataset, none of the adaptation methods is able to make use of the extra example to improve upon the zero-shot performance.
We next evaluate our model in the few-shot setting. In this evaluation, the model is trained on a set of training tasks and then given a few examples from a new task not seen at training. The model needs to adapt to the new task using these few-shot examples and is evaluated on test examples from the new task.
For the initial training stage, we train our model on all tasks other than the NYT, TL;DR and NewsQA datasets. These held-out datasets are chosen to provide an example of a task similar to those in the training tasks (NYT), and two examples of datasets which require some generalisation (TL;DR and NewsQA). For the NewsQA dataset, the model has seen news domain summarisation data and out-of-domain QA data in the initial training stage, and must generalise to news QA. For TL;DR, the model has seen Reddit dialogue data and out-of-domain summarisation data, and must generalise to Reddit summarisation. The summarisation style of TL;DR is highly abstractive, which gives a harder adaptation task.
We use the Full model which performs the best in the previous experiment and compare three few-shot adaptation methods:
Infer: our basic inference-based adaptation (§2.3) that uses the output of the inference network directly.
Gradient: a baseline method that uses the fixed prior mean of to generate each example, and fine-tunes all of the model parameters using gradient descent. We use learning rates in the set and steps per batch of examples in the set .
We present batches of 5 few-shot training examples to our model, and evaluate our model after 5, 50 and 250 examples. We also show the zero-shot performance to provide an idea of whether the model can make use of the few-shot examples to improve its performance. As we do not have a separate development set in the few-shot scenario to tune hyperparameters, we average the test set performance across all hyperparameters over 5 different consecutive model checkpoints—to reduce the high variance of the gradient-based fine-tuning method—and report results in Table4.
Our results show that our best inference-based adaptation method compares favourably to standard fine-tuning. Infer does not provide any additional improvement over the zero-shot result. This suggests that the inference network does not generalise well to out-of-domain data. However, Infer++ is able to improve over the zero-shot model performance (i.e., it is able to make use of the few-shot examples to perform the task better) for 2 of the 3 held out datasets. The only dataset it does not improve on is the TL;DR summarisation corpus, which is a difficult adaptation task and no adaptation method works on this dataset. Inference-based adaptation appears more stable to hyperparameter choices; the standard deviation of the inference-based methods is generally much lower than the gradient-based methods.
4 Related Work
Natural language generation.
Traditional natural language generation approaches have focused on rule-based or template-based generators conditioning on a structured logical representation of the input (Reiter and Dale, 2000)
. With significant advances in deep learning, learning-based approaches conditioning on a wide range of input modalities have been explored(Vinyals et al., 2015; Hinton et al., 2012). In this paper, we focus on conditional language generation tasks where the input is also natural language. Such tasks include those in our training set and others (e.g., text simplification). Many learning-based approaches have been employed to tackle this problem (Kalchbrenner and Blunsom, 2013; Durrett et al., 2016; Zhang and Lapata, 2017).
Our model is a latent variable conditional language model. Unconditional latent variable language models, such as those presented in Bowman et al. (2016), have been shown to learn useful representations for tasks like discourse parsing (Ji et al., 2016) and semi-supervised sentence classification (Yang et al., 2017)
. Further, they have been shown to learn smooth latent spaces that afford interpolation. Conditional latent variable language models have previously been used for open-domain dialogue(Cao and Clark, 2017; Serban et al., 2017) and paraphrasing (Narayan et al., 2016).
Multitask learning for language.
investigate multitask learning with neural models for NLP by sharing a feature extraction neural network across multiple different tasks. They show that powerful neural network feature extractors, trained using a language modelling objective on unlabelled data, can improve performance at specific natural language processing tasks such as part-of-speech tagging and named entity recognition.word2vec (Mikolov et al., 2013) introduced a much simpler training objective for learning word representations, which showed promise when used in many downstream tasks. As computational power increased, methods such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019)—which learn representations of words in context—currently hold the state-of-the-art on many natural language processing tasks.
For language generation specifically, both multitask learning (Guo et al., 2018) and transfer learning (Liu and Lapata, 2019) have been investigated. A recent finding shows that large language models trained on a diverse collection of natural language data can generate coherent high-quality long-form text (Radford et al., 2019). However, it remains difficult to directly control the output of such a model for a particular task. Recent approaches include fine-tuning such models directly on a downstream task (Gehrmann et al., 2019), ensembling a large unconditional language model with a smaller auxiliary conditional model (Dathathri et al., 2020), or using it as the source model in a noisy channel decomposition (Yee et al., 2019; Yu et al., 2019).
Few-shot learning and skill modelling.
Adapting a model to a new task using relatively few examples is a long standing goal of machine learning. One approach is to optimise the fine-tuning objective at training time, whether via gradient descent (Andrychowicz et al., 2016; Finn et al., 2017) or a matching objective (Vinyals et al., 2016). Another approach is to treat few-shot learning as inference in a Bayesian model (Gordon et al., 2019; Ravi and Beatson, 2019). Hausman et al. (2018) present a similar model to us in the context of learning transferrable skills for robotic control tasks. In addition, Garnelo et al. (2018) parametrise a model to directly estimate distributions given few-shot data. Recent work has considered using mixture-of-Gaussian latent variables in the context of continual unsupervised representation learning, where each component represents a cluster of related examples (Rao et al., 2019).
We present a generative model for multitask language generation which augments an encoder-decoder model with a task embedding space for modelling latent skills. We show that the resulting model can perform multiple language generation tasks simultaneously better compared to models which do not use task information or only learns a pointwise task embedding. We also show that our model can generalise to unseen tasks with few-shot examples by inference and adaptation in the latent space, and that this inference procedure is competitive with a standard fine-tuning method that adapts all model parameters in terms of performance and is more stable across hyperparameter choices.
The main limitations of our model are that we need to fix the number of tasks in advance and need to observe the dataset identifier. Exciting avenues for future work include adding the ability to continually grow our model (e.g., by assuming a Dirichlet process prior over the tasks) and designing better ways to incorporate unlabelled data.
We thank Angeliki Lazaridou for helpful comments on an earlier draft of this paper and the Language group at DeepMind for valuable discussions.
- Conversational contextual cues: the case of personalization and history for response ranking. CoRR abs/1606.00372. External Links: Cited by: §3.1.
- Learning to learn by gradient descent by gradient descent. In Proc. of NeurIPS, Cited by: §4.
Guiding extractive summarization with question-answering rewards. In Proc. of NAACL-HLT, Cited by: §3.3.
- MS MARCO: a Human Generated MAchine Reading COmprehension Dataset. CoRR abs/1611.09268. External Links: Cited by: 1st item.
- Generating sentences from a continuous space. In Proc. of CoNLL, Cited by: §4.
- Latent variable dialogue models and their diversity. In Proc. of EACL: Volume 2, Short Papers, Cited by: §4.
- Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §4.
- Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: §4.
- Plug and play language models: a simple approach to controlled text generation. In Proc. of ICLR, Cited by: §4.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, Cited by: §4.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, Cited by: §3.1.
- Learning-based single-document summarization with compression and anaphoricity constraints. In Proc. of ACL, Cited by: 4th item, §4.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. of ICML, Cited by: §4.
- Neural processes. CoRR abs/1807.01622. External Links: Cited by: §4.
- Generating abstractive summaries with finetuned language models. In Proc. of INLG, Cited by: §4.
- Meta-learning probabilistic inference for prediction. In Proc. of ICLR, Cited by: §4.
- NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In Proc. of NAACL-HLT, Cited by: 3rd item.
- Dynamic multi-level multi-task learning for sentence simplification. In Proc. of COLING, Cited by: §4.
- Learning an embedding space for transferable robot skills. In Proc. of ICLR, Cited by: §4.
- Teaching machines to read and comprehend. In Proc. of NeurIPS, Cited by: 2nd item.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. In Proc. of ICLR, Cited by: §3.2.
- Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §4.
Tying word vectors and word classifiers: A loss framework for language modeling. CoRR abs/1611.01462. External Links: Cited by: §2.1.
A latent variable recurrent neural network for discourse-driven language models. In Proc. of NAACL-HLT, Cited by: §4.
- Recurrent continuous translation models. In Proc. of EMNLP, Cited by: §4.
- Auto-encoding variational bayes. CoRR abs/1312.6114. External Links: Cited by: §2.2.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §3.2.
- WikiHow: a large scale text summarization dataset. CoRR abs/1810.09305. External Links: Cited by: 6th item.
- SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of EMNLP: System Demonstrations, Cited by: §2.1.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461. External Links: Cited by: §1.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Cited by: §3.3.
- OpenSubtitles2016: extracting large parallel corpora from movie and tv subtitles. In Proc. of LREC, Cited by: §3.1.
- Text summarization with pretrained encoders. In Proc. of EMNLP-IJCNLP, Cited by: §4.
- The natural language decathlon: multitask learning as question answering. CoRR abs/1806.08730. Cited by: §3.3.
- Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, G. H. Bower (Ed.), Vol. 24, pp. 109–165. External Links: Cited by: §2.3.
- Distributed representations of words and phrases and their compositionality. In Proc. of NeurIPS, Cited by: §4.
- Paraphrase generation from latent-variable PCFGs for semantic parsing. In Proc. of INLG, Cited by: §4.
- Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, Cited by: §3.3.
- Deep contextualized word representations. In Proc. of NAACL-HLT, Cited by: §4.
- Using the output embedding to improve language models. In Proc. of EACL: Volume 2, Short Papers, Cited by: §2.1.
- Language models are unsupervised multitask learners. External Links: Cited by: §1, §3.3, §4.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683. External Links: Cited by: §1.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP, Cited by: 3rd item, §3.3.
- Continual unsupervised representation learning. In Proc. of NeurIPS, Cited by: §4.
- Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological Review 97 (2), pp. 285–308. Cited by: §2.3.
- Amortized bayesian meta-learning. In Proc. of ICLR, Cited by: §4.
- Building natural language generation systems. Cambridge University Press, USA. Cited by: §4.
Stochastic backpropagation and approximate inference in deep generative models. In Proc. of ICML, Cited by: §2.2.
A neural attention model for abstractive sentence summarization. In Proc. of EMNLP, Cited by: 1st item.
- Get to the point: summarization with pointer-generator networks. In Proc. of ACL, Cited by: 2nd item.
- A hierarchical latent variable encoder-decoder model for generating dialogues. In Proc. of AAAI, Cited by: §4.
LeafNATS: an open-source toolkit and live demo system for neural abstractive text summarization. In Proc. of NAACL: System Demonstrations, Cited by: Table 2.
- Doubly stochastic variational bayes for non-conjugate inference. In Proc. of ICML, Cited by: §2.2.
- NewsQA: a machine comprehension dataset. In Proc. of Workshop on Representation Learning for NLP, Cited by: 2nd item.
- Show and tell: a neural image caption generator. In Proc. of CVPR, Cited by: §4.
- Matching networks for one shot learning. In Proc. of NeurIPS, Cited by: §4.
- TL;DR: mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Cited by: 5th item.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proc. of ICLR, Cited by: §1, §3.3.
- ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proc. of ACL, Cited by: §3.1.
- Neural extractive text summarization with syntactic compression. In Proc. of EMNLP-IJCNLP, Cited by: 4th item, Table 2.
Improved variational autoencoders for text modeling using dilated convolutions. In Proc. of ICML, Cited by: §4.
Simple and effective noisy channel modeling for neural machine translation. In Proc. of EMNLP-IJCNLP, Cited by: §4.
- Putting machine translation in context with the noisy channel model. CoRR abs/1910.00553. External Links: Cited by: §4.
- PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. CoRR abs/1912.08777. External Links: Cited by: Table 2.
Sentence simplification with deep reinforcement learning. In Proc. of EMNLP, Cited by: §4.
Appendix A Complete Results
Appendix B Summarisation Style Transfer
In this section, we report the results of experiment of summarisation style transfer. We use the Full model and consider two evaluation setups in this experiment:
Inspecting the samples, we can observe that the model has learnt different summarisation styles as a result of the different training data. The summaries generated using often seem to consist of the article lede, whereas summaries generated with consist of extracted phrases which are more evenly distributed throughout the article. We note that while the summaries generated using a skill representation that is not intended for that dataset (i.e., for NYT and vice versa) score less using standard metrics compared to the summaries generated using the “correct” skill representation (35.9 R1 vs. 44.27 R1 on NYT and 28.1 R1 vs. 28.62 on Newsroom), the resulting summaries are still valid summaries as indicated by the reasonably high scores. This suggests that our model learns useful summarisation skills that can be generalized to other domains.
|Article||0.7WHEN people die in train collisions, like the 2 engineers and a passenger killed in a commuter train crash in Secaucus, N.J., or the 11 on a Maryland commuter train that hit an Amtrak train in Silver Spring, both in early 1996, there are national headlines and detailed safety investigations. But the attention might be better directed elsewhere: Far more people, more than 400 last year, died in the less noted but far more numerous accidents in which trains collided not with other trains, but with motor vehicles. The Federal Railroad Administration counted more than 4,000 car-train collisions last year. ”Something comes over the fax every minute of every day,” David Bolger, a spokesman for the agency, said with only slight exaggeration. His agency is running a long-term campaign to reduce the toll. The Transportation Department’s goal, set in 1991, is to reduce by 25 percent by 2001 the number of places where tracks cross roads. So far the number is down by a little more than 6 percent, to 268,000, with three years to go. The theory is that if crossings are combined, towns or highway departments can afford to pay for improvements at the ones that survive. Those costs are high. The Transportation Department says a simple ”crossbuck …|
|Reference summary||0.7Matthew L Wald comments on growing concern about train-car collisions, noting that Federal Railroad Administration counted more than 4,000 car-train collisions last year; Transportation Dept goal, set in 1991, is to reduce by 25 percnet [sic] by 2001 number of places where tracks cross roads; so far, that number is down by little more than 6 percent, to 268,000; photo (M)|
|NYT summary style||0.7Federal Railroad Administration has counted more than 4,000 car-train collisions in 1996, with more than 400 deaths in less noted but far more numerous accidents in which trains collided not with other trains, but with motor vehicles; Transportation Department’s goal is to reduce by 25 percent by 2001 number of places where tracks cross roads; theory is that if crossings are combined, towns or highway departments can afford to pay for improvements at ones that survive; those costs are high; photo (Special section, Cars) (M)|
|Newsroom summary style||0.7WHEN people die in train collisions, like the 2 engineers and a passenger killed in a commuter train crash in Secaucus, N.J., or the 11 on a Maryland commuter train that hit an Amtrak train in Silver Spring, both in early 1996, there are national headlines and detailed safety investigations. But the attention might be better directed elsewhere: Far more people, more than 400 last year, died in the less noted but far more numerous accidents in which trains collided not with other trains, but with motor vehicles. The Federal Railroad Administration counted more than 4,000 car-train collisions last year. ”Something comes over the fax every minute of every day,” David Bolger, a spokesman for the agency, said with only slight exaggeration.|
|Article||0.7Months ago, Alaska posted a list of its 1,600 paroled sex offenders on the World Wide Web. California provides information on about 64,000 offenders on CD-ROM at any county sheriff’s office or big-city police department. And three months ago, Connecticut began letting local police stations release the names, addresses and pictures of paroled offenders to anyone who calls. But in New Jersey, the state that inspired the nationwide movement for public warnings about the presence of sex offenders, information is held far more tightly. Since New Jersey’s community-notification law was hurriedly enacted in 1994 amid the outcry that followed the killing of Megan Kanka of Hamilton Township, it has faced an onslaught of constitutional challenges, and nearly three years of review and revision by Federal and state judges. Indeed, the original ”Megan’s Law” is soon to emerge from legal limbo as one of the most restricted in the nation, with tight limits on which offenders are identified, who has access to the information and how those facts may be passed on. By the end of this month, when most of New Jersey’s 21 county prosecutors are expected to begin issuing names and addresses of offenders, only about 600 parolees will be listed, those classified as posing a moderate or high risk of committing new …|
|Reference summary||0.7New Jersey, which inspired other states to make known the names of paroled sex offenders, has faced onslaught of constitutional challenges to its own Megan’s law, and now the law is emerging from nearly three years of judicial review and revision as one of most restricted in nation; it sets tight limits on which offenders are identified, who has access to information and how those facts may be passed on (M)|
|NYT summary style||0.7New Jersey, which inspired nationwide movement for public warnings about presence of sex offenders, has faced onslaught of constitutional challenges, and nearly three years of review and revision by Federal and state judges; original ’Megan’s Law’ is soon to emerge from legal limbo as one of most restricted in nation, with tight limits on which offenders are identified, who has access to information and how those facts may be passed on; photo (M)|
|Newsroom summary style||0.7Months ago, Alaska posted a list of its 1,600 paroled sex offenders on the World Wide Web. California provides information on about 64,000 offenders on CD-ROM at any county sheriff’s office or big-city police department. And three months ago, Connecticut began letting local police stations release the names, addresses and pictures of paroled offenders to anyone who calls. But in New Jersey, the state that inspired the nationwide movement for public warnings about the presence of sex offenders, information is held far more tightly.|
|Article||0.7On the road … Pilgrim and Ambassador, 1977, by Raghubir Singh. See it in Everything Was Moving at the Barbican, London August is a quiet month for photography shows, so here’s a preview of some of the exhibition highlights for the next few months. The most anticipated London show is surely Tate Modern’s ambitious double header William Klein/Daido Moriyama, which opens on 10 October. Taking the cities of New York and Tokyo as its starting point, the show contrasts the approaches of two pioneers of impressionistic urban photography. It considers the influence of Klein’s seminal 1956 book, Life Is Good and Good for You in New York, on Japanese photography, and Moriyama in particular. The prodigiously productive Moriyama was a founder of the radical Provoke movement in Japan and, alongside previously unseen vintage prints, the exhibition explores photography’s role in the representation of protest movements and civil unrest. This is an ambitious show that will be a chance for many of us to see lots of Moriyama’s images outside of book form for the first time. I, for one, cannot wait. The other big London exhibition is the Barbican’s group show, Everything Was Moving: Photography from the 60s …|
|Reference summary||0.7From an ambitious survey of 1960s-70s photography in London to Kohei Yoshiyuki’s controversial work in Liverpool and Amsterdam’s Unseen Photo Fair, there’s a lot to see. By Sean O’Hagan|
|NYT summary style||0.7Barbican, London August, previews Tate Modern’s ambitious double header William Klein/Daido Moriyama, which opens on 10 October; show contrasts approaches of two pioneers of impressionistic urban photography; it considers influence of Klein’s seminal 1956 book, Life Is Good and Good for You in New York, on Japanese photography, and Moriyama in particular; other big London exhibition is Barbican’s group show, Everything Was Moving: Photography from 60s; photos (M)|
|Newsroom summary style||0.7Two pioneers of impressionistic urban photography are the latest in London’s show, Everything Was Moving: Photography from the 60s, says Jonathan Jones!!!|
|Article||0.7BANGALORE, India, June 4 The world’s biggest computer services company could not have chosen a more appropriate setting to lay out its strategy for staying on top. A building housing I.B.M.’s software laboratory and application service teams on the company’s corporate campus in Bangalore, India. On Tuesday, on the expansive grounds of the Bangalore Palace, a colonial-era mansion once inhabited by a maharajah, the chairman and chief executive of I.B.M., Samuel J. Palmisano, will address 10,000 Indian employees. He will share the stage with A. P. J. Abdul Kalam, India’s president, and Sunil Mittal, chairman of the country’s largest cellular services provider, Bharti Tele-Ventures. An additional 6,500 employees will look in on the town hall-style meeting by satellite from other Indian cities. On the same day, Mr. Palmisano and other top executives will meet here with investment analysts and local customers to showcase I.B.M.’s global integration capabilities in a briefing customarily held in New York. During the week, the company will lead the 50 analysts on a tour of its Indian operations. The meetings are more than an exercise in public and investor relations. They …|
|Reference summary||0.7India provides I.B.M. with its fastest-growing market and a crucial base for delivering services to much of the world.|
|NYT summary style||0.7IBM chairman and chief executive Samuel J Palmisano will address 10,000 Indian employees on expansive grounds of Bangalore Palace, colonial-era mansion once inhabited by maharajah, will meet with investment analysts and local customers to showcase IBM’s global integration capabilities in briefing customarily held in New York; during week, company will lead 50 analysts on tour of its Indian operations; meetings are more than exercise in public and investor relations; photo (M)|
|Newsroom summary style||0.7The world’s biggest computer services company could not have chosen a more appropriate setting to lay out its strategy for staying on top.|