Modelling Latent Skills for Multitask Language Generation

02/21/2020 ∙ by Kris Cao, et al. ∙ 0

We present a generative model for multitask conditional language generation. Our guiding hypothesis is that a shared set of latent skills underlies many disparate language generation tasks, and that explicitly modelling these skills in a task embedding space can help with both positive transfer across tasks and with efficient adaptation to new tasks. We instantiate this task embedding space as a latent variable in a latent variable sequence-to-sequence model. We evaluate this hypothesis by curating a series of monolingual text-to-text language generation datasets - covering a broad range of tasks and domains - and comparing the performance of models both in the multitask and few-shot regimes. We show that our latent task variable model outperforms other sequence-to-sequence baselines on average across tasks in the multitask setting. In the few-shot learning setting on an unseen test dataset (i.e., a new task), we demonstrate that model adaptation based on inference in the latent task space is more robust than standard fine-tuning based parameter adaptation and performs comparably in terms of overall performance. Finally, we examine the latent task representations learnt by our model and show that they cluster tasks in a natural way.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditional approaches to conditional natural language generation consider each generation task (e.g., summarisation, paraphrasing, generative question answering) to be an independent task and build a task-specific model. However, there are fundamental similarities between all of these tasks. In an encoder-decoder model, when the generated language from multiple tasks is in the same language, the structure of this language does not change across different tasks and the generation module (i.e., the decoder) should not have to relearn how to put words together into sentences for every new task. When the input for all tasks is from the same language, the model should benefit from sharing a common language understanding module (i.e., the encoder).

Recent papers attempt to design a single model to perform many language generation tasks (Raffel et al., 2019; Lewis et al., 2019; Radford et al., 2019). Such a universal model generally performs worse on a particular task than their task-specific counterpart, but the model is able to perform many diverse tasks. In this paper, we endeavour to design a better universal language generation model. Our motivating hypothesis is that modelling tasks explicitly in a latent task embedding space allows a universal model (for natural language generation on multiple tasks) to represent a highly multimodal distribution of examples from diverse tasks better, which ultimately leads to better performance.111We use tasks and datasets interchangeably and consider each dataset to be its own task.

We present a multitask generative model (§2) that takes a sequence input and outputs another sequence . Our model is based on an encoder-decoder model that is augmented with a latent task space. Each point in the latent task space can be considered as a continuous representation of a “skill”, which is used to customise the encoder-decoder model to perform a specific task for a given input. We model the latent task space as a mixture of Gaussians where each training task (dataset) is represented by a Gaussian. We provide the model with a weak supervision in the form of a dataset identifier (i.e., which dataset a particular example comes from). This results in an inductive bias that encourages examples which form a specific dataset to be solved with shared underlying skills. In order to generate , our model first samples a skill variable from the mixture of Gaussians and outputs conditioned on both and . Crucially, our model is able to generate different outputs given the same input by changing . We show how to train (§2.2) and use this model for new examples (§2.3).

(a) Full
(b) NoDataset
(c) NoLatent
(d) Base
Figure 5: A graphical model depiction of our model in Figure (a)a. We also show graphical model representations of three simpler variants (Figure (b)b, (c)c, and (c)c) that we use as baselines in our experiments. Darkly shaded variables are always observed, lightly shaded variables are observed only at training time, and others are latent.

We evaluate our model in both the multitask and few-shot settings. We collect various datasets spanning across multiple tasks (e.g., summarisation, dialogue, question answering, etc.),222We limit our scope to English-to-English language generation tasks. We also only consider tasks where the input conditioning context is also natural language. We leave exploration on multilingual generation and incorporating image-to-text tasks to future work. which we discuss in detail in §3.1. Automated evaluation of natural language generation models is notoriously hard, and comparing models across multiple tasks is even more difficult. We borrow the example of GLUE (Wang et al., 2019) and report a normalised score across multiple tasks with multiple different metrics. We demonstrate that multitask learning using a latent embedding space improves overall performance across tasks on average—with sometimes dramatic results (§3.3). In the few-shot learning setup, we show that model adaptation based on inference in the latent task space performs comparably to and is more robust than standard fine-tuning based parameter adaptation (§3.4). Finally, we probe the latent space and discover that the model clusters training tasks in a natural way (Figure. 7).

2 Model

Given a collection of text-to-text datasets indexed by , we consider the problem of generating a (natural language) output sequence given a (natural language) input sequence , where indexes an example in a dataset, and and index words in and respectively. For example, for generative question-answering tasks, we concatenate the context and question to form and predict the answer as . For other tasks (i.e., summarisation, dialogue, and paraphrasing), we use the input context as and predict a summary, a continuation of a dialogue, and a paraphrase respectively. Importantly, our model needs to be able to perform different tasks (i.e., use different skills) given a dataset identifier and an input .

We present a hierarchical generative model with a generative story as follows:

  • Given a dataset identifier and an input example :

    • Sample an example specific skill representation from .

    • Set to be the start of sentence symbol.

    • While :

      • Sample from .

Figure (a)a shows a graphical model depiction of our model. We discuss our model architecture and the training and inference methods in the following.

2.1 Architecture

Figure 6: An illustration of our model architecture. We sample a skill representation

and concatenate this with the embedded tokens of the context. We then pass the context through a transformer network to obtain a representation of the context. We then generate the output token by token conditioning on the input and previously generated tokens, feeding in the skill representation for previously generated tokens as well. We only show attention in the first layer of the decoder for clarity.

We use an encoder-decoder neural network to model

, illustrated in Figure 6.

Encoder.

We tokenize the input with a SentencePiece tokenizer (Kudo and Richardson, 2018)

, embed it with a word embedding layer, concatenate a skill embedding vector

(described in detail below) to each input word, and use a transformer encoder to obtain a sequence of contextualised input representations.

Decoder.

For generating outputs, we use the same vocabulary as the encoder input (produced by the SentencePiece tokenizer). We reuse the encoder word embedding layer to embed words for the decoder. This word embedding layer is also used as the final softmax layer of our decoder—following previous work by

Inan et al. (2016), Press and Wolf (2017), and others. We use a decoder-specific transformer that attends to encoded representations of the input and representations of previously generated outputs to produce a next target word . Note that the first word that the decoder conditions on is always the start of sentence symbol, and we concatenate the skill embedding vector to the word embeddings of previously generated words when computing contextualised representations of previously generated words using the decoder-specific transformer.

Latent task space.

A naïve design of the latent space would be to assume a structureless Gaussian prior and hope that examples which require similar skills to solve naturally cluster in this latent space. However, with such a unimodal distribution, we have little control over exactly what features the model will rely on to perform the clustering.

To encourage the model to cluster examples from the same dataset together, we use a mixture of Gaussians prior:

(1)

where denotes a dataset. Each of the Gaussians has two learnable parameters: the mean and covariance matrix (assumed to be diagonal). Each of the first Gaussians represents a dataset that exists in our training set. The final -th Gaussian is used to accommodate unseen datasets which can be used in a certain evaluation setup (e.g., when the model is presented with examples from a new dataset at test time). We defer discussion of test time to §2.3.

2.2 Training

The objective function that we want to maximise is:

This objective function is intractable due to the integration over , so we resort to (amortized) variational inference (Kingma and Welling, 2013; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014).

We introduce a variational distribution

, which takes the form of a normal distribution

. We use a feed-forward neural network that takes as input

, (by concatenating them as one sequence), and a one-hot vector that represents the dataset the example comes from (i.e., a one-hot vector of the size of ) to parametrise and .333We restrict the posterior covariance matrix to be a diagonal matrix as well.

At training time, we always observe the dataset label , so and when computing the prior in Eq. 1. However, recall that our latent task space is composed of Gaussians. For 90% of the training examples from all tasks, we keep their original dataset label . For the remaining 10% (chosen randomly), we assign these examples to the -th Gaussian to allow the model to accommodate for unseen datasets (i.e., examples which come from datasets that are not in the training datasets) at evaluation time. The goal is to allow the the final Gaussian to be a generalist that can be adapted quickly to a new dataset. We describe this procedure in detail in §2.3. The final variational lower bound of the above likelihood function that we optimise to train our model is:

We use a single sample from to approximate the expectation and compute the KL between and in closed form.

2.3 Predictions

Recall that our model always takes a weak supervision in the form of which dataset a test example comes from—either a particular training dataset or a new unseen dataset. We have three evaluation modes: in-domain evaluation, zero-shot evaluation, and few-shot adaptation.

In-domain evaluation.

In this setup, we evaluate on an example from a dataset that we have seen at training time. This is a classic multitask evaluation scenario. Given a test example , we use the mean of the Gaussian associated with this dataset to obtain and decode to generate our prediction . We use this evaluation mode in §3.3.

Zero-shot transfer.

For zero-shot evaluation, the model is given an example from a new unseen dataset. We simply use the mean of the extra -th Gaussian to obtain and decode to generate our prediction . We use this evaluation mode to establish baseline zero-shot performance and compare it with few-shot learning in §2.3.

Few-shot adaptation.

One of the crucial parts of machine learning is rapidly adapting an existing model to a new dataset, often with limited examples. In this setup, we are given a few examples from a new dataset that has not been seen at training time. The model must then make predictions for more examples drawn from the same distribution.

One approach towards few-shot adaptation is to optimise the parameters of the model on the few-shot data, typically via gradient descent. This effectively retrains (fine-tunes) the model on the newly presented training data, and is currently the dominant paradigm for transfer learning. However, fine-tuning all the parameters of the model with few examples is known to be sensitive to the hyperparameters of the tuning procedure. It also carries the risk of

catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990), where prior knowledge learnt on the training tasks is not retained and performance on them dramatically decreases.

On the other hand, our model explicitly learns latent skills which can be used across tasks. Our approach is to perform inference on what latent skills are used in the few-shot examples, and then make predictions using these inferred latent skills. As we do not change any parameter values of our base encoder-decoder model, we expect this method to be more robust than standard gradient-based adaptation.

Specifically, given a new set of training examples from a new unseen dataset: , we first compute for each new example . We then compute a posterior mean of the new dataset by averaging the posterior mean of each new example . We finally use as the skill representation to generate our prediction for all test examples in this new dataset.

A possible limitation of this approach is that our inference network

might not generalise well to examples from the new task. One possible way to improve our estimate

is to optimise:

(2)

using the inferred as the initialisation point. This procedure improves our estimate of given an example since and the proportionality constant does not depend on . Furthermore, the prior acts as a natural regulariser that prevents our estimate from deviating too far from previously learnt values. Similar to the above, after doing this optimisation for all new training examples (independently), we then average over all of the few-shot training examples to get , which we then use to generate predictions for the few-shot test data.

We show results with both of our few-adaptation methods—which we refer to as Infer and Infer++—and compare them with a standard fine-tuning based approach in §3.4.

3 Experiments

3.1 Tasks and Datasets

Dataset name # Train # Test Input len Output len
Gigaword 3,803,920 189,651 31.35 8.23
CNN/DM 287,226 13,368 791.38 55.17
NEWSROOM 995,033 108,837 659.08 26.82
NYT 137,778 17,222 995.22 80.48
TL;DR 3,077,981 6,400 211.50 25.89
Wikihow 168,128 6,000 508.26 52.35
MSMARCO 502,932 55,578 67.66 12.89
NewsQA 76,560 4,341 608.92 4.04
SQuAD 86,821 5,928 129.86 3.16
Table 1: Summary statistics of each of our evaluation datasets. The length of the input/output is defined as the number of whitespace-separated tokens after preprocessing.

Our experiments focus on monolingual text-to-text language generation. We collect a diverse set of tasks and datasets to evaluate our model. We show descriptive statistics of our datasets in Table 

1 and discuss them below.

Summarisation.

We use the following summarisation datasets:

  • Gigaword news headline generation (Rush et al., 2015). We reprocess the dataset to remove UNK tokens.

  • CNN/Daily Mail news article summarisation (Hermann et al., 2015; See et al., 2017)

  • NEWSROOM news article summarisation (Grusky et al., 2018)

  • NYT news article summarisation (Durrett et al., 2016). We use the splits provided in Xu and Durrett (2019).

  • TL;DR Reddit article summarisation (Völske et al., 2017).

  • Wikihow instructional article summarisation (Koupaee and Wang, 2018).

Question Answering.

We perform all question answering tasks as a language generation task, rather than span extraction. As we are primarily interested in generating correct answers, we remove all questions which are unanswerable in the MSMARCO and SQuAD datasets.

  • MSMARCO web article QA (Bajaj et al., 2016). We concatenate all articles which have been selected by the annotator as useful as the context.

  • NewsQA news article QA (Trischler et al., 2017). We use the consensus answer.

  • SQuAD Wikipedia QA (Rajpurkar et al., 2016)

Others.

In the multitask scenario, we train our model on two further categories of tasks: paraphrasing and dialogue. However, we do not evaluate on them. We include three paraphrase corpora: ParaNMT (Wieting and Gimpel, 2018), Quora Question Pairs444https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs, and MRPC (Dolan and Brockett, 2005); we use positive paraphrase examples from paraphrase identification corpora, and generate each sentence in the paraphrase pair from the other one. We also include two dialogue corpora: the Reddit conversational corpus (Al-Rfou et al., 2016), and OpenSubtitles (Lison and Tiedemann, 2016).

3.2 Implementation Details

We train all models on a cluster of 8 Nvidia V100s, with a batch size of 32 for each GPU. We use 6 transformer layers in the encoder and decoder, and use a word embedding size and transformer hidden layer size of 512. The dimensionality of the latent skill space was 64. For training, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate , and optimise for steps. These values are chosen using the development set of each dataset; if a model does not have a development set, we take examples from the head of the training set, where is as shown in Table 1.

Our SentencePiece tokenizer is trained on a random sample of 1 million randomly selected inputs and outputs from the entire set of training examples, and we keep a vocabulary of 24,000 tokens. All contexts are truncated to a maximum of 268 tokens, and all outputs are truncated to 256 tokens. For QA tasks, the question is truncated to at most 32 tokens.

For models with latent variables, we also tune the weight of the KL term (the beta parameter as in Higgins et al. (2018)) and anneal the KL term from 0 to maximum over the course of model training. We find that a beta term of 0.5 and annealing the KL term linearly over 100,000 model steps works the best across all datasets.

3.3 Multitask Evaluation

We first evaluate whether our model can transfer knowledge across multiple tasks without task interference. We compare our model—denoted by Full—to three baseline models which have a simpler latent variable hierarchy:

  • NoDataset: our first baseline removes the dataset index from the latent variable. Here, all examples for all datasets are generated from the same Gaussian prior with zero mean and identity covariance (see Figure (b)b for a graphical model overview).

  • NoLatent: the second baseline removes the latent variable , but keeps the conditioning on the dataset index. This can be viewed as collapsing each component in the latent mixture model to a point mass in the latent skill space (Figure (c)c). This model is analogous to a sequence-to-sequence model that is augmented with a trainable task embedding (one for each task) and is trained on multiple tasks.

  • Base: the last ablation removes all conditioning, and generates each output conditioning only on the input (Figure (c)c). This model is analogous to a standard sequence-to-sequence model that is trained on multiple tasks without any knowledge of the task beyond what exists in the input.

For each model, we also train it on each task independently (the single task setup, where we train a separate model for each dataset) to assess whether the model benefits from multitask training.

Evaluation metrics.

Many different metrics have been proposed for evaluating natural language generation systems. The most popular ones are overlap measures based on a reference, such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) for summarisation, and and exact match for question answering (Rajpurkar et al., 2016). However, comparing model performance across tasks and across metrics is difficult, as these metrics all take different ranges of values for different metrics and different tasks.

Motivated by the GLUE score proposed by Wang et al. (2019), which aims to compare the performance of models across a range of different tasks, we propose a normalised score to compare multitask language generation systems. For each task, we first report the maximum score for each metric achieved across all of our models. We then report all model results for all metrics as percentages of the maximum score for that metric for that task. This facilitates comparison across tasks, as now the ranges of the metrics for each task are comparable. We report the best scores for each metric and each summarisation task in Table 2, as well as which models achieved the best score. For summarisation, the metrics we use are ROUGE-1, ROUGE-2, ROUGE-L, and BLEU. For question answering, we evaluate using only, as exact match penalises generative question answering models harshly.

R1 R2 RL BLEU
Gigaword 50.14c,s 26.56c,s 47.28c,s 23.20c,s
CNN/DM 41.82b,s 16.83a,s 28.23b,s 13.67a,m
Newsroom 34.35c,s 24.50c,s 31.90c,s 39.45b,s
NYT 44.79a,m 28.32a,m 36.65a,m 30.01a,m
TL;DR 14.27a,s 2.14a,s 10.54a,s 2.18a,s
Wikihow 26.73a,m 8.13a,m 20.48a,m 7.28a,m
MSMARCO 65.46a,s
NewsQA 49.00c,m
SQuAD 73.81a,m
Table 2: The best results for each task and each metric. (a, b, c, d) refer to which model achieves that score in the notation of Figure 5, which are (Full, NoDataset, NoLatent, Base) respectively. (s, m) refer to whether the model was trained in the single or multitask setting respectively. These serve as the normalisation constants for the scores we report in Table 3. We show the full results for all models in Appendix A. Current state-of-the-art R1 for CNN/DM is 44.14 (Zhang et al., 2019); Newsroom is 39.91 (Shi et al., 2019) and NYT is 45.50 (Xu and Durrett, 2019). Our model performances are in the range of these state-of-the-art numbers.
Single task Multitask
Dataset Full NoDataset NoLatent Base Full NoDataset NoLatent Base
Gigaword 90.04 94.86 100.00 99.30 87.57 74.23 91.01 90.96
CNN/DM 98.21 99.74 87.81 86.72 98.21 89.43 85.31 91.53
Newsroom 83.40 97.95 97.92 79.13 92.77 74.85 80.47 82.36
NYT 98.96 98.47 91.21 93.10 100.00 82.92 94.72 96.76
TL;DR 100.00 94.47 83.02 57.97 87.54 73.08 61.52 76.16
Wikihow 88.25 88.50 58.41 67.64 100.00 85.10 63.95 63.21
MSMARCO 100.00 91.44 79.40 84.73 94.55 82.59 94.53 98.74
NewsQA 22.57 25.55 18.08 21.43 99.38 86.69 100.00 98.72
SQuAD 18.58 17.91 12.36 13.02 100.00 89.67 98.98 99.26
Average (summ.) 93.14 95.67 86.39 80.64 94.35 79.94 79.50 83.50
Average (QA) 47.05 44.97 36.61 39.73 97.98 86.32 97.84 98.91
Average (all) 77.78 78.77 69.80 67.01 95.56 82.06 85.61 88.63
Table 3: Single and multitask results for our evaluation datasets, reported using the aggregate metric described in §3.3. Note that our full model results for summarisation improve with multitask training, compared to the ablated models. Multitask training seems to be uniformly beneficial for QA tasks across all models.
Figure 7: PCA plot of the task means learnt by our latent skill model. Each point represents the mean of a Gaussian that makes up our prior , with the colour denoting the nature of the task, and the label text denoting the domain of the dataset; “unknown” represents the mean of the extra -th Gaussian that we use for unseen datasets. Note that all news domain datasets form a cluster, as do tasks with Reddit data.

Results.

Previous multitask natural language generation models often underperform a single-task baseline trained on each dataset separately (McCann et al., 2018; Radford et al., 2019). However, our results in Table 2 and Table 3 clearly demonstrate that our model (Full) which is trained on multiple tasks is the best performing model. In terms of absolute scores (Table 2) on each dataset and each metric (nine datasets, four summarisation metrics, one question answering metric), multitask Full is the best model on 10 out of 27 cases. The second best model (absolute scores) is a single task NoLatent, which is the best model in 7 cases. Note that this single task model does not generalise to other tasks and can only perform well on a specific dataset.

In terms of aggregated scores (Table 3), multitask Full again performs the best. The second best model under this metric is the multitask Base model. Overall results from both NoDataset and NoLatent show that if either task information or the latent variable is removed, the multitask performance of the full model drops significantly. This indicates that modelling latent skills in a continuous shared space is beneficial for performing multiple tasks with one model.

Comparing the results of task-specific and multitask models for question answering datasets, it is evident that multitask training significantly helps for question answering. To evaluate whether this effect is simply down to the models sharing information among question answering tasks only, we train each model on only those tasks, and find worse performance compared to training on all tasks for every model—the best normalised scores (comparable to numbers in Table 3) we see in this condition are 87.08, 72.27 and 79.75 on MSMARCO, NewsQA and SQuAD respectively. Our results show that training a model on summarisation tasks can help improve question answering—Arumae and Liu (2019) previously observe the reverse direction of this phenomenon.

We visualise each of the Gaussian means from our prior (projected into a two-dimensional space using PCA) learnt by our full model in Figure 7. Our model appears to group datasets mainly by the domain the datasets come from, although the two non-news QA datasets also form a cluster, despite being from very different domains. The plot indicates that our model clusters tasks that require similar skills in a meaningful way.

3.4 Few-Shot Learning

# examples 5 50 250
News QA: zero-shot 32.44
Infer 32.45 32.45 32.45
Infer++ 32.59 1.36 33.00 0.39 33.10 0.15
Gradient 32.98 0.74 33.94 1.22 33.52 2.11
NYT: zero-shot 29.04
Infer 29.06 29.06 29.06
Infer++ 32.35 0.93 32.64 0.41 32.67 0.28
Gradient 29.36 2.73 31.59 1.71 33.55 2.05
TL;DR: zero-shot 11.38
Infer 11.37 11.37 11.37
Infer++ 10.71 1.02 10.94 0.28 11.00 0.13
Gradient 11.15 0.37 10.71 0.51 10.32 0.75
Table 4:

Few shot results for each adaptation method, including the standard deviations across hyperparameters. We report ROUGE-1 only for this experiment. Our best inference-based adaptation method achieves comparable or superior performance to gradient-based adaptation. In addition, our adaptation methods are more stable across hyperparameter values, as shown by their generally smaller standard deviations. For the TL;DR dataset, none of the adaptation methods is able to make use of the extra example to improve upon the zero-shot performance.

We next evaluate our model in the few-shot setting. In this evaluation, the model is trained on a set of training tasks and then given a few examples from a new task not seen at training. The model needs to adapt to the new task using these few-shot examples and is evaluated on test examples from the new task.

For the initial training stage, we train our model on all tasks other than the NYT, TL;DR and NewsQA datasets. These held-out datasets are chosen to provide an example of a task similar to those in the training tasks (NYT), and two examples of datasets which require some generalisation (TL;DR and NewsQA). For the NewsQA dataset, the model has seen news domain summarisation data and out-of-domain QA data in the initial training stage, and must generalise to news QA. For TL;DR, the model has seen Reddit dialogue data and out-of-domain summarisation data, and must generalise to Reddit summarisation. The summarisation style of TL;DR is highly abstractive, which gives a harder adaptation task.

We use the Full model which performs the best in the previous experiment and compare three few-shot adaptation methods:

  • Infer: our basic inference-based adaptation (§2.3) that uses the output of the inference network directly.

  • Infer++: our improved inference-based adaptation (§2.3) that refines its estimates further by optimising Eq. 2. We use gradient descent to optimise Eq. 2 with learning rates in the set and maximum number of steps in the set .

  • Gradient: a baseline method that uses the fixed prior mean of to generate each example, and fine-tunes all of the model parameters using gradient descent. We use learning rates in the set and steps per batch of examples in the set .

We present batches of 5 few-shot training examples to our model, and evaluate our model after 5, 50 and 250 examples. We also show the zero-shot performance to provide an idea of whether the model can make use of the few-shot examples to improve its performance. As we do not have a separate development set in the few-shot scenario to tune hyperparameters, we average the test set performance across all hyperparameters over 5 different consecutive model checkpoints—to reduce the high variance of the gradient-based fine-tuning method—and report results in Table 

4.

Our results show that our best inference-based adaptation method compares favourably to standard fine-tuning. Infer does not provide any additional improvement over the zero-shot result. This suggests that the inference network does not generalise well to out-of-domain data. However, Infer++ is able to improve over the zero-shot model performance (i.e., it is able to make use of the few-shot examples to perform the task better) for 2 of the 3 held out datasets. The only dataset it does not improve on is the TL;DR summarisation corpus, which is a difficult adaptation task and no adaptation method works on this dataset. Inference-based adaptation appears more stable to hyperparameter choices; the standard deviation of the inference-based methods is generally much lower than the gradient-based methods.

4 Related Work

Natural language generation.

Traditional natural language generation approaches have focused on rule-based or template-based generators conditioning on a structured logical representation of the input (Reiter and Dale, 2000)

. With significant advances in deep learning, learning-based approaches conditioning on a wide range of input modalities have been explored

(Vinyals et al., 2015; Hinton et al., 2012). In this paper, we focus on conditional language generation tasks where the input is also natural language. Such tasks include those in our training set and others (e.g., text simplification). Many learning-based approaches have been employed to tackle this problem (Kalchbrenner and Blunsom, 2013; Durrett et al., 2016; Zhang and Lapata, 2017).

Our model is a latent variable conditional language model. Unconditional latent variable language models, such as those presented in Bowman et al. (2016), have been shown to learn useful representations for tasks like discourse parsing (Ji et al., 2016) and semi-supervised sentence classification (Yang et al., 2017)

. Further, they have been shown to learn smooth latent spaces that afford interpolation. Conditional latent variable language models have previously been used for open-domain dialogue

(Cao and Clark, 2017; Serban et al., 2017) and paraphrasing (Narayan et al., 2016).

Multitask learning for language.

Multitask learning, sensu lato, tries to improve the performance of a model on a particular task by leveraging the information contained in similar tasks (Caruana, 1997). Collobert et al. (2011)

investigate multitask learning with neural models for NLP by sharing a feature extraction neural network across multiple different tasks. They show that powerful neural network feature extractors, trained using a language modelling objective on unlabelled data, can improve performance at specific natural language processing tasks such as part-of-speech tagging and named entity recognition.

word2vec (Mikolov et al., 2013) introduced a much simpler training objective for learning word representations, which showed promise when used in many downstream tasks. As computational power increased, methods such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019)—which learn representations of words in context—currently hold the state-of-the-art on many natural language processing tasks.

For language generation specifically, both multitask learning (Guo et al., 2018) and transfer learning (Liu and Lapata, 2019) have been investigated. A recent finding shows that large language models trained on a diverse collection of natural language data can generate coherent high-quality long-form text (Radford et al., 2019). However, it remains difficult to directly control the output of such a model for a particular task. Recent approaches include fine-tuning such models directly on a downstream task (Gehrmann et al., 2019), ensembling a large unconditional language model with a smaller auxiliary conditional model (Dathathri et al., 2020), or using it as the source model in a noisy channel decomposition (Yee et al., 2019; Yu et al., 2019).

Few-shot learning and skill modelling.

Adapting a model to a new task using relatively few examples is a long standing goal of machine learning. One approach is to optimise the fine-tuning objective at training time, whether via gradient descent (Andrychowicz et al., 2016; Finn et al., 2017) or a matching objective (Vinyals et al., 2016). Another approach is to treat few-shot learning as inference in a Bayesian model (Gordon et al., 2019; Ravi and Beatson, 2019). Hausman et al. (2018) present a similar model to us in the context of learning transferrable skills for robotic control tasks. In addition, Garnelo et al. (2018) parametrise a model to directly estimate distributions given few-shot data. Recent work has considered using mixture-of-Gaussian latent variables in the context of continual unsupervised representation learning, where each component represents a cluster of related examples (Rao et al., 2019).

5 Conclusion

We present a generative model for multitask language generation which augments an encoder-decoder model with a task embedding space for modelling latent skills. We show that the resulting model can perform multiple language generation tasks simultaneously better compared to models which do not use task information or only learns a pointwise task embedding. We also show that our model can generalise to unseen tasks with few-shot examples by inference and adaptation in the latent space, and that this inference procedure is competitive with a standard fine-tuning method that adapts all model parameters in terms of performance and is more stable across hyperparameter choices.

The main limitations of our model are that we need to fix the number of tasks in advance and need to observe the dataset identifier. Exciting avenues for future work include adding the ability to continually grow our model (e.g., by assuming a Dirichlet process prior over the tasks) and designing better ways to incorporate unlabelled data.

Acknowledgements

We thank Angeliki Lazaridou for helpful comments on an earlier draft of this paper and the Language group at DeepMind for valuable discussions.

References

  • R. Al-Rfou, M. Pickett, J. Snaider, Y. Sung, B. Strope, and R. Kurzweil (2016) Conversational contextual cues: the case of personalization and history for response ranking. CoRR abs/1606.00372. External Links: 1606.00372 Cited by: §3.1.
  • M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. In Proc. of NeurIPS, Cited by: §4.
  • K. Arumae and F. Liu (2019)

    Guiding extractive summarization with question-answering rewards

    .
    In Proc. of NAACL-HLT, Cited by: §3.3.
  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang (2016) MS MARCO: a Human Generated MAchine Reading COmprehension Dataset. CoRR abs/1611.09268. External Links: 1611.09268 Cited by: 1st item.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proc. of CoNLL, Cited by: §4.
  • K. Cao and S. Clark (2017) Latent variable dialogue models and their diversity. In Proc. of EACL: Volume 2, Short Papers, Cited by: §4.
  • R. Caruana (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §4.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: §4.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2020) Plug and play language models: a simple approach to controlled text generation. In Proc. of ICLR, Cited by: §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, Cited by: §4.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, Cited by: §3.1.
  • G. Durrett, T. Berg-Kirkpatrick, and D. Klein (2016) Learning-based single-document summarization with compression and anaphoricity constraints. In Proc. of ACL, Cited by: 4th item, §4.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. of ICML, Cited by: §4.
  • M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. M. A. Eslami, and Y. W. Teh (2018) Neural processes. CoRR abs/1807.01622. External Links: 1807.01622 Cited by: §4.
  • S. Gehrmann, Z. Ziegler, and A. Rush (2019) Generating abstractive summaries with finetuned language models. In Proc. of INLG, Cited by: §4.
  • J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. Turner (2019) Meta-learning probabilistic inference for prediction. In Proc. of ICLR, Cited by: §4.
  • M. Grusky, M. Naaman, and Y. Artzi (2018) NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In Proc. of NAACL-HLT, Cited by: 3rd item.
  • H. Guo, R. Pasunuru, and M. Bansal (2018) Dynamic multi-level multi-task learning for sentence simplification. In Proc. of COLING, Cited by: §4.
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller (2018) Learning an embedding space for transferable robot skills. In Proc. of ICLR, Cited by: §4.
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Proc. of NeurIPS, Cited by: 2nd item.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2018) Beta-vae: learning basic visual concepts with a constrained variational framework.. In Proc. of ICLR, Cited by: §3.2.
  • G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §4.
  • H. Inan, K. Khosravi, and R. Socher (2016)

    Tying word vectors and word classifiers: A loss framework for language modeling

    .
    CoRR abs/1611.01462. External Links: 1611.01462 Cited by: §2.1.
  • Y. Ji, G. Haffari, and J. Eisenstein (2016)

    A latent variable recurrent neural network for discourse-driven language models

    .
    In Proc. of NAACL-HLT, Cited by: §4.
  • N. Kalchbrenner and P. Blunsom (2013) Recurrent continuous translation models. In Proc. of EMNLP, Cited by: §4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. CoRR abs/1312.6114. External Links: 1312.6114 Cited by: §2.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §3.2.
  • M. Koupaee and W. Y. Wang (2018) WikiHow: a large scale text summarization dataset. CoRR abs/1810.09305. External Links: 1810.09305 Cited by: 6th item.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of EMNLP: System Demonstrations, Cited by: §2.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461. External Links: 1910.13461 Cited by: §1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Cited by: §3.3.
  • P. Lison and J. Tiedemann (2016) OpenSubtitles2016: extracting large parallel corpora from movie and tv subtitles. In Proc. of LREC, Cited by: §3.1.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. In Proc. of EMNLP-IJCNLP, Cited by: §4.
  • B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2018) The natural language decathlon: multitask learning as question answering. CoRR abs/1806.08730. Cited by: §3.3.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, G. H. Bower (Ed.), Vol. 24, pp. 109–165. External Links: ISSN 0079-7421 Cited by: §2.3.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proc. of NeurIPS, Cited by: §4.
  • S. Narayan, S. Reddy, and S. B. Cohen (2016) Paraphrase generation from latent-variable PCFGs for semantic parsing. In Proc. of INLG, Cited by: §4.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, Cited by: §3.3.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL-HLT, Cited by: §4.
  • O. Press and L. Wolf (2017) Using the output embedding to improve language models. In Proc. of EACL: Volume 2, Short Papers, Cited by: §2.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. External Links: Link Cited by: §1, §3.3, §4.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683. External Links: 1910.10683 Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proc. of EMNLP, Cited by: 3rd item, §3.3.
  • D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell (2019) Continual unsupervised representation learning. In Proc. of NeurIPS, Cited by: §4.
  • R. Ratcliff (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological Review 97 (2), pp. 285–308. Cited by: §2.3.
  • S. Ravi and A. Beatson (2019) Amortized bayesian meta-learning. In Proc. of ICLR, Cited by: §4.
  • E. Reiter and R. Dale (2000) Building natural language generation systems. Cambridge University Press, USA. Cited by: §4.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    .
    In Proc. of ICML, Cited by: §2.2.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    .
    In Proc. of EMNLP, Cited by: 1st item.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proc. of ACL, Cited by: 2nd item.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In Proc. of AAAI, Cited by: §4.
  • T. Shi, P. Wang, and C. K. Reddy (2019)

    LeafNATS: an open-source toolkit and live demo system for neural abstractive text summarization

    .
    In Proc. of NAACL: System Demonstrations, Cited by: Table 2.
  • M. Titsias and M. Lázaro-Gredilla (2014) Doubly stochastic variational bayes for non-conjugate inference. In Proc. of ICML, Cited by: §2.2.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2017) NewsQA: a machine comprehension dataset. In Proc. of Workshop on Representation Learning for NLP, Cited by: 2nd item.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proc. of CVPR, Cited by: §4.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Proc. of NeurIPS, Cited by: §4.
  • M. Völske, M. Potthast, S. Syed, and B. Stein (2017) TL;DR: mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Cited by: 5th item.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proc. of ICLR, Cited by: §1, §3.3.
  • J. Wieting and K. Gimpel (2018) ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proc. of ACL, Cited by: §3.1.
  • J. Xu and G. Durrett (2019) Neural extractive text summarization with syntactic compression. In Proc. of EMNLP-IJCNLP, Cited by: 4th item, Table 2.
  • Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017)

    Improved variational autoencoders for text modeling using dilated convolutions

    .
    In Proc. of ICML, Cited by: §4.
  • K. Yee, Y. Dauphin, and M. Auli (2019)

    Simple and effective noisy channel modeling for neural machine translation

    .
    In Proc. of EMNLP-IJCNLP, Cited by: §4.
  • L. Yu, L. Sartran, W. Stokowiec, W. Ling, L. Kong, P. Blunsom, and C. Dyer (2019) Putting machine translation in context with the noisy channel model. CoRR abs/1910.00553. External Links: 1910.00553 Cited by: §4.
  • J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu (2019) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. CoRR abs/1912.08777. External Links: 1912.08777 Cited by: Table 2.
  • X. Zhang and M. Lapata (2017)

    Sentence simplification with deep reinforcement learning

    .
    In Proc. of EMNLP, Cited by: §4.

Appendix A Complete Results

We show the complete results on each dataset for Full, NoDataset, NoLatent, and Base in Table 5, Table 6, Table 7, and Table 8 respectively.

Single task Multitask
Dataset R1 R2 RL BLEU F1 R1 R2 RL BLEU F1
Gigaword 46.95 23.50 44.01 19.71 45.80 22.56 43.15 19.20
CNN/DM 41.53 16.84 27.87 12.96 40.52 16.29 28.01 13.68
Newsroom 28.62 18.99 26.15 35.82 32.76 23.10 30.09 34.35
NYT 44.27 27.88 36.52 29.67 44.79 28.32 36.65 30.01
TL;DR 14.27 2.14 10.54 2.18 13.43 1.72 9.98 1.76
Wikihow 26.00 6.33 18.39 6.41 26.73 8.13 20.48 7.28
MSMARCO 65.46 61.90
NewsQA 11.06 48.69
SQuAD 13.71 73.81
Table 5: Results for the Full model across all metrics and all tasks in the single and multitask conditions.
Single task Multitask
Dataset R1 R2 RL BLEU F1 R1 R2 RL BLEU F1
Gigaword 48.59 24.73 45.74 21.51 40.96 18.28 38.79 14.93
CNN/DM 41.82 16.72 28.23 13.63 40.83 14.52 26.37 10.99
Newsroom 33.35 24.19 30.61 39.45 29.67 19.91 26.65 19.01
NYT 44.64 27.44 36.12 29.64 40.25 22.64 31.32 22.93
TL;DR 13.99 2.01 10.36 1.91 13.12 1.42 9.17 1.02
Wikihow 25.64 6.49 18.86 6.27 25.88 6.10 17.68 5.99
MSMARCO 59.86 54.07
NewsQA 12.52 42.48
SQuAD 13.22 66.18
Table 6: Results for the NoDataset model across all metrics and all tasks in the single and multitask conditions.
Single task Multitask
Dataset R1 R2 RL BLEU F1 R1 R2 RL BLEU F1
Gigaword 50.14 26.56 47.28 23.20 47.09 23.77 44.37 20.14
CNN/DM 37.83 15.05 25.67 11.00 36.45 14.69 25.39 10.51
Newsroom 34.35 24.50 31.90 36.17 31.09 22.20 28.85 19.85
NYT 42.38 26.52 34.56 24.68 43.40 27.62 35.94 25.92
TL;DR 11.13 1.93 9.45 1.62 9.76 1.44 8.29 0.69
Wikihow 17.97 4.47 13.63 3.27 18.51 6.25 15.69 2.41
MSMARCO 51.98 61.89
NewsQA 8.86 49.00
SQuAD 9.13 73.96
Table 7: Results for the NoLatent model across all metrics and all tasks in the single and multitask conditions.
Single task Multitask
Dataset R1 R2 RL BLEU F1 R1 R2 RL BLEU F1
Gigaword 49.96 26.32 47.14 22.91 46.92 23.72 44.32 20.23
CNN/DM 37.54 14.83 25.50 10.76 38.13 15.52 26.54 12.13
Newsroom 30.09 21.15 27.79 21.88 30.93 21.64 28.55 24.27
NYT 42.73 26.41 34.84 26.60 43.90 27.96 36.25 27.42
TL;DR 9.07 1.34 7.75 0.70 10.92 1.63 9.00 1.45
Wikihow 20.70 5.11 15.42 4.00 17.88 5.93 15.13 2.85
MSMARCO 55.46 64.64
NewsQA 10.50 48.37
SQuAD 9.61 73.26
Table 8: Results for the Base model across all metrics and all tasks in the single and multitask conditions.

Appendix B Summarisation Style Transfer

In this section, we report the results of experiment of summarisation style transfer. We use the Full model and consider two evaluation setups in this experiment:

  • We take articles from the NYT development set and compare reference summaries, summaries generated from the prior mean corresponding to the NYT dataset , and summaries generated from the prior mean corresponding to the Newsroom dataset . We show representative samples in Table 9 and Table 10.

  • We take articles from the Newsroom development set and compare the same selection of summaries as above. We show representative samples in Table 11 and Table 12.

Inspecting the samples, we can observe that the model has learnt different summarisation styles as a result of the different training data. The summaries generated using often seem to consist of the article lede, whereas summaries generated with consist of extracted phrases which are more evenly distributed throughout the article. We note that while the summaries generated using a skill representation that is not intended for that dataset (i.e., for NYT and vice versa) score less using standard metrics compared to the summaries generated using the “correct” skill representation (35.9 R1 vs. 44.27 R1 on NYT and 28.1 R1 vs. 28.62 on Newsroom), the resulting summaries are still valid summaries as indicated by the reasonably high scores. This suggests that our model learns useful summarisation skills that can be generalized to other domains.

Article 0.7WHEN people die in train collisions, like the 2 engineers and a passenger killed in a commuter train crash in Secaucus, N.J., or the 11 on a Maryland commuter train that hit an Amtrak train in Silver Spring, both in early 1996, there are national headlines and detailed safety investigations. But the attention might be better directed elsewhere: Far more people, more than 400 last year, died in the less noted but far more numerous accidents in which trains collided not with other trains, but with motor vehicles. The Federal Railroad Administration counted more than 4,000 car-train collisions last year. ”Something comes over the fax every minute of every day,” David Bolger, a spokesman for the agency, said with only slight exaggeration. His agency is running a long-term campaign to reduce the toll. The Transportation Department’s goal, set in 1991, is to reduce by 25 percent by 2001 the number of places where tracks cross roads. So far the number is down by a little more than 6 percent, to 268,000, with three years to go. The theory is that if crossings are combined, towns or highway departments can afford to pay for improvements at the ones that survive. Those costs are high. The Transportation Department says a simple ”crossbuck …
Reference summary 0.7Matthew L Wald comments on growing concern about train-car collisions, noting that Federal Railroad Administration counted more than 4,000 car-train collisions last year; Transportation Dept goal, set in 1991, is to reduce by 25 percnet [sic] by 2001 number of places where tracks cross roads; so far, that number is down by little more than 6 percent, to 268,000; photo (M)
NYT summary style 0.7Federal Railroad Administration has counted more than 4,000 car-train collisions in 1996, with more than 400 deaths in less noted but far more numerous accidents in which trains collided not with other trains, but with motor vehicles; Transportation Department’s goal is to reduce by 25 percent by 2001 number of places where tracks cross roads; theory is that if crossings are combined, towns or highway departments can afford to pay for improvements at ones that survive; those costs are high; photo (Special section, Cars) (M)
Newsroom summary style 0.7WHEN people die in train collisions, like the 2 engineers and a passenger killed in a commuter train crash in Secaucus, N.J., or the 11 on a Maryland commuter train that hit an Amtrak train in Silver Spring, both in early 1996, there are national headlines and detailed safety investigations. But the attention might be better directed elsewhere: Far more people, more than 400 last year, died in the less noted but far more numerous accidents in which trains collided not with other trains, but with motor vehicles. The Federal Railroad Administration counted more than 4,000 car-train collisions last year. ”Something comes over the fax every minute of every day,” David Bolger, a spokesman for the agency, said with only slight exaggeration.
Table 9: An NYT article with a reference summary, and two model-generated summaries using and .
Article 0.7Months ago, Alaska posted a list of its 1,600 paroled sex offenders on the World Wide Web. California provides information on about 64,000 offenders on CD-ROM at any county sheriff’s office or big-city police department. And three months ago, Connecticut began letting local police stations release the names, addresses and pictures of paroled offenders to anyone who calls. But in New Jersey, the state that inspired the nationwide movement for public warnings about the presence of sex offenders, information is held far more tightly. Since New Jersey’s community-notification law was hurriedly enacted in 1994 amid the outcry that followed the killing of Megan Kanka of Hamilton Township, it has faced an onslaught of constitutional challenges, and nearly three years of review and revision by Federal and state judges. Indeed, the original ”Megan’s Law” is soon to emerge from legal limbo as one of the most restricted in the nation, with tight limits on which offenders are identified, who has access to the information and how those facts may be passed on. By the end of this month, when most of New Jersey’s 21 county prosecutors are expected to begin issuing names and addresses of offenders, only about 600 parolees will be listed, those classified as posing a moderate or high risk of committing new …
Reference summary 0.7New Jersey, which inspired other states to make known the names of paroled sex offenders, has faced onslaught of constitutional challenges to its own Megan’s law, and now the law is emerging from nearly three years of judicial review and revision as one of most restricted in nation; it sets tight limits on which offenders are identified, who has access to information and how those facts may be passed on (M)
NYT summary style 0.7New Jersey, which inspired nationwide movement for public warnings about presence of sex offenders, has faced onslaught of constitutional challenges, and nearly three years of review and revision by Federal and state judges; original ’Megan’s Law’ is soon to emerge from legal limbo as one of most restricted in nation, with tight limits on which offenders are identified, who has access to information and how those facts may be passed on; photo (M)
Newsroom summary style 0.7Months ago, Alaska posted a list of its 1,600 paroled sex offenders on the World Wide Web. California provides information on about 64,000 offenders on CD-ROM at any county sheriff’s office or big-city police department. And three months ago, Connecticut began letting local police stations release the names, addresses and pictures of paroled offenders to anyone who calls. But in New Jersey, the state that inspired the nationwide movement for public warnings about the presence of sex offenders, information is held far more tightly.
Table 10: An NYT article with a reference summary, and two model-generated summaries using and .
Article 0.7On the road … Pilgrim and Ambassador, 1977, by Raghubir Singh. See it in Everything Was Moving at the Barbican, London August is a quiet month for photography shows, so here’s a preview of some of the exhibition highlights for the next few months. The most anticipated London show is surely Tate Modern’s ambitious double header William Klein/Daido Moriyama, which opens on 10 October. Taking the cities of New York and Tokyo as its starting point, the show contrasts the approaches of two pioneers of impressionistic urban photography. It considers the influence of Klein’s seminal 1956 book, Life Is Good and Good for You in New York, on Japanese photography, and Moriyama in particular. The prodigiously productive Moriyama was a founder of the radical Provoke movement in Japan and, alongside previously unseen vintage prints, the exhibition explores photography’s role in the representation of protest movements and civil unrest. This is an ambitious show that will be a chance for many of us to see lots of Moriyama’s images outside of book form for the first time. I, for one, cannot wait. The other big London exhibition is the Barbican’s group show, Everything Was Moving: Photography from the 60s …
Reference summary 0.7From an ambitious survey of 1960s-70s photography in London to Kohei Yoshiyuki’s controversial work in Liverpool and Amsterdam’s Unseen Photo Fair, there’s a lot to see. By Sean O’Hagan
NYT summary style 0.7Barbican, London August, previews Tate Modern’s ambitious double header William Klein/Daido Moriyama, which opens on 10 October; show contrasts approaches of two pioneers of impressionistic urban photography; it considers influence of Klein’s seminal 1956 book, Life Is Good and Good for You in New York, on Japanese photography, and Moriyama in particular; other big London exhibition is Barbican’s group show, Everything Was Moving: Photography from 60s; photos (M)
Newsroom summary style 0.7Two pioneers of impressionistic urban photography are the latest in London’s show, Everything Was Moving: Photography from the 60s, says Jonathan Jones!!!
Table 11: A Newsroom article with a reference summary, and two model-generated summaries using and .
Article 0.7BANGALORE, India, June 4 The world’s biggest computer services company could not have chosen a more appropriate setting to lay out its strategy for staying on top. A building housing I.B.M.’s software laboratory and application service teams on the company’s corporate campus in Bangalore, India. On Tuesday, on the expansive grounds of the Bangalore Palace, a colonial-era mansion once inhabited by a maharajah, the chairman and chief executive of I.B.M., Samuel J. Palmisano, will address 10,000 Indian employees. He will share the stage with A. P. J. Abdul Kalam, India’s president, and Sunil Mittal, chairman of the country’s largest cellular services provider, Bharti Tele-Ventures. An additional 6,500 employees will look in on the town hall-style meeting by satellite from other Indian cities. On the same day, Mr. Palmisano and other top executives will meet here with investment analysts and local customers to showcase I.B.M.’s global integration capabilities in a briefing customarily held in New York. During the week, the company will lead the 50 analysts on a tour of its Indian operations. The meetings are more than an exercise in public and investor relations. They …
Reference summary 0.7India provides I.B.M. with its fastest-growing market and a crucial base for delivering services to much of the world.
NYT summary style 0.7IBM chairman and chief executive Samuel J Palmisano will address 10,000 Indian employees on expansive grounds of Bangalore Palace, colonial-era mansion once inhabited by maharajah, will meet with investment analysts and local customers to showcase IBM’s global integration capabilities in briefing customarily held in New York; during week, company will lead 50 analysts on tour of its Indian operations; meetings are more than exercise in public and investor relations; photo (M)
Newsroom summary style 0.7The world’s biggest computer services company could not have chosen a more appropriate setting to lay out its strategy for staying on top.
Table 12: A Newsroom article with a reference summary, and two model-generated summaries using and .