Computer generation of stories and other kinds of creative writing is a challenging endeavor. It entangles two difficult tasks: the generation of fluent natural language and the generation of a coherent storyline. In the recent year, neural language models have made tremendous progress with respect to fluency Bahdanau et al. (2015); Vaswani et al. (2017); Bengio et al. (2003); Devlin et al. (2019), but coherency is still a major challenge See et al. (2019). The generation of coherent stories has recently been addressed with additional conditioning: Fan et al. (2018) suggest conditioning on a story prompt, Clark et al. (2018) propose collaboration between a generative model and a human writer, and Guan et al. (2019) suggest attending to a commonsense graph relevant to the story plot. Conditioning based on a generated story plan (Martin et al., 2018; Fan et al., 2019; Yao et al., 2019), a sequence of images (Chandu et al., 2019) or character roles (Liu et al., 2020) have also been considered.
Our work is orthogonal to these efforts. Rather than considering additional conditioning, we propose a model which takes as input several sentences of context and selects the best next sentence within a large set of fluent candidate sentences. We leverage pre-trained BERT embeddings (Devlin et al., 2019) to build this sentence-level language model. Given the embeddings of the previous sentences of the story, our model learns to predict a likely embedding of the next sentence.
This task isolates the modeling of long-range dependencies from the prediction of individual words, which has several advantages. First, since our model only needs to determine how well each candidate sentence would fit as a coherent continuation to the story, it does not spend capacity and time to learn fluency. Second, our model does not manipulate individual words but full sentences, which allows us to consider tens of thousands of candidate sentences at a time. This contrasts with prior work (Logeswaran and Lee, 2018) where the need to learn token-level representations limited the number of candidate next sentences that could be considered to a few hundred. Third, we can rely on compact model architectures that train quickly because we take advantage of strong semantic representation from a pre-trained bidirectional language model, BERT, as our sentence embeddings. Of course, these benefits also imply that our sentence representation is limited to the information extracted by the pre-trained model. Nevertheless, we show that our model achieves state-of-the-art accuracy among unsupervised approaches on the Story Cloze task: predicting which of two sentences coherently ends a short story.
Our work also opens up the possibility of ranking thousands of candidate sentences from a large literature repository. On the ROC Stories dataset, we observe that training with a large number of candidates is key for selecting the most coherent ending among a large set of candidates at test time. We also show preliminary results on the efficacy of our method for ranking candidate next sentence on the Toronto Book Corpus (Kiros et al., 2015), a much larger book dataset. We envision that our methods for scoring many candidate next sentences by their coherence with the context might be useful to downstream generation tasks where it is possible to generate many fluent continuations of a text, but it remains an unsolved problem how to refine and choose the best of them. To encourage this exploration, we release our code and models111Code for ROC Stories experiments can be found at https://github.com/google-research/google-research/tree/master/better_storylines..
2 Proposed Method
We propose a sentence-level language model: our model estimates
, the probability distribution for sentencegiven the previous sentences, . Since it is intractable to marginalize over all possible candidate next sentences, we consider a finite but large set of valid, fluent sentences. Without loss of generality, we can consider as an integer index into that set of possible next sentences. This strategy resembles negative sampling in word2vec Mikolov et al. (2013).
Our model represents sentences with pre-computed vector embeddings. Specifically, sentences are represented by the mean of the 768-dimensional contextual word embeddings of the second-to-last layer of BERT(Devlin et al., 2019). This representation has shown to encode more transferable features compare to other layers (Liu et al., 2019). Alternative sentence representations were considered, including embeddings from the universal sentence encoder (Cer et al., 2018) and a weighted mean of the BERT embeddings using inverse document frequency weighting (Zhang et al., 2019). None of these alternatives improved our results however.
Motivated by simplicity, we consider a classical multi-layer perceptron (MLP)which takes as input the context sentence embeddings concatenated into a single vector. At the output layer, we perform a softmax operation. If we represent candidate sentences by the embeddings , our model estimates the probability that is the next sentence by the softmax
where is the output of the MLP given context , and is the partition function. At train time, the candidate set consists of the correct next sentence along with distractor sentences. The distractors can either be static (the same set used throughout training) or dynamic (picked at random from a larger set for each train batch). In this case, the “vocabulary” of next values to choose from changes with each train step, similar to negative sampling Mikolov et al. (2013). At test time, novel sentences can be embedded with BERT and scored by our model.
Like a classical language model, we optimize for the likelihood of the true next sentence’s embedding. However, when training we found that the sentences from the context () often ended up being given very high scores by our model. Inspired by work in sentence reordering (Lapata, 2003; Logeswaran and Lee, 2018), we incorporated an auxiliary loss, which we refer to as CSLoss, that only includes the context sentences in the distractor set.
Lastly, we consider a residual variant of the MLP (referred to as resMLP) with skip connection between layers, as described in He et al. (2016). The residual model trains faster and sometimes achieves higher accuracy than the non-residual model. Though we experimented with recurrent (Sundermeyer et al., 2012) and self-attention (Vaswani et al., 2017) models, we did not observe improvements, perhaps because the input to our model is already the high-dimensional output of a large mask language model. We leave deeper architecture exploration, which will be especially critical as context length is extended, to future work.
3 Experimental Setup
We first describe our experiments on the ROC Stories dataset of short 5-sentence stories before showing our setup on the larger Toronto Book Corpus.
|Valid 2016||Test 2016||Valid 2018||Test 2018|
|Alternatives||Peng et al. (2017)||–||62.3||–||–|
|Schenk and Chiarcos (2017)||62.9||63.2||–||–|
|Lang. Models||Schwartz et al. (2017)||–||67.7||–||–|
|GPT-2 (Radford et al., 2019)||54.5||55.4||53.8||–|
|GPT-2 + finetuning||59.0||59.9||59.0||–|
3.1 ROC Stories
Dataset Our experiments use the ROC Stories dataset, which consists of stories focusing on common sense Mostafazadeh et al. (2016). The training set has 98k stories, with five sentences each. The validation and test sets each contain 1.8k stories consisting of four sentences followed by two alternative endings: one ending is coherent with the context; the other is not. The dataset was introduced for the Story Cloze task, inspired by Taylor (1953), where the goal is to select the coherent ending. While the dataset and task were introduced as a way to probe for coherence and commonsense in models trained only on the unlabeled portion, most research derived from this dataset focuses on a supervised setting, using the validation set as a smaller, labeled training set (Chaturvedi et al., 2017; Sun et al., 2019; Cui et al., 2019; Li et al., 2019; Zhou et al., 2019). Our work is faithful to the original task objective. We train solely on the training set, i.e. the model never sees incoherent endings at training time.
Model We consider two models, an MLP and a residual MLP. They take as input the previous sentences represented as the concatenation of their embeddings. Alternative context aggregation strategies were considered with recurrent (Sundermeyer et al., 2012) and attention (Vaswani et al., 2017) architectures, without strong empirical advantages. The models maps its input to a vector which is compared to a set of candidate sentence embeddings via dot product. The embedding of the true next sentence should receive the highest score. For each example, we consider all other fifth sentences in the training set (96k in total) as the candidate set.
The input of our model is 3,072 dimensional, i.e. 4 context sentences represented by 768 dimensional BERT embeddings. After an architecture search, our best MLP has 3 layers of 1,024 units, and our best resMLP has a single residual layer with hidden size of 1,024. Both contain just over 6M trainable parameters. Both apply dropout with a rate of 0.5 after each ReLU, and layer normalization is performed on the concatenated context sentence embedding passed in as input to the network and on the final predicted embedding for the next sentence. For the Story Cloze task, the two architectures achieve similar validation accuracy, but when considering more than two distractors, the resMLP significantly outperforms the standard MLP. The resMLP also converges quicker than the MLP. Training to convergence takes under 2 hours for each model on a Tesla V100.
3.2 Toronto Book Corpus
Dataset ROC Stories contains only self-contained five-sentence stories, focusing on everyday life scenarios. They contain no dialog and very little flowery, expository language. Ideally our method would also be successful at scoring potential continuations to more naturally-written stories. To this end, we test out our approach on excerpts from the Toronto Book Corpus (Kiros et al., 2015), a dataset of self-published novels. The dataset contains over 7,000 unique books totalling over 45 million sentences. Since these stories are much longer than the ROC Stories ones and many of the sentences are uninformative (nearly 5% of sentences are 3 words or shorter, and 14% are 5 words or shorter), we double the context length to 8 sentences.
Model In addition to experimenting with a similar residual MLP architecture to the one used on ROC Stories, we also ran experiments with a Transformer model (Vaswani et al., 2017). The residual MLP architecture contains 2 residual layers with hidden size of 1024 (11M params total). The transformer has 4 self-attention layers with hidden size of 768, filter size of 2048 and 8 attention heads (22M params total). While the residual MLP is trained to predict the 9th sentence given the previous 8 sentences, the Transformer is trained to predict each next sentence given the previous sentences in a sequence of length 10 sentences. However, we only evaluate the Transformer on the task of predicting the 9th sentence so that evaluation results are directly comparable to the residual MLP.
For each batch during training, 2k distractors are randomly selected from the train set. Like with ROC Stories, we experiment with an auxiliary loss where just sentences from the context were used as distractors. Table 3 reports the results.
We evaluate on the Story Cloze task, a binary classification task, as well as on the task of ranking a large set of possible next sentences.
4.1 Story Cloze Task
Table 1 shows that our method outperforms unsupervised alternatives. The introduction of the CSLoss which considers only context sentences as candidates improves accuracy compared to only using a loss over all possible fifth sentences.
For comparison, we include the accuracies of the best unsupervised methods in the literature. Schenk and Chiarcos (2017) construct negative examples for their binary classification task by pairing contexts with random fifth sentences selected from the training set. Peng et al. (2017) train a language model to predict a representation of the semantic frame, entities, and sentiment of the fifth sentence given the representations of the previous sentences, then take the more likely fifth sentence. We achieve higher accuracy without relying on a task-specific architecture.
Table 1 also shows that picking the ending that is more likely according to a word-level language model, in our case GPT-2’s 355M parameter model, does not yield very high accuracies, even when the language model is finetuned on ROC Stories text (Radford et al., 2019). Lastly, we also include the accuracy reported by Schwartz et al. (2017)
, where a logistic classifier is trained to combine multiple language model scores.
It is worth noting that state-of-the-art on the Story Cloze task is over 90% accuracy (Li et al., 2019; Cui et al., 2019) for semi-supervised settings. The methods achieving this level of performance are not comparable to our unsupervised approach as they require training on the labeled validation set. The language model approach from Schwartz et al. (2017) also falls into this category.
4.2 Ranking Many Sentences on ROC Stories
For generation and suggestion scenarios, it is useful to be able to surface the best next sentence out of hundreds or thousands of candidates. In Table 3, we show the performance of our method on the 2018 validation set when all 98,161 fifth sentences in the training set plus all 1,571 correct 5th sentences in the 2018 validation are considered as candidate endings. Top-10 accuracy is highest, at 10.3%, when training a residual MLP without CSLoss.
Interestingly, strong performance on the Story Cloze task does not necessarily translate to strong performance on the large-scale ranking task. The CSLoss improves performance on the Story Cloze task but hurts it for large-scale ranking.
In Figure 1, we show how large-scale ranking performance improves as the size of the train-time distractor set is increased. However, on the Story Cloze task, the number of training distractors has no significant impact on performance. Even when only a single distractor is randomly chosen at each step of training, our method achieves over 70% 2016 test accuracy. It seems that training for the goal of detecting the true next sentence out of a very diverse candidate set is useful at test time only when the set of distractors at test time is similarly large and diverse. The many-distractors training regime might be less useful for the Story Cloze task since the two candidate endings are designed to be quite topically similar to each other.
Some qualitative examples are shown in Table 4. The failure examples showcase a side-effect of relying on pre-trained sentence embeddings: if common names like “Becky” or “Laura” or sports such as “fishing” and “golf” are close to each other in embedding space, our model will fail to distinguish between them.
|Context: My family got up one morning while on vacation. We loaded our boat onto a trailer and drove to the beach. After loading up from the dock, we took off on our boat. After only a few minutes on the sea, dolphins began to swim by us.|
|GT: (22.89) We played with them for a while and then returned to the dock.|
|(25.06) We were definitely lucky to see them and it made the trip more fun!|
|(24.31) They loved everything about that trip and vowed to do it again!|
|(23.76) We were sad to come home but excited to plan our next vacation.|
|(23.72) It was one of our best vacations ever!|
|Context: Ellen wanted to be smart. She started reading the dictionary. She learned two hundred new words the first day. Ellen felt smart and educated.|
|GT: (30.23) She couldn’t wait to use the new words.|
|(30.23) She couldn’t wait to use the new words.|
|(29.78) She felt like a new woman when she was done!|
|(29.01) She decided to go back to speaking like her normal self!|
|(28.95) She felt like a new girl!|
|Context: It was a very cold night. Becky was shivering from the cold air. She needed to cover up before she caught a cold. She wrapped up in her favorite blanket.|
|GT: (18.717398) Becky finally got warm.|
|(39.09) Laura ended up shivering, wrapped in a blanket for hours.|
|(36.71) After being cold all day, the warmth felt so good.|
|(33.77) Sam was able to bundle up and stay cozy all winter.|
|(33.38) The breeze felt good on her wet shirt.|
|Context: Benjamin enjoyed going fishing with his grandfather as a kid. They would pick a new location to go to every summer. Benjamin liked seeing who would catch the biggest fish. Even after his grandfather passed he continued the tradition.|
|GT: (26.65) He now takes his own grandchildren to create memories for themselves.|
|(34.71) Greg grew to love golfing and is now his favorite thing to do.|
|(33.82) It was a tradition Tim continues with his own family.|
|(33.63) Alex learned to be grateful of his family’s unique tradition.|
|(33.40) Tom was sad that he would have to let his son down.|
4.3 Ranking Many Sentences on Toronto Book Corpus
When evaluating with 100k distractors, about as many as our ROC Stories large-scale ranking task, P@10 is at best 7.1%, compared with 22.7% for ROC Stories. We suspect that this task would benefit from longer contexts and better selection of distractors. In particular, a qualitative evaluation of the data highlighted the presence of a large quantify of short, generic sentences in the high ranking sentences (e.g. “he said.” and “Yes.”). We see reducing the density of such sentences at training time as a potential for improvement.
In addition, further investigation is necessary into why the Transformer did not work as well as the residual MLP. The use of variable sequence length architectures like the Transformer will become more critical as the input sequence length is increased beyond what an MLP can easily handle.
This work introduces a sentence-level language model which takes a sequence of sentences as context and predicts a distribution over a finite set of candidate next sentences. It takes advantage of pre-trained BERT embeddings to avoid having to learn token-level fluency, allowing the model to focus solely on the coherence of the sentence sequences. Our results on the Story Cloze task highlight the advantage of this strategy over word-level language models. At train time, our model considers much larger amounts of text per update than typical token-level language models. We show that this strategy allows our model to surface appropriate endings to short stories out of a large set of candidates.
As future work, we plan to further evaluate the impact of different sequential architectures, longer contexts, alternative sentence embeddings, and cleverer selection of distractors. Inspired by deliberation networks and automatic post editing methods (Xia et al., 2017; Freitag et al., 2019), we ultimately want to apply our model to two-step generation, first selecting a sentence from a large set before refining it to fit the context.
This research is based upon work supported in part by U.S. DARPA KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
- Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representation (ICLR), Cited by: §1.
A neural probabilistic language model.
Journal of Machine Learning Research (JMLR)3 (Feb), pp. 1137–1155. Cited by: §1.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §2.
- Storyboarding of recipes: grounded contextual generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1.
Story comprehension for predicting what happens next.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1603–1614. Cited by: §3.1.
- Creative writing with a machine in the loop: case studies on slogans and stories. In 23rd International Conference on Intelligent User Interfaces, pp. 329–340. Cited by: §1.
- Discriminative sentence modeling for story ending prediction. CoRR abs/1912.09008. Cited by: §3.1, §4.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §1, §2.
- Strategies for structuring story generation. In Association for Computational Linguistics, ACL, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2650–2660. Cited by: §1.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 889–898. External Links: Cited by: §1.
- APE at scale and its implications on MT evaluation biases. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), Florence, Italy, pp. 34–44. External Links: Cited by: §5.
Story ending generation with incremental encoding and commonsense knowledge.
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, pp. 6473–6480. Cited by: §1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §2.
- Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §1, §3.2.
- Probabilistic text structuring: experiments with sentence ordering. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, Stroudsburg, PA, USA, pp. 545–552. External Links: Cited by: §2.
- Story ending prediction by transferable BERT. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp. 1800–1806. Cited by: §3.1, §4.1.
- A character-centric neural model for automated story generation. In AAAI Conference on Artificial Intelligence, Cited by: §1.
- Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094. Cited by: §2.
- An efficient framework for learning sentence representations. In International Conference on Learning Representations, Cited by: §1, §2.
- Event representations for automated story generation with deep neural nets. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
- Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Cited by: §2, §2.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Cited by: §3.1.
- A joint model for semantic sequences: frames, entities, sentiments. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 173–183. Cited by: Table 1, §4.1.
- Language models are unsupervised multitask learners. Cited by: Table 1, §4.1.
- Resource-lean modeling of coherence in commonsense stories. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp. 68–73. Cited by: Table 1, §4.1.
- Story cloze task: uw nlp system. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp. 52–55. Cited by: Table 1, §4.1, §4.1.
- Do massively pretrained language models make better storytellers?. In Conference on Natural Language Learning CONLL, Cited by: §1.
- Improving machine reading comprehension with general reading strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 2633–2643. External Links: Cited by: §3.1.
LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §2, §3.1.
- “Cloze procedure”: a new tool for measuring readability. Journalism Bulletin 30 (4), pp. 415–433. Cited by: §3.1.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2, §3.1, §3.2.
- Deliberation networks: sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1784–1794. External Links: Cited by: §5.
- Plan-and-write: towards better automatic storytelling. In AAAI Conference on Artificial Intelligence, Cited by: §1.
- Bertscore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675. Cited by: §2.
- Story ending selection by finding hints from pairwise candidate endings. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 27 (4), pp. 719–729. Cited by: §3.1.