Shaping the Narrative Arc: An Information-Theoretic Approach to Collaborative Dialogue

01/31/2019 ∙ by Kory W. Mathewson, et al. ∙ 8

We consider the problem of designing an artificial agent capable of interacting with humans in collaborative dialogue to produce creative, engaging narratives. In this task, the goal is to establish universe details, and to collaborate on an interesting story in that universe, through a series of natural dialogue exchanges. Our model can augment any probabilistic conversational agent by allowing it to reason about universe information established and what potential next utterances might reveal. Ideally, with each utterance, agents would reveal just enough information to add specificity and reduce ambiguity without limiting the conversation. We empirically show that our model allows control over the rate at which the agent reveals information and that doing so significantly improves accuracy in predicting the next line of dialogues from movies. We close with a case-study with four professional theatre performers, who preferred interactions with our model-augmented agent over an unaugmented agent.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Designing and building computational models that generate meaningful dialogue for human-interaction, in an interesting and engaging manner, is a challenging open problem. As personal digital assistants increase in popularity, proper conversational capabilities may allow them to provide creative, playful, and helpful interactions. Conversational agents can be effective for health-care (Bickmore & Giorgino, 2006), by supporting cognitive-behavioral therapy for treating depression (Fitzpatrick et al., 2017), helping patients with chronic pain (Miner et al., 2017), and supporting reminiscence (Nikitina et al., 2018). These applications require systems capable of understanding and collaboration.

What makes some dialogues more interesting than others? Interesting collaborative dialogue constructs knowledge iteratively (Swain, 2000) and depends on each speaker bringing information to the conversation (Sawyer, 2003). Interestingness is also subjective and difficult to directly optimize via numerical methods (Li et al., 2016; Venkatesh et al., 2018).

Rule-based conversational models have existed for over 50 years (Weizenbaum, 1966). These methods are limited by hand-tuning and engineering to predict and handle possible inputs. Generative language models maximize the likelihood of an utterance (e.g. a sentence or sequence of words) (Graves, 2013). These models can predict the likelihood of an utterance by considering the sentences as a sequences of words, sub-word units, characters, and/or tokens (Sennrich et al., 2015). This objective can result in generated sentences which are grammatically correct, and bear a semantic relationship to the context surrounding it, but lack global consistency (Liu et al., 2018).

Our work generates interesting dialogue by using a narrative arc to incrementally construct shared knowledge. A narrative arc defines evolving qualities of emotion, tension, or topic over a story (Bizzocchi, 2007). We draw inspiration from improvised theater, where actors collaborate in real time to develop narrative based on thematic constraints (Johnstone, 1979). Improvised theater is a unique storytelling medium which relies on collaborative dialogue in which each utterance must carry significant information (Swain, 2000). We appeal to the two golden rules of improvised dialogue: 1) accept (i.e. be consistent with the dialogue thus far and 2) reveal (i.e. progress the dialogue with new information) (Sawyer, 2003; Johnstone, 1979).

In this work, we propose a new method to modulate a conversation model, which accepts input utterances by generating consistent and revealing

responses. Our approach combines a conversational model with a topic classifier, or

universe model. We borrow the term universe from improvised theater where it is used to describe the world-as-we-know-it (Johnstone, 1979; McLeod, 2000; Raby, 2010). The universe encompasses associations surrounding the dramatic world, and is motivated by the possible world semantics theory (Kripke, 1963).

We identify two modes of operation for our shaping method: revealing and concealing. Revealing dialogue adds additional information about the current universe. Generating utterances which progress a scene with new information is the primary goal of our approach. Concealing dialogue avoids exposing new information about the universe.

The universe model characterizes the information revealed by each utterance in a sequence. We refer to this information profile across utterances as the narrative arc. By tuning the revealingness we can selectively choose utterances to shape the narrative arc to produce more interesting and engaging dialogue. We argue that a balance between revealing and concealing is required for interesting and engaging collaborative dialogue (Swain, 2000). Both over-specification and ambiguity are undesirable (Sawyer, 2003; Johnstone, 1979). We hypothesize that there is an ideal region of information revelation which our method can expose in existing text-based narratives such as movie scripts.

2 Shaping the Narrative Arc

In this section, we present a mechanism for shaping the narrative arc inspired by combining methods exploring entropy in textual documents (Shannon, 1951) with the Simple Shapes of Stories described by Vonnegut.111From K. Vonnegut lecture: goo.gl/JuEDVR We describe concepts of conversation and universe models. Then, we show how these combine to describe a narrative arc. Finally, we show how the narrative arc can be used to generate interesting dialogue.

2.1 The Conversation Model

A conversation model accepts an input utterance and generates one, or several, output utterance(s). The conversation model maintains local coherence by conditioning output generation on the input. We write to denote the set of possible utterances (i.e. sequences of words); in this work, is a collection of English sentences. A sequence of successive utterances is a dialogue, denoted

. A conversation model yields a probability

of an utterance given a dialogue .

We focus on dialogue generation using three retrieval-based conversation models. The first two models are based on the OpenSubtitles dataset (Lison et al., 2018). Pre-processing details are included in the supplementary material. When queried with an input line , a model returns candidate responses:

  • Baseline Random model: sample lines from .

  • Deep neural network model (DNN)

    : we embed all the lines in into a latent semantic space using the Universal Sentence Encoder (Cer et al., 2018). We encode the input line into , and return the approximate nearest neighbours in using the norm as the distance metric.

  • Books: Similar to the DNN model, responds with semantically-related nearest neighbor lines from literature, filtered for offensive content.222books.google.com/talktobooks

2.2 The Universe Model

The universe model measures how each successive utterance of a dialogue influences the probability distribution over universes. For a given utterance, the universe model calculates a probability distribution over universes. For a sequence of utterances, we use recursive universe belief propagation (Sec.

2.3) to update the posterior over the course of a dialogue. Revealing dialogue would concentrate probability mass on a single universe, and concealing dialogue would distribute probability mass uniformly across a set of universes. The shape of this sequence of posteriors is the narrative arc (Sec. 2.4). We investigated reveal/conceal dynamics using three different universe models based on probabilistic topic classifiers.

  • Newsgroups

    : Using the newsgroup classification dataset, we filter out stop-words, created frequency vectors, and use the TF-IDF (term frequency / inverse document frequency)

    (Salton & Buckley, 1988) word weighting scheme to account for word importance in the corpus. We train a naïve Bayes classifier on 5 aggregate topic universes (Computers, Recreation, Religion, Science, and Talk) (Joachims, 1996).

  • Movies: Naïve Bayes classifier, trained similar to Newsgroups, using a collected dataset of film synopses and one of 10 corresponding genres (Drama, Comedy, Horror, Action, Crime, Romantic Comedy, Romance, Thriller, Film Adaptation and Silent Film) from Wikipedia data (Hoang, 2018).

  • DeepMoji: Deep neural network that takes input text and outputs a distribution over a set of 8 aggregated emoji universes: (Sad, Mad, Meh, Nervous, Glad, Music, Love, and Miscellaneous) (Felbo et al., 2017). The authors’ pretrained model was used.333github.com/bfelbo/DeepMoji

Figure 1: The narrative arcs of a synthetic dialogue (a), using the Newsgroups universe model (b) and Movies universe model (c). This dialogue is likely SCIENCE or TALK under the Newsgroups model, and DRAMA or COMEDY under the Movie genres model.

2.3 Recursive Universe Belief Propagation

We desire a means by which we can update the universe belief incrementally as evidence is accumulated with each successive utterance in a dialogue. We begin by defining the notion of a universe model as a means of modelling the dynamics of information revelation. Consider a finite set of universes, . The role of a universe model is to assess the compatibility of an utterance with a given discrete universe, . Given such a model, we develop a method to update the agent’s posterior universe distribution over a sequence of utterances. For each universe , the universe model assigns a likelihood to an utterance , conditioned on a dialogue .

The universe model iteratively updates a posterior belief over universes, in a similar spirit to prediction with expert forecasters (Cesa-Bianchi & Lugosi, 2006)

. The probability of a given universe depends on iteratively combining evidence in support of that universe. We define the the posterior probability over universes

given a sequence of utterances as:

Where

is the prior probability,

is the likelihood of utterance conditioned on the past dialogue and universe, and is the likelihood of utterance under the conversation model.

Let

be an initially uniform distribution over universes, or the universe model’s prior. We can marginalize out the universe if the evidence is consistent over all hypotheses. To illustrate the relationship between utterance likelihood and universe, we can explicitly write the marginal likelihood as:

Thus, the posterior is updated recursively as:

(1)

In practice, it may be convenient to use the output of a probabilistic classifier in lieu of a likelihood function conditioned on past utterances and universe

. Universe classifiers can be trained separately from language models, and provide a complementary signal if model input distributions overlap. This assumption is justified when both models work with similar training corpus vocabularies. We view the probability distribution over universes output by the universe model as derived from a joint distribution

, of the universe , and utterance . With as the prior distribution over universes, the conditional probability is:

We can substitute for in Eq. 2.3 by assuming conditional independence (i.e., ), uniform prior distribution (i.e., ) and constant marginal probability (i.e., ). These assumptions are justified when the probabilistic topic classifier is a naïve Bayes classifier (Bishop, 2006) with uniform prior. Thus, the substitution follows the following steps:

[cond. independence]

[Bayes’ theorem]

[ uniform prior]
[ const. marginal]

Eq. 2.3 thus becomes:

(2)

2.4 The Narrative Arc

As defined in Eq. 2.3, the posterior is a function of the dialogue . We define the narrative arc as the sequence of universe distributions iteratively calculated for the dialogue. The arc depicts the evolution of a belief over a set of universes. The narrative arc function maps , where is a probability simplex over . We discuss three properties of the narrative arc of the synthetic dialogue shown in Fig. 1:

1. Utterances affect the arc in varying degrees. “My favorite scientist and academic is Albert Einstein” is similarly likely under Science and Talk, and less likely under the Recreation universe (bottom green line). Different utterances should have different effects on .

2. A concentrating posterior signals a revealing dialogue. A dialogue which emphasizes scientific content, for example, should see . Conversely, we would expect a concealing dialogue to spread the posterior across multiple universes.

3. A universe model is a perspective on dialogue. Different universe models can expose different aspects of the same dialogue. Replacing the Newsgroups universe model by a Movies universe model suggests the dialogue is from a Drama and/or Comedy universe. This dialogue would be considered revealing under both universe models.

Figure 2: First 20 lines of Romeo and Juliet modeled with Newsgroups (top), Movies (middle), and DeepMoji (bottom) universe models.

In this section, the universe model was applied to a fixed dialogue , but the model also provides a criterion for favoring utterances when generating dialogue.

2.5 Generating Dialogue with the Narrative Arc

The entropy of the posterior is given by:

Then, the entropy change due to a new utterance, , given the past dialogue, , is defined as:

The term measures how much a given utterance changes the entropy of the posterior, given the previous utterances . A positive value of is a reduction in entropy (i.e. revealing). Conversely, a negative value of is an increase in entropy (i.e. concealing). We define the score of an utterance , with respect to a dialogue, , as:

The exponential function is a convenient way to ensure strict positivity and preserve the ordering of scored candidates. We use our entropy-based score function to modulate the sampling of a base conversation model, , toward , which depends on the change in entropy due to the new utterance.

(3)

If , and candidates are sampled according to . If , is modulated by the score . Modulation mode depends on the value of :

  • (reveal): modulate towards revealing the universe. The probability of utterances likely under the universe with highest probability are increased.

  • (conceal): modulate towards concealing the universe. The probability of utterances likely under multiple unlikely universes is increased. Utterances not supporting the likely universe are made more likely.

We use these two modulations for filtering samples from our base conversation model. We filter via one of two methods for sampling from an unnormalized distribution: greedy sampling and rejection sampling. Greedy sampling scores a set of samples from the conversation model and selects the candidate with the maximum score. Scoring a large set of candidates can be time intensive. Rejection sampling (Alg. 1) can sample from the desired unknown modulated distribution online (Murphy, 2012). Additional details on rejection sampling are included in the supplementary material. As the entropy function is bounded, the utterance score is bounded. In practice, we set a max score and weigh all utterance scores above the threshold equally. Both filtering methods have benefits. Rejection sampling provides a smoother distribution and does not require scoring a large set of candidates. Greedy sampling is less sensitive to the range of from different utterances.

  Given: conversation model , scoring function , first line , length , max score , max samples
  Return: dialogue
  for  in  do
     while step  do
         sample sample
        if  then
            append to break 
        end if
     end while
  end for
Algorithm 1 Generating dialogue with rejection sampling.

3 Evaluation

3.1 Narrative Arc of Existing Dialogues

In Fig. 2, we visualize the narrative arc underlying the first 20 lines of Shakespeare’s Romeo and Juliet using three universe models: 1) Newsgroups, 2) Movies, and 3) DeepMoji.

Fig. 2 illustrates the entropy-reducing nature of good dialogue by showing us the underlying, evolving, narrative arc. Under the Newsgroups universe model, the dialogue evolves toward a talk-centric universe. Under the Movies model, the same dialogue balances between comedy and drama before shifting towards drama. Finally, using the DeepMoji universe model, a developing ambiguity between DeepMoji universes Sadness and Love is uncovered. This supports the hypothesis that existing dialogues exhibit underlying narrative arcs conditioned on universe models. Additional samples exposing narrative arc dynamics are presented in the supplementary material.

Figure 3: Narrative arcs over 10 utterances at increasing values: concealing (top), neutral (mid), revealing (bottom). On the right are utterances generated by each model after priming (bold). Dotted red line indicates the start of narrative arc shaping.

3.2 Shaping the Narrative Arc

In this section, we demonstrate that our method is able to modulate conversation models toward generation of revealing or concealing dialogues. Linguistic quality and semantic consistency of utterances are determined by the language underlying the conversation model. Here, we emphasize evaluation of narrative arc shaping.

We use the DNN conversation model to test how preferential selection, induced by our score function, can modulate information introduced into the conversation. In Fig. 3 we present characteristic narrative arcs and dialogues using concealing (top), neutral (middle), and revealing (bottom) modes. Each generation was primed with the first two lines from Romeo and Juliet (shown in bold in Fig. 3).

A significant difference is exposed between concealing (top) which tends toward a high entropy, uniform universe distribution, and revealing (bottom) where drama tends toward . Drama remains the most likely universe (and visible on all plots) as it was supported by the first two lines and subsequent utterances did not significantly shift the distribution. Fig. 3 also shows the dialogue generated by the model. Concealing utterances do not add information to the dialogue, revealing utterances incorporate new information over the course of the dialogue.

We next evaluate our method’s ability to generate concealing/revealing dialogue by measuring the entropy under both an objective universe (i.e. the universe model used for scoring in generation) and a test universe not used for scoring. We use the Newsgroups universe model for objective scoring and the Movies model for testing. A random conversation model is used to generate response candidates.

We generate 20 conversations following a process similar to Algorithm 1 but using greedy sampling. Each conversation starts with a random dialogue starter line to encourage diversity and then 19 lines are sampled from the conversation model using the narrative arc function. This approximates the length of a medium-duration improvised conversation (Sawyer, 2003).

Results are presented in Fig. 4. There is a significant difference between the entropy under the objective and testing universes, but each model exhibits similar dynamics over the dialogues. We conclude that concealing dialogue can conceal under multiple universes, and revealing dialogue can reveal information under multiple universe models.

The revealing/concealing dynamics of each utterance may be related to measurable lexicographical qualities such as words per sentence (WPS). We analyzed the language used in lines from each model and found a significant difference () between utterances selected by the revealing model ( WPS) and utterances selected by the concealing model ( WPS).

Figure 4: Revealing and Concealing across Universe Models. Dialogue generated to be (a) revealing () under the objective model Newsgroups is revealing under the testing Movies universe. The same is true for (b) concealing (

) dialogue. Data shown are means and standard deviation (shaded) over 20 runs of random conversation model.

3.3 Predicting the Next Best Line

We next test the system’s ability to add information to improve performance on a prediction task. Given a sequence of gold-standard conversational utterances and a list of next utterance candidates (i.e. the ground truth and distractors), can the universe model be used to improve accuracy of predicting the ground truth?

Evaluation compares top- accuracy and mean reciprocal rank (MRR) over samples in a held out test set. Accuracy measures the likelihood that the system scores the ground truth within the top- candidates against the distractors. MRR compares average ground truth ranking across conditions. A text2text Transformer language model was trained on the OpenSubtitles dataset (Lison et al., 2018) to predict an output line given a set of input lines (Vaswani et al., 2017). Additional details are in the supplementary material.

The trained Transformer model was used to assign a perplexity score for output line candidates given an input context line. For each unique subtitle file in the validation and test sets, the concatenation of the first lines serve as input context and line is the ground truth output to be predicted. Negative candidates are randomly selected from lines in the respective corresponding data segment (i.e. validation or test sets), thus may not be from the same source file as the input context lines.

The perplexity under the trained conversation model serves as the unmodulated probability (Eq. 3) of selection in the prediction task. The input sequence is then passed, line-by-line, through a Newsgroups universe model and a score is assigned to each candidate relative to the change in entropy of the evolving posterior. The value is modulated over evenly spaced values between . The accuracy of predicting the ground truth in the top- candidates and the MRR of the ground truth are computed.

The results on the validation set are shown in Fig. 5. By selecting the correct value, the likelihood of correctly selecting utterances revealing an incremental amount of information increases significantly. Note the shape of the curve as changes. As hypothesized, there exists a region, between and where the ‘right’ amount of universe information is revealed. This region corresponds to the notion that each line of dialogue will reveal some, but not too much, information about the universe. As continues to increase, the accuracy decreases below the neutral baseline. The top- accuracy of prediction increases when the universe model boosts the probabilities of appropriately revealing dialogue. The validation set is used to set the optimal , which is then used to score samples in the test set and results are presented in Table 1. Two additional models are included for comparison. T2T@1 uses preceding the ground truth as context. Unigram assigns a perplexity to output candidates by building a unigram language model using the input lines as a corpus. A smoothing factor of is used for out-of-vocabulary words. Additionally, a random conversation baseline model is included. For each model tested, information from the universe model significantly improves the predictive accuracy on this task.

CM UM Top3Acc MRR
T2T@5 NG 0.520 0.456*
T2T@5 Neutral 0.507 0.444
T2T@1 NG 0.483 0.428*
T2T@1 Neutral 0.469 0.412
Unigram NG 0.366 0.337*
Unigram Neutral 0.296 0.290
Random Neutral 0.302 0.294
Table 1: Results for predicting the next line. CM is the conversation model, UM is the universe model, Top3Acc is the accuracy of predicting the ground-truth in the top- of candidates, and MRR is the mean reciprocal rank of the ground truth. Unigram CM calculates the perplexity of each candidate given the input lines as training corpus. T2T@N is a Tensor2Tensor Transformer model which uses the previous N lines as an input to predict the output and NG is the Newsgroups universe. A Neutral universe model represents no modulation which is equivalent to . indicates

for a Students’ t-test comparing MRR to the Neutral model.

Figure 5: Information revelation region as varies for (left) top- accuracy and (right) MRR in universe model modulated prediction task

3.4 Interactive Collaborative Dialogue

Finally, as a practical implementation case-study, we tested how this system performs in collaborative dialogue through interaction with humans. Four expert improvisational theatre performers engaged with the system in text-based conversations. Each conversation consisted of utterance-response pairs for a total of ten utterances (i.e. an average length of a short-duration improvised scene (Sawyer, 2003)). Subjects are native English speakers with 5+ years professional performance experience and are familiar with shared narrative development and collaborative dialogue. Each interacted with revealing, concealing, and neutral models in a randomized order unknown to the them. Transcripts of actor-system dialogues and qualitative feedback are included in supplementary material.

As described in Sec. 2.1, this experiment used the Books conversation model and the DeepMoji universe model. Following the interactions, each performer was asked the following question: “please rank the conversations from 1 (most engaging) to (least engaging)”. Engagingness was defined to align with the notions of reaveling and concealing in this work. An agent is engaging for shared scene development if it brings just enough information to add specificity and reduce ambiguity but not limit the conversation.

Three of the four performers ranked the revealing model, , as the most engaging. Those three performers ranked as being less engaging due to being “too random”. All subjects ranked as being least engaging and not bringing enough information to the scene. These results support the hypothesis that can effectively modulate the engagingness of collaborative dialogue in human-machine interaction.

4 Related Work

Collaborative dialogue between humans and machines has been proposed as a grand challenge in artificial intelligence

(Mathewson & Mirowski, 2017b; Martin et al., 2016; Brown, 2008)

. Previous methods have used hard coded rules, decision trees, and event representations to generate novel narrative chains

(Martin et al., 2017). We used an deep neural network-based generative language model enhanced with universe model information in the context of improvised theatre (Mathewson & Mirowski, 2017a).

While neural response generation systems provide a trainable end-to-end system for language generation, these methods are prone to providing generic, unspecific responses (Li et al., 2015). Recent advances have improved generated responses by optimizing sentence encoding and decoding jointly, post-generation candidate re-scoring (Bordes et al., 2016; Vinyals & Le, 2015; Sordoni et al., 2015)

, reinforcement learning

(Li et al., 2016), hierarchical models for distilling extended context (Serban et al., 2016), and auxiliary training objectives, such as maximizing mutual information (Li et al., 2015), and personality specificity and consistency (Li et al., 2016; Zhang et al., 2018). In future work, universe models and conversational models could be trained jointly.

Our work is related to the controlled generation of text using disentangled latent representations (Hu et al., 2017; Zhou et al., 2017; Asghar et al., 2018). Previous work has used a topic-transition generative adversarial network to enforce smoothness of transition of subsequent utterances (Liang et al., 2017). These methods use neural encoder-decoders and generate responses given an input sequence and a desired target class for the response.

Other, recent work has aimed to improve candidates returned by retrieval-based conversation models (Weston et al., 2018). These methods utilize a conversation model to find similar prototypes using embedding distances and refine prototypes with a sequence-to-sequence model (Guu et al., 2017). We do not refine candidates from the conversation model, rather we sample and select using a scoring function defined by the revealing and concealing parameter.

Similar to universe models, topic models or lexical fields have been shown capable of tracking general subjects of a text (Blei et al., 2003; Geeraerts, 2010). Dynamic topic models characterize the evolution of topics over a set of documents over time (Blei & Lafferty, 2006). Our work differs in that we generate dialogue using the evolving probabilistic belief during a single conversation, as opposed to tracking topical shifts over longer timescales. Using a probabilistic classifier for narrative tracking has been explored previously (Mohammad, 2011; Reagan et al., 2016). These works used sentiment classifiers to track emotion and plots arcs through narratives. We extend these works by using probabilistic universe models collaborative dialogue generation.

5 Discussion and Conclusion

While innovations have improved the linguistic quality, semantic alignment, and consistency of utterances generated by neural models, generated conversations still lack interestingness and engagingness. Our work generates engaging dialogue by shaping the underlying narrative arc as opposed to improving the training of generative language models. The methods presented are agnostic to both the universe and the conversational model used. Using rules from improvised theatre, we quantitatively define the evolution of interesting and engaging dialogue.

In this work we focus on genre, emoji, and topic-based universe models. Other universe models to be explored involve causality of events, directions of relationships, or audience reaction prediction (Riedl & Young, 2006; Knight et al., 2011; Trabasso & Sperry, 1985; Cook, 1928; Eger et al., 2015). While this work explores the interaction between a base conversation model and a universe model, this method could be compatible with image or video generation.

The main contribution of this work is the computational formalization of the narrative arc, an information-theoretic framework for collaborative dialogue interaction. The framework fills a gap in previous research by connecting the utterance-level improvements of language models with the conversation-level improvements of universe tracking. This is done by sampling candidates from a conversational model using a universe model and the narrative arc. We illustrate narrative arcs underlying popular dialogues and show how universe models can be combined with conversation models to generate interesting dialogue. We present empirical results showing how the narrative arc can improve accuracy on a next line prediction task. Finally, we present an expert user-study to validate our model.

References

6 Supplementary Material

Appendix A Data Processing Details

OpenSubtitles were used as conversation model data.444http://opus.nlpl.eu/OpenSubtitles.php. The dataset was preprocessed by removing duplicate movie subtitle files, lines under 10 characters and duplicate lines, resulting in 68,719,885 unique lines. The text2text Transformer model from Google’s open source implementation was used for training.555https://github.com/tensorflow/tensor2tensor OpenSubtitles were used as training data.666http://opus.nlpl.eu/OpenSubtitles.php As several files in the dataset cover the same uniquely identified movie or television show duplicates were removed by keeping only the subtitle file with the most lines for each unique ID. The dataset was split into into training IDs, validation IDs, and testing IDs. The data was prepared for training by removing empty lines, duplicate lines, and substituting non-Unicode characters. A vocabulary was built using the training set. After cleaning, subtitle files with less than lines were excluded. Training data was formatted into input/response pairs. The training data was split into training examples and

evaluation examples. Validation and testing subtitle sets were held out to measure task accuracy on unseen data. The hyperparameters of the Transformer model were set as follows: hidden size of

, filter size of , batch size of , heads, and a dropout factor of was used for regularization. The model was trained for steps, to convergence, with final negative log-perplexity of on evaluation set.

Appendix B Rejection Sampling

Rejection sampling is a means of online sampling that allows for sampling from an unknown distribution. Suppose we are given an unnormalized distribution over which we can query (i.e. ) but not necessarily integrate over. Let be a proper distribution over such that is dominated by , :

(4)

The Rejection sampling algorithm to obtain a sample from an unnormalized distribution using samples from a proper distribution proceeds as follows:

  1. Sample and sample ,

  2. If , accept as a sample drawn from , otherwise reject the sample and go to 1.

This algorithm will take an average of iterations to obtain a sample. Let be the random element returned by this procedure.

Proposition 1.

Rejection sampling samples with probability

Proof.

Write for the sequence of sampled symbols, and write to denote the event that the symbol is accepted (at which point we stop the process). Then

Now, this process is memoryless (in a sense we should make a little more formal) and

Since , the geometric sum converges and

which is the conditional probability of given . Now

where we used (4) to guarantee that . But then

We conclude that

as desired. ∎

Appendix C Additional Narrative Arcs Underlying Existing Dialogues

We present an additional set of narrative arcs underlying 20-utterance scenes in Figure 2 and discussed in Section 3.1. In the paper body we report results on the Romeo and Juliet scene.

  1. “Argument Clinic” Monty Python’s Flying Circus, Episode 29 (1972). First 20 lines.

  2. “Blade Runner” Hampton Fancher and David People. Adapted from a novel Philip K. Dick (1982). Last 20 lines of director’s cut movie script.

  3. “The Great Dictator” Charlie Chaplin. Los Angeles, CA: United Artists (1940). First 20 lines of Charlie Chaplin’s speech.

  4. “Dead Parrot Sketch” Monty Python’s Flying Circus, Episode 8 (1969). First 20 lines.

  5. “Romeo and Juliet” William Shakespeare. Vol. 1. JB Lippincott, (1871). Act 1, Scene 1, first 20 lines.

Figure 6: Narrative Arc of “Argument Clinic”
Figure 7: Narrative Arc of “Blade Runner”
Figure 8: Narrative Arc of “The Great Dictator”
Figure 9: Narrative Arc of “Dead Parrot Sketch”

Appendix D Expert Improvisor Conversation Transcripts

d.1 Subject1, Mode: Conceal

HumanGood morning dad, is breakfast ready? SystemBreakfast is ready. HumanGreat! What are we having? SystemWe are having Lamb Chops and peas for dinner. HumanYummmm. But what are we having for breakfast? Systemsome breakfast. HumanOk, thanks dad. I have a big ball game later, I would love if you and mom came to watch. SystemLikewise, Dad will fire up the chainsaw and be ready to cut through the big limbs at the joints as you work. HumanSo I guess you’re not coming to the game. You work too much to be good father. SystemI’m doing the best I can as a father, and I hope that’s good enough. HumanIt’s not. I’m moving out unless you change. SystemIf you will leave me to it, I will pack up my things and move out as quickly as I can so as not to inconvenience you.

d.2 Subject1, Mode: Neutral

HumanRosanne, thanks for meeting me under the oak tree SystemKeryn would also like to thank her cousins, Dana, Jason, Staci, and Ryan, and her aunt Marge Naegele for continually supporting her decisions even when they thought she was crazy. HumanIt means the world that you all came to meet here today as we prepare for another one of my crazy ideas SystemWe shall all be here at four o’clock. HumanWho else is coming? SystemA lot of people are coming. HumanHow many? SystemHow many and where? HumanWe need at least 200 here at the oak tree SystemGonna need at least three or four hundred. HumanYou’re right. How fast can they get here? Systemhowever fast they are travelling. HumanWell I can’t argue there. Prepare the tamborines. We all need tamborines for my crazy idea. SystemBut we’re depositing the money—you’ll need a chequebook.

d.3 Subject1, Mode: Reveal

HumanWelcome to tim hortons, what can I get you? SystemI’ll give you doughnuts and coffee with milk. HumanGreat work Jason, you’re doing great on your first day on the job SystemFirst day Welcome the employee! HumanThat’s right Jason. You’re very welcome here. If you do a good job, you’ll be a manager in no time. SystemBeing responsible and knowledgeable is possible in a participative environment if the manager is a member of the overall team. HumanWow. Jason you continue to surprise me. Where did you learn that? System“The next time I go up I’ll get the information for you,” Jason told him. HumanJason? Are you ok? SystemJASON: I’m glad you’re feeling better. HumanYou’re starting to scare me SystemAnd I frighten you.

d.4 Subject2, Mode: Conceal

HumanStacey! How are you? I’m excited that we’re getting our nails done for Sams wedding tomorrow! SystemI have to tell you I’m so excited for you! HumanThank you! Maid of Honour is a real honour! Did you remember to get them a gift? SystemI am so, so glad you’re my maid of honor. HumanAnd? SystemAbout what? HumanDid you remember to get your parents a gift? They are hosting your wedding after all. SystemOtherwise, plan to give them their gifts at the rehearsal dinner. HumanOh that’s a good idea! Hopefully they won’t need it earlier. SystemThey also need it in time so they can get it in the fridge to chill.

d.5 Subject2, Mode: Neutral

HumanGood Morning! I was wondering if you had any butcher knives for sale? SystemI had an extra one, roast carving size, sent me by the then president of Macy’s last summer, but when Mr. Case sent me two of his sets I gave this odd knife to my part-time cook. HumanWow! May I see it? I’m getting a… gift… for my boyfriend. SystemI sent it to you because you are now a man and it’s time for you to look at the future and fulfill the promise I gave to you in the desert—that I’d see you married.” HumanWhy would I want to make you happy? We haven’t spoken in years! SystemBecause I saw how kind your heart was? HumanI have missed you. I don’t know if I have what it takes. Am I ready to commit? Am I ready to take care of someone other than myself? SystemI am ready, willing, and worthy to speak with you right here and right now. HumanWill you teach me the butcher ways. So that I may follow in your steps? SystemI do it the easy way: I ask the butcher to score it; his knives are better than mine, for a start.

d.6 Subject2, Mode: Reveal

HumanCrikey Mate! Did you put the shrimp on the barby yet? SystemWe started the shrimp over a hot fire, then finished them over low heat to allow them to cook through gently. HumanWell that sounds delicious! We need to impress the Boss tonight! Did Karen make her cheese dip too? I can’t get enough of that! SystemShe makes some and we sit at the table again. HumanSounds great Greg! Listen, I really appreciate you hosting this party, and I hope you’re not too upset that I got the promotion over you. System(11) If Greg came to the party, all hell would break loose. HumanYeah I can’t stand that guy. Why have two Gregs in the office! You are all we need, am I right? SystemHe’s replacing Laura as our new office (assistant). HumanSounds like a plan. Well thanks again. Maybe next month you’ll get the promotion! SystemAnd you may depend on me to know my place if you get promoted.

d.7 Subject3, Mode: Conceal

HumanPhew! Good run today Meredith. SystemMeredith felt guilty even thinking it, but yesterday had been a good day. HumanYesterday was such a good day that Meredith never wanted it to end. SystemNow she really needed Meredith to go away. HumanShe needed to go somewhere new, somewhere for a fresh start. “I’m moving to Brazil!” She shouted. SystemAva made a last­minute decision to return to America by way of Spain, fitting in another visit to Madrid, taking a few days to see some of her new friends and acquaintances in that city of which she had grown so fond. HumanSee Ava and Meredith had always wanted to travel together, and this could be there last chance, before they were married and settled down. SystemDuring the summer of 1957, when Lorenzo Jr. was fourteen and Rani Meredith was nine, the family embarked on a rare train trip to visit relatives in Atlantic City, Philadelphia, New York, and New Bedford, Massachusetts. HumanThat was the last time the two friends had been close, so a trip was long overdue. SystemThe trip to Dinard was ‘a good opportunity for ending the fatal friendship that had sprung up between us, and ending it without bitterness’.

d.8 Subject3, Mode: Neutral

HumanTell me where the gold is! You can’t lie anymore. SystemGold was found in the bed of the Pactolus. HumanAh so you admit it. You stole from King Pactolus. SystemI didn’t admit it but I didn’t deny it. HumanIs there a difference? Once a thief, always a thief. SystemThe same is a thief and a robber.—The former of these words means the petty thief who commits the smaller or unobserved robbery. HumanBut your robbery was observed. You decided to steal from a royal. Something that will cost you dearly. SystemYet I desired to commit robbery, and did it. HumanSo you are a free man. And you shall die a free man. SystemYou are already free! the

d.9 Subject3, Mode: Reveal

HumanOh no! Mom is gonna be home soon! SystemAnd now you can’t come home. HumanPlease, just let me inside! It’s getting colder out here! SystemWe’ll go inside now. HumanOkay thank you! jeez, you’re a crazy older brother. SystemSometimes being a brother can mean you get mixed up in a bit of all of this. HumanYeah totally. At least we can be ready for bed now before mom gets back. SystemDo remember mother’s coming in the middle of the day, and she’s much more likely to be early than late. HumanExactly! That’s why I’m worried, let’s hurry! SystemHurry up then!

d.10 Subject4, Mode: Conceal

HumanMaster, I did not mean to disturb you but your estranged brother has unfortunately passed away. SystemYou did not mean any harm, and I owe you no grudge. HumanThank you, Master. I have brought you your three favourite things to cheer you up on this sad occasion: chocolate, red wine and a copy of this month’s Teen Vogue. SystemI am so very glad, as I do not know how I could have endured the thought of you & Edward enjoying Oxford life & myself cut off from it all for another year. HumanWithout you, Master, I don’t believe I could enjoy anything in this world. You have done so much for Edward and I; we can’t thank you enough. Because of this, we will forever be in your debt. SystemI thank you for accepting it; and I believe that I may safely promise that you will find the contents of the smaller packet of great importance and advantage to yourself and those dear to you. HumanI will open this small gift from you in front of you so that you may see my reaction. Good Master! This necklace is beautiful. Are you sure you wish to give this gift to me on the day of your brother’s passing? I am but a lowly servant. SystemThank you, Lord, that You can use our marriage to call out our unique gifts and abilities. HumanThe Good Lord has graced us with his blessing on this day. And you have blessed Edward and I will your compassion and grace. Before I depart, is there anything else you require, Master? SystemThou thought’st to help me, and such thanks I give As one near death to those that wish him live.

d.11 Subject4, Mode: Neutral

HumanListen here, Karen- y’all don’t come into my bar and tell me what the heck to put on my menu. ya hear? SystemI’m not thrilled about our menu, but I don’t want to hurt Thom’s feelings. HumanThom is supposed to be a professional. A pro-fess-ion-al. He has to take that dang feedback and take it with a smile on his face. Just like you do about the decor. SystemA long day of sessions produced little that Thom was happy with; his dissatisfaction can clearly be seen in the studio sequence that appeared in Meeting People Is Easy, as he complains to Nigel Godrich that they’ve been working for ages and still only two instruments—bass and one guitar—sound passable. HumanThom’s bar was designed to feature bands from across the nation, but working with Nigel made it difficult to play in a band and run a bar. SystemThe amount of travel created logistical problems, as the distance between venues had not been considered when scheduling each performance. HumanBands were consistently arriving late to perform or leaving their sets early. And to top it all off, the food catered to the bands from the bar left something to be desired. SystemA step up, the Kaiserkellar had a dance floor, a better PA, and another band on the bill to take some of the weight; although, because they had to start earlier and finish later, they ended up playing for six hours most nights. HumanAnd six hours was much too long for such an amateur band and tempers were runnning high. Nigel, Thom and Karen just couldn’t seem to get along. SystemIndeed, they continued to work astonishingly hard as a live band - but now on a truly international scale.

d.12 Subject4, Mode: Reveal

HumanDanielle, Kyle asked me to the prom but I can’t afford a dress. SystemIf you dress her in her new “Prom Pinks” outfit, satin splendor with matching coat, gold belt, clutch, and hair bow included, so long as you don’t lift her dress, right?—who’s to know. HumanYou’re absolutely right! We can borrow from my sister’s closet and no one will know it’s hers unless they see her name tag under the dress. SystemAnd when she finds it, she’ll hide the stuff even better. HumanWe need to be sneaky about it. We could use a bed sheet and hang it on the hangers and she’ll never know. Let’s do it! You grab the bedsheets and I’ll grab the dress. SystemThe pinch of the question is—why, after having provided the substitute dress, does she hide the smeared night-gown, instead of destroying it? HumanShe’s embarassed! Kyle asked me to the prom instead of her after their date went horribly wrong. She dropped ketchup all over herself. SystemShe thought that maybe she should just cancel the date and forget about the Junior Prom altogether. HumanShe should have but she didn’t. Plus, Kyle is the Captain of the Chess club. He’s the smartest guy around- I couldn’t say no. Was I wrong to do that? SystemKyle would have lost out on the opportunity to get to know someone who could be a mentor or valuable resource.

Appendix E Expert Improvisor Qualitative Assessment

Qualitative assessment as shared in free form discussion with expert improvisors during and following the experimental conversations. Words in [brackets] were implied in the conversation. When comments made were directly addressing a model’s alpha setting, that models alpha parameter is included in (bold).

e.1 Subject/Model Specific Comments/Assessment

  • (Subject1, Mode: Reveal): The system brought context to the scene.

  • (Subject1, Mode: Neutral): The system was comically literal.

  • (Subject1, Mode: Conceal): This setting was reactive though felt flip-floppy.

  • (Subject2, Mode: Reveal): It is making more sense this time, seems to make more sense deeper in conversation when it has more context.

  • (Subject2, Mode: Conceal): The system is making assumptions and while there is a chance for conflict it prefers to answer questions vaguely than admit guilt.

  • (Subject3, Mode: Neutral): This is a cool conversation, like it wanted to guide the conversation.

  • (Subject4, Mode: Reveal)): I am most impressed when it makes big choices that feel ‘right’.

  • (Subject4, Mode: Reveal): The system is good at picking a specific thing and running with it.

  • (Subject4, Mode: Neutral): felt like I was improvising with an improvisor who had their own ideas and doesn’t want to accomodate or listen.

  • (Subject4, Mode: Conceal): I loved the attention to detail.

e.2 General Interaction Comments/Assessment

  • When I gave things that were specific, it would give me specifics back. It gives you as much as you put in. It is as though you are improvising with yourself.

  • Sometimes there is too much information in the longer offers.

  • It responds and makes offers but they seldom have ‘conflict’, interesting but not ‘heightening’.

  • It is very comfortable narrating.

  • It doesn’t have memory, so it feels like I am following the scene.

  • It has adopted my style of speaking, and my linguistic choices.

  • It seems to enjoy providing names and backstory.

  • It doesn’t know the details I am not providing, it doesn’t know the details I am implying.

  • The offers that the system gives can further the scene.

  • It felt workshoppy, like a good improv tool to practice improv for new improvisors.

  • Sometimes it becomes a narrator, these moments are less fun for me as an improvisor.

  • I don’t know if the system knows how long I want the scene to be.

  • Speed helps in the system because then you are not judging it.

  • The system is not distracted by cheap laughs and references, it stays focused on the topic, it makes you do good improvisation.