Jurassic is (almost) All You Need: Few-Shot Meaning-to-Text Generation for Open-Domain Dialogue

10/15/2021 ∙ by Lena Reed, et al. ∙ University of California Santa Cruz 22

One challenge with open-domain dialogue systems is the need to produce high-quality responses on any topic. We aim to improve the quality and coverage of Athena, an Alexa Prize dialogue system. We utilize Athena's response generators (RGs) to create training data for two new neural Meaning-to-Text RGs, Athena-GPT-Neo and Athena-Jurassic, for the movies, music, TV, sports, and video game domains. We conduct few-shot experiments, both within and cross-domain, with different tuning set sizes (2, 3, 10), prompt formats, and meaning representations (MRs) for sets of WikiData KG triples, and dialogue acts with 14 possible attribute combinations. Our evaluation uses BLEURT and human evaluation metrics, and shows that with 10-shot tuning, Athena-Jurassic's performance is significantly better for coherence and semantic accuracy. Experiments with 2-shot tuning on completely novel MRs results in a huge performance drop for Athena-GPT-Neo, whose semantic accuracy falls to 0.41, and whose untrue hallucination rate increases to 12 acts for video games show that with 10-shot tuning, both models learn to control dialogue acts, but Athena-Jurassic has significantly higher coherence, and only 4 reliably produce outputs of high-quality for live systems with real users. To our knowledge, these are the first results demonstrating that few-shot tuning on a massive language model can create NLGs that generalize to new domains, and produce high-quality, semantically-controlled, conversational responses directly from MRs and KG triples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One challenge with open-domain dialogue systems is the need to respond to users’ utterances on any topic with high-quality responses. To handle this challenge, a common approach is to use an ensemble of response generators (RGs) and then train a ranker to select from a pool of possible responses curry2018alana ; shalyminov2018neural ; harrison2020athena ; paranjape2020neural ; chen2018gunrock ; fang2018sounding18 ; gabriel2020further . The ensemble of RGs can use a variety of generation techniques. One type of RG generates responses directly from the dialogue context, using a pre-trained model such as GPT-2 that is possibly tuned on additional conversational data, such as Topical Chat or Empathetic Dialoguesgopalakrishnan2019topical ; paranjape2020neural . Knowledge-Grounded response generation is a variant of this approach, where knowledge in the form of text is available during generation to control the utterance content and veracityhedayatnia2020policy ; yavuz2019deepcopy

. Template-Based RGs are also quite common, where templates are hand-written and then populated with values from a database or knowledge graph (KG). A final type is Meaning-to-Text (M2T) RGs trained from parallel corpora of meaning representations (MRs) and reference utterances.

111This is also sometimes called semantically-conditioned natural language generation. These M2T engines are typically used for task-oriented responses on specific (sets of) topics, such as restaurants, sports, movies, hotels, laptops, TVs, and video games inter alia. gardent2017creating ; tran2017neural ; KedzieMcKeown19 ; zhao2020bridging ; colin2016webnlg ; zhou2020design ; oraby2019curate ; reedetal20 ; harrison2019maximizing ; Novikovaetal17 ; walker2018exploring ; eric2019multiwoz ; budzianowski2018multiwoz ; cervone2019natural . Our aim in this paper is to improve the quality and the coverage of the M2T RGs in Athena, a dialogue system that has been a finalist in the Alexa Prize for the last two years harrison2020athena ; patilathena2021 .

A1: Do you have a favorite musician?
U1: yeah, taylor swift.
A2: Ok, Taylor Swift. Wow! Taylor Swift is very prolific! She has 114 songs, that’s a lot!
(Taylor Swift, numTracks, 114)
U2: yeah, that is a lot
A3: I like Taylor Swift’s song, 22. It came out in 2013. Do you like that song?
(Taylor Swift, song, 22), (22, date, 2013)
U3: yeah, it’s a good one.
A4: Right? This is interesting, Taylor Swift sings the song Bad Blood with Kendrick Lamar, want to hear more about Kendrick Lamar?
(Bad Blood, performer, Taylor Swift), (Bad Blood, performer, Kendrick Lamar)
U4: not really
A5: One of my favorite musicians is P!nk. Isn’t this interesting, P!nk won a Grammy Award for Best Pop Collaboration with Vocals. Do you want to hear more?
(P!nk, award, Grammy Award for Best Pop Collaboration with Vocals)
Figure 1: A conversation about music between Athena (A) and a user (U). The music KG-RG interweaves template-based utterances based on KG triples with other template-based utterances. Response segments generated using the KG triples are highlighted in orange. For each segment, the KG triples used by the RG as the input MR are shown below the response.

There are two types of M2T RGs in Athena. One set are template-based whose MRs are sets of triples from the WikiData Knowledge-Graph (KG-RGs). These KG-RGs cover the movies, sports, music, and TV topics. One advantage of using WikiData is the automatic updates to its knowledge by its editors. Another benefit is the ability to traverse the KG to new relations or entities, to say more things about a topic in a dialogue moon2019opendialkg . A third benefit is that Athena represents named entities and pronouns with their Wikidata IDs, providing a direct link to the KG patilathena2021 . Figure 1 provides a conversation with the Music KG-RG that shows how the knowledge triples from WikiData are realized with templates.222In accordance with the Alexa Prize rules, the shared conversations in Figure 1 and Figure 3 are between Athena and our team, or UCSC undergraduates, rather than real users. The other RG, for the video games topic, is based on the Viggo corpus juraska2019viggo . This is a parallel corpus of MRs and crowd-sourced realizations. An conversation with the Viggo RG is below in Figure 3.

Both of these RGs require substantial human effort. The KG-RGs require two steps: (1) sets of interesting and frequently populated KG relations must be selected by hand  moon2019opendialkg ; patilathena2021 and, (2) templates must be hand-written to realize them. This means that they currently cover a limited set of relations, ones that are populated frequently enough to make writing templates worthwhile. As previous work on dialogue generation has shown, even combinations of existing relations typically require multiple additional templates to be written rambow2001evaluating ; rambow2001natural ; walker2002automatically . The existing KG-RG entities and relations are in Table 1, as well as novel KG-RG relations and entities that we experiment with below with 2-shot tuning.

Topic Entities Relations
Movies Movies Actors Directors Awards* cast voiceCast spouse childrenNum genre award director* work* date* screenWriter* producer*
Music Musicians Bands Awards* Songs* Albums* performer (song and album) numTracks genre award memberOf instrument label date* show* work*
Sports Athletes Sports Awards* team position participant (tournament, leagues) spouse childrenNum award height date* work* ranking* duration* reviewScoreBy* disciplineCompetedIn* numMatches* numAwards* draftedBy* draftPicknum* startTime*
TV Shows Assessment* characterRole* cast role creator director genre award characterRole* narrativeLocation* mainSubject* assessment* assessmentOutcome* hasPart* occupation* derivativeWork* startTime* endTime* filmingLocation* setInPeriod* numSeasons* numEpisodes*
Table 1: The KG topics, entities and relations in the Athena-KG-synthetic corpus. A * indicates novel entities and relations, not in the training corpus, that are also tested in Section 4.

The neural Viggo RG required constructing dialogue acts and meaning representations, populating their content with information from IGDB, crowd-sourcing responses, and cleaning the resulting corpus, in order to experiment with methods for improving the fluency and semantic accuracy responses juraska2019viggo ; juraskawalker21 . Thus the ability to reliably generate high-quality responses directly from MRs via neural NLGs would transform the use of M2T NLGs in dialogue systems colin2016webnlg ; Lebretetal16 ; WisemanSR17 ; eric2019multiwoz ; rastogi2020towards ; zhao2020bridging .

Topic Example
Movies (Wonder Woman, director, Patty Jenkins)
I believe I read that Wonder Woman is directed by Patty Jenkins. Sometimes a director’s unique voice really comes through in their work. Do you think that happened in this case?
Music (Rihanna, record label, Def Jam Records)
Here’s another musician who worked for the same label Def Jam Records, called Rihanna. Want to hear about them?
Sports (Lebron James, member of sports team, Los Angeles Lakers), (Lebron James, position played on team/specialty, power forward)
Lebron James has played on many famous teams such as the Los Angeles Lakers, and played many positions like power forward.
TV (Lost, genre, paranormal television program), (Lost, genre, drama television)
Lost is considered both a paranormal television program and a drama television. What’s your opinion of paranormal television programs or drama television shows?
Figure 2: Sets of KG triples used as meaning representations, and the corresponding utterances. Dialogue Act types are not represented in the MRs but note that some KG triples are realized with multiple dialogue acts, combining statements, opinions and questions.

We utilize Athena’s current RGs to create training data for two new neural Meaning-to-Text RGs, Athena-GPT-Neo and Athena-Jurassic black2021gpt ; lieberetal21 . We conduct few-shot experiments, where we systematically vary within and cross-domain training, different tuning set sizes (2, 3, 10), prompt formats, and meaning representations. We expect that these NLGs to generalize beyond their training data  radford2019language ; ham2020end ; budzianowski2019hello ; pang-etal-2020-towards . We evaluate the results using both bleurt and human evaluation. Our results show that, with 10-shot training, both Athena-GPT-Neo and Athena-Jurassic generally produce coherent outputs, but that for within-domain experiments, Athena-Jurassic’s performance is significantly better for the human evaluation metrics of coherence and semantic accuracy. Experiments with 2-shot tuning on completely novel MRs results in a huge performance drop for Athena-GPT-Neo, whose semantic accuracy falls to 0.41, and untrue hallucinations increase to 12%. Experiments with the Viggo video games corpus shows that, with 10-shot training, both Athena-GPT-Neo and Athena-Jurassic can learn to control the dialogue acts realized, but Athena-Jurassic has significantly higher coherence, mainly because Athena-GPT-Neo produces some redundant and repetitive utterances. Again Athena-GPT-Neo produces untrue hallucinations in 12% of its outputs. We use the human evaluation to examine whether the bleurt scores are meaningful with results showing that bleurt scores have a very good correlation with semantic accuracy, but not with coherence, or other human metrics. Our results suggest that Athena-Jurassic reliably produces outputs of high enough quality to be used in live systems with real users. To our knowledge, this is the first paper to show that few-shot tuning on a large language model can create NLGs that generalize to new domains, and produce high-quality, semantically-controlled, conversational responses directly from MRs and KG triples.

2 Datasets

We use two datasets to train neural RGs for Athena. One is based on Athena’s KG RGs for the domains of music, movies, sports, and TV. Figure 2 provides KG triples (entity, relation, entity) and responses produced by the current KG-RGs. Each response is designed to continue the conversation with one or more statements, opinions, and/or question dialogue acts about an entity. However, the MRs represent the triples and do not explicitly represent dialogue acts. Some responses are based on traversing the KG links to a related entity, and introducing that entity, as illustrated in the music example about Rihanna, where the record label Def Jam Records is used to link two music artists. A music conversation was given in Figure 1 above. Both existing and novel KG-RG entities and relations are in Table 1.

A1: …Well, one of my all-time favorites is Control. I have already completed it twice! Have you played it by any chance?
U1: no not yet
A2: Obviously. There are just too many fun games for one person to play in a single lifetime. At any rate, Control is an excellent third-person action-adventure shooter with multiplayer. It was released in 2019 by Remedy Entertainment for PC, PlayStation, and Xbox. Do you think you would potentially like it?
inform(name[Control], release_year[2019], developer[Remedy Entertainment], rating[excellent], genres[shooter, action-adventure], has_multiplayer[yes], player_perspective[third person], platforms[PC, PlayStation, Xbox])
U2: absolutely i was actually hoping to play it sometime soon
A3: Alright! Have you played any adventure games by Remedy Entertainment, like Alan Wake?
suggest(name[Alan Wake], developer[Remedy Entertainment], genres[adventure])
U3: yes i did i loved the dark atmosphere in alan wake
A4: Yeah! I remember playing Alan Wake some time ago, and it wasn’t bad at all. Do you like multiplayer games?
request_attribute(has_multiplayer[])
Figure 3: A conversation between Athena(A) and a user (U) about video games, exhibiting a smooth interweaving of neural and template-based responses. Response segments generated by the neural model are highlighted in orange. For each such segment, the MR constructed by the RG as input for the model is shown below the response.

To train new neural KG-RGs, we create a new corpus, Athena-KG-Synthetic, of sets of knowledge triples and their template-based responses. We select five template categories and their paraphrases from the movies RG, two from music, three from sports, and two from TV.333Viggo and the Athena-KG-Synthetic corpus are available from nlds.soe.ucsc.edu. We query WikiData for thousands of KG triples to populate the templates and then split the resulting dataset into 32K train, 3558 development, and a test set of 100 instances for each template category.

The second dataset is the Viggo dataset. The Viggo RG combines responses generated from templates with those generated from meaning representations, as shown in Figure 3. The set of dialogue acts (DAs) are carefully constructed to be conversational and engage the user, rather than being purely informative  juraska2019viggo ; juraskawalker21 . We directly use the ViGGO corpus’s training, development and test sets. Each corpus instance uses one of Viggo’s 9 dialogue acts such as verify attribute, request explanation, or recommend. Most DAs are compatible with many combinations of content slots, using 14 video game attributes, yielding hundreds of response types juraskawalker21 . Figure 4 shows four DAs with various slot combinations.

confirm(name [Hellblade: Senua’s Sacrifice], release_year [2017], developer [Ninja Theory])
Oh, do you mean the 2017 game from Ninja Theory, Hellblade: Senua’s Sacrifice?
suggest(name [Half-Life 2], genres [shooter], player_perspective [first person]
Do you also enjoy playing first-person shooters, such as Half-Life 2?
give_opinion(name [SpellForce 3], rating [poor], genres [real-time strategy, role-playing], player_perspective [bird view])
I think that SpellForce 3 is one of the worst games I’ve ever played. Trying to combine the real-time strategy and role-playing genres just doesn’t work, and the bird’s eye view makes it near impossible to play.
verify_attribute(name [Little Big Adventure], rating [average], has_multiplayer [no], platforms [PlayStation])
I recall that you were not that fond of Little Big Adventure. Does single-player gaming on the PlayStation quickly get boring for you?
Figure 4: Viggo structured MRs (gray rows) and the corresponding reference utterances (with slot mentions in bold). Dialogue Act types are indicated in italics at the beginning of the MRs.

3 Experimental Setup

We utilize the models GPT-Neo and Jurassic-1 jumbo black2021gpt ; lieberetal21

. GPT-Neo is a transformer-based language model that has 1.7 billion parameters and is available on GitHub. GPT-Neo was created as a open-sourced alternative to GPT-3. Similarly to previous GPT-2 and GPT-3 models, GPT-Neo predicts the next word given the previous words in the text. The team from EleutherAI generated an open source data set, The Pile

gao2020pile , comparable to what OpenAI used for its GPT models. The data set is 825GB and contains data from 22 diverse sources, such as academic sources(Arxiv, PubMed), Github, and Wikipedia. GPT-Neo has a vocabulary size of approximately 50K tokens. The EleutherAI team provides three models (125M, 1.3B, 2.7B). These models were trained as masked auto-regressive models using cross-entropy loss. When compared to the closest GPT-3 model (GPT-3 Ada), GPT-Neo 2.7B had better performance on all linguistic and scientific reasoning benchmarks (Hellaswag, Piqa, Winogrande, MathQA, PubMedQA). We use GPT-Neo 1.3B, which has promising performance for its size.444Experiments with GPT-2-small showed that models trained on the 32K train did not generalize to unseen relations within the training domain, such as from the director to the screen writer relation, nor did these models generalize across domains.

[prompt]: confirm = yes — name = Tony Hawk’s Pro Skater 3 — release_year = 2001 — genres = sport
[sentence]: Gotcha! So you’re referring to the Tony Hawk’s Pro Skater 3 sports game, which was released in 2001?
Figure 5: Input format similar to QA

Jurassic-1 is also an auto-regressive transformer-based language model, that achieves state of the art performance on a set of common sense and QA zero-shot and few-shot tasks lieberetal21 ; zellers2019hellaswag ; sakaguchi2020winogrande

. AI21 Labs has released two versions, J1-large with 7.5B parameters and J1-jumbo with 178B parameters. Jurassic-1 is pre-trained with 300B tokens taken from publicly available resources, and has a larger vocabulary than other similar models with 250K tokens. Jurassic-1 has a larger vocabulary by including n-gram phrases as tokens along with the standard unigram and subword tokens. Jurassic-1’s architecture attempts to optimize the Jurassic’s depth-width tradeoff

levine2020depth ; lieberetal21 . The paper claims that Jurassic-1 can predict text from a broader set of domains than GPT-3, and is superior to GPT-3 in few-shot settings, due to its ability to fit more examples into a prompt. We use temperature 0.7 to promote interesting and varied output: the effect of temperature is illustrated in Figure 6 by the multiple outputs.

Starship = song = We Built This City — We Built This City = genre = pop rock
Starship plays pop rock like the song We Built This City. Do you like that genre?
Scream = cast member = Liev Schreiber
Liev Schreiber was really good in Scream, don’t you agree?.
name=Babbo — eatType = bistro — food = French — customerRating = outstanding
- Babbo’s bistro and restaurant serves French cuisine. The food is outstanding, according to customer reviews.
- Babbo is an outstanding French bistro in NY. Do you like French food?
- Babbo is a bistro. French food is the best. Babbo has outstanding food.
- Babbo restaurant is rated outstanding. I heard they serve great food.
Figure 6: S2S Input format. Jurassic example, generating multiple outputs (in italics) with no in-domain training.

We have two large training sets (Section 2), but we focus on GPT-Neo and Jurassic-1 few-shot (2, 3, 10) experiments, for two tuning formats, since format matters for Jurassic-1 lieberetal21 . In the QA format in Figure 5, the tuning instances consist of input MRs marked as the prompt and the response marked as sentence. The S2S format in the top of Figure 6 simply separates the MR and text into two lines, with an empty line separating test instances. In Figure 6, the 2-shot tuning prompts are from the music and movies domains, and the test item is from the restaurant domain. We generate multiple outputs, shown in italics, to illustrate the effect of temperature. All of the outputs are natural and coherent. Only the last output, Babbo restaurant is rated outstanding. I heard they serve great food, fails to realize all the MR attributes, missing eatType=bistro and food=French.

The 2-shot experiments are intended to create a challenging task for testing the models’ ability to generalize. In addition to 2-shot tuning with the 2 examples in Figure 6, the test set consists of novel hand-crafted MRs that are currently not in Athena, which in some cases also use rare relations. The goal is to test how well the models do at realizing responses directly from the WikiData KG, without any domain-specific or relation-specific training. Table 6 illustrates a good case of generalization to the restaurant domain. Table 1 indicates with a * those entities and relations corresponding to the novel MRs in our test set, and example novel MRs for each topic domain are in Figure 7.

ID Topic Novel Relations MR
M1 Movies (Despicable Me, screen writer, Cinco Paul)
M2 Music (The Beach Boys, song, Cotton Fields), (Cotton Fields, date, 1970)
M3 TV (Desperate Housewives, narrative location, Fairview)
M4 Sports (Muhammad Ali, significant event, lighting the Olympic cauldron), (lighting the Olympic cauldron, of, 1996 Summer Olympics)
Figure 7: Sample Novel Test Meaning Representations used to test 2-shot tuning

For evaluation metrics, we use bleurt along with human evaluation for the following metrics: (1): coherence: makes sense and is natural; (2) semantic accuracy: triples realized divided by total triples for the KG RGs and attributes realized divided by total attributes for Viggo; (3) good hallucinations: additional true information, not specified in the MR, is added to the utterance from the LM’s own knowledge; (4) bad hallucinations: additional false information is added to the utterance from the LM’s own knowledge; (5) dialogue act accuracy: whether the output utterance matches the dialogue act specified for Viggo, exemplified in the outputs in Figure 4; (6) whether a question is added to the end of the response, that was not specified in the MR or by the dialogue act, as seen in the 2nd example output in Figure 6. Remember that no dialogue acts are specified by the MRs for the Athena KG-RGs, but that some of the Viggo dialogue acts, such as suggest typically are realized as questions or include a question. For the 2-shot experiments with the novel MRs, there are no reference utterances and bleurt scores cannot be calculated, so we use the human evaluation metrics.

It is important to note that bleurt scores by themselves are not intended to mean anything: they are only useful for comparing models sellam2020bleurt . In addition, bleurt, like other n-gram scoring metrics, doesn’t account for stylistic variation which is often desirable oraby2019curate ; harrison2019maximizing . Also, previous work shows that the correlation of bleurt to human ratings of naturalness varies across conversational domains yeh2021comprehensive . However, that work was based on crowd-sourced open-domain dialogues where both sides of the dialogue were produced by humans. Here it might be expected that bleurt would be a good predictor of semantic accuracy. Therefore we use bleurt as first indicator of a model’s performance and use bleurt scores to decide whether to perform human evaluation on a model’s output. Then we examine whether the bleurt scores are highly correlated with the human metrics for coherence and semantic accuracy.

4 Experimental Results

We report results for all the KG-RG topics and for Viggo, with both GPT-Neo and Jurassic-1. The models were also trained and tested for both the the QA format in Figure 5 and the S2S format in Figure 6. For the KG-RG topics, we also experiment with all possible cross-domain combinations of training and test.

Table 2: bleurt scores for testing within and across the tuning domain for Athena-Jurassic and Athena GPT-Neo. Inputs in S2S/QA format, tuning on 10 instances of each topic.

Few-Shot Knowledge-Graph Response Generation. For each topic (movies, music, sports, TV), we randomly select ten instances for training and 50 for testing (200 total). We tune Jurassic-1 and GPT-Neo with each training set and then test each model on all four topics (test on 200) to examine both within and cross-domain few-shot performance. Table 2 provides the bleurt results for both Athena-GPT-Neo and Athena-Jurassic and for both S2S and QA formats. Rows indicate the training domain, while columns indicate test domains. The diagonal of each subtable reports within-domain performance. The average bleurt scores over all topics for each training set are in the last column of each subtable, and averages for each input format (S2S or QA) are also included.

As expected, the within-domain results (highlighted in yellow) show that the models perform best when tuned on their own domain. The best results for in-domain training are for sports, with an average bleurt score of 0.23 for the S2S format for Jurassic, and 0.26 for the S2S format for GPT-Neo, as well as a 0.21 for the QA format for GPT-Neo. The performance for within-domain testing for the TV domain is also good, with a score of 0.22 for the QA format for GPT-Neo and a score of 0.17 for Jurassic for the S2S format. Interestingly, sometimes a specific topic’s training data performs as well or better for another topic than its own (highlighted in turquoise), e.g., GPT-NEO S2S trained on TV performs better on movies than TV, and Jurassic QA, when trained on music, performs better for TV. This could arise because two domains are similar (TV and movies) or because one domain is easier, e.g., the averages across the columns of each section suggest that TV is easier.

The averages also clearly indicate that, for Jurassic, the S2S format works better, with large differences across all topic columns and topic diagonals, and an overall S2S of -0.34 compared to QA of -0.47 (p .01). For GPT-Neo, the overall differences between S2S (-0.33) and QA (-0.37) are not significant, and the story is more complex because GPT-Neo QA works well for both TV (0.22) and sports (0.21). The differences between S2S and QA are not significant for TV or movies, but GPT-Neo S2S is significantly better than GPT-Neo QA for music and sports.

A comparison of bleurt scores for S2S for Jurassic vs. GPT-Neo for each training topic, shows that GPT-Neo is significantly better for Movies (p .007), Jurassic is significantly better for music (p .005), GPT-Neo shows a trend to be better for TV (p .07) and there are no differences for Sports (p

.87). However, a paired t-test comparing

bleurt scores across all topics for both GPT-Neo and Jurassic shows that overall, the differences are not significant.

Since the overall differences for GPT-Neo S2S are not significantly different than GPT-Neo QA, we focus the human evaluation on comparing Athena-Jurassic to Athena-GPT-NEO for the S2S format. This will allow us to directly compare the human metrics for the two models while the prompt format is fixed. We restrict the annotation to the within-domain testing. We sampled 30 of the 50 test examples for each topic (240 examples). Three experts familiar with Athena labeled each output for coherence, semantic accuracy, good and bad extra information (hallucinations), and whether a question was added to the end of the response (remember that no dialogue acts were specified in the Athena-KG MRs). We also counted the number of words in each output to measure some aspects of the style of the outputs.

Table 3: Human Metrics for GPT-Neo vs. Jurassic per Topic.

Table 3 presents the results for the human metrics, showing that the average coherence (Coher) for Athena-GPT-Neo is significantly lower than Athena-Jurassic (p = .002), as well as the semantic accuracy (SemAcc) (p = .002). Athena-GPT-Neo hallucinates much more (total 33%) than Athena-Jurassic (total 18%). Although 22% of Athena-GPT-Neo’s hallucinations are good, 11% of the utterances include at least one bad hallucination, in addition to failing to realize the semantic triples included in the MRs.

Both models frequently add questions, which can be considered as either a feature or a bug. They are usually quite natural, but the Athena KG MRs do not specify dialogue acts. Thus, there is no way to control whether the questions appear or not. The models use a similar number of words. Still, given the low semantic accuracy of Athena-GPT-Neo, many words in the GPT-Neo outputs must be due to hallucinations.

The human evaluation suggests large differences between Athena-GPT-Neo and Athena-Jurassic in terms of the quality of the output, but there were no significant differences in overall bleurt scores between GPT-Neo S2S and Jurassic S2S. We therefore test whether bleurt scores are meaningful proxies for the human metrics of coherence and semantic accuracy, finding that the correlation across all the topics, for each model indicate that bleurt scores do correlate with semantic accuracy, with a correlation of 0.65 for Athena-GPT-Neo, and 0.46 for Athena-GPT-Jurassic. Coherence is not correlated with bleurt, with correlations of .02 for Athena-GPT-Neo and -0.05 for Athena-Jurassic.

ID Topic Input MR Output
Athena-GPT-Neo
M1 Movies (Peter Capaldi, award, BAFTA Award for Best Short Film), (BAFTA Award for Best Short Film, show, 47th British Academy Film Awards), (BAFTA Award for Best Short Film, work, Franz Kafka’s It’s a Wonderful Life) I think it’s really great when a talented actor wins an award. do you think Peter Capaldi deserved to win a BAFTA Award for Best Short Film in 1980, for Franz Kafka’s It’s a Wonderful Life?
M2 Music (Kellie Pickler, song, Red High Heels), (Red High Heels, genre, country music) Kellie Pickler is a country singer, and she’s also a rapper. Do you know her songs?
Athena-Jurassic
M3 TV (Saturday Night Live, award, Primetime Emmy Award for Outstanding Variety Sketch Series), (Saturday Night Live, date, 2019) Saturday Night Live won a Primetime Emmy Award for Outstanding Variety Sketch Series in 2019. How does the fact that it got this award affect your opinion of the show?
M4 Sports (Len Ford, member of sports team, Los Angeles Dons), (Len Ford, position played on team, end) Did you know that Len Ford has played as a part of famous teams, such as the Los Angeles Dons, and played positions such as end.
Figure 8: Jurassic-1 and GPT-NEO outputs for within-domain tuning with 10 instances.

Figure 8 shows examples for the two S2S models for each domain when tuned on within-domain instances, which illustrate the strengths and weaknesses between models. The Athena-GPT-Neo output for M1 was labeled a 3 for coherence. However, it leaves out the triple (BAFTA Award for Best Short Film, show, 47th British Academy Film Awards). It also includes the bad hallucination that Peter Capaldi is an actor, when in fact he wrote and directed the film. In addition, the 47th British Academy Film Awards honored the best films of 1993, so Peter Capaldi won this award in 1994, not in 1980. The semantic accuracy annotation indicates that 2/3 triples are correct, the output includes two bad hallucinations, and the output includes a question. Similarly, the GPT-Neo output for M2 shows that GPT-Neo knows that Kellie Pickler is a rapper, knowledge that was not included in the MR. This was hand-annotated as a good hallucination. However, this output fails to realize the triple (Kellie Pickler, song, Red High Heels), so semantic accuracy was 1/2 triples.

(Starship = song = We Built This City — We Built This City = genre = pop rock)
Starship plays pop rock like the song We Built This City. Do you like that genre?
(Planet of the Apes = cast member = Felix Silla)
I heard Felix Silla starred in a good movie, called Planet of the Apes.
Figure 9: Two tuning instances used to prompt Jurassic-1 for 2-Shot Novel MR Experiments

The Athena-Jurassic output for M3 was labeled as a 3 for coherence, and that it includes a question. The output correctly realizes all the triples so it was marked as semantically perfect (3/3 triples realized). The output for M4 is also labeled as a 3 for coherence. It also correctly realizes all the triples (2/2), which are realized by a Did you know question. This output would not be annotated as including an additional question since the material in the Did you know question is part of the specified content in the MR.

2-Shot Training on Novel Entities and Relations.

Table 4: Human evaluation for 2-shot Novel Athena-Jurassic vs. Athena-GPT-Neo, trained with S2S format.

We also performed 2-shot experiments using the 2 tuning instances for movies and music in Figure 9. Because the realizations of each relation or sets of relations requires a template to be written for Athena’s current KG-RGs, Athena has no templates for relations that are sparsely populated. Thus, we test 80 MRs composed of entities, relations, or combinations of relations that are novel to Athena, as indicated by a * in Table 1. We only train with the S2s format since the results in Table 2 show that the S2S format is clearly better for Jurassic and there are no differences for GPT-Neo.

ID Domain Input MR Output
Athena-GPT-Neo
N1 Movies (The Black Cauldron, genre, film based on literature) The Black Cauldron is a film based on the novel of the same name by Stephen King.
N2 Music (Dua Lipa, award, Grammy Award for Best New Artist), (Grammy Award for Best New Artist, show, 61st Annual Grammy Awards) Dua Lipa is a singer-songwriter from the Philippines. She is known for her hit songs “I’m Not the One” and “I’m Not the One (Remix)”.
N3 TV (Psych, cast member, James Roday), (James Roday, character role, Shawn Spencer) I heard James Roday starred in a good movie, called Psych.
N4 Sports (Tom Brady, award, Associated Press Athlete of the Year), (Associated Press Athlete of the Year, date, 2007) Tom Brady is the most valuable player in the NFL. He is also the most valuable player in the NFL.
Athena-Jurassic
N5 Movies (Ghost Rider, producer, David S. Goyer) Have you heard of a movie called Ghost Rider? I heard it was produced by David S. Goyer.
N6 Music (Adele, award, Grammy Award for Song of the Year), (Grammy Award for Song of the Year, show, 54th Annual Grammy Awards), (Grammy Award for Song of the Year, work, Rolling in the Deep) Adele won an award for Song of the Year at the 54th Annual Grammy Awards. Do you like her song Rolling in the Deep?
N7 TV (Elementary, assessment, Bechdel test), (Bechdel test, assessment outcome, passes) Elementary passes the Bechdel test because at least two of the named women talk to each other about something other than a man.
N8 Sports (Usain Bolt, record held, Men’s 100 metres world record progression), (Usain Bolt, race time, 9.58 seconds), (Usain Bolt, start time, 2009) Usain Bolt holds the record for the men’s 100 metres. He broke the previous record, which was 9.69 seconds, at the 2009 World Championships in Berlin, Germany.
Figure 10: Sample outputs for novel unseen relations and entities for both Jurassic-1 and GPT-Neo

Since there are no reference utterances for the novel MRs, bleurt scores cannot be calculated. Table 4 provides the results of the human evaluation, which indicate stark differences between Athena-GPT-Neo and Athena-Jurassic, reflecting the reduced training data and the rareness of the KG relations in the novel MRs. The novel test MRs are also harder for both types of models, but the performance of Athena-GPT-Neo declines more sharply. Compare the mean coherence of 2.58 and semantic accuracy of 0.41 for Athena-GPT-Neo in Table 4 to the mean coherence of 2.83 and semantic accuracy of 0.77 for Athena-GPT-Neo in Table 3. Then compare the mean coherence of 2.80 and semantic accuracy of 0.72 for Athena-Jurassic in Table 4, to the mean coherence of 2.96 and a semantic accuracy of 0.89 for Athena-Jurassic in Table 3. When testing with the novel MRs, Athena-GPT-Neo only adds questions 7% of the time, but Athena-Jurassic maintains a higher level and adds questions 47% of the time. When testing with the novel MRs, GPT-Neo hallucinates bad information in 12% of turns, while Jurassic only does so in 4% of turns. This supports the claim by Lieber et al. that Jurassic should generalize better lieberetal21 .

Figure 10 provides novel MRs and outputs for all four topics for both Athena-GPT-Neo and Athena-Jurassic, that illustrate the differences between the models. In N1, Athena-GPT-Neo produces a coherent and semantically accurate utterance about The Black Cauldron that includes a bad hallucination of Stephen King as the author, when Lloyd Alexander is the author. In N2, Athena-GPT-Neo again produces a coherent utterance, but the content of that utterance doesn’t include any of the triples in the MR, only matching the name of the singer, Dua Lipa. In N3, the output is again coherent, but it fails to realize the triple (James Roday, character role, Shawn Spencer). Example N4 illustrates how GPT-Neo sometimes produces redundant or logically inconsistent outputs, where here it says the same thing about Tom Brady twice, but sometimes might repeat itself many times, e.g. Friday Night Lights is a movie about a small town in Texas that is run by a family of criminals. The town is run by a family of criminals. …(4 times). In other cases, Athena-GPT-Neo contradicts itself. There are no examples from Athena-Jurassic that are redundant or logically inconsistent. In future work, these categories could be added to the human metrics, even though they happen rarely.

Figure 10 also shows that Athena-Jurassic’s 2-shot outputs are remarkably good. In N5, the output is coherent, semantically correct and stylistically interesting. In N6, all three triples are realized correctly, and the last triple is embedded into a question, which seems very natural. In N7, Athena-Jurassic realizes all the content in the MR, but also produces a good hallucination. defining what the Bechdel tests actually is. In N8, Athena-Jurassic seems to know a lot about Usain Bolt: it does not actually realize the triple (Usain Bolt, race time, 9.58 seconds), but provides the race time for the previous record, and produces a good hallucination of the event that this happened at, namely the 2009 World Championships.

Few-Shot Response Generation for Viggo Video Games.

Table 5: bleurt for Viggo comparing GPT-Neo and Jurassic, S2S format vs. QA format, and 3 training instances vs. 10.

We also experiment with few-shot training with the Viggo corpus, with a focus on the realization of dialogue acts. Athena KG MRs do not specify the dialogue act, and these its use of questions cannot be controlled. The dialogue acts in Viggo are shown in Figure 3 and Figure 4. The Viggo experiments compare training with GPT-Neo and Jurassic, for both S2S and QA formats, and compares 3-shot tuning to 10-shot tuning per dialogue act. All experiments use a randomly selected set of 100 Viggo test items.

Table 6: Human metrics for Viggo comparing GPT-Neo and Jurassic

Table‘5 provides the bleurt scores for these variations in the training setup. The QA row in the table for 3-shot training shows that, unlike the Athena KG’s, the QA format performs much better than the S2S format. Thus we conducted the 10-shot training with only the QA format. The bleurt scores for Athena-Jurassic for 10-shot training are clearly much better than any of the other results. Table 6 provides the results of the human evaluation showing that Athena-GPT-Neo does better at matching the dialogue acts (DA) and achieves higher semantic accuracy (0.83) than Athena-Jurassic. However overall the results are good for controlling dialogue acts. Table 6 also shows however that the coherence of Athena-GPT-Neo is poor compared to Athena-Jurassic. Athena-Jurassic demonstrates the knowledge that Jurassic-1 has about video games by producing far more good hallucinations (29%), while only producing 2% of bad hallucinations, as compared to the 12% of bad hallucinations by Athena-GPT-Neo.

ID Input MR Output
Athena-GPT-Neo
V1 confirm = yes, name = Tony Hawk’s Pro Skater 3, release_year = 2001, genres = sport You mean the Tony Hawk’s Pro Skater 3 that has got a sport genre?
V2 give_opinion = yes, name = Might & Magic: Heroes VI, rating = average, player_perspective = bird view, platforms = PC Might & Magic: Heroes VI is a solid game. I like that it has a multiplayer and, since it’s only rated E (for Everyone), I can play it with friends and younger siblings.
Athena-Jurassic
V3 confirm = yes, name = Tony Hawk’s Pro Skater 3, release_year = 2001, genres = sport Gotcha! So you’re referring to the Tony Hawk’s Pro Skater 3 sports game, which was released in 2001?
V4 give_opinion = yes, name = Sid Meier’s Civilization V, rating = good, available_on_steam = yes, has_linux_release = yes 4X games can be really fun to play, and Civilization V is a pretty good example. It also happens to be available on Linux, and that’s a huge plus for me.
Figure 11: Viggo outputs with 10-shot training per dialogue act for QA format

Sample outputs for both Viggo models are shown in Figure 11. Both models produce good outputs much of the time. The low coherence of Athena-GPT-Neo primarily arises from redundancy of repeating the same information twice. Compare example V1 to example V3. Both models get the dialogue act of confirm

correct. However, the Athena-Jurassic output in V3 is more varied and interesting and includes all the attributes specified in the MR. The Athena-GPT output in V1 misses out the release year attribute, and uses an odd phrasing

that has got a sports genre that results in a lower coherence score. Example V2 illustrates Athena-GPT-Neo’s knowledge about video games, with a good hallucination that it is rated E for Everyone, and that means that it can be played with younger siblings. Example V4 illustrates Athena-Jurassic failing to realize some of the MR attributes, such as the availability on Steam and the full name of the game. However, the language is again very natural, e.g. that’s a huge plus for me.

5 Conclusion

We trained new neural models for semantically-controlled meaning-to-text (M2T) natural language generators (NLGs) to improve the quality and the coverage of the M2T response generators in Athena, an open-domain dialogue system that has been a finalist in the Alexa Prize for the last two years harrison2020athena ; patilathena2021 . A major challenge for such systems is the need to produce high-quality responses on any topic. We created Athena-GPT-Neo and Athena-Jurassic using GPT-Neo black2021gpt , and Jurassic-1 lieberetal21 , by experimenting with few-shot (2, 3, 10) tuning for Athena’s knowledge-graph domains of movies, music, TV, sports and with the Viggo corpus’s dialogue act-based MRs for video games. We also experimented with multiple prompt formats and with testing both within and across-domain. The ability to reliably generate high-quality responses directly from MRs via neural NLGs will greatly facilitate the use of M2T NLGs in dialogue systems. To our knowledge, these are the first results demonstrating that few-shot tuning on a massive language model can create M2T NLGs that generalize well to new semantic domains.

Athena-Jurassic produces high-quality, semantically-controlled, conversational responses directly from MRs and KG triples. These results confirm the choice that the Jurassic-1 creators made to use a larger vocabulary with phrasal tokens, and less depth and more width, in order to create a large language model that generalizes better lieberetal21 ; levine2020depth . Our results show that both Athena-GPT-Neo and Athena-Jurassic generally produce coherent output with 10-shot within-domain training, but that Athena-Jurassic is significantly better for both coherence and semantic accuracy. While we have not tested whether real-time response generation is possible, we believe the responses are of high enough quality to be used in settings with real human users, such as the Alexa Prize gabriel2020further ; Venkatesh2017 ; patilathena2021 ; harrison2020athena .

We also showed that Athena-Jurassic performs well in 2-shot tuning tests, using completely novel sets of KG triples with relations and entities never seen in tuning. These novel MRs are not currently included in Athena, because the relations are rare, and creating templates for novel relations or sets of relations is typically not worth the human effort rambow2001evaluating ; rambow2001natural . For example the MR in M4 in Figure 7

describes the event of Muhammed Ali lighting the Olympic torch in 1996, a rarely populated event for the athlete entity type. Athena-Jurassic achieves a semantic accuracy of 2.72 out of 3 for MRs like this in our challenging 2-shot setting.

In experiments with the KG response generators in Athena, we found that in almost half the responses, Athena-Jurassic adds questions to the end of the response, which are typically quite natural. However the use of questions cannot be controlled because the KG-RG meaning representations do not specify dialogue acts. Thus we also experimented with few-shot tuning for controlling dialogue acts using the MRs in the Viggo video games corpus. We showed that both Athena-GPT-Neo and Athena-Jurassic can learn to control dialogue acts with 10-shot tuning per dialogue act. However again, Athena-Jurassic performs significantly better on the human metrics of coherence and semantic accuracy. Interestingly, often Athena-GPT-Neo successfully produces the form or syntax of the dialogue act, e.g. a verify-attribute dialogue act, while getting very few of the MR attributes correct. For example, Athena-GPT-Neo produces You said you liked Assassin’s Creed Chronicles: India. Do you think it would have been better to make it a single-player only game? when the reference utterance is So I know you said you hated Assassin’s Creed Chronicles: India. Do you think all of Climax Studios side view games are as bad?. Here, Athena-GPT-Neo only gets the name attribute correct, and misses the attributes that it is single-player, the user-rating is poor, as the developer is Climax Studios.

We also presented automatic evaluation results using bleurt for cross-domain testing. Some of the bleurt results are very good, and suggest that training across domains with 10-shot tuning may also produce high quality utterances. Our results also show that bleurt scores have good correlation with the human metric of semantic accuracy, but not coherence. Future work should evaluate these cross-domain results with human metrics. It would also be valuable to experiment with the large number of recently proposed automatic evaluation metrics to test whether there are better metrics than bleurt for doing automatic evaluation in this task setting howcroft2020twenty ; yeh2021comprehensive . Many recently proposed automatic metrics rely on evaluating outputs within a dialogue context, which typically is not available in M2T NLG experiments. However there are also novel reference free metrics that could be tested in this setting.

There are many other possibilities with both the WikiData knowledge graph RGs and with corpora such as Viggo for tuning and testing regimes that we have not yet experimented with or fully evaluated. We also plan to carry out future experiments on a number of other challenging problems for NLG reedetal20 ; oraby2019curate ; oraby2018neural ; harrison2019maximizing .

10

Bibliography

  • [1] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. Gpt-neo: Large scale autoregressive language modeling with meshtensorflow. If you use this software, please cite it using these metadata, 2021.
  • [2] Paweł Budzianowski and Ivan Vulić. Hello, it’s gpt-2-how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 15–22, 2019.
  • [3] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 5016–5026, 2018.
  • [4] Alessandra Cervone, Chandra Khatri, Rahul Goel, Behnam Hedayatnia, Anu Venkatesh, Dilek Hakkani-Tür, and Raefer Gabriel. Natural language generation at scale: A case study for open domain question answering. In INLG, 2019.
  • [5] Chun-Yen Chen, Dian Yu, Weiming Wen, Yi Mang Yang, Jiaping Zhang, Mingyang Zhou, Kevin Jesse, Austin Chau, Antara Bhowmick, Shreenath Iyer, et al. Gunrock: Building a human-like social bot by leveraging large scale real user data. Alexa Prize Proceedings, 2018.
  • [6] Emilie Colin, Claire Gardent, Yassine Mrabet, Shashi Narayan, and Laura Perez-Beltrachini.

    The webnlg challenge: Generating text from dbpedia data.

    In Proceedings of the 9th International Natural Language Generation conference, pages 163–167, 2016.
  • [7] Amanda Cercas Curry, Ioannis Papaioannou, Alessandro Suglia, Shubham Agarwal, Igor Shalyminov, Xinnuo Xu, Ondřej Dušek, Arash Eshghi, Ioannis Konstas, Verena Rieser, et al. Alana v2: Entertaining and informative open-domain social dialogue using ontologies and entity linking. Alexa Prize Proceedings, 2018.
  • [8] Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669, 2019.
  • [9] Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi, Noah A Smith, and Mari Ostendorf. Sounding board: A user-centric and content-driven social chatbot. NAACL HLT 2018, page 96, 2018.
  • [10] Raefer Gabriel, Yang Liu, Anna Gottardi, Mihail Eric, Anju Khatri, Anjali Chadha, Qinlang Chen, Behnam Hedayatnia, Pankaj Rajan, Ali Binici, et al. Further advances in open domain dialog systems in the third alexa prize socialbot grand challenge. Alexa Prize Proceedings, 2020.
  • [11] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  • [12] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. Creating training corpora for nlg micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, 2017.
  • [13] Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür, and Amazon Alexa AI. Topical-chat: Towards knowledge-grounded open-domain conversations. Proc. Interspeech 2019, pages 1891–1895, 2019.
  • [14] Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, and Kee-Eung Kim. End-to-end neural pipeline for goal-oriented dialogue systems using gpt-2. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 583–592, 2020.
  • [15] Vrindavan Harrison, Juraj Juraska, Wen Cui, Lena Reed, Kevin K Bowden, Jiaqi Wu, Brian Schwarzmann, Abteen Ebrahimi, Rishi Rajasekaran, Nikhil Varghese, et al. Athena: Constructing dialogues dynamically with discourse constraints. Proceedings of the Alexa Prize 2020, 2020.
  • [16] Vrindavan Harrison, Lena Reed, Shereen Oraby, and Marilyn Walker. Maximizing stylistic control and semantic accuracy in nlg: Personality variation and discourse contrast. DSNNLG 2019, page 1, 2019.
  • [17] Behnam Hedayatnia, Seokhwan Kim, Yang Liu, Karthik Gopalakrishnan, Mihail Eric, and Dilek Hakkani-Tur. Policy-driven neural response generation for knowledge-grounded dialogue systems. arXiv preprint arXiv:2005.12529, 2020.
  • [18] David M Howcroft, Anja Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, and Verena Rieser. Twenty years of confusion in human evaluation: Nlg needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Natural Language Generation, pages 169–182, 2020.
  • [19] Juraj Juraska, Kevin K Bowden, and Marilyn Walker. ViGGO: A video game corpus for data-to-text generation in open-domain conversation. In Proceedings of the 12th International Conference on Natural Language Generation, 2019.
  • [20] Juraj Juraska and Marilyn Walker. Attention is indeed all you need: Semantically attention-guided decoding for data-to-text nlg. In International Conference on Natural Language Generation, INLG, 2021.
  • [21] Chris Kedzie and Kathleen McKeown. A good sample is hard to find: Noise injection sampling and self-training for neural language generation models. In Proceedings of the 12th International Conference on Natural Language Generation, 2019.
  • [22] Rémi Lebret, David Grangier, and Michael Auli. Generating text from structured data with application to the biography domain. In Proc. of Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [23] Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. The depth-to-width interplay in self-attention. arXiv preprint arXiv:2006.12467, 2020.
  • [24] Opher Lieber, Barak Lenz Sharir, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, 2021.
  • [25] Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, 2019.
  • [26] J. Novikova, O. Dušek, and V. Rieser. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Conference, 2017.
  • [27] Kevin K. Bowden Juraj Juraska Wen Cui Vrindavan Harrison Rishi Rajasekaran Angela Ramirez Cecilia Li Eduardo Zamora Phillip Lee Jeshwanth Bheemanpally Rohan Pandey Adwait Ratnaparkhi Omkar Patil, Lena Reed and Marilyn Walker. Athena 2.0: Discourse and user modeling in open domain dialogue. Proc. of the Alexa Prize, 2021.
  • [28] Shereen Oraby, Vrindavan Harrison, Abteen Ebrahimi, and Marilyn Walker. Curate and generate: A corpus and method for joint control of semantics and style in neural nlg. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5938–5951, 2019.
  • [29] Shereen Oraby, Lena Reed, TS Sharath, Shubhangi Tandon, and Marilyn Walker. Neural multivoice models for expressing novel personalities in dialog. Proc. Interspeech 2018, pages 3057–3061, 2018.
  • [30] Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, and Kewei Tu. Towards holistic and automatic evaluation of open-domain dialogue generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3619–3629, Online, July 2020. Association for Computational Linguistics.
  • [31] Ashwin Paranjape, Abigail See, Kathleen Kenealy, Haojun Li, Amelia Hardy, Peng Qi, Kaushik Ram Sadagopan, Nguyet Minh Phu, Dilara Soylu, and Christopher D Manning. Neural generation meets real people: Towards emotionally engaging mixed-initiative conversations. arXiv preprint arXiv:2008.12348, 2020.
  • [32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [33] Owen Rambow, Srinivas Bangalore, and Marilyn Walker. Natural language generation in dialog systems. In Proceedings of the First International Conference on Human Language Technology Research, 2001.
  • [34] Owen Rambow, Monica Rogati, and Marilyn Walker. Evaluating a trainable sentence planner for a spoken dialogue system. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 434–441, 2001.
  • [35] Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan.

    Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset.

    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 34, pages 8689–8696, 2020.
  • [36] Lena Reed, Vrindavan Harrison, Shereen Oraby, Dilek Hakkani-Tur, and Marilyn Walker. Learning from mistakes: Combining ontologies via self-training for dialogue generation. In Proceedings of the 21st Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2020), 2020.
  • [37] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020.
  • [38] Thibault Sellam, Dipanjan Das, and Ankur Parikh. Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, 2020.
  • [39] Igor Shalyminov, Ondřej Dušek, and Oliver Lemon. Neural response ranking for social conversation: A data-efficient approach. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 1–8, 2018.
  • [40] Van-Khanh Tran, Minh Le Nguyen, and Satoshi Tojo. Neural-based natural language generation in dialogue using rnn encoder-decoder with semantic aggregation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 231–240, 2017.
  • [41] Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, and Ming Cheng. On evaluating and comparing conversational agents.
  • [42] M.A. Walker, I. Langkilde-Geary, H.W. Hastie, J.H. Wright, A.L. Gorin, et al. Automatically training a problematic dialogue predictor for a spoken dialogue system. Journal of Artificial Intelligence Research, 16(1):293–319, 2002.
  • [43] Marilyn Walker, Albry Smither, Shereen Oraby, Vrindavan Harrison, and Hadar Shemtov. Exploring conversational language generation for rich content about hotels. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • [44] Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. Challenges in data-to-document generation. CoRR, abs/1707.08052, 2017.
  • [45] Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, Dilek Hakkani-Tür, and Amazon Alexa AI. Deepcopy: Grounded response generation with hierarchical pointer networks. In 20th Annual Meeting of the Special Interest Group on Discourse and Dialogue, page 122, 2019.
  • [46] Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. A comprehensive assessment of dialog evaluation metrics. arXiv preprint arXiv:2106.03706, 2021.
  • [47] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  • [48] Chao Zhao, Marilyn Walker, and Snigdha Chaturvedi. Bridging the structural gap between encoding and decoding for data-to-text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2481–2491, 2020.
  • [49] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93, 2020.