Log In Sign Up

Zero-Shot Controlled Generation with Encoder-Decoder Transformers

by   Devamanyu Hazarika, et al.

Controlling neural network-based models for natural language generation (NLG) has broad applications in numerous areas such as machine translation, document summarization, and dialog systems. Approaches that enable such control in a zero-shot manner would be of great importance as, among other reasons, they remove the need for additional annotated data and training. In this work, we propose novel approaches for controlling encoder-decoder transformer-based NLG models in zero-shot. This is done by introducing three control knobs, namely, attention biasing, decoder mixing, and context augmentation, that are applied to these models at generation time. These knobs control the generation process by directly manipulating trained NLG models (e.g., biasing cross-attention layers) to realize the desired attributes in the generated outputs. We show that not only are these NLG models robust to such manipulations, but also their behavior could be controlled without an impact on their generation performance. These results, to the best of our knowledge, are the first of their kind. Through these control knobs, we also investigate the role of transformer decoder's self-attention module and show strong evidence that its primary role is maintaining fluency of sentences generated by these models. Based on this hypothesis, we show that alternative architectures for transformer decoders could be viable options. We also study how this hypothesis could lead to more efficient ways for training encoder-decoder transformer models.


page 1

page 2

page 3

page 4


Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems

Transformer models have achieved state-of-the-art results in a wide rang...

Zero-Shot Question Generation from Knowledge Graphs for Unseen Predicates and Entity Types

We present a neural model for question generation from knowledge base tr...

Graph-Aware Transformer: Is Attention All Graphs Need?

Graphs are the natural data structure to represent relational and struct...

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

The idea of combining multiple languages' recordings to train a single a...

Positional Encoding to Control Output Sequence Length

Neural encoder-decoder models have been successful in natural language g...

Turning transformer attention weights into zero-shot sequence labelers

We demonstrate how transformer-based models can be redesigned in order t...

Attending to Mathematical Language with Transformers

Mathematical expressions were generated, evaluated and used to train neu...

1 Introduction

Natural language generation (NLG) aims at producing fluent and coherent sentences and phrases in different problem settings such as dialog systems Huang et al. (2020), machine translation Yang et al. (2020), text summarization Syed et al. (2021). Due to their outstanding power in recognizing patterns and generalization, neural network-based models have dominated NLG research in the past decade. Most recently, the majority of the research in NLG leverages transformers Vaswani et al. (2017) and specifically, transformer decoders to generate natural language Radford et al. (2019); Brown et al. (2020); Lewis et al. (2020)

. As a general paradigm, in these approaches, natural language is generated autoregressively one token at a time, and each token is generated based on an inferred probability distribution over all possible tokens. Although these statistical approaches to NLG have proven to be highly effective, their stochastic nature and complex architectures make them difficult to control in order for them to reflect any set of desired attributes in the output. These attributes could range from persona, sentiment, condolence, dialog acts, questions, for dialog response generation 

Niu and Bansal (2018); Zhang et al. (2018); See et al. (2019); Madotto et al. (2020b) to story ending control for story generation Peng et al. (2018) or formality and politeness control for drafting emails Madaan et al. (2020), amongst others.

Figure 1: Performing zero-shot controlled generation on an encoder-decoder transformer at inference time using control knobs. The control knobs influence the generation process in such a way that the generated output has the desired attributes (e.g., asking questions).
Figure 2: Control knobs for zero-shot controlled NLG: (A) attention biasing knob for cross-attention, (B) decoder mixing knob, (C) context augmentation knob, and (D) attention biasing knob for self-attention

In general, being able to control an NLG model in a zero-shot fashion would be highly instrumental since such zero-shot control would not require large amounts of annotated data, nor would it require any fine-tuning of parameters of the NLG model or auxiliary attribute models to guide the generation. In this work, we introduce new zero-shot approaches for controlling NLG models based on encoder-decoder transformer architectures (that in the interest of brevity we refer to as EDT-NLG). The high-level idea of these approaches is to explicitly manipulate the transformers within the trained EDT-NLG models to achieve the desired attributes at generation time (see Figure 1). More specifically, we introduce a set of three control knobs, namely, attention biasing, decoder mixing and context augmentation that could be used to control the generation of EDT-NLG models. These knobs could be modulated to achieve varying degrees of control in generation. The attention biasing knob modulates the amount of attention paid to different parts of the attention context (i.e., what the attention module attends to). The decoder mixing knob works based on the idea that different decoder transformers with different learned behaviors such as input reconstruction (in auto-encoders), summarization, response generation, etc., could be combined at generation time to achieve mix of these behaviors in the generation process. The context augmentation knob works based on introducing additional context on the encoder side to generate as per desired attributes. Here, we note that there is no control-specific training in these approaches and no gradient update is involved in applying these knobs. An overview of how these control knobs function is shown in Figure 2. In the next sections, we will describe in detail how these knobs work, and through computational results, we show that these knobs are highly effective in zero-shot controlling of the generation in EDT-NLG models.

With that in mind, it is quite unexpected and counter-intuitive that according to the experiments results (shown in § 5 and § 6), manipulation of trained attention layers and transformers in general, through control knobs, does not derail EDT-NLG models. This robustness of EDT-NLG models to the manipulations introduced through control knobs raises many new questions. One of such questions is about the limits of such manipulations and when these manipulations cause the models to break down. Additionally, what are some of the implications of the robustness of these models to such manipulations? We address some of these questions in § 7, where we show strong evidence that in EDT-NLG models fluency of the generation is managed by the decoder self-attention. Based on these results, we investigate alternative architectures for transformer decoders, as well as approaches for more efficient training of EDT-NLG models.

To summarize, this work’s contributions are as follows:

  • [leftmargin=*]

  • We propose a set of control knobs that can control EDT-NLG models during generation in a zero-shot manner, i.e., without training for controlled generation or using any gradient-based optimization during inference.

  • We explore the application of the proposed control knobs for knowledge-grounded response generation and find that these control knobs can achieve zero-shot controlled generations, for a wide variety of attributes.

  • We put forth and analyze the hypothesis that in EDT-NLG models fluency of generation is managed by decoder self-attention. Based on this analysis, we also explore alternative architectures for transformer decoders and propose efficient ways of training EDT-NLG models.

The control knobs introduced in this paper could be generalized to any EDT-NLG model for any NLG task. Moreover, the attention biasing knob is generic to any attention mechanism within or outside of a transformer-based architecture and could be applied to other attention-based applications such as computer vision and multi-modal problems. However, to demonstrate the efficacy of the control knobs, in this work we focus on a specific family of NLG tasks, namely knowledge-grounded open-domain Neural Response Generation (K-NRG). We use K-NRG to present the ideas, experiments, and computational results.

2 Related Works

Numerous works in the literature focus on controlling neural network-based NLG models Prabhumoye et al. (2020). These approaches fall under two major categories. The first category focuses on using data annotated with the desired attributes to train the NLG model such that it is able to generate with the same attributes Keskar et al. (2019); Wu et al. (2020); Smith et al. (2020); See et al. (2019). The drawback of this approach is that for every set of desired attributes, annotated training datasets are required, which makes this approach difficult to scale. It also makes it difficult for trained models to generalize to other forms of control. Furthermore, often times, there is a different dataset per desired attribute, and it is not clear how to combine multiple attributes as they may override the effect of each other.

The second category involves approaches that do not alter the model parameters but rather modify the decoding strategies during inference. To achieve the desired control, these approaches nudge or reweigh the output distribution towards the desired directions, either using discriminators Holtzman et al. (2018) or bag-of-words that are indicative of the attributes Ghazvininejad et al. (2017); Baheti et al. (2018); See et al. (2019). However, these kinds of decoding strategies have been observed to be brittle, particularly for tasks like dialog response generation See et al. (2019). Another set of approaches within this category leverage auxiliary models that can detect the desired attributes. Termed as Plug-and-Play Language Models (PPLM) Dathathri et al. (2020); Madotto et al. (2020a), these approaches utilize the gradients from the auxiliary attribute detection model, using which the generative model performs optimization and increases the likelihood of receiving a high score from the auxiliary model. In these methods training auxiliary models still require annotated data that could be expensive to acquire. Moreover, approaches like PPLM that employ gradient updates at generation time are computationally expensive to generate. Note that while in the first category of controlled NLG approaches, gradient updates are used at training time, in the second category, gradient updates are rather indirect and occur at generation time only.

In contrast to the above categories, the goal of this work is controlled NLG in zero-shot. In the recent years, in machine learning in general, there has been an increased emphasis on zero-shot approaches that do not require any specific gradient-based optimization (neither during training nor during inference). In particular, prompt-based approaches have been proposed that prime massive language models, like GPT-3 

Brown et al. (2020), with few-shot supervised examples of a specific task. These approaches have led to favorable results where the model is able to adapt to new tasks without any fine-tuning. However, to the best of our knowledge, there is no work in the literature focused on controlling the output within an NLG task in a zero-shot setting Wu et al. (2020) through priming of language models. And also, to the best of our knowledge, this work is the first to propose approaches for zero-shot controlled NLG.

Figure 3: Overall architecture of encoder-decoder transformer-based model for K-NRG problem.

3 Control Knobs for Zero-Shot Controlled NLG

This section discusses the details of control knobs, namely attention biasing, decoder mixing, and context augmentation, that are proposed for zero-shot controlled NLG. These knobs are intuitively designed such that the weights and outputs of attention layers in a trained NLG model are modified at inference time to achieve desired attributes in the generated outputs.

Figure 4: Details of the attention biasing knob for cross-attention. Values 1 and 5 are example attention bias values.

As a high-level summary of how the control knobs for zero-shot controlled NLG work, consider a trained EDT-NLG model , where represents the parameters of the model, and is conditioned on a context . The generation process using this model could be represented as sampling a response from , i.e., . Generating with additional desired attributes, e.g., positive sentiment, could be interpreted as introducing an additional condition to the sampling process. The control knobs manually modify to such that samples from on average manifest the desired attributes significantly more.

3.1 Attention Biasing

We describe how the attention biasing knob works for cross-attention in EDT models. Consider the cross-attention layer in the decoder as shown in Figure 4. At generation time step

, the decoder attends to the context in the following manner: the query vector is first multiplied by the key matrix, and the result goes through a Softmax operation that outputs a discrete probability distribution, also referred to as

attention distribution. Attention distribution is then used (through multiplication with the value matrix) to determine in some sense how much attention should be paid to each one of the context tokens Daniluk et al. (2017).

The idea of the attention biasing knob is forcing an attention module to attend to some parts of its context more or less than it usually would, by directly adjusting the attention distribution. We do this through element-wise multiplication of a bias vector with the attention distribution and then normalizing the results so that the outcome is still a probability distribution (referred to as biased attention distribution). As an example, in Figure 4 the cross-attention context has two parts, and the attention process is biased by multiplying the attention to the first part by some value (for example, 5) and then normalizing the outcome to retrieve a probability distribution.

More formally, given embedded context , attention matrices , , , and the embedding for , cross-attention output for is:

In this notation, biased cross-attention could be defined as:


where function normalizes a given positive vector to have the element-wise sum of 1, is the bias vector at time step , and represents element-wise vector multiplication.

Attention biasing for self-attention works similarly to its cross-attention counterpart, which we discuss at length in § 7. Note that in this work, vector is not a learned parameter, and it is manually set so that, similar to probing, the effects of intuitive designs for this vector could be analyzed.

Biasing of attention modules has been employed in applications such as machine translation, to achieve local or focused attention. These include learning local windows of attention using strategies like gaussian-based biases Luong et al. (2015); Yang et al. (2018), hard-coded biases You et al. (2020), etc. Attention biases could also be induced using relative embeddings Shaw et al. (2018). Another popular way to bias attention is by learning differentiable masks Nguyen et al. (2020); Fan et al. (2021). Similar to these works, the attention biasing knob also biases the attention distribution, but unlike these works, in the attention biasing knob the bias is applied in zero-shot and on a continuous scale. While zero-shot biases have been studied in the probing literature to understand the influence of attentions on model’s classifications Serrano and Smith (2019), to our knowledge, zero-shot attention biasing for controlled generation is an unexplored avenue.

Figure 5: Example of decoder mixing with two decoders. For every decoding time step and decoder layer, a convex combination is applied to the output of the two decoder blocks.

3.2 Decoder Mixing

The decoder mixing knob, as the name suggests, mixes two or more trained transformer decoders at every layer. Inspired by cross-stitch networks for multi-task learning Misra et al. (2016), the mixing in this knob is done through convex combination of the output of the transformer decoders for every decoder layer. Figure 5 shows how decoder mixing works for two decoders.

The intuition behind this control knob is combining different behaviors that are learned in different decoders. For instance, consider a BART model in which the decoder has learned to reconstruct the input to the model and, and as a result, could be thought of as having a copying (of the context) behavior. Also consider another EDT-NLG model in which the decoder has learned to generate a response given a dialog, hence it has a responding behavior. In this example, the decoder mixing knob could be used for combining these two behaviors to generate responses where parts of the context (e.g., knowledge) is incorporated (more or less copied) in the generated response.

Combining multiple decoders has not been extensively explored in the literature. One notable work is Niu and Bansal (2018), where the authors propose late fusion of two decoders by merging the probability distribution predicted from each decoder. The main difference between the late fusion approach and the decoder mixing knob is that in this knob the mixing is done at every layer of the decoder and not at the output layer.

In the decoder mixing knob, at every generation time step and decoder layer , two or more transformer decoders are applied followed by a convex combination of the output of these decoders. More formally, if there are transformer decoders and transformer decoder is represented as function for all , the output of decoding at time step for the input is equal to where and . This mixing process is repeated across all the layers of the decoders. We refer to the vector of values as the decoder mixing vector and represent it as .

3.3 Context Augmentation

Figure 6: Example of creating control codes for questions.

In the context augmentation knob, we apply modifications to the input of the EDT-NLG model in order to push the model to manifest the desired attributes in the generations. We explain how the context augmentation knob works through an example. Imagine that the desired attribute for the output of the model is asking a question, i.e., inquisitive generation. In other words, we would like to increase the likelihood of the model’s output including a question. To this end, we first sample a set of question sentences (e.g., by choosing sentences that end with a question mark) from any text corpora. We call these sentences control phrases. We then feed each control phrase to the encoder of the EDT-NLG model to get an embedding for it111The embedding of a particular control phrase is the contextual sequence of token representations generated by the encoder of the encoder-decoder model.. We then take the average of these embeddings across all control phrases (Figure 6). We refer to this average as control code. The control code is then concatenated to the encoded context as shown in part (C) of Figure 2.

The control code, being the average of embedded control phrases, is designed to capture the shared concepts among the control phrases. The role of averaging in creating control codes is to maintain the shared concepts within the control phrases (which is question in our example) and smoothing out other concepts such as topic. This concept of an average control code is inspired from the prototypes in Snell et al. (2017).

4 Overview of Experiments

We conduct extensive experiments and ablations to understand the efficacy of the control knobs introduced in the previous section. This section presents an overview of the experiments and establishes the different settings under which they are conducted.

4.1 Preliminaries: Knowledge-Grounded Open-Domain NRG

A: Hi! do you like to dance?
B: I love to dance a lot. How about you?
A: I am really bad, but it is a good time.
“Bruce Lee was also a great dancer and
that he won the Hong Kong Cha-Cha
Championship in 1958.”
Next Turn
B: Hmm. Dancing is a lot of fun.
B: Dancing is a lot of fun. Even Bruce Lee
was a great dancer and has won competitions.
B: Dancing is a lot of fun. Did you know
that Bruce Lee was a great dancer?
Table 1: An example of knowledge-grounded response generation from Topical-Chat dataset Gopalakrishnan et al. (2019). For the provided conversational context, multiple styles of responses (such as more informative or inquisitive) are appropriate.

We use the K-NRG problem in our experiments to study the efficacy of the control knobs for controlled generation. In this problem, the input comprises the previous dialog turns (also referred to as dialog history) and one or more knowledge snippets related to the dialog. The task is to generate the next turn of the dialog. Table 1 shows an example of the previous turns of a dialog and a provided knowledge snippet, as well as the successive turn of the dialog.

For the K-NRG problem, we train encoder-decoder transformer-based NRG (EDT-NRG) models. More specifically, at every turn, the previous dialog turns is concatenated to the provided knowledge snippet , and the result , collectively referred to as dialog context, is encoded by the encoder (see § A.1). The decoder is prompted with the start of the sentence token and in an auto-regressive manner generates one token () at a time, based on cross-attention to the encoded dialog context and self-attention to the previously generated tokens () until a special end-of-sentence token is generated, i.e.

where is the EDT-NRG model with parameters .

We use the setting introduced in the Topical-Chat dataset Gopalakrishnan et al. (2019) which includes dialogues between two Mechanical Turk workers (a.k.a. Turkers). Based on the previous work Hedayatnia et al. (2020), we choose the setting where for each turn in the dialog, the knowledge snippet that is the most similar to the ground truth response is selected using TF-IDF and is provided as additional input.

For the NRG model, we use BART as the pre-trained encoder-decoder transformer model (EDT) Lewis et al. (2020). In particular, we choose the smaller BART-base model for two reasons. First, smaller models require significantly less compute resources and are more economical with a much less carbon footprint. Second, they are more challenging for zero-shot control as previous results highlight the difficulty to achieve zero or few-shot capabilities in smaller models Schick and Schütze (2020). Full details over the fine-tuning procedure of the Bart-base model on the Topical-Chat dataset is provided in Appendix A.

We evaluate the efficacy of the control knobs over the two frequent and rare test sets from the Topical-Chat dataset. As the name suggests, the frequent test set contains entities in the dialogs that frequently appear in the training set, whereas the rare test set contains entities that are not frequent in the training data.

4.2 Goals of the Experiments

The goals of experiments in this work are two-fold. First, we examine whether the proposed control knobs effectively control the generation process to generate according to desired attributes. Second, we examine whether applying the knobs would cause negative impacts on the generation output. Specifically, we examine the impact of the control knobs on fluency and relevance of the generated response. Fluency refers to the grammatical and syntactical correctness of generated responses. Relevance refers to appropriateness of a response given the history of the dialog See et al. (2019); Shin et al. (2019); Rashkin et al. (2019).

It should be re-emphasized that our primary goal is to explore the zero-shot controllability of EDT-NLG models using the control knobs. As such, we do not intend to propose a general-purpose response generator that leverages such controllable generations to improve the overall conversation experience. Such models would need appropriate dialog policies, such as dialog act-based policies Hedayatnia et al. (2020); Sankar and Ravi (2019) or other content planners Wu et al. (2020), and we leave the exploration of these as future work.

Due to the differences between the attention biasing and decoder mixing knobs on the one hand, and the context augmentation knob on the other hand, we split the experiments into two sections. In § 5 we present the experiments for cross-attention biasing and decoder mixing knobs. In § 6 we discuss the experiments for the context augmentation knob. After the experiment sections, we discuss in details self-attention biasing, alternative architectures for transformer decoders, and more efficient training of EDT-NLG models in § 7. All of our experiments are done across five runs to account for variability in the token sampling procedure.

5 Experiments: Attention Biasing and Decoder Mixing

5.1 Experiments Setup

In this section, we study the effects of applying cross-attention biasing (§ 3.1) and decoder mixing knobs (§ 3.2) for zero-shot controlled NLG. We focus our experiments on controlling informativeness in generated responses for the Topical-Chat problem setup introduced in § 4.1. As a reminder, in this setup, input is composed of dialog history and a knowledge snippet , i.e., .

5.1.1 Cross-Attention Biasing Profiles

We first apply the attention biasing knob for cross-attention modules on the transformer decoder of an EDT-NRG model fine-tuned for Topical-Chat. As the dialog context is a sequence with two parts, knowledge snippet and dialog history (Figure 4), the bias vector at generation time step could be represented as the row vector which is the concatenation of two bias row vectors and , i.e., . Although these vectors can be composed of different elements, meaning that at time the attention bias value for token of knowledge snippet could be different from that of token, we simplify the setup by assigning one attention bias value for knowledge () and another for dialog history () for each generation time step . In other words:

where, and represent the total number of tokens in knowledge and dialog history, respectively. As a hypothetical example if has 3 tokens and has 4 tokens, and at time step we give attention bias value of 5 to knowledge () and 1 to dialog history (), then .

Following this notation, we design different biasing profiles to explore the extent of controlled generations we can achieve from biasing cross-attention through the attention biasing knob. We experiment with three different biasing profiles, namely:

Figure 7: Cross-attention biasing profiles: Dialog, Knowledge, and Gradual Knowledge (see § 5.1).

where the decoder cross-attention is biased towards the dialog history across all generation time steps. In other words, for all , we set the biases such that is larger than , and more specifically, we set them as .


where the decoder cross-attention is biased towards the knowledge snippet across all generation time steps. In other words, for all , we set the biases such that is larger than , and more specifically, we set them as .222All of our biasing profiles are shared across the multiple heads of attention layers. Exploring head-specific biasing is left as a future work.

Gradual Knowledge

where initially decoder cross-attention is biased more towards the dialog history, and as the generation time step progresses, the biasing gradually shifts towards the knowledge snippet. The motivation for this design comes from the typical nature of human conversations, where it often is appropriate to start the response by addressing the last utterance of the other party. In this biasing profile, knowledge bias value increases linearly (with slope ) from up to a certain threshold. Meanwhile the dialog bias is kept at a constant value throughout the generation time steps. For our experiments, we set the parameters of this cross-attention biasing profile as follows: , , and .

Figure 7 presents sample representations of these three biasing profiles. Note that dialog and knowledge biasing knobs are rather extreme and mimic a gating strategy between knowledge and dialog history. In contrast, the gradual knowledge biasing knob is based on a predominant response structure in human conversations.

5.1.2 Decoder Mixing Setup

We explore two profiles for the decoder mixing control knob, both of which use the Topical-Chat fine-tuned EDT-NRG model’s decoder and the pre-trained BART’s decoder for decoder mixing. For the first profile, we set the decoder mixing vector which performs an averaging operation for the output of pre-trained and fine-tuned decoders at each decoder layer (§ 3.2). In the second profile, we set which gives the values 0.7 to the Topical-Chat fine-tuned EDT-NRG decoder and 0.3 to the pre-trained BART decoder in order to put more emphasis on generating proper responses and slightly less emphasis on copying from the knowledge snippet. The latter bias profile essentially applies a smaller bias compared to the former one.

5.2 Evaluation

To evaluate the generated responses for the dialog context , we setup both automatic and human evaluations of the responses to measure their informativeness, fluency, and relevance. Due to the high cost of human evaluations we only conduct them for a subset of our experiments.


To evaluate informativeness of responses, we use BLEU, ROUGE, and METEOR as automatic metrics for comparing a generated response with the provided knowledge snippet () in the dialog context. As automatic metrics on their own are not entirely reliable for evaluation of informativeness of the responses Belz and Reiter (2006); van der Lee et al. (2019), we also perform human evaluation of the generated responses. For this purpose we define a new taxonomy (presented in Table 2) over five levels of informativeness. The goal of these levels is to capture the amount of manifestation of provided knowledge in the response. Details on the setup of human evaluation is provided in § B.2.


As was mentioned in § 4.2 we also examine if applying the control knobs to create more informative responses would impact the fluency of the generated responses negatively. To that end we set up human evaluations in which annotators make a yes or no decision on the question “Does the language of the response seem correct?”. Moreover, as an automatic metric for fluency, we also measure the perplexity of the models calculated with respect to ground-truth human response (PPL). In § B.1, we discuss additional automatic metrics for fluency.


To evaluate the relevance of responses, in human evaluation we ask annotators the following question: “Regardless of its factual correctness, how appropriate is the response to the conversation?”. This score is filled on a Likert scale of 1-5.

Level Taxonomy
Does NOT include anything from the provided knowledge
and does NOT provide any facts.
Does NOT include anything from the provided knowledge
but includes some other facts or opinions (made up or not).
Includes some words from the provided knowledge,
but makes up facts.
Indirectly uses provided knowledge, without making up facts.
Directly uses provided knowledge, without making up facts.
Table 2: Proposed taxonomy to evaluate informativeness in responses. We prioritize knowledge-oriented responses over un-informative responses (Levels 2-5 vs. 1). Within levels 2-5, we prefer responses that adhere to the provided knowledge (4 and 5) over responses that mention hallucinated facts (2 and 3).
Knob Bias Profile Fluency Relevance Informativeness
PPL Human Eval Human Eval BLEU ROUGE METEOR Human Eval
Freq Rare Freq Rare Freq Rare Freq Rare
Base Model 9.66 9.88 0.796 3.76 0.09 0.16 0.22 0.28 0.28 0.36 3.43
A Dialog 10.15 10.39 - - 0.03 0.10 0.13 0.20 0.16 0.26 -
Knowledge 10.20 10.59 0.786 3.81 0.14 0.26 0.28 0.38 0.36 0.49 3.84
Gradual Knowledge 10.03 10.38 0.788 3.83 0.13 0.22 0.26 0.34 0.34 0.45 3.80
B 11.78 12.02 0.785 3.87 0.23 0.25 0.35 0.38 0.45 0.48 3.68
Gradual Knowledge,
12.59 13.00 0.751 3.89 0.28 0.34 0.41 0.47 0.53 0.61 3.82
Table 3: Effect of attention biasing and decoder mixing control knobs on the informativeness of responses on the Topical-Chat frequent and rare test sets. “A” represents the attention biasing knob and “B” represents the decoder mixing knob. Results are averaged over 5 inference runs with random seeds. Numbers in boldface (for BLEU, ROUGE, METEOR

, and human evaluation) represent statistically significant difference with respect to Base Model as per both pairwise Tukey’s HSD test and two-tailed unpaired t-test (both with

). Model is significantly different from base for but not for . One-way ANOVA test across the five human evaluated models have a statistically significant difference with 1.6e-8

. Complete tables, with additional metrics and standard deviations are provided in 

§ B.1, and further discussion on human evaluation setup and results are provided in § B.2.
A: Hello - how are you doing today?
B: Hello, I am doing good. How are you?
What do you think of countries having
prime ministers?
A: I’m doing good thanks for asking.
I think it’s different. What about you?
B: It is different, I agree. I am not sure
how much power they have. It seems like they
can do a little more than a president can but
I am not certain.
A: I saw the president of the UK has a cat with a
government title of chief mouser.
“Broadly speaking, a “politician” can be
anyone who seeks to achieve political
power in any bureaucratic institution.”
Dialog Bias
B: I wonder how many people are in that
position. I think they can be more than the
president though. They can do whatever
they want.
Knowledge Bias (Informative)
B: That is pretty cool.
I wonder what kind of job that is.
Politicians can be anyone who seeks to achieve
political power in any bureaucratic institution.
Table 4: Example of control between dialog-oriented vs. knowledge-oriented responses. The responses generated with the Dialog and Knowledge biasing profiles, respectively (see  § 5.1).

5.3 Results

Table 3 summarizes the results of applying attention biasing (Knob A) and decoder mixing knobs (Knob B) for controlling the informativeness of generated responses. Note that numbers in boldface represent statistically significant difference from “Base Model”, which is the BART-base model fine-tuned on Topical-Chat training set. In this table, fluency, relevance, and informativeness (column families) of responses generated by applying no control knob (“Base Model” row), cross-attention biasing knob (rows A), decoder mixing knob (row B), and both attention biasing and decoder mixing knobs (row A+B) are measured. For attention biasing experiments (rows A), we use the three bias profiles that are discussed in § 5.1.1 (Figure 7).

From the informativeness columns, we can see that using the attention biasing knob for biasing the cross-attention towards dialog (row A - Dialog profile) causes the automatic metrics (BLEU, ROUGE, and METEOR) to drop, indicating that the provided knowledge is incorporated less in the responses, as expected. On the other hand, when the attention biasing knob is used to bias the cross-attention towards the provided knowledge snippet, we see that compared to the base model, these metrics are significantly higher. This trend also appears in the human evaluation, where we see that the informative scores are significantly higher for all the rows corresponding to the attention biasing knob (rows A). Specifically, we see that using the bias profile “Knowledge”, the human evaluation score for informativeness is 3.84, which is significantly larger than the 3.43 that the base model achieves.

Regarding fluency and relevance, while we see an increase in perplexity as the model is biased with different profiles, the human evaluations do not show any statistically significant difference between the variants and the base model. This indicates that while the attention biasing knob works well in generating more informative responses, it does not negatively impact the fluency and relevance of the responses.

Similar conclusions could be made for the decoder mixing knob (row B) as well as the combination of attention biasing and decoder mixing knobs (row A+B). It is notable that for the decoder mixing knob (row B) we see that automatic informativeness scores are higher than those of the attention biasing knobs (rows A), but the human evaluation informativeness score for the attention biasing knobs is higher than that of decoder mixing. This perhaps highlights the value of human evaluation for measuring subjective factors such as informativeness. Table 4 presents an example from the test set, where we demonstrate how the two cross-attention biasing profiles control the informativeness of the output response.

Knob Bias Profile Knowledge First Knowledge Last
Base Model 9.66 0.22 9.65 0.21
B 11.78 0.35 11.38 0.17
12.59 0.41 11.46 0.28
Table 5: Effect of positioning the knowledge snippet in the context: beginning vs. end. We observe that this positioning has an effect on the decoder-mixing knob. Results shown here are on the frequent testing set.

One point to note here is that since in the experiments knowledge snippets come before the dialog history in the cross-attention context, it is likely that it is easier for the pre-trained decoder in the decoder mixing knob to incorporate the knowledge snippet into the response. If this order is switched and the dialog history comes before the knowledge snippet in the dialog context, using cross-attention biasing in the pre-trained decoder could be leveraged to copy from the knowledge snippet more than the dialog history. We see this effect in Table 5, where applying only the decoder mixing knob, when knowledge is placed at the end of the dialog context, causes the informativeness to go down, instead of going up. The results could be improved using the attention biasing knob along with the decoder mixing knob (row A+B where ROUGE increases from 0.17 to 0.28). However, the overall informativeness is significantly lower than the variants where knowledge is placed in the beginning of the dialog context. This indicates that ordering of the input could be crucial for the decoder mixing knob.

5.3.1 Effect of Bias Amount

We also examine how the amount of bias in attention biasing (bias value) and decoder mixing knobs () impact informativeness (through ROUGE) and fluency (through perplexity) of generated responses. Table 6 demonstrates the effect of varying the amount of bias of the control knobs. Throughout the experiments on attention biasing (rows A) we keep the value of for cross-attention biasing (amount of bias towards dialog history) at 1 in order to measure the effect of different values for , which represent the amount of bias towards the provided knowledge snippets. From the table, we note that higher values of result in higher ROUGE which could be interpreted as higher incorporation of knowledge into the generated response. On the other hand, as is increased, perplexity of the model also goes up slightly. But if we compare these perplexity numbers with those in Table 3 we can argue that the generated sequences with are still reasonably fluent333Due to high cost of human evaluation, we run them only for a subset of our experiments that cover the primary results..

For rows B in Table 6 we compare two different decoder mixing profiles, one with and one with . The first profile corresponds to the case where we give 0.7 weight to the Topical-Chat fine-tuned decoder and 0.3 to pre-trained BART decoder in decoder mixing at generation time. As is expected, we see that when a higher weight is given to the pre-trained decoder the incorporation of the knowledge snippet into the generated response is higher (higher ROUGE) which is likely to be due to the higher reconstruction (copying input) behavior of pre-trained BART that results in copying more content from the knowledge snippets into the generated responses. Similar to the attention biasing experiments (rows A), we see that the fluency of the models are likely to be not significantly impacted in these decoder mixing experiments according to the perplexity values.

Knob Test frequent Test rare
1 1 9.66 0.22 9.88 0.28
A 2 1 9.78 0.25 10.05 0.32
5 1 10.20 0.28 10.59 0.38
10 1 10.70 0.32 11.18 0.41
50 1 12.23 0.38 12.90 0.49
B [0.7, 0.3] 10.47 0.24 10.68 0.31
[0.5, 0.5] 11.78 0.35 12.02 0.38
Table 6: Effect of varying intensities of biasing for Attention biasing (A) and decoder mixing (B) knobs. For A, we keep the biasing profile to be Knowledge (see  § 5.1).
Knob Biased Decoder Layers Test frequent Test rare
PPL Rouge PPL Rouge
Base Model 9.66 0.22 9.88 0.28
A Bottom 3 9.73 0.22 9.95 0.28
Top 3 10.11 0.28 10.48 0.37
All Layers 10.20 0.28 10.59 0.38
B Bottom 3 10.23 0.20 10.45 0.25
Top 3 11.03 0.24 11.23 0.29
All Layers 11.78 0.35 12.02 0.38
AB Bottom 3 10.77 0.29 11.05 0.36
Top 3 11.58 0.34 12.01 0.42
All Layers 13.04 0.43 13.57 0.51
Table 7: Effect of varying intensities of biasing through attention biasing (A) or decoder mixing (B) knobs. For A the biasing profile is Knowledge (see  § 5.1) and for B, .

5.3.2 Layer-Specific Biasing

Up to this point, we apply the control knobs on all of the transformer decoder layers. Here, we also examine the sensitivity of different layers of the transformer decoder to these knobs. To that end, we consider the settings where the control knobs are applied either to the top half or the bottom half of the transformer decoder layers and compare these to the general setting of applying the knob to all transformer decoder layers. The results are summarized in Table 7. We can see that for the attention biasing knob (rows A), applying the knob on the bottom three layers has no significant impact on perplexity and ROUGE compared to the base model (first row). On the other hand, applying this knob on the top three layers has almost the same effect on perplexity and ROUGE as applying the knob to all layers has. The results for the decoder mixing knob (rows B) show that applying this knob only on the bottom three layers deteriorates ROUGE compared to the base model. Application of this knob on the top three layers has a very small impact on ROUGE compared to the base model, but when this knob is applied to all the layers, we see significant improvement in ROUGE. Finally, when both of the attention biasing and decoder mixing knobs are applied (rows A+B) to the bottom three, top three, and all the layers, we see that ROUGE increases in that same order.

6 Experiments: Context Augmentation

This section explores controlled generation using the context augmentation knob (§ 3.3). We also investigate how combining the context augmentation knob with attention biasing and decoder mixing knobs could further improve our results.

6.1 Experiments Setup

For demonstrating how the context augmentation knob works and how different parameters and settings impact the results of this knob in ablation studies, in this section, we focus the majority of experiments on controlling the generation of questions (“wh” questions, yes or no questions, etc.) in dialog responses. Generating questions is an essential skill towards making dialogs more inquisitive and consequently improving their engagingness with the user See et al. (2019). In addition to question generation, later in the section, we also show the applicability of the context augmentation knob for controlling other desired attributes in a dialog response, including incorporating feedback-oriented sentences in the response (§ 6.4.2) and making responses more positive in sentiment (§ 6.4.3).

As was discussed in § 3.3, control phrases are used for the context augmentation knob. For applying this knob to generate questions we randomly sample 1000 questions from the Topical-Chat training set and use them as control phrases. We then generate the control code by encoding the control phrases using a pre-trained BART encoder and averaging the encodings over the 1000 samples.

Since in this section we also experiment with combining the context augmentation knob with attention biasing and decoder mixing knobs, we discuss the profiles used for these two knobs here. Regarding attention biasing, unlike the earlier experiments in § 5, the embedding of dialog context in this section is prepended with the control code (§ 3.3). The resulting augmented dialog context is then segmented as follows. There are two values and that define the bias vector . In this section we use the profile where for all and for , which means that the cross-attention is biased towards the control code for the first decoder time steps444The results also hold with other time steps than ., while there is no cross-attention biasing for the remaining time steps. For the decoder mixing knob, we use .

6.2 Evaluation

As the initial and ending parts of a dialog typically include greetings and salutations, we sample a subset of test samples from the Topical-Chat test sets by focusing on more central turns in the dialog. In particular, we randomly sample dialog contexts ( from each frequent and rare splits of the test set) with five previous dialog turns and use this consolidated test set to evaluate the efficacy of the context augmentation knob. For that, similar to See et al. (2019), we use “?” as an indicator for questions, which we find to act as a strong proxy for questions555We rarely find cases where a question does not have a “?” or a question-marked sentence is not a question..

As each response turn can be composed of multiple sentences, we calculate the number of questions either at the sentence-level (counting every sentence with a question mark666We use NLTK to perform sentence segmentation of the generated response:, or turn-level (counting the indicator that a turn has at least one question). We repeat each experiment across five runs, to account for variability in the token sampling procedure. We report both the mean and standard deviation over these counts in the respective tables discussed next. We also measure fluency and relevance through human evaluations similar to the setup that was used earlier in § 5.2.

6.3 Results

Table 8 summarizes the results of biasing the responses towards more questions. C in this table and all the following tables represent the context augmentation knob, and A, B represent cross-attention biasing and decoder mixing knobs, respectively. The first row of this table also represents the base case where no biasing knob is applied. From the numbers, one could note that the fluency, that is evaluated by human annotators, does not change much, and any changes are statistically insignificant. For relevance, which is also evaluated by human annotators, although larger differences are observed, they are still not statistically significant777For Tukey’s HSD test between base model ‘-’ and ‘C+A+B’ on relevance, we get which indicates the difference is not statistically significant. These scores have standard deviations of about , which could be a reason for no statistically significant difference. Details provided in § C.1. The simple intuition that two different responses (one including and the other not including a question) could both be relevant responses, could be the reason why the relevance measure is not significantly impacted, even though significantly more questions are generated in the responses.

In terms of the number of questions generated, we can see that using context augmentation alone (row C) does not generate more questions than the base case (row None). However, when this knob is combined with the attention biasing knob (row C+A) or with both attention biasing and decoder mixing knobs (row C+A+B), the number of questions generated is quite larger. From these numbers it appears that cross-attention biasing is key for the context augmentation knob to work. Notably, the combination of context encoding and decoder mixing (row C+B) has the inverse effect of generating fewer questions. This could be due to the pre-trained BART decoder in decoder mixing not being able to use its reconstruction (copying) capabilities on the control codes, which are not sequences of token embeddings because of the averaging operation.

Another notable observation is that despite the control knobs aid in increasing the number of questions by 40%, not all the responses are becoming questions. This could be indicative of these knobs being used by the model to generate questions whenever it makes sense to have a question in the response, and not forcing the model to generate questions at all times, whether it is appropriate or not; which is a desirable feature of the knobs.

Knobs Fluency Relevance # of Questions
None 0.86 3.70 58.24.2 61.45.2
C 0.88 3.62 58.65.4 60.06.3
CA 0.87 3.56 70.43.7 72.64.1
CB 0.85 3.61 50.23.1 59.66.2
CAB 0.85 3.55 83.44.8 100.04.7
Table 8: Control over number of generated questions for different combinations of knobs. Configurations of A, B, and C are defined in § 6.1. Note that sentence-level numbers are higher than turn-level as each turn can have multiple questions. We report the mean and standard deviation across five generations with varying random seeds over the 200 dialog contexts from the consolidated test set (§ 6.2). Statistical significance details are provided in § C.1. Qualitative examples are presented in § C.5. Reliability of the proxy-based ‘?’ marker for counting questions is established through a human evaluation detailed in § C.4.

6.3.1 Effect of Number of Control Phrases

In the previous experiments, we used a set of 1000 control phrases for the context augmentation knob. In the next experiment, we vary that number and the results are shown in Table 9. Again, we see that context augmentation alone (row C) and the combination of context augmentation and decoder mixing knobs (row C+B) do not result in more generated questions (at turn level). However, most notably, when the three knobs are combined (row C+A+B) with ten control phrases, the average number of generated questions is 147.4, which is almost three times more than the base case (row None). Also, as the number of control phrases increases, the number of generated questions decreases. This could be due to the averaging operator of calculating control codes which might be causing the question aspect of the control phrases not to be the only prominent shared feature between the control phrases.

Knobs Size of Biasing Set
10 100 1K 10K
Table 9: Effect of number of question control phrases on the number of generated questions at turn-level. Configurations of A, B, and C are defined in § 6.1.

6.3.2 Effect of Encoder for Control Phrases

So far in this section, we have used pre-trained BART’s encoder for creating the control codes from control phrases for the context augmentation knob. The next set of experiments show how using Topical-Chat fine-tuned EDT-NRG encoder instead would impact the results. The results of these experiments are shown in Table 10. We see from these numbers that for cases where the decoder mixing knob is used (rows C+B and C+A+B) using Topical-Chat fine-tuned EDT-NRG encoder for building the control code results in much fewer questions in the generated responses. One potential explanation for this could be that the decoder mixing knob’s contributions are hindered by using an encoder that it has not been associated with before.

On the other hand, when the decoder mixing knob is not involved (rows C and C+A) we see that using the Topical-Chat fine-tuned EDT-NRG encoder is increasing the number of generated questions which again could be due to the familiarity of the decoder with the encoder used for the context augmentation knob.

Knobs Biasing Encoder
None 58.24.2 56.46.8
C 58.65.4 67.02.6
CA 70.43.7 74.07.2
CB 50.23.1 41.83.6
CAB 83.44.8 54.23.8
Table 10: Comparing control over number of generated questions (at turn-level) between pre-trained and fine-tuned encoder for generating control codes. Configurations of A, B, and C are defined in §§ 6.1 and 6.1.

6.3.3 Effect of Source of Control Phrases

In this experiment, we evaluate the impact of changing the source of these questions. More specifically, we sample 1000 questions from the SQuAD Rajpurkar et al. (2018) dataset for creating the control code from the context augmentation knob. The results in Table 11 show that there is no conclusive and significant difference between the two sources (Topical-Chat and SQuAD) of biasing phrase in terms of the final number of generated questions, which suggests that the source of control phrases might not be an important factor, particularly for questions. Moreover, this could also be due to the smoothing out of domain-specific features from the averaging operation in the context augmentation knob.

Knobs Biasing Encoder # of Questions
Topical-Chat SQuAD
Table 11: Comparing control over the number of generated questions (turn-level) between in-domain (Topical-Chat) and out-of-domain (SQuAD) control phrases. Configurations of A, B, and C are defined in §§ 6.1 and 6.1.

6.4 Context Augmentation for Other Attributes

In the previous results, we have shown zero-shot controlled generation using the proposed control knobs (specifically the context augmentation knob) for generating questions. One question here is whether such control in generation can be observed for more specific types of questions or over concepts beyond questions, such as other dialog acts like feedback or semantic aspects like sentiment. Next, we explore the answer to these questions.

6.4.1 Fine-Grained Question Control

In this section, we look into the ability of the control knobs to generate fine-grained question types. We consider the ISO-based Dialog Act Scheme in Mezza et al. (2018), and in particular, we choose the question types from the subset used in Hedayatnia et al. (2020). These include PropQ, ChoiceQ, and SetQ question types. Table 12 explains what these three types of questions are using examples.

Definition Example
Do you know what the University
of Iowa’s locker room is?
SetQ Wh-question
What about you?
ChoiceQ Or-question
Or does it become a problem?
Table 12: Fine-grained question types considered for control.
Evaluation Approach.

For evaluating the accuracy of generating these fine-grained questions, we initially used the off-the-shelf SVM-based dialog-act classifier proposed in

Mezza et al. (2018). However, we found that this model has a slow inference rate, and as a result, we trained an RNN model with a similar training setup as the SVM model. We use this RNN-based model as the primary evaluator of our generated responses. To establish the performance of this model, we conduct human evaluations on a set of sentences (full details in § C.3). The dialog acts tagged by this model achieves F1 score of 0.83, which indicates that this model is a relatively reliable tool for evaluating the generated responses.

Control Phrases.

While it is possible to curate random examples of fine-grained questions from the Topical-Chat training set, we take a different approach here. We sample the most frequent phrases of these question types from the training set and curate small sets of these questions’ prefixes. For example, for PropQ we curate control phrases that include “Do you like”, “Do you know”, “Have you ever”, “Are you a”, etc. The goal of this approach is two-fold. First, we aim to show that we can achieve controlled generation even with a very small set of control phrases. Second, to show that there is no particular requirement for the control phrases to be well-formed questions. As seen in the results below, we observe that incomplete sentences also work as effective control phrases.

Knobs Biasing Code Predictions
PropQ SetQ ChoiceQ Feedback
None None 30.8 10.8 0.0 72.4
C 43.4 12.0 0.0 64.8
CA 87.6 10.2 0.0 38.4
CB 68.2 18.8 0.0 57.4
CAB PropQ 183.2 6.2 1.0 26.4
C 35.2 13.2 0.0 65.0
CA 42.8 23.6 0.0 56.4
CB 29.0 34.2 0.0 59.6
CAB SetQ 37.4 105.2 1.0 43.4
C 33.6 12.4 1.0 67.8
CA 50.0 15.4 0.0 52.8
CB 29.8 18.8 1.5 67.0
CAB ChoiceQ 91.6 20.0 2.6 45.2
C 31.2 9.4 0.0 71.0
CA 33.8 10.0 0.0 85.6
CB 16.6 9.0 1.0 97.2
CAB Feedback 16.0 13.4 0.0 141.4
Table 13: Comparing control over the number of questions (at sentence-level) in 200 responses when biased with different fine-grained question types. Configurations of A, B, and C are defined in § 6.1. Examples of each of the question types and Feedback could be found in Table 14.
A: do you watch the NFL?
B: I sure do. How about yourself?
A: Me too. Did you know women can play
in the NFL too?
B: Did not know there, are there limitations though?
A: Nope as long as you qualify,
you are allowed to play.
“the NFL uses tracking chips embedded in players

shoulder pads to record a players every move

on the field. They are also located inside footballs to
track: velocity, rotation, ball speed and location.”
B: Do you know much about tracking? Apparently
they use tracking chips in the shoulder pads and the
balls to record a player’s movements.
B: What do you think of the tracking chips
they put in the players?
B: That is awesome. Did you know they are putting
tracking chips in the players shoulder pads to
record their movements?
Table 14: A sample dialog from the consolidated test set with the controlled generations by Knobs C+A+B. Additional examples (randomly sampled) are provided in § C.5.

The results are summarized in Table 13. We can see that generating PropQ and SetQ questions could be successfully accomplished by using control knobs. Specifically, PropQ questions are generated significantly more compared to SetQ questions. ChoiceQ questions, however, are not being generated using the control knobs. One reason for this could be that such questions are quite rare in the training set of the Topical-Chat, and as a result, the model has not learned how to generate them. The other factor could be that the control phrases for ChoiceQ are not quite representative of what ChoiceQ questions are. A few sample generations for each one of these question types are shown in Table 14.

In terms of precision of control, from the numbers presented in Table 13 we can also study how precise the control knobs are. More specifically, we want to determine when the goal is generating more SetQ questions, how much difference is observed in the number of generated PropQ questions. In general, we see that the models largely adhere to the provided additional context. The only place where the precision is poor is ChoiceQ, which means conditioning with ChoiceQ does not improve the number of ChoiceQ questions generated, but it increases the number of generated PropQ questions. This could be due to the similarity between ChoiceQ and PropQ questions in general (see the control phrases in the § C.2).

6.4.2 Generating Feedback Responses

Beyond questions, we show that the proposed control knobs are also effective in creating other dialog acts such as feedback. We evaluate the controllability of feedback acts using the same RNN-based evaluator (§ C.3). In the sampled example in Table 13, we can see that using feedback control codes helps with generating significantly more responses that are providing feedback for the previous turn.

6.4.3 Sentiment

We also investigate the use of the context augmentation knob to generate more positive responses. For control phrases we use “That’s awesome”, “That’s cool”, “Oh that is great”, “It’s great to”, and “It’s wonderful to”. The results are shown in Table 15

. Here again, we see that using the context augmentation knob alone (row C) does not result in statistically significant improvements in the positivity of sentiment of the generated responses (measured using an off-the-shelf sentiment classifier

888 However, similar to the previous experiments, when the context augmentation knob is combined with attention biasing and decoder mixing knobs (row C+A+B), we see the most significant increase in the average sentiment scores. It should be noted that the base model (row None) already has a very high average sentiment score (around 0.57) which is indicating that the majority of the responses created by the base model are already positive. This could potentially explain why the increase in the average positivity of the sentiment, although significant, is not very high.

Knobs positive p-value
None 0.56970.016 -
C 0.55650.007 0.1851
CA 0.57200.017 0.8513
CB 0.61130.021 0.0155
CAB 0.65080.022 0.0005
Table 15: Sentiment scores (1positive and 0negative) averaged over 5 runs. Models significantly different from base model (based on a two-tailed unpaired t-test) are highlighted using boldface. Configurations of A, B, and C are defined in §§ 6.1 and 6.1. Refer to § C.5 for some qualitative examples.

7 Deeper Dive into Attention Mechanisms in Encoder-Decoder Transformers

So far in this work, we have shown the feasibility and efficacy of zero-shot controlled NLG by directly manipulating the internal workings of trained encoder-decoder transformer models at generation time through the proposed control knobs in § 3. Note that this approach to controlled generation does not require any costly training or gradient-based optimization steps during inference. Although we present results for K-NRG, the control knobs could be used for zero-shot control of any EDT-NLG model.

The counter-intuitive fact that trained encoder-decoder transformer models could go through such drastic manipulations and not only not get fully derailed by them, but also generate sentences with the desired attributes raises many questions. In this section, we try to address some of these questions. Moreover, we believe this observation to have consequences beyond the controlled NLG problem, including more compute efficient approaches towards training these models that we also discuss in this section.

7.1 Manipulating Self-Attention

So far we have studied the application of the attention biasing knob on cross-attention modules in encoder-decoder transformer models to control the generation, but we have not yet applied this knob on self-attention modules (D in Figure 2).

7.1.1 Self-Attention Biasing

Fluency Samples of Model Responses
{FT} 9.66
- I agree. I think it’s funny that the highest score ever was 222-0.
That must have been a humiliating defeat.
{PT} 10.52
- I do like the Patriots. What about you?
{FT,PT} 9.84
- I did not know that. I wonder if they are allowed
to eat in restaurants.
{FT,FT,FT,FT} 9.68
- Not really, I think they have some pretty good movies, I don’t know.
{FT,FT,FT,FT,PT} 9.65
- I did not know that! I really love the batman character. Did you know
he was originally named Bat-Man?
Table 16: Mixing multiple self-attention decoder blocks. Here, FT represents a fine-tuned self-attention block and PT represents the pre-trained self-attention block. For each decoder layer, we perform a convex combination of the participating self-attention functions as per Figure 8. The last column represents generated responses (randomly selected).

We investigated the question of whether the attention biasing knob could be applied to self-attention in a similar way that it was successfully applied to cross-attention. Through our experiments we found that the answer to this question is probably negative. For instance, in a series of experiments we tried to use the attention biasing knob on self-attention modules of the decoder so that the model pays more attention to tokens immediately preceding the present generation time step999We applied a linear decay bias that from to for preceding time steps ,

, …. This profile in some aspects is similar an n-gram language model where token

is primarily conditioned on its immediately preceding n-gram tokens., and we notice that this would cause the generated sentences to be not fluent anymore. More concretely, the average perplexity of generated sentences, measured using a pre-trained GPT-2 model used as a proxy for fluency, is 142.6 compared to the same model with no self-attention biasing that gives the average perplexity of 40.2. This increase in perplexity shows itself as many grammatical and syntactical mistakes in the generated sentences.

This experiment along with several other similar failed experiments that we ran on biasing the self-attention modules in the decoder of encoder-decoder transformer models raises the hypothesis that perhaps decoder self-attention in these models is primarily responsible for fluency of the generated sentences, and that is why manipulation of self-attention results in loss of fluency. More formally:

Hypothesis 1

In EDT-NLG models, fluency of the generation is managed by decoder self-attention.

It should be noted that the observation in § 5 that biasing cross-attention, while decoder self-attention remains intact, does not negatively impact fluency of generations is another strong evidence for this hypothesis to be correct.

7.1.2 Self-Attention Mixing

Inspired by creftypecap 1, one could ask whether decoder self-attention in EDT-NLG works independently of the task that the model is trained on. In other words, would it be possible to replace the self-attention modules of one trained EDT-NLG model with those of another trained EDT-NLG model and still get fluent generation out of these models?

To examine this, we take two BART models, one is fine-tuned for the Topical-Chat task (here is referred to as fine-tuned BART), and the other is the original pre-trained BART model. Note that the fine-tuned BART is trained to generate responses for a given dialog history and a knowledge snippet, whereas the pre-trained BART is trained to reconstruct an input sentence. We replace the parameters of the decoder self-attention of fine-tuned BART with the parameters of the pre-trained BART, and we generate for the Topical-Chat task using the resulting model. The result is generated responses that are surprisingly fluent (row {PT} in Table 16) with perplexity 10.53 which is only slightly higher than the perplexity of the Topical-Chat fine-tuned model which is 9.66 (row {FT} in Table 16). It is important to note that these self-attention modules are significantly different from one another, in that the average Frobenius norm of the difference between Q, K, and V matrices are 5.25, 5.60, and 5.19, respectively, where the average norm of the matrices are 61.10, 61.15, and 33.56, respectively. This could be interpreted as significant difference between the two self-attention modules. This result is quite surprising in that we are replacing all of the parameters of self-attention modules of one trained encoder-decoder transformer model (EDT-NRG for Topical-Chat) with the parameters of another model that has an identical architecture but is trained for a completely different task, and we still see that the performance of the resulting model is not impacted significantly.

Next we examine the perplexity of generations for the case where the self-attention modules for fine-tuned BART and pre-trained BART are combined. The architecture is shown if Figure 8. In this architecture at every generation time step two different self-attention modules (one from Topical-Chat fine-tuned BART and one from pre-trained BART) are run and the results are combined through a convex combination. The combined self-attention output then goes through the rest of layers of fine-tuned BART. From the results, shown in row {FT,PT} in Table 16, we see that the average generation perplexity in this setting is 9.84 which is very close to this metric for fine-tuned BART, which indicates that the model is generating fluent responses.

Figure 8: Mixing self-attention of fine-tuned BART for Topical-Chat dataset and pre-trained BART

We next combine four different (trained with different random seeds) fine-tuned BART models’ self-attention modules (row {FT,FT,FT,FT} in Table 16), and also four different fine-tuned BART and pre-trained BART models’ self-attentions modules (row {FT,FT,FT,FT, PT} in Table 16), in the same way that is depicted in Figure 8. We see that the generations from both of these models have a perplexity that is very close to fine-tuned BART’s generations perplexity, which again is an indicator that these models with combined self-attention modules generate fluently. It should be noted that the self-attention modules of the four different fine-tuned BART models are significantly different from each other in that the average Frobenius Norm of the difference between Q, K, and V matrices are 6.79, 6.98, and 6.29 respectively101010The average is calculating among differences of pairs of the same matrices from different models.

We can see from the results that convex combinations (average to be more specific) of decoder self-attention from models that are trained to generate fluent sentences also generates fluent sentences. Moreover, adding random noise to a decoder self-attention module that is trained to generate fluent sentences results in large hits to the fluency of the generated results. These observations would establish additional evidence for Hypothesis 1. Also, fluency of convex combination of trained decoder self-attention suggests that, intuitively speaking, there might be a flat surface of fluency in the space of parameters of these encoder-decoder transformer models. It is important to note that if the encoder-decoder transformer architecture is replaced with a transformer decoder architecture, majority of what was discussed above will not hold true. As to why, it is important to notice that for a transformer decoder trained on the Topical-Chat dataset, the decoder self-attention is responsible both for maintaining fluency of the generated response as well as its relevance to the previous turns of the conversation.

7.2 Rethinking Transformer Decoder Architectures

If Hypothesis 1 holds, one question that arises is are there alternative layouts for transformer decoder that could better facilitate this separation of roles between decoder self-attention and decoder cross-attention? Another natural question here is that if the hypothesis is correct and in a pre-trained encoder-decoder transformer architecture decoder self-attention is already able to generate fluently, can we freeze self-attention parameters during fine-tuning of these pre-trained models? In this section we address these two questions.

7.2.1 Parallel Self- and Cross-Attention for Transformer Decoders

Figure 9: Parallel self- and cross-attention in a transformer decoder layer.
Decoder PPL Sample Responses
Freq Rare
Sequential 9.66 9.88 -
Parallel 10.22 10.76
- It was released in 2017. I was excited
to see it. I am excited to see what the
new Avengers movie will be about.
- He has been good for a long time.
I think he is going to join the Giants.
Table 17: Parallel decoder self- and cross-attention EDT-NRG performance on the Topical-Chat Dataset.

In the current architecture of transformer decoder self-attention, cross-attention, and feed forward layers are sequentially chained together across multiple layers of stacked transformer decoders (Figure 2). If Hypothesis 1

holds and, as a result, self-attention and cross-attention models in encoder-decoder transformer models have different roles, could a different architecture, in which decoder self-attention and cross-attention are linked not sequentially but in parallel, work better for NLG tasks? The intuition behind this alternative architecture would be somewhat adding more separation between self-attention and cross-attention in the sense that the output of one is not directly the input of the other. To examine this idea, we take a pre-trained BART and change the architecture of its transformer decoder so that the self-attention and cross-attention modules are linked together in a parallel manner instead of the original sequential way. Figure

9 shows this alternative transformer decoder architecture. Note that we use the original pre-trained BART values, which were trained in the standard sequential decoder, for the parameters of both self- and cross- attention modules for initialization. We then fine-tune this model with the Topical-Chat task. Table 17 presents the results where the model does not observe a significant degradation in terms of perplexity. These results along with some qualitative review of the generations suggest positive indications towards the feasibility of such alternate parallel architectures for transformer decoders.

7.2.2 Fine-Tuning Only Cross-Attention

Trained EDT-NLG models could generate fluently. If Hypothesis 1 holds and fluency of the output in these models is managed by decoder self-attention, then if we freeze the decoder self-attention parameters during fine-tuning of these models for other NLG tasks (e.g., NRG for Topical-Chat), would the fine-tuned model still perform well? In a set of experiments, we study this question. More specifically, we freeze all parameters of the transformer decoder except the cross-attention parameters of a pre-trained BART model and we fine-tune the rest of the parameters on the Topical-Chat dataset. The top half of Table 18 presents the results and compares it with the case where all of the pre-trained BART model are fine-tuned (first row). In the case that decoder self-attention parameters of pre-trained BART (all decoder self-attention parameters except for cross-attention parameters) are fixed and the rest of the parameters are fine-tuned (second row) we see that the impact on perplexity is very small111111We train three independent models with random seeds for both variants. According to two-tailed unpaired t-test, the difference is not statistically signification (p-values are 0.11 and 0.25 for frequent and rare test sets, respectively).

For the third and fourth rows of the table, the parameters of the decoder self-attention of pre-trained BART are replaced with random values. We see that random self-attention parameters when fixed during training cause very high perplexity values (fourth row) and the generations are no longer fluent. This experiment was conducted to ensure that if Hypothesis 1 is correct, not any random decoder self-attention could result in fluent generations. On the other hand, when the decoder self-attention parameters of pre-trained BART are set to random values, but they are trained during fine-tuning (third row), fluency of the model to some extend is regained.

PPL ( better)
Dec-Self-Attn. (Test-freq) (Test-rare)
Bart-large 406M 406M - full 9.310.05 9.370.02
254M (37.43%) - only cross-attn. 9.400.06 9.400.01
Bart-large 406M 406M full 11.29 11.51
254M (37.43%) only cross-attn. 18.07 17.93
Bart-large 381M 381M (6.15%) - full 9.71 9.88
(top-6) 228M (43.84%) - only cross-attn. 12.7 12.9
Bart-large 381M 381M (6.15%) - full 10.78 11.65
(bottom-6) 228M (43.84%) - only cross-attn. 25.08 26.97
Bart-large 381M 381M (6.15%) - full 9.88 10.14
(alternate-6) 228M (43.84%) - only cross-attn. 10.32 10.40
Table 18: Training BART-large models with varying initialization (of decoder’s self-attention) and decoder self-attention freezing (freezing all decoder parameters except cross-attention and shared embedding matrices) strategies. All are relative changes with respect to the BART large model size.

7.2.3 Removing Some of the Decoder Cross-Attention Modules

We also apply another set of modifications to the BART model in which we remove the cross-attention of some of the transformer decoder layers. In 3 different settings we remove the cross-attention for top half of the decoders, bottom half of the decoders, and every other decoder. The results of these settings are presented in the bottom half of Table 18. For the case where only the top half of the decoder layers keep their cross-attention modules (“top-6” rows in Table 18) we see that when all the parameters are fine-tuned (first row of “top-6” rows), the model still performs quite well when compared to the original model. Remember that these models have approximately 8% less trainable parameters compared to the original BART model. In this setting when all but cross-attention weights on the decoder side are not fine-tuned (second row of “top-6” rows), perplexity goes even higher, but is still somewhat acceptable. When only the bottom half of the transformer decoders keep their cross-attention (“bottom-6” rows in Table 18), the performance of the fine-tuned model (first row of “bottom-6” rows) model is worse than the case where top half of the decoder transformers kept their cross-attention, but the performance remains in the acceptable range. However, in the case where decoder self-attention parameters are also fixed during training (second row of “bottom-6” rows), we see a large hit to the performance and the generations are no longer fluent. Finally in the case where every other transformer decoder keeps its cross-attention (“alternate-6” rows in Table 18) produces the most interesting results. We can see that in this case even when decoder self-attention parameters are fixed during fine-tuning the perplexity on the test-rare set is 10.40 which is only slightly higher than 9.36 which is for the case where all the transformer decoder cross-attention modules are kept and all the parameters are fine-tuned. This result is interesting because in this case on the decoder side none of the self-attention parameters are trained; nor are half of the cross-attention parameters are even in the model. In fact, in this case the number of trainable variables is only 56% of the trainable variables in the base model and the model’s performance is strikingly high.

7.2.4 Efficient Training of Encoder-Decoder Transformer Models

The numbers in Table 18 suggest that there could be more efficient ways of training EDT models for NLG applications. In the experiments that are summarized in this Table, we see that on the one hand, freezing parameters of decoder self-attention modules to pre-trained values does not hugely impact the performance of the model. Note that this finding is aligned with Hypothesis 1. Freezing these parameters during training (or fine-tuning) would mean that gradients for these parameters need not to be kept, tracked, or communicated between GPUs or comput nodes, which would result in significant savings in compute resources. On the other hand, dropping cross-attention modules from some of the transformer decoders would also result in the same savings and reduction in the model size, which results in savings during both training and inference time.

8 Conclusion

In this work, we propose novel approaches to controlling NLG models that are based on encoder-decoder transformers. In these approaches, we manually intervene in the internal computations of these EDT-NLG models at generation time to achieve the control goals in a zero-shot manner. These manual interventions are applied through three proposed control knobs: attention biasing, decoder mixing, and context augmentation knobs. Some aspects of applying these knobs on the EDT-NLG models are quite counter-intuitive. Most prominently, the fact that we can manually intervene in computations of these NLG models in rather intrusive ways without derailing the generation process entirely, comes as a surprise. Building on this observation, we then see that in most cases, intuition-based design of manual interventions produce results that are aligned with the intuition behind the design.

One notable aspect of the results of experiments on the application of these knobs is that using the combination of these knobs leads to the most favorable results. This was specifically pronounced for the context augmentation knob, where applying it alone would result in little to no control in the generation process. However, when combined with the other two knobs, it would result in a large increase in controllability. The context augmentation knob could be thought of as an alternative way of prompting generative language models Brown et al. (2020). But it is known that prompting fails to achieve the control goals in models that are not enormous in size Schick and Schütze (2020). For instance, we know that prompting works very well for GPT-3, but not necessarily for GPT-2. In this work, we show that by adding cross-attention biasing and decoder mixing to the context augmentation knob (which could be thought of as an alternative to prompting) we can achieve zero-shot controllability for models that are orders of magnitude smaller than GPT-3 (e.g., BART-base).

When used for controlling the generation, we see that while applying the attention biasing knob on cross-attention achieves the desired outcome, it turns out that applying this knob on self-attention results in loss of fluency. That leads us to the hypothesis that decoder self-attention in EDT-NLG models is responsible for fluency of the generations. We examine this hypothesis in several ways, and all the evidence points towards the hypothesis being correct. Inspired by this hypothesis, we propose alternative architectures for transformer decoders that are significantly more compute efficient during both training and generation.

In this work, we show given a control goal, how to generate according to the goal in a zero-shot fashion. One obvious direction for future research is how to develop these control goals, especially in the context of dialog systems, and building models that can both determine the control goals and generate according to them in an end-to-end fashion. As another future research direction, more computational studies are needed to determine how to benefit from the proposed results in designing more compute-efficient encoder-decoder transformer models.

Application of transformers, and attention in general, goes far beyond NLG and even NLP, and these mechanisms are heavily employed in machine vision, multi-modal learning, and more. Zero-shot biasing of attention through the attention biasing knob that is introduced in this work could potentially be useful in these areas not only for generation, but also other tasks such as classification, segmentation, etc. Also, it should be noted that the attention biasing knob is not limited to the attention mechanism in the context of transformers and could be applied to any attention mechanism, whether it is part of a transformer or not.

9 Acknowledgements

We are truly grateful for valuable feedbacks from Gokhan Tur, Yang Liu, Behnam Hedayatnia, Nicole Chartier, and Mohit Bansal.


  • A. Baheti, A. Ritter, J. Li, and B. Dolan (2018) Generating more interesting responses in neural conversation models with distributional constraints. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018

    , E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.),
    pp. 3970–3980. External Links: Link, Document Cited by: §2.
  • A. Belz and E. Reiter (2006) Comparing automatic and human evaluation of NLG systems. In EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, April 3-7, 2006, Trento, Italy, D. McCarthy and S. Wintner (Eds.), External Links: Link Cited by: §5.2.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §2, §8.
  • M. Daniluk, T. Rocktäschel, J. Welbl, and S. Riedel (2017) Frustratingly short attention spans in neural language modeling. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §3.1.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2020) Plug and play language models: A simple approach to controlled text generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §2.
  • Z. Fan, Y. Gong, D. Liu, Z. Wei, S. Wang, J. Jiao, N. Duan, R. Zhang, and X. Huang (2021) Mask attention networks: rethinking and strengthen transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1692–1701. Cited by: §3.1.
  • M. Ghazvininejad, X. Shi, J. Priyadarshi, and K. Knight (2017) Hafez: an interactive poetry generation system. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, System Demonstrations, M. Bansal and H. Ji (Eds.), pp. 43–48. External Links: Link, Document Cited by: §2.
  • K. Gopalakrishnan, B. Hedayatnia, Q. Chen, A. Gottardi, S. Kwatra, A. Venkatesh, R. Gabriel, and D. Hakkani-Tür (2019) Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pp. 1891–1895. External Links: Document, Link Cited by: §4.1, Table 1.
  • B. Hedayatnia, K. Gopalakrishnan, S. Kim, Y. Liu, M. Eric, and D. Hakkani-Tür (2020) Policy-driven neural response generation for knowledge-grounded dialog systems. In Proceedings of the 13th International Conference on Natural Language Generation, INLG 2020, Dublin, Ireland, December 15-18, 2020, B. Davis, Y. Graham, J. D. Kelleher, and Y. Sripada (Eds.), pp. 412–421. External Links: Link Cited by: §A.2, §B.2, §4.1, §4.2, §6.4.1, footnote 15.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2020) The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §A.2.
  • A. Holtzman, J. Buys, M. Forbes, A. Bosselut, D. Golub, and Y. Choi (2018) Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 1638–1649. External Links: Link, Document Cited by: §2.
  • M. Huang, X. Zhu, and J. Gao (2020) Challenges in building intelligent open-domain dialog systems. ACM Trans. Inf. Syst. 38 (3), pp. 21:1–21:32. External Links: Link, Document Cited by: §1.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) CTRL: A conditional transformer language model for controllable generation. CoRR abs/1909.05858. External Links: Link, 1909.05858 Cited by: §2.
  • D. Khashabi, G. Stanovsky, J. Bragg, N. Lourie, J. Kasai, Y. Choi, N. A. Smith, and D. S. Weld (2021) GENIE: A leaderboard for human-in-the-loop evaluation of text generation. CoRR abs/2101.06561. External Links: Link, 2101.06561 Cited by: §B.2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 7871–7880. External Links: Link, Document Cited by: Appendix A, §1, §4.1.
  • M. Luong, H. Pham, and C. D. Manning (2015)

    Effective approaches to attention-based neural machine translation

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §3.1.
  • A. Madaan, A. Setlur, T. Parekh, B. Póczos, G. Neubig, Y. Yang, R. Salakhutdinov, A. W. Black, and S. Prabhumoye (2020) Politeness transfer: A tag and generate approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 1869–1881. External Links: Link, Document Cited by: §1.
  • A. Madotto, E. Ishii, Z. Lin, S. Dathathri, and P. Fung (2020a) Plug-and-play conversational models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 2422–2433. External Links: Link, Document Cited by: §2.
  • A. Madotto, Z. Lin, Y. Bang, and P. Fung (2020b) The adapter-bot: all-in-one controllable conversational model. CoRR abs/2008.12579. External Links: Link, 2008.12579 Cited by: §1.
  • S. Mezza, A. Cervone, E. A. Stepanov, G. Tortoreto, and G. Riccardi (2018)

    ISO-standard domain-independent dialogue act tagging for conversational agents

    In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, E. M. Bender, L. Derczynski, and P. Isabelle (Eds.), pp. 3539–3551. External Links: Link Cited by: §C.3, §6.4.1, §6.4.1.
  • I. Misra, A. Shrivastava, A. Gupta, and M. Hebert (2016) Cross-stitch networks for multi-task learning. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    pp. 3994–4003. External Links: Link, Document Cited by: §3.2.
  • T. Nguyen, X. Nguyen, S. Joty, and X. Li (2020) Differentiable window for dynamic local attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6589–6599. Cited by: §3.1.
  • T. Niu and M. Bansal (2018) Polite dialogue generation without parallel data. Trans. Assoc. Comput. Linguistics 6, pp. 373–389. External Links: Link Cited by: §1, §3.2.
  • N. Peng, M. Ghazvininejad, J. May, and K. Knight (2018) Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, New Orleans, Louisiana, pp. 43–49. External Links: Link, Document Cited by: §1.
  • S. Prabhumoye, A. W. Black, and R. Salakhutdinov (2020) Exploring controllable text generation techniques. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, D. Scott, N. Bel, and C. Zong (Eds.), pp. 1–14. External Links: Link, Document Cited by: §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, I. Gurevych and Y. Miyao (Eds.), pp. 784–789. External Links: Link, Document Cited by: §6.3.3.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2019) Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 5370–5381. External Links: Link, Document Cited by: §4.2.
  • C. Sankar and S. Ravi (2019)

    Deep reinforcement learning for modeling chit-chat dialog with discrete attributes

    In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, SIGdial 2019, Stockholm, Sweden, September 11-13, 2019, S. Nakamura, M. Gasic, I. Zuckerman, G. Skantze, M. Nakano, A. Papangelis, S. Ultes, and K. Yoshino (Eds.), pp. 1–10. External Links: Link, Document Cited by: §4.2.
  • T. Schick and H. Schütze (2020) It’s not just size that matters: small language models are also few-shot learners. CoRR abs/2009.07118. External Links: Link, 2009.07118 Cited by: §4.1, §8.
  • A. See, S. Roller, D. Kiela, and J. Weston (2019) What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 1702–1723. External Links: Link, Document Cited by: §1, §2, §2, §4.2, §6.1, §6.2.
  • S. Serrano and N. A. Smith (2019) Is attention interpretable?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2931–2951. Cited by: §3.1.
  • P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468. Cited by: §3.1.
  • J. Shin, P. Xu, A. Madotto, and P. Fung (2019) HappyBot: generating empathetic dialogue responses by improving user experience look-ahead. CoRR abs/1906.08487. External Links: Link, 1906.08487 Cited by: §4.2.
  • E. M. Smith, D. Gonzalez-Rico, E. Dinan, and Y. Boureau (2020) Controlling style in generated dialogue. CoRR abs/2009.10855. External Links: Link, 2009.10855 Cited by: §2.
  • J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 4077–4087. External Links: Link Cited by: §3.3.
  • A. A. Syed, F. L. Gaol, and T. Matsuo (2021) A survey of the state-of-the-art models in neural abstractive text summarization. IEEE Access 9, pp. 13248–13265. External Links: Link, Document Cited by: §1.
  • C. van der Lee, A. Gatt, E. van Miltenburg, S. Wubben, and E. Krahmer (2019) Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, INLG 2019, Tokyo, Japan, October 29 - November 1, 2019, K. van Deemter, C. Lin, and H. Takamura (Eds.), pp. 355–368. External Links: Link, Document Cited by: §B.2, §5.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §1.
  • Z. Wu, M. Galley, C. Brockett, Y. Zhang, X. Gao, C. Quirk, R. Koncel-Kedziorski, J. Gao, H. Hajishirzi, M. Ostendorf, and B. Dolan (2020) A controllable model of grounded response generation. CoRR abs/2005.00613. External Links: Link, 2005.00613 Cited by: §2, §2, §4.2.
  • B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao, and T. Zhang (2018) Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4449–4458. Cited by: §3.1.
  • S. Yang, Y. Wang, and X. Chu (2020)

    A survey of deep learning techniques for neural machine translation

    CoRR abs/2002.07526. External Links: Link, 2002.07526 Cited by: §1.
  • W. You, S. Sun, and M. Iyyer (2020) Hard-coded gaussian attention for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7689–7700. Cited by: §3.1.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: I have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 2204–2213. External Links: Link, Document Cited by: §1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020) BERTScore: evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §B.1.

Appendix A Model Details

All our experiments, except for § 7.2.3, utilize the BART-base model Lewis et al. (2020)121212 Below, we detail the input format with respect to the K-NRG problem.

a.1 Formatting the Input

As mentioned in § 4.1, our input comprises a knowledge snippet and the dialog history . Here, dialog history is the last five turns in the dialog, with respect to the response. To prepare the input, we assign a fixed number of tokens for each section in the input. We call each section a bucket. If the actual number of tokens of an input section is less than the total tokens assigned for that bucket, we pad the input to infill the empty tokens. In particular, we provide tokens for the knowledge snippet and tokens for each turn in the dialog history.

We start the input sequence with the special token , followed by the knowledge snippet’s bucket. Next, we include the dialog history, whose turns use alternate start symbols: . Overall, our input comprises tokens, 33 knowledge tokens plus 26 turn tokens for each of the 5 turns. On the decoder side, for teacher-forcing, we provide the human response as the input, along with the start token .

a.2 Training Details and Hyper-Parameters

For training the models, we follow the simple maximum likelihood-based training using ground-truth human responses. It should be re-emphasized that we do not use any of the control knobs during training. Thus, for fine-tuning the BART model on the Topical-Chat data, we train the model for a maximum of 10 epochs with early stopping (patience = 1). The early stopping metric is applied on the average perplexity of the validation set (frequent split). We train with a batch size of 5, gradient accumulation of 4, and learning rate of


For inference, we follow Hedayatnia et al. (2020) and utilize nucleus sampling Holtzman et al. (2020) with a top-p value of . Top-k is set to and temperature is set to . The maximum length of the responses is set to tokens. We experiment with other values of top-p, but do not observe significant changes in control.

Appendix B Informativeness Experiments

b.1 Additional Metrics

In this section, we present the extended results with respect to Table 3. First, we detail the automatic metrics that we consider for the informativeness experiments.

Comparing with Human Responses.

We test the quality of the responses by calculating automatic metrics with respect to the ground-truth human responses. The set of metrics include, perplexity (PPL), Unigram F1 (F1), BLEU, ROUGE, METEOR, and also the model-based BertScore Zhang et al. (2020) (B-Score).

Comparing with Knowledge Snippet.

To compare the amount of knowledge incorporated into the response, we calculate the above metrics with the knowledge snippet as the reference. We call these metrics, F1, BLEU, ROUGE, and METEOR131313In both the settings, we use BLEU-4 and ROUGE-L as the respective metrics..

Table 19 and Table 20 present the overall results for automatic metrics across both frequent and rare test sets, respectively. The results in these additional metrics follow similar trends to the metrics discussed in § 5.3.

Base Model vs. Fluency Relevance Informativeness
p-value SSD p-value SSD p-value SSD
Knob A (Knowledge) 0.90 No 0.79 No 0.001 Yes
Knob A (Gradual Knowledge) 0.90 No 0.90 No 0.001 Yes
Knob B () 0.9 No 0.13 No 0.001 Yes
Knob AB (Gradual Knowledge, ) 0.32 No 0.052 No 0.001 Yes
Table 21: Comparing variants to the base model for statistically significant mean difference in human evaluation scores as per Tukey’s HSD test. SSD refers to Statistically Significant Difference between the models for .

b.2 Human Evaluation Details

For the human evaluation of informativeness, relevance and fluency, we utilize the questionnaire discussed in § 5.2. We randomly sample 200 dialog instances from the combined test sets of frequent and rare splits in Topical-Chat (100 each). Each instance has the dialog history () with five dialog turns and the provided knowledge snippet (). However, we notice that the top-selected knowledge snippet for a particular dialog context may not always be entirely relevant for the response. This would affect the human evaluations as we specifically ask the annotators to prefer responses where facts from the knowledge snippet is manifested in generated response. Thus, we first filter the test sets before sampling the 200 instances. Specifically, we calculate the ROUGE metric between the knowledge snippet and the human response, and only consider the set of dialog contexts that have a higher value than the mean ROUGE value of 0.2141414Mean ROUGE between knowledge snippet and human response over the Topical-Chat training set is 0.2.. This filtration ensures that the knowledge snippet is relevant to the dialog context and thus can be a good test bed for measuring control over informativeness.

We use Amazon Mechanical Turk as the annotation platform and appoint three annotators per response sample across all model variants. To ensure high quality for annotations, we opt for annotators that are familiar with dialog evaluation and have a high overall performance as Turkers ( or higher approval rate and more than approved HITs).


The main results of the average Likert-based scores are summarized in Table 3. For fluency, relevance, and informativeness, the respective inter-annotator agreement (IAA) using Krippendorff’s alpha are as follows: , , . As relevance and informativeness are scored on a wider scale of 1 to 5, we categorize this 5-scale Likert scale into three bins comprising the values [1,2], [3], and [4,5]. As seen in the IAA values, we achieve high agreements for fluency. For relevance and informativeness, our IAA scores are similar to Hedayatnia et al. (2020) where the annotations were on a ranking-based format and not Likert-based. It is known in the literature that Likert-based annotations, due to factors like personal bias of annotators, are prone to have lower IAA scores van der Lee et al. (2019). Having said that, we choose this process as it provides a descent average value of each model variant Khashabi et al. (2021). Comparing the mean statistics of the variants, we perform statistical significance tests between all the variant pairs using the Tukey’s HSD test. We find that compared to the base model (no control knobs applied), none of the controlled variants have fluency or relevance scores that are statistically significant in difference. In contrast, all the variants achieve statistically significantly higher informativeness scores. This highlights that the variants are able to improve on informativeness without compromising on fluency and relevance.

Appendix C Context Augmentation Experiments

c.1 Human Evaluation Details

Similar to Table 21, we perform statistical tests for the variants introduced in § 6. The results are summarized in Table 22. As seen in the Table, we do not find any statistically significant difference between the controlled variants when compared to the base model in both fluency and relevance.

Base Model vs. Fluency Relevance
p-value SSD p-value SSD
C 0.623 No 0.802 No
CA 0.871 No 0.263 No
CB 0.900 No 0.900 No
CAB 0.900 No 0.199 No
Table 22: Comparing variants to the base model for statistically significant mean difference in human evaluation scores. SSD = “Yes” means Statistically Significant Difference between the models for .
PropQ SetQ ChoiceQ
Do you like How are you Or are you
Do you know How much do you Or do you
Do you watch How can you Is it just
Do you have What do you Is there a reason
Have you ever What kind of Do you think or
Are you a What did you think
Why is that
Why do you
Table 23: Control phrases for fine-grained question types.

c.2 Control Phrases

Table 23 presents the control phrases that we use for the respective fine-grained question generation.

c.3 Human Evaluation for Control Classifier

For investigating the reliability of the RNN-based control classifier, we proceed to check its accuracy with respect to human ground truths. We start by sampling sentences from the test set and ask two human annotators to annotate each sentence with the reduced tag-set: PropQ, ChoiceQ, SetQ, Feedback, Salutation, Statement, and Others. Here, the category Others collate infrequent dialog acts, such as Directives151515Full tag-set is available in Hedayatnia et al. (2020).

The annotators get a very high inter-annotator agreement with Krippendorff’s alpha . For the conflicts, we employ a third annotator to break the ties. With this, we get the ground truth annotations over the sentences. Table 24 demonstrates some of the sentences with the human annotation.

Sentence Dialog Act
Do you know what the University of
Iowa’s locker room is?
What about you?
Or does it become a problem?
I haven’t seen that one, but I have heard
that he tried to retire the first time.
Wow that is a lot.
I hope you have a good day too!
Table 24: Samples of sentences from Topical-Chat annotated with dialog acts.

Next, we automatically annotate the sentences with both the off-the-shelf SVM Mezza et al. (2018) and our RNN-based taggers. Table 25 shows the classification results, where the RNN-based classifier achieves a higher F1-score () than the SVM (). Notably, the F1 (and particularly the precision) of the question categories are very high which establishes this classifier as a reliable control evaluator.

Dialog Acts SVM RNN
Precision Recall F1 Precision Recall F1
PropQ 1.00 0.84 0.91 0.98 1.00 0.99
SetQ 0.88 0.80 0.83 1.00 0.91 0.95
ChoiceQ 1.00 1.00 1.00 1.00 0.82 0.90
Statement 0.60 0.63 0.62 0.53 0.76 0.62
Feedback 0.90 0.52 0.66 0.81 0.69 0.75
Salutation 0.86 0.93 0.89 0.93 0.96 0.95
Other 0.00 0.00 0.00 0.00 0.00 0.00
Accuracy - - 0.72 - - 0.83
Weighted Avg. 0.85 0.72 0.77 0.85 0.83 0.84
Table 25: F1 score of SVM and RNN model predictions over human annotated ground truth dialog acts. Numbers in boldface represent the higher F1-score between SVM and RNN models.

c.4 Human Evaluation for Detecting Questions Based on ‘?’

We measure the reliability of using ‘?’ as a proxy for valid questions. For this, we perform a human evaluation similar to § C.3, where we ask two annotators to identify questions. In particular, we take the generations from the CAB model in Table 8 and ask the annotators to mark a response as ‘yes’ if it contains a valid question and ‘no’ otherwise. As this is an objective question, we find only four disagreements between the annotators, which we resolve after discussing them. This provides the ground-truth question annotation. We compare this ground truth with the proxy-based approach that we employ, and present the results in Table 26. The numbers show very high similarity between ‘?’ based marker and human annotations. Particularly, the ‘?’ based marker obtains a 0.99 precision which demonstrates that the marker does not falsely count questions as positive. For recall, it has 0.87 which upon investigations we notice that they are all of the type “I wonder” where there is no explicit question mark. An example of such a question is: “I wonder if it was the NFL’s tracking chips or the tracking chips embedded in players shoulder pads to record a players every move on the field.”. Additionally, the Krippendorff’s alpha between the ‘?’ based and human annotations are 0.935. These results show that, in general, the ‘?’ based counts that are shown in Table 8 and related tables are highly reliable.

isQuestion? ‘?’ marker for Question
Precision Recall F1
yes 0.99 0.87 0.92
no 0.90 0.99 0.94
Accuracy - - 0.94
Weighted Avg. 0.94 0.94 0.93
Table 26: Comparison of proxy-based vs. human-based question detection.

c.5 Qualitative Examples

Table 27 presents a few dialog instances from the test set that we randomly sample. We show the generations by the variants detailed in § 6.1.

Table 28 presents examples for the fine-grained question types, and feedback and sentiment attributes. The generations are by the CAB model. While we observe the attributes in the generated samples, for the positive sentiment in the second example, we notice that the model does not generate with positive sentiment when the context is not appropriate for it. This is an important skill in grounded tasks like dialog response generation.

A: hi how are you? do you play video games?
B: I’m doing well how are you? Yes I like playing
video games, especially on Xbox!
A: I have not played Xbox or PS.
whhich do you prefer? why?
B: I like xbox more because honestly
it’s the only system I’ve had.
I remember getting one way back in 2001.
A: Microsoft gets many things wrong, but
ive heard good things about xbox games.
i played games as a kid, but now, im not interested.
What games do you like?
“there are controllers created for quadriplegics
so they can play XBOX 360 and other consoles.”
Base Model
B: I agree with you. I’ve heard that they
created controllers for quadriplegics so they can play
xbox 360 and other consoles.
B: I agree with you on that one!
I’m not surprised that they created controllers for quadriplegics,
I didn’t know that.
B: Did you know there are controllers created for
quadriplegics so they can play xbox 360 and other consoles?
B: Did you know that there
are controllers created for quadriplegics so they can play xbox 360
and other consoles? That’s really cool!
B: I agree with that! There are controllers created
for quadriplegics so they can play
XBOX 360 and other consoles.
I wonder how many of those are on the first one?
A: Good morning! Have you heard of Stan Lee?
Did you know that he did cameos in
all of the Marvel movies?
B: I have! He was such a good writer!
I love looking for him in his films!
A: I do too and I agree, he was a good writer.
Sometimes he even speaks, like in Captain America,
the Winter Soldier.
B: I didn’t know that! What else was he in?
A: My favorites are the Iron Man appearances.
It seems like he had fun with those because
in one Tony Stark thinks he’s Hugh Hefner
and in another he thinks he’s Larry King!
“Luckily, the titular character
prevents him from getting hit by a bus.”
Base Model
B: That’s hilarious! I love that character!
I loved him in The Iron Man!
B: That’s a good one!
Did you know he was the titular character?
B: I never knew that. I wonder if it is the same
character that killed off the Hulk?
B: That’s a lot of fun! I just can’t
believe he was actually doing that!
B: I love the first few Iron Man appearances,

but the last one was a little odd.

I wonder if he ever got hit by a bus?
Table 27: Randomly sampled dialogs along with the generations by control knobs from § 6.1. Questions in the turns are highlighted in green.
A: Did you know that the University of Iowa’s
locker room is painted pink? I wonder why?
B: I think I did hear something about that.
I imagine it is an attempt to psych the other team out.
A: So, it would be in the visiting team’s locker
room but not their own?
B: Right. Teams do all kinds of things to bother
the competition. I’ve heard of teams having heated benches
in the winter for themselves but not for the visitors.
A: I would hate a cold bench. Then again,
I wouldn’t want to be some place that
cold or watching football.
“That the highest score ever in a
football game occurred in 1916 when Georgia Tech
defeated Cumberland 222-0”
B: Do you know what the highest score ever
was in a football game?
B: What kind of football do you think they play?
I bet they score like 222 points
B: Yeah, that’s pretty extreme.
I’d rather not go to Georgia Tech.
They beat Cumberland 222-0.
That’s pretty crazy.
B: That’s right. Georgia Tech beat Cumberland 222-0 in 1916.
It was a good game.
They must have been exhausted.
A: Hi, do you watch nfl?
B: Yea, it’s crazy the Seahawks released Richard Sherman
A: yeah that is. did you know women can play too
B: I’ve heard there is no official rule barring female
players in the NFL, I wonder if there will ever be one
A: yeah me too. did you hear about that
rule for new orleans cheerleaders?
“New Orleans Saints cheerleaders are
forbidden from eating in the same restaurant as any NFL player
and if they are already dining at a restaurant and an NFL player
comes in after, the cheerleaders are required to leave.”
B: Do you think they can get away with it?
Or is it just a rule? I mean, if
they’re eating at the same restaurant as the players, then it’s okay.
B: What is the rule? It seems like
a lot of sexist stuff to me. I’d be happy
if the cheerleaders were allowed to eat at the
same restaurant as the players
B: That’s pretty sexist.
I wonder if they’ll ever get a job in the NFL
and get fired for being sexist and homophobic
B: That’s crazy, I wonder how they treat
the cheerleaders in the NFL
Table 28: Randomly sampled dialogs controlled by different fine-grained attributes (Table 13). Interestingly, the model does not generate a positive response when it is not appropriate to the context.