Counterfactual Language Model Adaptation for Suggesting Phrases

10/04/2017 ∙ by Kenneth C. Arnold, et al. ∙ 0

Mobile devices use language models to suggest words and phrases for use in text entry. Traditional language models are based on contextual word frequency in a static corpus of text. However, certain types of phrases, when offered to writers as suggestions, may be systematically chosen more often than their frequency would predict. In this paper, we propose the task of generating suggestions that writers accept, a related but distinct task to making accurate predictions. Although this task is fundamentally interactive, we propose a counterfactual setting that permits offline training and evaluation. We find that even a simple language model can capture text characteristics that improve acceptability.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligent systems help us write by proactively suggesting words or phrases while we type. These systems often build on a language model that picks most likely phrases based on previous words in context, in an attempt to increase entry speed and accuracy. However, recent work Arnold et al. (2016) has shown that writers appreciate suggestions that have creative wording, and can find phrases suggested based on frequency alone to be boring. For example, at the beginning of a restaurant review, “I love this place” is a reasonable prediction, but a review writer might prefer a suggestion of a much less likely phrase such as “This was truly a wonderful experience”—they may simply not have thought of this more enthusiastic phrase. Figure 1 shows another example.

Figure 1: We adapt a language model to offer suggestions during text composition. In above example, even though the middle suggestion is predicted to be about 1,000 times more likely than the one on the right, a user prefers the right one.

We propose a new task for NLP research: generate suggestions for writers. Doing well at this task requires innovation in language generation but also interaction with people: suggestions must be evaluated by presenting them to actual writers. Since writing is a highly contextual creative process, traditional batch methods for training and evaluating human-facing systems are insufficient: asking someone whether they think something would make a good suggestion in a given context is very different from presenting them with a suggestion in a natural writing context and observing their response. But if evaluating every proposed parameter adjustment required interactive feedback from writers, research progress would be slow and limited to those with resources to run large-scale writing experiments.

In this paper we propose a hybrid approach: we maintain a natural human-centered objective, but introduce a proxy task that provides an unbiased estimate of expected performance on human evaluations. Our approach involves developing a

stochastic baseline system (which we call the reference policy), logging data from how writers interact with it, then estimating the performance of candidate policies by comparing how they would behave with how the reference policy did behave in the contexts logged. As long as the behavior of the candidate policy is not too different from that of the reference policy (in a sense that we formalize), this approach replaces complex human-in-the-loop evaluation with a simple convex optimization problem.

This paper demonstrates our approach: we collected data of how humans use suggestions made by a reference policy while writing reviews of a well-known restaurant. We then used logged interaction data to optimize a simple discriminative language model, and find that even this simple model generates better suggestions than a baseline trained without interaction data. We also ran simulations to validate the estimation approach under a known model of human behavior.

Our contributions are summarized below:

  • We present a new NLP task of phrase suggestion for writing.111Code and data are available at

  • We show how to use counterfactual learning for goal-directed training of language models from interaction data.

  • We show that a simple discriminative language model can be trained with offline interaction data to generate better suggestions in unseen contexts.

2 Related Work

Language models have a long history and play an important role in many NLP applications Sordoni et al. (2015); Rambow et al. (2001); Mani (2001); Johnson et al. (2016)

. However, these models do not model human preferences from interactions. Existing deployed keyboards use n-gram language models 

Quinn and Zhai (2016); Kneser and Ney (1995), or sometimes neural language models Kim et al. (2016), trained to predict the next word given recent context. Recent advances in language modeling have increased the accuracy of these predictions by using additional context Mikolov and Zweig (2012). But as argued in Arnold et al. (2016), these increases in accuracy do not necessarily translate into better suggestions.

The difference between suggestion and prediction is more pronounced when showing phrases rather than just words. Prior work has extended predictive language modeling to phrase prediction Nandi and Jagadish (2007) and sentence completion Bickel et al. (2005), but do not directly model human preferences. Google’s “Smart Reply” email response suggestion system Kannan et al. (2016)

avoids showing a likely predicted response if it is too similar to one of the options already presented, but the approach is heuristic, based on a priori similarity. Search engine query completion also generates phrases that can function as suggestions, but is typically trained to predict what query is made (e.g.,  

Jiang et al. (2014)).

3 Counterfactual Learning for Generating Suggestions

We consider the task of generating good words and phrases to present to writers. We choose a pragmatic quality measure: a suggestion system is good if it generates suggestions that writers accept. Let denote a suggestion system, characterized by

, the probability that

will suggest the word or phrase when in context (e.g., words typed so far).222Our notation follows Swaminathan and Joachims (2015) but uses “reward” rather than “loss.” Since has the form of a contextual language model, we will refer to it as a “model.” We consider deploying in an interactive interface such as Figure 1, which suggests phrases using a familiar predictive typing interface. Let denote a reward that a system receives from that interaction; in our case, the number of words accepted.333Our setting admits alternative rewards, such as the speed that a sentence was written, or an annotator’s rating of quality. We define the overall quality of a suggestion system by its expected reward over all contexts.

Counterfactual learning allows us to evaluate and ultimately learn models that differ from those that were deployed to collect the data, so we can deploy a single model and improve it based on the data collected Swaminathan and Joachims (2015). Intuitively, if we deploy a model and observe what actions it takes and what feedback it gets, we could improve the model by making it more likely to suggest the phrases that got good feedback.

Suppose we deploy a reference model444Some other literature calls a logging policy. and log a dataset

of contexts (words typed so far), actions (phrases suggested), rewards, and propensities respectively, where . Now consider deploying an alternative model (we will show an example as Eq. (1) below). We can obtain an unbiased estimate of the reward that would incur using importance sampling:

However, the variance of this estimate can be unbounded because the importance weights

can be arbitrarily large for small . Like Ionides (2008), we clip the importance weights to a maximum :

The improved model can be learned by optimizing

This optimization problem is convex and differentiable; we solve it with BFGS. 555We use the BFGS implementation in SciPy.

4 Demonstration Using Discriminative Language Modeling

We now demonstrate how counterfactual learning can be used to evaluate and optimize the acceptability of suggestions made by a language model. We start with a traditional predictive language model of any form, trained by maximum likelihood on a given corpus.666The model may take any form, but n-gram Heafield et al. (2013) and neural language models (e.g., Kim et al. (2016)) are common, and it may be unconditional or conditioned on some source features such as application, document, or topic context. This model can be used for generation: sampling from the model yields words or phrases that match the frequency statistics of the corpus. However, rather than offering representative samples from , most deployed systems instead sample from , where is a “temperature” parameter; corresponds to sampling based on (soft-max), while corresponds to greedy maximum likelihood generation (hard-max), which many deployed keyboards use Quinn and Zhai (2016)

. The effect is to skew the sampling distribution towards more probable words. This choice is based on a heuristic assumption that writers desire more probable suggestions; what if writers instead find common phrases to be overly cliché and favor more descriptive phrases? To capture these potential effects, we add features that can emphasize various characteristics of the generated text, then use counterfactual learning to assign weights to those features that result in suggestions that writers prefer.

We consider locally-normalized log-linear language models of the form


where is a phrase and

is a feature vector for a candidate word

given its context . ( is a shorthand for

.) Models of this form are commonly used in sequence labeling tasks, where they are called Max-Entropy Markov Models 

McCallum et al. (2000). Our approach generalizes to other models such as conditional random fields Lafferty et al. (2001).

The feature vector can include a variety of features. By changing feature weights, we obtain language models with different characteristics. To illustrate, we describe a model with three features below. The first feature (LM) is the log likelihood under a base 5-gram language model trained on the Yelp Dataset777; we used only restaurant reviews with Kneser-Ney smoothing Heafield et al. (2013). The second and third features “bonus” two characteristics of : long-word is a binary indicator of long word length (we arbitrarily choose letters), and POS

is a one-hot encoding of its most common POS tag. Table 

1 shows examples of phrases generated with different feature weights.

Note that if we set the weight vector to zero except for a weight of on LM, the model reduces to sampling from the base language model with “temperature” . The fitted model weights of the log-linear model in our experiments is shown in supplementary material.

LM weight = 1, all other weights zero:
i didn’t see a sign for; i am a huge sucker for

LM weight = 1, long-word bonus = 1.0:
another restaurant especially during sporting events

LM weight = 1, POS adjective bonus = 3.0:
great local bar and traditional southern

Table 1: Example phrases generated by the log-linear language model under various parameters. The context is the beginning-of-review token; all text is lowercased. Some phrases are not fully grammatical, but writers can accept a prefix.

Reference model .

In counterfactual estimation, we deploy one reference model to learn another —but weight truncation will prevent from deviating too far from . So must offer a broad range of types of suggestions, but they must be of sufficiently quality that some are ultimately chosen. To balance these concerns, we use temperature sampling with a temperature ):

We use our reference model to generate 6-word suggestions one word at a time, so is the product of the conditional probabilities of each word.

Figure 2: Example reviews. A colored background indicates that the word was inserted by accepting a suggestion. Consecutive words with the same color were inserted as part of a phrase.

4.1 Simulation Experiment

We present an illustrative model of suggestion acceptance behavior, and simulate acceptance behavior under that model to validate our methodology. Our method successfully learns a suggestion model fitting writer preference.

Desirability Model.

We model the behavior of a writer using the interface in Fig. 1, which displays 3 suggestions at a time. At each timestep they can choose to accept one of the 3 suggestions , or reject the suggestions by tapping a key. Let denote the likelihood of suggestion under a predictive model, and let denote the probability of any other word. Let denote the writer’s probability of choosing the corresponding suggestion, and denote the probability of rejecting the suggestions offered. If the writer decided exactly what to write before interacting with the system and used suggestions for optimal efficiency, then would equal . But suppose the writer finds certain suggestions desirable. Let give the desirability of a suggestion, e.g., could be the number of long words in suggestion . We model their behavior by adding the desirabilities to the log probabilities of each suggestion:

where The net effect is to move probability mass from the “reject” action to suggestions that are close enough to what the writer wanted to say but desirable.

Figure 3: We simulated learning a model based on the behavior of a writer who prefers long words, then presented suggestions from that learned model to the simulated writer. The model learned to make desirable predictions by optimizing the counterfactual estimated reward. Regularization causes that estimate to be conservative; the reward actually achieved by the model exceeded the estimate.

Experiment Settings and Results.

We sample 10% of the reviews in the Yelp Dataset, hold them out from training , and split them into an equal-sized training set and test set. We randomly sample suggestion locations from the training set. We cut off that phrase and pretend to retype it. We generate three phrases from the reference model , then allow the simulated author to pick one phrase, subject to their preference as modeled by the desirability model. We learn a customized language model and then evaluate it on an additional 500 sentences from the test set.

For an illustrative example, we set the desirability to the number of long words ( characters) in the suggestion, multiplied by 10. Figure 3 shows that counterfactual learning quickly finds model parameters that make suggestions that are more likely to be accepted, and the counterfactual estimates are not only useful for learning but also correlate well with the actual improvement. In fact, since weight truncation (controlled by ) acts as regularization, the counterfactual estimate consistently underestimates the actual reward.

4.2 Experiments with Human Writers

We recruited 74 workers through MTurk to write reviews of Chipotle Mexican Grill using the interface in Fig 1 from Arnold et al. Arnold et al. (2016). For the sake of simplicity, we assumed that all human writers have the same preference. Based on pilot experiments, Chipotle was chosen as a restaurant that many crowd workers had dined at. User feedback was largely positive, and users generally understood the suggestions’ intent. The users’ engagement with the suggestions varied greatly—some loved the suggestions and their entire review consisted of nearly only words entered with suggestions while others used very few suggestions. Several users reported that the suggestions helped them select words to write down an idea or also gave them ideas of what to write. We did not systematically enforce quality, but informally we find that most reviews written were grammatical and sensible, which indicates that participants evaluated suggestions before taking them. The dataset contains 74 restaurant reviews typed with phrase suggestions. The mean word count is 69.3, std=25.70. In total, this data comprises 5125 words, along with almost 30k suggestions made (including mid-word).

Estimated Generation Performance.

We learn an improved suggestion model by the estimated expected reward (). We fix and evaluate the performance of the learned parameters on held-out data using 5-fold cross-validation. Figure 4 shows that while the estimated performance of the new model does vary with the used when estimating the expected reward, the relationships are consistent: the fitted model consistently receives the highest expected reward, followed by an ablated model that can only adjust the temperature parameter , and both outperform the reference model (with ). The fitted model weights suggest that the workers seemed to prefer long words and pronouns, and eschewed punctuation.

Figure 4: The customized model consistently improves expected reward over baselines (reference LM, and the best “temperature” reweighting LM) in held-out data. Although the result is an estimated using weight truncation at , the improvement holds for all reasonable .

5 Discussion

Our model assumed all writers have the same preferences. Modeling variations between writers, such as in style or vocabulary, could improve performance, as has been done in other domains (e.g., Lee et al. (2017)). Each review in our dataset was written by a different writer, so our dataset could be used to evaluate online personalization approaches.

Our task of crowdsourced reviews of a single restaurant may not be representative of other tasks or populations of users. However, the predictive language model is a replaceable component, and a stronger model that incorporates more context (e.g., Sordoni et al. (2015)) could improve our baselines and extend our approach to other domains.

Future work can improve on the simple discriminative language model presented here to increase grammaticality and relevance, and thus acceptability, of the suggestions that the customized language models generate.


Kai-Wei Chang was supported in part by National Science Foundation Grant IIS-1657193. Part of the work was done while Kai-Wei Chang and Kenneth C. Arnold visited Microsoft Research, Cambridge.


Appendix A Supplemental Material

a.1 Experiment details

Crowd workers were U.S.-based Mechanical Turk workers who were paid $3.50 to write a review of Chipotle using the keyboard interface illustrated in Figure 1. They could elect to use the interface on either a smartphone or on a personal computer. In the former case, the interaction was natural as it mimicked a standard keyboard. In the latter case, users clicked with their mouses on the screen to simulate taps. (There did not seem to be significant differences between these two groups.) The instructions are given below:

Go to <URL> on your computer or phone.
Try out our new keyboard by pretending youre writing a restaurant review. For this part we just want you to play around -- it doesnt matter what you type as long as you understand how it works. Click the submit button and enter the code here: ____________
Did you use your phone or did you use you computer?
How would you describe the new keyboard to a friend? How do you use the phrase suggestions?
Next, please go back to <URL> on your computer or phone (reload if necessary).
Now please use the keyboard to write a fun review for Chipotle, the infamous chain Mexican restaurant. The ideal review is well written (entertaining, colorful, interesting), and has specific details about Chipotle menu items, service, atmosphere, etc. Please do not randomly click on nonsense suggestions -- we all know Chipotle doesnt serve pizza or burgers. We will bonus our favorite review!

a.2 Qualitative feedback

We include quotes from feedback from participants who used the system with suggestions generated by the reference model , i.e., -grams with temperature 1/2. Some users found the suggestions accurate at predicting what they intended to say, some found them useful in shaping one’s thoughts or finding the “right words,” and others found them overly simplistic or irrelevant.

  • “The phrases were helpful in giving ideas of where I wanted to go next in my writing, more as a jumping off point than word for word. I’ve always liked predictive text so the phrases are the next level of what I never knew I wanted.”

  • “Kind of easy to review but they also sometimes went totally tangent directions to the thoughts that I was trying to accomplish.”

  • “I was surprised how well the words matched up with what I was expecting to type.”

  • “I did like the phrase suggestions very much. They really came in handy when you knew what you wanted to say, but just couldn’t find the right words.”

  • “I thought they were very easy to use and helped me shape my thoughts as well! I think they may have been a bit too simple in their own, but became more creative with my input.”

a.3 Fitted model weights

The following table gives the fitted weights for each feature in the log-linear model, averaged across dataset folds.

mean 2.04 0.92 -1.16 1.03 1.45 0.45 0.91 0.36 0.96 0.87 1.68 0.23 0.79
std 0.16 0.14 0.26 0.61 0.38 0.55 0.26 0.22 0.14 0.27 0.20 1.00 0.32