Log In Sign Up

"It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems

by   Victor S. Bursztyn, et al.

Conversations aimed at determining good recommendations are iterative in nature. People often express their preferences in terms of a critique of the current recommendation (e.g., "It doesn't look good for a date"), requiring some degree of common sense for a preference to be inferred. In this work, we present a method for transforming a user critique into a positive preference (e.g., "I prefer more romantic") in order to retrieve reviews pertaining to potentially better recommendations (e.g., "Perfect for a romantic dinner"). We leverage a large neural language model (LM) in a few-shot setting to perform critique-to-preference transformation, and we test two methods for retrieving recommendations: one that matches embeddings, and another that fine-tunes an LM for the task. We instantiate this approach in the restaurant domain and evaluate it using a new dataset of restaurant critiques. In an ablation study, we show that utilizing critique-to-preference transformation improves recommendations, and that there are at least three general cases that explain this improved performance.


page 1

page 2

page 3

page 4


User-Centric Conversational Recommendation with Multi-Aspect User Modeling

Conversational recommender systems (CRS) aim to provide highquality reco...

Exploiting Negative Preference in Content-based Music Recommendation with Contrastive Learning

Advanced music recommendation systems are being introduced along with th...

Generating Synthetic Data for Conversational Music Recommendation Using Random Walks and Language Models

Conversational recommendation systems (CRSs) enable users to use natural...

Mining User/Movie Preferred Features Based on Reviews for Video Recommendation System

In this work, we present an approach for mining user preferences and rec...

Unintended Bias in Language Model-driven Conversational Recommendation

Conversational Recommendation Systems (CRSs) have recently started to le...

Towards Psychologically-Grounded Dynamic Preference Models

Designing recommendation systems that serve content aligned with time va...

POIBERT: A Transformer-based Model for the Tour Recommendation Problem

Tour itinerary planning and recommendation are challenging problems for ...

1 Introduction

Conversational recommendation systems (CRSs) are dialog-based systems that aim to refine a set of options over multiple turns of a conversation, envisioning more natural interactions and better user modeling than in non-conversational approaches.

However, the resulting dialogs still do not necessarily reflect how real conversations unfold. Most CRSs fall into two categories: they either frame the problem as a slot-filling task within a predefined feature space, such as sun2018conversational; zhang2018towards; budzianowski-etal-2018-multiwoz, which is closer to how people make decisions but not as flexible as real conversations; or they elicit preferences by asking users to rate specific items, such as christakopoulou2016towards, which is independent of a feature space but not as natural to users.

Figure 1: An example of our system transforming a critique into a positive preference and then using a customer testimonial to sell the user on a new option.

When we examine situations involving real human agents lyu2021workflow, decisions typically require multiple rounds of recommendations by the agent and critiques by the user, with the agent continuously improving the recommendations based upon user preferences that can be inferred from such critiques.

These inferences can be compared to the types of common sense inferences that have been studied recently with LMs davison-etal-2019-commonsense; majumder-etal-2020-like; jiang2021m. However, use of LMs for critique interpretation remains underexplored, despite the important role of critiques in communicating preferences—a very natural real-world task. Working in the restaurant domain, we prompt GPT3 brown2020language to transform a free-form critique (e.g., “It doesn’t look good for a date”) into a positive preference (e.g., “I prefer more romantic”) that better captures the user’s needs. Compared with most previous work on common sense inference, which relies on manually-constructed question sets, our task presents an opportunity to study common sense inference within a naturally arising, real-world application.

We test the effect of our novel critique interpretation method on the quality of recommendations using two different methods: one that matches the embedding of an input statement (e.g., “I prefer more romantic”) to persuasive arguments found in customer reviews (e.g., “Perfect for a romantic dinner”); and another one that fine-tunes BERT devlin2018bert in using an input statement to rank a given set of arguments.

Our work differs from previous critiquing-based systems that strongly limit the types of critiques that can be used chen2012critiquing and aligns with a recent trend in the CRS literature towards more open-ended interactions radlinski2019coached; byrne2019taskmaster. To the best of our knowledge, penha2020does are the closest prior work investigating whether BERT can be used for recommendations by trying to infer related items and genres. Here, we focus specifically on critique-to-preference inferences, aiming at more natural dialogs and better recommendations.

Our contributions are the following: 1. We propose a critique interpretation method that does not limit the feature space a priori; 2. We demonstrate that transforming critiques into preferences improves recommendations over two fold when matching embeddings and by 19-59% when fine-tuning an LM to rank recommendations, and present three possible explanations for this; and 3. We release a new dataset of user critiques in the restaurant domain, contributing a new applied task where common sense has great practical value.

2 Methods

In this section, we describe three methods: A critique interpretation method (2.1), an embeddings-based recommender (2.2.1), and an LM-based recommender (2.2.2).

2.1 Critique Interpretation

Critique interpretation is the task of transforming a free-form critique into a positive preference. Our critique interpretation method uses GPT3 in a few-shot setting similarly to brown2020language, which can be represented in a 3-shot version as follows:

To prime GPT3 for our task, we include ten examples in its prompt, five related to food and five to the atmosphere.111Fully available at We then append the critique that we would like to transform followed by the string “I prefer”, which conditions GPT3 to generate a positive preference. In our experiments, positive preferences are sampled using OpenAI’s Completion API (the DaVinci model, temperature = 0.7, top p = 1.0, response length = 20, and no penalties).

Besides not requiring a hand-crafted feature set, this method is also capable of more flexible interpretation of language, such as transforming “How come they only serve that much?”—with no clearly negative words—into “I prefer larger portions.”

2.2 Content-based Recommendations

2.2.1 Recommendation Search

Our embeddings-based recommender, , takes a preference statement and searches for persuasive arguments in customer reviews. As seen in Figure 1, we can define a persuasive argument as a review sentence that conveys clearly positive sentiment while being as specific as possible w.r.t the user’s preferences.

To incorporate this definition in , first we parse sentences in customer reviews using spaCy honnibal2017spacy and use EmoNet abdul2017emonet to keep the sentences with at least a minimum amount of “joy” () as our set of argument candidates .

Then we use the Universal Sentence Encoder cer2018universal

to calculate the similarity of all these argument candidates w.r.t a given user preference. We calculate the cosine similarity between their representations in this embedding space, select the argument with maximum alignment, and recommend the associated restaurant:

As with critique interpretation, can take any natural language statement as input to search for potential recommendations. We denote when it uses an inferred positive preference as input (“I prefer more romantic”) and when it directly uses a critique (“It doesn’t look good for a date”). In our first ablation study, we use as a baseline to test the efficacy of in retrieving better recommendations.

2.2.2 Recommendation Ranking

Besides using pretrained embeddings to search for recommendations from customer reviews, we design a more computationally intensive method, , that fine-tunes BERT to rank a set of arguments considering a given input statement.

We use the currently top performing open-source solution

han2020learning; TensorflowRankingKDD2019 on the MSMARCO passage ranking leaderboard222 to fine-tune three versions of BERT: uses a positive preference as input (“I prefer more romantic”), uses a critique (“It doesn’t look good for a date”), and uses a concatenation of both a critique and a preference (“It doesn’t look good for a date. I prefer more romantic”). Hypothetically, the more powerful LM method could learn to satisfy the user’s preferences without the need of critique interpretation if the performances of .

In our experiments, BERT-Base is fined-tuned for 10,000 steps, with learning rate = , maximum sequence length = 512, and softmax loss, using a Nvidia Quadro RTX 8000 for 3-6h per run (when ranking 15 and 30 arguments, respectively) and two runs per model (2-fold cross validation).

3 Evaluation

We run two ablation studies to evaluate the hypothesis that critique interpretation would be beneficial to the overall recommendation approach. First, we analyze our embeddings-based recommender, , to check whether the performance of . Secondly, we analyze our LM fine tuning-based recommender, , to check if or . Finally, we discuss qualitative differences between the tested arms.

Test case Positive preference Without critique interpretation () With critique interpretation()
Rank #1 Rank #2 Rank #3 Rank #1 Rank #2 Rank #3
It looks
too casual.
I prefer a
fancier place.
Very cheesy,
very fresh!
Very kid friendly.
Elegant, upscale
and classy place
for a special occasion.
The best
around here.
Superior restaurant,
the only place I
will have a dim sum.
It has a
freaking band!
I prefer a more
quiet place.
It has an awesome
It has an awesome
It has a great
Excellent spot to
spend time alone
or talk business.
Good ambiance.
Great place
to be at night.
I don’t really
like seafood.
I prefer beef
or chicken.
Everything delicious
with an exception of
of the shrimps.
I found that I do not
enjoy tuna, but my mom
thought it was excellent.
For dinner, I enjoyed
the scallops one night
and the sea bass
the second.
I only eat Beef
Brisket here because
is delicious!
Chicken flautas
are always
Chicken moist
and tender.
Table 1: Three test cases with the top 3 arguments from and (accurate marked in bold).

3.1 Data

Our methods were instantiated in a system comprising 15 restaurants selected from two of the largest metropolitan areas in the United States, covering a variety of price ranges and cuisines. For each restaurant, up to 100 four- or five-star customer reviews were collected from Google Places. This resulted in a total of 1455 reviews comprising 5744 sentences, 2865 of which pass the threshold for being identified as positive review sentences.

We compiled a set of user critiques from two sources: a set of 46 unique critiques from user studies that were conducted to test an earlier system prototype bursztyn2021, and 294 additional critiques adapted from the Circa dataset louis2020d. Circa was designed to study indirect answers to yes-no questions, such as “Are you a big meat eater?” answered with ”I prefer leafy greens”, from which the critique “I’m not a big meat eater” can be generated. We end with a total of 340 individual critiques after examining 1205 similar examples.

We generated a positive preference for each individual critique using our critique interpretation method in 2.1, without discarding any critiques. Our method yielded accurate preferences for 298 critiques (87.6%). For the remaining 42, we found GPT3 mostly undecided and vague (e.g., “Jalapeños are my limit” generates “I prefer food without jalapeños”). In our experiments, for these edge cases, we kept the best of three trials, but we believe that results using just the first generation would have been qualitatively similar.

The 340 critiques were randomly combined into pairs and triples in order to simulate longer conversations, i.e., two- and three-round critiques. We sampled 340 pairs and 340 triples, substituting only exceptional combinations that contained contradictory statements (e.g., “I’m not a big meat eater.” paired with “I’m not in the mood for vegetables.”), for a total of 1020 critiques. Compound critiques were concatenated into single statements as well as their corresponding preferences.

This curated dataset of 1020 restaurant critiques and inferred preferences is made available to the research community.333

3.2 Measurements

For evaluating our embedding-based methods , we use critiques as input to and their positive preferences as input to . For each query we retrieve the top 3 arguments, which are labeled as accurate or inaccurate by a human judge (illustrated in Table 1). To measure labeling consistency, a second human annotator redundantly labeled a sample of 100 arguments resulting in a Cohen’s Kappa of 0.71, which indicates strong agreement.

We then measure Precision@1, Precision@2, and Precision@3 in Table 2 for the embeddings-based method with () and without critique interpretation ().

To train and evaluate the BERT-based method , we retrieve the top 15 arguments from and the top 15 arguments from for 100 queries. Each argument receives a score from 3 (very relevant) to 1 (irrelevant). Again, a second human annotator relabeled 100 arguments for a Cohen’s Kappa of 0.73, also indicating strong agreement.

We design three ranking tasks: consists of ranking the 15 arguments originally retrieved with , hence closer to critiques in the embedding space; consists of ranking the 15 arguments originally retrieved with , hence closer to preferences; and consists of ranking both sets, i.e., 30 arguments. For each task we train , , and . We then measure nDCG@1, nDCG@3, nDCG@5, and nDCG@10 in Table 3 averaged after 2-fold cross validation.

Precision@1 Precision@2 Precision@3
0.256 0.251 0.250
0.574 0.546 0.525
Table 2: Precision@1, 2, and 3 for and .
model nDCG1 nDCG3 nDCG5 nDCG10
0.617 0.674 0.723 0.811
0.731 0.753 0.773 0.858
0.726 0.740 0.773 0.844
0.676 0.754 0.805 0.865
0.729 0.761 0.774 0.856
0.805 0.772 0.808 0.863
0.498 0.537 0.605 0.660
0.790 0.754 0.758 0.791
0.686 0.663 0.685 0.746
Table 3: nDCG1, 3, 5, and 10 for on each task.

3.3 Results

We found that using the positive preferences yields substantial improvements in information retrieval. For , in Table 2, increases Precision@1 by 124%, Precision@2 by 118%, and Precision@3 by 110%. This gap is also present, with marginal variations, when separately analyzing single-, two-, and three-round critiques. For , in Table 3, outperforms by 19% on nDCG@1 even at , where could have an edge. This gap persists for ( outperforms by 19%), increases for ( outperforms by 59%), and tends to narrow towards nDCG@10. Overall, we found strong evidence in support of our hypothesis.

Table 1 shows three examples in which the use of positive preferences was clearly beneficial. These examples represent three critique patterns that cause systematic errors if critique interpretation is turned off: 1. When the user implies a preference for a feature using the polar opposite (e.g., “It looks too casual” implying “I prefer a fancier place”); 2. When the user draws on common sense to express a preference (“It has a freaking band!” implying “I prefer a more quiet place”); and 3. When the user implies a filter within a set of related features (e.g., “I don’t really like seafood” implying preference for alternatives in the meat category).

Analyzing the results of and for the 340 single-round critiques, we found 170 cases where outperformed . Within these, 40 belong to the first pattern (24%), 78 to the second (46%), and 38 to the third (22%).444Fully available at A common trait behind the three patterns is that critiques can be lexically very distinct from their corresponding preference statements, and critique interpretation helps to bridge this gap.

4 Conclusion & Future Work

In this paper, we presented an open-ended approach to content-based recommendations for CRS. We developed a novel critique interpretation method that uses GPT3 to infer positive preferences from free-form critiques. We also developed two methods for retrieving recommendations: one that matches embeddings and another that fine-tunes BERT for the task. We ran two ablation studies to test if transforming critiques into positive preferences would yield better recommendations, confirming that it improves performance across both methods. Finally, we described three critique patterns that cause systematic errors in recommendation search if critique interpretation is turned off.

For future work, we will strive to use critiques to identify and remove unsuitable restaurants; we speculate that the sparsity of customer reviews generally makes it harder to “rule out” than to “rule in.” We will also study other issues such as when to ask clarification questions to resolve ambiguity in the scope of a critique.


We would like to thank reviewers for their helpful feedback. This work was supported in part by gift funding from Adobe Research and by NSF grant IIS-2006851.