Zero-shot Query Contextualization for Conversational Search

by   Antonios Minas Krasakis, et al.
University of Amsterdam

Current conversational passage retrieval systems cast conversational search into ad-hoc search by using an intermediate query resolution step that places the user's question in context of the conversation. While the proposed methods have proven effective, they still assume the availability of large-scale question resolution and conversational search datasets. To waive the dependency on the availability of such data, we adapt a pre-trained token-level dense retriever on ad-hoc search data to perform conversational search with no additional fine-tuning. The proposed method allows to contextualize the user question within the conversation history, but restrict the matching only between question and potential answer. Our experiments demonstrate the effectiveness of the proposed approach. We also perform an analysis that provides insights of how contextualization works in the latent space, in essence introducing a bias towards salient terms from the conversation.


page 1

page 2

page 3

page 4


BERT with History Answer Embedding for Conversational Question Answering

Conversational search is an emerging topic in the information retrieval ...

A Comparison of Question Rewriting Methods for Conversational Passage Retrieval

Conversational passage retrieval relies on question rewriting to modify ...

Few-Shot Conversational Dense Retrieval

Dense retrieval (DR) has the potential to resolve the query understandin...

Few-Shot Generative Conversational Query Rewriting

Conversational query rewriting aims to reformulate a concise conversatio...

Reinforcement Learning Based Conversational Search Assistant

In this work, we develop an end-to-end Reinforcement Learning based arch...

Query Resolution for Conversational Search with Limited Supervision

In this work we focus on multi-turn passage retrieval as a crucial compo...

Saving Dense Retriever from Shortcut Dependency in Conversational Search

In conversational search (CS), it needs holistic understanding over conv...

1. Introduction

The introduction of commercial voice assistants along with advances in natural language understanding have enabled users to interact with retrieval systems in richer and more natural ways through conversations. While those interactions can ultimately lead to increased user satisfaction, they are inherently complex to handle as they require an understanding of the entire dialogue semantics by the retrieval system. Hence, the retrieval of relevant passages within the context of a conversation has risen as a promising research direction  (dalton2020trec; krasakis2020analysing; borisov2021keyword; meng2021initiative).

Understanding the semantics of language has been empowered by the availability of large-scale datasets in a variety of tasks (choi2018quac; nguyen2016ms; yates2021pretrained), which are lacking when it comes to conversational retrieval. Constructing a large and diverse conversational retrieval dataset can be quite challenging. Conversational queries are tail queries. As conversations evolve, multi-turn queries are likely to be unique and therefore cannot be aggregated for anonymisation, making it unlikely that publicly available resources can be built from real user interactions. Therefore, datasets need to be built using human experts in controlled environments. However, this approach leads to small-scale datasets (dalton2020trec; dalton2020trec1; dalton2022trec; 10.1145/3432726) and requires explicit conversation development instructions which bias the nature of the constructed dataset and hurt the generalizability of models to new types of conversations (adlakha2021topiocqa).

On the other hand there is a plethora of data resources for ad-hoc retrieval, e.g.  craswell2020orcas. Therefore, most conversational retrieval approaches so far introduce a query rewriting step, which essentially decomposes the conversational search problem into a query resolution problem and an ad-hoc retrieval problem. Regarding query resolution, the majority of methods perform an explicit query re-write attempting to place the user’s question in the context of the conversation, by either expanding queries with terms from recent history  (voskarides2020query), or rewriting the full question using a sequence-to-sequence model  (vakulenko2021comparison; lintrec; ferreira2021open; lin2021multi; yu2020few).  yu2021few learns to better encode the user’s question in a latent space so that the learnt embeddings are close to human rewritten questions, while  lin2021contextualized uses human rewritten questions to generate large-scale pseudo-relevance labels and bring the user’s question embeddings closer to the pseudo-relevant passage embeddings. In all cases, supervision is necessary and it is performed against CANARD (elgohary2019can), which consists of synthetic resolutions of conversational questions. The only approaches that do not use supervision simply expand the user’s question by extracting general informative terms from the conversation history  (lin2021multi; borisov2021keyword).

In this work we pose the following key research question: to what extent can we transfer knowledge from ad-hoc retrieval to the domain of conversational retrieval, where data scarcity is and will likely remain an imminent problem? To answer this question we adapt ColBERT, the state-of-the-art BERT-based token-level dense retriever pre-trained on ad-hoc search data. We propose Zero-shot Conversational Contextualization (), a variant of ColBERT which on one hand contextualizes all embeddings within the conversation, but on the other hand matches only the contextualized terms of the last user’s question with potential answers (Figure 1

). As such our approach is zero-shot in the conversational domain, that is, it does not use any conversational search data, neither rewritten queries nor relevance judgements, to retrieve relevant passages. It is also different from the aforementioned unsupervised keyword extraction works, since it focuses on contextualizing embeddings rather than adding terms to queries. In this work we aim to answer the following research questions:

  1. [label=RQ0,leftmargin=*]

  2. Can zero-shot contextualization of conversational questions improve dense passage retrieval?

  3. How does zero-shot contextualization change the last turn’s question embeddings?

  4. How is zero-shot contextualization affected by the anaphora phenomena found in conversations?

To the best of our knowledge this is one of the few efforts for zero-shot conversational search. Our approach remains agnostic and unbiased to small conversational datasets and it can prove particularly useful in privacy-sensitive settings (eg. medical domain), where annotating rewrites of conversational questions is not an option. Further, our method is orthogonal to and can be applied in combination with existing query resolution techniques.

We open-source our code for reproducibility and future research purposes


2. Methodology

In this section, we describe our zero-shot dense retriever for conversational retrieval. Our approach consists of two main components: an encoder that produces token embeddings of a document or query and a matching component that compares query and document token embeddings to produce a relevance score.

2.1. Task & Notation

Let be the user utterance/query to the system at the -th turn, and the corresponding canonical passage response provided by the dataset. We formulate our passage retrieval task as follows: Given the last user utterance and the previous context of the conversation at turn : , we produce a ranking of k passages from a collection that are most likely to satisfy the users’ information need.

2.2. Token-level Dense Retrieval

In this section we briefly describe ColBERT (khattab2020colbert), a dense retrieval model that serves as our query and document encoder . In contrast to other dense retrievers that construct global query and document representations (eg. DPR (karpukhin2020dense) or ANCE (xiong2020approximate)), ColBERT generates embeddings of all input tokens. This allows us to perform matching at the token-level. To generate token embeddings, ColBERT passes each token through multiple attention layers in a transformer encoder architecture, which contextualizes each token with respect to its surroundings  (vaswani2017attention; devlin2018bert). We use to denote the embeddings produced for a query and to denote the embeddings produced for a document . To compute a query-document score, ColBERT performs a soft-match between the embeddings of a query token and a document token by taking their inner product. Specifically, each query token is matched with the most similar document token and the summation is computed:


2.3. Conversational Token-level Dense Retrieval

Figure 1. Zero-shot Conversational Dense Retriever (figure adapted from  (khattab2020colbert) with permission)

Our approach extends the idea of token contextualization to multi-turn conversations. When dealing with conversations, it is crucial for each turn to be contextualized with respect to the conversation, because conversational queries have continuity and often contain phenomena of ellipsis or anaphoras to previous turns  (vakulenko2021question; yu2020few; voskarides2020query).

In practice, ColBERT serves as our query and document encoder . We encode documents in the usual way. However, to encode a query at turn , we concatenate the conversational context with the last query utterance before generating contextualized query token embeddings .


While constitutes token embeddings of the entire conversation (i.e., the input to ), our goal is to perform ranking based on only tokens from the last utterance . To do so, we (1) replace with in the token-level matching function (Eq 1) and (2) compute the score as , so that only query tokens from the last turn contribute to it.


Note that, this approach of contextualizing with respect to the conversation history avoids the need for resolution supervision from conversational tasks. Instead, it relies on the pre-training of three different tasks: (a) Masked Language Modelling, (b) next sentence prediction tasks (pre-training of BERT  (devlin2018bert)) and the (c) ad-hoc ranking task (pre-training of ColBERT (khattab2020colbert)).

CAsT’19 CAsT’20 CAsT’21
NDCG@3 R@100 NDCG@3 R@100 NDCG@3 R@100
ColBERT last-turn 0.214 0.157 0.155 0.124 0.140 0.154
all-history 0.190 0.165 0.150 0.166 0.237 0.265
ZeCo (ours)
0.238 0.216 0.176 0.200 0.234 0.267
human 0.430 0.363 0.443 0.408 0.431 0.403
ConvDR (yu2021few) zero-shot 0.247 0.183 0.150 0.150
few-shot 0.466 0.362 0.340 0.345 0.361 0.376
human 0.461 0.389 0.422 0.454 0.548 0.451
Table 1.

Effectiveness of zero-shot embedding contextualization on TREC-CAsT datasets. Bold font indicates the best zero-shot performing model. Superscripts indicate statistically significant improvements (paired t-test,

) of over zero-shot models: last-turn , all-history and ConvDR zero-shot

3. Experimental Setup

3.1. Datasets and Evaluation

We test our approach on the TREC CAsT ’19, ’20 and ’21 (dalton2020trec; dalton2020trec1; dalton2022trec) datasets. Each dataset consists of about 25 conversations, with an average of 10 turns per conversation. CAsT ’20 and ’21 include canonical passage responses to previous questions, that the user can refer to or give feedback. The corpus consists of the MSMarco Passages and Documents, Wikipedia and Washington Post news articles (nguyen2016ms; petroni2020kilt; dietz2017trec). TREC CAsT ’19 and ’20 includes relevance judgements at passage level, whereas CAsT ’21 at the document level. For CAsT ’21, we split the documents into passages and score each document based on its highest scored passage ( (dai2019deeper)).

To quantify retrieval performance we use two metrics: NDCG@3 and Recall, R@100. The former quantifies effectiveness at the top ranks, which is important for a user, while the latter expresses the ability of a first-stage ranker to retrieve relevant passages that can be later re-ranked with a more effective second-stage ranker.

3.2. Methods & Baselines


ColBERT was trained to contextualize tokens through (a) the self-supervision of BERT’s masked language model and next-sentence prediction (devlin2018bert) and (b) the training to optimize ad-hoc ranking (khattab2020colbert). In our experiments, we use the weights of a ColBERT retriever pre-trained on the MSMarco passage ranking dataset (nguyen2016ms). We use ColBERT v1, while our method remains applicable to v2 (santhanam2021colbertv2), where the main novelty is optimizations to reduce the index size. To avoid any spill-over effects we perform matching only on the query tokens; we deactivate matching on the , query indicator () and expansion tokens used in the original work (khattab2020colbert).


To assess the effect of our contextualization method , we compare with the following baselines:


  • last-turn: embeddings are not contextualized in the conversation (i.e., in Eq. (2)).

  • all-history: embeddings are contextualized in conversation, and the matching function includes terms across the entire history (i.e., ).

  • human: oracle, using human rewrites

-based  (yu2021few):

  • zero-shot: no supervision from conversational tasks/data

  • few-shot: trained with Knowledge-Distillation on query rewrites

  • human: oracle, using human rewrites

4. Results and Analysis

To answer RQ1, which asks whether the last user utterance (question) can be effectively contextualized with respect to the conversation history, we compare the performance of the non-contextualized utterance (last-turn) with our contextualized approach () in Table 1. It is clear that contextualization helps in all cases, especially in terms of Recall with relative improvements of - . We further observe that our approach significantly outperforms the all-history baseline, which uses the entire conversation as the query, in the first two datasets and yields comparable performance on CAsT’21. We hypothesize that the baseline’s improved performance on CAsT’21 is due to its document-level annotations, with one document satisfying multiple turns of the conversation. We also observe that all-history performs worse than last-turn regarding but better regarding . Furthermore, outperforms the zero-shot ConvDR in most cases, especially with respect to . Last, while the supervised versions of ConvDR clearly outperfom in , remains competitive in terms of .

Next we consider the effect of contextualization of the user’s query by looking into how this changes the last turn’s query embeddings so that we answer RQ2.

What are the most influenced terms? To assess the effect of conversational contextualization (), we measure the token embedding changes in the user’s query and report the terms with the largest average change in Table 2. We define token embedding change as the cosine distance between a term before and after contextualization: . We observe that terms indicating anaphora (’they’, ’it’, etc.), punctuation symbols and special tokens are the ones most influenced. This is expected, since users often use anaphoras referring to previous conversation rounds. Regarding punctuation and special tokens, one plausible explanation is that a global representation of a turn is aggregated in those tokens after contextualization.

frequency avg()
they 60 0.501
it 196 0.480
934 0.464
? 873 0.458
that 52 0.440
. 192 0.424
macro-avg 0.185
micro-avg 0.323
Table 2. Average change of frequent token embeddings after zero-shot contextualization (all CAsT datasets).
closest match
closest match
term similarity term similarity
what is the
first sign of it ?
0.52 +0.31
tell me about
lung cancer.
what causes
throat cancer ?
What is the role of
positivism in it?
sociology 0.44 +0.67
what is taught in
sociology ?
what is taught in
sociology ?
what technological
developments enabled it ?
0.54 +0.42
… the origins of
popular music ?
… the origins of
popular music ?
what is the
evidence for it ?
bronze age
0.36 0.00
tell me about the
bronze age collapse.
tell me about the bronze
age collapse.
why did ben franklin want it
to be the national symbol?
turkeys 0.22 +0.29
where are
turkeys from ?
where are
turkeys from ?
what is it
story film
0.39 +0.12
the neverending
story film .
the neverending
story film .
Table 3. Examples of best term matches of anaphora terms in conversation history (before/after query contextualization).

How do terms change when contextualized? To illustrate how contextualization changes term embeddings, in Table 3 we focus on a highly influenced anaphora term (’it’) and match it to the most similar token embeddings from the conversation history. We observe that in certain cases, zero-shot contextualization resolves anaphoras successfully, bringing anaphora embeddings very close to the referred term (“sociology”, “popular music”, “throat cancer”). The first row shows one noteworthy example where the matching term is always cancer, but contextualization allows it to resolve to the correct embedding of throat cancer instead of lung cancer. Lastly, we see cases where embeddings come closer to punctuation symbols, indicating that those might preserve some sort of global query representation, or a multi-token concept (e.g., “the neverending story film .”).

To answer RQ3, we quantitatively explore whether contextualization brings anaphora terms closer to their corresponding resolutions and how this affects ranking. Bringing anaphora terms closer to resolutions is crucial for conversational search.We identify those terms automatically using the human rewrites and define the effect of contextualization in bringing anaphora embeddings () closer to resolution embeddings () as:

where anaphoras are contextualized within the last turn () or the entire conversation (). We encode resolutions () independently to ensure they retain their original representations. On queries with multi-token anaphoras or resolutions, we pick the highest match. In Figure 2 we observe the scatter plot of this against . In most cases, contextualization improves () and brings anaphoras closer to resolutions (). Further, Recall correlates with this change in similarity towards resolutions (Pearson’s , ).

Figure 2. Correlation between and similarity change of anaphoras towards resolutions (CAsT ’19)
(resolution) (random term)
0.178 0.255
0.372 0.204
Table 4. Embedding similarities between anaphoras and resolution or random terms from the conversations (CAsT ’19).

To examine whether anaphora terms coming closer to resolutions is simply a by-product of being encoded together, we measure similarities between anaphoras () and (a) resolution terms () vs (b) random terms from the same conversation () in Table 4. For consistency, we also encode random terms independently. When anaphoras are contextualized within only the last-turn, they are more similar to random terms than to resolutions on average. However, our method () brings anaphoras closer to their resolutions, while pushing them away from other (random) words from the same conversation. This confirms that resolutions have a high impact on contextualizing anaphoras, in contrast to other random conversation words. The mechanism behind this effect requires further investigation. It could be that simply the lower frequency of resolution terms has an effect here, but it is also possible that pre-trained transformers have certain co-reference resolution capabilities (eg. by relating ’it’ to a noun). Regardless, it is evident that our method induces a bias towards salient terms from the conversation, leading to improved ranking performance.

5. Conclusions

In this paper, we explore the possibility of performing conversational search in a zero-shot setting, by contextualizing the last user query with respect to the conversation history. We show that this method is highly effective for first-stage ranking, yielding consistent and significant improvements in . Further, it is suitable for privacy-sensitive settings, and can be combined with existing query rewriting techniques. In addition, we shed light into how zero-shot contextualization changes the last turn embeddings and show that biasing them towards the previous conversation can help retrieval, since it brings them closer to the conversation topic and salient terms. For future work we aim to explore zero-shot re-ranking and extend this work to few-shot training.

This research was supported by the NWO Innovational Research Incentives Scheme Vidi (016.Vidi.189.039), the NWO Smart Culture - Big Data / Digital Humanities (314-99-301), and the H2020-EU.3.4. - SOCIETAL CHALLENGES - Smart, Green And Integrated Transport (814961). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.