Recent advances in the field of Natural Language Understanding (NLU) (Devlin et al., 2018; Adiwardana et al., 2020; Brown et al., 2020) have enabled natural language interfaces to help users find information beyond what typical search engines provide, through systems such as open domain and task-oriented dialogue engines (Li et al., 2018, 2020) and conversational recommenders (Christakopoulou et al., 2016), among others. However, most existing systems still present with one or both of the following limitations: (1) answers are typically constrained to relatively simple and primarily factoid-style requests in natural language (Kwiatkowski et al., 2019; Soleimani et al., 2021), as is the case with search engines; and (2) a requirement on availability of inferred user preferences (Kostric et al., 2021).
However, user information needs, when expressed using natural language, can be inherently complex and contain many interdependent constraints, as is shown in Figure 1. When issuing such requests, users may be considered to be in exploratory mode; they are looking for suggestions to pick from, rather than a single concrete answer. The task becomes especially challenging since most real applications (Christakopoulou, 2018) need to constantly deal with cold-start users (Kiseleva et al., 2016a; Sepliarskaia et al., 2018), for whom little to no preferential knowledge is known a priori. This may be due to infrequent visits, rapid changes in user preferences (Bernardi et al., 2015; Kiseleva et al., 2014, 2015), or general privacy-preserving constraints. In this work, we aim to bridge the described gap of processing complex information-seeking requests in natural language from unknown users by developing a new type of application, which will work as illustrated in Figure 1. Concretely, our proposed solution is capable of jointly processing complex natural language requests, inferring user preferences, and suggesting new ones for users to explore, given real-time interactions with the Interactive Agent (IA).
One of the major bottlenecks in tackling the proposed problem of processing complex information-seeking is the lack of an existing interactive system to collect data and observe user interactions. Therefore, we designed a platform, which we call Pluto, which allows users to submit complex information-seeking requests. Using Pluto, we leverage human agents in the loop to help users accomplish their informational needs while collecting data on complex search behavior and user interactions (e.g., Holzinger, 2016; Li et al., 2016).
Furthermore, we propose a novel IA that seeks to replace human agents in the loop in order to scale out Pluto to a significantly broader audience while simultaneously making responses a near real-time experience. The proposed IA contains NLU
unit that extracts a semantic representation of the complex request. It also integrates a novel score that estimates the completeness of a user’s intent at each interactive step. Based on the semantic representation and completion score, theIA interacts with users through a Reinforcement Learning (RL) loop that guides them through the process of specifying intents and expressing new ones. The proposed model leverages a user interface to suggest a ranked list of suggested intents that users may not have previously thought about, or even know. Online user feedback is leveraged through these interactions with users to automatically improve and update the reinforcement learner’s policies.
Another important aspect we touch on is a simple, straightforward evaluation of the proposed approach. We adopt pre-retrieval metrics (e.g., Sarnikar et al., 2014; Roitman et al., 2019) as a means to evaluate the extent to which refinement to the complex request afforded by the IA
better represents the actual user intent, or narrows down the search space. Our evaluation demonstrates that a better formulated complex request results in a more reliable and accurate retrieval process. For the retrieval phase, we break down the complex request based on the contained slots and generate a list of queries from the user intent, slots, and location. A search engine API is used to extract relevant documents, after which a GPT-3 based ranker re-ranks the final results based on the actual slot values or aspects. The final re-ranker considers the user preferences through the aspects values for the slots in the reformulated query.
To summarize, the main contributions of this work are:
Designing a novel interactive platform to collect data for handling complex information-seeking tasks, which enables integration with humans-in-the-loop for initial processing of the user requests and search engines to retrieve relevant suggestions in response to refined user requests (Section 3)
Proposing a hybrid model, which we name Interactive Agent (IA), consisting of an Natural Language Understanding (NLU) and a Reinforcement Learning (RL
) component. This model, inspired by conversational agents, encourages and empowers users to explicitly describe their search intents so that they may be more easily satisfied (Section5)
Suggesting an evaluation metric, Completion Intent Score (CIS) that estimates the degree to which intent is expressed completely, at each step. This metric is used to continue the interactive loop so that users can express the maximum preferential information in a minimum number of steps (Section 7.2)
2. Background and Related Work
Our work is relevant to four broad strands of research on multi-armed bandits, search engines, language as an interface for interactive systems, and exploratory search and trails, which we review below.
Contextual bandits for recommendation
Multi-armed bandits are a classical exploration-exploitation framework from Reinforcement Learning (RL), where the user feedback is available in each iteration (Parapar and Radlinski, 2021; Cortes, 2018; Li et al., 2010). They are becoming popular for online applications such as adds ranking and recommendation systems (e.g., Ban and He, 2021; Joachims et al., 2020), where information about user preferences is unavailable (cold-start users (Bernardi et al., 2015; Kiseleva et al., 2016a)) (Felício et al., 2017). Parapar and Radlinski (2021)
proposed a multi-armed bandit model for personalized recommendations by diversifying the user preferences. Others examined the application of contextual bandit models in healthcare, finance, Dynamic Pricing, and Anomaly Detection(Bouneffouf and Rish, 2019).
Our work adopts contextual bandits paradigm to the new problem of interactive intent modeling for complex information-seeking tasks.
Commonly used search engines such as Google and Bing provide platforms focusing on the document retrieval process through search sessions (Hassan et al., 2010; Kiseleva et al., 2014, 2015; Ageev et al., 2011). Developing retrieval models that can extract the most relevant documents from an extensive collection has been well-studied (Croft et al., 2010) for decades. The developed retrieval models focus on retrieving the most relevant documents to the search query concerning the user’s textual and contextual information within or across search sessions (Kotov et al., 2011). Although extracting relevant documents is necessary, it is not always sufficient, especially when the users have a complex information-seeking task (Ingwersen and Järvelin, 2006).
Language as an interface for interactions
NLU have been the important direction for human-computer interaction and information search for decades (Woods et al., 1972; Codd, 1974; Hendrix et al., 1978). The recent impressive advances in capabilities of NLU (Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020; Adiwardana et al., 2020; Roller et al., 2020; Brown et al., 2020)
powered by large-scale deep learning and increasing demand for new applications has led to a major resurgence of natural language interfaces in the form of virtual assistants, dialog systems, conversational search, semantic parsing, and question answering systems(Liu and Lane, 2017, 2018; Dinan et al., 2020; Zhang et al., 2019). The scope of natural language interfaces has been significantly expanding from databases (Copestake and Jones, 1990) to knowledge bases (Berant et al., 2013), robots (Tellex et al., 2011), virtual assistants (Kiseleva et al., 2016c, b), and other various forms of interaction (Fast et al., 2018; Desai et al., 2016; Young et al., 2013). Recently, the community has focused on continuous learning through interactions, including systems that learn a new task from instructions (Li et al., 2020), assess their uncertainty (Yao et al., 2019) and ask feedback from humans in case of uncertainty (Aliannejadi et al., 2021, 2020) or for correcting possible mistakes (Elgohary et al., 2020).
The interfaces discussed above model user intent through a step-by-step process. Leveraging this mechanism is effective; however, according to the current advancement in designing dialogue managers, they struggle to model long-term dependencies. A slight mistake can cascade to the higher levels, which causes user dissatisfaction. Also, conversational search systems do not allow users to initially express all their intent at once, resulting in incomplete user intent.
Exploratory search and trails
Exploratory search refers to an information-seeking process in which the system assists the searcher in understanding the information space for iterative exploration and retrieval of information (Ruotsalo et al., 2018; Hassan Awadallah et al., 2014; White et al., 2008). The abnormal state of knowledge hypothesis is the main reason behind a demand for information-seeking search systems. According to this hypothesis, users usually cannot accurately formulate their information need by missing some essential information (Liu and Belkin, 2015; White and Roth, 2009). In such cases, the system should assist the user in specifying their intent (Marchionini, 2006). Odijk et al. (2015) shows that a large portion of searches where users struggle to formulate their query, or the searcher explores to navigate to a particular Web page. New search interfaces need to be designed to support searchers through their information-seeking process (Villa et al., 2009). Trails are another group of tools that were developed to guide users to accomplish search tasks. Olston and Chi (2003) proposed ScentTrails that leverage an interface that combines browsing and searching and highlights potentially relevant hyperlinks. WebWatcher (Joachims et al., 1997), like ScentTrails, underlined the relevant hyperlinks and improved the model based on the implicit feedback collected during previous tours.
To summarize, the key distinctions of our work compared to previous efforts are: similar to the exploratory search, trails, and conversational search, our model proposes an iterative information-seeking process and designs an interface for user interactions to guide struggling users and help them better understand the information space. Unlike exploratory search, trails, and conversational search that only focus on user interaction modeling and limit users in issuing short and imprecise queries and utterances, our model provides a platform for users to express their information needs in the form of long and complex requests. Users can utilize this capability to express their intent more accurately and prune significant parts of the search space for the exploratory search process. Adding this capability needs an advanced NLU
step and different machine learning components to understand and guide the final user through the search process. To this end, the proposed system has two new components, such as anintent ontology and profile for partitioning the information space, where the IA can be more effective in exploring the search space.
3. Pluto: data collection infrastructure
Since the proposed problem is novel and requires non-trivial user interaction data, we designed a new pipeline to collect such data. Users of Pluto were supplied with a consent form explaining that their requests and interactions would be viewed by human agents and some members of the development team. Further, the human agents in the loop also consented to have their interactions with the system recorded. All data and interactions were anonymized, and no personal identifiers of users or agents were retained in any of the modeling and experimentation in our work. This section presents the details of the developed infrastructure, which is called Pluto. It is leveraging human-in-the-loop for data collection and curation. Pluto is comprised of two main components :
[leftmargin=*, label=Phase0, nosep]
Refinement of complex user request in natural language;
Refinement of retrieved list of suggestions.
Complex user request refinement.
When a user issues a request in natural language to express their complex information needs, which potentially has many expressed constraints (see Figure 1 for several mentioned in the example), GPT-3 (Brown et al., 2020) is leveraged to understand the request’s intent and identify explicitly mentioned aspects.
Once GPT-3 has identified these aspects, they will be used as the initial set for the request. To further expand this set, this phase will identify an additional list of aspects to be presented to users as a supplemental set of relevant considerations for their request.
As stated, Pluto has integrated human-in-the-loop into its pipeline. The goal of human agents is to intervene at certain stages of the system to offer human judgment. One such intervention occurs when agents review users’ requests, at which point they can correct the aspects this phase identified in the request as well as add new ones to better serve the user’s needs.
Here, Pluto performs two tasks. First, it receives the slots selected by the user for processing and suggests additional slots (so as to further narrow down the request, with the aid of the user). These new slots can be generated via GPT-3 or by intervention from the human agents. Second, Pluto leverages the search engine to produce a series of suggestions that meet the slots for the request as well as the new slot proposal. GPT-3 is leveraged at this stage to aid in determining what potential suggestions meet which aspects from the request so that the system can rank them. Human agents then make final decisions on which suggestions to present to users. Once that is done, users can either accept the suggestions if they are satisfying, or request another iteration of the retrieval phase. When users request another iteration, they may change either the language of the request or add/remove aspects from it (including the newly suggested ones). Additionally, for any iteration of this phase, users can provide feedback that is captured via a form to help refine the system.
Finally, human agents are responsible for another very valuable and essential contribution: intent and aspect curation. In either of the phases described above, GPT-3 may suggest various aspects and intents that sometimes are not as relevant or useful. All of these are considered entries into the dynamic intent ontology. However, human agents then curate them. Intent and aspects that are considered higher quality by the agents are then given more weight when suggesting aspects in either of the two phases.
Next, we will describe the formal problem description and elaborate on the problem formulation and the proposed interactive agent.
4. Problem description
In this section, we formalize the problem of interactive intent modeling for supporting complex information-seeking tasks.
We begin by formally defining used notation as follows:
User request : is a complex information seeking task expressed in natural language, which contains contains multiple functional desiderata, preferences, and conditions (e.g. Figure 1).
Request topic : determines the topic the request belongs to, e.g. “activity” or “service”, where is the list of all existing topics.
User intent : for each topic , a list of user intents can be defined. Intents are the identification of what a user wants to find. For example, in Figure 1, the user request has an “activity” topic, with a “hiking” intent. This definition allows having identical intents in different domains, where is the list of all intents.
Slot and aspect : for each specific topic and user intent , a list of slots are defined that describe features and properties of the intent in topic , and aspect (values) is a restriction on the slot . For example, from Figure 1, “date” is a slot related to “hiking”, with aspect value of “May 9th to May 29th, 2021”.
Intent completion score : is a score to estimate the completeness of user intent in the interaction step .
Semantic representation : is an information frame which represents an abstract representation of the as
Intent ontology : is the graph structure representing relations among the defined domains, intents, and slots.
Intent profile : is the list of all conditional distributions , for slot with respect to topic and intent . It can be changed over time via user interactions with specific intent and topic . Figure 3 shows different steps to generate the intent profile.
List of retrieved suggestions : is the list of retrieves suggestions in response to .
This section provides a high-level problem formulation. The desired IA aims to map a request expressing complex information-seeking task to a set of relevant suggestions , as illustrated in Figure 2. The proposed model is comprised of three main components:
Natural Language Understanding (NLU) component:
consists of a topic and intent classifier, and a slot tagger to extract topic, user intent , and a list of slots , respectively. The unit leverages GPT-3 to improve and generalize the predictions for unseen slots. Finally, NLU generates the semantic representation for a complex request .
Interactive intent modeling component: is an iterative model leveraging contextual multi-armed bandits (Cortes, 2018) that receives the semantic representation and context for the request from the NLU unit and predicts the most relevant set of slots for .
Retrieval component: generates a sequence of sub-queries based on the list of slots and their corresponding aspects. Relevant documents are retrieved and ranked by GPT-3 to provide the final list of retrieved suggestions .
To summarize, this section formally defines a problem we intend to solved (Algorithm 1) in the next section.
5. Method Description
This section presents a detailed description of the proposed strategy to model Interactive Agent (IA).
5.1. Creating intent profile
Based on the intent ontology created in Section 3, and historical users’ interactions with topic , intent , slot , an dynamic intent profile can be formed. To do so, for each individual , , and
, the intent profile stores a conditional probability, which can be updated in real-time using new user interactions with triple. The conditional probability is computed as follows:
where is the slot for intent , and topic , is the number of slots for intent and topic in intent ontology .
5.2. NLU component
The NLU unit contains three main components: (1) a topic classifier, (2) an intent classifier, and (3) a slot tagger. For each incoming complex request , this unit generates a semantic representation as follows: .
To generate the semantic representation, we leveraged GPT-3 (Brown et al., 2020), a very large language model trained on massive amounts of textual data that has proven capable of natural language generalization and task-agnostic reasoning. One of the hallmarks of GPT-3 is its ability to generate realistic natural language outputs from few or even no training examples (few-shot and zero-shot learning).
Intent Completion score
We propose a score Intent Completion Score (ICS) to manage the number of interactions for the interactive steps. The ICS value can be calculated using the semantic representation and the generated dynamic intent profile . The initial ICS value is equal to the summation over all the conditional probabilities of slots in the request. Then, in the following steps, ICS becomes updated by new slots that the user selects.
Where is the number of explicitly mentioned slots in the and is the number of selected slots through the interactive steps. Also, indicates the conditional probability extracted from intent profile in Eq. 1.
5.3. Interactive user intent modeling
We leveraged contextual multi-armed bandits to model online user interactions. In each iteration, the system interacts with users, receives user feedback, and updates its policies. Multi-armed bandits (Barraza-Urbina and Glowacka, 2020) are a type of RL model that make rewards immediately available after the interaction of an agent with the environment. Contextual multi-armed bandits are an extension of multi-armed bandits, where the context of the environment is also modeled in predicting the next step. Contextual multi-armed bandits are utilized in the interactive agent as users are capable of providing feedback for the agent in each step. We trained a separate contextual multi-armed bandit to represent each pair. The corresponding bandit model is then invoked at the inference time, based on the semantic representation . One of the main elements in designing the contextual bandits is how to represent the context . To this end, we suggested three different methods that are described in the following sections.
This method uses a one-hot representation of the semantic representation . During the interactions with our agent, the one-hot representation is updated by adding newly selected slots. As a result, the size of the context equals the number of slots for each specific intent.
Where is the one-hot vector of the collected slots in the interaction step . is the total number of slots in the interaction step , and is the slots belonging to intent .
In method 2, the request representation is concatenated with the one-hot representation of the slots to enrich the context representation. We used the Google Universal Sentence Encoder(USE) (Cer et al., 2018), which is trained with a deep averaging network (DAN) encoder for encoding text into a 512-dimensional vector for each request.
Where is the one-hot vector of the collected slots at step .
Inspired by session-based recommender systems (Wu and Yan, 2017), we developed a deep learning model in Figure 4 to extract the slot representations. users were excluded from the model as we only focused on intent modeling independent of the user. The goal is to predict the list of slots most likely to be selected by the user, given the input request and explicitly mentioned lists of slots in semantic representation .
The model consists of (1) embedding layer, (2) representation layer, and (3) prediction layer. We used sigmoid cross-entropy to compute the loss since the task is a multi-label problem: a subset of slots is predicted for an input list of slots and the request representation. Finally, max-pooling is done across all the slot embeddings and concatenated with the request embedding vector to represent.
where is the slot embedding, with respect to intent and topic , and is the one-hot vector of the collected slots in step .
Threshold to stop iterations:
We leverage the score to stop the contextual bandit iterations, which has a steady increase in its value through the interactions. To this end, when this value becomes greater than a threshold, the contextual bandit model stops iteration. The threshold varies per pair. Hence, we consider a threshold value of
the mean plus the standard deviation of the slot distribution within.
5.4. Retrieval Component
To extract the final recommendations for the users, we use a retrieval engine that consists of two main components: 1) search retrieval and 2) ranking. For the retrieval part, we need to collect a corpus that is representative of the search space on the Web. Then, we can evaluate the pre-retrieval metrics is discussed in section 7.2. for both initial requests and reformulated requests at inference time.
To generate the corpus, we need to issue a series of queries to a search engine that will capture the search space of the web. Algorithm 3 shows the steps we used to generate these queries and collect the corpus. In essence, we leveraged a pool of sub-queries derived from the internal intent ontology. To create these sub-queries, we use the idea of request refinement using request sub-topics (Nallapati and Shah, 2006), and generated a list of them by combining each selected topic/intent with the set of aspects we have associated with it. Finally, these queries were issued to the Bing Web Search API, and the top 100 results (consisting of the page’s title, URL, and snippet) for each query were added to the corpus.
A few shot GPT-3 model, which has been fine-tuned on a limited number of training samples, is deployed on the pool of potential suggestions extracted from the Web Search API. The GPT-3 ranker then ranks all the potential suggestions concerning the evolved user intent and the actual aspect values . The GPT-3 ranker considers the user preferences for the final ranking results.
To evaluate the proposed interactive model, we leveraged the real data collected through user interaction with Pluto. We collected more than user requests with user interactions for training, and user requests with 13,840 interactions for testing. In Section 6.1, we describe a crowd-sourcing procedure that is designed to collect annotated data, which is used to train and test the slot tagger in the Natural Language Understanding (NLU) unit. Section 6.2 describes the interactive data collected via Pluto (Section 3).222The datasets cannot be shared publicly due to privacy concerns. However, we believe using the presented descriptions the dataset collection can be reproduced.
6.1. Dataset Collected for Nlu unit
To collect the data for training and evaluating the NLU model, we use a crowd-sourcing platform that provides an easy way for researchers to build interfaces for data collection and labeling. Using the platform, we developed a simple interface that presented annotators with a natural language request paired with up to five possible slots. Annotators were then asked to mark relevant slots and given the opportunity to highlight sections of the request that mapped to the slot to their corresponding aspect in question.
The set of requests and slots presented to annotators was created from a seed set of requests, where each request was paired with all the slots from the subsuming intent. Three annotators then used the interface to map slots to requests as appropriate.
Evaluating quality of the collected dataset.
Requests were randomly selected from two different topics and 14 user intents, respectively (Table 1). We only chose two topics as the selected intents were a part of these two topics. Three different human annotators manually labeled these queries through the data collection interface described in the previous section. Table 1 presents Krippendorff’s alph scores (Krippendorff, 2011) across all the intents. A score above or equal to 0.667 is often considered a good reliability test. The results demonstrate an acceptable agreement among all annotators, except “Hike” intent which shows a moderate agreement (Krippendorff, 2011). After evaluating the , we notice that the slots for “hike” intent have overlapped, meaning there are slots that refer to the same thing with different textual representations. These semantic overlaps happened even after normalization with the clustering, which sometimes confuses annotators.
|Topic||Intent||K (dist)||Topic||Intent||K (dist)|
|Service||restaurants||0.74 (12%)||Service||appliance||0.71 (11%)|
|Service||electrician||0.79 (13%)||Service||hotel||0.71 (2%)|
|Service||landscaping||0.67 (16%)||Service||handyman||0.75 (2%)|
|Activity||hike||0.58 (10%)||Service||cleaners||0.69 (4%)|
|Activity||general||0.74 (8%)||Service||remodeling||0.82 (3%)|
|Activity||springbreak||1.00 (5%)||Activity||daytrip||0.73 (2%)|
|Activity||campground||0.74 (6%)||Activity||summercamp||0.75 (6%)|
6.2. Dataset Collected via Pluto
The data for training and evaluating our proposed model is collected during six months of Pluto proprietary interactive logs described in Section 3. We used the first five months to form the training set and reserved the last month for testing. Since GPT-3 is a generative model, the suggested slots during data collection may not be expressed identically, despite representing the same underlying intent (e.g.“access to parking” and “parking availability”). To address this issue, we used a universal sentence encoder (Cer et al., 2018) to softly match a generated slot to a slot in . The slot with the lowest cosine distance is considered the target slot.
Pluto is capable of covering hundreds of different user intents. In this study, however, we selected the most frequent search intents in the logs because we observed a sharp drop-off in frequency after that. Table. 1 represents the intent values with their corresponding topic. Each sample in the collected interactive dataset has the form of , where there is no intersection between two sets of slots . The selected slots are the slots user selects during the interaction with the interactive agent. We collected more than user requests with user interactions for training, and user requests with 13,840 interactions for testing.
6.3. Corpus Collection
To generate the corpus, we need to issue a series of queries to a search engine that will capture the search space of the web. Algorithm 3 shows the steps used to generate these queries and collect the corpus.
7. Experimental Setup and Results
Baseline: Popularity Method
The popularity-based method is a heuristic, suggesting the next set of related slots based on overall frequency (popularity) in the intent profile. The order of suggestions can change over time as some slots become more popular for specific intents.
The models proposed in this paper are unique in modeling the user intent, and they can be baselines for future research in this area, since the current conversational and exploratory search models (Louvan and Magnini, 2020; White and Roth, 2009; Kostric et al., 2021) are not applicable to tackle the described task. For convenience, we summarize the methods compared for reporting the experimental results as follows:
Group 1: Contextual Multi-armed Bandit Policies
We report the results on
different policies for contextual bandit models, including: “Bootstrapped Upper Confidence Bound”, “Bootstrapped Thompson Sampling”, “Epsilon Greedy”, “Adaptive Greedy”, “SoftMax Explorer”, etc. which have been extensively investigated in(Cortes, 2018). The library to implement the policies is available here333https://contextual-bandits.readthedocs.io/en/latest/.
Group 2: Different context representation:
We report the results for the three different proposed context described in Section 5.3.
7.2. Evaluation Metrics
Evaluating complex search tasks has always been quite challenging. Since the task is not supervised and there is no available dataset or labels, we could not directly evaluate the results. In addition, our goal is to refine requests in a way that they lead to better suggestions. Therefore, we propose to employ Query Performance Prediction (QPP) metrics for evaluation purposes. QPP task is defined as predicting the performance of a retrieval method on a given input request (Carmel and Yom-Tov, 2010; Cronen-Townsend et al., 2002; He and Ounis, 2004). In other words, query performance predictors predict the quality of retrieved items w.r.t to the query. QPP methods have been used in different applications such as query reformulation, query routing, and in intelligent systems (Sarnikar et al., 2014; Roitman et al., 2019). QPP methods are a promising indicator of retrieval performance and are categorized into pre-retrieval, and post-retrieval methods (Carmel and Yom-Tov, 2010).
Post retrieval QPP methods generally show superior performance compared to pre-retrieval ones, whereas the pre-retrieval QPP methods have been more often used in more real-life applications and can address more practical problems since their prediction occurs before the retrieval.
In addition, almost all of the post-retrieval methods work based on the relevance scores of the retrieved list of documents, and in our case, the relevance score was not available from the search engine API; thus, we only employed pre-retrieval QPP methods for this work’s evaluation purposes. Having said that, we predict and compare the performance of the initial complex requests as well as our reformulated requests using SOTA pre-retrieval QPP methods which have been shown to have a high correlation with retrieval methods on different corpora (Hashemi et al., 2019; Arabzadeh et al., 2020a, b; Zhao et al., 2008; Hauff et al., 2008, 2009; Carmel and Kurland, 2012; He and Ounis, 2004). The intuition behind evaluating our proposed method with pre-retrieval QPP methods is that QPP methods have shown to be a promising indicator of performance. Therefore, we can compare the predicted performance of the initial complex request as well as our reformulated request and predict which one is more likely to perform better. To simply put, higher QPP values mean that there is a higher probability that the request is going to be easily satisfied, and lower QPP values indicate a higher chance of poor retrieval results.
In the following, we elaborate on the SOTA pre-retrieval QPP methods that showed promising performance over different corpora and query sets, and we leveraged them for evaluating this work.
Simplified Clarity Score (SCS): SCS is a specificity-based QPP method, which captures the intuition behind that the more specific a query is, the more likely a system is to specify the query (He and Ounis, 2004; Plachouras et al., 2004). SCS measures the KL divergence between the query and the corpus language model, thereby capturing how well the query is distinguishable from the corpus.
Similarity of Corpus and Query (SCQ): SCQ leverages the intuition that if a query is more similar to the collection, there is a higher potential to find an answer in the collection (Zhao et al., 2008). Concretely, the metric measures the similarity between collection and query for each term and then aggregates over the query, reporting the average of each query term’s individual score.
Neural embedding based QPPs (Neural-CC): Neural embedding-based QPP metrics have shown excellent performance on several Information Retrieval (IR) benchmarks. They go beyond the traditional term-frequency based QPP metrics and capture the semantic aspects of terms (Zamani et al., 2018; Roy et al., 2019; Arabzadeh et al., 2019, 2020a, 2020b; Khodabakhsh and Bagheri, 2021; Roitman, 2020; Hashemi et al., 2019). We adapted one of the recently proposed QPP metrics which build a network between query terms and their most similar neighbors in the embedding space. Similar to (He and Ounis, 2004; Plachouras et al., 2004), this metric is based on query specificity. The intuition behind this metric is that specific queries play a more central and vital role in their neighborhood network than more generic ones. Here, as suggested in (Arabzadeh et al., 2020a, b, 2019), we adapted the Closeness Centrality (CC) of query terms within their neighborhood network, which has shown to have the highest correlation across different IR benchmarks.
For contextual bandits and GPT-3 models, the default parameters for the available libraries were used, and no parameter tuning was done. To train the model described in Section 7.1, we use an Adam optimizer with a learning rate of
, a mini-batch of size 8 for training, and embedding of size 100 for both word and aspects. A dropout rate of 0.5 is applied at the fully connected and ReLU layers to prevent potential overfitting.
7.3. Experimental Results
We compare the result of QPP metrics on our best policies and popular attributes with the original requests in Figure 5, where we report the percentage difference w.r.t the full form. That is, to what extent do the QPP metrics predict that the reformulated requests are likely to perform better than the original ones. We examine the difference between the average of QPP metrics on reformulated requests with the best policy (adaptive active greedy) and the full form of requests. In addition, we compared the reformulated requests with popular attributes and the full form of the request and reported them in the same figure. As shown in Figure 5, the adaptive active greedy policy showed improvements over all the three QPP metrics and on all intents. These bars in Figure 5
can be interpreted as the percentage of predicted improvement for the reformulated requests compared to full form of requests. For instance, for restaurants intent, SCQ, SCS, and neural embeddings QPP methods, have improved with 3.3%, 3.1%, and 22.5%, respectively. We measure statistical significance using a paired t-test with a p-value of 0.05. We note that while the improvement made by the adaptive active greedy policy were consistently statistically significant on all intents by the SCQ and neural embedding QPP metrics, the gains were only statistically significant 4 intents on the SCS metric: “Restaurants”, “Landscaping”, “Home cleaners”, and “Home Remodeling.” It should be noted that while QPP methods are potential indicators of performance, every QPP method focuses on a different quality aspect of the query. Therefore, they do not necessarily agree on the predicted performance according to different queries, corpora, or retrieval methods. This observation has been made on different SOTA QPP methods and various well-known corpora such as TREC corpora or MS MARCO and their associated query set(Carmel and Yom-Tov, 2010; Arabzadeh et al., 2021; Hashemi et al., 2019). Thus, we conclude that the level of agreement could strengthen our confidence in the query performance prediction. In other words, the more QPP metrics agree on query performance, the more confidence we have in that prediction. In addition, we can interpret each QPP performance based on their intuition behind them. For example, the SCS method relies on query clarity level, while the SCQ method counts on the existence of potential answers in the corpus. When the two QPP methods do not agree on the query’s performance, we consider it as how the query satisfies the intuition behind one of the QPP methods while failing to satisfy the others. For example, take the ‘activity’ intent in Figure 5, in which the SCQ methods showed significant improvement, but the SCS method did not. We interpret this observation as the clarity of the query has not been significantly increased while refined by our method. However, the query was expanded so that the existing potential answers in the corpus has increased.
To evaluate the topic and intent classifiers, we used the test set described in Section 6.2 and achieved a 99.3% and 95.2% accuracy for topic and intents, respectively. For evaluating the slot tagger, we leveraged the annotated data collected by three different judges described in Section 6.1 performing 4-fold cross-validation and achieved a 0.75 macro-F1 across all the intents and slots. The results for slot tagging are promising despite the challenges, e.g., a small amount of labeled data, a large number of slots per intent, and overlapping slots across user intents. The results indicate the ability of GPT-3 to generalize in few-shot learning.
7.4. Ablation Analysis
Broad vs. Specific:
Studying a system’s performance deeply on a per-query basis can enlighten where the systems fail, i.e., which queries a system fails to answer and which group of queries can be handled easily. Thus, it could potentially lead to future improvements to the system. As such, exploring query characteristics has always attracted lots of attention between IR and NLP communities because query broadness has shown to be a crucial factor that could potentially lead to an unsatisfactory retrieval (Song et al., 2009; Clarke et al., 2009; Sanderson, 2008; Nel et al., 2019; Min et al., 2020). Here, we separately study the performance of our proposed method on two groups of broad and specific queries. We are interested in examining whether our proposed method can address both requests consistently, i.e., broad and specific ones. Here we define the broad requests as the requests with less complex information-seeking tasks and fewer preferences expressed; they are short and contain a small number of slots/values ( 3), hence requiring more steps for the RL model to refine the user intent. On the contrary, the specific requests is defined as the longer ones which contain many slots/values, and users need fewer steps to finalize their intent.
Figure 6 demonstrates the evaluation results of broad and specific requests. As demonstrated in this figure, although all the employed QPP metrics agreed that both types of requests had been improved, Adaptive Active Greedy would perform relatively better on broad queries compared to specific ones. It is an expected output because specific requests are more complex than broad ones, and more criteria should be addressed to satisfy them. Moreover, suggesting the popular slots have a deteriorating effect on all the metrics across the intents for the specific requests, showing a challenging reformulation process, while the proposed model in all metrics improves the QPP.
Different Context :
We compare three different proposed contexts described in section 5.3 in terms of percentage difference w.r.t original form of requests on predicted performance by QPP metrics in Figure 7 on the top-5 most popular intents. The results show all the three proposed contexts outperform the original representation across all metrics and intents. We observe that QPP metrics do not consistently agree on the predicted performance between these three different methods. While neural-cc predicts that method 3 and method 2 to define the perform better than method 1. we also noticed that SCS and SCQ sometimes behave the opposite. We hypothesize that this difference could potentially be because neural-cc works based on neural embedding while SCS and SCQ work based on term frequency and corpus statistics. Therefore, each group might capture different aspects of requests. Although all the proposed contexts significantly outperform the original query, we can not conclude which one among them outperforms the others.
Policy evaluation for contextual bandit model:
We performed an experiment for policy evaluation on contextual bandits. We selected the popular intent for off-policy evaluation. Off-line contextual bandits assessment is complicated because they interact in online environments. There are multiple methods for policy evaluation for off-line setting such as Rejection Sampling (RS) (Li et al., 2010) and Normalised Capped Importance Sampling (NCIS) (Gilotte et al., 2018). All the results are reported based on the best arm performance. The system can expose users to multiple slots. As a result, in the proposed setting, the final performance will be much better than the described results. Table 2 shows the result for three different models.
According to the results, RS sometimes underestimates the performance on intents like “restaurant” and “appliance repair” with overestimating other intents such as “hike.” The NCIS method provides a more accurate estimation and provides a more realistic estimate.
8. Conclusion and Future Work
This paper proposed a novel application of natural language interfaces, allowing cold-start users to submit and receive responses for complex information-seeking requests expressed in natural language. Unlike traditional search engines, where a single most relevant result is expected, users of our system are presented with a set of suggestions for further exploration. We have designed and deployed a system that permitted us to conduct initial data collection and potential future online experimentation using the A/B testing paradigm. To complement this platform for complex user requests, we developed a novel interactive agent-based on contextual bandits guides users to express their initial request more clearly by refining their intents and adding new preferential desiderata. During this guided interaction, a NLU unit is used to build a structured semantic representation of the request. The system also uses a proposed Completion Intent Score (CIS) that estimates the degree to which intent is entirely expressed at each interaction step. When the system determines that an optimal request has been expressed, it leverages a search API to retrieve a list of suggestions. To demonstrate the efficacy of the proposed modeling paradigm we have adopted various pre-retrieval metrics that capture the extent to which guided interactions with our system yield better retrieval results. In a suite of experiments, we demonstrated that our method significantly outperforms several robust baseline approaches.
In future work, we plan to design an online experiment that will involve business metrics, such as user satisfaction, the ratio of returning users, and interactively collect ratings for the list of suggestions made by our system. This will allow us to learn from language and rating data jointly. Another possible direction is designing intent ontologies in a more complex hierarchical form.
- Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §1, §2.
- Find it if you can: a game for modeling different types of web search success using interaction data. In SIGIR, Note: To related work at WWW 2014 Cited by: §2.
- ConvAI3: generating clarifying questions for open-domain dialogue systems (clariq). External Links: Cited by: §2.
- Building and evaluating open-domain dialogue corpora with clarifying questions. arXiv preprint arXiv:2109.05794. Cited by: §2.
- BERT-qpp: contextualized pre-trained transformers for query performance prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2857–2861. Cited by: §7.3.
- Neural embedding-based specificity metrics for pre-retrieval query performance prediction. Information Processing & Management 57 (4), pp. 102248. Cited by: §7.2, §7.2.
- Neural embedding-based metrics for pre-retrieval query performance prediction. Advances in Information Retrieval 12036, pp. 78. Cited by: §7.2, §7.2.
- Geometric estimation of specificity within embedding spaces. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2109–2112. Cited by: §7.2.
- Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pp. 2335–2346. Cited by: §2.
- Introduction to bandits in recommender systems. In Fourteenth ACM Conference on Recommender Systems, pp. 748–750. Cited by: §5.3.
Semantic parsing on freebase from question-answer pairs.
Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544. Cited by: §2.
- The continuous cold start problem in e-commerce recommender systems. arXiv preprint arXiv:1508.01177. Cited by: §1, §2.
- A survey on practical applications of multi-armed and contextual bandits. arXiv preprint arXiv:1904.10040. Cited by: §2.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §2, §3, §5.2.
- Query performance prediction for ir. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 1196–1197. Cited by: §7.2.
- Estimating the query difficulty for information retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services 2 (1), pp. 1–89. Cited by: §7.2, §7.3.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175. Cited by: §5.3, §6.2.
- Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 815–824. Cited by: §1.
- Towards recommendation systems with real-world constraints. Ph.D. Thesis, University of Minnesota. Cited by: §1.
- Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §2.
- An effectiveness measure for ambiguous and underspecified queries. In Conference on the Theory of Information Retrieval, pp. 188–199. Cited by: §7.4.
- Seven steps to rendezvous with the casual user. IBM Corporation. Cited by: §2.
- Natural language interfaces to databases. Cited by: §2.
- Adapting multi-armed bandits policies to contextual bandits scenarios. arXiv preprint arXiv:1811.04383. Cited by: §2, item 2, §7.1.
- Search engines: information retrieval in practice. Vol. 520, Addison-Wesley Reading. Cited by: §2.
- Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 299–306. Cited by: §7.2.
- Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering, pp. 345–356. Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Cited by: §1, §2.
- The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pp. 187–208. Cited by: §2.
- Speak to your parser: interactive text-to-SQL with natural language feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2065–2077. External Links: Cited by: §2.
- Iris: a conversational agent for complex tasks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 473. Cited by: §2.
- A multi-armed bandit model selection for cold-start user recommendation. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, pp. 32–40. Cited by: §2.
- Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 198–206. Cited by: §7.4.
- Performance prediction for non-factoid question answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 55–58. Cited by: §7.2, §7.2, §7.3.
- Beyond dcg: user behavior as a predictor of a successful search. In WSDM, pp. 221–230. External Links: Cited by: §2, A. Hassan, R. Jones, and K. L. Klinkner (2010).
- Supporting complex search tasks. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pp. 829–838. Cited by: §2.
- The combination and evaluation of query performance prediction methods. In European Conference on Information Retrieval, pp. 301–312. Cited by: §7.2.
- A survey of pre-retrieval query performance predictors. In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 1419–1420. Cited by: §7.2.
- Inferring query performance using pre-retrieval predictors. In International symposium on string processing and information retrieval, pp. 43–54. Cited by: §7.2, §7.2, §7.2, §7.2.
- Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS) 3 (2), pp. 105–147. Cited by: §2.
- Interactive machine learning for health informatics: when do we need the human-in-the-loop?. Brain Informatics 3 (2), pp. 119–131. Cited by: §1.
- The turn: integration of information seeking and retrieval in context. Vol. 18, Springer Science & Business Media. Cited by: §2.
- Webwatcher: a tour guide for the world wide web. In IJCAI (1), pp. 770–777. Cited by: §2.
- REVEAL 2020: bandit and reinforcement learning from user interactions. In Fourteenth ACM Conference on Recommender Systems, pp. 628–629. Cited by: §2.
- Semantics-enabled query performance prediction for ad hoc table retrieval. Information Processing & Management 58 (1), pp. 102399. Cited by: §7.2.
- Modelling and detecting changes in user satisfaction. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1449–1458. Cited by: §1, §2.
- Behavioral dynamics from the serp’s perspective: what are failed serps and how to fix them?. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1561–1570. Cited by: §1, §2.
- Beyond movie recommendations: solving the continuous cold start problem in e-commercerecommendations. arXiv preprint arXiv:1607.07904. Cited by: §1, §2.
- Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 45–54. Cited by: §2.
- Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, pp. 121–130. Cited by: §2.
- Soliciting user preferences in conversational recommender systems via usage-related questions. In Fifteenth ACM Conference on Recommender Systems, pp. 724–729. Cited by: §1, §7.1.
- Modeling and analysis of cross-session search tasks. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 5–14. Cited by: §2.
- Computing krippendorff’s alpha-reliability. Cited by: §6.1.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 453–466. Cited by: §1.
- Dialogue learning with human-in-the-loop. ICLR. Cited by: §1.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §2, §7.4.
- Interactive task learning from GUI-grounded natural language instructions and demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Cited by: §2.
Dialogue generation: from imitation learning to inverse reinforcement learning. arXiv preprint arXiv:1812.03509. Cited by: §1.
- Guided dialog policy learning without adversarial learning in the loop. arXiv preprint arXiv:2004.03267. Cited by: §1.
Iterative policy learning in end-to-end trainable task-oriented neural dialog models.
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 482–489. Cited by: §2.
- Adversarial learning of task-oriented neural dialog models. In Proceedings of the SIGDIAL 2018 Conference, pp. 350–359. Cited by: §2.
- Personalizing information retrieval for multi-session tasks: examining the roles of task stage, task type, and topic knowledge on the interpretation of dwell time as an indicator of document usefulness. Journal of the Association for Information Science and Technology 66 (1), pp. 58–81. Cited by: §2.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §2.
- Recent neural methods on slot filling and intent classification for task-oriented dialogue systems: a survey. arXiv preprint arXiv:2011.00564. Cited by: §7.1.
- Exploratory search: from finding to understanding. Communications of the ACM 49 (4), pp. 41–46. Cited by: §2.
- AmbigQA: answering ambiguous open-domain questions. arXiv preprint arXiv:2004.10645. Cited by: §7.4.
- Evaluating the quality of query refinement suggestions in information retrieval. Technical report MASSACHUSETTS UNIV AMHERST CENTER FOR INTELLIGENT INFORMATION RETRIEVAL. Cited by: §5.4.
- The effect of search engine, search term and occasion on brain-computer interface metrics for emotions when ambiguous search queries are used.. In CHIRA, pp. 28–39. Cited by: §7.4.
- Struggling and success in web search. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1551–1560. Cited by: §2.
- ScentTrails: integrating browsing and searching on the web. ACM Transactions on Computer-Human Interaction (TOCHI) 10 (3), pp. 177–197. Cited by: §2.
- Diverse user preference elicitation with multi-armed bandits. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 130–138. Cited by: §2.
- University of glasgow at trec 2004: experiments in web, robust, and terabyte tracks with terrier.. In TREC, Cited by: §7.2, §7.2.
- A study of query performance prediction for answer quality determination. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 43–46. Cited by: §1, §7.2.
- ICTIR tutorial: modern query performance prediction: theory and practice. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, pp. 195–196. Cited by: §7.2.
- Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. Cited by: §2.
Estimating gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction. Information Processing & Management 56 (3), pp. 1026–1045. Cited by: §7.2.
- Interactive intent modeling for exploratory search. ACM Transactions on Information Systems (TOIS) 36 (4), pp. 1–46. Cited by: §2.
- Ambiguous queries: test collections need more sense. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 499–506. Cited by: §7.4.
- Query-performance prediction for effective query routing in domain-specific repositories. Journal of the Association for Information Science and Technology 65 (8), pp. 1597–1614. Cited by: §1, §7.2.
- Preference elicitation as an optimization problem. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 172–180. Cited by: §1.
- NLQuAD: A non-factoid long question answering data set. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 1245–1255. Cited by: §1.
- Identification of ambiguous queries in web search. Information Processing & Management 45 (2), pp. 216–229. Cited by: §7.4.
Understanding natural language commands for robotic navigation and mobile manipulation.
Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §2.
- An aspectual interface for supporting complex search tasks. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 379–386. Cited by: §2.
- Evaluating exploratory search systems. Information Processing and Management 44 (2), pp. 433. Cited by: §2.
- Exploratory search: beyond the query-response paradigm. Synthesis lectures on information concepts, retrieval, and services 1 (1), pp. 1–98. Cited by: §2, §7.1.
- The lunar sciences natural language information system: final report. BBN Report 2378. Cited by: §2.
- Session-aware information embedding for e-commerce product recommendation. In Proceedings of the 2017 ACM on conference on information and knowledge management, pp. 2379–2382. Cited by: §5.3.
- Model-based interactive semantic parsing: a unified framework and a text-to-SQL case study. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5447–5458. External Links: Cited by: §2.
- Pomdp-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §2.
- Neural query performance prediction using weak supervision from multiple signals. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 105–114. Cited by: §7.2.
- Dialogpt: large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. Cited by: §2.
- Effective pre-retrieval query performance prediction using similarity and variability evidence. In European conference on information retrieval, pp. 52–64. Cited by: §7.2, §7.2.