This repository will host the code used in the ICTIR 2016 publication on "Lexical Query Modeling in Session Search".
Lexical query modeling has been the leading paradigm for session search. In this paper, we analyze TREC session query logs and compare the performance of different lexical matching approaches for session search. Naive methods based on term frequency weighing perform on par with specialized session models. In addition, we investigate the viability of lexical query models in the setting of session search. We give important insights into the potential and limitations of lexical query modeling for session search and propose future directions for the field of session search.READ FULL TEXT VIEW PDF
This repository will host the code used in the ICTIR 2016 publication on "Lexical Query Modeling in Session Search".
Many complex information seeking tasks, such as planning a trip or buying a car, cannot sufficiently be expressed in a single query . These multi-faceted tasks are exploratory, comprehensive, survey-like or comparative in nature  and require multiple search iterations to be adequately answered . Donato et al.  note that 10% of the user sessions (more than 25% of query volume) consists of such complex information needs.
The TREC Session Track  created an environment for researchers “to test whether systems can improve their performance for a given query by using previous queries and user interactions with the retrieval system.” The track’s existence led to an increasing number of methods aimed at improving session search. Yang et al. 
introduce the Query Change Model (QCM), which uses lexical editing changes between consecutive queries in addition to query terms occurring in previously retrieved documents, to improve session search. They heuristically construct a lexicon-based query model for every query in a session. Query models are then linearly combined for every document, based on query recency or document satisfaction [10, 3]
, into a session-wide lexical query model. However, there has been a clear trend towards the use of supervised learning[3, 16, 12] and external data sources [6, 11]. Guan et al. 
perform lexical query expansion by adding higher-order n-grams to queries by mining document snippets. In addition, they expand query representations by including anchor texts to previously top-ranked documents in the session.Carterette et al.  expand document representations by including incoming anchor texts. Luo et al.  introduce a linear point-wise learning-to-rank model that predicts relevance given a document and query change features. They incorporate document-independent session features in their ranker.
The use of machine-learned ranking and the expansion of query and document representations is meant to address a specific instance of a wider problem in information retrieval, namely the query document mismatch. In this paper, we analyze the session query logs made available by TREC and compare the performance of different lexical query modeling approaches for session search, taking into account session length.111An open-source implementation of our testbed for evaluating session search is available at https://github.com/cvangysel/sesh. In addition, we investigate the viability of lexical query models in a session search setting.
The main purpose of this paper is to investigate the potential of lexical methods in session search and provide foundations for future research. We ask the following questions: (1) Increasingly complex methods for session search are being developed, but how do naive methods perform? (2) How well can lexical methods perform? (3) Can we solve the session search task using lexical matching only?
We define a search session as a sequence of interactions between user and search engine, where denotes a user-issued query consisting of terms , …, and denotes a result page consisting of documents , …, returned by the search engine (also referred to as SERP). The goal, then, is to return a SERP given a query and the session history that maximizes the user’s utility function.
In this work, we formalize session search by modeling an observed session as a query model parameterized by , …, , where denotes the weight associated with term (specified below). Documents are then ranked in decreasing order of
where is a lexical model of document
, which can be a language model (LM), a vector space model or a specialized model using hand-engineered features. Query modelis a function of the query models of the interactions in the session, (e.g., for a uniform aggregation scheme, ). Existing session search methods [16, 6] can be expressed in this formalism as follows:
Terms in a query are weighted according to their frequency in the query (i.e., becomes the frequency of term in ). Queries that are part of the same session are then aggregated uniformly for a subset of queries. In this work, we consider the following subsets: the first query, the last query and the concatenation of all queries in a session. Using the last query corresponds to the official baseline of the TREC Session track .
Nugget  is a method for effective structured query formulation for session search. Queries , part of session , are expanded using higher order n-grams occurring in both and snippets of the top- documents in the previous interaction, , …, . This effectively expands the vocabulary by additionally considering n-grams next to unigram terms. The query models of individual queries in the session are then aggregated using one of the aggregation schemes. Nugget is primarily targeted at resolving the query-document mismatch by incorporating structure and external data and does not model query transitions. The method can be extended to include external evidence by expanding to include anchor texts pointing to (clicked) documents in previous SERPs.
QCM  uses syntactic editing changes between consecutive queries in addition to query changes and previous SERPs to enhance session search. In QCM [16, Section 6.3], document model is provided by a language model with Dirichlet smoothing and the query model at interaction , , in session is given by
where are the session’s theme terms, (, resp.) are the added (removed) terms,
denotes the probability ofoccurring in SAT clicks, is the inverse document frequency of term and , , , are parameters. The are then aggregated into using one of the aggregation schemes, such as the uniform aggregation scheme (i.e., the sum of the ).
In §4, we analyze the methods listed above in terms of their ability to handle sessions of different lengths and contextual history.
|Sessions||76||98||87||100 (1,021 total)|
|Queries per session||3.68 1.79; M=3.00||3.03 1.57; M=2.00||5.08 3.60; M=4.00||4.34 2.22; M=4.00|
|Unique terms per session||7.01 3.28; M=6.50||5.76 2.95; M=5.00||8.86 4.38; M=8.00||7.79 4.08; M=7.00|
|Session per topic||1.23 0.46; M=1.00||2.04 0.98; M=2.00||2.18 0.93; M=2.00||20.95 4.81; M=21.00|
|Document judgments per topic||313.11 114.63; M=292.00||372.10 162.63; M=336.50||268.00 116.86; M=247.00||332.33 149.03; M=322.00|
|Document length||1,096.18 1,502.45||649.07 1,635.29|
|Terms||34015925 (23303541122 total)||23575957 (10191892325 total)|
Overview of 2011, 2012, 2013 and 2014 TREC session tracks. For the 2014 track, we report the total number of sessions in addition to those sessions with judgments. We report the mean and standard deviation where appropriate; M denotes the median.
We evaluate the lexical query modeling methods listed in §2 on the session search task (G1) of the TREC Session track from 2011 to 2014 . We report performance on each track edition independently and on the track aggregate. Given a query, the task is to improve retrieval performance by using previous queries and user interactions with the retrieval system. To accomplish this, we first retrieve the 2,000 most relevant documents for the given query and then re-rank these documents using the methods described in §2. We use the “Category B” subsets of ClueWeb09 (2011/2012) and ClueWeb12 (2013/2014) as document collections. Both collections consist of approximately 50 million documents. Spam documents are removed before indexing by filtering out documents with scores (GroupX and Fusion, respectively) below 70 . Table 1 shows an overview of the benchmarks and document collections.
To measure retrieval effectiveness, we report Normalized Discounted Cumulative Gain at rank 10 (NDCG@10) in addition to Mean Reciprocal Rank (MRR). The relevance judgments of the tracks were converted from topic-centric to session-centric according to the mappings provided by the track organizers.222We take into account the mapping between judgments and actual relevance grades for the 2012 edition. Evaluation measures are then computed using TREC’s official evaluation tool, trec_eval.333https://github.com/usnistgov/trec_eval
We compare the lexical query model methods outlined in §2. All methods compute weights for lexical entities (e.g., unigram terms) on a per-session basis, construct a structured Indri query  and query the document collection using pyndri.444https://github.com/cvangysel/pyndri For fair comparison, we use Indri’s default smoothing configuration (i.e., Dirichlet smoothing with ) and uniform query aggregation for all methods (different from the smoothing used for QCM in ). This allows us to separate query aggregation techniques from query modeling approaches in the case of session search.
For Nugget, we use the default parameter configuration ( and ), using the strict expansion method. We report the performance of Nugget without the use of external resources (RL2), with anchor texts (RL3) and with click data (RL4). For QCM, we use the parameter configuration as described in [16, 12]: and .
In addition to the methods above, we report the performance of an oracle that always ranks in decreasing order of ground-truth relevance. This oracle will give us an upper-bound on the achievable ranking performance.
We investigate the maximally achievable performance by weighting query terms. Inspired by Bendersky et al. , we optimize NDCG@10 for every session using a grid search over the term weight space. We sweep the weight of every term between and (inclusive) with increments of , resulting in a total of weight assignments per term. Due to the exponential time complexity of the grid search, we limit our analysis to the 230 sessions with unique query terms or less (see Table 1). This experiment will tell us the maximally achievable retrieval performance in session search by the re-weighting lexical terms only.
|TF (first query)|
|TF (last query)|
|TF (all queries)|
In this section, we report and discuss our experimental results. Of special interest to us are the methods that perform lexical matching based on a user’s queries in a single session: QCM, Nugget (RL2) and the three variants of TF. Table 2 shows the methods’ performance on the TREC Session track editions from 2011 to 2014. No single method consistently outperforms the other methods. Interestingly enough, the methods based on term frequency (TF) perform quite competitively compared to the specialized session search methods (Nugget and QCM). In addition, the TF variant using all queries in a session even outperforms Nugget (RL2) on the 2011 and 2014 editions and QCM on nearly all editions. Using the concatenation of all queries in a session, while being an obvious baseline, has not received much attention in recent literature or by TREC . In addition, note that the best-performing (unsupervised) TF method achieves better results than the supervised method of Luo et al.  on the 2012 and 2013 tracks. Fig. 1 depicts the boxplot of the NDCG@10 distribution over all track editions (2011–2014). The term frequency approach using all queries achieves the highest mean/median overall. Given this peculiar finding, where a generic retrieval model performs better than specialized session search models, we continue with an analysis of the TREC Session search logs.
In Fig. 2 we investigate the effect of varying session lengths in the session logs. The distribution of session lengths is shown in the top row of Fig. 2. For the 2011–2013 track editions, most sessions consisted of only two queries. The mode of the 2014 edition lies at 5 queries per session. If we examine the performance of the methods on a per-session length basis, we observe that the TF methods perform well for short sessions. This does not come as a surprise, as for these sessions there is only a limited history that specialized methods can use. However, the TF method using the concatenation of all queries still performs competitively for longer sessions. This can be explained by the fact that as queries are aggregated over time, a better representation of the user’s information need is created. This aggregated representation naturally emphasizes important theme terms of the session, which is a key component in the QCM .
|TF (all queries)|
|Ideal term weighing|
How do these methods perform as the search session progresses? Fig. 3 shows the performance of sessions of length five after every user interaction, when using all queries in a session (Fig. 2(a)) and when using only the previous query (Fig. 2(b)). We can see that NDCG@10 increases as the session progresses for all methods. Beyond half of the session, the session search methods outperform retrieving according to the last query in the session. We see that, for longer sessions, specialized methods (Nugget, QCM) outperform generic term frequency models. This comes as no surprise. Bennett et al.  note that users tend to reformulate and adapt their information needs based on observed results and this is essentially the observation upon which QCM builds.
Fig. 1 and Table 2 reveal a large NDCG@10 gap between the compared methods and the ground-truth oracle. How can we bridge this gap? Table 3 shows a comparison between frequency-based term weighting, the ideal term weighting (§3.4) and the ground-truth oracle (§3.3) for all sessions consisting of 7 unique terms or less (§3.4). Two important observations. There is still plenty of room for improvement using lexical query modeling only. Relatively speaking, around half of the gap between weighting according to term frequency and the ground-truth can be bridged by predicting better term weights. However, the other half of the performance gap cannot be bridged using lexical matching only, but instead requires a notion of semantic matching .
We have shown that naive frequency-based term weighting methods perform on par with specialized session search methods on the TREC Session track (2011–2014).555An open-source implementation of our testbed for evaluating session search is available at https://github.com/cvangysel/sesh. This is due to the fact that shorter sessions are more prominent in the session query logs. On longer sessions, specialized models are able to exploit session history more effectively. Future work should focus on creating benchmarks consisting of longer sessions with complex information needs. Perhaps more importantly, we have looked at the viability of lexical query matching in session search. There is still much room for improvement by re-weighting query terms. However, the query/document mismatch is prevalent in session search and methods restricted to lexical query modeling face a very strict performance ceiling. Future work should focus on better lexical query models for session search, in addition to semantic matching and tracking the dynamics of contextualized semantics in search. Acknowledgments This work was supported by the Google Faculty Research Award and the Bloomberg Research Grant programs. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsors. The authors would like to thank Daan Odijk, David Graus and the anonymous reviewers for their valuable comments and suggestions.
Do you want to take notes?: identifying research missions in yahoo! search pad.In WWW, pages 321–330. ACM, 2010.
The query change model: Modeling session search as a markov decision process.TOIS, 33(4):20:1–20:33, May 2015.