CIKM 2021: Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking
Context information in search sessions has proven to be useful for capturing user search intent. Existing studies explored user behavior sequences in sessions in different ways to enhance query suggestion or document ranking. However, a user behavior sequence has often been viewed as a definite and exact signal reflecting a user's behavior. In reality, it is highly variable: user's queries for the same intent can vary, and different documents can be clicked. To learn a more robust representation of the user behavior sequence, we propose a method based on contrastive learning, which takes into account the possible variations in user's behavior sequences. Specifically, we propose three data augmentation strategies to generate similar variants of user behavior sequences and contrast them with other sequences. In so doing, the model is forced to be more robust regarding the possible variations. The optimized sequence representation is incorporated into document ranking. Experiments on two real query log datasets show that our proposed model outperforms the state-of-the-art methods significantly, which demonstrates the effectiveness of our method for context-aware document ranking.READ FULL TEXT VIEW PDF
Personalized search plays a crucial role in improving user search experi...
We present a context-aware neural ranking model to exploit users' on-tas...
Users may strive to formulate an adequate textual query for their inform...
An effective email search engine can facilitate users' search tasks and
The pursuit of improved accuracy in recommender systems has led to the
Click-through data has proven to be a valuable resource for improving
Web search engines today return a ranked list of document links in respo...
CIKM 2021: Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking
Search engines have evolved from one-shot searches to consecutive search interactions with users (Agichtein et al., 2012). To fulfil complex information needs, users will issue a sequence of queries, examine and interact with some of the results. User historical behavior or interaction history in a session is known to be very useful for understanding the user’s information needs and to rank documents (Jones and Klinkner, 2008; Bennett et al., 2012; Ge et al., 2018; Zhou et al., 2021).
Various studies exploited user behavior data for different purposes. For example, by analyzing search logs, researchers found that a user’s search history provides useful information for understanding user intent during the search sessions (Bennett et al., 2012). To utilize the historical user behavior in document ranking, some early work explored query expansion and learning to rank techniques (Bennett et al., 2012; Carterette et al., 2016; Gysel et al., 2016; Shen et al., 2005)
. More recently, various neural structures have been used to model the user behavior sequence. For example, a recurrent neural network (RNN) is proposed to model the historical queries and suggest the next query(Sordoni et al., 2015). This structure has been extended to model both the historical queries and clicked documents, leading to further improvement on document ranking (Ahmad et al., 2019, 2018)
. Pre-trained language models have also been exploited to encode contextual information from user behavior sequences, and they achieved promising results(Qu et al., 2020).
All these studies tried to learn a prediction or representation model to capture the information hidden in the sequences. However, user behavior sequences have been viewed as definite and exact sequences. That is, an observed sequence is used as a positive sample and any unseen sequence is not used or is viewed as a negative sample. This strict view does not reflect the flexible nature of user’s behavior in a session. Indeed, when interacting with a search engine, users do not have a definitive interaction pattern, nor a fixed query for an information need. All these are flexible and change greatly from a user to another, and from a search context to another. Similarly, user’s click behaviors are also not definitive: one can click on different documents for the same information need, and can also click on irrelevant documents. The high variation is inherent in the user’s interactions with a search engine. This characteristic has not been explicitly addressed in previous studies. One typically relied on a large amount of log data, hoping that strong patterns can emerge, while accidental variations (or noise) can be discarded. This is true to some extent when we have a large amount of log data and when we are only interested in the common patterns shared by users. However, the models strictly relying on the log data cannot fully capture the nuances in user behaviors and cope with the variations. A better approach is to view the data as they are, i.e., they are just samples of possible query formulations and interactions, but much more are not shown in the logs.
To tackle this problem, in this work, we propose a data augmentation approach to generate possible variations from a search log. More specifically, we use three strategies to mask some terms in a query or document, delete some queries or documents, or reorder the sequence. These strategies reflect some typical variations in user’s behavior sequences. The generated behavior sequences can be considered similar to the observed ones. We have, therefore, automatically tagged user behavior sequences in terms of similarity, which are precious for model training. In addition, we can generate more training data from search logs, which has always been a critical issue for research in this area. Based on the augmented data, we utilize contrastive learning to extract what is similar and dissimilar. More specifically, the contrastive model tries to pull the similar sequences (generated variants) closer and to distinguish them from semantically unrelated ones. Compared to the existing approaches based on search logs, we expect that contrastive learning can better cope with the inherent variations and generate more robust models to deal with new behavior sequences.
Contrastive learning is implemented with a pre-trained language model BERT (Devlin et al., 2019) through encoding a sequence and its variants into a contextualized representation with a contrastive loss. The document ranking is then learned by a linear projection on top of the optimized sequence representation. With both the original sequences and corresponding variants modeled in the representation, the final ranking function can not only address the context information thoroughly, but also learn to cope with the inherent variations, hence generating better ranking results during prediction.
We conduct experiments on two large-scale real-world search log datasets (AOL and Tiangong-ST). Experimental results show that our proposed method outperforms the existing methods (including those exploiting search logs) significantly, which demonstrates the effectiveness of our approach.
Our contributions are three-fold:
(1) We design three different data augmentation strategies to construct similar sequences of observed user behavior sequences, which modify the original sequence at term, query/document, and behavior levels.
(2) We propose a self-supervised task with a contrastive learning objective based on the augmented behavior sequences to capture what is hidden behind the sequences and their variants, and to distinguish them from other unrelated sequences.
(3) Experiments on two large-scale real-world search log datasets confirm the effectiveness of our method. This study shows that contrastive learning with automatically augmented search logs is an effective way to alleviate the shortage of log data in IR research.
Context information in sessions has shown to be useful in modeling user intent in search tasks (Jones and Klinkner, 2008; Bennett et al., 2012; Ge et al., 2018; Zhou et al., 2020a). Early studies focused on extracting contextual features from users’ search activities so as to characterize their search intent. For example, some keywords were extracted from users’ historical queries and clicked documents and used to rerank the documents for the current query (Shen et al., 2005). Statistical features and rule-based features were also introduced to quantify or characterize context information (White et al., 2010; Xiang et al., 2010). However, these methods often rely on manually extracted features or handcrafted rules, which limits their application in different retrieval tasks.
Later, researchers started to build predictive models for users’ search intent or future behavior. For example, a hidden Markov model was employed to model the evolution of users’ search intent. Then, both document ranking and query suggestion were conducted based on the predicted user intent(Cao et al., 2009)
. Reinforcement learning has also been applied to model user interactions in search tasks(Guan et al., 2013; Luo et al., 2014). Unfortunately, the predefined model space or state transition structure limits the learning of rich user-system interactions.
The development of neural networks generated various solutions for context-aware document ranking. Some researchers proposed a hierarchical neural structure with RNNs to model historical queries and suggest the next query(Sordoni et al., 2015). This model is further extended with the attention mechanism to better represent sessions and capture user-level search behavior (Chen et al., 2018). Recently, researchers found that jointly learning query suggestion and document ranking can boost the model’s performance on both tasks (Ahmad et al., 2018). In addition to leveraging historical queries, the historical clicked documents are also reported to be helpful in both query suggestion and document ranking (Ahmad et al., 2019).
More recently, large-scale pretrained language models, such as BERT (Devlin et al., 2019), have achieved great performance on many NLP and IR tasks (Liu et al., 2019; Khashabi et al., 2020; Khattab and Zaharia, 2020; Ma et al., 2021). Qu et al. (2020) proposed to concatenate all historical queries, clicked documents, and unclicked documents as a long sequence and leveraged BERT as an encoder to compute their term-level representations. These representations were further combined with relative position embeddings and human behavior embeddings through another transformer-based structure to get the final representations. The ranking score is computed based on the representation of the special “[CLS]” token.
Our framework is also based on BERT, but we use contrastive learning to pretrain the model in a self-supervised manner. Theoretically, this strategy better leverages the available training data, which can also be applied to existing methods.
. It has been widely applied in computer vision(Zhuang et al., 2019; Tian et al., 2020; Chen et al., 2020) and NLP tasks (Fang and Xie, 2020; Gunel et al., 2020; Wu et al., 2020; Gao et al., 2021) and has proven its high efficiency in leveraging the training data without the need of annotation. What is required in contrastive learning is to identify semantically close neighbors. In visual representation, neighbors are commonly generated by two random transformations of the same image (such as flipping, cropping, rotation, and distortion) (Dosovitskiy et al., 2014; Chen et al., 2020). Similarly, in text representation, data augmentation techniques such as word deletion, reordering, and substitution are applied to derive similar texts from a given text sequence (Meng et al., 2021; Wu et al., 2020). Although the principle of contrastive learning is well accepted, the ways to implement it are still under exploration, with the general guiding principles of alignment and uniformity (Wang and Isola, 2020).
As for pre-training, Chang et al. (2020) designed several paragraph-level pre-training tasks and the Transformer models can improve over the widely-used BM25 (Robertson and Zaragoza, 2009). Ma et al. (2021) constructed a representative word prediction (ROP) task for pre-training BERT. Experimental results showed that the BERT model pre-trained with ROP and masked language model (MLM) tasks achieves great performance on ad-hoc retrieval. Our proposed sequence representation optimization stage can be treated as a pre-training stage because it is trained before document ranking (our main task). However, as we do not use external datasets, we do not categorize our method as a pre-training approach.
In this paper, we propose a contrastive learning objective to optimize the sequence representation for improving the downstream document ranking task. This first attempt opens the door to more future studies on applying contrastive learning to IR.
Context-aware document ranking aims at using the historical user behavior sequence and the current query to rank a set of candidate documents. In this work, we design a new framework for this task. Our framework aims at optimizing the representation of the user behavior sequence before learning document ranking. As shown in Figure 1, our framework can be divided into two stages: (1) sequence representation optimization and (2) document ranking. In the first stage, we design a self-supervised task with contrastive learning objective to optimize the sequence representation. In the second stage, our model uses the optimized sequence representation and further learns the ranking model. We call our framework COCA – COntrastive learning for Context-Aware document ranking.
Before introducing the task and the model, we first provide the definitions of important concepts and notations. We present a user’s search history as a sequence of queries , where each query is associated with a submission timestamp and the corresponding list of returned documents . Each query is represented as the original text string that users submitted to the search engine. is ordered according to query timestamp . Each document has two attributes: its text content and click label ( if it is clicked). In general, user clicks serve as a good proxy of relevance feedback (Joachims et al., 2005, 2007; Ahmad et al., 2019; Qu et al., 2020). Given all available historical queries and clicked documents up to turns, we denote the user behavior sequence as 111Following previous studies (Qu et al., 2020), we only use one clicked document to construct the sequence.. As reported in (Ahmad et al., 2019; Qu et al., 2020), the unclicked documents are less helpful and may even introduce noise, so they are not considered in the user behavior sequence.
With the above concepts and notations, we briefly introduce the two stages in COCA as follows.
(1) Sequence Representation Optimization. As shown in the left side of Figure 1, our target is to obtain a better representation of the user behavior sequence in this stage. To achieve this, we first construct two augmented sequences and from with randomly selected augmentation strategies (Section 3.3.1). Such a pair of sequences are considered to be similar. Then a BERT encoder is applied to get the representations of these two sequences (Section 3.3.2). With the contrastive loss, the model learns to pull them close and push them away from other sequences in the same minibatch (Section 3.3.3). By comparing the two augmented sequences, the BERT encoder is forced to learn a more generalized and robust representation for sequences.
(2) Document Ranking. As shown in the right side of Figure 1, we aim to rank the relevant documents as high as possible in this stage. Given the current query and the historical behavior sequence , we treat as a sequence and the candidate document as another sequence. Then, we concatenate them together and use the BERT encoder trained in the first stage to generate a representation. The final ranking score is obtained by a linear projection on the representation. A cross-entropy loss is applied between the predicted ranking score and the click label .
The user behavior sequence contains abundant information about the user intent. To optimize the representation of the user behavior sequence, we propose a self-supervised approach. Specifically, we apply a contrastive learning objective to pull close the representation of similar sequences and push apart different ones. The similar sequences are created by the three augmentation strategies described below.
Inspired by the existing data augmentation strategies in NLP and image processing, we propose three strategies to construct similar sequences, namely term mask, query/document deletion, and behavior reordering (shown in Figure 2). These strategies correspond to three levels of variation in user behaviors, i.e., term level, query/document level, and user behavior level.
(a) Term Mask.
In natural language processing, the “word mask” or “word dropout” technique has been widely applied to avoid overfitting. It has been shown to improve the robustness of the sentence representation,e.g., in sentence generation (Bowman et al., 2016)2015), and question answering (G et al., 2018). Inspired by this, we propose to apply a random term mask operation over the user behavior sequence (including query terms and document terms) as one of the augmentation strategies for contrastive learning.
With the term-level augmentation strategy, we can obtain various user behavior sequences similar to the original one. The similar sequences only have minor differences in some terms. This aims to simulate the real search situations where users may issue slightly different queries for searching the same target, and a document may satisfy similar information needs. By contrasting similar sequences with others, the models can learn the importance of different terms in both queries and documents. Besides, it can also help the model to learn more generalized sequence representation by avoiding relying too much on specific terms.
Specifically, for a user behavior sequence , we first represent it as a term sequence , where is the total number of terms. Then, we randomly mask a proportion of terms , where , and is the index of term to be masked. If a term is masked, it is replaced by a special token “[T_MASK]”, which is similar to the token “[MASK]” used in BERT (Devlin et al., 2019). Therefore, we formulate this augmentation strategy as a function over the user behavior sequence as:
(b) Query/Document Deletion. Random crop (deletion) is a common data augmentation strategy in computer vision to increase the variety of images (Chen et al., 2020; Tian et al., 2020). This operation can create a random subset of an original image and help the model generalize better. Inspired by this, we propose a query/document deletion augmentation operation for contrastive learning.
The query/document deletion strategy can improve the learning of sequence representation in two respects. First, after deletion, the resulting user behavior sequence becomes a similar one with the difference on some queries or documents. This reflects a type of variation in real query logs. By contrasting these similar sequences with others, the models are trained to learn the influence of the deleted queries or documents. Second, the generated incomplete sequence provides a partial view of the original sequence, which forces the model to learn a more robust representation without relying on complete information.
Specifically, for a user behavior sequence , we treat each query and document as a sub-sequence and represent the sequence as . Then, we randomly delete a proportion of sub-sequences , where , and is the index of the sub-sequence to be deleted. Different from the term mask strategy, if a query or document is deleted, the whole sub-sequence is replaced by a special token “[DEL]”. This augmentation strategy is formulated as a function on and defined as:
(c) Behavior Reordering. Many tasks assume the strict order of the sequence, e.g., natural language generation (Kikuchi et al., 2016; Vaswani et al., 2017) and text coherence modeling (Li and Hovy, 2014; Li and Jurafsky, 2017; Zhu et al., 2021). However, we observe that the user search behavior sequence is much more flexible. For example, when users only have a vague search intent, they will issue several queries in a random order to obtain related information before making their real intent clear (Guan et al., 2013). Besides, sometimes users may issue a repeated query when they miss some information, which is called re-finding behavior (Ma et al., 2020; Zhou et al., 2020b). Under this circumstance, we cannot assume the order of the queries is strict. To prevent the model from relying too much on the order information and make the model more robust to the newly issued query, we propose a behavior reordering strategy for contrastive learning. Different from the former two strategies, user behavior reordering does not reduce the information contained in the sequence. Models can focus on learning content representation in queries and documents rather than merely “remembering” their relative order.
For a user behavior sequence , we treat each query and its corresponding document as a behavior sub-sequence and denote it as , where . Then, we randomly select two behavior sub-sequences and switch their positions, and this operation is conducted times, where is the reordering ratio. Considering the randomly selected -th pairwise position as , we switch and , which can be formulated as a function on :
Previous work has shown the effectiveness of applying BERT (Devlin et al., 2019) for sequence representation (Dai and Callan, 2019; Nogueira and Cho, 2019; Qu et al., 2020; Zhu et al., 2021; Su et al., 2021). In our framework, we also use the pre-trained BERT as an encoder to represent the augmented user behavior sequences (shown in the left side of Figure 1). For a user behavior sequence , following the design of vanilla BERT, we add special tokens “[CLS]” and “[SEP]” at the head and tail of the sequence, respectively. Besides, to further indicate the end of each query/document, we append a special token “[EOS]” at the end of it. Therefore, the input sequence is represented as:
Then, the embedding of each token, the positional embedding, and the segment embedding222Please refer to the original paper of BERT (Devlin et al., 2019) for more details about these embeddings. are added together and input to BERT to obtain the contextualized representation. The output of BERT is a sequence of representations for all tokens, and we use the representation of “[CLS]” token as the sequence representation:
where , and is a linear projection.
We apply a contrastive learning objective to optimize the user behavior sequence representation. A contrastive learning loss function is defined for the contrastive prediction task,i.e., trying to predict the positive augmentation pair in set . We construct the set by randomly augmenting twice for all sequences in a minibatch. The strategy of each augmentation is randomly selected from our proposed three ones. Assuming a minibatch with size , we can obtain a set with size . The two augmented sequences from the same user behavior sequence form the positive pair, while all other sequences from the same minibatch are regarded as negative samples for them. Following previous work (Chen et al., 2020; Gao et al., 2021; Wu et al., 2020; Fang and Xie, 2020), the contrastive learning loss for a positive pair is defined as:
where is the indicator function to judge whether and
is a hyperparameter representing temperature. The overall contrastive learning loss is defined as all positive pairs’ losses in a minibatch:
where when is a positive pair, and otherwise.
From another perspective, the contrastive learning stage can be viewed as a kind of domain-specific post-training for pre-trained language models. As these contextualized language models are usually pre-trained on general corpora, such as the Toronto Books Corpus and Wikipedia, it is less effective to directly fine-tune these models on our downstream ranking task if there is a domain shift. Our contrastive learning stage can help the model on domain adaptation to further improve the ranking task. This strategy has shown to be effective in various tasks including reading comprehension (Xu et al., 2019) and dialogue generation (Whang et al., 2020; Gu et al., 2020).
In the previous step, the BERT encoder has been optimized with the contrastive learning objective. We now incorporate this BERT encoding to learn the context-aware document ranking task.
Previous studies have applied BERT for ranking in a manner of sequence pair classification (Dai and Callan, 2019; Nogueira and Cho, 2019; Qu et al., 2020). Different from the first stage, the ranking stage aims at measuring the relationship between the historical user behavior sequence , the current query , and a candidate document . Therefore, we treat as one sequence and as another sequence, and the input sequence is represented as:
Afterwards, the embedding of each token, the positional embedding, and the segment embedding are added together and input to BERT. Note that contains two sequences, so we set their segment embeddings respectively as 0 and 1 to distinguish them. The output representation of “[CLS]” is used as the sequence representation to calculate the ranking score as:
where , and is a linear projection to map the representation into a (scalar) score.
We conduct experiments on two public datasets: AOL search log data333We understand that the AOL dataset should normally not be used in experiments. We choose to use it here because it contains real human clicks, which fits our experiments well. MS MARCO Conversational Search dataset may be another possible dataset, but the sessions in it are artificially constructed rather than real search logs. So, we do not use the MS MARCO dataset in experiments. (Pass et al., 2006) and Tiangong-ST query log data (Chen et al., 2019).
For AOL search log, we use the one provided by Ahmad et al. (2019). The dataset contains a large number of sessions, and each session consists of several queries. In training and validation sets, there are five candidate documents for each query in the session. In the test set, 50 documents retrieved by BM25 (Robertson and Zaragoza, 2009) are used as candidates for each query in the session. All queries have at least one satisfied click in this dataset, and if there are more than one clicked documents, we use the first one in the list to construct the user behavior sequence.
Tiangong-ST dataset is collected from a Chinese commercial search engine. It contains web search session data extracted from an 18-day search log. Each query in the dataset has 10 candidate documents. In the training and validation sets, we use the clicked documents as the satisfied clicks. Some queries may have no satisfied click, we use a special token “[Empty]” for padding. For the test, the last query of each session is manually annotated with relevance scores, while other (previous) queries in the session have only click labels. Therefore, we construct two test sets based on the original test data as follows:
(1) Tiangong-ST-Click: In this test set, we only use the previous queries (i.e., without the last query) and their candidate documents. Similar to AOL dataset, in this test scenario, all documents are labeled with “click” or “unclick”, and the model is asked to rank the clicked documents as high as possible. Note that the query with no click document is not used for testing.
(2) Tiangong-ST-Human: In this test set, only the last query with human annotated relevance score is used. The score ranges from 0 to 4. More details can be found in (Chen et al., 2019).
The statistics of both datasets are shown in Table 1. Following previous studies (Gao et al., 2010; Huang et al., 2013, 2018; Ahmad et al., 2019), to reduce memory requirements and speed up training, we only use the document title as its content.
Evaluation Metrics We use Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain at position (NDCG@,
) as evaluation metrics. For Tiangong-ST-Human, since the candidate documents are provided by a commercial search engine, the irrelevant documents are expected to be limited. Hence, as suggested by the authors of(Chen et al., 2019), we only evaluate the results with NDCG@. All evaluation results are calculated by TREC’s evaluation tool (trec_eval) (Gysel and de Rijke, 2018).
|Avg. # Query per Session||2.58||2.58||2.59|
|Avg. # Document per Query||5||5||50|
|Avg. Query Len||2.86||2.85||2.9|
|Avg. Document Len||7.27||7.29||7.08|
|Avg. # Clicks per Query||1.08||1.08||1.11|
|Avg. # Query per Session||2.41||2.51||3.21|
|Avg. # Document per Query||10||10||10|
|Avg. Query Len||2.89||1.83||3.46|
|Avg. Document Len||8.25||6.99||9.18|
|Avg. # Clicks per Query||0.94||0.53||(3.65)|
achieves significant improvements over all existing methods in paired t-test with-value 0.01.
We compare our method with several baseline methods, including those for (1) ad-hoc ranking and (2) context-aware ranking.
(1) Ad-hoc ranking methods. These methods do not use context information (historical queries and documents), and only current query is used for ranking documents. KNRM (Xiong et al., 2017) performs fine-grained interaction between current query and candidate documents and obtain a matching matrix. The ranking features and scores are then calculated by a kernel pooling method. ARC-I (Hu et al., 2014)ARC-II (Hu et al., 2014) is an interaction-based method. A matching map is constructed from the query and document, based on which the matching features are extracted by CNNs. The score is also computed by an MLP. Duet (Mitra et al., 2017)
computes local and distributed representations of the query and document by several layers of CNNs and MLPs. Then, it integrates both interaction-based features and representation-based features to compute ranking scores.
(2) Context-aware ranking methods. These methods can leverage both context information and current query to rank candidate documents.
et al., 2018) is a multi-task model, which jointly predicts the next query and ranks corresponding documents. The historical queries in a session are encoded by a recurrent neural network (RNN). The ranking score is computed based on the query representation, history representation, and document representation.
M-Match-Tensor(Ahmad et al., 2018) is similar to M-NSRF but learns a contextual representation for each word in the queries and documents. The computation of ranking score is based on the word-level representation. CARS (Ahmad et al., 2019) is also a multi-task model, which learns query suggestion and document ranking simultaneously. Different from M-NSRF, this method also models the click documents in the history through an RNN. An attention mechanism is applied to compute representations for each query and document. The final ranking score is computed based on the representation of historical queries, clicked documents, current query, and candidate documents444We will notice some slight discrepancies between our results and those of the original paper of CARS. This is due to different tie-breaking strategies in evaluation. Following (Qu et al., 2020), we use trec_eval while the authors of CARS uses an author-implemented evaluation.. HBA-Transformer (Qu et al., 2020) (henceforth denoted as HBA) concatenates historical queries, clicked documents, and unclick documents into a long sequence and applies BERT (Devlin et al., 2019) to encode them into representations. Then, a higher-level transformer structure with behavior embedding and relative position embedding is employed to further enhance the representation. Finally, the representation of the first token (“[CLS]”) is used to calculate the ranking score. This is the state-of-the-art method in context-aware document ranking task. It is the most similar to our approach, but without contrastive learning.
We use PyTorch(Paszke et al., 2019) and Transformers (Wolf et al., 2019) to implement our model. The pre-trained BERT is provided by Huggingface555https://huggingface.co/bert-base-uncased. The maximum number of tokens in the two stages are set as 128. Sequences with more than 128 tokens are truncated by popping query-document pairs from the head. We use AdamW (Loshchilov and Hutter, 2019)
optimizer in both stages. In the sequence representation optimization stage, both the term mask ratio and query/document deletion ratio are tuned from 0.1 to 0.9 and set as 0.6. As for behavior reordering, only one pair of positions are switched because the session is not long (on average 2.5 queries per sessions). The three strategies are randomly selected. Note that the reordering strategy can only be applied to sessions with more than one query. The batch size is set as 128, and the temperature is set as 0.1. We train the model for four epochs. The learning rate is set as 5e-5. In the document ranking stage, we apply a dropout layer on the sequence representation with the rate of 0.1. The learning rate is set as 5e-5 and linearly decayed during the training. We train the model for three epochs. All hyperparameters are tuned based on the performance on the validation set. Our code is released on GitHub athttps://github.com/DaoD/COCA.
The experimental results are shown in Table 2. We can find COCA outperforms all existing methods. This result clearly demonstrates the superiority of our method. Based on the results, we can make the following observations.
(1) Among all models, COCA achieves the best results, which demonstrates its effectiveness on modeling user behavior sequence through contrastive learning. In general, the context-aware document ranking models perform better than ad-hoc ranking models. For example, on AOL dataset, the weak contextualized model M-NSRF can still outperform the strong ad-hoc ranking model KNRM. This indicates that modeling user behavior sequence is beneficial for understanding user intent and improving the ranking results.
(2) Compared with the RNN-based multi-task learning models (M-NSRF, M-Match-Tensor, and CARS), BERT-based methods (HBA and COCA) achieve better performance. Specifically, on AOL dataset, HBA and COCA improve the results by more than 15% in terms of all metrics. It is worth noting that HBA and COCA learn document ranking independently without supervision signals from query suggestion task. This result reflects the clear advantage of applying pre-trained language models (e.g., BERT) in document ranking.
(3) HBA is the state-of-the-art method on context-aware document ranking task. It designs complex structures over a BERT encoder to consider user behavior in various aspects, including an intra-behavior attention on clicked documents and skipped documents; an inter-behavior attention on all turns; and an embedding indicates their relative positions. In contrast, our COCA only applies a standard BERT encoder and achieves significantly better performance (paired t-test with -value 0.01). Both MAP and MRR are improved about 4%. The key difference between them is the contrastive learning we use. The improvements of COCA over HBA directly reflects the advantage of using contrastive learning for behavior sequence representation.
(4) Intriguingly, the improvements of COCA on AOL and Tiangong-ST-Click are much more significant than that on Tiangong-ST-Human test set. The potential reasons fall in two aspects: (a) COCA is trained on data with click labels rather than relevance labels, and the construction of the user behavior sequence is also based on click labels. Therefore, the model is better at predicting click-based scores than relevance scores. (b) According to our statistics, there are more than 77.4% documents labeled as relevant (i.e., their annotated relevance scores are larger than 1), so the base score is very high. Even the basic model ARC-I can achieve 0.7088 and 0.8691 in terms of NDCG@1 and NDCG@10. Without more accurate relevance labels for training, it is more difficult for our model to further improve relevance ranking.
We further investigate the following research questions.
|TM + QDD||0.5492||0.5592||0.4005||0.5467||0.6155|
|TM + BR||0.5448||0.5550||0.3963||0.5414||0.6115|
|QDD + BR||0.5473||0.5576||0.3995||0.5444||0.6132|
To study the effectiveness of our proposed sequence augmentation strategy, we test the performance on AOL with different combinations of strategies. The results are shown in Table 3. “None” means that we use the original BERT parameters for document ranking without our proposed sequence optimization stage. We denote the term mask strategy as “TM”, query/document deletion as “QDD”, and behavior reordering as “BR”. Note that the reordering strategy can only apply to sequences with more than two query-document pairs, thus cannot work independently.
First, compared with no sequence optimization stage, optimizing sequence representation with any combination of our proposed strategies is helpful. This clearly demonstrates that our proposed method is effective in building a more robust representation. Second, the term mask works best and this single strategy can improve around 2.5% in MAP. This implies that learning user behavior sequences with similar queries and documents are very useful for document ranking. Finally, it is interesting to see that combining term mask and behavior reordering strategy (i.e., “TM + BR”) leads to a performance degradation compared with only using the term mask strategy. After checking the sequence representation optimization process, we find that the contrastive learning loss in this case is very low and the prediction accuracy is very high, which indicates that this combination is easy to overfit and cannot learn a good sequence representation.
As reported in recent work (Chen et al., 2020), the temperature and batch size are two important hyperparameters in contrastive learning. To investigate the impact of them, we train our model with different settings and test their performance. In addition to evaluating the performance of ranking, we also compute the loss value (cross-entropy, CE) and prediction accuracy in contrastive prediction. The results are shown in Table 4.
Considering temperature, according to Equation (9), a higher temperature will cause a higher loss, which are consistent with our results. However, a lower contrastive loss cannot always lead to a better performance. Indeed, is the best choice for the document ranking task. Therefore, it is important to select a proper temperature for contrastive learning. Similar observations are also reported in other recent studies (Chen et al., 2020; Gao et al., 2021).
As for batch size, we can see that contrastive learning benefits from larger batch sizes. According to a recent study (Chen et al., 2020), larger batch sizes can provide more negative examples, so that the convergence can be facilitated. Due to our limited hardware resources, the largest batch size we can handle is 128. We speculate that a larger batch size can bring more improvements.
To understand the impact of the session length on the final ranking performance, we categorize the sessions in the test set into three bins:
(1) Short sessions (with 1-2 queries) - 77.13% of the test set;
(2) Medium sessions (with 3-4 queries) - 18.19% of the test set;
(3) Long sessions (with 5+ queries) - 4.69% of the test set.
As we also consider sessions with only one query, the short sessions have a higher proportion than that provided in (Ahmad et al., 2019).
We compare COCA with Duet, CARS, HBA on AOL dataset and show the results regarding MAP and NDCG@3 in the left side of Figure 3. First, it is evident that COCA outperforms all context-aware baseline methods on all three bins of sessions. This suggests COCA’s advantages in learning search context. Second, we can see the ad-hoc ranking method Duet performs worse than other context-aware ranking methods. This demonstrates once again that modeling the historical user behavior is essential for improving the document ranking performance. Third, we can observe that COCA performs relatively worse in long sessions than in short sessions. We hypothesize that those longer sessions are intrinsically more difficult, and similar trend in baseline methods can support this. This can be due to the fact that a long session may contain more noise or exploratory search. This is also shown by a larger improvement in the short sessions from COCA to the ad-hoc baseline ranker Duet than that in the long sessions (37.10% v.s. 26.83% in terms of MAP). This result implies that it may be useful to model the immediate search context rather than the whole context.
It is important to study how the modeled search context helps document ranking when a search session progresses. We compare COCA with CARS and HBA at individual query positions in short (S), medium (M), and long (L) sessions. The results are reported in the right side of Figure 3. Due to the limited space, long sessions with more than seven queries are not presented.
It is noticeable that the ranking performance is improved steadily as a search session progresses, i.e., more search context becomes available for predicting the next click. Both COCA and HBA benefit from it, while COCA improves faster by better exploiting the context. In contrast, the performance of CARS is unstable. This implies that BERT-based methods are much more effective in modeling search context. One interesting finding is that, when the search sessions get longer (e.g., from L4 to L7), the gain of COCA diminishes. We attribute this to the more noisy nature of long sessions.
As reported by recent studies (Chen et al., 2020; Gao et al., 2021), the amount of data for contrastive learning has a great impact on downstream task (e.g., document ranking in our case). We investigate such influence by training the model with different proportions of data and different epochs. As a comparison, we also illustrate the performance of COCA without sequence representation optimization stage (denoted as “None”).
We first reduce the number of training data used for contrastive learning666Note that all models are trained four epochs with different number of data.. The results are shown in the left side of Figure 4. It is clear that contrastive learning benefits from a larger amount of data. Surprisingly, our proposed sequence representation optimization stage can still work with only 20% of training data. This demonstrates the potential and effectiveness of learning better sequence representation for context-aware document ranking. We also train COCA with different number of epochs in the sequence optimization stage. The performance on document ranking is shown in the right side of Figure 4. The results suggest that the contrastive learning also benefits from larger training epochs. In our implementation, the data augmentation strategies are randomly selected in different epochs. Therefore, the sequence representation can be more fully learned. When training more than four epochs, the performance is stable without further improvement. Therefore, four epochs is the best choice in our experiments.
In this work, we aimed at learning better representation of user behavior sequence for context-aware document ranking. A self-supervised task with contrastive learning objective is introduced for optimizing sequence representation before learning document ranking. To construct positive pairs in contrastive learning, we proposed three data augmentation strategies at term, query/document, and user behavior level. These strategies can improve the generalization and robustness of sequence representation. The optimized sequence representation is used in document ranking task. We conducted comprehensive experiments on two large-scale search log datasets. The results clearly showed that our proposed method is very effective. In particular, our method with contrastive learning was shown to outperform the close competitor HBA without it.
This is the first attempt to utilize contrastive learning in IR and much remains to be explored. For example, it may be more appropriate to exploit recent history instead of the whole history. Query and document weighting in the history could also be a promising avenue.
Towards context-aware search by learning a very large variable length hidden markov model from search logs. InProceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl (Eds.). ACM, 191–200. https://doi.org/10.1145/1526709.1526736
Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event(Proceedings of Machine Learning Research, Vol. 119). PMLR, 1597–1607. http://proceedings.mlr.press/v119/chen20j.html
Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, COMAD/CODS 2018, Goa, India, January 11-13, 2018, Sayan Ranu, Niloy Ganguly, Raghu Ramakrishnan, Sunita Sarawagi, and Shourya Roy (Eds.). ACM, 348–351. https://doi.org/10.1145/3152494.3167988
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17-22 June 2006, New York, NY, USA. IEEE Computer Society, 1735–1742. https://doi.org/10.1109/CVPR.2006.100
PyTorch: An Imperative Style, High-Performance Deep Learning Library. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 8024–8035. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Local Aggregation for Unsupervised Learning of Visual Embeddings. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 6001–6011. https://doi.org/10.1109/ICCV.2019.00610