1 Introduction
In real web search scenarios, a large number of queries are ambiguous or multifaceted. For instance, the query “apple” can be a kind of delicious fruit or the great IT company; the huge vehicle “rocket” can also be mentioned as the Houston Rocket basketball team. In order to satisfy the users with different information needs, search result diversification approaches, which provide the search results that covered with a wide range of subtopics for a query, have been widely studied. The approaches work by ranking documents or webpages take both relevance and information novelty (diversification) into considerations.
A majority of traditional methods for search result diversification are heuristic methods with manually defined functions
[2, 5, 13, 14, 15, 16]. Their key rationale is that the subsequent document should be “different” from the ones already ranked. As a representative work, the maximal marginal relevance (MMR) [2] is proposed to formulate the construction of a diverse ranking as a process of sequential document selection. In MMR, the marginal relevance is defined as a sum of querydocument relevance and the maximal document distance as novelty by a predefined document distance function.Recently, in order to avoid heuristic methods with manually defined evaluation functions, machine learning methods have been proposed and applied to search result diversification
[21, 22, 24, 25, 28]. The basic idea is to automatically learn a diverse ranking model from the labeled training data. Typical approaches include the relational learning to rank (RLTR) [28] and its variations [21, 22, 24]. In [21, 28], the novelty of a document with respect to the previously selected documents is encoded as a set of handcrafted novelty features. In [22], the neural tensor networks are extended to model the novelty among them.However, all these methods model utility of a candidate document either based on carefully designed heuristics or handcrafted relevance features and novelty features. The utility perceived from the preceding documents is not fully utilized. To avoid this, the latest work for search result diversification, Markov decision process diversification model (MDPDIV) [23]
is proposed, which formalizes the construction of a diverse ranking as a sequential decision making process and models the process with Markov decision process (MDP). Reinforcement learning technique, the policy gradient algorithm of REINFORCE
[18], is adopted to adjust the model parameters. MDPDIV outperforms the stateoftheart baselines on the TREC benchmark datasets. However, its low convergence rate, often requiring tens of thousands iterations to converge, is unacceptable, especially for industrial applications.In this paper, we aim to promote the performance of MDPDIV by speeding up the convergence rate and maintaining the accuracy. The primary reasons for low convergence rate are the large action space and data scarcity. On the one hand, the sequential decision making at each position needs to evaluate all the remaining documents of relevance, which forms a huge search space; on the other hand, the data scarcity compels the agent to proceed more “trial and error” interactions with the environment. To address the problem, we propose MDPDIVkNN and MDPDIVNTN methods. The MDPDIVkNN method adopts a nearest neighbor strategy to linearly reduce the action space at each position. Specifically, it removes the nearest neighbors of the recent selected action (document). Different from the MDPDIVkNN, the MDPDIVNTN employs a pretrained diversification neural tensor network (NTNDIV) as the evaluation model, and combines the results with MDP to produce the final ranking list. There are two instantiations of MDPDIVNTN. Specifically, the MDPDIVNTN(D) method directly filters the preranked list; while the MDPDIVNTN(E) method sequentially models the novelty of candidate document with respect to previously selected documents. The main contributions of this paper can be summarized as follows:

We analyze the reasons for the slow convergence of MDPDIV, and find that it is mainly due to the large action space and the data scarcity.

We propose the MDPDIVkNN and MDPDIVNTN methods, which can promote the convergence rate while maintaining the accuracy of MDPDIV for search result diversification.

Extensive experiments are carried out on 0912 TREC benchmark datasets, and the results demonstrate the proposed methods indeed fasten MDPDIV and outperforms the stateoftheart competitors.
The remainder of the paper is structured as follows. In Section 2, we briefly review the related works. In Section 3, the Markov decision process, MDPDIV, and NTNDIV are introduced as preliminaries. The proposed methods are presented in Section 4. Experimental results are provided in Section 5 to demonstrate the effectiveness of the proposed methods.
2 Related Work
2.1 Search Result Diversification
One of the key problems in search result diversification is the diverse ranking. Formalizing the construction of diverse ranking as a process of sequential document selection is a common practice. This ranking strategy provides us a more rational way to model the utility of a candidate document which not only depends on the document itself but also the preceding documents. Existing approaches can be classified into two categories, namely heuristic methods
[2, 5, 6, 7, 16] and machine learning methods [21, 22, 23, 24, 28].The representative work in the first kind is the maximal marginal relevance (MMR) [2] criterion to guide the design of diverse ranking models. In MMR, the sequential document selection is based on the marginal relevance score, which is a linear combination of querydocument relevance score and document novelty score. A variation of MMR is the probabilistic latent MMR model proposed by Guo and Scanner [6]. PM2 [5] tackles the problem from the perspective of proportionality. xQuAD [16] explicitly models the relationships between the documents retrieved for the query and the possible subqueries coverage. The authors in [7] propose to combine the implicit and explicit topic representations for constructing diverse ranking. All these methods model the utility of candidate document based on carefully designed heuristics with manually defined evaluation functions. However, it is hard to design an unified similarity function for different tasks.
Recently, machine learning approaches have been proposed for search result diversification issue. The ranking score for diverse ranking is based on a linear combination of relevance features and novelty features, and the parameters can be automatically adjusted from the training data. Zhu et al. [28] propose the relational learning to rank (RLTR) framework by optimizing the objective function to construct the diverse ranking model. With different definitions of the objective functions and optimization techniques, different diverse ranking algorithms have been proposed [21, 22, 24]. Xia et al. [21] learn a maximal marginal relevance model via directly optimizing diversity evaluation measures. The authors in [22] utilize the neural tensor network to model the novelty relations. To avoid the handcrafted features and fully utilize the utility in preceding documents, Xia et al. [23] propose to adapt reinforcement learning techniques to formalize the diverse ranking as a process of sequential decision making which can be modeled with MDP, where the parameters can be trained by policy gradient algorithm of REINFORCE [18].
2.2 Reinforcement Learning for Information Retrieval
Reinforcement learning (RL) techniques are widely used in information retrieval (IR) applications. The aforementioned MDP for diverse ranking in [23] is a representative work in this kind. What’s more, MDP also can be extended to learning to rank problems [20], in which the proposed MDPRank model utilizes the MDP to directly optimize the NDCG at all ranking positions. Wang et al. [19] propose a game theoretical minimax game to iteratively optimize the generative retrieval and discriminative retrieval models, in which the generative retrieval model is optimized by the policy gradient algorithm of REINFORCE. In session search, Luo et al. [10] propose to utilize the partially observed Markov decision process (POMDP) to model session search as a dualagent stochastic game for constructing a winwin search framework. The authors in [27] propose to utilize the logbased document reranking, which is modeled as a POMDP to improve the ranking performance. Moreover, RL techniques are also utilized in recommender systems. For instance, Guy et al. [17] designed a MDP based recommender system which employs a strong initial model to converge quickly. The multiarmed bandits technique is also utilized for diverse ranking [12]. Lu and Yang [9] propose a neuraloptimized POMDP model for building a collaborative filtering recommender system.
Recent advances in reinforcement learning techniques make the research in IR one step further, and promising performances are delivered, such as MDPDIV, MDPRank, etc. However, MDPDIV suffers from a very slow convergence, which hinders the usability in real applications. In this paper, we aim to promote the performance of MDPDIV by speeding up the convergence rate without much accuracy sacrifice.
3 Preliminaries
3.1 Markov Decision Process
The search result diversification issue considered in this paper could be formulated with a continuous state Markov decision process (MDP) [11, 18] which is usually utilized for sequential decision making. An MDP is comprised of states, actions, rewards, policy, and transition, and can be represented by a tuple
States S is a set of states. In [23], states can be defined as tuples consisting of preceding ranked documents, candidate documents, and the utility that the agent perceives from the preceding documents as well as the query.
Actions A is a discrete set of actions that an agent can take. The possible actions at each time step depend on the current state , denoted as .
Transition T is the state transition function which maps a state into a new state in response to the selected action .
Reward is the immediate reward, also known as reinforcement. It gives the agent an immediate reward when taking action under state .
Policy describes the behaviors of an agent which is a sequence mapping from states to actions. Generally speaking, is optimized to decide how to move around in the state space to achieve the optimal longterm discounted reward .
The agent interacts with the environment at each time step. For instance, at time step t, the agent receives the environment’s state , and then selects an action based on the current state , where is the set of actions available under state . As a consequence of the action taken, the agent receives a numerical reward and the state changes to simultaneously in the next time step.
3.2 MdpDiv
MDPDIV is proposed by Xia et al. [23], which is the latest and the first approach that utilizes the reinforcement learning techniques for search result diversification. The construction of diverse ranking is formalized as a process of sequential decision making, which is modeled with a continuous state Markov decision process (MDP). The user’s perceived utility can be treated as a part of its MDP state.
More specifically, at time step , the agent receives the environment’s state which models the user’s dynamic state on the perceived utility, starting from the first ranking position. Based on the received state, the agent chooses an action depending on the policy that the agent has learned recently. The policy in MDPDIV is formulated as a softmax
type of function that maps from the current state to a probability distribution of selecting each possible actions. According to the selected action (document), the user perceives some additional utility, also known as the immediate reward, from the recentlyselected document. Here the reward is defined as the quality improvement of the selected documents in terms of
DCG or Subtopic recall (Srecall), which are two widely used metrics in search result diversification. Then the system transits to a new state. The transition function, which maps old state and the selected document to a new state, is implemented in a recurrent manner. Reinforcement Learning techniques, the policy gradient algorithm of REINFORCE [18], is adopted to coordinate the model parameters for the sake of maximizing the expected longterm discounted rewards.The endtoend MDPDIV model unifies the relevance and novelty as the criterion for selecting documents which directly optimizes a diversity evaluation measure, and outperforms the stateoftheart baselines on the TREC benchmark datasets. However, the low convergence rate of needing tens of thousands iterations in the training phase is indeed unacceptable, especially for industrial applications. The reasons are two fold: (i) In the training stage, for decision making at each ranking position, the agent has to go through the whole remaining candidate set which introduces high computational complexity. Suppose we are given training queries, and each query is associated with a set of retrieved documents^{1}^{1}1For the ease of explaination, we suppose each query is associated with the same number of documents.. The diverse ranking process will cost times of querydocument relevance evaluations for just one iteration. Moreover, the reinforcement learning process often needs large numbers of iterations to converge. Therefore, it is really a catastrophe if we are unfortunately facing to a large discrete action space, i.e. M is large; (ii) The retrieved documents are too scarce to train, which means that the agent has to proceed more “trial and error” interactions with the environment. For instance, more than 70% of data utilized in MDPDIV are not labeled (i.e., no subtopics is contained). Worse still, some queries are associated with completely irrelevant (unlabeled) documents.
3.3 NtnDiv
The NTNDIV model is proposed by Xia et al. [22] that models document novelty with neural tensor networks. Intuitively, the neural tensor networks model the relationships between two entities with a bilinear tensor product. This idea could be naturally extended to model the novelty relation of a document with respect to the other documents for search result diversification. Suppose we are given a set of candidate documents , where each document is characterized with its preliminary representation with embedding models, such as the doc2vec model. The novelty score of a candidate document with its preliminary representation , and a set of ranked documents with their representations can be defined as a neural tensor network with hidden slices. The ranking function can be defined in Eq.(1):
(1) 
where the first term is the relevance score^{2}^{2}2In order to learn endtoend, we use the embedding features instead of handcrafted relevance features., and weights the embedding feature . The second term is the novelty score computed by neural tensor network. Specially, , a dimensional threeway tensor, represents the relationship of the documents, where stands for the th feature of relationship between documents and . And weights the importance of the slices of the tensor. The primary merit of using neural tensor network to model the document novelty is that the tensor can relate the candidate document and the selected documents multiplicatively, instead of only going through a predefined similarity function or through a linear combination of novelty features. To the best of our knowledge, the NTNDIV model is the latest and the best approach for search result diversification except for MDPDIV.
4 Methodology
As aforementioned that large action space and data scarcity will lead to low convergence rate, in this paper, we propose two kinds of strategies to deal with this issue. The first one is the nearest neighbor strategy, which discards the nearest neighbors of the recentlyselected action (document); The second strategy relies on the pretrained NTNDIV [22] model, which employs a pretrained NTNDIV as the evaluation model, and combines the results with MDP to produce the final ranking solution. The two strategies are, respectively, realized by the proposed MDPDIVkNN and MDPDIVNTN methods in this paper. Both methods are based on the original MDPDIV, and they differ from each other in the sampling procedure of the episode. Suppose we are given labeled training data , where each query is associated with a set of retrieved documents , and denotes the labels on the documents, in the form of a binary matrix. if document contains the th subtopics of and 0 otherwise. The reward function is based on DCG. As an overview of our approaches, we first summarize main procedure in Algorithm 1. Clearly, similar to the MDPDIV model, our approaches also work in an iterative manner. The main improvements come from the step 4, where two different sampling methods are developed to efficiently search the action space. Next, we will elaborate the two methods.
4.1 K Nearest Neighbors Strategy
The action evaluation is always a parameterized function that takes both state and action as input. Hence, each time to select an action, evaluations have to be performed first, where is the size of action space. However, this quickly becomes intractable, especially if the parameterized function is costly to evaluate. In MDPDIV, the policy is defined as a normalized softmax function whose input is the bilinear product of the utility and the selected document in Eq.(2):
(2) 
where is the parameter in the bilinear product and is the normalization factor. The perceived utility of information could be computed in a recurrent manner in Eq.(3):
(3) 
where is the documentstate transformation matrix that adds the newly perceived utility from the recentlyselected document. is the statestate transformation matrix which determines the utility remained across time step. Generally speaking, at each time step, the utility perceived by users for fulfilling the information needs has to take all the previously selected documents into account, i.e., the later, the more complicated. Unfortunately, the execution complexity grows quadratically with which makes this approach inefficient. This motivate us to reduce the computational complexity.
Since the complexity of MDPDIV closely relates to , it is natural to find a way to “shrink” the action space, i.e. reduce the complexity. To maintain the accuracy not degrading, the “shrink” strategy guarantees such foundations that: (i) It has the ability to smartly prune part of the redundant (highly similar) actions; (ii) The shrunken action evaluation can nearly generalize over actions. For search result diversification, our goal is to return the most relevant documents to the queries and ensure the diversity of the documents simultaneously. Therefore, consider such a situation: and are highly alike and both are closely relevant to the queries, can we just return (or )? The answer is positive, because learning about also inform us about . Moreover, in order to guarantee the diversity of the selected documents, returning them both is not a reasonable choice. Therefore, we propose a nearest neighbor based strategy (MDPDIVkNN) to reduce the complexity of MDPDIV. The basic idea of the MDPDIVkNN is to discard the nearest neighbors of the recentlyselected action (document) at each time step. In particular, the strategy is instantiated in Algorithm 2. Each time we adopt an action , at the same time, we remove the nearest neighbors of from the action space, where the neighbors are computed by using the document embeddings^{3}^{3}3All the queries and documents are embedded with doc2vec [8] embedding model. with Euclidean distance as:
(4) 
The kNN lookup is a lightweight operation than the action evaluation execution although they are of the same complexity of the action space. Therefore, the kNN based strategy offers us three merits here: (i) It provides subquadratic complexity with respect to the action space; (ii) It avoids heavy cost of evaluating all actions while retraining generalization over actions; (iii) It directly optimizes the diversity of the selected documents.
4.2 Pretrained NTNDIV Strategy
The other method we propose to speed up the convergence rate of MDPDIV is to use a pretrained diversity ranking model. As aforementioned that the large action space and the data scarcity will lead to low convergence rate. The proposed nearest neighbors strategy in turn reduces the action space at each position by filtering out the nearest neighbors of the recentlyselected action (document). It is apparent that this strategy will efficiently shrink the action space to speed up the convergence. However, it cannot deal with the data scarcity. Because, in the incipient phase, once the document is selected, we will delete the nearest neighbors of the selected document, but we cannot make sure that it is relevant to the query or is the right one to rank at the current position. To deal with this problem, we propose the MDPDIVNTN method, which has two instantiations, i.e., MDPDIVNTN(D) and MDPDIVNTN(E), to promote the performance of MDPDIV.
The first instantiation adopts the pretrained NTNDIV model to rank the candidate set first and then takes actions in part of the preranked list by applying the MDPDIV. As a result, the action space is reduced as the NTNDIV model can provide accurate candidates with good diversity. The MDPDIVNTN(D) offers us two merits: (i) It directly shrinks the candidate set, i.e., the action space in MDPDIV; (ii) It straightforwardly takes out part of the irrelevant documents (the documents with none subtopics). Although the MDPDIVNTN(D) methods is effective, it may loss a bit of information because the NTNDIV model is indeed not perfectly accurate.
The second variant is more precise. We utilize the pretrained NTNDIV model at each position, i.e., each time to adopt an action. Similar to kNN strategy, we summarize its sampling strategy in Algorithm 3. It can be seen that, at each step time of the training, once the agent chooses an document, we utilize the pretrained NTNDIV model to find the documents which are novelty to the previously selected documents and relevant to the query simultaneously. For the next time step, the agent only needs to learn on the filtered the candidate set. Moreover, this approach also provides more considerable advantages: (i) It precisely shrinks the action space; (ii) It accurately takes out the irrelevant documents.
However, the training of the NTNDIV model using the original implementation is time consuming^{4}^{4}4https://github.com/sweetalyssum/DiverseNTN
, because it is executed sequentially on CPU. In order to accelerate the training, we reimplement this model with Tensorflow
[1] on a NVIDIA Tesla K80 GPU because all the tensor product can be computed parallelly. Finally, we obtain a slightly better performance with less than 30 minutes to train instead of more than 5 hours training of the original CPU version. We also note that the NTNDIV is trained offline and its GPU implementation brings no improvement on the convergence for the MDPDIVNTN.5 Experimental Study
5.1 Datasets and Evaluation Metrics
The dataset is provided by the authors^{5}^{5}5The datasets and source code are available at https://github.com/sweetalyssum/RL4SRD which is a combination of four TREC benchmark datasets: TREC 20092012 Web Track. The retrieved documents are carried out on the ClueWeb09 Category B data collection^{6}^{6}6http://lemurproject.org/clueweb09/, which is comprised of 50 million English web documents. We note that the large number of parameters in MDPDIV needs lots of labeled data to train, which is the reason why the four benchmark datasets are merged together. In total, there are 200 queries. Each query includes several subtopics identified by the TREC assessors. Moreover, the documents’ relevance labels are made at the subtopic level, which are binary with 0 denoting irrelevant and 1 denoting relevant.
We employ three widelyused evaluation metrics to assess the diverse ranking models. They are
NDCG [4], subtopic recall [26] (denoted as “Srecall”), and ERRIA [3]. The NDCG and ERRIA adopt the default settings in official TREC evaluation program^{7}^{7}7http://trec.nist.gov/data/web/12/ndeval.c, which measure relevance and diversity of the ranking list by explicitly rewarding diversity and penalizing redundancy observed at each rank. The parameter in these two evaluation metrics are set to 0.5. The traditional diversity metric Srecall measures the coverage rate of the retrieved subtopics for each query. All of the measures are computed over the top search results ( and ).5.2 Experimental Setup
All the experiments are conducted with 5fold crossvalidation. We randomly resplit the queries into five even subsets^{8}^{8}8The authors does not provide the split results, therefore we resplit the queries.
. For each fold, three subsets are utilized for training, one is for validation, and the rest one for testing. Moreover, for fair comparison, we run each fold five times, and the results reported are presented with average and standard deviation values over the total 25 trials. All the experiments are performed on an intel
Xeon Processor E5 V4 server with NVIDIA Tesla K80 GPU and over 256 GB memory.We compare the proposed methods with the latest stateoftheart baselines in search result diversification, including the NTNDIV [22] and MDPDIV [23]. We do not compare conventional models because previous studies have shown that their performances are inferior [22, 23].
NTNDIV: As mentioned in Section 3.3, as a stateoftheart method, the model computes a ranking score by taking both relevance and novelty into account with a neural tensor network. To speed up the training, we implement this method with Tensorflow on GPU which is extremely much faster than the original CPU version. The tensor slices is 100.
MDPDIV: As introduced in Section 3.2, this is the latest and stateoftheart method based on the MDP. We set parameters following [23], because the datasets utilized are exactly the same as in [23]. As our methods employ DCG as reward function, the DCG version MDPDIV is thus adopted for a fair comparison.
MDPDIVkNN: The parameter is set to be , , and , denoted as MDPDIVkNN(10), MDPDIVkNN(20), and MDPDIVkNN(30), respectively. The other parameters follow the settings in MDPDIV.
MDPDIVNTN: The tensor slices of the pretrained NTNDIV model is 100, and the learning rate is 0.009. The size of both preranked list in MDPDIVNTN(D) and MDPDIVNTN(E) is set to . Again, the other parameters follow the setting in MDPDIV.
In the experiments, the query vector and document vector are represented as the embeddings generated by the Doc2vec model, which is trained on all the documents in Web Track datasets. When training of the Doc2vec model, the number of dimension is set to 100, the learning rate is set to 0.025 and 8 is utilized as the window size.
5.3 Results and Analysis
Method  NDCG@10  NDCG@5  Srecall@10  Srecall@5  ERRIA@10  ERRIA@5  time (:h) 
NTNDIV(GPU)  0.4617  0.4124  0.6205  0.5140  0.3446  0.3186  0.5 
MDPDIV  0.4874  0.4480  0.6639  0.5599  0.3697  0.3477  65 
MDPDIVkNN(10)  0.4915  0.4462  0.6731  0.5435  0.3725  0.3539  43 
MDPDIVkNN(20)  0.4869  0.4461  0.6582  0.5463  0.3723  0.3506  25 
MDPDIVkNN(30)  0.4844  0.4464  0.6489  0.5467  0.3721  0.3517  16 
MDPDIVNTN(D)  0.4912  0.4470  0.6738  0.5464  0.3727  0.3493  26 
MDPDIVNTN(E)  0.4937  0.4485  0.6795  0.5627  0.3735  0.3497  53 
Performance Comparison for Search Result Diversification. Table 1 shows the performance of all the methods on TREC web track datasets. From the table, we can see that the reimplemented GPU version of NTNDIV needs half an hour to train which is extremely faster than all the other methods. However, its performance (accuracy) is significantly inferior to the other approaches.
Compared to the original MDPDIV, the proposed MDPDIVkNN methods and MDPDIVNTN methods are all faster, with a barely degraded or even slightly better accuracy. Among the MDPDIVkNN methods, the fastest one is the MDPDIVkNN(30) which discards of the current actions by the nearest neighbor strategy. It takes 16 hours to train which is 3x faster than the MDPDIV (taking 65 hours). Moreover, the MDPDIVkNN(10) shows best accuracy among the three. We observe that it is slightly better than the original MDPDIV, while the other two (i.e., MDPDIVkNN(20) and MDPDIVkNN(30)) are slightly worse. The reasons are two fold: (i) The nearest neighbors strategy can help produce a more diverse ranking list; (ii) Filtering nearest neighbors may also result in a information loss. The lager the , the more the information loss is. Therefore, the performance is a tradeoff between the complexity and the accuracy.
As to the MDPDIVNTN methods, the performance is better compared to MDPDIV. For MDPDIVNTN(D), the pretained NTNDIV model offers us a preranked list which helps to shrink the action space and filters part of the irrelevant document; For MDPDIVNTN(E), at each time step, we model the novelty of the candidate document based on both the query and preceding selected documents which provides us a more accurate preranked list. Hence, its performance (accuracy) is not only better than MDPDIV, but also better than MDPDIVNTN(D). However, the computing on GPU at each time step will cost some time. This is the reason that MDPDIVNTN(E) (taking 53 hours) does not run as fast as MDPDIVNTN(D) (taking 26 hours).
In Figure 1, we report the errorbar of the comparison methods. From the figure, we can see that all the approaches show relatively consistent standard deviation, which indicates the proposed methods achieve stably better or comparable performance than NTNDIV and MDPDIV.
Next, we present some results to analyze the efficiency and effectiveness of the proposed methods in details.
Efficiency Analysis. To analyze the efficiency, We draw a shadedline figure in Figure 2 to show the time cost for NDCG@10 performance of the models based on 5fold cross validation. The horizon axis is the NDCG@10 performance, and the vertical axis is the time cost to achieve the NDCG@10 performance. The curve in the figure means the average time cost for NDCG@10 performance, and the shade is the standard deviation. From the figure we can see that the proposed MDPDIVkNN and MDPDIVNTN methods are all trained faster than the original MDPDIV. Specially, with the increase of the value, the MDPDIVkNN models converge faster. Although the accuracy of the final convergence will sacrifice, it is still relatively acceptable. The MDPDIVNTN(D) trained faster than other models before the NDCG@10 performance reaches 0.48. However, the promotion of NDCG@10 after 0.48 becomes very timeconsuming. In terms of NDCG@10, after convergence, MDPDIVNTN(D) performs worse than MDPDIVNTN(E) which achieves the best accuracy.
Compared to the original MDPDIV, for instance, to achieve the NDCG@10 performance at 0.48, MDPDIVkNN(30) and MDPDIVNTN(D) are almost 3 times faster, MDPDIVkNN(20) is 1.4 times faster, MDPDIVkNN(10) is 0.4 times faster, and MDPDIVNTN(E) is 0.54 times faster than MDPDIV. According to the observations, we draw the following conclusions: (i) The proposed MDPDIVkNN and MDPDIVNTN models fulfill the target of accelerate the convergence of MDPDIV without much accuracy sacrifice; (ii) The MDPDIVkNN methods converge fast with a relatively acceptable accuracy, and the MDPDIVNTN methods converge fast and show better accuracy than MDPDIV.
Effectiveness analysis. Another promotion comes from the accuracy performance. Here we draw a shadedline figure in Figure 3 to show the NDCG@10 performance against the number of iterations. From this figure, we observe that during the first 2000 iterations, MDPDIVNTN(E) shows a significant improvement of NDCG@10 up to 0.05 over the MDPDIV. As the training phase goes on, the improvement becomes gentle. Finally, when both the methods converge, MDPDIVNTN(E) still delivers better performance than MDPDIV. In summary, we draw the following conclusions: (i) The proposed MDPDIVNTN(E) converges faster than the original MDPDIV; (ii) MDPDIVNTN(E) can reach a high performance in the first 2000 iterations, and the converge performance is also better. The reason of the fast convergence rate is that we utilize an offline NTNDIV model to shrink the search space and filter part of the irrelevant documents.
6 Conclusion
In this paper, we aim to promote the performance of MDPDIV by speeding up its convergence rate without much accuracy sacrifice. After analysis, we find the slow convergence of MDPDIV is mainly due to the two reasons: the large action space and data scarcity. On the one hand, the sequential decision making at each position needs evaluate the querydocument relevance for all the candidate set, which results in a huge searching space for MDP; on the other hand, due to the data scarcity, the agent has to proceed more “trial and error” interactions with the environment. To tackle this problem, we propose MDPDIVkNN and MDPDIVNTN methods. The experiment results demonstrate that the two proposed methods indeed accelerate the convergence rate of the MDPDIV, while the accuracies produced barely degrade, or even become better.
Acknowledgement. This research was supported in part by NSFC under Grant Nos. 61602132 and 61572158, and Shenzhen Science and Technology Program under Grant No.JCYJ20160330163900579.
References
 [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
 [2] Carbonell, J., Goldstein, J.: The use of mmr, diversitybased reranking for reordering documents and producing summaries. pp. 335–336. SIGIR ’98, ACM (1998)
 [3] Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intentbased diversification of web search results: metrics and algorithms. Information Retrieval 14(6), 572–592 (2011)
 [4] Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. pp. 659–666. SIGIR ’08, ACM, ACM (2008)
 [5] Dang, V., Croft, W.B.: Diversity by proportionality: an electionbased approach to search result diversification. pp. 65–74. SIGIR ’12, ACM (2012)
 [6] Guo, S., Sanner, S.: Probabilistic latent maximal marginal relevance. pp. 833–834. SIGIR ’10, ACM (2010)
 [7] He, J., Hollink, V., de Vries, A.: Combining implicit and explicit topic representations for result diversification. pp. 851–860. SIGIR ’12, ACM (2012)

[8]
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. ICML ’14, vol. 32, pp. 1188–1196. PMLR, Bejing, China (22–24 Jun 2014)
 [9] Lu, Z., Yang, Q.: Partially observable markov decision process for recommender systems. arXiv preprint arXiv:1608.07793 (2016)
 [10] Luo, J., Zhang, S., Yang, H.: Winwin search: Dualagent stochastic game in session search. pp. 587–596. SIGIR ’14, ACM (2014)
 [11] Puterman, M.L.: Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons (2014)
 [12] Radlinski, F., Kleinberg, R., Joachims, T.: Learning diverse rankings with multiarmed bandits. pp. 784–791. ICML ’08, ACM (2008)
 [13] Rafiei, D., Bharat, K., Shukla, A.: Diversifying web search results. pp. 781–790. WWW ’10, ACM (2010)
 [14] Raman, K., Shivaswamy, P., Joachims, T.: Online learning to diversify from implicit feedback. pp. 705–713. SIGKDD ’12, ACM (2012)
 [15] Santos, R.L., Macdonald, C., Ounis, I.: Exploiting query reformulations for web search result diversification. pp. 881–890. WWW ’10, ACM (2010)
 [16] Santos, R.L., Peng, J., Macdonald, C., Ounis, I.: Explicit search result diversification through subqueries. In: ECIR. vol. 5993, pp. 87–99. Springer (2010)
 [17] Shani, G., Heckerman, D., Brafman, R.I.: An mdpbased recommender system. Journal of Machine Learning Research 6(Sep), 1265–1295 (2005)
 [18] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction, vol. 1. MIT press Cambridge (1998)
 [19] Wang, J., Yu, L., Zhang, W., Gong, Y., Xu, Y., Wang, B., Zhang, P., Zhang, D.: Irgan: A minimax game for unifying generative and discriminative information retrieval models. pp. 515–524. SIGIR ’17, ACM, New York, NY, USA (2017)
 [20] Wei, Z., Xu, J., Lan, Y., Guo, J., Cheng, X.: Reinforcement learning to rank with markov decision process. pp. 945–948. SIGIR ’17, ACM, New York, NY, USA (2017)
 [21] Xia, L., Xu, J., Lan, Y., Guo, J., Cheng, X.: Learning maximal marginal relevance model via directly optimizing diversity evaluation measures. pp. 113–122. SIGIR ’15, ACM (2015)
 [22] Xia, L., Xu, J., Lan, Y., Guo, J., Cheng, X.: Modeling document novelty with neural tensor network for search result diversification. pp. 395–404. SIGIR ’16, ACM (2016)
 [23] Xia, L., Xu, J., Lan, Y., Guo, J., Zeng, W., Cheng, X.: Adapting markov decision process for search result diversification. pp. 535–544. SIGIR ’17, ACM, New York, NY, USA (2017)
 [24] Xu, J., Xia, L., Lan, Y., Guo, J., Cheng, X.: Directly optimize diversity evaluation measures: A new approach to search result diversification. ACM Transactions on Intelligent Systems and Technology (TIST) 8(3), 41 (2017)

[25]
Yu, H.T., Jatowt, A., Blanco, R., Joho, H., Jose, J., Chen, L., Yuan, F.: A concise integer linear programming formulation for implicit search result diversification. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. pp. 191–200. ACM (2017)
 [26] Zhai, C.X., Cohen, W.W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. pp. 10–17. SIGIR ’03, ACM, ACM (2003)
 [27] Zhang, S., Luo, J., Yang, H.: A pomdp model for contentfree document reranking. pp. 1139–1142. SIGIR ’14, ACM (2014)
 [28] Zhu, Y., Lan, Y., Guo, J., Cheng, X., Niu, S.: Learning for search result diversification. pp. 293–302. SIGIR ’14, ACM (2014)
Comments
There are no comments yet.