Optimizing Gross Merchandise Volume via DNN-MAB Dynamic Ranking Paradigm

08/14/2017 ∙ by Yan Yan, et al. ∙ JD.com, Inc. 0

With the transition from people's traditional `brick-and-mortar' shopping to online mobile shopping patterns in web 2.0 era, the recommender system plays a critical role in E-Commerce and E-Retails. This is especially true when designing this system for more than 236 million daily active users. Ranking strategy, the key module of the recommender system, needs to be precise, accurate, and responsive for estimating customers' intents. We propose a dynamic ranking paradigm, named as DNN-MAB, that is composed of a pairwise deep neural network (DNN) pre-ranker connecting a revised multi-armed bandit (MAB) dynamic post-ranker. By taking into account of explicit and implicit user feedbacks such as impressions, clicks, conversions, etc. DNN-MAB is able to adjust DNN pre-ranking scores to assist customers locating items they are interested in most so that they can converge quickly and frequently. To the best of our knowledge, frameworks like DNN-MAB have not been discussed in the previous literature to either E-Commerce or machine learning audiences. In practice, DNN-MAB has been deployed to production and it easily outperforms against other state-of-the-art models by significantly lifting the gross merchandise volume (GMV) which is the objective metrics at JD.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Effectively generating the right list of items for customers is the key to the success of E-Commerce websites and applications. How fast and how frequent the customers could converge (place orders) decide if the E-Commerce company would thrive. The recommender system, an information filtering, rating, and recommending engine that assists users to reach a small group of items that describe and satisfy their purchasing needs in real time, is emerged for solving this challenge.

Founded in 2004, JD has quickly ascended into one of the most popular E-Commerce websites in China. Customers come to JD to discover, browse and purchase items sold by itself as well as over hundreds of thousands other government certificated E-Retailers. Everyday tens of millions of users generate billions of querying requests, place tens of millions of orders. At the same time, the statistics shows that in average the active users visit only limited number of items from specific positions like the front page, topic driven recommending sections in JD’s app. (Fig.1 presents two type of layout). Therefore, despite the huge amount of serving loads, the number of ranked items presented to users are actually small, the recommender system at JD must provide robust, agile and accurate recommendation service to make sure conversion happens frequently in every user’s item browsing experience.

The ranking module decides how the final list of items from the retrieval should be generated and positioned so that the items interest the customers most are placed first. As a result, the ranking problem is always the centric issue of the recommender system. In the use case of E-commerce recommendations, people mine each customer’s information including search-history, clicks, orders, etc. to model customers’ purchasing intent across the whole platform.

Traditional ranking modules in recommender systems are not as efficient regarding with learning the users’ intent and behavioral feebacks while they browse the ranked items. This is due to the fact that most of the ranking results in the E-Commerce apps come in a fashion of the waterflow streaming. Traditional ranking ideas usually have difficulties to incorporate the users’ real time feedbacks so that the results are either not not precise enough or in the extreme cases not legitimate anymore (e.g.

the static ranking results still promote items that users just clicked or placed orders on, based on recent behavioral histories). In this paper, we propose an innovative dynamic ranking paradigm. By combining the deep learning model and the multi-armed dandit algorithm together, this framework is capable of learning customers’ real time feedbacks so to change the ranking results for reflectng users’ current purchasing intent and improving the overall recommender system performance.

Figure 1: two types of mobile page layouts of item rankings

1.1 Contributions

This paper has two primary contributions in research and industry:

  1. an innovative ranking paradigm for solving the dynamic ranking problem by combining the pairwise deep neural network and multi-armed bandit algorithms;

  2. a revised Thompson sampling algorithm with the brand new initialization strategy under the production use case that enables the customers’ fast convergence.

1.2 Organization

The rest of this paper is organized as follows: Sec.2 briefly describes the recommender system utilized in JD and discusses the ranking module in detail: 2.3 learning-to-rank DNN module as the pre-ranker and 2.4 MAB Thompson sampling as the post-ranker; Sec.3 first discusses a simple case study which simulates the real world for testing how Thompson sampling works comparing with other popular MAB algorithms and then reports the proposed DNN-MAB performance regarding with different metrics; Sec.4 discusses some prominent related studies regarding with recommender systems, learning-to-rank via deep neural network models and multi-armed bandit models; we summarize our work and point out some future directions in Sec.5.

2 Formulation

2.1 System design

Figure 2: the recommender system flowchart

Our recommender system in Fig.2 includes three main function modules: the item-retrieval module, the item post-retrieval module, and the ranking module. Users typically trigger the recommender system via directing from the mobile application entry point or from our mobile version websites. After the system indicates users’ identities, the recommender issues a query request to the user database and the item database to fetch other information regarding with user profiles such like: gender, geo location, purchasing history, price sensitivity etc. This piece of information combining with the items that the users have either recently clicked or put in carts are collected and serve as the input of the retrieval system. Next, the retrieval system selects a large pool of candidate items that are related to the input.

The motivation for the post-retrieval module is to filter out items that are not suitable for recommending, including items that users have already purchased, items containing sensitive contents, and other disqualified items etc.

The ranking module compares all candidate items provided from the post-retrieval module and generate the final top- item sublist. Such list of items should be optimized so that the items that users interested in most should be placed at top positions. Generally this is achieved by sophisticated machine learning models such as Factorization Machines [Rendle2010], pairwise learning-to-rank [Burges et al.2005, Severyn and Moschitti2015], listwise learning-to-rank [Cao et al.2007], etc. In our use case, the ranking module is implemented by a pairwise loss learning-to-rank DNN and a revised MAB Thompson sampling.

2.2 Input data

We now formally propose the ranking problem. Assuming that each sample is indicated as a

-dimentional feature vector

coming from a certain category , with their . By given a set of items and a template triggering item , a subset of items are selected from and ordered in a particular way so that the of the sublist’s GMV is optimized:


Here, is the indicator function such that:


2.3 learning-to-rank Dnn pre-ranker

Figure 3: the learning-to-rank DNN pre-ranker

The learning-to-rank DNN model is to compute the pairwise loss for ranking different items based upon the label information of whether users have ordered/ not-ordered certain items. The simplified model structure is shown as Fig.3. It is implemented by two miorring -layers DNNs: one input layer which takes item features that are sparse -dimensional vectors, three fully-connected layers that generalize item features , and one output layer [Basak et al.2007, Cherkassky and Ma2004] which outputs s serving as the pre-ranker regression scores. Noting that both DNNs share the same set of parameters.


The loss function

in Eq.3 is inspired by SVM-rank [Elisseeff et al.2001], where and are the pair of item features labeled as and . The label is valued as either (negative) or (positive). Each pair is generated in such a way that only one out of two items is from the positive class: . is the tunning parameter representing the classification margin making sure that the better separability between two classes is preferred. is the pairwise weighting function that emphasizes the losses from pairs of greater values.

In the training phase, positive and negative item pairs input to both sides of the DNN models shown as the of Fig.3 (L), since both DNNs are sharing the same parameters, learning-to-rank

DNNs learn from the pair-labeling difference and aim to find the scoring scheme that correctly classifies items with the largest margin.

In the testing phase, each item from passes through the predictor (Fig.3 (R)), and is evaluated and scored by learning-to-rank DNN. The pre-ranker scores s are then served as the candidate scores for the MAB post-ranker.

2.4 Multi-armed bandit post-ranker

The reasons for designing the MAB post-ranker are mainly the following three:

  1. the real-time ‘click’ and ‘order’ types of labels represent the user’s current purchasing intent out of many other recent intents and it should be emphasized in rankings;

  2. the real-time ‘no-action’ labels indicate item categories that the user is not currently interested in and they should be de-emphasized in rankings;

  3. users tend to click items under the same category and place orders by comparing them over different attributes.

The first and second reasons are well discussed in [Radlinski et al.2008] by stating that static rankings contain lots of redundant results. The MAB post-ranker is to emphasize items that users are potentially interested in by referencing other items clicked; de-emphasize items that users are intentionally ignored, meanwhile exploring items from different categories to diversify the ranking results.

We follow the problem settings of the contextual multi-armed bandit problem in [Langford and Zhang2008], and define the problem as follows:

2.4.1 Definition 2.4.1 (Contextual bandit in DNN-MAB)

Given arms , and a set of items scored from learning-to-rank DNN that each item belongs to exactly one out of arms. The player needs to pull one arm at each round , so that the item from that arm is picked up and placed at position . The reward at round is observed as . Ideally, we would like to choose the actions so that the total rewards are maximized.

2.4.2 Definition 2.4.2 (Rewards in DNN-MAB)

The expected rewards are defined as the total GMV generated from the listed items that users place orders on. In general it shall be translated into the company’s operating revenue:


where is the ranked sublist from which is co-decided by both DNN and MAB.

The revised Thompson sampling algorithm is triggered after learning-to-rank DNN pre-ranker and we describe it in Algo.1. The main idea of Algo.1 is to take the pre-ranker’s output as the initial static ranking and finetune the local orders via users’ online feedbacks so to reflect the current user purchasing intent in the final ranking. Some important parameters are highlighted as follows: is to adjust the intensity from negative feedbacks, alleviating the potential issue with the treatment that most items of no-actions are seen as negative; are to control the weights for how much the pre-ranking scores are to be changed; is the set of items from arm that have not yet been selected; is the set of items from arm that are presented but not clicked by users during post-ranking; is the set of items that are presented and clicked; is the cardinality computation.

2:     for each  do
3:         for arm such that
8:procedure At round- MAB ranking
10:     for each arm  do
11:         draw sample
12:         update all for      
13:     pick
16:     FEEDBACK:
17:     if  is exposed but not clicked then
20:     if  is exposed and clicked then
Algorithm 1 post-ranker: DNN-MAB Thompson sampling

At round , DNN-MAB Thompson sampling randomly draws samples based upon the beta distributions estimated, and then selects the arm with the and the item in that arm containing the adjusted score . If it is clicked after exposure, the algorithm updates the beta distribution parameter in arm . Otherwise the algorithm updates the beta distribution parameter in arm .

3 Experimental Evaluation

3.1 Case study

Before the real system online a/b test discussion, we first walk through a simple case study to evaluate different bandit algorithms’ performance under our use cases. We picked three state-of-the-art bandit algorithms: -greedy[Watkins1989], Upper Confidence Bound or UCB[Auer et al.2002], and Thompson sampling. Specifically, we simulate two versions of Thompson sampling: 1. the revised Thompson sampling with the specially designed initialization (revised-Thompson) (Algo.1); 2. the normal Thompson sampling (normal-Thompson). The random selection is also performed serving as a naive baseline.

In our simulation, we design arms and simply set each item’s reward as if the user clicks,

if the user does not click. The way we define the ‘click’ action is by presetting a thresholding probability

, once the item is selected, we randomly generate another probability via the real unknown beta distribution. If , we assume as the ‘click’ action happens, otherwise we assume the customer is not interested in the item selected at this round.

We perform the simulation times and each simulation keeps running over rounds. The average performance is shown in Fig.4 & 5. The left subfigures of Fig.4 & 5 are about the cumulative gains / regrets and the right ones are their zoom-ins. As shown, -greedy remains sub-optimal regarding with both rewards and regrets, UCB and normal-Thompson perform almost equally well, and the revised-Thompson performs the best by beating UCB and normal-Thompson with faster convergence. This is due to the fact that the revised-Thompson’s initialization phase personalizes the arms based upon the user information. Hence, revised-Thompson could converge in less steps relative to other standard MAB algorithms. The random selection no-surpisingly performs the worst among the five. Implementation-wise, the revised-Thompson is also straightforward and the overall system latency remains low (reported in Sec.3.4). With the above arguments, the revised-Thompson becomes the choice of our post-ranker module.

Figure 4: multi-armed bandit algorithms rewards simulations
Figure 5: multi-armed bandit algorithms regrets simulations

3.2 Experiment setup

JD processes billions of requests in a daily basis, any new models about to launch have to be evaluated by JD’s online testing platform. It divides the real traffics into buckets, each bucket gets about of the total traffics, and the remaining is held by the control bucket.

We deploy the proposed dynamic ranking paradigm on this platform for days, and track following metrics: GMV, order numbers, overall (Eq.1) and page-wise (Eq.5) discounted cumulative gains ().


Since the item lists are presented to users in a page-wise fashion and each page contains - items, page-wise is a perfect metric for evaluating how is the revised-Thompson module functioning in the application and how much gains we observed should be credited to it (we use in the evaluation).

3.3 Production performance

Figure 6: page-wise dcg gain: DNN-MAB v.s. DNN-rt-ft

We report the performance between DNN-MAB and DNN-rt-ft as the baseline in Tab.1. DNN-rt-ft is to utilize the DNN pre-ranker by taking both offline users’ feedbacks, online browsing and click signals as features for training a near-line model that is better than models taking offline signals only. In average we see DNN-MAB’s daily GMV has increased over DNN-rt-ft. Any performance gains that are greater than over days are considered statistically significant in the real production system. DNN-MAB paradigm has clearly proved its superiority against the current production DNN-rt-ft ranking strategy. To emphasize the importance of the parameter initialization and the feeback revision in Thompson sampling, we also report DNN + normal-Thompson in Tab.2. Due to the space limitation, we do not go to the very details but simply put our conclusion that DNN + normal-Thompson in general will not beat DNN-rt-ft because it can not learn users online behaviors quickly enough to improve the ranking quality.

Figure 7: 7-day page-wise DNN-MAB dcg percentage gain

We also report the overall gains in Tab.3 and page-wise gains in Tab.4. At the first glance, it seems that DNN-MAB beats the production baseline consistently in terms of overall as well as page-wise . By taking a closer look at the page-wise comparison (Fig.6) and the MAB-DNN page-wise gains in percentage (Fig.7), we find that the revised-Thompson is able to effectively learn the users’ intent. Due to the fact that the revised-Thompson takes each user’s recent behaviors for the personalized initialization and keeps tracking the real-time user browsing signals for the dynamic ranking adjustment. Although the page-wise percentage gains are not quite visible at the page-1 (), the dynamic ranking performances are maximized at page-2 () and page-3 (), and then deminish along with users browsing more and more pages. In the end DNN-MAB and DNN-rt-ft both end up with similar page-wise performances at page-7 () and page-8 ().

Date Day1 Day2 Day3 Day4
GMV +22.87% +45.45% +20.20% +2.73%
Orders -2.14% -1.57% +5.18% +0.42%
Date Day5 Day6 Day7 Summary
GMV +0.91% +23.15% +1.50% +16.69%
Orders -2.79% +4.19% +2.20% +0.78%
Table 1: GMV and orders gain / loss for DNN-MAB
Date Day1 Day2 Day3 Day4
GMV -12.08% -9.33% -4.74% -3.24%
Orders 0.30% -4.72% -1.34% -0.67%
Date Day5 Day6 Day7 Summary
GMV -18.31% -7.49% -1.43% -8.08%
Orders -10.67% -4.69% -0.81% -3.23%
Table 2: GMV and orders gain / loss for DNN + normal-Thompson

3.4 System specifications

Our current DNN-MAB ranking system is maintained by hundreds of Linux servers111We could not release this piece of information regarding with the exact number of operating servers due to the company confidentiality., the qps (query per second) is in average (peak at ), and the overall recomendation end-to-end response latency is within milli-seconds (including both retrieval and ranking phases).

Date DNN-MAB DNN-rt-ft Gain
Day1 5.470 5.180 +5.60%
Day2 5.303 4.811 +10.23%
Day3 5.434 5.281 +2.90%
Day4 5.443 5.340 +1.93%
Day5 4.865 4.789 +1.59%
Day6 5.873 5.491 +6.96%
Day7 7.045 6.884 +2.34%
Average 5.633 5.397 +4.37%
Table 3: dcg online a/b test: DNN-MAB v.s. DNN-rt-ft
Page DNN-MAB DNN-rt-ft Gain
Page-0 8.164 8.046 +1.47%
Page-1 5.177 4.708 +9.96%
Page-2 4.844 4.448 +8.90%
Page-3 4.602 4.279 +7.55%
Page-4 4.171 3.920 +6.40%
Page-5 4.062 3.798 +6.95%
Page-6 3.957 3.897 +1.54%
Page-7 3.584 3.536 +1.34%
Table 4: page-wise dcg a/b test: DNN-MAB v.s. DNN-rt-ft

4 Related Work

4.1 Recommender system

The recommender system is the key to the success of E-Commerce websites as well as other indexing service providers, such as Alibaba, Ebay, Google, Baidu, Youtube, etc. Efforts from different parties regarding with how the recommender systems should be designed include [Linden et al.2003, Davidson et al.2010, Schafer et al.2001, Sarwar et al.2000]. There are in general two streams of works in the recommender system researches: content-based approaches and item-based approaches. Item-based approaches represent users and items as a huge by matrix and focus on learning the underlying relations between items. Works of item-based approaches such like [Sarwar et al.2001, Rendle et al.2010]have all received enormous success. Yet item-based approaches suffer from issues like cold-start, scalability and plasticity, etc. Content based approaches treat the problem as the query-indexing problem, which in general, scales better and performs well for cases that users do not have too many previous actions in records but it tends to have query generalization issues for users with many behavior histories. [Burke2002, Lops et al.2011] both provide thorough surveys about this topic and readers should refer to them for in depth details.

4.2 learning-to-rank via deep neural network

Learning-to-rank, emerged from late 90s, has always been an interesting research topic in information retrieval. Approaches for solving this problem could be summarized into two main threads depending on the loss funtions that different approaches utilize: the pairwise loss and the listwise loss. In pairwise approaches, it has been formulated as a classification problem: item pairs are generated by picking up samples from positive and negative classes, the goal for learning-to-rank models is to correctly categorize item pairs into the binary classes, so that the loss defined is minimized. Research works in this thread include [Freund et al.2003, Cao et al.2006]. Listwise approaches, on the other hand, formulate the ranking problem as a classification problem on permulations. The loss will only be minimized if the whole list is perfectly ranked. Successful listwise approaches include ListNet[Cao et al.2007] and RankCosine[Qin et al.2008]. For more in-depth discussions regarding with learning-to-rank, please refer to [Liu and others2009].

With the growth in popularity of deep learning, people start to think of using different deep learning structures to tackle the learning-to-rank problem. Maybe the works that are most similar to our pre-ranker could be [Severyn and Moschitti2015, Kalchbrenner et al.2014, Kim2014]. They utilized convolutional deep neural network for ranking texts in natural languarage processing problems.

4.3 Contextual multi-armed bandit problems

Multi-armed bandit problem has been well studied and discussed in literatures such like [Lai and Robbins1985, Even-Dar et al.2006, Auer et al.2002] The basic setup for MAB is to select items from arms consecutively with feedbacks so the total expected regrets are minimized. Thompson sampling, dated back from [Thompson1933], albeit its simplicity, has been proved quite efficient regarding with productional performance [Tang et al.2013, Chapelle and Li2011]and suprisingly straightforward to implement. Other works regarding with Thompson sampling models include [Scott2010].

5 Conclusion

We proposed a dynamic ranking framework called DNN-MAB which is composed of learning-to-rank DNN pre-ranker and revised-Thompson post-ranker. This effective ranking paradigm takes into consideration of both the user and the item information so that the DNN could reach the decent static ranking performance. Meanwhile, by tracking real time user feedbacks, the revised-Thompson sampling is able to adjust the pre-rankings that futher boosts the objective metrics. To our knowledge, such a ranking framework has not been discussed in previous researches. Real production tests show that both GMVs and s have been significantly improved. However, for the sake of model simplicity, we have not paid too much attention to the position-bias which is one important factor that affects the ranking performance [Radlinski et al.2008] in the learning-to-rank DNN pre-ranker, since bringing in the listwise loss would introduce some scalability issues in our use cases. Meanwhile we have not optimized the proposed paradigm regarding with other user behaviors such as ‘clicks’, ‘orders’, etc. either (we do observe negative moves in several days regarding with order numbers, which is reported in Tab.1). Which said, how to optimize multiple KPIs simultaniously still remains as a big challenge. We plan to further improve our ranking models along with the these research paths in the future.


We are thankful to Dali Yang, Huisi Ou, Sulong Xu, Jincheng Wang, Lu Bai, Lixing Bo as well as anonymous reviewers for their helpful comments. This research has been supported in part by JD Business Growth BU and JD Santa Clara Research Center.


  • [Auer et al.2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [Basak et al.2007] Debasish Basak, Srimanta Pal, and Dipak Chandra Patranabis. Support vector regression. Neural Information Processing-Letters and Reviews, 11(10):203–224, 2007.
  • [Burges et al.2005] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96. ACM, 2005.
  • [Burke2002] Robin Burke. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction, 12(4):331–370, 2002.
  • [Cao et al.2006] Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. Adapting ranking svm to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186–193. ACM, 2006.
  • [Cao et al.2007] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129–136. ACM, 2007.
  • [Chapelle and Li2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • [Cherkassky and Ma2004] Vladimir Cherkassky and Yunqian Ma. Practical selection of svm parameters and noise estimation for svm regression. Neural networks, 17(1):113–126, 2004.
  • [Davidson et al.2010] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. The youtube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pages 293–296. ACM, 2010.
  • [Elisseeff et al.2001] André Elisseeff, Jason Weston, et al. A kernel method for multi-labelled classification. In NIPS, volume 14, pages 681–687, 2001.
  • [Even-Dar et al.2006] Eyal Even-Dar, Shie Mannor, and Yishay Mansour.

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.

    Journal of machine learning research, 7(Jun):1079–1105, 2006.
  • [Freund et al.2003] Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. Journal of machine learning research, 4(Nov):933–969, 2003.
  • [Kalchbrenner et al.2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
  • [Kim2014] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
  • [Lai and Robbins1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • [Langford and Zhang2008] John Langford and Tong Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in neural information processing systems, pages 817–824, 2008.
  • [Linden et al.2003] Greg Linden, Brent Smith, and Jeremy York. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing, 7(1):76–80, 2003.
  • [Liu and others2009] Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
  • [Lops et al.2011] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. Content-based recommender systems: State of the art and trends. In Recommender systems handbook, pages 73–105. Springer, 2011.
  • [Qin et al.2008] Tao Qin, Xu-Dong Zhang, Ming-Feng Tsai, De-Sheng Wang, Tie-Yan Liu, and Hang Li. Query-level loss functions for information retrieval. Information Processing & Management, 44(2):838–855, 2008.
  • [Radlinski et al.2008] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, pages 784–791. ACM, 2008.
  • [Rendle et al.2010] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme.

    Factorizing personalized markov chains for next-basket recommendation.

    In Proceedings of the 19th international conference on World wide web, pages 811–820. ACM, 2010.
  • [Rendle2010] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 995–1000. IEEE, 2010.
  • [Sarwar et al.2000] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Analysis of recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 158–167. ACM, 2000.
  • [Sarwar et al.2001] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295. ACM, 2001.
  • [Schafer et al.2001] J Ben Schafer, Joseph A Konstan, and John Riedl. E-commerce recommendation applications. In Applications of Data Mining to Electronic Commerce, pages 115–153. Springer, 2001.
  • [Scott2010] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
  • [Severyn and Moschitti2015] Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 373–382. ACM, 2015.
  • [Tang et al.2013] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1587–1594. ACM, 2013.
  • [Thompson1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • [Watkins1989] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.