Three Methods for Training on Bandit Feedback

04/24/2019
by   Dmytro Mykhaylov, et al.
0

There are three quite distinct ways to train a machine learning model on recommender system logs. The first method is to model the reward prediction for each possible recommendation to the user, at the scoring time the best recommendation is found by computing an argmax over the personalized recommendations. This method obeys principles such as the conditionality principle and the likelihood principle. A second method is useful when the model does not fit reality and underfits. In this case, we can use the fact that we know the distribution of historical recommendations (concentrated on previously identified good actions with some exploration) to adjust the errors in the fit to be evenly distributed over all actions. Finally, the inverse propensity score can be used to produce an estimate of the decision rules expected performance. The latter two methods violate the conditionality and likelihood principle but are shown to have good performance in certain settings. In this paper we review the literature around this fundamental, yet often overlooked choice and do some experiments using the RecoGym simulation environment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2019

Learning from Bandit Feedback: An Overview of the State-of-the-art

In machine learning we often try to optimise a decision rule that would ...
research
08/10/2022

A Scalable Probabilistic Model for Reward Optimizing Slate Recommendation

We introduce Probabilistic Rank and Reward model (PRR), a scalable proba...
research
02/11/2021

Freudian and Newtonian Recurrent Cell for Sequential Recommendation

A sequential recommender system aims to recommend attractive items to us...
research
07/26/2021

Combining Reward and Rank Signals for Slate Recommendation

We consider the problem of slate recommendation, where the recommender s...
research
08/01/2018

Off-Policy Evaluation and Learning from Logged Bandit Feedback: Error Reduction via Surrogate Policy

When learning from a batch of logged bandit feedback, the discrepancy be...
research
08/28/2020

BLOB : A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

A common task for recommender systems is to build a pro le of the intere...
research
04/18/2018

Highly Relevant Routing Recommendation Systems for Handling Few Data Using MDL Principle

Many classification algorithms existing today suffer in handling many so...

Please sign up or login with your details

Forgot password? Click here to reset