Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay

07/19/2023
by   Thomas M. McDonald, et al.
0

Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become available might take several weeks, hurting the rate at which learning happens, whereas measuring short-term proxy rewards reflects the actual long-term goal only imperfectly. We address this challenge in two steps. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Full observations as well as partial (short or medium-term) outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that takes advantage of this new predictive model. The algorithm quickly learns to identify content aligned with long-term success by carefully balancing exploration and exploitation. We apply our approach to a podcast recommendation problem, where we seek to identify shows that users engage with repeatedly over two months. We empirically validate that our approach results in substantially better performance compared to approaches that either optimize for short-term proxies, or wait for the long-term outcome to be fully realized.

READ FULL TEXT
research
09/01/2020

From Clicks to Conversions: Recommendation for long-term reward

Recommender systems are often optimised for short-term reward: a recomme...
research
09/14/2020

Dual-Mandate Patrols: Multi-Armed Bandits for Green Security

Conservation efforts in green security domains to protect wildlife and f...
research
02/07/2023

Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective

We study the problem of optimizing a recommender system for outcomes tha...
research
07/24/2018

Learning from Delayed Outcomes with Intermediate Observations

Optimizing for long term value is desirable in many practical applicatio...
research
08/18/2021

Tilted Platforms: Rental Housing Technology and the Rise of Urban Big Data Oligopolies

This article interprets emerging scholarship on rental housing platforms...
research
02/04/2014

Short-term plasticity as cause-effect hypothesis testing in distal reward learning

Asynchrony, overlaps and delays in sensory-motor signals introduce ambig...
research
02/14/2019

The Perils of Exploration under Competition: A Computational Modeling Approach

We empirically study the interplay between exploration and competition. ...

Please sign up or login with your details

Forgot password? Click here to reset