Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow

07/01/2021
by   Wenjun Zeng, et al.
12

In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear–flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TS's efficiency in balancing exploration and exploitation and Bandit's convenience in modeling actions' incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with ϵ-greedy and decreasing ϵ, independent Bandits, and interaction Bandits. We also find the proposed algorithm's performance is the most robust to changes in the across-page interdependence strength.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/03/2023

A Reduction-based Framework for Sequential Decision Making with Delayed Feedback

We study stochastic delayed feedback in general multi-agent sequential d...
research
02/16/2023

Linear Bandits with Memory: from Rotting to Rising

Nonstationary phenomena, such as satiation effects in recommendation, ar...
research
04/14/2020

Sequential Batch Learning in Finite-Action Linear Contextual Bandits

We study the sequential batch learning problem in linear contextual band...
research
08/19/2021

Personalized next-best action recommendation with multi-party interaction learning for automated decision-making

Automated next-best action recommendation for each customer in a sequent...
research
01/24/2019

Deep Neural Linear Bandits: Overcoming Catastrophic Forgetting through Likelihood Matching

We study the neural-linear bandit model for solving sequential decision-...
research
07/13/2019

Parameterized Exploration

We introduce Parameterized Exploration (PE), a simple family of methods ...
research
09/24/2022

Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem

Bandits with knapsacks (BwK) is an influential model of sequential decis...

Please sign up or login with your details

Forgot password? Click here to reset