Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation

09/17/2021 ∙ by Haruka Kiyohara, et al. ∙ 0

In recommender systems (RecSys) and real-time bidding (RTB) for online advertisements, we often try to optimize sequential decision making using bandit and reinforcement learning (RL) techniques. In these applications, offline reinforcement learning (offline RL) and off-policy evaluation (OPE) are beneficial because they enable safe policy optimization using only logged data without any risky online interaction. In this position paper, we explore the potential of using simulation to accelerate practical research of offline RL and OPE, particularly in RecSys and RTB. Specifically, we discuss how simulation can help us conduct empirical research of offline RL and OPE. We take a position to argue that we should effectively use simulations in the empirical research of offline RL and OPE. To refute the counterclaim that experiments using only real-world data are preferable, we first point out the underlying risks and reproducibility issue in real-world experiments. Then, we describe how these issues can be addressed by using simulations. Moreover, we show how to incorporate the benefits of both real-world and simulation-based experiments to defend our position. Finally, we also present an open challenge to further facilitate practical research of offline RL and OPE in RecSys and RTB, with respect to public simulation platforms. As a possible solution for the issue, we show our ongoing open source project and its potential use case. We believe that building and utilizing simulation-based evaluation platforms for offline RL and OPE will be of great interest and relevance for the RecSys and RTB community.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recommender systems (RecSys) and real-time bidding (RTB) for online advertisements, we often use sequential decision making algorithms to increase sales or to enhance user satisfaction. For this purpose, interactive bandit and reinforcement learning (RL) are considered powerful tools. The RecSys/RTB research communities have studied many applications of bandit and RL and demonstrated their effectiveness in a wide variety of settings (Zhao et al., 2019; Zhao et al., 2021; Zhao et al., 2018c, 2017, b; Ie et al., 2019; Cai et al., 2017; Wu et al., 2018; Zhao et al., 2018a; Jin et al., 2018; Hao et al., 2020; Zou et al., 2019). However, deploying RL policies in real-world systems is often difficult due to the need for risky online interactions. Specifically, when we use an adaptive policy and learn it in the real environment, numerous numbers of exploration is needed before acquiring near-optimal decision makings (Levine et al., 2020). The non-optimal exploration is harmful because it may damage sales or user satisfaction (Xiao and Wang, 2021). Moreover, we often use online A/B testing to evaluate how well a policy works in the real environment. However, it involves high stakes because the unseen new policy may perform poorly on the system (Gilotte et al., 2018). Therefore, online deployment of RL policies is often limited due to risk concerns, despite their potential benefits after the successful deployment.

Emerging paradigms such as offline reinforcement learning (offline RL) and off-policy evaluation (OPE) try to tackle these issues in a data-driven manner (Levine et al., 2020). In offline RL and OPE, we aim to learn and evaluate

a new policy using only previously logged data, without any risky online interaction. The major benefit of offline RL and OPE is that we can obtain a new policy that is likely to perform well in a completely safe manner, by 1) learning a new policy using only the logged data (offline RL), and 2) estimating the policy performance using the logged data to guarantee the safety in deployment (OPE). The potential to reduce the risks in deploying RL policies is gaining researchers’ interest. There are many works on offline RL 

(Levine et al., 2020; Fujimoto et al., 2019b; Kumar et al., 2019, 2020; Fu et al., 2020; Gulcehre et al., 2020; Agarwal et al., 2020; Argenson and Dulac-Arnold, 2021; Yu et al., 2020; Kidambi et al., 2020; Paine et al., 2020; Gulcehre et al., 2021; Yang and Nachum, 2021; Chen et al., 2019b; Yu et al., 2021) and OPE (Beygelzimer and Langford, 2009; Precup et al., 2000; Strehl et al., 2010; Dudík et al., 2014; Wang et al., 2017; Su et al., 2020; Jiang and Li, 2016; Thomas and Brunskill, 2016; Saito et al., 2020; Voloshin et al., 2019; Fu et al., 2021; Le et al., 2019), and also in their applicability in RecSys practice (Gilotte et al., 2018; Gruson et al., 2019; Santana et al., 2020; Xiao and Wang, 2021; Ma et al., 2020; Chen et al., 2019a; Rohde et al., 2018; Mazoure et al., 2021; Saito et al., 2021).

Discussion topic. In this paper, we discuss how simulation studies can help accelerate offline RL/OPE research, especially in RecSys/RTB. In particular, we focus on the roles of simulations in the evaluation of offline RL/OPE because empirical research is essential for researchers to compare offline RL/OPE methods and analyze their failure cases, leading to a new challenging research direction (Fu et al., 2021; Saito et al., 2020; Voloshin et al., 2019). Moreover, validating the performance of the offline RL policies and the accuracy of OPE estimators is crucial to ensure their applicability in real-life situations (Fu et al., 2021).

Our position. We take a position that we should effectively use simulations for the evaluation of offline RL and OPE. Against the position to argue that only the real-world data should be used in the experiments, we first show the difficulties of comprehensive and reproducible experiments incurred in real-world experiments. Then, we demonstrate the advantages of simulation-based experiments and how both real-world and simulation-based experiments are important from different perspectives. Finally, by presenting our ongoing open source project and its expected use case, we show how a simulation platform can assist future offline RL/OPE research in RecSys/RTB.

2. Preliminaries

In (general) RL, we have total timesteps to optimize our decision making (the special case is called the contextual bandit problem). At every timestep , the decision maker first observes state and decide which action to take according to the policy . Then, the decision maker receives a reward and observes the state transition , where and

are the unknown probability distributions. For example, in a RecSys setting,

can be user features, an item that the system recommends to the user, and the user’s click indicator, respectively. Here, the objective of RL is to obtain a policy that maximizes the following policy performance (i.e., expected total rewards) , where is a discount factor and is the expectation over the trajectory distribution .

Let us suppose there is a logged dataset collected by a behavior policy as follows.

where the dataset consists of trajectories. In offline RL, we aim to learn a new policy that maximizes the policy performance using only . In OPE, the goal is to evaluate, or estimate, the policy performance of a new (evaluation) policy using an OPE estimator and as . To succeed in offline RL/OPE, it is essential to address the distribution shift between the new policy and the behavior policy . Therefore, various algorithms and estimators have been proposed for that purpose (Agarwal et al., 2020; Levine et al., 2020; Fujimoto et al., 2019b; Kumar et al., 2019, 2020; Beygelzimer and Langford, 2009; Argenson and Dulac-Arnold, 2021; Yu et al., 2020; Kidambi et al., 2020; Paine et al., 2020; Gulcehre et al., 2021; Yang and Nachum, 2021; Chen et al., 2019b; Yu et al., 2021; Precup et al., 2000; Strehl et al., 2010; Dudík et al., 2014; Wang et al., 2017; Su et al., 2020; Jiang and Li, 2016; Thomas and Brunskill, 2016; Le et al., 2019). To evaluate and compare these methods in empirical research, we need to access both the logged dataset and the ground-truth policy performance of evaluation policy . In the following sections, we discuss how we can obtain and to conduct experiments of offline RL/OPE using both real-world and simulation-based synthetic data.

3. Is the use of real-world data sufficient to facilitate offline RL?

In this section, we discuss the advantages and drawbacks of the counterclaim: only the real-world data should be used in experiments of offline RL and OPE.

We can implement a real-world offline RL/OPE experiment by running (at least) two different policies in the real-world environment. First, behavior policy collects a logged dataset . Then, for the evaluation of offline RL/OPE, we need to approximate based on on-policy estimation of the policy performance, i.e., , by deploying evaluation policy to an online environment. The advantage of real-world experiments compared to simulation is that it is informative in the sense that the experimental results are expected to generalize in real-world applications (Saito et al., 2020).

However, there are two critical drawbacks in empirical studies using only real-world data. The first issue is the risky data collection process and resulting limited experimental settings in the comprehensive experiments. The real-world experiments always necessitate the high-cost data collection process because the online interactions can be harmful until the performance of data collection policies ( and ) is guaranteed (Gilotte et al., 2018). Therefore, it is difficult to deploy a variety of policies due to this risk concern, and the available empirical findings in real-world experiments are often limited. For example, when evaluating offline RL algorithms, we often want to know how well the algorithms learn from different logged data, such as the one collected by a sub-optimal policy (Fu et al., 2020; Gulcehre et al., 2020). However, deploying such a sub-optimal behavior policy is often demanding because it may damage sales or user satisfaction (Levine et al., 2020; Qin et al., 2021). Moreover, in the evaluation of OPE estimators, researchers are often curious about how the divergence between behavior and evaluation policies affects the accuracy of the performance estimation (Voloshin et al., 2019; Saito et al., 2021). Nonetheless, deploying such largely different evaluation policies is challenging, as there is huge uncertainty in their performance (Levine et al., 2020).

The second issue is the lack of reproducibility. Due to confidentiality and data collection costs in RecSys/RTB practice, there is only one public real-world dataset for OPE research (Open Bandit Dataset (Saito et al., 2020)). It is also difficult to publicize a real-world dataset for offline RL because the evaluation of a new policy requires access to the environment (Fu et al., 2020; Gulcehre et al., 2020). Therefore, conducting a reliable and comprehensive experiment is extremely difficult using only real-world data, which we argue is a bottleneck of the current offline RL/OPE research in RecSys/RTB practice.

4. How can simulations accelerate offline RL research?

In this section, we describe how simulations can help evaluate offline RL/OPE methods together with real-world data.

An alternative way to conduct experiments is to build a simulation platform and use it as a substitute for the real environment. Specifically, we can first deploy behavior policy to the simulation environment and obtain a synthetic dataset . Then, we can calculate the ground-truth policy performance or approximate it by on-policy estimation when the ground-truth calculation is computationally intensive. The important point here is that the whole experimental procedure does not require any risky online interaction in the real environment.

Since the policy deployment in the simulation platform is always safe, we can gain abundant findings from simulation research (Fujimoto et al., 2019a; Voloshin et al., 2019; Fu et al., 2021; Xi et al., 2021). For example, in the evaluation of offline RL, we can easily deploy a sub-optimal behavior policy, which is often difficult in real-world experiments (Fu et al., 2020; Gulcehre et al., 2020). Moreover, we can also analyze the learning process of offline RL by deploying a new policy several times in different training checkpoints, which is challenging due to risks and deployment costs in the real-world experiments (Matsushima et al., 2021). In addition, we can test how well an OPE estimator identifies evaluation policies that perform poorly, which is crucial to avoid failures in practical scenarios (McInerney et al., 2020).

Furthermore, we can also tackle the reproducibility issue in real-world experiments by publicizing simulation platforms. Using an open-access simulation platform, researchers can easily reproduce the experimental results, which leads to a reliable comparison of the existing works (Fu et al., 2020; Gulcehre et al., 2020). Therefore, simulation-based experiments are beneficial in enabling reproducible comprehensive studies of offline RL/OPE.

Although the simulation-based empirical research overcomes the drawbacks of real-world experiments, it should also be noted that simulation-based experiments have a simulation gap issue (Zhao et al., 2020)

. Specifically, to model the real environment, we need function approximations for the probability distributions (i.e.,

and ). Unfortunately, there must be an inevitable modeling bias which may lead to less informative results.

However, since both real-world and simulation-based experiments have different advantages, we can leverage both for different purposes, as shown in Table 1. Specifically, we can first conduct simulation-based comprehensive experiments to see how the configuration changes affect the performance of offline RL/OPE methods to discuss both the advantages and limitations of the methods in a reproducible manner. We can also verify if offline RL policies and OPE estimators work in a real-life scenario using real-world experiments with limited online interactions. Here, by performing preliminary experiments on a simulation platform and removing policies that are likely to perform poorly in advance, we can implement real-world experiments in a less risky manner. Thus, we argue that we should effectively use simulations in the empirical research of offline RL and OPE.

experiment reality safety reproducibility usage
real-world performance verification in real-world
simulation-based comprehensive study
Table 1. Comparison of the advantages and usage of real-world and simulation-based experiments

5. Towards practical research of offline RL in RecSys and RTB

In this section, we discuss how we can further accelerate offline RL/OPE research in RecSys/RTB practice.

The benefits of the simulation-based experiments have indeed pushed forward the offline RL/OPE research. Specifically, many research papers (Kumar et al., 2020; Argenson and Dulac-Arnold, 2021; Yu et al., 2020; Kidambi et al., 2020; Paine et al., 2020; Gulcehre et al., 2021; Yang and Nachum, 2021; Chen et al., 2019b; Yu et al., 2021; Fu et al., 2021) have been published using a variety of simulated control tasks and their standardized synthetic datasets collected by diverse policies (Fu et al., 2020; Gulcehre et al., 2020). Moreover, the simulation-based benchmark experiments play important roles for researchers to discuss both advantages and limitations of the existing offline RL and OPE methods (Fujimoto et al., 2019a; Xi et al., 2021; Voloshin et al., 2019; Fu et al., 2021).

Practical applications, however, are still limited, especially for offline RL (such as (Qin et al., 2021; Zhan et al., 2021; Xiao and Wang, 2021; Ma et al., 2020; Chen et al., 2019a; Mazoure et al., 2021)). We attribute this to the lack of application-specific simulation environments that provide useful insights for specific research questions. For example, RecSys/RTB are unique regarding their huge action space and highly stochastic and delayed rewards (Ma et al., 2020; Cai et al., 2017; Zou et al., 2019). Therefore, we need to build a simulation platform imitating such specific characteristics to better understand the empirical performance of offline RL/OPE methods in these particular situations.

In the RecSys setting, there are two dominant simulation platforms well-designed for offline RL/OPE research, OpenBanditPipeline (OBP) (Saito et al., 2020) and RecoGym (Rohde et al., 2018). They are both beneficial in enabling simulation-based experiments in a fully offline manner. Moreover, OBP is helpful in practice because it provides streamlined implementations of the experimental procedure and the modules to preprocess real-world data. However, their limitation is that they are unable to handle RL policies.111Specifically, OBP can handle only contextual bandits, and RecoGym combines contextual bandits and non-Markov interactions called organic session. Therefore, with these currently available packages, it is difficult to evaluate the offline RL and OPE methods of RL policies relevant to the real-world sequential decision makings (Xiao and Wang, 2021). Moreover, there are no such simulation platforms in RTB. There is a need to build a simulation-based evaluation platform for offline RL and OPE in RecSys/RTB settings.

Figure 1. Overview of our simulation platform and its workflow

Motivated by the above necessity, we are developing an open-source simulation platform in the RTB setting. Our design principle is to provide an easy-to-use platform to the users. Below, we present an expected use case and describe how to utilize our platform in offline RL/OPE empirical research.

We aim to conduct both simulation-based and real-world experiments, as described in Section 4. The solid arrows in Figure 1 show the workflow of simulation-based comprehensive experiments based on our platform. The key feature of the platform is that there are design choices for researchers, such as what behavior and evaluation policies to use and what offline RL and OPE methods to test. Moreover, researchers can also customize the environmental configurations in the simulation platform, such as action space and total timestep , to see how the configuration changes affect the performance of offline RL/OPE. After the detailed investigation in simulation-based experiments, we can also verify if the offline RL/OPE methods work in real-life scenarios with a limited number of online interactions. Our platform also provides streamlined implementation and data processing modules for assisting real-world experiments, as shown with the dotted arrows in Figure 1. The platform also allows researchers to identify a safe policy in advance using our semi-synthetic simulation, which replicates the real environment based on the original real-world dataset. The results of such a semi-synthetic simulation may help reduce the risks in real-world experiments.

Finally, since we plan to publicize the platform, the research community can engage in our project to make the simulation platform to be a more diverse benchmark and more practically relevant. Moreover, we plan to extend our platform to the RecSys setting. These additional efforts will allow researchers to easily involve in the empirical research of offline RL/OPE in RecSys/RTB.

References