Log In Sign Up

Generative Inverse Deep Reinforcement Learning for Online Recommendation

Deep reinforcement learning enables an agent to capture user's interest through interactions with the environment dynamically. It has attracted great interest in the recommendation research. Deep reinforcement learning uses a reward function to learn user's interest and to control the learning process. However, most reward functions are manually designed; they are either unrealistic or imprecise to reflect the high variety, dimensionality, and non-linearity properties of the recommendation problem. That makes it difficult for the agent to learn an optimal policy to generate the most satisfactory recommendations. To address the above issue, we propose a novel generative inverse reinforcement learning approach, namely InvRec, which extracts the reward function from user's behaviors automatically, for online recommendation. We conduct experiments on an online platform, VirtualTB, and compare with several state-of-the-art methods to demonstrate the feasibility and effectiveness of our proposed approach.


Pitfalls of learning a reward function online

In some agent designs like inverse reinforcement learning an agent needs...

Deep Reinforcement Learning with Function Properties in Mean Reversion Strategies

With the recent advancement in Deep Reinforcement Learning in the gaming...

Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning

Portfolio Selection is an important real-world financial task and has at...

Goal-Oriented Next Best Activity Recommendation using Reinforcement Learning

Recommending a sequence of activities for an ongoing case requires that ...

Where to go: Agent Guidance with Deep Reinforcement Learning in A City-Scale Online Ride-Hailing Service

Online ride-hailing services have become a prevalent transportation syst...

Neural Model-Based Reinforcement Learning for Recommendation

There are great interests as well as many challenges in applying reinfor...

Deep Reinforcement Learning for Imbalanced Classification

Data in real-world application often exhibit skewed class distribution w...


Deep reinforcement learning (DRL) is promising for recommendation systems, given its ability to learn optimal strategies from interactions for generating recommendations that best fit users’ dynamic preferences. DRL-based recommendation systems cover three categories: deep Q-learning based methods, policy gradient based methods, and hybrid methods. Deep Q-learning aims to find the best step by maximizing a Q-value over all possible actions. Zheng et al. (2018) first introduced DRL into recommendation systems for news recommendation; then, Chen et al. (2018) introduced a robust Q-learning to handle dynamic environments for online recommendation. However, Q-learning based methods suffer the agent stuck problem, i.e., Q-learning requires the maximise operation over the action space, which becomes infeasible when the action space is extremely large. Policy gradient based methods can mitigate the agent stuck problem Chen et al. (2019a). Such methods use average reward as the guideline yet they may treat bad actions as good actions, making the algorithm hard to converge Pan et al. (2019). In comparison, hybrid methods combine the advantages of Q-learning and policy gradient. As a popular algorithm among hybrid methods, actor-critic network Konda and Tsitsiklis (2000) adopts policy gradient on an actor network and Q-learning on a critic network to achieve nash equilibrium on both networks. Until now, actor-critic network has been widely applied to DRL-based recommendation systems  Chen et al. (2019b); Liu et al. (2020); Zhao et al. (2020); Chen et al. (2020).

Despite differences, all existing DRL-based methods rely on well-designed context-dependent reward functions, as shown in the general workflow of a DRL-based recommendation systems in Fig 1 (a). However, in many cases, the reward function cannot be easily defined, due to dynamic environments and various factors that affect user’s interest  Shang et al. (2019). Thus, the existing methods may suffer limited generalization ability. Besides, they generally take into account both user’s preference and user’s actions (e.g., click-through, rating or implicit feedback) in the reward function, under the assumptions that the reward is determined by the current chosen item and user’s action is unaffected by the recommended items. Such assumptions, however, no longer hold for online recommendation Shang et al. (2019). Another challenge is that the usage of reinforcement learning in finding the recommendation policy from scratch might be time-consuming Finn et al. (2016); Levine and Koltun (2012). In a recommendation problem, the state space (i.e., all candidate items in which users might be interested) and action space (i.e., all the actions for candidate items) might be huge, traditional reinforcement learning based methods iterate all the possible combinations to figure out the best policy which is arduous.

Figure 1: General workflow of current reinforcement learning based recommendation system (a) which relies on reward function to guide the agent. The inverse reinforcement learning based approach (b) which does not require the reward function guiding. We use red color to represent how those two approaches update the policy.

Targeting at the above challenges, we aim to enable the agent to infer a reward function from user’s behaviors via inverse reinforcement learning and learn the recommendation policy directly from user behaviors in an efficient and adaptive way, as shown in Fig 1 (b). To this end, we propose a generative inverse reinforcement learning approach for adaptively inferring an implicit reward function from user behaviors. Specifically, the approach can measure the performance of the current recommendation policy and update the current policy with the expert policy in the discriminator, thus alleviating the need of defining reward function for online recommendation. In particular, we transform inverse deep reinforcement learning (DRL) into a generator to augment a diverse set of state-action pairs. Under this generative strategy, our method can achieve better generalization ability under complex online recommendation conditions. In a nutshell, we make the following contributions:

  • We propose generative inverse reinforcement learning to automatically learn reward function for online recommendation. To the best of our knowledge, this is the first work to decouple the reward function and the agent for online recommendation.

  • We design a novel actor-discriminator network module that takes a discriminator as the critic network and a novel actor-critic network as the actor network, to implement the proposed framework. The module is model-free and can be easily generalized to a variety of scenarios.

  • We conduct experiments on a virtual online platform, VirtualTB, to demonstrate the feasibility and effectiveness of the proposed approach. Our proposed method achieves a higher click-through-rate that several state-of-the-art methods.

Problem Formulation

Online Recommendation Online recommendation differs from offline recommendation in dealing with real-time interactions between users and the recommendation system. The system needs to analyse user’s behavior and updates the recommend policy dynamically. The objective is to find a solution that best reflects those interactions and apply it to the recommend policy.

Reinforcement Learning-based Recommendation

Reinforcement Learning based recommendation systems learn from interactions through an Markov Decision Process (MDP). Given a recommendation problem consisting of a set of users

, a set of items and user’s demographic information , MDP can be represented as a tuple , where denotes the state space, which is the combination of the subsets of ; is the action space, which represents agent’s selection during recommendation based on the state space ;

is the set of transition probabilities for state transfer based on the action received;

is a set of rewards received from users, which are used to evaluate the action taken by the recommendation system, with each reward being a binary value to indicate user’s click; is a discount factor for balancing the future reward and current reward.

Given a user and the state observed by the agent (or the recommendation system), which includes a subset of item set and user’s demographic information , a typical recommendation iteration for user goes as follows. First, the agent makes an action based on the recommend policy under the observed state and receives the corresponding reward . Then, the agent generates a new policy based on the received reward and determines the new state

based on the probability distribution

. The cumulative reward after iterations is as follows:

Inverse Reinforcement Learning-based Online Recommendation We propose inverse reinforcement learning without predefining a reward function for online recommendation. We aim to optimize the current policy to make recommendations that are most suitable for the user.

Figure 2: The proposed framework. The actor-critic network generates the policy based on current state . The discriminator takes state-action pair and expert state-action pair as input. The goal of the actor-critic network is to generate

which can let discriminator classify it as an expert policy. The discriminator aims to distinguish the policy that is generated by actor-critic network. The input of GAE is the Value from the critic network and the

from discriminator which details will be presented later. We circled one update episode which happened on agent.

Specifically, we model online recommendation as an MDP with a finite state set , a set of actions , transition probabilities , and a discount factor . Suppose there exist an expert policy that can master any state . The recommendation turns into the optimization problem of finding the policy that best approximates the expert policy across the cost function class , by the following objective function Abbeel and Ng (2004):


The cost function class is restricted to convex sets defined by the linear combination of a few basis functions {

}. Hence, the corresponding feature vector for the state-action pair

can be represented as . The expectation is defined as (on -discounted infinite horizon):



The overall structure of our proposed approach is illustrated in Fig 2. It consists of three main components: policy approximation, policy generation, and discriminative actor-critic network. Policy approximation provides the theoretical approach to approximating the learned recommendation policy with the expert policy ; policy generation increases the diversity of the recommendation policy; and discriminative actor-critic network

constitutes the main structure of our approach. Besides the above components, we will present the optimization method and the corresponding training algorithm, which aims to limit the update step in optimizing the loss function to ensure that a new policy achieves better performance than the old one.

Policy Approximation

The policy approximation component aims to make the learned recommendation policy and expert policy as similar as possible. To this end, we look for a cost function that delivers the best expert policy among all the policies on the latent space. According to Eq.(1), the cost function class is convex sets, which have a linear format Abbeel and Ng (2004) and a convex format Syed et al. (2008), respectively:


The corresponding objective functions are as follows:


In particular, Eq.(5) minimizes the distance between the state-action pairs, known as feature expectation matching Abbeel and Ng (2004). Eq.(6) minimizes the worst-case excess cost among the functions  Syed and Schapire (2008). An issue with such methods is the ambiguity in Eq.(1) that many candidate policies can approximate the expert if we only compare the features Ziebart et al. (2008). We resolve the ambiguity by introducing the following -discounted causal entropy Bloem and Bambos (2014) into Eq.(1):


We thereby rewrite Eq.(1) into


and define the reinforcement learning process according to Ziebart et al. (2008):


Suppose is the policy set. We define the loss function such that the expert policy has a lower cost and other policies have higher cost. Referring to Eq.(8), we define the maximum causal entropy inverse reinforcement learning Ziebart et al. (2010) as follows:


Policy Generation

We regard policy generation as the problem of matching two occupancy measures and solve it by training a Generative Adversarial Network (GAN) 

Goodfellow et al. (2014). The occupancy measure can be defined as:


Since the generator aims to generate the policy as similar to the expert policy as possible, we use GAIL Ho and Ermon (2016) to bridge inverse reinforcement learning and GAN by making an analogy from the occupancy matching to distribution matching. Specifically, we introduce a GA regularizer to restrict the entropy function:


where is defined as:


By introducing the GA regularizer, we can directly measure the difference between the policy and expert policy without needing to know the reward function. We use the loss function from the discriminator as in Eq.(10). We represent the negative log loss for the binary classification to distinguish the policy and via state-action pairs. The optimal of Eq.(14) is equivalence to the Jensen-Shannon divergence Nguyen et al. (2009):


Finally, We obtain the inverse reinforcement learning definition by substituting the GA regularizer into Eq.(8):


where is a factor with . Note that Eq.(16) has the same goal as the GAN, i.e., finding the squared metric between distributions. More specifically, we have the following equivalence for Eq.(16):


Discriminative Actor-Critic Network

The discriminative actor-critic network aims to map online recommendation into an inverse reinforcement learning framework.

Specifically, we take advantage actor-critic network, a variant of the actor-critic to constitute the main structure of our approach. Within this network, the actor uses the policy gradient to update the policy, and the critic uses Q-learning to evaluate the policy and provides feedback Konda and Tsitsiklis (2000).

Given user’s profile at timestamp (i.e., the item list ) and the optional demographic information (which is used to generate the state

), the environment embeds user’s recent interest and user’s features into the latent space via neural network 

Chen et al. (2020); Liu et al. (2020). Once the actor network gets the state

from the environment, it feeds the state to a network with two fully-connected layers with ReLU as the activation function. The final layer outputs the target policy function

parameterized by , which will be updated together with discriminator . Then, the critic network takes the input from the actor network with current policy , which can be used for sampling to get the trajectory . We concatenate the state-action pair and feed it into two fully-connected layers with ReLU as the activation function. The output of the critic network is a value , which will be used to calculate the advantage, which is a value used for optimization (to be discussed later).

As aforementioned, the discriminator is the key component of our approach. To build an end-to-end model and better approximate the expert policy , we parameterized the policy as and clip the output of the discriminator so that with weight . The loss function of is . Besides, we use Adam Kingma and Ba (2014) to optimize weight (the optimization for will be introduced later). Here, the discriminator can be treated as a local cost function to guide the policy update. Specifically, the policy will move toward expect-like regions (divided by ) in the latent space by minimizing the loss function , i.e., finding a point for Eq.(17) such that the equation output is minimal.

Policy Optimization

We use the actor-critic network as a policy network to be trained jointly with the discriminator. Therefore, the actor-critic network needs to update the policy parameter based on the discriminator. During this process, we aim to limit the agent’s step size to ensure the new policy is better than the old one. Specifically, we use trust region policy optimization (TRPO) Schulman et al. (2015a) to update the policy parameter and formulate the TRPO problem as follows:



is the advantage function calculated by Generalized Advantage Estimation (GAE)  

Schulman et al. (2015b) below:


where the reward is the -step’s test reward at timestamp . The reward have two components which are reward returned by environment and the bonus reward calculated by Discriminator by using . Considering the massive computation load of updating the TRPO via optimizing Eq.(18), we use Proximal Policy Optimization (PPO) Schulman et al. (2017) with the objective function below, to update the policy:


where is the clipping parameter, which represents the maximum percentage of change that can be updated at a time.

The training procedure involves two components: the discriminator and the actor-critic network. The training algorithm is illustrated in Algorithm  1. Specifically, for the discriminator, We use Adma as the optimizer to find the gradient for Eq.(17) for weight :

input : Expert Policy , current state
1 Sampling expert trajectories ;
2 Initialize discriminator parameter ;
3 Initialize policy parameter ;
4 Initialize clipping parameter ;
5 for  do
6       Sampling trajectories ;
7       Update the parameter by gradient on Eq.(21) ;
8       for  do
9             Get the trajectories on policy ;
10             Estimate advantage using Eq.(19);
             Compute the Policy Update
By taking step of minibatch SGD (via Adma)
       end for
end for
Algorithm 1 Training algorithm for our model


We report experimental evaluation of our model on a real-world online retail environment, VirtualTB Shi et al. (2019), on OpenAI gym111

Figure 3:

Experimental results where (a) is CTR with 95% confidence interval, and (b) is the average reward received each step after 1500 iterations with 95% confidence interval. For comparison, we have added the performance from expert policy to (a). Results reported in (c) is for ablation study.

Virtual TaoBao

VirtualTB is a dynamic environment to test the feasibility of the recommendation agent. It enables a customized agent to interact with it and achieve the corresponding rewards. On VirtualTB, each customer has 11 static attributes as the demographic information and are encoded into a 88-dimensional space with binary values. The customers have multiple dynamic interests that are encoded into a 3-dimensional space and may change over interaction process. Each item has several attributes, e.g,. price and sales volume, and are encoded into a 27-dimensional space.

Baseline methods

  • IRecGAN Bai et al. (2019): An online recommendation method that employs reinforcement learning and GAN.

  • PGCR Pan et al. (2019): A policy Gradient based method for contextual recommendation.

  • GAUM Chen et al. (2019c): A deep Q-learning based method that employs GAN and cascade Q-learning for recommendation.

  • KGRL Chen et al. (2020): Actor-Critic based method for interactive recommendation, a variant of online recommendation.

Note that GAUM and PGCR are not designed for online recommendation, and KGRL requires knowledge graph as side information, which is unavailable to the gym environment. Hence, we only keep the network structure and put those network into the VirtualTB platform for testing.

Evaluation Metric and Experimental Environment

The experiments are conducted in the OpenAI gym environment where the reward can be readily obtained for each episode. Since each episode may have different number of steps, it leads to the difficulty in determining when users will end the session. For this reason, we choose click-through-rate to represent the performance which is defined as:


where means that user is interested in all 10 items which are recommended in a single page, is the reward received per episode and is number of step included in one episode.

The model is implemented by using PyTorch 

Paszke et al. (2019) and the experiments are carried out on a server which consists of two 12-core/ 24-thread Intel (R) Xeon (R) CPU E5-2697 v2 CPUs, 6 NVIDIA TITAN X Pascal GPUs, 2 NVIDIA TITAN RTX, with a total 768 GiB memory.

Expert Policy Acquisition

In this part, we introduce the strategy on acquiring the expert policy for VirtualTB. There is no official expert policy in VirtualTB. Obviously, it is unrealistic to manually create the expert policy from this virtual environment where the source data is not available. Hence, we follow the similar strategy as in  Gao et al. (2018) to generate a set of expert policy from a pre-trained expert policy network. We design an actor-critic network with the same actor and critic network structure as our model, but without advantage. The critic network from is used to calculate the -value by adopting deep Q-learning. We adopts the Deep Deterministic Policy Gradients (DDPG) Lillicrap et al. (2015) to train .

0.94 0.95 0.96 0.97 0.98 0.99
PPO: 0.05 0.630 0.063 0.632 0.064 0.633 0.062 0.630 0.059 0.626 0.060 0.629 0.059
0.10 0.632 0.062 0.635 0.060 0.636 0.061 0.636 0.058 0.634 0.061 0.633 0.060
0.15 0.633 0.060 0.635 0.061 0.639 0.061 0.640 0.057 0.639 0.059 0.638 0.061
0.20 0.634 0.060 0.636 0.060 0.641 0.063 0.643 0.061 0.643 0.063 0.641 0.058
0.25 0.631 0.061 0.635 0.059 0.636 0.060 0.637 0.060 0.636 0.061 0.634 0.059
0.30 0.630 0.059 0.631 0.061 0.632 0.060 0.630 0.059 0.630 0.058 0.629 0.050
Table 1: CTR for Different Parameter Settings for GAE and PPO with 95% Confidence Interval

Hyper Parameters Setting

For the policy network , we set the DDPG parameters as: , size of hidden layer is , the size of reply buffer is and the number of episode is set to . For Ornstein-Uhlenbeck Noise, scale is , . For our approach, number of episode is set to , hidden size of the advantage actor-critic network is , hidden size for discriminator is , learning rate is , factor is , mini batch size is

and the epoch of PPO is

. For the generalized advantage estimation, we set the discount factor to , and . For fair comparison, all those baseline methods are training under the same condition. For easy recognition, we set one iteration as 100 episodes.


Fully results are reported in Fig 3. Our approach generally outperforms all four state-of-the-art methods. Specifically, our method gets a best result over all those baseline methods after iterations. It demonstrates the feasibility of the proposed approach.A possible reason for the poor performance of KGRL is that KGRL maintains a local knowledge graph inside the model and actively interacts with the environment. Because the experiments are conducted on an online platform which does not provide the side information for KGRL to generate its knowledge graph. Hence, KGRL performs poorer than other baseline methods.

Impact of Key Parameters

We are interested in how the control parameter on GAE and the clipping parameter on PPO affect the performance. These two key parameters significantly affect the generalized advantage estimation and proximal policy optimization process. The

is used to make a compromise between bias and variance which normally is selected from

with step 0.01. The clipping parameter is used to determine the number of percentage need to be clipped, normally smaller than to control the optimization step size. The results are reported in Table 1. For fairness, we report CTR at iteration 2000. Observe that when and , the model achieves the best result which is . These two values are also used as our default setting. More details about the model can be found in the Supplementary Materials.

Ablation Study

In this part, we investigate the effect of the GAE. We use two different optimization strategies to optimize the proposed model which are DDPG and Adaptive KL Penalty Coefficient. The Adaptive KL Penalty Coefficient is the simplified version of the PPO which can be defined as:


The update rule for is:

The parameter , and are determined by experiments where the selection process are reported in Supplementary Materials due to the space limit. The result of the ablation study can be found on Fig 3 (c).


This study provides a new approach for reinforcement learning based online recommendation, without the need of defining reward function. In this way, our work is feasible to be applied in various real-world recommendation scenarios, where the reward function is hard to manually define or highly domain-dependent. The proposed method offers a fundamental support for inverse reinforcement learning based recommendation system. By providing a few user behaviors, the proposed method can extract an adaptive unknown reward function so as to automatically find out the optimal strategies to generating recommendations best fitting users’ interest. Our empirical evaluation testify its competitive performance against reinforcement learned based existing state-of-the-art methods. Our model has implication and potentially accelerates the progress in applying reinforcement learning in practice where a complex environment exists.

Related Work

We briefly review previous studies related to deep reinforcement learning (DRL)-based recommendation. All those methods are MDP-based or partial observable MDP-based (POMDP). POMDP based methods can be further divided into three categories: value function estimation Hauskrecht (1997), policy optimization  Poupart and Boutilier (2005), and stochastic sampling Kearns et al. (2002). Due to the high computational and representational complexity of the POMDP based methods, MDP-based methods are relatively more popular in academia.

MDP-based DRL methods for recommendation can be concluded in three different approaches: deep Q-learning based, policy gradient based and Actor-Critic based methods. Zheng et al. (2018) adopts the deep Q-learning into the news recommendation by using user’s historical record as the state. Zou et al. (2020) improves the structure of Deep Q-learning to achieve more robust results. Pan et al. (2019) applies the policy gradient to learn the optimal recommendation strategies. Wang et al. (2020) introduces the knowledge graph into the policy gradient for sequential recommendation. However, Q-learning may get stuck because of the max operation, and policy gradient requires a large scale data to boost the converge speed and it will only update once per episode. Hence, Actor-Critic uses the Q-value to conduct the policy gradient per step instead of episode. Zhao et al. (2017) adopts the actor-critic methods to conduct the list-wise recommendation in a simulated environment. Chen et al. (2020); Zhao et al. (2020) utilize the knowledge graph as the side information embedded into the state-action space to increase model’s capability on actor-critic network. In addition, Liu et al. (2020) proposes to produce recommendations via learning a state-action embedding within the DRL framework.

Furthermore, Chen et al. (2019c) integrates the generative adversarial network with reinforcement learning structure to generate user’s attribute so that more side information would be available to boost the reinforcement learning based recommendation system’s performance. Shang et al. (2019) proposes a multi-agent based DRL method for environment reconstruction which take the environmental co-founders into account.

Conclusion and Future Work

In this paper, we propose a new approach InvRec for online recommendation. Our proposed approach are designed to overcome the drawback due to the inaccurate reward function. The proposed model is built upon advantage actor-critic network with the generate adversarial imitation learning. We evaluate our method on the online platform VirtualTB and our model achieves a good performance. We also compared our method with a few state-of-the-art methods in three different categories: Deep Q-Learning based, policy gradient based and actor-critic network based methods. The results demonstrate that the proposed approach’s feasibility and superior performance.

This study provides a good initial attempt about the application of deep inverse reinforcement learning on online recommendation system. However, there are remains a few shortcomings which are not addressed in this paper such as the sample inefficiency problem for the imitation learning  Kostrikov et al. (2019). Low sample inefficiency will lead to longer training time and the need of a larger dataset. Random sampling also would affect the performance. The possible solutions would be using the off-policy methods instead of on-policy, finding an optimal sampling strategy such that agent will get the same expert trajectories when facing the state which comes up before.


  • P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the twenty-first international conference on Machine learning

    pp. 1. Cited by: Problem Formulation, Policy Approximation, Policy Approximation.
  • X. Bai, J. Guan, and H. Wang (2019) A model-based reinforcement learning with adversarial training for online recommendation. In Advances in Neural Information Processing Systems, pp. 10735–10746. Cited by: 1st item.
  • M. Bloem and N. Bambos (2014) Infinite time horizon maximum causal entropy inverse reinforcement learning. In 53rd IEEE Conference on Decision and Control, pp. 4911–4916. Cited by: Policy Approximation.
  • H. Chen, X. Dai, H. Cai, W. Zhang, X. Wang, R. Tang, Y. Zhang, and Y. Yu (2019a) Large-scale interactive recommendation with tree-structured policy gradient. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3312–3320. Cited by: Introduction.
  • M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi (2019b) Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456–464. Cited by: Introduction.
  • S. Chen, Y. Yu, Q. Da, J. Tan, H. Huang, and H. Tang (2018) Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1187–1196. Cited by: Introduction.
  • X. Chen, C. Huang, L. Yao, X. Wang, W. Liu, and W. Zhang (2020) Knowledge-guided deep reinforcement learning for interactive recommendation. arXiv preprint arXiv:2004.08068. Cited by: Introduction, Discriminative Actor-Critic Network, 4th item, Related Work.
  • X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song (2019c) Generative adversarial user model for reinforcement learning based recommendation system. In International Conference on Machine Learning, pp. 1052–1061. Cited by: 3rd item, Related Work.
  • C. Finn, S. Levine, and P. Abbeel (2016) Guided cost learning: deep inverse optimal control via policy optimization. In International conference on machine learning, pp. 49–58. Cited by: Introduction.
  • Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell (2018) Reinforcement learning from imperfect demonstrations. External Links: Link Cited by: Expert Policy Acquisition.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Policy Generation.
  • M. Hauskrecht (1997) Incremental methods for computing bounds in partially observable markov decision processes. In AAAI/IAAI, pp. 734–739. Cited by: Related Work.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: Policy Generation.
  • M. Kearns, Y. Mansour, and A. Y. Ng (2002) A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning 49 (2-3), pp. 193–208. Cited by: Related Work.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Discriminative Actor-Critic Network.
  • V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: Introduction, Discriminative Actor-Critic Network.
  • I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson (2019) Discriminator-actor-critic: addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, External Links: Link Cited by: Conclusion and Future Work.
  • S. Levine and V. Koltun (2012) Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617. Cited by: Introduction.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Expert Policy Acquisition.
  • F. Liu, H. Guo, X. Li, R. Tang, Y. Ye, and X. He (2020) End-to-end deep reinforcement learning based recommendation with supervised embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 384–392. Cited by: Introduction, Discriminative Actor-Critic Network, Related Work.
  • X. Nguyen, M. J. Wainwright, M. I. Jordan, et al. (2009) On surrogate loss functions and f-divergences. The Annals of Statistics 37 (2), pp. 876–904. Cited by: Policy Generation.
  • F. Pan, Q. Cai, P. Tang, F. Zhuang, and Q. He (2019) Policy gradients for contextual recommendations. In The World Wide Web Conference, pp. 1421–1431. Cited by: Introduction, 2nd item, Related Work.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    In Advances in neural information processing systems, pp. 8026–8037. Cited by: Evaluation Metric and Experimental Environment.
  • P. Poupart and C. Boutilier (2005) VDCBPI: an approximate scalable algorithm for large pomdps. In Advances in Neural Information Processing Systems, pp. 1081–1088. Cited by: Related Work.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015a) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: Policy Optimization.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015b) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: Policy Optimization.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Policy Optimization.
  • W. Shang, Y. Yu, Q. Li, Z. Qin, Y. Meng, and J. Ye (2019) Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 566–576. Cited by: Introduction, Related Work.
  • J. Shi, Y. Yu, Q. Da, S. Chen, and A. Zeng (2019) Virtual-taobao: virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4902–4909. Cited by: Experiments.
  • U. Syed, M. Bowling, and R. E. Schapire (2008)

    Apprenticeship learning using linear programming

    In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. Cited by: Policy Approximation.
  • U. Syed and R. E. Schapire (2008) A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pp. 1449–1456. Cited by: Policy Approximation.
  • P. Wang, Y. Fan, L. Xia, W. X. Zhao, S. Niu, and J. Huang (2020) KERL: a knowledge-guided reinforcement learning model for sequential recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 209–218. Cited by: Related Work.
  • K. Zhao, X. Wang, Y. Zhang, L. Zhao, Z. Liu, C. Xing, and X. Xie (2020) Leveraging demonstrations for reinforcement recommendation reasoning over knowledge graphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 239–248. Cited by: Introduction, Related Work.
  • X. Zhao, L. Zhang, L. Xia, Z. Ding, D. Yin, and J. Tang (2017) Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209. Cited by: Related Work.
  • G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li (2018) DRN: a deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167–176. Cited by: Introduction, Related Work.
  • B. D. Ziebart, J. A. Bagnell, and A. K. Dey (2010) Modeling interaction via the principle of maximum causal entropy. Cited by: Policy Approximation.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: Policy Approximation.
  • L. Zou, L. Xia, P. Du, Z. Zhang, T. Bai, W. Liu, J. Nie, and D. Yin (2020) Pseudo dyna-q: a reinforcement learning framework for interactive recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 816–824. Cited by: Related Work.