Introduction
Deep reinforcement learning (DRL) is promising for recommendation systems, given its ability to learn optimal strategies from interactions for generating recommendations that best fit users’ dynamic preferences. DRLbased recommendation systems cover three categories: deep Qlearning based methods, policy gradient based methods, and hybrid methods. Deep Qlearning aims to find the best step by maximizing a Qvalue over all possible actions. Zheng et al. (2018) first introduced DRL into recommendation systems for news recommendation; then, Chen et al. (2018) introduced a robust Qlearning to handle dynamic environments for online recommendation. However, Qlearning based methods suffer the agent stuck problem, i.e., Qlearning requires the maximise operation over the action space, which becomes infeasible when the action space is extremely large. Policy gradient based methods can mitigate the agent stuck problem Chen et al. (2019a). Such methods use average reward as the guideline yet they may treat bad actions as good actions, making the algorithm hard to converge Pan et al. (2019). In comparison, hybrid methods combine the advantages of Qlearning and policy gradient. As a popular algorithm among hybrid methods, actorcritic network Konda and Tsitsiklis (2000) adopts policy gradient on an actor network and Qlearning on a critic network to achieve nash equilibrium on both networks. Until now, actorcritic network has been widely applied to DRLbased recommendation systems Chen et al. (2019b); Liu et al. (2020); Zhao et al. (2020); Chen et al. (2020).
Despite differences, all existing DRLbased methods rely on welldesigned contextdependent reward functions, as shown in the general workflow of a DRLbased recommendation systems in Fig 1 (a). However, in many cases, the reward function cannot be easily defined, due to dynamic environments and various factors that affect user’s interest Shang et al. (2019). Thus, the existing methods may suffer limited generalization ability. Besides, they generally take into account both user’s preference and user’s actions (e.g., clickthrough, rating or implicit feedback) in the reward function, under the assumptions that the reward is determined by the current chosen item and user’s action is unaffected by the recommended items. Such assumptions, however, no longer hold for online recommendation Shang et al. (2019). Another challenge is that the usage of reinforcement learning in finding the recommendation policy from scratch might be timeconsuming Finn et al. (2016); Levine and Koltun (2012). In a recommendation problem, the state space (i.e., all candidate items in which users might be interested) and action space (i.e., all the actions for candidate items) might be huge, traditional reinforcement learning based methods iterate all the possible combinations to figure out the best policy which is arduous.
Targeting at the above challenges, we aim to enable the agent to infer a reward function from user’s behaviors via inverse reinforcement learning and learn the recommendation policy directly from user behaviors in an efficient and adaptive way, as shown in Fig 1 (b). To this end, we propose a generative inverse reinforcement learning approach for adaptively inferring an implicit reward function from user behaviors. Specifically, the approach can measure the performance of the current recommendation policy and update the current policy with the expert policy in the discriminator, thus alleviating the need of defining reward function for online recommendation. In particular, we transform inverse deep reinforcement learning (DRL) into a generator to augment a diverse set of stateaction pairs. Under this generative strategy, our method can achieve better generalization ability under complex online recommendation conditions. In a nutshell, we make the following contributions:

We propose generative inverse reinforcement learning to automatically learn reward function for online recommendation. To the best of our knowledge, this is the first work to decouple the reward function and the agent for online recommendation.

We design a novel actordiscriminator network module that takes a discriminator as the critic network and a novel actorcritic network as the actor network, to implement the proposed framework. The module is modelfree and can be easily generalized to a variety of scenarios.

We conduct experiments on a virtual online platform, VirtualTB, to demonstrate the feasibility and effectiveness of the proposed approach. Our proposed method achieves a higher clickthroughrate that several stateoftheart methods.
Problem Formulation
Online Recommendation Online recommendation differs from offline recommendation in dealing with realtime interactions between users and the recommendation system. The system needs to analyse user’s behavior and updates the recommend policy dynamically. The objective is to find a solution that best reflects those interactions and apply it to the recommend policy.
Reinforcement Learningbased Recommendation
Reinforcement Learning based recommendation systems learn from interactions through an Markov Decision Process (MDP). Given a recommendation problem consisting of a set of users
, a set of items and user’s demographic information , MDP can be represented as a tuple , where denotes the state space, which is the combination of the subsets of ; is the action space, which represents agent’s selection during recommendation based on the state space ;is the set of transition probabilities for state transfer based on the action received;
is a set of rewards received from users, which are used to evaluate the action taken by the recommendation system, with each reward being a binary value to indicate user’s click; is a discount factor for balancing the future reward and current reward.Given a user and the state observed by the agent (or the recommendation system), which includes a subset of item set and user’s demographic information , a typical recommendation iteration for user goes as follows. First, the agent makes an action based on the recommend policy under the observed state and receives the corresponding reward . Then, the agent generates a new policy based on the received reward and determines the new state
based on the probability distribution
. The cumulative reward after iterations is as follows:Inverse Reinforcement Learningbased Online Recommendation We propose inverse reinforcement learning without predefining a reward function for online recommendation. We aim to optimize the current policy to make recommendations that are most suitable for the user.
Specifically, we model online recommendation as an MDP with a finite state set , a set of actions , transition probabilities , and a discount factor . Suppose there exist an expert policy that can master any state . The recommendation turns into the optimization problem of finding the policy that best approximates the expert policy across the cost function class , by the following objective function Abbeel and Ng (2004):
(1) 
The cost function class is restricted to convex sets defined by the linear combination of a few basis functions {
}. Hence, the corresponding feature vector for the stateaction pair
can be represented as . The expectation is defined as (on discounted infinite horizon):(2) 
Methodology
The overall structure of our proposed approach is illustrated in Fig 2. It consists of three main components: policy approximation, policy generation, and discriminative actorcritic network. Policy approximation provides the theoretical approach to approximating the learned recommendation policy with the expert policy ; policy generation increases the diversity of the recommendation policy; and discriminative actorcritic network
constitutes the main structure of our approach. Besides the above components, we will present the optimization method and the corresponding training algorithm, which aims to limit the update step in optimizing the loss function to ensure that a new policy achieves better performance than the old one.
Policy Approximation
The policy approximation component aims to make the learned recommendation policy and expert policy as similar as possible. To this end, we look for a cost function that delivers the best expert policy among all the policies on the latent space. According to Eq.(1), the cost function class is convex sets, which have a linear format Abbeel and Ng (2004) and a convex format Syed et al. (2008), respectively:
(3)  
(4) 
The corresponding objective functions are as follows:
(5)  
(6) 
In particular, Eq.(5) minimizes the distance between the stateaction pairs, known as feature expectation matching Abbeel and Ng (2004). Eq.(6) minimizes the worstcase excess cost among the functions Syed and Schapire (2008). An issue with such methods is the ambiguity in Eq.(1) that many candidate policies can approximate the expert if we only compare the features Ziebart et al. (2008). We resolve the ambiguity by introducing the following discounted causal entropy Bloem and Bambos (2014) into Eq.(1):
(7) 
We thereby rewrite Eq.(1) into
(8) 
and define the reinforcement learning process according to Ziebart et al. (2008):
(9) 
Policy Generation
We regard policy generation as the problem of matching two occupancy measures and solve it by training a Generative Adversarial Network (GAN)
Goodfellow et al. (2014). The occupancy measure can be defined as:(11) 
Since the generator aims to generate the policy as similar to the expert policy as possible, we use GAIL Ho and Ermon (2016) to bridge inverse reinforcement learning and GAN by making an analogy from the occupancy matching to distribution matching. Specifically, we introduce a GA regularizer to restrict the entropy function:
(12) 
where is defined as:
(13) 
By introducing the GA regularizer, we can directly measure the difference between the policy and expert policy without needing to know the reward function. We use the loss function from the discriminator as in Eq.(10). We represent the negative log loss for the binary classification to distinguish the policy and via stateaction pairs. The optimal of Eq.(14) is equivalence to the JensenShannon divergence Nguyen et al. (2009):
(14) 
(15) 
Finally, We obtain the inverse reinforcement learning definition by substituting the GA regularizer into Eq.(8):
(16) 
where is a factor with . Note that Eq.(16) has the same goal as the GAN, i.e., finding the squared metric between distributions. More specifically, we have the following equivalence for Eq.(16):
(17) 
Discriminative ActorCritic Network
The discriminative actorcritic network aims to map online recommendation into an inverse reinforcement learning framework.
Specifically, we take advantage actorcritic network, a variant of the actorcritic to constitute the main structure of our approach. Within this network, the actor uses the policy gradient to update the policy, and the critic uses Qlearning to evaluate the policy and provides feedback Konda and Tsitsiklis (2000).
Given user’s profile at timestamp (i.e., the item list ) and the optional demographic information (which is used to generate the state
), the environment embeds user’s recent interest and user’s features into the latent space via neural network
Chen et al. (2020); Liu et al. (2020). Once the actor network gets the statefrom the environment, it feeds the state to a network with two fullyconnected layers with ReLU as the activation function. The final layer outputs the target policy function
parameterized by , which will be updated together with discriminator . Then, the critic network takes the input from the actor network with current policy , which can be used for sampling to get the trajectory . We concatenate the stateaction pair and feed it into two fullyconnected layers with ReLU as the activation function. The output of the critic network is a value , which will be used to calculate the advantage, which is a value used for optimization (to be discussed later).As aforementioned, the discriminator is the key component of our approach. To build an endtoend model and better approximate the expert policy , we parameterized the policy as and clip the output of the discriminator so that with weight . The loss function of is . Besides, we use Adam Kingma and Ba (2014) to optimize weight (the optimization for will be introduced later). Here, the discriminator can be treated as a local cost function to guide the policy update. Specifically, the policy will move toward expectlike regions (divided by ) in the latent space by minimizing the loss function , i.e., finding a point for Eq.(17) such that the equation output is minimal.
Policy Optimization
We use the actorcritic network as a policy network to be trained jointly with the discriminator. Therefore, the actorcritic network needs to update the policy parameter based on the discriminator. During this process, we aim to limit the agent’s step size to ensure the new policy is better than the old one. Specifically, we use trust region policy optimization (TRPO) Schulman et al. (2015a) to update the policy parameter and formulate the TRPO problem as follows:
(18) 
where
is the advantage function calculated by Generalized Advantage Estimation (GAE)
Schulman et al. (2015b) below:(19) 
where the reward is the step’s test reward at timestamp . The reward have two components which are reward returned by environment and the bonus reward calculated by Discriminator by using . Considering the massive computation load of updating the TRPO via optimizing Eq.(18), we use Proximal Policy Optimization (PPO) Schulman et al. (2017) with the objective function below, to update the policy:
(20) 
where is the clipping parameter, which represents the maximum percentage of change that can be updated at a time.
The training procedure involves two components: the discriminator and the actorcritic network. The training algorithm is illustrated in Algorithm 1. Specifically, for the discriminator, We use Adma as the optimizer to find the gradient for Eq.(17) for weight :
(21) 
Experiments
We report experimental evaluation of our model on a realworld online retail environment, VirtualTB Shi et al. (2019), on OpenAI gym^{1}^{1}1https://gym.openai.com.
Experimental results where (a) is CTR with 95% confidence interval, and (b) is the average reward received each step after 1500 iterations with 95% confidence interval. For comparison, we have added the performance from expert policy to (a). Results reported in (c) is for ablation study.
Virtual TaoBao
VirtualTB is a dynamic environment to test the feasibility of the recommendation agent. It enables a customized agent to interact with it and achieve the corresponding rewards. On VirtualTB, each customer has 11 static attributes as the demographic information and are encoded into a 88dimensional space with binary values. The customers have multiple dynamic interests that are encoded into a 3dimensional space and may change over interaction process. Each item has several attributes, e.g,. price and sales volume, and are encoded into a 27dimensional space.
Baseline methods

IRecGAN Bai et al. (2019): An online recommendation method that employs reinforcement learning and GAN.

PGCR Pan et al. (2019): A policy Gradient based method for contextual recommendation.

GAUM Chen et al. (2019c): A deep Qlearning based method that employs GAN and cascade Qlearning for recommendation.

KGRL Chen et al. (2020): ActorCritic based method for interactive recommendation, a variant of online recommendation.
Note that GAUM and PGCR are not designed for online recommendation, and KGRL requires knowledge graph as side information, which is unavailable to the gym environment. Hence, we only keep the network structure and put those network into the VirtualTB platform for testing.
Evaluation Metric and Experimental Environment
The experiments are conducted in the OpenAI gym environment where the reward can be readily obtained for each episode. Since each episode may have different number of steps, it leads to the difficulty in determining when users will end the session. For this reason, we choose clickthroughrate to represent the performance which is defined as:
(22) 
where means that user is interested in all 10 items which are recommended in a single page, is the reward received per episode and is number of step included in one episode.
Expert Policy Acquisition
In this part, we introduce the strategy on acquiring the expert policy for VirtualTB. There is no official expert policy in VirtualTB. Obviously, it is unrealistic to manually create the expert policy from this virtual environment where the source data is not available. Hence, we follow the similar strategy as in Gao et al. (2018) to generate a set of expert policy from a pretrained expert policy network. We design an actorcritic network with the same actor and critic network structure as our model, but without advantage. The critic network from is used to calculate the value by adopting deep Qlearning. We adopts the Deep Deterministic Policy Gradients (DDPG) Lillicrap et al. (2015) to train .
GAE:  

0.94  0.95  0.96  0.97  0.98  0.99  
PPO:  0.05  0.630 0.063  0.632 0.064  0.633 0.062  0.630 0.059  0.626 0.060  0.629 0.059 
0.10  0.632 0.062  0.635 0.060  0.636 0.061  0.636 0.058  0.634 0.061  0.633 0.060  
0.15  0.633 0.060  0.635 0.061  0.639 0.061  0.640 0.057  0.639 0.059  0.638 0.061  
0.20  0.634 0.060  0.636 0.060  0.641 0.063  0.643 0.061  0.643 0.063  0.641 0.058  
0.25  0.631 0.061  0.635 0.059  0.636 0.060  0.637 0.060  0.636 0.061  0.634 0.059  
0.30  0.630 0.059  0.631 0.061  0.632 0.060  0.630 0.059  0.630 0.058  0.629 0.050 
Hyper Parameters Setting
For the policy network , we set the DDPG parameters as: , size of hidden layer is , the size of reply buffer is and the number of episode is set to . For OrnsteinUhlenbeck Noise, scale is , . For our approach, number of episode is set to , hidden size of the advantage actorcritic network is , hidden size for discriminator is , learning rate is , factor is , mini batch size is
and the epoch of PPO is
. For the generalized advantage estimation, we set the discount factor to , and . For fair comparison, all those baseline methods are training under the same condition. For easy recognition, we set one iteration as 100 episodes.Results
Fully results are reported in Fig 3. Our approach generally outperforms all four stateoftheart methods. Specifically, our method gets a best result over all those baseline methods after iterations. It demonstrates the feasibility of the proposed approach.A possible reason for the poor performance of KGRL is that KGRL maintains a local knowledge graph inside the model and actively interacts with the environment. Because the experiments are conducted on an online platform which does not provide the side information for KGRL to generate its knowledge graph. Hence, KGRL performs poorer than other baseline methods.
Impact of Key Parameters
We are interested in how the control parameter on GAE and the clipping parameter on PPO affect the performance. These two key parameters significantly affect the generalized advantage estimation and proximal policy optimization process. The
is used to make a compromise between bias and variance which normally is selected from
with step 0.01. The clipping parameter is used to determine the number of percentage need to be clipped, normally smaller than to control the optimization step size. The results are reported in Table 1. For fairness, we report CTR at iteration 2000. Observe that when and , the model achieves the best result which is . These two values are also used as our default setting. More details about the model can be found in the Supplementary Materials.Ablation Study
In this part, we investigate the effect of the GAE. We use two different optimization strategies to optimize the proposed model which are DDPG and Adaptive KL Penalty Coefficient. The Adaptive KL Penalty Coefficient is the simplified version of the PPO which can be defined as:
(23) 
The update rule for is:
The parameter , and are determined by experiments where the selection process are reported in Supplementary Materials due to the space limit. The result of the ablation study can be found on Fig 3 (c).
Discussion
This study provides a new approach for reinforcement learning based online recommendation, without the need of defining reward function. In this way, our work is feasible to be applied in various realworld recommendation scenarios, where the reward function is hard to manually define or highly domaindependent. The proposed method offers a fundamental support for inverse reinforcement learning based recommendation system. By providing a few user behaviors, the proposed method can extract an adaptive unknown reward function so as to automatically find out the optimal strategies to generating recommendations best fitting users’ interest. Our empirical evaluation testify its competitive performance against reinforcement learned based existing stateoftheart methods. Our model has implication and potentially accelerates the progress in applying reinforcement learning in practice where a complex environment exists.
Related Work
We briefly review previous studies related to deep reinforcement learning (DRL)based recommendation. All those methods are MDPbased or partial observable MDPbased (POMDP). POMDP based methods can be further divided into three categories: value function estimation Hauskrecht (1997), policy optimization Poupart and Boutilier (2005), and stochastic sampling Kearns et al. (2002). Due to the high computational and representational complexity of the POMDP based methods, MDPbased methods are relatively more popular in academia.
MDPbased DRL methods for recommendation can be concluded in three different approaches: deep Qlearning based, policy gradient based and ActorCritic based methods. Zheng et al. (2018) adopts the deep Qlearning into the news recommendation by using user’s historical record as the state. Zou et al. (2020) improves the structure of Deep Qlearning to achieve more robust results. Pan et al. (2019) applies the policy gradient to learn the optimal recommendation strategies. Wang et al. (2020) introduces the knowledge graph into the policy gradient for sequential recommendation. However, Qlearning may get stuck because of the max operation, and policy gradient requires a large scale data to boost the converge speed and it will only update once per episode. Hence, ActorCritic uses the Qvalue to conduct the policy gradient per step instead of episode. Zhao et al. (2017) adopts the actorcritic methods to conduct the listwise recommendation in a simulated environment. Chen et al. (2020); Zhao et al. (2020) utilize the knowledge graph as the side information embedded into the stateaction space to increase model’s capability on actorcritic network. In addition, Liu et al. (2020) proposes to produce recommendations via learning a stateaction embedding within the DRL framework.
Furthermore, Chen et al. (2019c) integrates the generative adversarial network with reinforcement learning structure to generate user’s attribute so that more side information would be available to boost the reinforcement learning based recommendation system’s performance. Shang et al. (2019) proposes a multiagent based DRL method for environment reconstruction which take the environmental cofounders into account.
Conclusion and Future Work
In this paper, we propose a new approach InvRec for online recommendation. Our proposed approach are designed to overcome the drawback due to the inaccurate reward function. The proposed model is built upon advantage actorcritic network with the generate adversarial imitation learning. We evaluate our method on the online platform VirtualTB and our model achieves a good performance. We also compared our method with a few stateoftheart methods in three different categories: Deep QLearning based, policy gradient based and actorcritic network based methods. The results demonstrate that the proposed approach’s feasibility and superior performance.
This study provides a good initial attempt about the application of deep inverse reinforcement learning on online recommendation system. However, there are remains a few shortcomings which are not addressed in this paper such as the sample inefficiency problem for the imitation learning Kostrikov et al. (2019). Low sample inefficiency will lead to longer training time and the need of a larger dataset. Random sampling also would affect the performance. The possible solutions would be using the offpolicy methods instead of onpolicy, finding an optimal sampling strategy such that agent will get the same expert trajectories when facing the state which comes up before.
References

Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, pp. 1. Cited by: Problem Formulation, Policy Approximation, Policy Approximation.  A modelbased reinforcement learning with adversarial training for online recommendation. In Advances in Neural Information Processing Systems, pp. 10735–10746. Cited by: 1st item.
 Infinite time horizon maximum causal entropy inverse reinforcement learning. In 53rd IEEE Conference on Decision and Control, pp. 4911–4916. Cited by: Policy Approximation.

Largescale interactive recommendation with treestructured policy gradient.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3312–3320. Cited by: Introduction.  Topk offpolicy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456–464. Cited by: Introduction.
 Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1187–1196. Cited by: Introduction.
 Knowledgeguided deep reinforcement learning for interactive recommendation. arXiv preprint arXiv:2004.08068. Cited by: Introduction, Discriminative ActorCritic Network, 4th item, Related Work.
 Generative adversarial user model for reinforcement learning based recommendation system. In International Conference on Machine Learning, pp. 1052–1061. Cited by: 3rd item, Related Work.
 Guided cost learning: deep inverse optimal control via policy optimization. In International conference on machine learning, pp. 49–58. Cited by: Introduction.
 Reinforcement learning from imperfect demonstrations. External Links: Link Cited by: Expert Policy Acquisition.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Policy Generation.
 Incremental methods for computing bounds in partially observable markov decision processes. In AAAI/IAAI, pp. 734–739. Cited by: Related Work.
 Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: Policy Generation.
 A sparse sampling algorithm for nearoptimal planning in large markov decision processes. Machine learning 49 (23), pp. 193–208. Cited by: Related Work.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Discriminative ActorCritic Network.
 Actorcritic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: Introduction, Discriminative ActorCritic Network.
 Discriminatoractorcritic: addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, External Links: Link Cited by: Conclusion and Future Work.
 Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617. Cited by: Introduction.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Expert Policy Acquisition.
 Endtoend deep reinforcement learning based recommendation with supervised embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 384–392. Cited by: Introduction, Discriminative ActorCritic Network, Related Work.
 On surrogate loss functions and fdivergences. The Annals of Statistics 37 (2), pp. 876–904. Cited by: Policy Generation.
 Policy gradients for contextual recommendations. In The World Wide Web Conference, pp. 1421–1431. Cited by: Introduction, 2nd item, Related Work.

Pytorch: an imperative style, highperformance deep learning library
. In Advances in neural information processing systems, pp. 8026–8037. Cited by: Evaluation Metric and Experimental Environment.  VDCBPI: an approximate scalable algorithm for large pomdps. In Advances in Neural Information Processing Systems, pp. 1081–1088. Cited by: Related Work.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: Policy Optimization.
 Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: Policy Optimization.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Policy Optimization.
 Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 566–576. Cited by: Introduction, Related Work.
 Virtualtaobao: virtualizing realworld online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4902–4909. Cited by: Experiments.

Apprenticeship learning using linear programming
. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. Cited by: Policy Approximation.  A gametheoretic approach to apprenticeship learning. In Advances in neural information processing systems, pp. 1449–1456. Cited by: Policy Approximation.
 KERL: a knowledgeguided reinforcement learning model for sequential recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 209–218. Cited by: Related Work.
 Leveraging demonstrations for reinforcement recommendation reasoning over knowledge graphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 239–248. Cited by: Introduction, Related Work.
 Deep reinforcement learning for listwise recommendations. arXiv preprint arXiv:1801.00209. Cited by: Related Work.
 DRN: a deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167–176. Cited by: Introduction, Related Work.
 Modeling interaction via the principle of maximum causal entropy. Cited by: Policy Approximation.
 Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: Policy Approximation.
 Pseudo dynaq: a reinforcement learning framework for interactive recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 816–824. Cited by: Related Work.