Inverse Reinforcement Learning for Marketing

12/13/2017 ∙ by Igor Halperin, et al. ∙ 0

Learning customer preferences from an observed behaviour is an important topic in the marketing literature. Structural models typically model forward-looking customers or firms as utility-maximizing agents whose utility is estimated using methods of Stochastic Optimal Control. We suggest an alternative approach to study dynamic consumer demand, based on Inverse Reinforcement Learning (IRL). We develop a version of the Maximum Entropy IRL that leads to a highly tractable model formulation that amounts to low-dimensional convex optimization in the search for optimal model parameters. Using simulations of consumer demand, we show that observational noise for identical customers can be easily confused with an apparent consumer heterogeneity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding customer choices, demand and preferences, with customers being consumers or firms, is an eternal theme in the marketing literature. In particular, structural models for marketing build models of consumers or firms by modelling them as utility-maximizing rational agents (see e.g. [1] for a review). Unlike ”reduced-form” (purely statistical) models, structural models aim to dissect true consumer choices and demand preferences from effects induced by particular marketing campaigns, thus enabling promoting new products and offers, whose attractiveness to consumer could then be assessed based on the learned consumer utility.

In particular, in the area of consumer demand research, one can distinguish between static demand and dynamic demand. This paper deals with learning a consumer demand utility function in a dynamic, multi-period setting, where customers can be both strategic and non-strategic in choosing their optimal consumption over some pre-defined period of time (a week, a month, a year etc.). Such setting is relevant for marketing different recurrent utility-like plans and services such as cloud computing plans, internet data plans, utility plans (electricity, gas, phone), and so on.

Structural models approach such problems by modeling forward-looking consumers as rational agents maximizing their streams of expected utilities of consumption over a planning horizon rather than their one-step utility. Structural models typically specify a model for a consumer utility, and then estimate such model using methods of Dynamic Programming and Stochastic Optimal Control. The fact that such models are typically computationally heavy, as they often involve a repeated solution of a Bellman optimality equation, is one of the main stumbling blocks for a wide industrial-level deployment of structural models.

We propose an alternative approach to the problem of learning consumer demand utility in a dynamic, multi-period setting, which is based on Inverse Reinforcement Learning (IRL). While IRL is widely used in robotics over years [2], more recently it has been applied in other areas as well, in particular to study human behaviour, see e.g. [3]. We are not aware, however, of any literature that would apply Inverse Reinforcement Learning specifically for problems in marketing.

The main contribution of this paper is a new version of the Maximum Entropy IRL method (Ziebart 2008), that leads to a highly tractable convex optimization problem for optimal model parameters. Our model enables an easy simulation, which makes it possible to use it to study finite-sample properties of estimators for optimal parameters of the consumer utility. In particular, we use simulations to demonstrate that due to finite-sample effects, consumers with identical demand utilities might easily be mistaken for heterogeneous consumers.

1.1 Related work

Learning customers preferences given their observed behaviour is an active research topic for psychology, marketing, statistical decision making, optimal control, and Artificial Intelligence (AI) communities. Depending on the field, it is usually referred to as the problem of customer choice in the marketing and psychology literature, preference elicitation in the statistical decision making literature, and Inverse Reinforcement Learning in the AI literature. In the specific context of learning the consumer dynamic demand utility, previous research largely follows the Stochastic Optimal Control (SOC) approach. In particular, a recent paper by Xu et. al.

[4] develops a structural SOC-based model for estimation of mobile phone users’ preferences using their observed data daily consumption. On the side of Inverse Reinforcement Learning, our framework is rooted in The Maximum Entropy IRL (MaxEnt-IRL) [5, 6] method. Other relevant references to the Maximum Entropy IRL are Refs. [7, 8, 9].

1.2 Outline of our approach

Similarly to Xu et. al. [4], a framework presented here focuses on consumption data. While our method can be applied to a number of different business settings as outlined in the introduction (such as cloud plans, data plans, utility plans etc.), we follow Ref.  [4] and consider a consumption utility of mobile phone users, to facilitate a direct comparison with their approach.

Our model parametrizes users utility (reward) function in terms of a low number of free parameters (that include, in particular, the user price sensitivity), and then estimates these parameters given users’ histories of data consumption. Unlike Ref. [4]

, we do not follow the Stochastic Optimal Control approach, but instead rely on IRL methods developed for similar tasks in the AI and Machine Learning communities. More specifically, we develop a model based on a highly tractable version of a popular Maximum Entropy (MaxEnt) IRL method

[5, 6].

Our approach has a number of important advantages over the model of Ref. [4]. First, our model estimation is much simpler, and amounts to a convex optimization problem with 5 variables, which can be easily handled using standard off-she-shelf optimization software. This enables a very efficient numerical implementation of our model. In contrast, the model of Xu et. al. relies on Monte Carlo for the model estimation. Second, our model is much easier to generalize, if needed, by adding additional features. Third, tractability of our model allows us to investigate the impact of finite-sample ”observational noise” on estimated model parameters. This issue was not addressed in Ref. [4] that suggested a substantial users’ heterogeneity based on estimation of their model with relatively short (9 months) history for a small number of users. Last but not least, our approach is general enough to be applied, with proper modifications, to learning of customer preferences in other similar settings as suggested above.

The rest of the paper is organized as follows. In Sect. 2 we present our model. In Sect. 3, we show how the estimated model can be used for counterfactual simulations and design of marketing strategies. Sect. 4 presents numerical experiments.

2 Model formulation

2.1 User utility function

Consider a customer that purchased a single-service plan with the monthly price , initial quota , and price to be paid for the unit of consumption upon breaching the monthly quota on the plan111For plans that do not allow breaching the quota , the present formalism still applies by setting the price to infinity.. We specify a single-step utility (reward) function of a customer at time (where is a lenght of a payment period, e.g. a month) as follows:


Here is the daily consumption on day , is the remaining allowance at the start of day , and is the number of remaining days until the end of the billing cycle, and we use a short notation for any . The fourth term in Eq.(1) is proportional to the payment made by the customer once the monthly quota is exhausted. Parameter gives the price sensitivity of the customer, while parameters specify the dependence of the user reward on the state-action variables . Finally, the last term gives the reward received upon zero consumption at time (here is an indicator function that is equal to one if , and is zero otherwise). Model calibration amounts to estimation of parameters given the history of user’s consumption.

Note that the reward (1) can be equivalently written as follows (here ):



(here stands for the empirical mean of ), and the following set of basis functions is used:


Our definition of the user reward given by Eq.(1) is similar to the definition proposed in Ref. [4], but differs from it in four aspects. First, we add a possible bi-linear dependence of the reward on daily consumption and the number of the remaining days on the plan. Second, we do not scale parameter to as in Ref. [4] (this is because our framework is not, and should not be, scale-invariant). Third, we add a reward for zero consumption, given by the term in Eq.(1)222We note that our specification of no-consumption behavior is more flexible than what suggested by Xu et. al. [4], who interpret zero-consumption events as observations of a Gaussian process censored at zero. Instead, our model unties the link between the zero-consumption and non-zero consumption events at the price of introducing an additional free parameter .. Last, and most importantly, we do not add a random daily ”user shock” to the value of , as was done in Ref. [4]. The fundamental reason for the presence of such ”private user shocks” in the user utility function in the approach of Ref. [4]

is that in the setting of classical Markov decision process (MDP) problems where the dynamics are typically stochastic but both policy and rewards are


, a demonstrated sub-optimal (instead of an optimal) behavior can lead to diverging solutions for model parameters, and/or assign zero probabilities to demonstrated trajectories

333 The classical MDP problems deal with a completely observed Markov process, where an optimal deterministic policy always exists. Therefore, classical Stochastic Optimal Control methods typically work with deterministic policies..

It is exactly this problem that is addressed by introducing private (i.e. unobserved to the modeller) shocks in the structural model approach adopted in Ref. [4]: the reward function is augmented by an additional random term with some parametrized ”volatility” function (in Ref.[4], the particular form was used), while the exercised policy that gives the next value of is a deterministic function of . All model estimations in this framework are therefore based on a Monte Carlo simulation of paths of the private shocks , and then using them to generate paths of the observables at each time step.

Instead of relying on the structural models’ paradigm, we follow the ideas of the Maximum Entropy approach to Inverse Reinforcement Learning (IRL), which is probabilistic and assigns probabilities to observed paths [5, 6]. Due to its probabilistic specification, this approach does not require introducing a random shock to the utility function in order to reconcile the model with a possible sub-optimal behavior. As will be clear below, our approach has a number of advantages over the method of Ref. [4]. Most importantly, it does not need Monte Carlo to estimate parameters of the user utility, and instead relies on a straightforward Maximum Likelihood Estimation (MLE) with a convex negative log-likelihood function with 5 variables, which can be done very efficiently using the standard off-the-shelf convex optimization software. Moreover, our model enables, if needed, easily generalizations of the reward function by adding more of basis functions, while keeping the rest of the methodology intact.

2.2 Inverse Optimal Control and Inverse Reinforcement Learning

The problem of estimating the reward (traditionally referred to as the inter-temporal utility function in economics and econometrics literature) given the observed behavior is a problem of inverse optimal control. In direct optimal control, the objective is to optimize the strategy (i.e. the consumption policy ) such that the expected total reward (total utility) of the user is maximized, assuming the the dynamics of consumption are known, or estimated independently. The (direct) Reinforcement Learning (RL) addresses the same problem, but without knowledge of the dynamics and instead relying on samples from the system. In contrast, in the Inverse Optimal Control (IOC) or Inverse Reinforcement Learning (IRL) formulations, the problem is to find the reward given the observed behavior (which can be obtained in an off-line or on-line setting). While in the IOC setting the dynamics are assumed to be known, in the IRL approaches the dynamics are unknown, and only samples obtained under these dynamics are available.

Note that Ref. [4] uses a two-step structural approach to estimating the consumption model which first estimates the empirical policy and then finds structural parameters of the utility function that are consistent with this empirically ”observed” optimal policy. A similar approach that tries to simultaneously estimate both the reward and policy function compatible with this reward is sometimes employed in the IRL literature. On the contrary, our Maximum Entropy IRL model is much simpler, as in our setting the reward parameters automatically fix the user policy function, due to the fact that the cumulative consumption process is deterministic given the daily consumption.

2.3 Maximum Entropy IRL and Relative Entropy IRL

The Maximum Entropy IRL (MaxEnt-IRL) [5, 6] method is currently the most popular approach to IRL. The Maximum Entropy argument is applied in our setting to a single-step (daily) transition probability . The MaxEnt solution is to require that this distribution should match empirical counts

along such one-step paths, but otherwise should stay as close as possible to a uniform distribution. Quantitatively, the last condition is imposed as the condition of minimization of the Kallback-Leibler relative entropy between the distribution sought and a uniform distribution. We use an extension of the MaxEnt-IRL called Relative Entropy IRL

[7] which replaces the uniform distribution in the MaxEnt method by a non-uniform benchmark (or ”prior”) distribution . This produces the exponential single-step transition probability:


where is a state-dependent normalization factor


We note that most applications of MaxEnt IRL deal with multi-step trajectories as prime objects, and define the partition function on the space of trajectories. While the fist applications of MaxEnt IRL calculated exactly for small discrete state-action spaces as in [5], for large or continuous state-action spaces such calculation can only be done approximately using approximate dynamic programming, or other methods. For example, the Relative Entropy IRL approach of Bourarias et. al. [7] uses importance sampling from a reference (”background”) policy distribution to calculate 444Furthermore, the reference distribution can be adapted to the estimated path distribution, as is done in the Guided Cost Search algorithm of Ref. [9].. It is this calculation that poses the main computational bottleneck for applications of MaxEnt/RelEnt IRL methods for large or continuous state-action spaces.

In contrast to that, in our approach state-dependent normalization factors are defined per each time step. Because we trade a path-dependent ”global” partition function for a local state-dependent factor , we do not need to rely on exact or approximate dynamic programming to calculate this factor. Our method is somewhat similar to the approach of Bourarias et. al. (as it also relies on the Relative Entropy minimization), but in our case both the reference distribution and normalization factor are defined on a single time step, and calculation of amounts to computing the integral (5). As we show below, this integral can be calculated analytically with a properly chosen distribution .

To this end, we propose to use a mixture discrete-continuous distribution for the reference (”prior”) action distribution :


where stands for the Dirac delta-function, and if and zero otherwise. The continuous component

is given by a spliced Gaussian distribution


where and

are probability density functions of two truncated normal distributions defined separately for small and large daily consumption levels,

and , respectively (in particular, they both are separately normalized to one). The mixing parameter is determined by the continuity condition at :


As this matching condition may involve large values of where the normal distribution would be exponentially small, in practice it is better to use it by taking logarithms of both sides:


The ”prior” mixing-spliced distribution (6), albeit represented in terms of simple distributions, leads to potentially quite complex dynamics that make intuitive sense and appear largely consistent with observed patterns of consumption. In particular, note that Eq.(7) indicates that large fluctuations are centered around a smaller mean value than the mean value of smaller fluctuations . Both a reduction of the mean upon breaching the remaining allowance barrier and a decrease of the mean of each component with time appear quite intuitive in the current context. As will be shown below, a ”posterior” distribution inherits these properties, while also further enriching the potential complexity of dynamics555in particular, it promotes a static mixing coefficient to a state- and time-dependent variable ..

The advantage of using the mixed-spliced reference distribution (6) as a reference distribution is that the state-dependent normalization constant can be evaluated exactly with this choice:




is the cumulative normal probability distribution.

Probabilities of -steps paths (where enumerates different user-paths) are obtained as products of single-step probabilities:


Here are cumulative feature counts along the observed path :


Therefore, the total path probability in our model is exponential in the total reward along a trajectory, as in the ”classical” MaxEnt IRL approach [5], while the pre-exponential factor is computed differently as we operate with one-step, rather than path-probabilities.

Parameters defining the exponential path probability distribution (12) can be estimated by the standard Maximum Likelihood Estimation (MLE) method. Assume we have historically observed single-cycle consumption paths, and assume these path probabilities are independent666A more complex case of co-dependencies between rewards for individual customers can be considered, but we will not pursue this approach here. Note that this specification formally enables calibration at the level of an individual customer, in which case would be equal to the number of consumption cycles observed for this user. However, in practice the feasibility of single-name calibration depends on finite-sample properties of the MLE, which will be discussed in Sect. 4.. The total likelihood of observing these data is


The negative log-likelihood is therefore, after omitting the term that does not depend on 777Note that still depends on , see Eq.(5)., and rescaling by ,


Given an initial guess for the optimal parameter , we can also consider a regularized version of the negative log-likelihood:


where is a regularization parameter, and or stand for the - and -norms, respectively. The regularization term can also be given a Bayesian interpretation as the contribution of a prior distribution on when the MLE estimation (14) is replaced by a Bayesian Maximum A-Posteriori (MAP) estimation.

As is well known, exponential models like (12) give rise to convex negative log-likelihood functions, therefore our final objective function (16) is convex in parameters (as can also be verified by a direct calculation), and therefore has a unique solution for any value of and . This ensures stability of the calibration procedure and a smooth evolution of estimated model parameters between individual customers or between groups of customers.

2.4 Computational aspects

The regularized negative log-likelihood function (16) can be minimized using a number of algorithms for convex optimization. If (i.e. no regularization is used), or , the objective function is differentiable, and gradient-based methods can be used to calibrate parameters . When and the -regularization is used, the objective function is non-differentiable at zero, which can be addressed by using the Orhant-Wise variant of the L-BFGS algorithm, as suggested in Ref. [8].

2.5 Possible extensions for different payment schemes

So far, we assumed a pricing scheme where a customer pays an upfront price in the beginning of a month, has an initial quota of ,, and pays a fixed price for a unit of consumption once the quota is spent before the end of the month. In practice, there may exist a number of modifications to such pricing scheme. First, some service providers may not allow for additional consumption beyond the quota , so that customers who breach it would only be given a minimally required level of service, for example a low speed access. As was mentioned above, this case can be treated in our framework by taking the limit in the formulae above.

Other service/pricing schemes would require further adjustments to the model. In particular, in addition to a ”main” monthly plan, customers might have access to different plan adjustments and extensions available once the monthly quota is exhausted. This could be handled in our framework by making the state dynamics fully stochastic, rather than locally deterministic as before:

where is an extra quota that can be added to the plan at a cost . As such irregular adjustments would probably be only made only a few (or zero) days in a month, would be equal zero at most days in a month, and therefore can again be modeled as a mixture of a delta-function at zero and a discrete (or continuous, depending on range of options offered by the service provider) distribution. A mixing weight of this distribution can depend on the current remaining quota , remaining days to the end of a payment period, and possibly some other additional factors. Simultaneously, the reward function (1) would need to be adjusted by subtracting an additional term :

While such extensions of the model are possible, we leave them for a future research, and concentrate on the basic setting presented above in our numerical experiments below.

3 Counterfactual simulations

3.1 Action probabilities

After the model parameters are estimated using the MLE method of Eq.(15) or (16), the model can be used for counterfactual simulations of total user rewards assuming that users adopt plans with different upfront premia , prices , and initial quota . To this end, note that given the daily consumption and the previous values , the next values are deterministic: . Therefore, in our model path probabilities are solely defined by action probabilities, and the probability density of different actions at time can be obtained from a one-step probability . Using Eqs.(1) and (4), this gives:


Using the explicit form of a mixture discrete-continuous prior distribution given by Eq.(6), we can express the ”posterior” distribution in the same form:


where the mixture weight becomes state- and time-dependent:


(here we used Eq.(10)), and the spliced Gaussian component is


where the weight can be obtained using Eqs.(17) and (10). After some algebra, this produces the following formula


where functions , are defined above in Eqs.(2.3). The ratio can be equivalently represented in the following form:


It can be checked by a direct calculation that Eq.(21) with the ratio given by Eq.(22) coincides with the formula for the weight that would be obtained from a continuity condition at if we started directly with Eq.(20). This would produce, similarly to Eq.(9),


The fact that two expressions (21) and (23) coincide means that the ”posterior” distribution is continuous at as long as the prior distribution is continuous there. Along with continuity at , the optimal (or ”posterior”) action distribution has the same mixing discrete-spliced Gaussian structure as the reference (”prior”) distribution

, while mixing weights, means and variances of the component distributions are changed. Such structure-preserving property of our model is similar in a sense to a structure-preservation property of conjugated priors in Bayesian analysis. Note that simulation from the spliced Gaussian distribution (


) is only slightly more involved than simulation from the standard Gaussian distribution. This involves first simulating a component of the spliced distribution, and then simulating a truncated normal random variable from this distribution. Different consumption paths are obtained by a repeated simulation from the mixing distribution (

18), along with deterministic updates of the state variables . Examples will be presented below in Sect. 4.

3.2 Total expected utility of a plan

Given a consumption plan for a particular service with the monthly fee , initial allowance and price (where ), the total expected utility at the start of the plan is


If a given customer picked plan among all possible plans, and we assume the customer acted rationally, this produces a set of inequalities


which equivalently can be represented as a set of inequalities for parameter :


Depending on the specification of a consumption plan, this relation can be used for either a verification (or improvement) of an estimate of obtained from the MLE procedure presented above, or alternatively as the only source for calibration of . In particular, some service providers do not permit any additional consumption beyond the plan limit. While this can formally be represented as a particular case of the formalism presented above in the limit , it also would mean that any consumption beyond a quota limit would not be present in data, and hence the price sensitivity parameter could not be learned from the MLE procedure. The only way to infer in such case would be to rely on inequalities (26) which provide a low bound for this parameter. Expected rewards that appear in the numerator of Eq.(26) should be calculated by simulation after other model parameters are estimated using the MLE.

3.3 Counterfactual simulation for promotion design

For the promotion design of possible upgrade offers (), one can rely on a counter-factual analysis of a future customer behavior upon different plan upgrade scenarios. We can simulate future consumption paths, and then again use Eq.(24) which we repeat here for convenience, to calculate the expected utility of future consumption:


Different consumption plans can therefore be compared quantitatively, and sorting them in the decreasing order in (with ) reveals their decreasing attractiveness to the customer in terms of their total expected utilities.

4 Numerical experiments

4.1 Simulation of consumptions paths

To test our model, we generate artificial data from the model by simulating paths of daily consumption from the mixing-spliced distribution Eq.(18) as described at the end of Sect. 3.1.

The simulation of daily consumption is illustrated in Fig. 1, while the resulting trajectories for remaining allowance are shown in Fig. 2, where we pick the following values of model parameters: , , . In addition, we set , and .

Note that consumption may vary quite substantially from one month to another (e.g. a customer can run out of the quota at about 80% of the time period, or can have a residual unused quota at the end of the month) purely due to the observational noise, even though the utility function stays the same.

Figure 1: Simulated daily consumption
Figure 2: Simulated remaining allowance

4.2 Finite-sample properties of MLE estimators

While the Maximum Likelihood Estimation (MLE) is known to provide asymptotically unbiased results for estimated model parameters, in practice we have to deal with data that have limited length of history at the level of individual customers. For example, the structural model of Ref. [4] was trained on 9 months of data for 1000 customers. While the number of customers to be included for analysis can potentially be increased by collecting more data, collecting long individual-level consumption histories might be more difficult due to a number of factors such as e.g. a customer mobility.

In view of such potential limitations with availability of long time series for service consumption, it is important to investigate finite-sample properties of the MLE estimators in the setting of our model. In particular, note that even if two customers have the same ”true” model parameters, their finite-sample MLE estimates would be in general different for these customers.

Therefore, ability of the model to differentiate between individual customers hinges upon the size of the bias and variance of its MLE estimator in realistically expected settings regarding the amount of data available for analysis. We note that the authors of Ref. [4] reported a substantial heterogeneity of estimated model parameters for their dataset of 9 months of observations for 1000 users, however they did not address the finite-sample properties of their estimators, thus leaving out a simplest interpretation of their results as due to ”observational noise” in their estimators that would be observed even for a perfectly homogeneous set of customers.

We have estimated the ”empirical” distribution of MLE estimators for our model by repeatedly sampling months of consumption history, which is done times, while keeping the model parameters fixed as per above. For each model parameter, we compute a histogram of its estimated values.

The results are presented in Figs. 3- 5, where we show the resulting histograms for and 1000 months of data, respectively, while keeping the number of experiments for all graphs. Note that for all parameters except

, the standard deviation of the MLE estimate for

is nearly equal its mean. This implies that two users with 10 months of daily observations can hardly be differentiated by the model unless their implied parameters differ by a factor of two or more.

This might cast some doubts on a model-implied customer heterogeneity suggested in a similar setting in Ref. [4], and suggests that some, if not all, of this heterogeneity can simply be explained by a finite-sample noise of a model estimation procedure, while all customers are actually undistinguishable from the model perspective.

On the other hand, one can see how both the bias and variance of MLE estimators decrease, as they should, with an increased span of the observation period from 10 user-months to 1000 user-months. These results suggest that in practice, the model should be calibrated using groups of customers with a similar consumption behavior. While the problem of finding such groups is outside of scope of this work, this task can be addressed using available techniques for clustering time series.

Figure 3: Distributions of MLE estimators for months of data
Figure 4: Distributions of MLE estimators for months of data
Figure 5: Distributions of MLE estimators for months of data

5 Summary

We have presented a very tractable version of Maximum Entropy Inverse Reinforcement Learning (IRL) for a dynamic consumer demand estimation, that can be applied technique for designing appropriate marketing strategies for new products and services. The same approach can be applied, upon proper modifications to similar problems in marketing and pricing of recurrent utility-like services such as cloud plans, internet plans, electricity and gas plans, etc. The model enables easy simulations, which is helpful for conducting counter-factual experiments. On the IRL/Machine Learning side, unlike most of other versions of the Maximum Entropy IRL, our model does not have to solve a Bellman optimality equation even once. The model estimation in our approach amounts to convex optimization in a low-dimensional space, which can be solved using a standard off-the-shelf optimization software. This is much easier computationally than structural models that typically rely on a Monte Carlo simulation for model parameter estimation.


  • [1] P. Chintagunta, T. Erdem, P. E. Rossi, and M. Wedel, ”Structural Modeling in Marketing: Review and Assessment”, Marketing Science, Vol. 25, no.6 (2006), pp. 604-616.
  • [2] J. Kober, J. A. Bagnell, and J. Peters, ”Reinforcement Learning in Robotics: a Survey”, International Journal of Robotic Research, vol. 32, no. 11 (2013), pp. 1238-1278.
  • [3] S. Liu, M. Araujo, E. Brunskill, R. Rosetti, J. Barros, and R. Krishnan, ”Undestanding Sequential Decisions via Inverse Reinforcement Learning”, 2013 IEEE 14th International Conference on Mobile Data Management.
  • [4] L. Xu, J.A. Smith, Y. Hu, Y. Cheng, and Y. Zhu, ”A Dynamic Structural Model for Heterogeneous Mobile Data Consumption and Promotion Design” (2015), working paper, available at
  • [5] Brian D. Ziebart et. al., ”Maximum Entropy Inverse Reinforcement Learning” (2008), AAAI, p. 1433-1438 (2008).
  • [6] B.D. Ziebart, J.A. Bagnell, and Anind K. Dey, ”The Principle of Maximum Causal Entropy for Estimating Interacting Processes”, IEEE Transactions on Information Theory, (2012).
  • [7] A. Boularias, J. Kober, and J. Peters, ”Relative Entropy Inverse Reinforcement Learning” (2011).
  • [8] , M. Kalakrishnan, P. Pastor, L. Righetti, and S. Schaal, ” Learning Objective Functions for Manipulations”, in International Conference on Robotics and Automation (ICRA), 2013.
  • [9] C. Finn, S. Levine, and P. Abbeel, ”Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization” (2016).