1 Introduction
The modelbased approach to reinforcement learning consists of learning an internal model of the environment and planning with the learned model (Sutton & Barto, 1998). The main promise of the modelbased approach is dataefficiency: the ability to perform policy improvements with a relatively small number of environmental interactions.
Although the modelbased approach is wellunderstood in the tabular case (Kaelbling et al., 1996; Sutton & Barto, 1998), the extension to approximate setting is difficult. Models usually have nonzero generalization error due to limited training samples. Moreover, the model learning problem can be unrelizable, leading to an imperfect model with irreducible error (Ross & Bagnell, 2012; Talvitie, 2014). Sometimes referred to as the compounding error phenomenon, it has been shown that such small modeling errors can also compound after multiple steps and degrade the policy learned using the model (Talvitie, 2014; Venkatraman et al., 2015; Asadi et al., 2018).
On way of addressing this problem is by learning a model that is tailored to the specific planning algorithm we intend to use. That is, even though the model is imperfect, it is useful for the planning algorithm that is going to leverage it. To this end, Farahmand et al. (2017) proposed an objective function for modelbased RL that captures the structure of value function during model learning to ensure that the model is useful for Value Iteration. Learning a model using this loss, known as valueaware model learning (VAML) loss, empirically improved upon a model learned using maximumlikelihood objective, thus providing a promising direction for learning useful models in the approximate setting.
More specifically, VAML minimizes the maximum Bellman error given the learned model, MDP dynamics, and an arbitrary space of value functions. As we will show, computing the Wasserstein metric involves a similar maximization problem, but over a space of Lipschitz functions. Under certain assumptions, we prove that the value function of an MDP is Lipschitz. Therefore, minimizing the VAML objective is in fact equivalent to minimizing Wasserstein.
2 Background
2.1 MDPs
We consider the Markov decision process (MDP) setting in which the RL problem is formulated by the tuple
. Here, denotes a state space and denotes an action set. The functions and denote the reward and transition dynamics. Finally is the discount rate.2.2 Lipschitz Continuity
We make use of the notion of “smoothness” of a function as quantified below.
Definition 1.
Given two metric spaces and consisting of a space and a distance metric, a function is Lipschitz continuous (sometimes simply Lipschitz) if the Lipschitz constant, defined as
(1) 
is finite.
Equivalently, for a Lipschitz ,
Note that the input and output of
can generally be scalars, vectors, or probability distributions. A Lipschitz function
is called a nonexpansion when and a contraction when . We also define Lipschitz continuity over a subset of inputs:Definition 2.
A function is uniformly Lipschitz continuous in if
(2) 
is finite.
Note that the metric is still defined only on . Below we also present two useful lemmas.
Lemma 1.
(Composition Lemma) Define three metric spaces , , and . Define Lipschitz functions and with constants and . Then, is Lipschitz with constant .
Proof.
∎
Lemma 2.
(Summation Lemma) Define two vector spaces and . Define Lipschitz functions and with constants and . Then, is Lipschitz with constant .
Proof.
∎
2.3 Distance Between Distributions
We require a notion of difference between two distributions quantified below.
Definition 3.
Given a metric space and the set of all probability measures on , the Wasserstein metric (or the 1st Kantorovic metric) between two probability distributions and in is defined as
(3) 
where
denotes the collection of all joint distributions
on with marginals and (Vaserstein, 1969).Wasserstein is linked to Lipschitz continuity using duality:
(4) 
This equivalence is known as KantorovichRubinstein duality (Kantorovich & Rubinstein, 1958; Villani, 2008). Sometimes referred to as “Earth Mover’s distance”, Wasserstein has recently become popular in machine learning, namely in the context of generative adversarial networks (Arjovsky et al., 2017) and value distributions in reinforcement learning (Bellemare et al., 2017)
. We also define Kullback Leibler divergence (simply KL) as an alternative measure of difference between two distributions:
3 ValueAware Model Learning (VAML) Loss
The basic idea behind VAML (Farahmand et al., 2017) is to learn a model tailored to the planning algorithm that intends to use it. Since Bellman equations (Bellman, 1957) are in the core of many RL algorithms (Sutton & Barto, 1998), we assume that the planner uses the following Bellman equation:
where can generally be any arbitrary operator (Littman & Szepesvári, 1996) such as max. We also define:
A good model could then be thought of as the one that minimizes the error:
Note that minimizing this objective requires access to the value function in the first place, but we can obviate this need by leveraging Holder’s inequality:
Further, we can use Pinsker’s inequality to write:
This justifies the use of maximum likelihood estimation for model learning, a common practice in modelbased RL
(Bagnell & Schneider, 2001; Abbeel et al., 2006; Agostini & Celaya, 2010), since maximum likelihood estimation is equivalent to empirical KL minimization.However, there exists a major drawback with the KL objective, namely that it ignores the structure of the value function during model learning. As a simple example, if the value function is constant through the statespace, any randomly chosen model will, in fact, yield zero Bellman error. However, a model learning algorithm that ignores the structure of value function can potentially require many samples to provide any guarantee about the performance of learned policy.
Consider the objective function , and notice again that itself is not known so we cannot directly optimize for this objective. Farahmand et al. (2017) proposed to search for a model that results in lowest error given all possible value functions belonging to a specific class:
(5) 
Note that minimizing this objective is shown to be tractable if, for example, is restricted to the class of exponential functions. Observe that the VAML objective (5) is similar to the dual of Wasserstein (4), but the main difference is the space of value functions. In the next section we show that even the space of value functions are the same under certain conditions.
4 Lipschitz Generalized Value Iteration
We show that solving for a class of Bellman equations yields a Lipschitz value function. Our proof is in the context of GVI (Littman & Szepesvári, 1996), which defines Value Iteration (Bellman, 1957) with arbitrary backup operators. We make use of the following lemmas.
Lemma 3.
Given a nonexpansion :
Proof.
Starting from the definition, we write:
∎
Lemma 4.
The following operators are nonexpansion ():




Proof.
We now present the main result of this paper.
Theorem.
For any choice of backup operator outlined in Lemma 4, GVI computes a value function with a Lipschitz constant bounded by if .Proof.
From Algorithm 1, in the th round of GVI updates we have:
First observe that:
(due to Summation Lemma (2))  
(due to Lemma (3))  
(due to Composition Lemma (1))  
(due to Lemma (4), the nonexpansion property of )  
Equivalently:
By computing the limit of both sides, we get:
where we used the fact that
This concludes the proof.∎
Now notice that as defined earlier:
so as a relevant corollary of our theorem we get:
That is, solving for the fixed point of this general class of Bellman equations results in a Lipschitz statevalue function.
5 Equivalence Between VAML and Wasserstein
We now show the main claim of the paper, namely that minimzing for the VAML objective is the same as minimizing the Wasserstein metric.
Consider again the VAML objective:
where can generally be any class of functions. From our theorem, however, the space of value functions should be restricted to Lipschitz functions. Moreover, it is easy to design an MDP and a policy such that a desired Lipschitz value function is attained.
This space can then be defined as follows:
where
So we can rewrite the VAML objective as follows:
It is clear that a function that maximizes the KantorovichRubinstein dual form:
will also maximize:
This is due to the fact that and so computing absolute value or squaring the term will not change in this case.
As a result:
This highlights a nice property of Wasserstein, namely that minimizing this metric yields a valueaware model.
6 Conclusion and Future Work
We showed that the value function of an MDP is Lipschitz. This result enabled us to draw a connection between valueaware modelbased reinforcement learning and the Wassertein metric.
We hypothesize that the value function is Lipschitz in a more general sense, and so, further investigation of Lipschitz continuity of value functions should be interesting on its own. The second interesting direction relates to design of practical modellearning algorithms that can minimize Wasserstein. Two promising directions are the use of generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017) or approximations such as entropic regularization (Frogner et al., 2015). We leave these two directions for future work.
References
 Abbeel et al. (2006) Abbeel, Pieter, Quigley, Morgan, and Ng, Andrew Y. Using inaccurate models in reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 1–8. ACM, 2006.

Agostini & Celaya (2010)
Agostini, Alejandro and Celaya, Enric.
Reinforcement learning with a gaussian mixture model.
In Neural Networks (IJCNN), The 2010 International Joint Conference on, pp. 1–8. IEEE, 2010.  Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.
 Asadi & Littman (2017) Asadi, Kavosh and Littman, Michael L. An alternative softmax operator for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 243–252, 2017.
 Asadi et al. (2018) Asadi, Kavosh, Misra, Dipendra, and Littman, Michael L. Lipschitz continuity in modelbased reinforcement learning. arXiv preprint arXiv:1804.07193, 2018.
 Bagnell & Schneider (2001) Bagnell, J Andrew and Schneider, Jeff G. Autonomous helicopter control using reinforcement learning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference on, volume 2, pp. 1615–1620. IEEE, 2001.
 Bellemare et al. (2017) Bellemare, Marc G, Dabney, Will, and Munos, Rémi. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458, 2017.
 Bellman (1957) Bellman, Richard. A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684, 1957.

Farahmand et al. (2017)
Farahmand, AmirMassoud, Barreto, Andre, and Nikovski, Daniel.
ValueAware Loss Function for Modelbased Reinforcement Learning.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, pp. 1486–1494, 2017.  Frogner et al. (2015) Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, Araya, Mauricio, and Poggio, Tomaso A. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pp. 2053–2061, 2015.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Kaelbling et al. (1996) Kaelbling, Leslie Pack, Littman, Michael L., and Moore, Andrew W. Reinforcement learning: A survey. J. Artif. Intell. Res., 4:237–285, 1996.
 Kantorovich & Rubinstein (1958) Kantorovich, Leonid Vasilevich and Rubinstein, G Sh. On a space of completely additive functions. Vestnik Leningrad. Univ, 13(7):52–59, 1958.
 Littman & Szepesvári (1996) Littman, Michael L. and Szepesvári, Csaba. A generalized reinforcementlearning model: Convergence and applications. In Proceedings of the 13th International Conference on Machine Learning, pp. 310–318, 1996.
 Nachum et al. (2017) Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, and Schuurmans, Dale. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017.
 Neu et al. (2017) Neu, Gergely, Jonsson, Anders, and Gómez, Vicenç. A unified view of entropyregularized Markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
 Ross & Bagnell (2012) Ross, Stéphane and Bagnell, Drew. Agnostic system identification for modelbased reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012, 2012.
 Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
 Talvitie (2014) Talvitie, Erik. Model regularization for stable sample rollouts. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 2014, Quebec City, Quebec, Canada, July 2327, 2014, pp. 780–789, 2014.
 Vaserstein (1969) Vaserstein, Leonid Nisonovich. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72, 1969.
 Venkatraman et al. (2015) Venkatraman, Arun, Hebert, Martial, and Bagnell, J Andrew. Improving multistep prediction of learned time series models. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, January 2530, 2015, Austin, Texas, USA., 2015.
 Villani (2008) Villani, Cédric. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
Comments
There are no comments yet.