Abstract
We propose a new approach for solving a class of discrete decision making problems under uncertainty with positive cost. This issue concerns multiple and diverse fields such as engineering, economics, artificial intelligence, cognitive science and many others. Basically, an agent has to choose a single or series of actions from a set of options, without knowing for sure their consequences. Schematically, two main approaches have been followed: either the agent learns which option is the correct one to choose in a given situation by trial and error, or the agent already has some knowledge on the possible consequences of his decisions; this knowledge being generally expressed as a conditional probability distribution. In the latter case, several optimal or suboptimal methods have been proposed to exploit this uncertain knowledge in various contexts. In this work, we propose following a different approach, based on the geometric intuition of distance. More precisely, we define a goal independent quasimetric structure on the state space, taking into account both cost function and transition probability. We then compare precision and computation time with classical approaches.
Introduction
It’s Friday evening, and you are in a hurry to get home after a hard day’s work. Several options are available. You can hail a taxi, but it’s costly and you’re worried about traffic jams, common at this time of day. Or you might go on foot, but it’s slow and tiring. Moreover, the weather forecast predicted rain, and of course you forgot your umbrella. In the end you decide to take the subway, but unfortunately, you have to wait half an hour for the train at the connecting station due to a technical incident.
Situations like this one are typical in everyday life. It is also undoubtedly a problem encountered in logistics and control. The initial state and the goal are known (precisely or according to a probability distribution). The agent has to make a series of decisions about the best transport means, taking into account both uncertainty and cost. This is what we call optimal control under uncertainty.
Note that he might also have an intuitive notion of some abstract distance: how far am I from home? To what extent will it be difficult or time consuming to take a given path? The problem might become even more difficult if you do not know precisely what state you are in. For instance, you might be caught in a traffic jam in a completely unknown neighborhood.
This problem that we propose to deal with in this paper can be viewed as sequential decision making, usually expressed as a Markovian Decision Process (MDP) [Bellman1957, Howard1960, Puterman1994, Boutilier1999] and its extension to Partially Observable cases (POMDP) [Drake1962, Astrom1965]. Knowing the transition probability of switching from one state to another by performing a particular action as well as the associated instantaneous cost, the aim is to define an optimal policy, either deterministic or probabilistic, which maps the state space to the action state in order to minimize the mean cumulative cost from the initial state to a goal (goaloriented MDPs).
This class of problems is usually solved by Dynamic Programming method, using Value Iteration (VI) or Policy Iteration (PI) algorithms and their numerous refinements. Contrasting with this modelbased approach, various learning algorithms have also been proposed to progressively build either a value function, a policy, or both, from trial to trial. Reinforcement learning is the most widely used, especially when transition probabilities and cost function are unknown (modelfree case), but it suffers from the same tractability problem
[Sutton1998]. Moreover one significant drawback to these approaches is that they do not take advantage of the preliminary knowledge of cost function and transition probability.MDPs have generated a substantial amount of work in engineering, economics, artificial intelligence and neuroscience, among others. Indeed, in recent years, Optimal Feedback Control theory has become quite popular in explaining certain aspects of human motor behavior [Todorov2002, Todorov2004]. This kind of method results in feedback laws, which allow for closed loop control.
However, aside from certain classes of problems with a convenient formulation, such as the Linear Quadratic case and its extensions [Stengel1986], or through linearization of the problem, achieved by adapting the immediate cost function [Todorov2009]
, the exact total optimal solution in the discrete case is intractable due to the curse of dimensionality
[Bellman1957].Thus, a lot of work in this field is devoted to find approximate solutions and efficient methods for computing them.
Heuristic search methods try to speed up optimal probabilistic planning by considering only a subset of the state space (e.g. knowing the starting point and considering only reachable states). These algorithms can provide offline optimal solutions for the considered subspace [Barto1995, Hansen2001, Bonet2003].
MonteCarlo planning methods that doesn’t manipulate probabilities explicitly have also proven very successful for dealing with problems with large state space [Peret2004b, Kocsis2006].
Some methods try to reduce the dimensionality of the problem in order to avoid memory explosion by mapping the state space to a smaller parameter space [Buffet2006, Kolobov2009] or decomposing it hierarchically [Hauskrecht1998, Dietterich1998, Barry2011] .
Another family of approximation methods which has recently proven very successful [Little2007] is the “determinization”. Indeed, transforming the probabilistic problem to a deterministic one optimizing another criterion allows the use of very efficient deterministic planner [Yoon2007, Yoon2008, TeichteilKonigsbuch2010].
What we propose here is to do something rather different, by considering goalindependent distances between states. To compute the distance we propose a kind of determinization of the problem using a one step transition ”mean cost per successful attempt” criterion, which can then be propagated by triangle inequality. The obtained distance function thus confers to the state space a quasimetric structure that can be viewed as a Value function between all states. Theses distances can then be used to compute an offline policy using a gradient descent like method.
We show that in spite of being formally suboptimal (except for the deterministic and a described particular case), this method exhibits several good properties. We demonstrate the convergence of the method and the possibility to compute distances using standard deterministic shortest path algorithms. Comparison with the optimal solution is described for different classes of problems with a particular look at problems with prisons. Prisons or absorbing set of states have been recently shown to be difficult cases for state of the art methods [Kolobov2012] and we show how our method naturally deals with these cases.
Materials and Methods
Quasimetric
Let us consider a dynamic system described by its state and , the action applied at state leading to an associated instantaneous cost
. The dynamics can then be described by the Markov model:
where the state of the system is a random variable
defined by a probability distribution. Assuming stationary dynamics, a function exists, satisfyingThis model enables us to capture uncertainties in the knowledge of the system’s dynamics, and can be used in the Markov Decision Process (MDP) formalism. The aim is to find the optimal policy
allowing a goal state to be reached with minimum cumulative cost. The classic method of solving this is to use dynamic programming to build an optimal Value function , minimizing the total expected cumulative cost using Bellman equation:(1) 
which can be used to specify an optimal control
policy
(2) 
In general this method is related to a goal state or a discount factor.
Here we propose a different approach by defining a goal independent quasimetric structure in the state space, defining for each state couple a distance function reflecting a minimum cumulative cost.
This distance has to verify the following properties:
leading to the triangle inequality
Therefore, the resulting quasidistance function confers the property of a quasimetric space to .
Notice that this metric need not be symmetric (in general ). It is in fact a somewhat natural property, e.g. climbing stairs is (usually) harder than going down.
By then choosing the cost function this
distance can be computed iteratively (such as the Value function).
For a deterministic problem, we initialize with:
with the discrete dynamic model giving the next state by applying action in state . Then we apply the recurrence:
(3) 
We can show that this recurrence is guaranteed to converge in finite time for a finite statespace problem.


by recurrence as:

as
and by definition. 
and if then


as:
then in particular if we take we have . 
is a decreasing monotone sequence bounded by .
However, finding a way to initialize (more precisely ) while taking uncertainty into account, presents a difficulty in probabilistic cases as we cannot use the cumulative expected cost like in Bellman equation.
For example we can choose:
for the first iteration with as the onestep distance.
The quotient of cost over transition probability is chosen as it provides an estimate of the
mean cost per successful attempt. If we attempt times the action in state the cost will be and the objective will be reached on average times. The mean cost per successful attempt is:This choice of metric is therefore simple and fairly convenient. All the possible consequences of actions are clearly not taken into account here, thus inducing a huge computational gain but at the price of losing the optimality. In fact, we are looking at the minimum over actions of the mean cost per successful attempt, which can be viewed as using the best mean cost, disregarding unsuccessful attempts, i.e. neglecting the probability to move to an unwanted state.
In a onestep decision, this choice is a reasonable approximation of the optimal that takes both cost and probability into account.
This costprobability quotient was used before to determinize probabilistic dynamics and extract plans [Keyder2008, Barry2011, Kaelbling2011]. Here we generalize this method to construct an entire metric in the state space using triangle inequality.
We also notice that contrary to the dynamic programming approach, the quasimetric is not linked to a specific goal but instead provides a distance between any state pair. Moreover, using this formalism, the instantaneous cost function is also totally goal independent and can represent with greater ease any objective physical quantity, such as consumed energy. This interesting property allows for much more adaptive control since the goal can be changed without the need to recompute at all. As shown in the following, it is even possible to replace the goal state by a probability distribution over states. Another interesting property of the quasidistance is that it doesn’t have local minimum from the action point of view.
In fact, for any couple , is a decreasing finite series of nonnegative numbers (finite number of states), which therefore converges to a nonnegative number
Note that if we multiply the cost function by any positive constant, the quasimetric is also multiplied by the same constant. This multiplication has no consequence on the structure of the state space and leaves the optimal policy unchanged, therefore we can choose a constant such that:
Let be the subset of associated with a goal such that:
and let the subset of associated with the goal such that:
The subset is the set of states from which the goal can be reached in a finite time with a finite cost. Starting from the goal will never be reached either because some step between and requires an action with an infinite cost, or because there is a transition probability equal to .
Then the defined quasimetric admits no local minimum to a given goal in the sense that for a given , if is such that:
then

Comments
There are no comments yet.