1 Introduction
Road traffic, while essential to the proper functioning of a city, generates a number of nuisances, including pollution, noise, delays and accidents. It is the role of city managers, network administrators and urban planners to attempt to mitigate the negative impact of transportation by planning adequate infrastructure and policies. Most of the traffic is generated by individual travelers who seek to minimize their own travel costs without guidance from a system maximizing the overall welfare. It is thus a necessity to understand how users of the transportation system behave and choose their path in the network in order to provide planners with decisionaid tools to manage it.
This is the main motivation for the introduction of what is known in the transport demand modeling literature as the route choice problem, which seeks to predict and explain the choice of path of travelers in a network. All path choice models are based on the assumption that individuals behave rationally by minimizing a certain cost function. The models’ aim is to identify this cost function from a set of observed trajectories, which allows to predict chosen paths for all origin destination pairs.
It is desirable that path choice models satisfy several properties, such as i) scaling with the size of large urban networks in order to be efficiently used for real applications, ii) lending themselves well to behavioral interpretation, in order to, e.g., assess travelers’ value of time, and iii) yielding accurate predictions. In the transportation demand literature, the most common methodology is a probabilistic approach known as discrete choice modeling, which finds its origin in econometrics. It views path choice as a particular demand modeling problem, where the alternatives of the decision maker are the paths in the network. The principal issue with this modeling approach stems from the fact that the number of feasible paths in a real network cannot be enumerated, and we do not observe which paths are actually considered by individuals. As a result, discrete choice models based on paths currently used in the transportation literature may suffer from biased estimates and inaccurate predictions, as well as potentially long computational times.
This work is a tutorial on a modeling framework which in comparison meets the previously enumerated expectations. Dubbed “recursive choice models”, this methodology draws its efficiency from modeling the path choice problem as a parametric Markov decision process (MDP) and resorting to dynamic programming to solve its embedded shortest path problem. In this tutorial, we aim at i) facilitating understanding of the recursive model’s formulation by drawing links to related work in inverse optimization, and ii) comparing recursive models on the basis of desirable properties with the most wellknown approach in the transportation demand literature, i.e., discrete choice models based on paths. We note here that this tutorial is addressed to both transportation demand researchers who are unfamiliar with recursive models, and researchers from the machine learning community who are keen to find out about stateoftheart methods in the area of transportation science. For this purpose, we use general terminology and speak about path choice instead of route choice in the remaining of this paper.
This tutorial proposes to view this problem from a fresh perspective and makes several contributions. First, we give background and intuition on recursive models’ formulation and properties. Indeed, we contextualize discrete choice models as a probabilistic approach for what is in fact an inverse optimization problem. Through a brief overview of that literature, we motivate and throw light on the recursive formulation, which bears similarities to models for inverse reinforcement learning, but is also theoretically equivalent to a discrete choice model based on the set of all feasible paths. Second, we illustrate the advantages of the recursive model, namely consistent estimates and fast predictions, through several examples and discussions related to model estimation and prediction.
The remainder of this paper is organized as follows. In the following section, we provide a broader context with an overview of inverse shortest path problems. In Section 3, we present discrete choice models, which we liken to a probabilistic paradigm for solving inverse optimization problems with noisy data. In particular, we introduce recursive discrete choice models in Section 3.2, which provides consistent parameter estimates of the cost function. Section 4
provides an illustrative comparison of path choice probabilities under both the recursive model and a pathbased discrete choice model. Section
5 discusses the issues related to the latter and demonstrates the advantages of recursive models through practical examples of both model estimation and prediction. Finally, Section 6 provides an outlook and concludes.2 Context: from shortest paths problems to path choice models
In this section, we frame the path choice problem as that of unveiling an unknown cost function from noisy shortest paths observations, which is an inverse optimization problem where the forward (inner) problem is a shortest path. We give some background on the literature on inverse optimization and we situate the problem this tutorial addresses within this context. We illustrate that there is a close connection between stochastic shortest path problems and the inverse problem with noisy data we are interested in. For the sake of clarity we start this section by introducing deterministic and stochastic shortest path problems before describing the related inverse problems.
2.1 Shortest path problems
Throughout this section, we consider a simple oriented graph with a set of nodes and a set of arcs . We denote the nodes in and the arcs in , which are characterized by a source node and a tail node . Arcs have an associated cost given by a function . A path is a sequence of arcs such that the head node of each arc is the tail node of the next.
2.1.1 The deterministic shortest path problem
The deterministic shortest path problem (DSP) in the graph is concerned with finding the path with minimum cost between an origin node and a destination node, where the cost of a path is defined by the sum of its arc costs. More often, methods developed in the literature are designed to solve the shortest path problem between a given origin and all possible destinations, or a given destination and all possible origins.
This combinatorial optimization problem has been amply studied in the literature. Its chief difficulty is the existence of a very large number of feasible paths between each node pair, which precludes proceeding by naive enumeration. The problem could be formulated and solved as a linear program, however more efficient algorithms have been developed, relying on
dynamic programming (DP). In general, DP is a methodology to solve optimization problems in dynamic (often discrete time) systems, where a decision (denoted action or control) must be taken in each state in order to minimize future additive costs over a certain time horizon (finite or infinite). The shortest path problem in the graph can be formulated as a DP problem by considering nodes as states and an arc choice as an action taken in a given state.The Bellman principle of optimality at the core of deterministic problems states that for an optimal sequence of choices (in this case, arcs along the shortest path), each subsequence is also be optimal. This allows to decompose the problem and formulate a recursive expression for the optimal arc choice at node as well as the cost of the shortest path from to destination node ,
(1) 
where .
Solving (1) is however not straightforward in cyclic graphs. In this case, note that the problem is welldefined only when there are no negative cost cycles, otherwise there would be paths of cost . Under this assumption, the shortest path in contains at most arcs. Bellman (1958) shows how to solve the shortest path problem by backwards induction as a deterministic finite state finite horizon optimal control problem. The DP algorithm is
(2) 
where , and is the length of the horizon. The value represents the cost of the shortest path from to using at most arcs, and in this sense is an upper bound on the cost of the shortest path. The cost of the shortest path from corresponds to . Note that the value of the shortest paths can also be found by label correction algorithms (see, e.g., Dijkstra, 1959, Floyd, 1962).
2.1.2 The stochastic shortest path problem
The stochastic shortest path problem (SSP), as defined by Bertsekas and Tsitsiklis (1991)
, is an extension of the previous problem which considers a discrete time dynamic system where a decision must be taken in each state and causes the system to move stochastically to a new state according to a transition probability distribution. This problem can be analyzed using the framework of Markovian Decision Processes (MDP), formally defined as:

A set of states and a set of available actions for each state .

The cost incurred by taking action in state and moving to next state .

the transition probability from to when taking action .
An MDP models problems where an action must be taken in each state, with the aim to minimize expected future discounted costs over a certain horizon. The SSP is a special case of MDP with infinite horizon, no discounting and a costfree absorbing state , where . The SSP is an infinite horizon problem since there is no upper limit on the number of arcs traversed. However, by assumption the absorbing state can be reached with probability 1 in finite time. The optimal solution of the SSP is not a path but a policy, which consists in a probability distribution over all possible actions in each state. When the optimal policy is followed, the path which is actually travelled is random but has minimum expected cost.
This stochastic problem can be solved with DP and the recursion which defines the optimal expected cost from any given state is given by the Bellman equation
(3) 
where . is also known as the value function.
Note that the DSP is a particular degenerate case of the SSP where states are nodes of the graph , action costs are arc costs and state transition probabilities are deterministic, since the next state is always equal to the chosen successor node. Therefore, it is in fact a deterministic MDP.
This definition of SSPs provided by Bertsekas and Tsitsiklis (1991) is very general. It also encompasses variants of shortest paths problems on random graphs (see, e.g., Polychronopoulos and Tsitsiklis, 1996)
, where arc costs are modeled as random variables
, with the average arc cost and a random error term. Arguably the most interesting variation of the problem is the one where the realization of the arc costs is learned at each intersection as the graph is traversed. Below, we make the additional assumption that are i.i.d. variables. In this case, by defining an action as an arc and a state as a network nodeand a vector
of learned realizations of the error term for all outgoing arcs from , (3) becomes(4) 
where transitions between states are entirely contained by the density . This specific SSP will be of importance in Section 3.2.
2.2 Inverse shortest path problems
In shortest path problems it is assumed that the modeler has complete knowledge of the cost function or its distribution in the previous case, as well as state transitions . In contrast, inverse shortest path problems study the case where the cost function is unknown and must be inferred with the help of an additional source of information at disposal, in the form of observed optimal paths between some origindestination (OD) pairs. This class of problems broadly belongs to inverse optimization, an extensively studied problem (see e.g. Ahuja and Orlin, 2001) where the modeler seeks to infer the objective function (and sometimes the constraints) of a forward (inner) optimization problem based on a sample of optimal solutions.
Motivation for studying inverse problems can be drawn from several types of applications. Burton and Toint (1992) cites possible motivations to examine inverse shortest path problems, one of which is precisely the subject of this tutorial. One may view the underlying optimization problem as a model for rational human decision making and assume that the cost function represents the preferences of users traveling in a network (possibly a parametrized function of certain arc features). In this context, recovering the cost function allows to analyze why individuals choose the observed routes and to gain understanding of network users’ behavior. Using the recovered cost function, the inner shortest path problem can be solved and in this sense yield predictions of path choices for unobserved OD pairs.
Related to inverse shortest path problems is the literature on inverse reinforcement learning
(IRL) or imitation learning
(Ng et al., 2000, Abbeel and Ng, 2004). The IRL problem is more general than the inverse shortest path, since it considers an underlying optimization problem generally formulated as a infinite horizon MDP. In this context, observations consist of optimal sequences of actions. Nevertheless, models for IRL have also been applied to the problem of recovering the cost function of network users (e.g. Ziebart et al., 2008). Such applications consider specific MDPs where an action is a choice of arc in a network, and the destination is an absorbing state, as in Section 2.1.2. Most applications of IRL on path choice however consider a deterministic MDP (state transitions probabilities are degenerate) in contrast to the more general formulation in Section 2.1.2.Inverse problems are in general underdetermined and may not possess a unique solution. In the inverse shortest path problem and in IRL, there may be an infinite number of ways to define the cost function such that observations form optimal solutions. Different modeling paradigms propose to solve this issue. They may be separated in two categories depending on the assumptions made on the presence of noise in the data, which distinguishes between deterministic and stochastic problems. The former consider that demonstrated behavior is optimal, while the latter makes the hypothesis that observed trajectories deviate from deterministic shortest paths. The second case is often studied when the data collecting process may have induced measurement errors or when the data is generated by a decision maker who exhibits seemingly inconsistent behavior. In the following sections, we review existing models for the above mentioned inverse problems.
2.2.1 Deterministic problems
Burton and Toint (1992) introduced the original deterministic inverse shortest path problem where observed trajectories are assumed to correspond to deterministic shortest paths. They do not assume that the cost function is parametrized by arc features and simply seek the value associated to each arc . They propose to provide uniqueness to the inverse problem by seeking the arc costs that are closest to a given estimation of costs based on a certain measure of distance, thus minimizing an objective of the form . This implies that the modeler has an a priori knowledge of costs, which is reasonable in some applications. Burton and Toint (1992) provide seismic tomography as an example. Seismic waves are known to propagate according to the shortest path along the Earth’s crust, but the geological structure of the zone of study is typically not entirely known, although modelers have an estimate. Given measurements of earthquakes’ arrival times at different points in the “network”, the goal is then to predict movements of future earthquakes by recovering the actual transmission times of seismic waves.
Variants of this problem have been studied by Burton and Toint (1994), Burton et al. (1997), for instance considering the case where arcs costs may be correlated or their values belong to a certain range. In Burton and Toint (1992) and following works, the norm is selected as a distance measure between initial and modified costs, and yields a quadratic programming formulation. In Zhang et al. (1995), the norm is assumed so that the problem can be modelled as a linear program and solved using column generation. In all the studies above, the inverse shortest past problem is modeled as a constrained optimization problem, with an exponential number of linear constraints to ensure that each observed path is shortest under the chosen cost function.
Bärmann et al. (2017) provide another example of approach for inverse optimization with an application to learning the travel costs of network users, in particular subject to budget constraints. More precisely, they consider an inverse resourceconstrained shortest path problem. Their approach does not recover an exact cost function, but provides a sequence of cost functions corresponding to each observation , which allows to replicate demonstrated behavior. Their framework explicitly assumes the optimality of observations in order to infer the objective functions .
The deterministic problem has also notably been studied under the guise of inverse reinforcement learning in Ratliff et al. (2006)
, with an application to robot path planning. The objective of their work is to learn a cost function in order to teach a robot to imitate observed trajectories, which are assumed optimal. In contrast to the aforementioned works, no hypothesis is made regarding preliminary arc costs. However, the environment is considered to be described by features, such as elevation, slope or presence of vegetation, and the model seeks to obtain a mapping from features to costs by learning the weights associated to each feature. To obtain uniqueness of the solution, they cast the problem as one of maximum margin planning, i.e. choosing the parameters of the cost function that makes observed trajectories better by a certain margin than any other path, while minimizing the norm of weights. This notion of distance between solutions is defined by a loss function to be determined by the modeler. Under the
norm, this also results in a quadratic programming formulation with a number of constraints that depend on the number of stateaction pairs, which consist of nodearc pairs in this context.2.2.2 Stochastic problems
Inverse optimization problems with noisy data have been studied in, e.g., Aswani et al. (2018) or Chan et al. (2018). In addition to nonuniqueness, the problem which typically arises in this situation is that there may not be any non trivial value of arc costs which makes the demonstrated paths optimal solutions of a DSP. If solutions do exist, they may be uninformative, such as the zero cost function. To solve this issue, the previous framework is extended by letting solutions be approximately optimal and measuring the amount of error. Thus accommodating noise requires to estimate a model for the choice of path which “fits” as closely as possible the observed data with various methods for measuring the fit, or loss, drawing from statistics. The chosen measure for the fit should uniquely define the solution.
Different points of views exist on achieving a good fit in the literature. Approaches grounded in machine learning make no assumptions on the underlying process that generated the data and merely focus on obtaining good predictive power with the simplest possible model while considering a large family of potential functions. In contrast, the statistical inference perspective on the problem considers that there exists an underlying true cost function with a known parametric form. The aim is to obtain parameter estimates that asymptotically converge to the real values, a property known as consistency. Several loss functions are conceivable to formulate a minimization problem, and often the choice of loss is directly related to assumptions made on the underlying model.
Examples for the former are the work of Keshavarz et al. (2011), who estimate cost functions in a flow network assuming an affine parametric shape, and the work of Bertsimas et al. (2015)
, who seek to infer cost functions in a network subject to congestion at equilibrium. The specificity of the latter is that the cost functions are endogenous, i.e., the cost of paths include congestion costs. This makes it an inverse variational inequality problem with noisy data. In both works the proposed method is a heuristic and treats the process that generates the data as a black box. E.g. in
Bertsimas et al. (2015), nonparametric cost functions are considered and the problem is formulated as a constraint programming model which balances the objective of minimizing the norm of the cost function with that of maximizing the fit of the data. They follow the approach of measuring the loss of the model by the amount of slack required to accommodate equilibrium constraints, i.e. making observed solutions “optimal”.In contrast, the latter category of models assume that observed behavior deviates randomly from optimality according to a certain known probability distribution. This leads to a different modeling paradigm, in which a random term is added to the true cost function of the inner optimization problem, with the interpretation that each observation corresponds to an instantiation of the cost. This framework, described in, e.g., Nielsen and Jensen (2004) for general inverse problems, has received limited attention in the literature on inverse shortest path problems. One notable exception consists in path choice models based on the discrete choice modeling framework. In the next section, we review the literature surrounding this probabilistic modeling framework, which is at the heart of this tutorial.
3 Probabilistic models for path choice
The inverse shortest path problem with noisy data is motivated by the situation of individuals traveling in real networks, where the assumption that a cost function can account for all observed behavior is rarely valid. Probabilistic methods for this noisy problem assume that observed path choices randomly deviate from deterministic shortest paths. In particular, discrete choice models are a particular type of probabilistic models grounded in econometrics, which provides the theoretical basis for behavioral interpretation.
One of the interpretations of the distributional assumption on the data is that travelers act rationally but observe additional factors impacting their path choice which vary among individuals and are unknown to the modeler. These factors are encompassed in a random term added to the cost function. Discrete choice models assume that the cost function is a parametrized function of several attributes. The only option available to the modeler, who knows the family of distributions for but does not observe the realization of random terms for a given individual, is to infer the probability that a given path is optimal.
The problem becomes akin to density estimation, i.e., recovering the parameters of a probability distribution over a set of paths. In this context, statistical consistency of the estimator is a desirable property. Yet there are several ways to define a probability distribution over paths, wellknown in the discrete choice literature, which do not necessarily yield a consistent estimator. In the following sections, we elaborate on the above statement and describe two distinct discrete choice models for path choice.
Note that discrete choice models employ the terminology of utilities instead of costs and that we uphold this convention in the remainder of this tutorial. This implies a trivial change of the above formulations from minimization to maximization problems and the definition of a utility function .
3.1 Pathbased models
Discrete choice models based on paths are the methodology embraced by most works on the topic (Prato, 2009, Frejinger, 2008). They assume that the utility of a path is a random variable , where is a random error term, is its scale, and the deterministic utility is parametrized by attributes of the network, such as travel time. Often the parametrization consists of a linear in parameters formulation of the shape , where is a vector of parameters and is a vector of attribute variables of the path . Measuring the fit of a discrete choice model to the data naturally leads to selecting the loglikelihood loss, which assesses the plausibility of observing the chosen trajectories under the current value of parameters . Thus the problem is one of maximum likelihood estimation, i.e. finding the set of parameters which minimize the loglikelihood loss.
The difficulty in specifying the probability of a given path is to identify the class of paths over which this probability should be defined. Since the very large number of feasible paths in a real network precludes enumeration, the immediate solution consists in choosing a subset of reasonable paths, assuming that all other paths have a null choice probability. This implies a twostep modeling framework, in which

Plausible paths are generated between the origin and destination of each observation by solving versions of the DSP, forming the choice set ;

Parameters that maximize the probability of observed paths within the choice set previously defined are estimated via maximum likelihood, i.e. by maximizing defined as
(5)
The distribution chosen for
leads to different forms of discrete choice models with distinct path choice probabilities. The most wellknown is the multinomial logit formulation, resulting from the assumption of i.i.d Extreme value type I distributed error terms, also known as the softmax in the machine learning community. Other models exist, such as the probit model. In the logit case, we have
(6) 
There exists a vast array of methods to extract plausible paths from the network in order to generate the needed choice sets. Usually, it consists in assuming an a priori cost function and solving variants of the shortest path problem described in Section 2.1, until a large enough set of paths is obtained. Problematically, the preliminary value of costs used to define the probability distribution through the choice set is in general not equivalent to the true value of the utility which is ultimately sought. This discrepancy is what prevents the consistency of the resulting estimates.
Frejinger et al. (2009) designed a method to correct for the induced sampling bias, by adjusting the choice probability of a given path depending on parameter values and the choice set. The adjusted path choice probabilities include a correction term which accounts for the probability that the given choice set was selected under the current parameter values conditionally on the observed choice.
The method proposed by Frejinger et al. (2009) can be understood as assuming that the true distribution is based on the set consisting of all feasible paths, while resorting to sampling paths in order to estimate the parameters of the distribution in practice. The advantage is that it yields consistent parameter estimates. Nevertheless, since there is no means to compute the normalizing constant of the distribution save for the impractical enumeration of all paths, the estimated model still requires to sample choice sets for prediction, an issue we further discuss in Section 5.2.
Arguably, since these issues arise from the combinatorial size of the inner problem, one may expect that DP could provide a solution for the inverse problem as well. The literature on IRL supplies such an example with Ziebart et al. (2008), who model the path choice problem as a MDP and estimate a probabilistic model which normalizes over the global set of feasible paths. This is achieved without enumeration nor sampling by viewing the path choice process as a sequence of action choices depending on a current state as in Section 2.1.2. In fact, this methodology is equivalent to the recursive discrete choice model, which has been developed independently and in parallel in the transportation research community. We introduce this model in the following section, which provides a method to consistently estimate parameters of the utility function without resorting to choice sets of paths.
3.2 Recursive models
In this section, we introduce the recursive model proposed by Fosgerau et al. (2013) for the choice of path within the discrete choice modeling framework. We also discuss the link to the IRL model by Ziebart et al. (2008).
3.2.1 Path choice as a deterministic MDP
Contrary to the previous section in which the problem is formulated in the high dimensional space of paths, the recursive choice model considers arcbased variables. Its formulation is explicitly based on the framework of Markov Decision Processes used to solve shortest path problems in Section 2.1.
In recursive models, network arcs correspond to states, while outgoing links at the head node of the current arc assume the role of available actions. For this purpose, we subsequently denote arcs as either or depending on whether they play the role of states or actions. Note that it would also be possible to select nodes as states, however the arcbased formulation allows the deterministic utility of an actionstate pair to depend on turn angles between two subsequent arcs and . The destination is represented by a dummy link which is an absorbing state of the MDP, where no additional utility is gained. Finally, utilities are undiscounted and the actionstate transition function is assumed to be degenerate, since the new state is simply the chosen arc. A path under this framework is a sequence of states , starting from an origin state and leading to the absorbing state representing the destination .
3.2.2 Parametric estimation of MDPs
Under this perspective, the inverse problem of recovering the utility function is a problem of parametric estimation of MDPs, as first described by Rust (1994). As in the previous section, the noise in the data is accounted for by assuming the presence of an i.i.d random error term , added to the utility . The utility becomes a random variable , where is the scale of the error term. It is assumed that the individual observes the realizations of the random variables at each step of the process and chooses the best action accordingly. From the point of view of the modeler, the individual’s behavior hence consists in solving a stochastic shortest path problem similarly to Section 2.1.2. In this context, the Bellman equation gives the optimal value function when the state consists of an arc and realizations ,
(7) 
which is very similar to (4). We may however simplify this equation by taking the expectation with respect to of (7) and defining the expected value function of a state , which gives
(8) 
For simplicity and consistency with terminology in other works (Fosgerau et al., 2013, Mai et al., 2015), we nevertheless refer to (8) as the value function in this work.
The modeler does not observe the realized utilities and can only compute the probability that each given action is optimal. According to the modeler, the observed behavior of individuals follows a probability distribution over the set of actions which maximizes the expected utility in (8). As in Section 3.1, choice probabilities may take several forms depending on the distributional assumption for the error terms . Assuming an Extreme value type I probability distribution, the probability of an individual choosing a certain action conditional on the state and the destination is given by the familiar multinomial logit formula:
(9) 
Given observations of sequences of actions (i.e., paths), the model can be estimated by maximum likelihood. This requires to define the probability of choosing an observed path, which can be expressed as the product of the corresponding action choice probabilities using (9). For an observed path , the path choice probability is given by
(10) 
where is the link , which can more simply be expressed as
(11) 
where is the sum of the link utilities of the path .
As a result, the likelihood of a set of path observations is defined as
(12) 
The expression in (12) does not depend on choice sets, in contrast to (5). However, the value function which appears in (9) must be computed in order to evaluate the likelihood. This suggests resorting to a twostep likelihood maximization algorithm, in which an inner loops solves the SSP and obtains the value function in (8) for the current value of parameters , while an outer loop browses through the values of . Rust (1994) proposed such a method, denoted the nested fixed point algorithm. The resulting parameter estimates are consistent.
The model for IRL proposed in Ziebart et al. (2008) bears another name but is equivalent to a recursive logit model, since they assume a maximum entropy (exponential family) distribution. The only difference lies in the method for estimating the model, as Ziebart et al. (2008) approximate the value function in (8), whereas we exemplify in the next section that they can conveniently be solved as a system of linear equations (Fosgerau et al., 2013, Mai et al., 2016).
4 Illustrative examples
Section 3 presented two discrete choice models for the problem of estimating the utility function of travelers in a network: (i) Pathbased models, (ii) Recursive models. Although the first is extensively used in practice, the second is superior because of its consistent estimator and accurate predictions without choice set generation.
This section has two purposes. First, we use small illustrative examples on toy networks to provide a clear understanding and intuition of the recursive model and the value function; second, we compare path probabilities obtained with both model formulations. Although both models can take several forms depending on distributional assumptions on the error terms, the focus of this tutorial is on the logit model. Therefore in the following we refer to the recursive logit as the RL model and the pathbased logit as the PL model.
Note that the MATLAB code used for these numerical examples is available online, as well as a tutorial detailing how to use it^{1}^{1}1 http://intermodal.iro.umontreal.ca/software.html.
4.1 An acyclic network
The motivation for this example is to show that it is possible to obtain with the recursive formulation in (9) the same choice probabilities on paths as with the PL model in (6). For this illustrative purpose, it is meaningful to consider a given specification of the utility function. Hence we assume that path utilities are specified by an additive function of arc length , such that for each path we have , where is the sum of the lengths of arcs contained in path .
The OD pair considered for the toy network displayed in Figure 1 is . The two dashed arcs represent dummy origin and destination links. There exists 4 possible paths from node 1 to node 4, of respective length 2, 3, 4 and 6. Under the logit model, it is easy to compute the choice probability of the shortest path for this OD pair, which goes directly from node 1 to node 4 with length 2. Assuming that the scale of the random term for this example is 1, we obtain at the denominator of the logit function in (6) the term , and at the numerator the term , therefore the choice probability is equal to 0.6572. Table 1 displays similarly the choice probability of all other paths.
Let us now suppose that instead of choosing between the possible paths connecting origin and destination, the traveler builds the chosen path along the way through a series of consecutive link choices, as in the RL formulation. In each link choice situation, the alternatives to choose from are the outgoing links at the current node. We denote the utility of links originating from link . The choice probability of a path under the RL model is then equal to the product of each link choice probability in (9).
Path  Length  Path probability (PL)  Product of link probabilities (RL) 

2  0.6572  0.6572  
6  0.0120  0.0120  
3  0.2418  
4  0.0889 
Node  Value function 

In order to compute link choice probabilities, we need to compute the value function in (7). This equation can be rewritten as the logsum when is assumed to be i.i.d. Extreme value type I distributed as in the logit model:
(13) 
Since the specified utility of a link does not depend on the incoming arc, the value function is identical for all links with the same end node. It is thus more convenient to compute the value function at each of the 4 nodes (the value function at a link is then equivalent to the value function at the end node of that link). Below, we show how to compute the value function in this network and display the value for each node.
In this case, since the network is acyclic, it is possible to compute the value function by backwards induction. At the destination node 4, given that there is no utility to be gained, the value function is zero. Working our way backwards, we compute at node 3 the value function . At node 2, we have . Finally, at node 1 we obtain . The values for all nodes are summarized in Table 2.
Having computed the value function for this network, we may apply (9) to this example and we obtain the path probabilities in the last column of Table 1. We notice that they are identical to choice probabilities under the PL model. This is due to the property of the RL model of being formally equivalent to a discrete choice model over the full choice set of paths (Fosgerau et al., 2013). Therefore, the PL and the RL models are two strictly equivalent approaches when the set of all possible paths in the network can be enumerated.
4.2 A cyclic network
Let us now consider a very similar network in Figure 2, with an added link between nodes 3 and 1. This network is no longer acyclic, and as a result there is in theory an infinite number of paths between nodes 1 and 4, when accounting for paths with loops.
The first consequence of dealing with a cyclic network is that the value function can no longer be computed by backwards induction starting from destination, since the network admits no topological order. However, the value function is still well defined as the solution of the fixed point problem (7) and can be solved either by value iteration or, in the case of the recursive logit, as the solution of a system of linear equations. For the latter, notice that by taking the exponential of (13) and raising to the power , we obtain
(14) 
This is a linear system of equations if we solve for variable . Doing so, we obtain the value function in Table 3.
As in the previous example, having solved the value function, we can trivially compute the choice probabilities for different paths in this network as product of link choice probabilities. As can be observed, the probabilities of the four paths used in the acyclic example do not sum to 1 anymore (rather to 0.9698), and neither do the probabilities of the additional paths displayed in Table 4, which sum to 0.9965. This is because a cyclic network contains an infinite number of possible paths, and the RL model attributes a positive choice probability to each outgoing arc at an intersection. Hence, even paths with multiple cycles have a small probability of being chosen. We notice however that choosing a path with two or more cycles is extremely unlikely, with a probability of 0.0009 according to the model.
Node  Value function 

Path  Length  Product of link choice probabilities 

2  0.6374  
6  0.0117  
3  
4  
5.5  
9.5  
6.5 
This example illustrates that the RL model offers a convenient mathematical formulation for the choice of path in a cyclic network. In comparison, using the PL model for this network raise a wellknown challenge. Indeed, the logit formula in (6) requires to define a finite choice set of alternative paths for the OD pair. Given that the possible paths cannot be all enumerated in this example, the modeler is compelled to make hypotheses on which subset of paths should have a non zero choice probability. The value of the resulting path choice probabilities will depend the composition of the choice set. In reality, this issue is not necessarily related to cycles only. In large networks, the number of possible acyclic paths may also be too large to enumerate in practice. In the following section, we delve into the issues which may arise from the necessity to generate choice sets to define choice probabilities in pathbased models.
5 An analysis of the advantages of recursive models compared to pathbased models
The goal of this section is to highlight in more detail the advantages of recursive models and the issues related to pathbased models. In this discussion, we use illustrative examples and we focus on two practical purposes of such models; i) estimating parameters from data of observed paths; ii) predicting choices from an estimated model. We focus on logit models for this comparison.
In practice, the PL model requires to generate choice sets of paths for both purposes. There is an extensive literature on the questions of how to generate choice sets of paths, what characteristics should choice sets observe, and what is the impact of selecting a restricted choice set prior to model estimation and prediction (Bekhor et al., 2006, Prato and Bekhor, 2007, Bovy, 2009, Bliemer and Bovy, 2008). The consensus in that literature is that it is advantageous to explicitly separate the procedures of generating path choice sets and modelling choice. Bekhor and Toledo (2005) argue that predicted paths from linkbased models are behaviorally unrealistic as they may contain cycles. Bliemer and Taale (2006) claim that there are computational advantages to choice set generation in large networks.
On the contrary, Horowitz and Louviere (1995) indicate that it is possible to mis specify choice sets with problematic consequences and that choice sets provide no information on preferences besides what is already contained in the utility function, although their study does not investigate path choice. Frejinger et al. (2009), among others, empirically demonstrate that the definition of choice sets may affect parameter estimates. In this section of this tutorial, we offer additional arguments in this sense. We exemplify complications related to choice sets which arise when using pathbased models, and we demonstrate that recursive logit models do not display these issues.
5.1 An example of model estimation
Figure 3 displays a network with one OD pair connected by a set of feasible paths . We study estimation results for synthetic data of trajectories on this toy network. This data is generated by simulation assuming that the true utility specification is given by
where is the travel time on arc , and is a constant equal to 1, with and . Travel time for each link is given in Figure 3. The travelers are also assumed to consider every possible path in , such that any trajectory may be observed.
We compare the ability of the PL model versus that of the RL model to recover the true parameter values. To do so, we estimate four pathbased models based on different choice sets . Table 5 displays the paths contained in each choice set, noting that choice set contains all 15 paths and is equivalent to . For each observation, the chosen path is added to the choice set if not already present. The last column displays the utility of each path based on the given specification, obtained by summing the link utilities.
Path  Nodes  

1  y  y  y  
2  y  y  y  
3  y  y  y  
4  y  y  y  
5  y  y  y  
6  n  y  y  
7  n  y  y  
8  n  y  y  
9  n  n  y  
10  n  n  y  
11  n  n  y  
12  n  n  y  
13  n  n  n  
14  n  n  n  
15  n  n  n 
Results are shown in Table 6. In the cases where the choice set fails to include several relevant alternatives ( and
), the estimation algorithm for the PL model does not converge, and the parameter values obtained are significantly different from the true ones. The fact that the algorithm does not converge may seem counterintuitive at first, but it is in fact due to i) the lack of variance in attributes of the paths in these choice sets, ii) the omission in the choice set of paths 6 to 12, which are chosen relatively often in the data, but only added to
when corresponding to the observed path. As a result, when such paths are present in the choice set, the data reports that they are selected of the time, which cannot be reconciled with the explanatory variables present in the utility specification.The only case where the PL model recovers the true parameter values based on a restricted choice set is with , which contains almost the same paths as but for three paths. The estimates have slightly lower variance when all alternatives are included with choice set , and the RL model obtains equivalent results (the slight difference may be due to different implementations of the optimization algorithm). In accordance with several other studies, we conclude from this experiment that the PL model may not recover the true utility function when the choice set fails to include several relevant alternatives.
Model  

3.35 (0.59)  (0.32)  
2.00 (0.37)  0.64 (0.10)  
(0.17)  (0.07)  
(0.16)  (0.07)  
(0.15)  (0.07) 
Certain studies argue that the assumption that users consider any path in is behaviorally unrealistic, and inquire what would happen if the data reflects instead the possibility that users do restrict their consideration set. In order to shed light on this question, we study a second sample of synthetic data, where observed trajectories include only paths 1 to 12, generated under the assumption that paths 13 to 15 are not considered by travelers due to their highly negative utility.
In Table 7, we show the estimation results of the same models on this new data. It shows that although the RL model considers more paths than the true choice set, it still recovers the true parameter values. On the other hand, the models based on choice sets and do not. This second experiment suggests that restricting the choice set without evidence regarding what alternatives are truly considered is potentially harmful, while considering a larger set including “irrelevant” alternatives does not interfere with estimation results in this case.
Model  

3.00 (0.47)  (0.26)  
3.00 (0.49)  0.93 (0.13)  
(0.17)  (0.07)  
(0.16)  (0.07)  
(0.15)  (0.07) 
Finally, we note that Frejinger et al. (2009) provide a method to correct parameter estimates of pathbased models. However, while this leads to consistent estimates, there is no method to consistently predict path choice probabilities according to the estimated model. Indeed, as the next examples highlight, predictions vary significantly depending on the definition of the choice set.
5.2 Examples of prediction
In general, predicting choices from discrete choice models for a certain demand requires knowing the utility function and choice sets of the decision makers , on which the probability distribution depends. This is in theory more complex when the utility function depends on socioeconomic characteristics of individuals . However, in the following, we make the assumption that the utility function is not individual specific and depends only on attributes of network links.
5.2.1 Link flows
Predicting link flows in the network is a typical application of path choice models, of importance in e.g., stochastic user equilibrium models. Link flows represent the amount of individuals (or other unit) on each arc of the network corresponding to loading a certain OD demand.
Two methods exist to predict expected flows with the RL model, none of which require to enumerate choice sets of paths. The first method consists in sampling paths link by link for each individual using link choice probabilities in (9). The second method allows to compute expected link flows without resorting to simulation. According to the Markov property of the model, Baillon and Cominetti (2008), proved that destinationspecific link flows are obtained by solving the linear system
(15) 
where is the demand vector from all origins to destination .
In this example, we predict link flows for the network in Figure 3, assuming a demand of 100 for the single OD pair and the same utility specification as in section 5.1. We compare the link flows predicted by the RL model and the three PL models based on different restricted choice sets , and . In each case, the expected flow on a given path is equal to the fraction of the demand choosing according to .
For the RL model, link flows are obtained by solving (15). For the PL models, flows on paths are computed from the path choice probabilities for . Flows on links are then obtained by summing the flows on all paths traversing each link. Table 8 displays the amount of flow on each link according to each model. As expected, we observe that the amount of predicted flow varies greatly between pathbased models depending on the chosen choice set. When the choice set size increases, predicted flows tend to be closer to the values forecast by the RL model. A particularity of the RL model is that it predicts nonnull flow on every link. However, the amount of flow on links 7, 10 and 16, which belong only to paths with very small choice probabilities, is very close to zero.
In reality, it is difficult to judge which model predicts link flows better without being able to compare to observed link counts. However, a crucial remark is that in the absence of any information regarding which paths are truly considered by travelers, the predictions of the PL models are arbitrarily dependent on the choice set. On the other hand, the RL model allows to predict according to the true estimated probability distribution. In addition, the RL model offers the advantage of computing link flows very efficiently, as only one system of equations must be solved to obtain link flows for all OD pairs with the same destination. On the contrary, the PL models require to define a choice set for each OD pair.
Link  Nodes  

1  oE  0.00  8.83  13.88  12.99 
2  oA  100.00  91.17  86.12  87.01 
3  AF  36.92  35.75  37.10  37.39 
4  AB  63.08  55.42  49.02  46.63 
5  BC  26.16  27.67  24.47  25.10 
6  BH  36.92  27.75  24.55  24.53 
7  CD  0.00  0.00  0.00  0.12 
8  Cd  0.00  8.00  7.07  6.77 
9  CI  26.16  19.67  17.40  18.21 
10  Dd  0.00  0.00  0.00  0.12 
11  EG  0.00  8.83  13.88  12.99 
12  FG  0.00  8.00  12.55  12.86 
13  FH  36.92  27.75  24.55  24.53 
14  GH  0.00  0.00  11.54  12.04 
15  Gd1  0.00  16.83  14.89  13.60 
16  Gd2  0.00  0.00  0.00  0.20 
17  HI  35.08  26.36  28.80  30.40 
18  Hd  38.76  29.14  31.84  30.70 
19  Id  61.24  46.03  46.20  48.60 
5.3 Accessibility measures
Accessibility measures are another example of information which can be computed from path choice models. The accessibility is a measure of the overall satisfaction of an individual for the available alternatives, i.e. the existing paths in a network for a given OD pair, and is formally defined as the maximum expected utility of the alternatives. According to the RL model, the accessibility is simply the value function at the origin in (13). In pathbased models there is no notion of value function, and instead the accessibility depends on the generated choice set,
(16) 
In the network of this example, accessibility measures are given in Table 9. This illustrates that the value of accessibility significantly differs depending on choice set composition, and that as more paths are added to the value predicted by PL models converges to that predicted by the RL model, as asserted by Zimmermann et al. (2017). Obtaining a prediction of accessibility which is independent of choice sets is very useful, as it allows to compare changes in accessibility before and after network improvements (e.g. after links are added) without bias. When pathbased models are used, reported accessibility measures may be incoherent, e.g. decreasing after network improvements, an issue dubbed the Valencia paradox in Nassir et al. (2014).
0.9592  0.6738  0.5512  0.5478 
6 Conclusion
This paper presented a tutorial on analyzing and predicting path choices in a network with recursive discrete choice models. The goal of path choice models is to identify the cost function representing users’ behavior, assuming that individuals act rationally by maximizing some kind of objective function when choosing a path in a network. Such models are useful to provide insights into the motivations and preferences of network users and to make aggregate predictions, for instance in the context of traffic equilibrium models.
In this tutorial, we presented the state of the art methodology for this problem, namely recursive discrete choice models. This methodology is superior in many respects to the discrete choice models based on paths extensively used in the transportation demand modeling literature. This tutorial achieved two main contributions, which we describe below.
First, we provided a fresh and broader research context for this problem, which has traditionally been addressed mostly from the angle of econometrics in transportation. Namely, we drew links between discrete choice modeling and related work in inverse optimization and inverse reinforcement learning, which facilitates a greater understanding of the recursive models presented in this work. In particular, we contextualized discrete choice as a method for inverse optimization with noisy data, and showed that viewing the inner problem as a Markov decision process naturally yields the recursive formulation.
Second, we highlighted the advantages of recursive models through an illustrated comparison with the most widely used method in the literature, i.e., pathbased discrete choice models. While we do not aim at discussing the validity of the behavioral assumptions between both models, we illustrated that recursive models display mathematical convenience, by yielding consistent parameter estimates and predicting choices faster without choice set generation.
Acknowledgements
We are very grateful to Tien Mai for his work developing the code which was used for the examples in this tutorial. This work was partly funded by the National Sciences and Engineering Research Council of Canada, discovery grant 4356782013.
References
 Abbeel and Ng (2004) Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 Ahuja and Orlin (2001) Ahuja, R. K. and Orlin, J. B. Inverse optimization. Operations Research, 49(5):771–783, 2001.
 Aswani et al. (2018) Aswani, A., Shen, Z.J., and Siddiq, A. Inverse optimization with noisy data. Operations Research, 2018.
 Baillon and Cominetti (2008) Baillon, J.B. and Cominetti, R. Markovian traffic equilibrium. Mathematical Programming, 111(12):33–56, 2008.
 Bärmann et al. (2017) Bärmann, A., Pokutta, S., and Schneider, O. Emulating the expert: Inverse optimization through online learning. In International Conference on Machine Learning, pages 400–410, 2017.
 Bekhor and Toledo (2005) Bekhor, S. and Toledo, T. Investigating pathbased solution algorithms to the stochastic user equilibrium problem. Transportation Research Part B: Methodological, 39(3):279–295, 2005.
 Bekhor et al. (2006) Bekhor, S., BenAkiva, M. E., and Ramming, M. S. Evaluation of choice set generation algorithms for route choice models. Annals of Operations Research, 144(1):235–247, 2006.
 Bellman (1958) Bellman, R. On a routing problem. Quarterly of applied mathematics, 16(1):87–90, 1958.
 Bertsekas and Tsitsiklis (1991) Bertsekas, D. P. and Tsitsiklis, J. N. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595, 1991.
 Bertsimas et al. (2015) Bertsimas, D., Gupta, V., and Paschalidis, I. C. Datadriven estimation in equilibrium using inverse optimization. Mathematical Programming, 153(2):595–633, 2015.
 Bliemer and Bovy (2008) Bliemer, M. and Bovy, P. Impact of route choice set on route choice probabilities. Transportation Research Record: Journal of the Transportation Research Board, (2076):10–19, 2008.
 Bliemer and Taale (2006) Bliemer, M. C. and Taale, H. Route generation and dynamic traffic assignment for large networks. In Proceedings of the First International Symposium on Dynamic Traffic Assignment, Leeds, UK, pages 90–99, 2006.
 Bovy (2009) Bovy, P. H. On modelling route choice sets in transportation networks: a synthesis. Transport reviews, 29(1):43–68, 2009.
 Burton and Toint (1994) Burton, D. and Toint, P. L. On the use of an inverse shortest paths algorithm for recovering linearly correlated costs. Mathematical Programming, 63(13):1–22, 1994.
 Burton et al. (1997) Burton, D., Pulleyblank, W., and Toint, P. L. The inverse shortest paths problem with upper bounds on shortest paths costs. In Network Optimization, pages 156–171. Springer, 1997.
 Burton and Toint (1992) Burton, D. and Toint, P. L. On an instance of the inverse shortest paths problem. Mathematical Programming, 53(13):45–61, 1992.
 Chan et al. (2018) Chan, T. C., Lee, T., and Terekhov, D. Inverse optimization: Closedform solutions, geometry, and goodness of fit. Management Science, 2018.
 Dijkstra (1959) Dijkstra, E. W. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.
 Floyd (1962) Floyd, R. W. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.
 Fosgerau et al. (2013) Fosgerau, M., Frejinger, E., and Karlström, A. A link based network route choice model with unrestricted choice set. Transportation Research Part B, 56:70–80, 2013.
 Frejinger (2008) Frejinger, E. Route choice analysis: data, models, algorithms and applications. PhD thesis, Ecole Polytechnique Federale de Lausanne, 2008.
 Frejinger et al. (2009) Frejinger, E., Bierlaire, M., and BenAkiva, M. Sampling of alternatives for route choice modeling. Transportation Research Part B: Methodological, 43(10):984–994, 2009.
 Horowitz and Louviere (1995) Horowitz, J. L. and Louviere, J. J. What is the role of consideration sets in choice modeling? International Journal of Research in Marketing, 12(1):39–54, 1995.
 Keshavarz et al. (2011) Keshavarz, A., Wang, Y., and Boyd, S. Imputing a convex objective function. In Intelligent Control (ISIC), 2011 IEEE International Symposium on, pages 613–619. IEEE, 2011.
 Mai et al. (2015) Mai, T., Fosgerau, M., and Frejinger, E. A nested recursive logit model for route choice analysis. Transportation Research Part B, 75:100–112, 2015.
 Mai et al. (2016) Mai, T., Bastin, F., and Frejinger, E. A decomposition method for estimating recursive logit based route choice models. EURO Journal on Transportation and Logistics, pages 1–23, 2016. doi: 10.1007/s1367601601023.
 Nassir et al. (2014) Nassir, N., Ziebarth, J., Sall, E., and Zorn, L. Choice set generation algorithm suitable for measuring route choice accessibility. Transportation Research Record, (2430):170–181, 2014.
 Ng et al. (2000) Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
 Nielsen and Jensen (2004) Nielsen, T. D. and Jensen, F. V. Learning a decision maker’s utility function from (possibly) inconsistent behavior. Artificial Intelligence, 160(12):53–78, 2004.
 Polychronopoulos and Tsitsiklis (1996) Polychronopoulos, G. H. and Tsitsiklis, J. N. Stochastic shortest path problems with recourse. Networks: An International Journal, 27(2):133–143, 1996.
 Prato and Bekhor (2007) Prato, C. and Bekhor, S. Modeling route choice behavior: how relevant is the composition of choice set? Transportation Research Record: Journal of the Transportation Research Board, (2003):64–73, 2007.
 Prato (2009) Prato, C. G. Route choice modeling: past, present and future research directions. Journal of Choice Modelling, 2(1):65–100, 2009.
 Ratliff et al. (2006) Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
 Rust (1994) Rust, J. Structural estimation of markov decision processes. Handbook of econometrics, 4:3081–3143, 1994.
 Zhang et al. (1995) Zhang, J., Ma, Z., and Yang, C. A column generation method for inverse shortest path problems. Zeitschrift für Operations Research, 41(3):347–358, 1995.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
 Zimmermann et al. (2017) Zimmermann, M., Mai, T., and Frejinger, E. Bike route choice modeling using GPS data without choice sets of paths. Transportation Research Part C: Emerging Technologies, 75:183–196, 2017.
Comments
There are no comments yet.