I Introduction
Given millions of interconnected smartphones^{1}^{1}1 Smartphones are equipped with various sensors such as camera, GPS and accelerometer which enable mobile users to easily sense many realtime traffic conditions when they drive [1]. and invehicle sensors sold annually, it is promising to leverage the crowd for data sensing and sharing. Mobile social network applications constitute an important platform for traffic information sharing, helping users collect and share realtime sensor information about the driving conditions they experience on the traveled path, see [2]
. Platforms inform new travelers of the paths they should take by aggregating information from other users that used these paths in the past and recommending a path with the least estimated current cost for travelling. For example, Waze uses a mobile social network platform for drivers to share traffic and road information. Another example is Google Map which uses realtime traffic data shared by hundreds of millions of people around the world to analyse traffic and road conditions
[3].All these platforms estimate the current average cost of the alternative paths and suggest the least costly paths to their travelers. Obviously, a selfish user will follow such a myopic suggestion. But would she follow the suggestions of an optimal (social cost optimising in the long run) platform that frequently explores riskier paths in case these become superior over time? This incentive issue becomes even more important since our Price of Anarchy analysis (see Section IV) suggests that myopic platforms whose recommendations users are likely to follow can be arbitrarily bad in term of efficiency compared to the optimal platform.
In this paper we illustrate the above issues by considering the simple but fundamental case of jointly routing and learning in a context where users decide their trips from point to point
by choosing between two paths P1 and P2. P2 has a fixed user driving cost whereas P1 has driving conditions (e.g., visibility, ‘black ice’ segments, congestion) that alternate between a ‘good’ and a ‘bad’ states according to a twostate partially observable Markov chain with known transition probabilities, influencing the expected driving cost over the path. When in good (bad) condition, P1 has lower (higher) expected cost than P2. By aggregating information about the actual cost experienced by users that traveled over P1, a mobile platform can estimate its current state, and make the appropriate recommendation to future travelers. Selfish users deciding on their current trip would prefer P1 only if its current expected cost conditioned on the available information is less than the known cost of P2.
But there are additional reasons to explore P1 even if it momentarily looks on average worse than P2. An ‘altruistic’ user would take this exante costlier path in order to increase system information about P1. With little luck, finding P1 in its good state will benefit future travelers which will exploit this information. Hence a socially optimal platform would advise at appropriate times some of the users to use paths that are myopically suboptimal to them. Unfortunately, this is not a Nash equilibrium strategy for the system since without appropriate incentives, selfish users will always choose the path with the least current expected cost. This results in exploring the stochastic path P1 less frequently than socially desired.
We show that the myopic routing strategy achieves a Price of Anarchy (PoA) that can be arbitrarily large compared to the case that users follow the recommendations of the optimal platform. We prove that by restricting information the incentives of the users become aligned with the incentives of the social planner: simply hide the information reported by the past travellers and recommend the socially optimal path choice to current travelers. Using a correlated equilibrium concept, we show that the equilibrium strategy of the users is to follow the recommendations of the optimal platform.
Numerous works have been done on traffic estimation based on information sharing by travelers (e.g., [4], [5]). Our paper does not deal with technical details on how to aggregate and process information or on how to architect such systems. It provides a simple conceptual model for user incentive mechanism design when there are explorationexploitation tradeoffs. On a different direction, explorationexploitation in optimal decision making is well studied in classical multiarmed bandit problems where decisions are made centrally (e.g., [6]). In our model we have multiple bandits (corresponding to paths) but we cannot force the optimal sequence of choosing arms due to users’ selfishness. Each machine will be played in a myopic sense if full information is disclosed.
Incentive mechanism design for participation in crowdsensing platforms has been well studied recently (e.g., [7], [8], [9]). In our case participation is not an issue since we prove that users always gain by participating. As a parallel to the case of allocating tasks to agents, our goal is to incentivise agents to accept tasks that may not be optimal for them, but create the best results for the rest of the community. Similar to our idea, in the economics literature there are some recent work ([10, 11]) for motivating the wisdom of the crowd. Yet, [10] did not look at a dynamic Markov chain model for longterm forecasting and [11]
requires incentive payments (which is not possible for many traffic recommendation applications). Instead, we model and analyze a more interesting but complex partially observable Markov decision process (POMDP), and propose a paymentfree incentive mechanism for the POMDP model. Further, we study the incentive compatibility of modelfree reinforcement learning, which approximates the complex POMDP policy and is easy to implement in practice. Our main contributions are:

We formulate a joint routing and learning model for users making travel path choices. The POMDP model is simple but powerful enough to formulate some key problems in incentive compatible platform design. The optimal policy for recommending paths may prefer paths with higher average costs to exploit their low cost states. This policy serves as a benchmark for efficiency comparisons with other policies.

Although the optimal policy cannot be derived in closed form, we compute the Price of Anarchy (PoA) of myopic decision making by comparing to the optimum. If platforms (or users) minimise the short term travel cost, PoA is equal to , where is the discount factor used in the optimal policy. This tells that myopic platforms whose recommendations users are likely to follow can be arbitrarily bad.

We consider the challenging case of ‘sophisticated’ users: such a user has full system information (i.e., system parameters and the used POMDP to derive the optimal policy). If we allow such users to access the travel information collected by the platform from past travelers, the system with sophisticated users has an equilibrium that corresponds to using the myopic policy. Accordingly, we propose an information restriction mechanism such that the equilibrium is to follow the recommendations of the optimal policy, achieving PoA =1.

In practice, an approximation of the optimal policy can be obtained via reinforcement learning. We consider the incentive compatibility of the platform using Qlearning. We numerically show that the more accurate the learning algorithm is, the ‘more’ incentive compatible the system with restricted information becomes. We further extend the twopath model to include more stochastic paths, and show that the incentive compatibility is easier to ensure under our information restriction mechanism.
The rest of the paper is organized as follows. Section II introduces the network model and formulates the problem as a POMDP over a belief state about the paths. Section III presents the optimal platform design and Section IV presents two myopic platforms as comparison benchmarks. Section V shows the incentive mechanism design for myopic users. Section VI presents the modelfree optimization technique of Qlearning and analyses the incentive compatibility issues, and Section VII extends the twopath model for examining users’ incentive compatibility. Section VIII concludes.
Ii System Model and Problem Formulation
As mentioned in the Introduction, we model selfish behaviour of platform users. To make the problem nontrivial we consider the challenging case that such users are ‘infinitely sophisticated’ in terms of analytical and computational capabilities and have full information about the system parameters and the platform algorithms. We like to investigate the actions of such users and the corresponding results in social cost if i) there is no platform recommending an action, ii) the platform besides recommending a path is also making available the full information collected so far by other users, and iii) such information is hidden and only the current path recommendation is available. To make the above problem well defined we use as a benchmark the case of an optimal platform and then analyse what happens in the practical case of a platform that uses machine learning, in particular using the Qlearning algorithm.
Our optimal platform makes routing decisions under uncertainty capturing the fundamental tradeoff between exploring new possibilities versus exploiting optimally the current information. To make the problem analytically tractable, we choose a network model that is simple but fundamental enough to capture the essential aspects of making such routing decisions.
Iia Network Model
We consider the simplest case where there are only two paths for our users to choose from: one with deterministic cost and another that alternates randomly between two states, each such state generating a different average cost. A platform user that travels along the stochastic path probes the path and experiences some actual cost which is reported to the platform. The platform collects these cost reports into a path history and uses Bayesian inference to determine the probability that the path is in high or low cost state. Though simple, this twopath network model captures the fundamental explorationexploitation tradeoffs in making routing decisions, and makes users face the incentive problems we like to analyse.
^{2}^{2}2Note that our analysis can be easily extended to include multiple paths with deterministic costs in a larger network, by removing all the deterministic paths apart from the one with the smallest cost for routing consideration. Yet the analysis for multiple paths with timevarying costs is more involved and we need to update and balance the belief states of all stochastic paths. Still, Section VII provides some interesting results for developing the optimal thresholdbased policy and examining the incentive compatibility for users to follow the platform recommendations.Our (road) network model with source node and destination node is in Fig. 1(a), with two paths from the set {P1, P2}. We consider an infinite discrete time horizon
, and assume that during each discrete epoch there is a single user of our platform that must travel from
to and must choose between paths P1 and P2. In this abstract model a trip takes a single epoch to complete^{3}^{3}3We can easily extend the model where a trip takes any fixed number of epochs..We define the road condition experienced by a traveler on path P1 as a binary random variable
:
is the event that a hazard occurs to the traveler (e.g., poor visibility, ‘black ice’ segments, congestion), i.e., driving on the path generates some positive fixed driving cost .

is the event that no hazard occurs to the traveler; without loss of generality we associate with this case a zero driving cost.
Users that drive on P1 observe the value of , and incur the corresponding cost depending whether or .
To capture the randomness of the road condition of P1, we assume that P1 alternates between two states and during as a Markov chain with transition probabilities as in Fig. 1(b), and in each such state is i.i.d. with a different distribution. In state the probability of incurring a hazard , whereas in state this probability is , where . Path P2 is always in a known cost state, generating cost such that .^{4}^{4}4Otherwise, P2 will never be chosen due to its always higher cost than P1. Since , corresponds to the high (expected) travel cost state, with average cost per traveller . Similarly, is the low travel cost state with average cost . Note that if P1 is in the high cost state, there is always some probability that a traveller incurs no hazard. Similarly, if P1 is in the low cost state, there is still some probability that a traveller incurs a hazard.
A user that travels on P1 observes . If we say that her observation is . A user travelling along P2 observes nothing about the condition of path P1, in which case we say her observation is (provides no information about P1 due to travel on P2). A user always shares her observation about with the platform. We denote the observation of a user that traveled at time by , where . The history of observations available to the platform by time corresponds to .
IiB Platform Information Model
We next introduce how the platform works. Given the history of observations , it determines the probability that the path is in state or using Bayesian inferencing. To avoid keeping an everincreasing history of observations, we summarize the available information equivalently into a single belief state , the probability that path P1 is in state just before the travel of the user at time . We denote the platform’s initial belief state as .
To make our Bayesian inferencing precise, we need to define in our model our refined sequence of events from to . To do that we refine time and use as ‘micro’ time refinements around time (where ).

At time
there is no event occurring; we just summarise our belief about P1’s state based on the previous history: compute the prior probability
, i.e., the probability for P1 being in just before . 
At time a user probes the paths by traveling and she supplies her trip observation . We use
to update our posterior probability
for the state of P1 being at time after the trip observation. 
At time the Markov chain of the path state makes a transition.
In this model we consider that road conditions in P1 change in time scales slower or equal to the time scale of user trip arrivals. Then two consecutive users do not see P1 in its steady state distribution, and hence the probability for depends on the history of the observations.
The belief state can be derived in a recursive way from the observations and . Let be the choice of path of the user who travels at time . Consider first that . If
, then by Bayes’ Theorem, the posterior probability that the cost state is
after time is(1) 
where we use the fact that the path state does not change during . Similarly, if , we obtain
(2) 
If , then and the posterior probability is the same as the prior probability, i.e., .
Given the posterior probability , we can finally compute the probability that P1 is in state at as
(3) 
Observe that a user offering positive information to the platform by travelling on P1 incurs an average cost of
(4) 
which might be more than the safe travel on P2 with fixed cost . This creates a tension between individual incentives and social optimality as we analyse next in the optimal platform design problem.
Iii The Optimal Platform by Solving POMDP
The optimal platform operation is modelled as a Markov decision process (MDP) where the state is our belief state , decisions correspond to path choices for travelers, and the cost function is the total discounted cost from travel. In fact, our problem can be seen as a partially observable Markov decision process (POMDP), and it is a standard solution method to reformulate it as an MDP over a belief state. Though this optimal design problem is notoriously difficult to solve, it provides a performance upper bound to evaluate i) myopic platforms and ii) modelfree machine learning platforms.
A stationary routing policy is a function that specifies an action for each state at any time. Given the initial belief , the goal of the optimal platform is to find an optimal stationary policy to minimize the expected total discounted driving cost (social cost) over an infinite time horizon, i.e.,
(5) 
where is the discount factor over time and is either (4) or if the specified routing action is or . We refer to the minimum cost value solution of the Bellman equation (5) as the ‘value function’. According to our discussion of belief state updating in Section IIB, the specific optimality equation of our problem can be written as follows:
(6) 
For ease of reading, we denote by and the first and second terms in the minimum operator of (III), respectively. Hence, is the expected discounted cost staring from state if action is taken at the first time epoch and optimal policy is followed thereafter. Once we determine the exact value function, the optimal policy can be obtained for any state as,
(7) 
We can easily show that the optimal platform might recommend users to travel to P1 even when the expected travel cost in (4) is higher than of P2 (i.e., when the myopic decision is P2) for exploration benefit in the future.
Although our analysis of the above POMDP and the corresponding incentive issues is possible for any set of parameters, to illustrate better our key ideas and results we choose a specific set of parameters as follows.
Assumption 1.
The Markov chain in Fig. 1(b) is symmetric with where , and the probabilities and are complementary, i.e., and where .
In the rest of the paper, we assume that Assumption 1 holds. Without it, the more general problem can still be analysed in a similar way and yields the same theoretical results.
Before solving (III) we can first prove it has a unique solution by using the contraction mapping theorem. Note that the minimum operator in (III) is a contraction operator since . Furthermore, we prove that the value function is a piecewiselinear concave function of the belief state by mathematical induction. Besides, we show that is an increasing function of . Here we skip detailed proofs due to page limit.
Proposition 2.
There exists a unique solution to the optimality equation (III) and it is a piecewiselinear, increasing and concave function of the belief state .
The proof is given in Appendix A. Although the existence of solution to (III) is guaranteed, it is still difficult to solve it analytically. An intuitive conjecture^{5}^{5}5 Our POMDP model is similar (but not the same) to the well studied problem of searching for a moving object [12]. To prove the same conjecture for that problem still remains an open problem. This suggests that proving (or disproving) the threshold property for the optimal policy in our case can be extremely challenging. Yet using extensive numerical analysis for a very fine grid of parameter values we have observed that Conjecture 3 remains true. about the optimal policy is that it is of threshold type.
Conjecture 3.
There exists a threshold value such that it is optimal to choose path P1 when the belief state is in , choose P2 when the belief state is in , and choose any of the two paths at .
We have been able to formally prove our conjecture for a restricted set of parameters as follows, by using the concavity of the value function .
Proposition 4.
If , the optimal policy is of threshold type.
The proof is given in Appendix B. We have the following corollary which directly follows from Proposition 4.
Corollary 5.
If or , the optimal policy is of threshold type.
For our experimental analysis we will discretise finely the state space , use value iteration to compute the value function in (III), and finally compute the optimal policy at each given belief state by solving (7).
How much would society lose compared to the optimum if a myopic platform (always chooses the least current cost path) is in place? We see in the next section that this performance loss can be arbitrarily large.
Iv Myopic Platforms and PoA
Myopic platforms such as Waze and Google Maps estimate the travel cost of different paths and suggest to users the path with the smallest cost. In this section we introduce two basic myopic platforms: a platform that does not use feedback information from users, and a platform that uses such feedback to update the current cost estimate. We analyse these platforms and characterise their performance gaps with the optimal platform in Section III in term of price of anarchy (PoA). The large PoA values resulting from our analysis suggest that the optimal platform is definitely desired, but such platform is not incentive compatible. This motivates our incentive alignment proposal in the rest of the paper.
Iva Myopic Platform without Information Sharing
In this case the platform uses longrun average path costs to make recommendations. For our specific model parameters, cost states and have each probability and the expected cost to travel through path P1 is . The routing policy of the myopic platform is straightforward. Let denote the routing policy without information sharing, then
which is independent of . Thus, either chooses path P1 all the time or path P2. We can now calculate the value function. If , always chooses path P2 to incur immediate cost to users over time, we have
If , always chooses P1. Given some initial probability about path P1 (assumed known to the platform), the value function satisfies
Similar to the proof of Proposition 2, we can prove the existence and uniqueness of . We can also prove by mathematical induction that is a linear function of . It follows that,
is defined as the ratio between the maximum expected total discounted cost incurred under this myopic policy and the minimum expected total discounted cost in (III), by searching over all possible network parameters. That is,
(8) 
Proposition 6.
Given and , the policy achieves an infinite price of anarchy, i.e., .
Sketch of Proof: Lets rescale costs so that . To determine the PoA, we purposely create a worse case scenario where always chooses path P2 ( and ). Furthermore, let the initial P1 state be (i.e., ) and let the Markov chain change very slowly (). Then path P1 will remain in for a very long time. Since always chooses path P2, its cost value is a constant . Since there is zero average cost in state and the Markov chain is fully observable; hence the optimal policy will choose path P1 until a change of state occurs, i.e., nonzero cost is observed. But the time of such a transition can be made arbitrarily large since while our cost discount factor remains constant and equal to . A more formal argument in Appendix D can be used to prove that the price of anarchy of is infinity.
To prove Proposition 6, we can purposely create the worst case scenario with properly chosen initial state and costs and , where always chooses path P2 but the optimal policy chooses path P1 until a nonzero cost is observed. In this case, the expected cost of optimal policy can be made arbitrarily close to zero.
Even though can be arbitrarily worse than the optimal policy, users will still follow the platform recommendation under . Without any other information, sophisticated users can reproduce the calculations of the platform and hence will follow .
IvB Myopic Platform with Information Sharing
We now consider a myopic platform where travelers share information online. The difference from the optimal platform is that here it chooses actions that myopically minimise immediate average costs. Given the current belief about P1, the immediate expected cost is for path P1 and for path P2. By equating the two costs and solving for the corresponding threshold belief state , we obtain . The myopic policy of this platform is
Note that users will follow the recommendation of the platform as their objectives are aligned.
Let be the cost value function under the myopic policy . Similar to (5), we obtain
(9) 
It is rather obvious that this myopic platform behaves the same way as the Nash equilibrium of a system that deploys the optimum platform but users have full information about travel history, i.e., can reconstruct . This is because in the optimal platform users will still use (IVB) to choose paths, and hence the two systems will have the same sample paths on our probability space.
Fact 1.
On any sample path, the Nash equilibrium of the optimal platform with information sharing and selfish users is the same as the Nash equilibrium of the myopic platform (with information sharing and selfish users) using .
One can easily prove the intuitive result that is more conservative than in the sense that if prefers the risky path P1, then clearly should also prefer it since it obtains the additional/future benefit of learning the path state more accurately. Obviously, the reverse does not hold: if prefers P1, it does not imply that should also prefer P1. This is formally stated in the following proposition and will be used in our proof for incentive compatibility in Section V.
Proposition 7.
For any the optimal policy chooses path P1.
Proof.
Note that when ,
By the concavity of the value function ,
By combining the above two inequalities, we obtain when . This completes the proof. ∎
Note that if Conjecture 3 is true, a corollary is that .
Similar to (8) the price of anarchy of is defined as
Proposition 8.
Given and , the policy achieves .
Sketch of Proof: Let’s rescale costs so that . Let the Markov chain be fully observable (i.e., ), and let it change very slowly (i.e., ). Let the initial probability be very small. Thus, with a very high probability, path P1 starts in state and remain in that state for very long time thereafter. Now choose slightly smaller than so that chooses path P2 at the beginning. Without exploring path P1, the belief state will gradually increase with time and in turn continues choosing path P2 instead of exploring path P1. Hence, policy will always choose path P2 generating cost in every time epoch. But the optimal policy would like to take a little risk exploring path P1 at the beginning to exclude the possibility that it is in state (which is highly improbable) to keep exploiting the zero cost of state if this turns out to be the case. If the cost state turns out to be (which occurs with very low probability), we switch to path P2 thereafter imitating . Hence exploring path P1 at the beginning generates a cost of , but from the second time epoch and for a very long time forward the cost under the optimal policy is either always (with prob. ) or (with prob ).Simple calculations give the result as . The detailed proof can be found in Appendix C.
Similar to the proof idea of Proposition 6, we still purposely create the worst case, where always chooses path P2 but the optimal policy chooses path P1 until a nonzero cost is observed. But with information sharing, we cannot make the expected cost of the optimal policy arbitrarily close to zero. Thus, unlike , PoA of is bounded. This is because obtaining information from travelers allows the platform to significantly reduce the immediate cost. Without such information, the platform can make terrible routing decision from the start. However, even with information sharing, the decision making of the platform can still be arbitrarily poor in the long term. The performance of the myopic platform becomes worse compared to the optimal policy as the discount factor increases and future costs become more important. As approaches 1, PoA approaches infinity, indicating a great performance loss due to the myopic nature of . As this performance loss can be huge, it is crucial to design incentive mechanisms for for achieving incentive compatibility.
V Information Restriction Mechanism for Incentive Compatibility of
To provide incentives for users to follow the recommendations of the optimal platform, we propose a novel information restriction mechanism. The idea is to hide from users the information collected by the platform from the previous travelers and supply only the path recommendation. This is equivalent to keep private the information about the current value of the belief state that the optimal platform has constructed. Hence, a user knows only her current path recommendation besides knowing the statistical properties of the paths and the platform algorithm.
We use the concept of correlated equilibrium (proposed by Robert Aumann [13]). In this model the platform provides a private signal to the players which then act in their best interest under information uncertainty. In our case the platform offers a private signal (its recommendation) and users decide to follow it or not. If no user would want to deviate from the recommendation assuming the others don’t deviate, we say all users following recommendations is a correlated equilibrium. The mechanism we propose here does not require the optimal policy to be of threshold type (Conjecture 3), and its incentive properties are just related to properties of the value function of the optimal policy.
The optimal policy always produces a partition of the belief state space into two sets , , where , is the set of belief states for which the optimal policy chooses action . Our signalling mechanism is defined as follows.
Definition 1.
Information Restriction Mechanism (IRM): The platform hides the history of observations (hence the belief state information ) from the users. It follows in (7) and recommends P1 when the belief state and P2 when .
IRM is incentive compatible if no user wants to deviate from her path recommendation unilaterally.
Next, we analyse the users’ actions (to follow the recommendation or not) in the correlated equilibrium under this mechanism. Although users have no knowledge of in real time, they are aware of the actual Markov chain model of the paths, the value of the parameters and the algorithm of the platform. They will reverseengineer the platform recommendation to estimate the possible values of the actual belief state and based on that decide on following the recommendation or not. More specifically, when the recommendation is P2, the user will infer that the current system state must be in , which implies that by Proposition 7. Note that the user benefits from choosing P1 for and P2 for . Thus, the user will follow the recommendation of P2. When the recommendation is P1, the user infers that the current system state must be in . We can prove that in the average sense she benefits by choosing P1 assuming the rest of the users do the same, and hence she will follow the recommendation of IRM. The incentive compatibility and the efficiency of IRM are formally stated in the next theorem.
Theorem 9.
Under IRM, all users following the optimal platform’s recommendation is a correlated equilibrium. Thus, our IRM achieves optimality and .
Proof.
Consider a user’s point of view at time who assumes that all the other users follow the optimal platform’s recommendation. Lacking any information about the history of the path state and assuming that the system operates already for very long time and the rest of the users follow the recommendation of the platform, her best estimate of the belief state is the stationary distribution under the optimal policy which then can be conditioned on the recommendation for P1 or P2. To prove our result we don’t need to evaluate this distribution analytically, but we need to establish certain properties of . To do that we use to evaluate the longrun undiscounted average cost that the system would incur if the platform uses the discounted cost optimal policy and users follow it^{6}^{6}6Note that this is not the cost minimised by the platform and we only use it to establish a relation involving to be used later in the proof.. Then can be computed according to the stationary distribution .
(10) 
where we used the claim that is less than (to be proved later). The formula above simply states that when in the average cost of a user following the recommendations is and when this cost is .
When the recommendation is P2, the user can reverse engineer the recommendation to infer that the current belief state must be in . According to Proposition 7, whenever , it follows that
(11) 
The user can compute the expected cost of travelling along P1 when the recommendation is P2 according to the stationary distribution of the belief state . By (11) it is larger than , that is,
Hence, the platform user will follow the recommendation to choose P2. When the recommendation is P1, the user infers that the current system state must be in . She will compute the expected cost of travelling along P1 according to . By using (V), this cost is smaller than since
Hence, each myopic user will follow the recommendation to choose P1.
Now we still need to prove our claim that (V) holds. Assume the initial distribution of the belief state is the stationary distribution . Then, the belief state at any time
has the same probability distribution
. Since policy is optimal for the total discounted cost minimization problem, the resulting optimal expected total discounted cost averaged over the initial state distribution isNote that one can always choose path P2 at each time epoch and the resulting expected total discounted cost is
which must be larger than the optimal cost. Thus,
or . Thus, (V) holds and this completes the proof. ∎
Vi Reinforcement Learning Platform
In practice, it may be difficult to develop an exact POMDP model for analysing the routing policy, either because of the many unknown parameter values or because such a Markovian model may not be sensible. Hence we expect that platforms will resort to modelfree reinforcement learning techniques such as Qlearning [14]. We want to obtain some insights on how reinforcement learning, which leads to suboptimal platforms, affects our mechanism results regarding incentive compatibility. In particular, we want to make the following conjecture which we have been able to test with experiments.
Conjecture 10.
Under the IRM, as the machine learning algorithm becomes more efficient in reducing the average system cost, the range of system parameters for which the users follow the recommendations of the platform increases. In particular, in the case of Qlearning algorithms (see (VI)), as , IRM induces IC.
In simple terms, increasing platform efficiency combined with hiding information induces incentive compatibility in a wider range of systems. In this section we analyse the performance of such a learning platform and measure its performance loss from the optimal platform benchmark.
The classical Qlearning algorithm estimates the Qvalue function in an online fashion and computes the optimal policy according to Qvalues computed for all possible system states and actions. In this case, state records the latest observations (cost reports by the last travelers), where is a parameter of the learning algorithm. For each possible action in state the Qvalue maps the stateaction tuple to the anticipated cost, and the optimal action corresponding to the minimum Qvalue is chosen. The platform updates the Qvalues for each over time by learning from the path observations the actual costs that such actions generate in the given context.
We expect the performance of Qlearning to improve as increases, since the system makes decisions in a more detailed context. Another way to see this is that a larger allows for a better estimate of the correct value of the belief state that the POMDPbased optimal platform would like to use for its decisions. But larger values of come at an exponential increase of the size of the statespace (which is with each observation being ) and influence the time Qlearning needs to converge in its optimal choices. Our numerical results later suggest that a small such as already provides nearoptimal performance. Next we describe the Qlearning algorithm adapted to our problem.
Given observation history before time , the platform takes action and incurs actual cost
, and updates the observation vector from
to . The Qvalue is updated as:(12) 
where is the learning rate. It is known from [15] that Qlearning converges if each tuple is performed infinitely often and satisfies for each tuple ,
In our implementation we use where is the number of times that the platform observes and performs until time and (as suggested in [16]). We next show that Qlearning has good performance when applied to the benchmark POMDP path model and then investigate incentive compatibility for users of this platform.
Via Performance Analysis of Qlearning Platform
In this subsection we provide a methodology for calculating the parameters of Qlearning after it converges and hence solving the path selection policy obtained by Qlearning. This allows us to compare this policy with the optimal policy of the POMDP and formulate the incentive compatibility problem faced by the users in Qlearning platform.
Using the results in [17] regarding the steady state values of the parameters of the Qlearning algorithm, we obtain that our Qlearning algorithm converges with probability 1 to the solution for each of the following system of equations:
(13) 
Here, is the sequence of latest observations after the transition by appending the last observation (0, 1, or ) to the vector after removing its first element. , are the asymptotic probabilities that the underlying cost state is , , respectively, given that the sequence of latest observations is .
We can use (2) and (3) to compute from some initial state (assumed in our specific case or equal to the steady state distribution of the path Markov chain in general) and any sequence of observations . This defines (and ) in (VIA) for all possible values .
Let be the solution to equation (VIA) with corresponding (asymptotic) policy . This takes the action with the minimum Qvalue, i.e.,
(14) 
where is the set of all possible latest observations and its size is . Clearly, (VIA) cannot be solved analytically and thus we obtain policy numerically using value iteration.
In Fig. 2, we plot the expected total discounted costs of policy for different values of and compare these costs to the optimal policy as functions of the initial belief state . Since Qlearning does not deal with belief states, we convert any initial into an appropriate initial state for Qlearning, by choosing the to make the value of most probable:
In Fig. 2 we first observe the curves for small values of and 2. When initial belief state is close to 0 or 1, the gap between and is more obvious. This is because if has few elements with or , it cannot approximate at the two extremes near 0 and 1. Since is a finite set, the corresponding values of are for all values of , not containing and . We further observe that as increases, the expected total discounted cost of becomes closer to the optimal cost.
We conclude that as increases, the Qlearning policy approximates the optimal policy more accurately. To see this imagine that we run two versions of Qlearning, both using the belief state instead of the vector , by discretising (finely) to make the state space finite (since Qlearning operates over a finite set of states). In the first version we use Bayesian updates for . In the second version we directly construct from the vector of last observations. We expect as the state space of becomes finer, the first algorithm converges to the solution of the POMDP, while the second algorithm converges to the original Qlearning based on . Now as , the value of used in both algorithms will tend to be the same. Hence we expect as , Qlearning approaches the solution of the POMDP.
ViB Information Restriction for Qlearning Platform
We continue with the analysis of user incentives for the Qlearning Platform as in the case of the optimal platform. Again we assume that users are sophisticated, have full information in how Qlearning works and can reverseengineer the Qleaning policy to decide whether to follow or not. We define our IRM mechanism as before: the platform hides the history of user observations. At each time, when the history of latest observations is it recommends P1 and when it recommends P2, as dictated by . Here is the set of under which recommends action .
As before, knowing and assuming that all users follow it, a sophisticated user computes the asymptotic probability distribution of the last observation vector . Let be the expected cost of taking action P1 given ,
(15) 
If a user receives path recommendation P1, then she can infer that and the expected cost of travelling through path P1 is . This user will follow recommendation P1 if and only if
(16) 
Similarly, the user will follow recommendation P2 if and only if
(17) 
Therefore, given a fixed combination of system parameters’ values, the Qlearning platform is incentive compatible if and only if both (16) and (17) hold. An interesting question is to determine the range of parameters in the parameter space of the twopath model for which incentive compatibility may not hold.
Fig. 3(a) examines the incentive compatibility (IC) as a function of and . We let , , and , and by solving (16) and (17) we find the regime of all possible values in which the IC does not hold. We observe that IC does not hold for all instances. As increases, the interval of values of in which IC does not hold becomes smaller. We also examine the incentive compatibility regarding in Fig. 3(b) and regarding in Fig. 3(c). In Fig. 3(b), we set , , , and . In Fig. 3(c)), we set , , , and . As increases, the interval of values of in which IC does not hold also becomes smaller. In all the three subfigures, we observe that the regime in which the IC does not hold becomes trivial once .
One may wonder the reason behind. As increases, the Qlearning policy becomes more accurate as an approximation of the optimal policy and by Theorem 9 IC holds for the optimal policy over all range of system parameters under IRM. Thus, as the accuracy of the Qlearning policy increases, the information restriction mechanism should become ‘more’ incentive compatible in the sense that the instances for which IC does not hold become rare.
Vii Extension to a Multipath Learning Model
In this section, we consider a more general network with three parallel paths where one more stochastic path P is added to our twopath model in Fig. 1(a). This new stochastic path P
follows the same Markov model as path P1 in Fig.
1(b). Unlike our simple twopath model, we need to update the belief states of both stochastic paths now. Thus we use a belief state vector whose updating follows the Bayesian inferencing process as in Section IIB.We similarly denote the value function by with due to symmetry. Similar to (III), we define as the expected discounted cost staring from if action is taken at the first time epoch and the optimal policy is followed thereafter. can be similarly written down as follows:
Similar to (7), the optimality equation of our threepath model is:
(18) 
Similar to Proposition 2, we can prove (18) has a unique solution by using the contraction mapping theorem. As it is not in closedform, we compute the value function using standard numerical methods such as value iteration. We first discretise and partition the belief state space for equally into grids. In each iteration step, we directly evaluate the value function in each grid by solving (18). Once the value function is obtained, the platform computes the optimal policy for each given belief state. We plot the optimal policy for problem (18) in Fig. 4 where we let , and . We observe in Fig. 4 that here when cost belief () is small the optimal policy uses stochastic path P1 (P), and when both and are large the optimal policy uses deterministic path P2, which is similar to Proposition 4. We observe that the optimal policy is more complex and cannot be defined in terms of simple threshold rules.
Comments
There are no comments yet.