This paper develops an approach for allocating resources in multiagent systems for domains where there are multiple agents and multiple tasks, and the success of the agents carrying out tasks is dependent stochastically on their ability to obtain a sequence of resources over time. We are particularly interested in situations where agents must independently optimize over their individual states, actions, and utilities, but must also solve a complex coordination problem with other agents in the usage of limited resources.
In particular, we are concerned with allocating resources in settings that involve a set of consumers, each of whom requires some subset of a total of resources. The consumers each have a measure of health111We use the term health here in a general sense to denote a single quantity over which an agent’s utility function (and hence, its reward) is defined. This can be for e.g. quality of a solution, value of an outcome, or patient state of health. that they are trying to optimize, and this quality is influenced stochastically by the resources they acquire and by time. Further, each consumer has a resource pathway that represents the partial ordering in which they need the resources. Consumers’ states evolve independently over time, and are dependent only through their need for shared resources. Rewards are independent, and the global reward is the sum of individual consumer rewards.
We formulate this problem as a factored multiagent Markov Decision Process (MMDP) with explicit features for each consumer’s state and resource utilization, and an explicit model of how each consumer’s state progresses stochastically over time dependent on obtained resources. The actions are the possible allocations of resources in each time step. For realistic numbers of consumers and resources, however, such an MMDP has a state and action space that precludes computation of an optimal policy. This paper addresses this problem and makes three contributions:
We develop an approximate distributed approach, where the full MMDP is broken into MDPs, one for each consumer. We call these consumer MDPs agents
. Agents model the resources they expect to obtain using a probability distribution derived from average statistics of the other agents, and compute expected regret based on this distribution and on the known dynamics of their health state.
We propose an iterative auction-based mechanism for real-time resource allocation based on the agents’ individual expected regret values. The iterative nature of this process ensures a reasonable allocation at minimal computational cost.
We demonstrate the advantages of our approach in a cooperative healthcare domain with patients seeking doctors and equipment in order to improve their health states. We present averages of simulations using randomly generated agents from a reasonable prior distribution. We compare our coordinated MDP approach against an alternate planning algorithm intended for large-scale applications, a state-of-the-art Monte Carlo sampling based method for solving the full MMDP model known as UCT. We also compare to two simple but realistic heuristic approaches for allocating medical resources.
Our approach is particularly well suited to large collaborative domains that require rapid responses to resource allocation demands in time-critical domains, and we use a healthcare scenario throughout the paper to clarify our solution. We start by introducing the MMDP model and our distributed approach, followed by descriptions of the baseline methods we compare to. We then develop a set of realistic models for use in simulation, and show results across a range of problem sizes.
2 MDPs and Coordination
Our model is a factored MDP represented as a tuple of elements where is the number of consumers, the number of resources, and is the planning horizon. is a finite set of resource variables, each one representing the state of a single consumer’s resource utilizations, where is a set of variables representing consumer ’s utilization of resource . Each where is the set of possible resource utilizations (how much resource is being used). We model each resource as distinct (so multiple copies of a resource are modeled separately). is a set of variables measuring each consumer’s health, each of which is giving the different levels of health. We use to denote the complete set of state variables for consumer , and to denote the complete state for all consumers. Agent receives a reward of for transition from to , thus the multiagent system’s reward function is . The transition model is defined as , which denotes the probability of reaching joint state when in joint state , and is a set of permissible actions, one for each resource and each consumer representing all feasible allocations of resources (so the same resource cannot be allocated to two agents simultaneously). Resources are deterministic given the actions, and only one resource can be allocated to each consumer at a time. We assume a finite horizon undiscounted setting222This is realistic in healthcare scenarios as health states do not warrant discounting..
The full MDP as described is an instance of a multiagent MDP (MMDP), and will be very challenging to solve optimally for reasonable numbers of consumers and resources. The total number of states is , and the number of actions is . We will show how to compute approximate (sample-based) solutions later in this paper, but first we show our approach to distributing this large MDP into smaller MDPs, and introduce our coordination mechanism for computing approximate allocations.
We treat each consumer’s MDP as independent (an agent), an example of which is shown in Figure 1. We assume that the agent’s state spaces, resource utilizations, health states, transition and reward functions are independent. The agents are only dependent through their shared usage of resources: only feasible allocations are permitted as described above (agents can’t simultaneously share resources). Rewards are additive and each agent’s actions now become requests for resources as described below. We make two further assumptions. First, the reward function for each agent is dependent on the agent’s health, H, and is set to zero by a boolean factor at the end of resource acquisition (finishing the medical pathway by receiving all required resources). Second, the agent health (H) is conditionally independent of the agent action given the current resources and the previous health, and the agent actions only influence the resource allocation, since the agent can only influence health indirectly by bidding for resources. Thus, for each agent , factors as
where we define is the probability of getting the next set of resources given the current health, resources, and action, and is a dynamic model for the agent’s health rate. We will refer to as the resource obtention model and to as the health progression model.
is a property of a particular agent’s condition or task and can be estimated from global statistics about the nature of the conditions (e.g. diseases).must be elicited from prior knowledge about diseases and treatments, and so forms part of a disease model that we henceforth assume is pre-defined (manually, or by learning based on historical statistics). On the other hand, the resource obtention model, , will be dependent on the current state of the multiagent system, and is a property of how we are setting up our resource allocation mechanism and the expected regret computations of each agent. For example, the probability of a single agent obtaining a resource will depend on (i) the number of other agents currently bidding for that resource and (ii) the agent’s model of health.
If using a single MDP for all agents as described at the start of this section, then resources would be deterministic given a joint allocation action. If modeled as a decentralized POMDP, the resources for each consumer would be conditioned on the unobservable states and actions of all the other consumers. In our model, we assume that the probability of obtaining a certain resource can be approximated reasonably well, either as a proior model based on the known distribution of diseases and the known requirements for treatments of each disease, or as a learned distribution based on simulated or real experiments.
In general, we can make no assumptions about further conditional independencies in the resource allocation factor. That is, the probability of obtaining a resource at time may depend stochastically on the set of resources at time . However, in many domains, there may be further independencies that can be encoded in the model. For example, in Figure 1, resource is conditionally independent of all resources where (for ) and for (for ), so the resources are ordered according to the (linear) medical pathway of this particular patient. We assume that the health progression factor can be specified for each agent independently of the other agents.
A policy for each individual MDP is a function that gives an action for an agent to take in each state . The policy can be obtained by computing a value function for each state , that is maximal for each state (i.e. satisfies the Bellman equation ). For simplicity of notation, we remove agent indices and only show the indices for resources. Thus an individual agent’s value function is represented as:
The policy is then given by the actions at each state that are the arguments of the maximization in Equation 2.
Agents compute their expected regret for not obtaining a given resource as follows. The expected value, for being in health state with resources at time , bidding for (denoted ) and receiving resource at time is:
where is the set of all resources except and and otherwise. The equivalent value for not receiving the resource, , is
Thus, the expected regret for not receiving resource when in with resources and taking action is:
We also refer to this as the expected benefit of receiving . It is important for agents in this setting to consider regret (or benefit) instead of value, as two agents may value a resource the same, but one might depend on it much more (e.g. have no other option). Value-based bids will fail to communicate this important information to the allocation mechanism.
Note that is an optimistic estimate, since the expected value assumes the optimal policy can be followed after a single time step (which is untrue). This myopic approximation enables us to compute on-line allocations of resources in the complete multiagent problem, as described in the next section. In the following, we will use the notion of utilitarian social welfare by aggregating the total rewards amongst all agents as an evaluation measure.
2.1 Coordination Mechanism
A coordination mechanism must aim to respect the health needs of the patients to maximize the overall utility. Each agent estimates its expected individual regret given its estimate of future resources and health (as given by and ). The regret values of different agents are compared globally, and an allocation is sought that minimizes the global regret. While the final allocation decisions are made greedily in the action-selection phase, the reported expected values of regret (for bidding) consider future rewards.
To implement this allocation, we use an iterative auction-like procedure, in which each consumer bids on the resource with highest regret. The highest bidder gets the resource, and all other agents bid on their next highest regret resource. Agents can also resign, receive no resources for one time step, and try again in a future time step.
Consider a simplified scenario with 4 agents and 4 resources. We are assuming that agents require all four resources and the expected benefits for receiving resources (or regrets for not receiving resources) based on their internal utility function have been calculated as illustrated in Table 1. The worst-case scenario would be when all the agents have attributed higher benefits to the same resources, so that their desire to acquire resources is in the same order or preference.
Agents first try to acquire the resource with highest benefit. In this scenario, all agents have associated the highest benefit to , however, only one () would be successful in getting it. All agents who have lost the previous auction, will now bid for the resource with the second-highest benefit, and so on. In this case, agents , , all have attributed as their second highest. Our auction-based method gives a benefit of 22 (shown in bold in Table (a)a). The optimal allocation has the benefit of 25 (one shown with * in Table (a)a).
Table (b)b shows an average-case scenario. Again we are assuming all agents require all the resources but with more diverse preferences over the set of resources. Our method gets a benefit of compared to the optimal benefit of .
3 Baseline Solution Methods
We will compare our algorithm to the result of a sample-based solution on the full MMDP as described at the start of this section. UCT is a rollout-based Monte Carlo planning algorithm  where the MDP is simulated to a certain horizon many times, and the average rewards gathered are used to select the best action to take next. To balance between exploration and exploitation, UCT chooses an action by modeling an independent multi-armed bandit problem considering the number of times the current node and its chosen child node has been visited according to the UCB1 policy . In general, UCT can be considered as an any-time algorithm and will converge to the optimal solution given sufficient time and memory . UCT has become the gold standard for Monte-Carlo based planning in Markov decision processes .
To rollout at each state, we use a uniform random action selection from the set of permissible actions at each state. The permissible actions are the ones that do not cause any conflict over resource acquisition. Subsequently, the best action is then chosen based on the UCB1 policy. The amount of time UCT uses for rollouts is the timeout, and is a parameter that we must set carefully in our experiments, as it directly impacts the value of the sample-based solution. Although in some resource allocation settings lengthy decision periods would not have any impact on the efficiency of allocations, arguably, the time for making allocation decisions can be important in domains requiring urgent decisions such as emergency departments and environments exposed to significant change. Delayed decisions for critical patients with acute conditions in emergency departments can have huge impact on effectiveness of treatments . Moreover, the allocation solution may become useless by the time an optimal decision is computed as a result of fluctuations in demand, and hence, requires recomputing the allocation decision. We will compare to UCT using a number of different realistic timeout settings.
3.2 Heuristic methods
We use three heuristic methods. In the first, only the agent’s level of criticality is considered (we call this “sickest first”). In the second, we use the reported regret values and only run one round of the auction-based allocation (so only one agent gets a resource at each time step: the agent with the biggest regret for not getting it). In the third, patients are treated in the order they arrive (first-come, first-served or FCFS - a traditional healthcare method).
4 Experiments and Results
We demonstrate our approach in simulations with realistic probabilistic models of different conditions (e.g. diseases) and health and resource dynamics distributions. The simulations use a random sampling of agent MDPs, drawn from a realistic prior distribution over these models. It is important to note that we are not simply defining a single patient MDP, but rather our results are averages over randomly drawn MDPs: each simulated patient is different in each simulation, but drawn from the same underlying distribution.
We make three main assumptions. First, we assume that task durations are identical (e.g. it always takes one unit of time to consume each resource). The second assumption is that each agent is only able to bid on a single resource at each bidding round (but each bidding round includes a sequence of bids to determine the action for each MDP). The third assumption is that all patients arrive at the same time.
4.1 Agent Setup
We assume that the health variable , and each resource variable . Patients all start (enter the hospital) with and, depending on the resources they acquire, their health state improves to healthy or degrades to the critical condition. We further define a function to encode the states of the health variables as for . We assume that there are possible conditions (diseases), each with a criticality level, a real number with being the most critical disease (makes the patient become sicker faster).
We first assume a multinomial distribution over the conditions drawn from a set , such that each patient has condition with probability . In the following, we assume conditions to be evenly distributed: , although in practice this distribution would reflect the current condition distribution in the population, community or hospital. Each condition has a condition profile that specifies a set of resources in a specific order that is derived from the clinical practice guidelines or the medical pathway, a distribution over health state progression models, , and a distribution over resource obtention models, .
The medical pathway can be specified either within the (by making any set of not on the pathway lead to non-progression of the health state), or within (by making it impossible to get resource allocations outside the pathway). We choose the latter in these experiments, but in practice the pathway may need to be specified by a combination of both, particularly if there is non-determinism in the pathways (i.e. different pathways can be chosen with different predicted outcomes). We assume that pathways for all agents are a linear chain through the required resources for each condition.
For our experiments, we have built priors over and based on our prior knowledge of the health domain. We have made these priors reasonably realistic (capture some of the main properties of this domain), and sufficiently non-specific to allow for a wide range of randomly drawn transition functions in the patient MDPs. In practice, these priors would be elicited from experts or learned from data.
Health state progression model: For each simulated agent, is drawn from a Dirichlet prior distribution over the three values of that puts more mass on the probability of healthier states (compared to the current health state) if the required resources are obtained, but more mass on the probability of sicker states if the disease is more critical. More precisely, define where is a triple of values over and . If all the required resources are in , then . If all required resources are either , or , then . Finally, if all the resources are needed, then . For all the other values of , i.e. the ones with partial resources needed, we define . Now for sampling purposes, we use these Dirichlet priors as parameters of multinomial distributions to sample the progression of health state. We have assumed similar progression of health over health states for all possible transitions based on . Thus,
where is the element of .
Resource obtention model: For each simulated agent, is drawn from a Dirichlet prior distribution over the three values of that puts more mass on the probability of getting a resource if it is the next in the medical pathway, and if the patient is more sick (so their regret and bids will be larger, making it more likely they will get the resource). However, the probability mass shifts towards not getting a resource as gets larger (so the more agents in the system, the less likely it is to get a resource). Recall from above that this model is meant to summarize the joint actions of other agents, as would have been modeled in a full dec-POMDP solution. An adequate summary is important for good performance, and while we do not claim that the following prior is optimal, we believe it to be a good representation for these simulations. Ideally this function would be computed from the complete model directly, or learned from data. We define where is a triple of values over . We define for . If all resources in are either or , then . If the previous resource in the medical pathway is , then . Finally, if all resources are needed, then .
Reward function: is fixed for all the agents, and rewards agents for becoming healthy, but penalizes them for staying sick or going to the critical state. More precisely: for , , , and . Further, once an patient is healthy and has received all resources, they are discharged and receive no further reward.
We ran each of the benchmarks on a machine with 3.4GHz QuadCore AMD and 4GB RAM available. We compare our auction-based coordinated MDP approach with (AucMDP-RegIter) and without (AucMDP-Reg) iteration using the expected regret bidding mechanism. We also compare to a version where agents only bid their expected values, not regrets (AucMDP-Iter), FCFS, sickest-first, and sample-based (UCT). Each simulated patient is randomly assigned a condition profile and then an MDP model with parameters randomly drawn from the Dirichlet distributions defined above is assigned. 100 trials are done for each randomly drawn set of conditions and MDPs, and this is repeated 10 times. For the UCT results, we ran trials, also repeated times.
We present means and standard deviations over these simulations. We first present results with 4 total resources types and each agent requiring 4 resources based on randomly assigned condition profiles (Figure(a)a). The y-axis is the average reward per patient gathered over an entire trial. We use a horizon that depends on the number of agents (), and UCT is given a 300 second timeout. The total computation time of the complete allocations for the AucMDP approach is less than 10 seconds for problems with 10 agents, and this computation time increases linearly with the number of agents and resources (as opposed to exponential growth in the MMDP case). We can see that the two AucMDP iterative approaches perform similarly, and outperform the heuristic approaches for . UCT is given sufficient time to outperform all other approaches.
Figure (b)b shows the performance of our approach in a more realistic scenario with timeout set to a maximum of 120 seconds for rollouts. Similarly, each agent requires 4 resources. When the number of agents increases to more than 8 agents, UCT underperforms compared to AucMDP, providing a policy as inferior as FCFS or sickest-first. This is mostly due to the fact that the number of possible actions grows exponentially by adding more agents, and thus, UCT requires significantly more rollouts in the action exploration phase. Figure (a)a shows a further scaling to , again showing that our AucMDP approach outperforms the other methods for the larger problems. The number of joint actions also grows exponentially when the number of resources required by each agent is increased, since there are more individual options, but our AucMDP handles this well as a result of linear growth in the number of actions (Figure (b)b).
As more resources are added into the system, the performance of approaches such as FCFS and sickest-first get closer to our approach because more diverse sets of resources are defined by condition profiles. Figure (a)a denotes that introducing more resources yields more diversity in resource requirements: the allocation problem becomes “easier” to solve (fewer conflicts of interest), i.e., the smaller number of resources results in harder allocation. Figure (b)b shows results of further scaling our AucMDP approach to 50 agents each requiring 10 resources with 10 condition profiles.
5 Related Work and Conclusion
Our approach to coordinating MDPs contrasts with those of multiagent MDPs  and dec-MDPs  in finding exact solutions, which face complexity problems for large-scale problems such as ours . Instead, we offer an approximation method that collapses the state space of each agent down to only features that are available locally, and uses averaged effects of other agents for coordination. This is similar in spirit to  where effects of actions are estimated by agents (but without the central coordination, as in our work).
Our approach to resource allocation assumes additive utility independence, as in , and has state and action spaces decomposed into sets of features, with each feature relevant to only one subtask, but for cooperative settings, to maximize global utility. The use of auctions to coordinate local preferences through MDPs is also proposed in 
where individual MDPs are submitted to a central decision maker to eventually solve the winner determination problem through a mixed integer linear program (MILP). However, this model only provides one-shot allocations and is not applicable to environments with dynamic agents or resources. Multiple allocation phases are addressed in, but the solution incurs greater communication overload with full agent preferences being modeled. Both approaches require a full preference model of all agents and their MDPs to be submitted to the auctioneer, which increases the computation effort on the side of the auctioneer for solving an MMDP and requires complicated (and often large) communication overload while raising privacy concerns. The work of  also addresses cooperative scenarios using auctions for allocating tasks to agents with fixed types and no individual preference models. However, we employ a multi-round mechanism to assign multiple resources to dynamic agents, with expected regret dictating winner determination.
The problem of medical resource allocation is perhaps best addressed to date by [17, 18] which also integrates a health-based utility function to address fairness based on the severity of health states. This model does not, however, consider temporal dependency when determining allocations and our approach of considering future events provides a broader consideration of possible uncertainty. Markov decision processes have been used to model elective (non-emergency) patient scheduling in .
In all, our auction-based MDP approach addresses dynamic allocation of resources using multiagent stochastic planning, employing an auction mechanism to converge fast with low communication cost. Our experiments demonstrate effectiveness in achieving global utility, using regret, for large-scale medical applications.
Future work includes exploring auction-coordinated POMDPs  to estimate resource demands, and learning resource models from data. We are also interested in studying combinatorial bidding mechanisms [7, 19], and bidding languages  in order to optimize allocations based on richer preferences. Online mechanisms and dynamic auctions  may also be of value to consider, to continue to explore changing environments.
We would like to thank the anonymous reviewers for their helpful comments.
-  P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
-  R.E. Bellman. Dynamic programming. Courier Dover Publications, 2003.
-  D.S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of operations research, 27(4):819–840, 2002.
-  Aurélie Beynier and Abdel-Illah Mouaddib. An iterative algorithm for solving constrained decentralized Markov decision processes. In Proceedings of AAAI, 2006.
-  Craig Boutilier. Sequential optimality and coordination in multiagent systems. In IJCAI, pages 478–485, 1999.
-  D.B. Chalfin, S. Trzeciak, A. Likourezos, B.M. Baumann, R.P. Dellinger, et al. Impact of delayed transfer of critically ill patients from the emergency department to the intensive care unit*. Critical care medicine, 35(6):1477–1483, 2007.
-  P. Cramton, Y. Shoham, and R. Steinberg. Introduction to combinatorial auctions. MIT Press, 2006.
D.A. Dolgov and E.H. Durfee.
Resource allocation among agents with MDP-induced preferences.
Journal of Artificial Intelligence Research, 27(1):505–549, 2006.
-  C.V. Goldman and S. Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22(1):143–174, 2004.
-  Thomas Keller and Patrick Eyerich. PROST: Probabilistic planning based on UCT. In Proc. ICAPS, 2012.
-  L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. Machine Learning: ECML 2006, pages 282–293, 2006.
-  S. Koenig, C. Tovey, X. Zheng, and I. Sungur. Sequential bundle-bid single-sale auction algorithms for decentralized control. In Proceedings of the international joint conference on artificial intelligence, pages 1359–1365, 2007.
-  Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large weakly coupled Markov decision processes. In Proceedings AAAI, pages 165–172, 1998.
-  N. Nisan. Bidding and allocation in combinatorial auctions. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 1–12. ACM, 2000.
-  L.G.N. Nunes, S.V. de Carvalho, and R.C.M. Rodrigues. Markov decision process applied to the control of hospital elective admissions. Artificial intelligence in medicine, 47(2):159–171, 2009.
Algorithmic Game Theory, ed. N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani, pages 411–439, 2007.
-  T.O. Paulussen, N.R. Jennings, K.S. Decker, and A. Heinzl. Distributed patient scheduling in hospitals. In International Joint Conference on Artificial Intelligence, volume 18, pages 1224–1232. Citeseer, 2003.
-  T.O. Paulussen, A. Zoller, F. Rothlauf, A. Heinzl, L. Braubach, A. Pokahr, and W. Lamersdorf. Agent-based patient scheduling in hospitals. Multiagent Engineering, pages 255–275, 2006.
-  S.J. Rassenti, V.L. Smith, and R.L. Bulfin. A combinatorial auction mechanism for airport time slot allocation. The Bell Journal of Economics, pages 402–417, 1982.
-  J. Wu and E.H. Durfee. Sequential resource allocation in multiagent systems with uncertainties. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 114. ACM, 2007.