I Introduction
Natural disasters have always posed a critical threat to human beings, often being accompanied by major loss of life and property damage. In recent years, we have witnessed more frequent and intense natural disasters all over the world. In 2017 alone, there were multiple devastating natural disasters, each resulting in hundreds of deaths. Hurricanes, flooding, tornadoes, earthquakes and wildfires, were all active keywords in 2017. An illustration of the distribution of weatherrelated disasters in a single year in the U.S. is presented in Figure 1. To mitigate the impacts of disasters, it is important to rapidly match the available rescue resources with disaster victims who need help in the most efficient way, in order to maximize the impact of the rescue effort with limited resources. A key challenge in disaster rescues is to balance the requests for help with the volunteers available to meet that demand.
The adverse impacts of a disaster can be substantially mitigated if during the disaster accurate information regarding the available volunteers can be gathered and victims’ locations can be determined in a timely manner, enabling a wellcoordinated and efficient response. This is particularly apparent whenever there is a huge burst of requests for limited public resources. For example, when Hurricane Harvey made landfall on August 25, 2017, flooding parts of Houston, the 911 service was overwhelmed by thousands of calls from victims in a very short period. Since the phone line resource is limited, many phone calls did not get through and victims turned to social media to plead for help, posting requests with their addresses. At the same time, many willing volunteers seeking to offer help during the disaster were left idle as no one knew where they should be sent. This case is illustrated in Figure 2, along with a sample distribution of victims and volunteers in Figure 3. In the case of a hurricane, a major challenge is that without coordination, multiple volunteers with boats may go to rescue the same victim while other victims have to wait for extended times to be rescued. This mismatch between victims and volunteers represents an enormous waste of limited volunteer resources. It is therefore imperative to improve the emergency services’ coordination to enable them to efficiently share information, coordinate rescue efforts and allocate resources more effectively, and offer guidance for optimal resource allocation.
The problem of resource coordination has drawn considerable attention in the computer science community, and several data mining frameworks have been developed to address this problem. Previous researchers have primarily focused on three approaches: supervised learning, adaptive methods, and optimizationbased method. Traditional supervised learning models demand a dataset that is statistically large in order to train a reliable model
[2], for example by building regression models to predict needs and schedule resources accordingly [3, 4]. Unfortunately, due to the unique nature of resource management for disaster relief, it is generally impractical to model this using traditional supervised learning models. Every disaster is unique and hence it makes no sense to model one disaster relief problem by using the dataset collected from other disasters; a realistic dataset for that disaster can only be obtained when it occurs. This means that traditional supervised learning is unable to solve the highly individual resource management problems associated with disaster relief efforts.Other researchers have developed adaptive methods [5, 6] and proposed adaptive systems [7] for resource allocation. However, a common limitation of the adaptive approach is that the parameters in adaptive models change slowly and hence converge slowly. An alternative is to model resource coordination problems as simulation problems or optimization problems which requires the process of modelling and tuning repeatedly if any of the external environmental parameters change.
Real world resource coordinating problems are very challenging for a number of reasons:

The sample size is small, especially in the early stages of the disaster, when there is almost no available data. Any decisionsupport system needs to move fast and make decisions swiftly.

The realworld environment where the resource coordination actually happens is a highly complex system with multiple uncertainties. For instance, the locations of volunteers and victims are dynamically changing, and the rescue time for an arbitrary victim varies depending on factors such as traffic, road closures, and emergency medical care, many of which are also changing dynamically.

There is no welldefined objective function to model the scheduling problem for disasters, especially when victims need emergency care or collaborative rescue efforts.
The recent success achieved in applying machine learning to challenging decisionmaking domains
[8, 9, 10]suggests that Reinforcement Learning (RL) is a promising method with considerable potential. So far, reinforcement learning has been successfully applied to solve problems such as optimizing deep neural networks with asynchronous gradient descents for the controllers
[11], playing Atari with reinforcement learning [8], learning control policies in a range of different environments with only very minimal prior knowledge [12, 9], among others. One appealing feature of the reinforcement learning method is that it can overcome many of the difficulties involved in building accurate models, which is usually formidable given the scale and complexity of realworld problems. Moreover, reinforcement learning does not require any prior knowledge of system behavior to learn optimal strategies. This means that reinforcement learning can be used to model systems that include changes and/or uncertainties. Finally, reinforcement learning can be trained for objectives that are hard to optimize directly because of the lack of precise models. When reward signals that are correlated with the objective are involved, this can be modelled by reinforcement learning as it is possible to incorporate a variety of goals by adopting different reinforcement rewards.In this paper, we aim to find an effective way to coordinate the efforts of volunteers and enable them to reach disaster victims as soon as possible. We have developed a novel heuristic multiagent reinforcement learningbased framework to analyze the tweets and identify volunteers and victims, along with their locations. Based on the information collected, a resource coordination system can then allocate the volunteer resources more efficiently. The resource coordination system is implemented using heuristic multiagent reinforcement learning since this approach offers a good way to address the above dilemmas because of its unique characteristics. More specifically:

We build an efficient heuristic multiagent reinforcement learning framework for largescale disaster rescue work based on information gathered by mining social media data. This study is one of the first that specifically focuses on coordinating volunteers in disaster relief using reinforcement learning.

We propose a ResQ algorithm, which is capable of adapting dynamically as information comes in about volunteers and victims’ situations and makes recommendations to minimize the total distance travelled by all the volunteers to rescue the maximum possible number of victims.

Our proposed new disaster relief framework bridges the gap when traditional emergency helplines such as 911 are overwhelmed, thus benefiting both the disaster victims and the nonGovernmental organizations seeking to help them.

Last but not least, our proposed ResQ algorithm significantly outperforms existing stateoftheart methods, reducing the computation times required considerably. The effectiveness of the proposed method is validated using a Hurricane Harvey related social media dataset collected in August 2017 for the Houston area, Texas.
Ii Related Work
Disaster Relief with Social Media.
The most recent survey [13] pointed out that, the success of a disaster relief and response process relies on timely and accurate information regarding the status of the disaster, the surrounding environment, and the affected people. There are a large number of studies using social media data for disaster relief. Gao et al. [14] built a crowdsourcing platform to provide emergency services during the 2010 Haiti earthquake, such as handling food requests. They integrated the system with crisis maps to help organizations to identify the location where supplies are most needed. Zook et al. [15] demonstrated that information technologies were the key means through which individuals could contribute to relief efforts without being physically present in Haiti. This example proved how to make full use of volunteer resources by outsourcing tasks to remote volunteers. Ashktorab et al. [16] introduced a Twittermining tool to extracts practical information for disaster relief workers during natural disasters. Their approach was validated with tweets collected from 12 different crises in the United States since 2016. In the work of [17], they identified fifteen distinct disaster social media uses, ranging from preparing and receiving disaster warnings, detecting disasters before an event to (re)connecting community members following a disaster. Gao et al. [18] proposed a model to explore users’ checkin behaviors via Locationbased social networks and they integrated users’ checkin history information to help build the connections between historical records and predicted locations. Lu et al. [19] explored the underlying trends in positive and negative sentiment concerning disasters and geographically related sentiment using Twitter data.
Multiagent Reinforcement Learning
The research on Multiagent Reinforcement Learning (MARL) has proved to be very challenging. The exponential growth of the discrete stateaction space gives rise to a challenge for iterating over the stateaction space. The correlated returns of multiple agents make it difficult to maximize the returns independently. Several MARL goals have been proposed to circumvent this problem. Hu and Wellman proposed a framework where agents maintain Qfunctions over joint actions and perform updates based on agents’ learning dynamics [20]. Powers and Shoham proposed to consider the adaption to the changing behaviors of the other agents [21]. Other researchers also proposed to consider both stability and adaption at the same time [22, 23, 24]. Nonstationarity arises in MARL since all the agents in the system are learning simultaneously. Stability essentially means the convergence to a stationary policy, whereas adaptation ensures that performance is maintained or improved when the other agents are changing their policies [25]. It has been discovered that convergence is required for stability and rationality is the criterion for adaption [23]. An alternative to rationality is the concept of noregret, which prevents the learner from being ‘exploited’ by other agents [22]. Convergence to equilibria is a basic stability requirement [20], and it means that agents’ strategies should eventually converge to a coordinated equilibrium. Nash equilibrium is commonly used in some scenarios, but concerns have been raised regarding its usefulness[26].
Resource Management
Studies of resource management problems appear in different fields, including realtime scheduling in CPU [27, 28], energy resource efficiency management of data center for cloud computing [29, 30], bitrate adaptation in video streaming [31, 32], network congestion controls [33]
, and so on. Tesauro et al. combined reinforcement learning and queuing models to learn resource valuation estimates, and results showed their model achieved significant performance improvements over a variety of initial modelbased policies
[34]. Xu et al. presented a unified reinforcement learning approach to automate the configuration of virtualized machines and appliances running in the virtual machines. Their model achieved an optimal or nearoptimal configuration setting in a few trialanderror iterations [35]. Mao et al. built a resource management system with reinforcement learning. Their initial results show that model based on deep learning outperforms the heuristic model, and their model can adapt to different conditions, converges quickly, and learns strategies that are sensible in hindsight
[36].Iii Problem Formulation
We first introduce a formal definition of the rescue problem, then present the concept of multiagent reinforcement learning and discuss its applications to disaster relief, concluding by presenting our new methodology.
Iiia Problem Formulation
Definition III.1 (Volunteering rescuer)
A volunteer rescuer (or volunteer) is a person who has access to potential rescue facilities (e.g. a boat) and is willing to help. We define all the volunteers at time as a set .
Definition III.2 (Victim)
A victim is an individual who is trapped or in trouble and needs to be rescued. We denote all the victims at time as a set .
Definition III.3 (Rescue Task)
Let , and denote the volunteers and victims, respectively. These volunteers and victims are scattered across certain areas affected by the disaster. A rescue task is to find a volunteer in the volunteer set to rescue a victim that is trapped at another location. The ‘cost’ of such a rescue task is the total time that it takes for volunteer to reach victim and convey them to a safe place.
For the purpose of this study, we assume that victims are taken to the nearest shelter after they have been rescued. We calculate the total time for a rescue task as , where is the distance between the volunteer and the victim, is the travel time that it takes for the volunteer to reach the victim, is the time to load the victim(s) to the boat, and is the time needed to carry them to the nearest shelter. Since the loading time and the time to shelter are constants in every scheduling policy, we will not take the loading time and the time to shelter into consideration.
Definition III.4 (Rescue Scheduling Task)
Let denote the set of assignments of victims to be rescued by volunteers at time . Given a set of volunteers , and a set of victims , a rescue scheduling task is to find a set of sequential assignments of volunteers to rescue victims, such that all the victims are rescued with minimal total cost. The total cost for such scheduling is the total time spent on rescuing all the victims.
Suppose that the cost function , where is an assignment of volunteers for rescuing. The cost function is interpreted as a “total rescuing time", and can be expressed in terms of times , the cost/time for volunteers to rescue victims . The rescue scheduling problem is to find the optimum assignment such that is a minimum, that is, there is no assignment such that .
Assignment may be written as an matrix, in which column lists the victims that volunteer will rescue at time , in order. Suppose there are victims to be rescued by volunteers. We can now represent the rescue scheduling result as a matrix
where , .
In this case, a volunteer rescues one victim at a time, while a victim can only be rescued by at least one volunteer. The mathematical model for the volunteervictim problem is defined as follows:
subject to  
where is the distance from volunteer i to victim j.
IiiB ResQ: Heuristic Multiagent Reinforcement Learning (MARL) in Rescue Scheduling
The setting of MARL
To tackle this rescue scheduling problem, we can formulate the problem using multiagent reinforcement learning technique[37]. The agents are volunteers who are willing to rescue disaster victims. The victims represent the rewards and the environment is the place where the disaster happened. This environment is represented as a squaregrid world, and the agents move within this grid world to rescue the victims. In other words, this is a Markov game G for N agents , which is denoted by a tuple , where , , , , ,
are the number of agents, sets of states, joint action space, transition probability function, reward function and discount factor respectively. These are defined as follows:

Agent: We consider a volunteer with a boat to be an agent. Although the number of unique heterogeneous agents is always , the number of agents is changing over time t.

State : A state of a volunteer at time in the rescue scheduling problem is defined as the possible grid location where he or she is located. We also maintain a global state at each time , considering the spatial distributions of available volunteer and victims as a global state , and the states is a finite set.

Action = : a joint action denotes the allocation strategy of all available volunteers at time , where is the number of available agent at time . The action space of an agent specifies all the possible directions of motion for the next iteration, which gives a set of four discrete actions denoted by represented for transition; allocating the agent to one of the four neighboring grids. At time , if the state and the action of an agent is given, then we conclude its state at time . Furthermore, the action space of agents depends on their locations. The agents located at corner grids have a smaller action space. A policy is defined as a sequence of actions and is the optimal policy.

Discount Factor : A discount factor is between 0 and 1, and it is used to quantify the difference in importance between immediate rewards and future rewards.

Transition function : Transition function gives the probability that describes the probabilities of moving between states. The state transition probability gives the probability of transiting to given a joint action is taken in the current state .

Reward function : A reward in the rescue scheduling problem is defined as the feedback from the environment when a volunteer takes actions at a state. Each agent is associated with a reward function and all agents in the same location have the same reward function. The agent attempts to maximize its own expected discounted reward: .
The goal of our disaster rescuing problem is to find the optimal policy (a sequence of actions for agents) that maximizes the total reward. The state value function is introduced to evaluate the performance of different policies. stands for the expected total reward with discount from current state onwards with the policy , which is equal to:
(1) 
According to Bellman optimality equation [38], we have
(2) 
Since the volunteers have to explore the environment in order to find victims, they cannot observe the underlying state of the environment. We treat this as a Partially Observable Markov Decision Process (POMDP)
[39]. A POMDP extends the definition of Markov Decision Process (MDP). It is defined by a set of states denoting the environment setting for all agents, a set of actions and a set of observations for each agent. The state transition function produces the next state with agents taking the action following the policies . Each agent receives an observation correlated to the state , and obtains a reward . Each agent aims to maximize the shared total expected return where is the discount factor and is the horizon.Several reinforcement learning algorithms have been proposed to estimate the value of an action in various contexts. These include the Qlearning, SARSA, and policy gradient algorithm. Among them, the modelfree Qlearning algorithm stands out for its simplicity. In Qlearning, the algorithm uses a Qfunction to calculate the total reward, defined as . Qlearning iteratively evaluates the optimal Qvalue function using backups:
(3) 
where is the learning rate and the term in the brackets is the temporaldifference (TD) error. Convergence to is guaranteed in the tabular case provided there is sufficient state/action space exploration.
The heuristic function
The Qlearning requires a number of trials in order to learn and perform consistently, which will increase the total time to generate a rescue plan. In order to address this problem, heuristicbased algorithms have been proposed, e.g. in Robotic Soccer game [40]. For the current problem, we propose a heuristic based Qlearning: ResQ. In our problem, the locations of volunteers and victims will be estimated via tweets’ geolocation as described in Section IVB. We will then incorporate this information as a heuristics function in the learning process. When determining actions for volunteers, besides choosing the optimal Qvalue as mentioned earlier, we also prioritize the actions that result in the shortest distance to the victims. The heuristics function is a mapping where is the current state, is the action to be performed, and is a real number representing the distance of volunteers to the victims. If after performing an action in , the agent is at row and column of the grid, and its goal is the victim positioned at row and column , then the heuristic distance is calculated as:
(4) 
Iv Experiments
In this section, we show the experimental study using real data collected from Twitter to fully evaluate the performance of our proposed algorithm. We first introduce the dataset and data processing, then show how the identified volunteers, victims and locations are mapped into the reinforcement learning environment. Finally we evaluate the performance of the proposed ResQ algorithm, and compare the performances achieved using traditional searchbased methods.
Iva Datasets
Tweets were collected from Aug 23, 2017, to Sept 5, 2017 using Twitter API, covering the whole course of Hurricane Harvey and its immediate aftermath. The raw data for each day includes about two million tweets, each of which has 36 attributes including, among other information, the location of the tweet originated from, its geographic coordinates, and the user profile location. The raw data was cleaned by removing tweets that did not originate from the United States. Figure 4 shows a heat map of the Hurricane Harvey related tweets from Aug 23 to Sept 5, 2017. Not surprisingly, the state of Texas has the largest total number of tweets: 173,315.
Total tweet  25,945,502  Volunteer tweet  13,953 
Harvey tweet  173,315  Victim tweet  16,535 
Harvey Classification  Victim Classification  Volunteer Classification  

Precision  Recall  F_ measure  Accuracy  Precision  Recall  F_ measure  Accuracy  Precision  Recall  F_ measure  Accuracy  
Log. Regr.  0.8  0.7273  0.7619  0.8646  0.8437  0.5510  0.6667  0.7127  0.9583  0.6216  0.7541  0.8170 
KNN 
1.0  0.2105  0.3478  0.7580  1.0  0.8414  0.9139  0.9172  1.0  0.6129  0.76  0.8248 
CART 
1.0  0.6364  0.7778  0.8919  1.0  0.9795  0.9896  0.9893  1.0  0.7567  0.8645  0.8902 
SVM 
0.8947  0.9444  0.9189  0.9516  0.9146  0.9868  0.9493  0.9490  0.9146  0.9868  0.9493  0.9490 
IvB Identification of Victims and Volunteers
To identify victims of Hurricane Harvey and the volunteers wishing to help save them from social media, we fist designed a classifier to filter Harvey related tweets from all the collected tweets. In this context, a Harvey tweet refers to a post talking about Hurricane Harvey or related to Hurricane Harvey. Within the Harvey tweets, we further developed two classifiers to identify tweets from victims tweets and volunteers. Here, victim tweets are those from victims (or their friends) requesting help, including retweets. Volunteer tweets are those from volunteers who have boats and are willing to offer help. All three classifiers were implemented based on a Support Vector Machine (SVM). In every classifier,
Harvey related tweets were manually labelled, withof the tweets being used for training, and the rest for testing. A fivefold crossvalidation method was then applied to ensure the classification results were trustworthy. To obtain a reliable classification result, we compared Logistic Regression, KNearest Neighbor (KNN), CART, and SVM. Measurement criteria such as precision (positive predictive value), recall, Fmeasure, and accuracy were employed to measure the performance, as shown in Table
II.Victim and volunteer time series
To monitor the impact of Hurricane Harvey and rescue activities, we tracked the victims and volunteers tweets time series from Aug 23 to Sept 5, 2017, as shown in Figure 5. Initially, when Hurricane Harvey formed into a tropical depression on Aug 23, not much attention was observed from Twitter in the US. When Harvey made landfall near the Texas Gulf Coast on Aug 25, there was a burst of victims tweets. With the increasing of victims requesting help, number of volunteers also increased sharply and reached a climax on Aug 28. Meanwhile, victims tweets reached peak on Aug 29. With the leaving of Harvey and systemwide rescuing, both the victims tweets and volunteers tweets dropped gradually. Generally the number of volunteers tweets is always lower than the victims tweets.
Geocoding
To identify the victim and volunteer locations, we designed a simple tool based to extract the tweets’ locations. For tweets from GPSenabled devices that included geographic coordinates, or tweets giving specific addresses, we used the address directly to locate the victims or volunteers. Otherwise, we combined alternative sources of information to infer their location, such as the selfreported location string in the user’s profile metadata, or by analyzing the tweet’s content. With the help of the World Gazetteer (http://archive.is/srm8P) database, we were able to lookup location names and geographic coordinates.
IvC Experiment Setting
We can model the problem of rescue scheduling using a heuristic multiagent fully cooperative reinforcement learning method. Multiagent means that we use multiple agents to represent multiple volunteers. The number of agents depends on the number of volunteers identified in the volunteer tweet classification process for each day. Similarly, we assume that the victims are immobile learning targets due to the fact that victims are trapped. Since volunteers aim to rescue all the victims as soon possible, the goal of all agents is to reach all their targets with the lowest cost and maximize the total reward.
In the following sections, we describe how the disaster grid environment is identified and what actions volunteers can perform in the course of their rescuing activities.
IvC1 The grid environment identification
The process of environment building is illustrated in Figure 6. In actual disaster relief operations, the whole city of Houston is the activity space for volunteers, and since a volunteer can go to any direction, the combination of space and direction will be infinite. According to our statistics, 95% of the requests for help during the hurricane come from a fixed downtown area. For simplicity, our model is based on a quasisquare area defined by four position coordinates, which are (29.422486,95.874178), (30.154665,95.874178), (30.154665,95.069705) and (29.422486,95.069705), which are shown in Figure 3. This square region has a width of 50 miles and for our purposes is mapped into a 25 by 25 grid, with each grid representing a 4squaremile area in the real world. By applying this simple mapping to convert the actual map to a virtual grid, we can transform the real world continuous state space to a more manageable discrete state space, and hence significantly reduce the state space complexity.
The position coordinates of the victims and volunteers are extracted every hour following the processes described in the Figure 6. This hourly updating strategy will keep our system updated with the number of available volunteers to be scheduled in order to go and rescue the remaining victims. From our observations, For victim tweets that contain the victim’s address and phone number, such as McLean & Fite St.Pearland, TX 832425**** , we can extract the address and converted their position coordinates. For volunteer tweets that do not include the volunteers’ address, we can use the geocoding tool as described in section IVB to extract geographical information from the raw tweet. This approximation is not precise but reasonable; as the volunteers will be moving around to rescue victims, their address is trivial. Similarly, every volunteer is mapped into the grid according to his/her position coordinates.
IvD Disaster Relief Coordination Performance
IvD1 Baseline Models
We used the following classical search methods to compare their performances with that of our proposed technique:
Random walk
In this search policy, the agent will randomly walk surround the grid and search for any victim they come across along the way. The behavior is totally random without any other knowledge of the environment.
Greedy best first search
A greedy best first search offers volunteers a heuristic distance estimation to the victims. Volunteers begin by rescuing the closest victims first and then move on to the further ones sequentially.
Rulebased search
A rulebased search computes action rules by utilizing the probability of taking an action in a grid cell. The action with highest probability are then selected for the next action. This probability is computed from the last average rewards gained in those cells during training episodes controlled by the random walk algorithm. In particular, if is the averaged reward value at time of the grid cell , and the volunteer takes action in order to move to grid cell , the probability of taking action at the grid cell is:
(5) 
Value iteration
This algorithm works by dynamically updating the value table based on a policy evaluation such as that described by [41]. The allocation policy is computed based on the new value table,
Reinforcement Learning
This is traditional reinforcement learning technique where there is no heuristics consideration in action selection. The technique has the same settings such as action, state and reward space compared with our proposed heuristic reinforcement learning.
IvD2 Evaluation Metrics
We define an episode as the set of attempts made by all volunteers to successfully rescue all the victims. Hence, our key metrics for measuring the performance of rescue activities are the average episode time, average episode reward, average reward rate and average rescuing cost.
Average episode time
is the average total time steps required to rescue all the victims in all executed episodes. Each time step is equivalent to one step action (from one cell to an other nearby cell) taken by all the available volunteers.
Average episode reward
is the average cumulative reward that all volunteers earn in each episode of rescuing.
Average reward rate
is the ratio between the average episode reward and average episode time. This represents the average reward that all volunteers earn in one time step. If there are episodes, and and represents the reward and total time steps for episode , respectively, then the formula to calculate is defined as:
(6) 
Average rescuing cost
represents the total time step cost to earn one unit of reward. This is the inverse of the reward rate.
(7) 
IvD3 Results and Comparisons
In this work, a heuristic multiagent reinforcement learning model for disaster relief is trained and evaluated in OpenAI Gym [42]. Unlike the standard reinforcement learning settings used for simulations, our experimental environment setting is based on the realworld geographical positions of tweets. Here, a volunteer is formulated as taking action in an environment and receiving rewards and observation at every time step. The training of the agent stops once the policies of volunteers converge. The main purpose is to minimize the amount of time needed to rescue all the victims in the target environment.
Time  Reward  Reward Rate  Rescuing Cost  

Random Walk  17.4  167.2  9.6  0.104 
Greedy B.F.S  9.6  182.7  19.03  0.053 
Rulebased  15.9  170.2  10.70  0.093 
Value Iteration  14.1  173.7  12.32  0.081 
Reinforcement Learning  5.4  172.0  31.85  0.031 
Heuristic R. L.  5.0  189.7  37.9  0.026 
For these experiments, we transform the geographical distribution of tweets into a grid and set up a centralized communication environment, which consists of N volunteers and M victims in a twodimensional grid with discrete space and discrete time. The process of extracting geographical information from volunteers and victims is illustrated in Figure 6. Volunteers may take actions in the environment and communicate with the remote central server. They will be assigned a penalty if they go off the grid and a reward if they reach the victims they are to rescue.
We compared the experimental performance of the proposed ResQ algorithm with Random walk, Greedy best first search, Rulebased search, Value iteration, and a traditional Reinforcement Learning method. Figure 7 presents the process of each algorithm’s performance within 2000 episode (path from initial to a terminal state). In Figure 7(a) and Figure 7(b), we compare the total rewards and total time steps per episode with each strategy. The ResQ quickly converges to stable states after the first 24 episodes of training. Once ResQ converged, it constantly outperforms all other approaches. As a comparison, the reinforcement learning technique also performs well after convergence. However, it requires a long time for convergence (208 episodes in current experiment) and the average reward over the entire time period is lower compared to the ResQ. The greedy B.F.S strategy performs consistently over the time, shown as points around constant lines. This is not surprising because with this strategy the agents always choose to reach the closest victims first, which is independent of other factors in the rescuing environment. Overall, the reward of greedy B.F.S strategy is less than the ResQ, while its time steps outperform the ResQ during the latter’s training phase. The Random walk approach leads to the lowest overall reward as well as the highest completion time per episode, and the performance has a large variation across different episodes. Ruled based and Value iteration are even worse compared to our proposed ResQ technique. Figure7(c) and Figure7(d) respectively show the heatmap of the reward distribution and its corresponding box plot. We clearly see that, during the total 2000 testing episodes, the ResQ has the most of its rewards above 190, while other methods have significantly less number of rewards in this category.
Table III gives a summary of each algorithm’s total time, total reward, reward rate, and rescuing cost. The result clearly shows that the ResQ has the best overall reward score, the shortest completion time, the highest reward rate, and the lowest rescuing cost rate. In particular, the Greedy B.F.S. and the Reinforced Learning method respectively have the reward and time performance close to the proposed method. Nonetheless, the proposed Heuristic reinforcement learning evidently outperforms these methods when the two metrics are considered simultaneously.
V Discussion
This paper presents a novel algorithm designed to develop a better response to victims’ requests for assistance during disasters, along with a case study using Twitter data collected during Hurricane Harvey in 2017. This is one of the first attempts to formulate the largescale disaster rescue problem as a feasible heuristic multiagent reinforcement learning problem using massive social network data. With the proposed method, we can train classifiers to extract victim and volunteer information from tweets and transform the data for use in a reinforcement learning environment. Our key contribution is the design of a heuristic multiagent reinforcement learning scheduling policy that simultaneously schedules multiple volunteers to rescue disaster victims quickly and effectively. The heuristic multiagent reinforcement learning algorithm can respond to dynamic requests and achieve an optimal performance over space and time. This approach helps to match volunteers and victims for faster disaster relief and better use of limited public resources. The proposed framework for disaster exploration and relief recommendation is significant in that it provides a new disaster relief channel that can serve as a backup plan when traditional helplines are overloaded.
References
 [1] NOAA, U.S. 2017 BillionDollar Weather and Climate Disasters, 2017. [Online]. Available: https://www.ncdc.noaa.gov/billions/

[2]
X. Zhu, “Semisupervised learning literature survey,” 2005.
 [3] Z. Gong, X. Gu, and J. Wilkes, “Press: Predictive elastic resource scaling for cloud systems,” in 2010 International Conference on Network and Service Management, Oct 2010, pp. 9–16.
 [4] S. Islam, J. Keung, K. Lee, and A. Liu, “Empirical prediction models for adaptive resource provisioning in the cloud,” Future Generation Computer Systems, 2012.
 [5] W. Song, Z. Xiao, Q. Chen, and H. Luo, “Adaptive resource provisioning for the cloud using online bin packing,” IEEE Transactions on Computers, 2014.
 [6] W. Iqbal, M. N. Dailey, D. Carrera, and P. Janecek, “Adaptive resource provisioning for read intensive multitier applications in the cloud,” Future Generation Computer Systems, 2011.
 [7] Y. Jiang, C. s. Perng, T. Li, and R. Chang, “Asap: A selfadaptive prediction system for instant cloud resource demand provisioning,” in 2011 IEEE 11th International Conference on Data Mining, Dec 2011.
 [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” Computing Research Repository, 2013.
 [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–503, 2016.
 [10] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey of deep reinforcement learning,” CoRR, vol. abs/1708.05866, 2017. [Online]. Available: http://arxiv.org/abs/1708.05866
 [11] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” Computing Research Repository, 2016.
 [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.
 [13] T. Nazer, G. Xue, Y. Ji, and H. Liu, “Intelligent disaster response via social media analysis a survey,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 46–59, 2017.
 [14] H. Gao, G. Barbier, and R. Goolsby, “Harnessing the crowdsourcing power of social media for disaster relief,” IEEE Intelligent Systems, vol. 26, no. 3, pp. 10–14, 2011.
 [15] M. Zook, M. Graham, T. Shelton, and S. Gorman, “Volunteered geographic information and crowdsourcing disaster relief: A case study of the haitian earthquake,” World Medical & Health Policy, vol. 2, no. 2, pp. 7–33, 2010. [Online]. Available: http://dx.doi.org/10.2202/19484682.1069
 [16] Z. Ashktorab, C. Brown, M. Nandi, and A. Culotta, “Tweedr: Mining twitter to inform disaster response,” in 11th Proceedings of the International Conference on Information Systems for Crisis Response and Management, University Park, Pennsylvania, USA, May 1821, 2014. ISCRAM Association, 2014.
 [17] J. B. Houston, J. Hawthorne, M. F. Perreault, E. H. Park, M. Goldstein Hode, M. R. Halliwell, S. E. Turner McGowen, R. Davis, S. Vaid, J. A. McElderry et al., “Social media and disasters: a functional framework for social media use in disaster planning, response, and research,” Disasters, vol. 39, no. 1, pp. 1–22, 2015.
 [18] H. Gao, J. Tang, X. Hu, and H. Liu, “Exploring temporal effects for location recommendation on locationbased social networks,” in RecSys, 2013.
 [19] Y. Lu, X. Hu, F. Wang, S. Kumar, H. Liu, and R. Maciejewski, “Visualizing social media sentiment in disaster scenarios,” in Proceedings of the 24th International Conference on World Wide Web. ACM, 2015, pp. 1211–1215.
 [20] J. Hu and M. P. Wellman, “Nash qlearning for generalsum stochastic games,” Journal of Machine Learning Research, vol. 4, pp. 1039–1069, 2003.
 [21] R. Powers and Y. Shoham, “New criteria and a new algorithm for learning in multiagent systems,” in Advances in Neural Information Processing Systems 17, 2005.
 [22] M. Bowling, “Convergence and noregret in multiagent learning,” in Advances in Neural Information Processing Systems 17. MIT Press, 2005, pp. 209–216.

[23]
M. Bowling and M. Veloso, “Rational and convergent learning in stochastic
games,” in
Proceedings of the 17th International Joint Conference on Artificial Intelligence  Volume 2
, ser. IJCAI’01, 2001, pp. 1021–1026.  [24] M. L. Littman, “Valuefunction reinforcement learning in markov games,” Cognitive Systems Research, vol. 2, no. 1, pp. 55 – 66, 2001.
 [25] L. Buşoniu, R. Babuška, and B. De Schutter, Multiagent Reinforcement Learning: An Overview, 2010.
 [26] Y. Shoham, R. Powers, and T. Grenager, “If multiagent learning is the answer, what is the question?” Artificial Intelligence, vol. 171, no. 7, pp. 365 – 377, 2007, foundations of MultiAgent Learning.
 [27] C. L. Liu and J. W. Layland, “Scheduling algorithms for multiprogramming in a hardrealtime environment,” Journal of the ACM, vol. 20, no. 1, pp. 46–61, Jan. 1973.
 [28] N. C. Audsley, A. Burns, M. F. Richardson, and A. J. Wellings, “realtime scheduling: the deadlinemonotonic approach,” in Proceedings IEEE Workshop on RealTime Operating Systems and Software, 1991, pp. 133–137.
 [29] A. Beloglazov, J. Abawajy, and R. Buyya, “Energyaware resource allocation heuristics for efficient management of data centers for cloud computing,” Future Gener. Comput. Syst., vol. 28, no. 5, pp. 755–768, May 2012.
 [30] A. Beloglazov and R. Buyya, “Energy efficient resource management in virtualized cloud data centers,” in Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, ser. CCGRID ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 826–831.
 [31] J. Jiang, V. Sekar, and H. Zhang, “Improving fairness, efficiency, and stability in httpbased adaptive video streaming with festive,” in Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, ser. CoNEXT ’12. New York, NY, USA: ACM, 2012, pp. 97–108.
 [32] T.Y. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson, “A bufferbased approach to rate adaptation: Evidence from a large video streaming service,” in Proceedings of the 2014 ACM Conference on SIGCOMM, ser. SIGCOMM ’14. New York, NY, USA: ACM, 2014, pp. 187–198.
 [33] D.M. Chiu and R. Jain, “Analysis of the increase and decrease algorithms for congestion avoidance in computer networks,” Computer Networks and ISDN Systems, vol. 17, no. 1, pp. 1 – 14, 1989.
 [34] G. Tesauro, N. K. Jong, R. Das, and M. N. Bennani, “A hybrid reinforcement learning approach to autonomic resource allocation,” in Proceedings of ICAC06, 2006, pp. 65–73.
 [35] C.Z. Xu, J. Rao, and X. Bu, “Url: A unified reinforcement learning approach for autonomic cloud management,” Journal of Parallel and Distributed Computing, vol. 72, no. 2, pp. 95 – 105, 2012.
 [36] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks, ser. HotNets ’16. New York, NY, USA: ACM, 2016, pp. 50–56.
 [37] M. L. Littman, “Markov games as a framework for multiagent reinforcement learning,” in Machine Learning Proceedings 1994. Elsevier, 1994, pp. 157–163.
 [38] R. Bellman, Dynamic Programming, 1st ed. Princeton, NJ, USA: Princeton University Press, 1957.
 [39] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st ed. New York, NY, USA: John Wiley & Sons, Inc., 1994.
 [40] R. A. Bianchi, C. H. Ribeiro, and A. H. R. Costa, “Heuristic selection of actions in multiagent reinforcement learning.” in IJCAI, 2007, pp. 690–695.
 [41] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [42] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
Comments
There are no comments yet.