1 Introduction
Resource scheduling and optimization is a costly, challenging problem that affects almost every aspect of our lives. In healthcare, for example, patients with nonurgent needs who experience prolonged wait times have higher rates of treatment noncompliance and missed appointments [Kehle, Greer, Rutks, WiltKehle et al.2011, Pizer PrenticePizer Prentice2011]. In military engagements, the weapontotarget assignment problem requires warfighters to deploy the minimal amount of resources in order to mitigate as many threats as possible while maximizing the duration of survival [Lee, Su, LeeLee et al.2003].
The problem of optimal task allocation and sequencing with upper and lowerbound temporal constraints (i.e., deadlines and wait constraints) is NPHard [Bertsimas WeismantelBertsimas Weismantel2005], and domainindependent approaches to realworld scheduling problems quickly become computationally intractable [Boese, Kahng, MudduBoese et al.1994, Streeter SmithStreeter Smith2006, Do KambhampatiDo Kambhampati2003]. However, human domain experts are able to learn from experience to develop strategies, heuristics and rulesofthumb to effectively respond to these problems. The challenge we pose is to autonomously learn the strategies employed by these domain experts; this knowledge can be applied and disseminated more efficiently with such a model than with a “singleexpert, singleapprentice” model.
Researchers have made significant progress toward capturing domainexpert knowledge from demonstration [Berry, Gervasio, Peintner, YorkeSmithBerry et al.2011, Abbeel NgAbbeel Ng2004, Konidaris, Osentoski, ThomasKonidaris et al.2011b, Zheng, Liu, NiZheng et al.2015, Odom NatarajanOdom Natarajan2015, Vogel, Ramach, Gupta, RauxVogel et al.2012, Ziebart, Maas, Bagnell, DeyZiebart et al.2008]. In one recent work [Berry, Gervasio, Peintner, YorkeSmithBerry et al.2011], an AI scheduling assistant called PTIME learned how users preferred to schedule events. PTIME was subsequently able to propose scheduling changes when new events occurred by solving an integer program. Two limitations to this work exist, however: PTIME requires users to explicitly rank their preferences about scheduling options to initialize the system, and also uses a complete solver that, in the worstcase scenario, must consider an exponential number of options.
Research focused on capturing domain knowledge based solely on user demonstration has led to the development of inverse reinforcement learning (IRL)
[Abbeel NgAbbeel Ng2004, Konidaris, Osentoski, ThomasKonidaris et al.2011b, Zheng, Liu, NiZheng et al.2015, Odom NatarajanOdom Natarajan2015, Vogel, Ramach, Gupta, RauxVogel et al.2012, Ziebart, Maas, Bagnell, DeyZiebart et al.2008]. IRL serves the dual purpose of learning an unknown reward function for a given problem and learning a policy to optimize that reward function.However, there are two primary drawbacks to IRL for scheduling problems, computational tractability and the need for an environment model. The classical apprenticeship learning algorithm, developed by Abbeel and Ng in 2004, requires repeated solving of a Markov decision process (MDP) until a convergence criterion is satisfied. However, enumerating a large state space, such as those common to largescale scheduling problems involving hundreds of tasks and tens of agents, can quickly become computationally intractable due to memory limitations. Approximate dynamic programming approaches exist that essentially reformulate the problem as regression
[Konidaris, Osentoski, ThomasKonidaris et al.2011b, Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, et al.Mnih et al.2015], but the amount of data required to regress over a large state space remains challenging, and MDPbased scheduling solutions exist only for simple problems [Wu, Xu, Zhang, LiuWu et al.2011, Wang UsherWang Usher2005, Zhang DietterichZhang Dietterich1995].IRL also requires a model of the environment for training. At its most basic, reinforcement learning uses a Markovian transition matrix that describes the probability of transitioning from an initial state to a subsequent state when taking a given action. In order to address circumstances in which environmental dynamics are unknown or difficult to model within the constraints of a transition, researchers have developed QLearning and its variants, which have had much recent success
[Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, et al.Mnih et al.2015]. However, these approaches require the ability to “practice,” or explore the state space by querying a blackbox emulator to solicit information about how taking a given action in a specific state will change that state.Another prior method involves directly learning a function that maps states to actions [Chernova VelosoChernova Veloso2007, Terrell MutluTerrell Mutlu2012, Huang MutluHuang Mutlu2014]. For example, Ramanujam and Balakrishnan trained a discretechoice model using real data collected from air traffic controllers, and showed how this model can accurately predict the correct runway configuration for an airport [Ramanujam BalakrishnanRamanujam Balakrishnan2011]. Sammut et al. [Sammut, Hurst, Kedzier, MichieSammut et al.1992]
applied a decision tree model for an autopilot to learn to control an aircraft from expert demonstration. Actiondriven learning techniques offer great promise for learning policies from expert demonstrators, but they have not been applied to complex scheduling problems. However, in order for these methods to succeed, the scheduling problem must be modeled in a way that allows for efficient computation of a scheduling policy.
In this paper, we propose a technique, which we call “apprenticeship sch eduling,” to capture this domain knowledge in the form of a scheduling policy. Our objective is to learn scheduling policies through expert demonstration and validate that schedules produced by these policies are of comparable quality to those generated by human or synthetic experts. Our approach efficiently utilizes domainexpert demonstrations without the need to train with an environment emulator. Rather than explicitly modeling a reward function and relying upon dynamic programming or constraint solvers – which become computationally intractable for largescale problems of interest – our objective is to use actiondriven learning to extract the strategies of domain experts in order to efficiently schedule tasks.
The key to our approach is the use of pairwise comparisons between the actions taken (e.g., schedule agent to complete task at time ) and the set of actions not taken (e.g., unscheduled tasks at time ) to learn the relevant model parameters and scheduling policies demonstrated by the training examples. Our approach was inspired by cognitive studies of human decisionmaking, in which learning through comparisons – and, in particular, paired comparisons – was identified as a foundation of human multicriteria decisionmaking [SaatySaaty2008, LombrozoLombrozo2006]
. Rather than explicitly query human experts about their preferences, our approach functions more like a human apprentice who learns by observing a sequence of actions performed by a demonstrator. Our approach automatically computes pairwise comparisons of the features describing the action taken at each moment in time relative to the corresponding set of actions not taken, producing sets of both positive and negative training examples. We formulate the apprenticeship scheduling problem as one of learning a pairwise preference model, and construct a classifier that is able to predict the rank of all possible actions and, in turn, predict which action the expert would ultimately take at each moment in time.
We validated our approach using both a synthetic data set of solutions for a variety of scheduling problems and two realworld data sets of demonstrations by human experts solving a variant of the weapontotarget assignment problem [Lee, Su, LeeLee et al.2003], known as antiship missile defense (ASMD), and a hospital resource allocation problem [Gombolay, Yang, Hayes, Seo, Liu, Wadhwania, Yu, Shah, Golen, ShahGombolay et al.2016]. The synthetic and realworld problem domains we used to empirically validate our approach represent two of the most challenging classes within the taxonomy established by Korsah:2013.
The first problem we considered was the vehicle routing problem with time windows, temporal dependencies and resource constraints (VRPTWTDR). Depending upon parameter selection, this family of problems encompasses the traveling salesman (Type 1), jobshop scheduling, multivehicle routing and multirobot task allocation problems, among others. We found that apprenticeship scheduling accurately learns multifaceted heuristics that emulate the demonstrations of experts solving these problems. We observed that an apprenticeship scheduler trained on a small data set of 15 scheduling demonstrations selected the correct scheduling action with up to accuracy. We also empirically characterized the extent to which our method is robust to errors that humans – even experts – may commonly make. We found that our method is able to learn a highquality representation of the demonstrator’s underlying heuristic from a “noisy” expert demonstrator that selects an incorrect action up to of the time.
Next, we observed that apprenticeship scheduling learned a policy for ASMD that outperformed the average ASMD domain expert for a statistically significant portion of problem scenarios () when trained on 15 perfect expertgenerated schedules. Third, we trained a decision support tool to assist nurses in managing resources – including patient rooms, staff and equipment – in a Boston hospital. We found that of the highquality recommendations generated by the apprentice scheduler were accepted by the nurses and doctors participating in the study.
In this work, we also introduce a new technique called Collaborative Optimization via Apprenticeship Scheduling (COVAS), which incorporates learning from human expert demonstration within an optimization framework to automatically and efficiently produce optimal solutions for challenging realworld scheduling problems. This technique applies apprenticeship scheduling to generate a favorable (if suboptimal) initial solution to a new scheduling problem. To guarantee that the generated schedule is serviceable, we augment the apprenticeship scheduler to solve a constraint satisfaction problem, ensuring that the execution of each scheduling commitment does not directly result in infeasibility for the new problem. COVAS uses this initial solution to provide a tight bound on the value of the optimal solution, substantially improving the efficiency of a branchandbound search for an optimal schedule.
We first presented the apprenticeship scheduling technique in a prior work [Gombolay, Jensen, Stigile, Son, ShahGombolay et al.2016], and also previously discussed an application of the technique to the hospital scheduling problem [Gombolay, Yang, Hayes, Seo, Liu, Wadhwania, Yu, Shah, Golen, ShahGombolay et al.2016]. This paper incorporates multiple extensions to these original works: First, we improve the performance of the original technique through the use of hyperparamter tuning. Second, we incorporate the data set acquired from the hospital domain in the previous study [Gombolay, Yang, Hayes, Seo, Liu, Wadhwania, Yu, Shah, Golen, ShahGombolay et al.2016]
to validate apprenticeship scheduling using a second realworld data set consisting of scheduling decisions generated by hospital nurses. Third, we present COVAS, an algorithmic extension that enables humanmachine collaborative optimization. COVAS leverages apprenticeship scheduling to optimally solve scheduling problems, whereas apprenticeship scheduling alone does not provide guarantees for solution quality. We report here that COVAS is able to leverage viable (but imperfect) human demonstrations to quickly produce globally optimal solutions. Fourth, we show that COVAS can transfer an apprenticeship scheduling policy learned for a small problem to optimally solve problems involving twice as many variables as those observed during any training demonstrations, and also produce an optimal solution an order of magnitude faster than mathematical optimization alone.
2 Background
In this section, we briefly review goal and policy learning, as well as methods for bridging machine learning (ML) and optimization. We also discuss the applicability and limitations of prior works related to learning through scheduling demonstration.
2.1 Goal Learning
Here, we review both IRLbased techniques and methods proposed for recommender and preferencelearning systems within the realm of goal learning.
2.1.1 Inverse Reinforcement Learning
Learning from demonstration (LfD) is an active subfield of ML [Abbeel NgAbbeel Ng2004, Berry, Gervasio, Peintner, YorkeSmithBerry et al.2011, Ijspeert, Nakanishi, SchaalIjspeert et al.2002, Konidaris, Osentoski, ThomasKonidaris et al.2011b, Zheng, Liu, NiZheng et al.2015, Odom NatarajanOdom Natarajan2015, Terrell MutluTerrell Mutlu2012, Thomaz BreazealThomaz Breazeal2006, Vogel, Ramach, Gupta, RauxVogel et al.2012, Ziebart, Maas, Bagnell, DeyZiebart et al.2008]. Arguably, the most ubiquitous approach to LfD is inverse reinforcement learning, which is founded on a Markov decision process where:

S is a set of states.

A is a set of actions.

is a transition function, where is the probability of being in state after executing action in state .

: () is a reward function that takes the form of or depending upon whether the reward is assessed for being in a state or for taking a particular action within a state.

is the discount factor for future rewards.
In a Markov decision process, the goal is to learn a policy that dictates which action to take in each state in order to maximize the infinitehorizon expected reward starting in state . This reward is defined by a value function, , as shown in Equation 1:
(1) 
The value function satisfies the Bellman equation for all , as shown in Equation 2.
(2) 
A policy is an optimal policy iff Equation 3 holds.
(3) 
The problem of inverse reinforcement learning (IRL) is to take as input 1) a Markov decision process without a known reward function and 2) a set of expert demonstrations , and to then determine a reward function that produces the expert demonstrations. IRL has previously been successfully applied to autonomous driving [Abbeel NgAbbeel Ng2004], aerobatic helicopter flight [Abbeel, Coates, Quigley, NgAbbeel et al.2007], urban navigation [Ziebart, Maas, Bagnell, DeyZiebart et al.2008], spoken dialog systems [Chandramohan, Geist, Lefevre, PietquinChandramohan et al.2011], and more. Researchers have also extended the capability of IRL algorithms to enable learning from operators with differing skill levels [Ramachandran AmirRamachandran Amir2007] and identification of operator subgoals [Michini HowMichini How2012].
The computational bottleneck of IRL and dynamic programming, in general, is the size of the state space. Algorithms that solve the IRL problem [Lagoudakis ParrLagoudakis Parr2003, Sutton, McAllester, Singh, Mansour, et al.Sutton et al.1999, TesauroTesauro1995, Watkins DayanWatkins Dayan1992]
typically work by iteratively updating the estimate of the future expected reward of each state until convergence. However, for many problems of interest, the number of states is too numerous to hold in the memory of modern computers, and the time required for the expected future reward to converge can be impractical
[Wu, Xu, Zhang, LiuWu et al.2011, Wang UsherWang Usher2005, Zhang DietterichZhang Dietterich1995].Even if one approximately solves the RL problem [Konidaris, Osentoski, ThomasKonidaris et al.2011b, Sutton, McAllester, Singh, Mansour, et al.Sutton et al.1999], RL is still illsuited for handling the temporal dependencies among tasks inherent in scheduling problems. Some researchers have attempted to extend the traditional Markov decision process to characterize temporal phenomena, but these techniques do not scale efficiently [Bradtke DuffBradtke Duff1994, Das, Gosavi, Mahadevan, Marchalleck.Das et al.1999, YuYu2010]. The inherent challenge is that complex realworld scheduling problems are highly nonMarkovian: the next state of the environment is dependent upon the history of actions taken to arrive at the current state and time. The few works that have addressed scheduling problems via RL assume models that are too restrictive: tasks must be periodic, occur with a regular frequency, and be independent, meaning there are no temporal dependencies between the tasks [Zhang DietterichZhang Dietterich1995, Wu, Xu, Zhang, LiuWu et al.2011]. Even work [Aydin ÖztemelAydin Öztemel2000] that relaxes the assumption of determinism and allows for tasks comprising predefined subtasks linked through precedence (as opposed to tasks representing atomic units of work) still does not consider wait, deadline, or resourcebased constraints, nor does it consider problems in the XD complexity class [Korsah, Stentz, DiasKorsah et al.2013].
2.1.2 Recommender/PreferenceLearning Systems
While not typically considered LfD, recommender systems are important within the field of goal learning. Recommender systems – those that use collected information to predict a rating or degree of preference a consumer would give for an item (e.g., goods or services) – have become ubiquitous during the Internet age, including services such as Netflix, which predicts which movies a viewer would want to watch [Koren, Bell, VolinskyKoren et al.2009]. These systems generally fall into one of two categories: collaborative filtering (CF) or contentbased filtering (CB) [Park, Kim, Choi, KimPark et al.2012]. In essence, collaborative filtering is a technique through which an algorithm learns to predict content for a single user based upon his or her history and that of other users who share his or her interests. However, CF suffers from problems related to data sparsity and scalability [Park, Kim, Choi, KimPark et al.2012]. CB works by comparing content that the user has previously viewed with new content [Claypool, Gokhale, Miranda, Murnikov, Netes, SartinClaypool et al.1999, Herlocker, Konstan, Terveen, RiedlHerlocker et al.2004, Sarwar, Karypis, Konstan, RiedlSarwar et al.2000]. The challenge of contentbased filtering lies in the difficulty of measuring the similarities between two items; also, these systems can often overfit, only predicting content that is very similar to that which the user has previously used [Basu, Hirsh, CohenBasu et al.1998, Schafer, Frankowski, Herlocker, SenSchafer et al.2007]. Researchers have previously employed association rules [Cho, Kim, KimCho et al.2002], clustering [Lihua, Lu, Jing, ZongyongLihua et al.2005, Linoff BerryLinoff Berry2004], decision trees [Kim, Cho, Kim, Kim, SuhKim et al.2002], knearest neighbor algorithms [Kim, Kim, RyuKim et al.2009]
[Anders KornAnders Korn1999, IbnkahlaIbnkahla2000], link analysis [Cai, He, Wen, MaCai et al.2004], regression [MalhotraMalhotra2010], and general heuristic techniques [Park, Kim, Choi, KimPark et al.2012] to recommend content to users.Ranking the relevance of Web pages is a key focus within systems that recommend suggested topics to users [Cao, Qin, Liu, Tsai, LiCao et al.2007, HaveliwalaHaveliwala2002, Herbrich, Graepel, ObermayerHerbrich et al.2000, Jin, Valizadegan, LiJin et al.2008, Page, Brin, Motwani, WinogradPage et al.1999, Pahikkala, Tsivtsivadze, Airola, Boberg, SalakoskiPahikkala et al.2007, Li, Wu, BurgesLi et al.2007, Valizadegan, Jin, Zhang, MaoValizadegan et al.2009, Volkovs ZemelVolkovs Zemel2009]. The seminal paper on Web page ranking by Page et al. initiated the computational study of page ranking with an algorithm, PageRank, which assesses the relevance of a page by determining the number of other pages that link to the page in question [Page, Brin, Motwani, WinogradPage et al.1999]. Since that paper, many have focused on developing better models for recommending Web pages to users; these models can then be trained using various ML algorithms [HaveliwalaHaveliwala2002, Herbrich, Graepel, ObermayerHerbrich et al.2000, Jin, Valizadegan, LiJin et al.2008, Pahikkala, Tsivtsivadze, Airola, Boberg, SalakoskiPahikkala et al.2007].
There are three primary approaches to modeling the importance of a Web page: pointwise, pairwise, and listwise ranking. In pointwise ranking, the goal is to determine a score for a Web page via regression analysis, given features describing its contents
[Li, Wu, BurgesLi et al.2007, Page, Brin, Motwani, WinogradPage et al.1999]. Pairwise ranking is typically a classification problem in which the aim is to predict whether one page is more relevant than another, given a user’s query [Jin, Valizadegan, LiJin et al.2008, Pahikkala, Tsivtsivadze, Airola, Boberg, SalakoskiPahikkala et al.2007]. More recent efforts have focused on listwise ranking, in which researchers develop lossfunctions based on entire lists of ranked Web pages, rather than individual pages or pairwise comparisons between pages
[Cao, Qin, Liu, Tsai, LiCao et al.2007, Valizadegan, Jin, Zhang, MaoValizadegan et al.2009, Volkovs ZemelVolkovs Zemel2009]. Our approach draws inspiration from the Web page pairwise ranking formulation in order to improve the tractability of learning scheduling policies from demonstration. We further discuss the relationship between prior work and our own approach in Section 3.2.The recommender and preferencelearning system most closely related to ours is that of Berry et al. [Berry, Peintner, Conley, Gervasio, Uribe, YorkeSmithBerry et al.2006, Berry, Gervasio, Peintner, YorkeSmithBerry et al.2011], which focused specifically on scheduling applications. Their goal was to develop an autonomous scheduling assistant that learned the preferences of the user. Berry et al. produced a number of works over the course of a decade, culminating in the development of an automated scheduling assistant, called PTIME. The purpose of PTIME was to help human coworkers schedule meetings. Berry et al. incorporated extensive questionnaires to solicit the preferences of human workers regarding how they preferred to arrange their schedules. PTIME would take these preferences as input and map them to a mathematical objective function. When a new meeting needed to be arranged amongst the workers, PTIME would solve a mixedinteger mathematical program to determine the optimal time for this meeting to occur. However, after approximately a decade of work, the ultimate acceptance rate of PTIME’s suggestions was only . These authors conducted a retrospective analysis of their work and presented the following guidance for future researchers [Berry, Gervasio, Peintner, YorkeSmithBerry et al.2011]:

“A personal assistant must build trust.”

“An assistive agent must aim to support, rather than replace, the user’s natural process.”
These tenants have served as an inspiration for our own work, and we believe all future works should begin with these key design principles.
Other works have outlined alternate approaches to elicitation and utilization of user preferences. De Grano et al. presented a method for optimizing scheduling shifts among nurses by soliciting nurses’ preferences via an auction process [Grano, Medeiros, EitelGrano et al.2009]. In particular, De Grano et al. used an iterative approach in which nurses first bid on which shifts they would prefer; then, their algorithm matches nurses to shifts based on their collective bids. Next, the nurses view the results and adjust their bids to push the algorithm toward a more preferable result. This process repeats over a number of iterations. The need for this iterative approach is due to the fact that nurses’ preferences were not independent: each nurse’s preferences would change according to the preferences of others. Further, it was not feasible for De Grano et al. to codify a rule set or learn a policy for each nurse [Grano, Medeiros, EitelGrano et al.2009].
Boutilier et al. and others [Boutilier, Brafman, Domshlak, Hoos, PoolBoutilier et al.2004, Boutilier, Brafman, Hoos, PooleBoutilier et al.1999, Öztürké, Tsoukiìs, VinckeÖztürké et al.2005] alternatively focused on modeling preferences as a set of ceteris paribus (all other things being equal) preference statements. In these works, researchers solicited preferences from users, typically in the form of binary comparisons. For example, consider the problem of determining which food and drink to serve a guest [Boutilier, Brafman, Domshlak, Hoos, PoolBoutilier et al.2004]. In this scenario, one may already know the following:

The guest prefers to drink red over white wine when eating a steak.

The guest prefers steak over chicken.

The guest prefers to drink white wine when eating chicken.
Determining the optimal food/drink pairing can be performed in polynomialtime; however, identifying the relative optimality two pairings is NPcomplete [Boutilier, Brafman, Domshlak, Hoos, PoolBoutilier et al.2004].
Other researchers have focused on developing techniques for efficiently incorporating preferences into constraint satisfaction problems [Dubois FortempsDubois Fortemps1999, Lin, Xie, Guo, WangLin et al.2005, Rossi, Venable, WalshRossi et al.2009, Rudová MurrayRudová Murray2002, Schiex, Fargier, Verfaillie, et al.Schiex et al.1995, Soomer FranxSoomer Franx2008]. A subset of this work has specifically addressed the unique challenges of solving such formulations for scheduling problems [Benton, Coles, ColesBenton et al.2012, Khatib, Morris, Morris, RossiKhatib et al.2001, Minton, Johnston, Philips, LairdMinton et al.1992, Morris, Morris, Khatib, Ramakrishnan, BachmannMorris et al.2004, Peintner PollackPeintner Pollack2004, YorkeSmith, Venable, RossiYorkeSmith et al.2003, Rossi, Venable, YorkeSmithRossi et al.2006].
These methods, which are designed for scheduling problems, still suffer from issues with computational tractability. As mentioned previously, Berry et al. used a preference learning algorithm to codify an objective function, which could then be solved via mathematical optimization [Berry, Peintner, Conley, Gervasio, Uribe, YorkeSmithBerry et al.2006]. Similarly, Wilcox et al. used mathematical programming to maximize the incorporation of users’ scheduling preferences into the system [Wilcox ShahWilcox Shah2012]. However, mathematical programming is not a tractable solution technique for many realworld scheduling problems [Bertsimas WeismantelBertsimas Weismantel2005], including the antiship missile defense and hospital resource allocation problems presented in this work. Solving these problems typically requires specification of domainspecific heuristics in order to focus the search space. In this work, we present a system designed to automatically learn a heuristic policy from expert demonstration, and then apply the heuristic in order to intelligently explore the search space, reducing computation time.
2.2 Policy Learning
One alternative approach to goal learning is policy learning, which focuses on learning a mapping from states to actions [Chernova VelosoChernova Veloso2007, Huang MutluHuang Mutlu2014, Sammut, Hurst, Kedzier, MichieSammut et al.1992, Ramanujam BalakrishnanRamanujam Balakrishnan2011]. This technique has been applied to learn cognitive decisionmaking tasks from human experts [Ramanujam BalakrishnanRamanujam Balakrishnan2011, Sammut, Hurst, Kedzier, MichieSammut et al.1992, Silver, Huang, Maddison, Guez, Sifre, Van Den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, et al.Silver et al.2016, Inamura, Inaba, InoueInamura et al.1999, Rybski VoylesRybski Voyles1999], including an air traffic control task [Ramanujam BalakrishnanRamanujam Balakrishnan2011] and a piloting task [Sammut, Hurst, Kedzier, MichieSammut et al.1992].
Ramanujam and Balakrishnan investigated learning a discretechoice model for how air traffic controllers decide which runways to use for arriving and departing aircraft according to weather, arrival and departure demand, and other environmental factors. The authors trained a discretechoice model on real data from air traffic controllers and showed how the model was able to accurately predict the correct runway configuration for the airport [Ramanujam BalakrishnanRamanujam Balakrishnan2011].
Sammut et al. applied a decision tree model to train an airplane’s autopilot from expert demonstration. Their approach generates a separate decision tree for each of the following control inputs: elevators, ailerons, flaps, and thrust. In their investigation, Sammut et al. noted that each pilot demonstrator could execute a planned flight path differently. These demonstrations could be in disagreement, thus making the learning problem significantly more difficult. To cope with the variance between pilot executions, the system learned a separate model for each pilot
[Sammut, Hurst, Kedzier, MichieSammut et al.1992].Other systems learn policies through interaction and feedback, as well as demonstration, from the user [Baranes OudeyerBaranes Oudeyer2013, Bullard, Akgun, Chernova, ThomazBullard et al.2016, Chernova VelosoChernova Veloso2008, Grollman JenkinsGrollman Jenkins2008, Inamura, Inaba, InoueInamura et al.1999, Konidaris, Kuindersma, Grupen, BartoKonidaris et al.2011a, Zeng KuipersZeng Kuipers2016]
. For example, Chernova and Veloso developed a Gaussian mixture model able to interactively learn from demonstration
[Chernova VelosoChernova Veloso2007]. Their algorithm first learns a reasonable policy for a given task (e.g., driving a car along a highway), then solicits user feedback by constructing scenarios involving a high level of uncertainty. Support vector machines are then applied to learn when an autonomous agent should request additional demonstrations
[Chernova VelosoChernova Veloso2008].Policy learning is an important complement to goal or rewardlearning. While goal and rewardlearning approaches are able to capture highlevel goals in order to produce quality schedules [Abbeel NgAbbeel Ng2004, Berry, Peintner, Conley, Gervasio, Uribe, YorkeSmithBerry et al.2006], these methods are limited by their reliance on computational methods for exploring the search space to identify a highquality schedule. IRL relies on dynamic programming, which requires state space enumeration, while approaches such as PTIME [Berry, Peintner, Conley, Gervasio, Uribe, YorkeSmithBerry et al.2006] rely upon mathematical programming. Policy learning, on the other hand, is wellsuited to guiding exploration of a state space. With a function mapping states to actions, a system can construct a schedule by taking sequential scheduling actions (e.g., assigning a worker to a task at the present time). In this sense, a learned policy can serve as a type of domainspecific heuristic to intelligently guide a search within a large state space. However, we are unaware of any prior attempts to apply policy learning to the scheduling domain.
2.3 Blending Machine Learning and Optimization
Typically, reward and policy learning are limited by the quality of the relevant demonstrations. However, even if the demonstrations are highquality, one cannot assume demonstrators nor their demonstrations will be optimal – or even uniformly suboptimal [Aleotti CaselliAleotti Caselli2006, Sammut, Hurst, Kedzier, MichieSammut et al.1992]. As such, some have sought to directly model the suboptimality of demonstrations. For example, Zheng et al. cleverly extended the work of Ramachandran and Amir [Ramachandran AmirRamachandran Amir2007] to model the trustworthiness of the demonstrator within a softmax formulation transition function for reinforcement learning [Zheng, Liu, NiZheng et al.2015], as shown in Equation 4. In this equation, is the expected reward for taking action in state , assuming reward function with the associated optimal policy :
(4) 
Through such a mechanism, it is possible to learn a policy that outperforms human demonstrators by inferring the intended goal rather than the demonstrated goal. Zheng et al. showed that their approach was better able to capture the groundtruth objective function from imperfect training data than Bayesian IRL [Ramachandran AmirRamachandran Amir2007], which does not include a trustworthiness parameter for demonstrations. They validated their approach using a synthetic data set in an experiment with the goal of identifying the best route through an urban domain. However, one limiting assumption from their work is that a system is able to accurately measure the trustworthiness of the demonstrations – especially the relative trustworthiness amongst the demonstrations.
AlphaGo is another wellknown MLoptimization framework recently developed to play Go, a turnbased strategy game [Silver, Huang, Maddison, Guez, Sifre, Van Den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, et al.Silver et al.2016]. At its core, AlphaGo is based on policy learning; it uses a MonteCarlo Tree Search (MCTS) that is guided by a neural network policy trained on a data set of 30 million examples of demonstrations by human Go experts. A policy is employed to initially explore the search tree, and two additional components are used to evaluate the quality of each branching point in the tree. The first component is a second policy, , which is identical to the first except that the neural network includes fewer nodes. This smaller size enables the second policy to rapidly play the Go game to completion in order to predict a winner [Silver, Huang, Maddison, Guez, Sifre, Van Den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, et al.Silver et al.2016].
The second component of AlphaGo is a value function trained via Qlearning. The developers rewired and duplicated the initial policy to enable improvement through selfplay. These duplicated, rewired policies would repeatedly play Go against one another and use a policy gradient approach, developed by Sutton et al., to iteratively improve their policies; the developers then captured a data set of 30 million moves taken by these policies[Sutton, McAllester, Singh, Mansour, et al.Sutton et al.1999]. They then used this data set to train a Qlearning algorithm to predict the expected value of taking a given action in a given state. Interestingly, the authors noted that these selfplay policies actually performed worse than the original trained on actual human demonstrations, but did not have a cohesive theory for why this was the case. Nonetheless, AlphaGo serves as a key example for how policy learning, coupled with optimization techniques (e.g., Qlearning and policy gradient methods) can yield performance on strategy games that is superior to that of humans.
The learningoptimization system most related to our work is that developed by Banerjee et al., who considered a scheduling problem for aircraft carrier flight deck operations. The system repeatedly solved a scheduling problem wherein the variables remained the same (i.e., variables describing which workers performed which tasks and when), but the constraints relating the variables (e.g. temporal constraints between tasks) changed [Banerjee, Ono, Roy, WilliamsBanerjee et al.2011]
. Using a mixedinteger linear program (MILP) formulation, they proposed a MLoptimization pipeline in which the system performed a branchandbound search over the integer variables, and used the prediction of a regression algorithm trained on examples of previously solved problems to provide a provable lowerbound for the optimality of the current integer variable assignments. This approach relied upon the generation of a large database of solutions to train the regression algorithm; however, this generation requires the costly exercise of repeatedly solving a large set of MILPs, which can be intractable for largescale scheduling problems.
3 Model for Apprenticeship Learning
In this section, we present a framework for learning, via expert demonstration, a scheduling policy that correctly determines which task to schedule as a function of task state.
3.1 Problem Domain
We intend for our apprenticeship learning model to address a variety of scheduling problem types. Korsah et al. provided a comprehensive taxonomy for classes of scheduling problems, which vary according to formulation of constraints, variables and objective or utility function [Korsah, Stentz, DiasKorsah et al.2013]. Within this taxonomy, there are four classes addressing interrelated utilities and constraints: No Dependencies (ND [Liu ShellLiu Shell2013]), InSchedule Dependencies (ID [Brunet, Choi, HowBrunet et al.2008, Gombolay ShahGombolay Shah2015, Nunes GiniNunes Gini2015], CrossSchedule Dependencies (XD [Gombolay, Wilcox, ShahGombolay et al.2013]) and Complex Dependencies (CD [Jones, Dias, StentzJones et al.2011]).
The Korsah et al. taxonomy also delineates between tasks requiring one agent (‘singleagent tasks” [SA]); and tasks requiring multiple agents (“multiagent tasks” [MA]). Similarly, agents that perform one task at a time are “singletask agents” (ST), while agents capable of performing multiple tasks simultaneously are “multitask agents” (MT). Lastly, the taxonomy distinguishes between “instantaneous assignment” (IA), in which all task and schedule commitments are made immediately, and “timeextended assignment” (TA), in which current and future commitments are planned.
In this work, we demonstrate our approach for two of the most difficult classes of scheduling problems defined within this taxonomy: XD [STSATA] and CD [MTMATA]. The first problem we consider is the VRPTWTDR, which is an XD [STSATA]class problem. We next consider two realworld problems within the moredifficult CD [MTMATA] class. The second problem (first realworld domain) is a variant of the weapontotarget assignment problem (WTA) [Lee, Su, LeeLee et al.2003], known as antiship missile defense (ASMD). The third problem (second realworld problem) we address is one of hospital resource allocation on a labor and delivery unit, wherein one nurse, called the “resource nurse,” is responsible for ensuring that the correct patient is in the correct type of room at the correct time, with the correct types of nurses present to care for those patients. The characteristics of the three problem domains we explore in evaluating the apprenticeship scheduling algorithm are shown in Table 1.
Problem Domain  VRPTWTDR  ASMD  Hospital Resource Mngmt. 

Describing Section  Section 4.1  Section 4.2  Section 4.3 
Data Type  Synthetic  Realworld  Realworld 
Dependency Type  XD  CD  CD 
Agent Type  ST  MT  MT 
Task Type  SA  MA  MA 
Allocation Type  TA  TA  TA 
3.2 Technical Approach
Many approaches to learning via demonstration, e.g., IRL, are based on Markov models
[Busoniu, Babuska, De SchutterBusoniu et al.2008, Barto MahadevanBarto Mahadevan2003, Konidaris BartoKonidaris Barto2007, PutermanPuterman2014]. Markov models, however, do not capture the temporal dependencies between states and are computationally intractable for large problem sizes. In order to determine which tasks to schedule at which times, we draw inspiration from the domain of Web page ranking [Page, Brin, Motwani, WinogradPage et al.1999], or predicting the most relevant Web page in response to a search query. One important component of page ranking is capturing how pages relate to one another as a graph with nodes (representing Web pages) and directed arcs (representing links between those pages) [Page, Brin, Motwani, WinogradPage et al.1999]. This connectivity is a suitable analogy for the complex temporal dependencies (precedence, wait and deadline constraints) relating tasks within a scheduling problem.Recent approaches to page ranking have focused on pairwise and listwise models, which each have advantages over pointwise models [Valizadegan, Jin, Zhang, MaoValizadegan et al.2009]. In listwise ranking, the goal is to generate a ranked list of Web pages directly [Cao, Qin, Liu, Tsai, LiCao et al.2007, Valizadegan, Jin, Zhang, MaoValizadegan et al.2009, Volkovs ZemelVolkovs Zemel2009], while a pairwise approach determines ranking based on pairwise comparisons between individual pages [Jin, Valizadegan, LiJin et al.2008, Pahikkala, Tsivtsivadze, Airola, Boberg, SalakoskiPahikkala et al.2007]. We chose the pairwise formulation to model the problem of predicting the best task to schedule at time .
The pairwise model has key advantages over the listwise approach: First, classification algorithms (e.g., support vector machines) can be directly applied [Cao, Qin, Liu, Tsai, LiCao et al.2007]. Second, a pairwise approach is nonparametric, in that the cardinality of the input vector is not dependent upon the number of tasks (or actions) that can be performed at any instance. Third, training examples of pairwise comparisons in the data can be readily solicited. From a given observation during which a task was scheduled, we only know which task was most important, not the relative importance between all tasks. Thus, we create training examples based on pairwise comparisons between scheduled and unscheduled tasks. A pairwise approach is more natural because we lack the necessary context to determine the relative rank between two unscheduled tasks.
We formulate the apprenticeship scheduling problem as one of learning a pairwise preference model, as follows. Consider a set of observations, . Each observation is a sixtuple consisting of the following: a set of feature vectors , where vector describes the state of each task ; , the task to be scheduled by the expert demonstrator at the current time step ; , the subset of agents allocated to task from the set of all agents ; , the subset of resources allocated to task from the set of all resources ; and , a set of contextspecific and “taskindependent” features that affect expert decisionmaking. The state feature vector for each task incorporates features that affect the selection of the task for execution and may represent, for example, the deadline, the earliest time at which the task is available, the duration of the task, which resource the task requires, etc. The taskindependent feature vector, , represents global state features, such as the proportion of agents that are currently idle.
An agent is defined as an entity that processes tasks and possesses the following set of attributes: timevarying physical location, travel speed, and taskspecific proficiency (i.e., two agents may require different amounts of time to execute the same task). A resource is defined as an object required to process a task and possesses the following attributes: timeinvariant physical location, a finite number of agents that can utilize the resource at any one time, and a taskspecific proficiency (i.e., one resource may allow a task to be completed at a faster rate than another). In the event that no task is scheduled at time , elements , , and in are null.
The goal is to learn a scheduling policy that selects a task to schedule at a selected time to be processed by agent a as a function of the task and problem state encoded by and . Our formulation assumes at least one agent is required to process one task, with the assignment and scheduling of agents to tasks determined by the scheduler. The assignment of a resource to a task is assumed to be either preallocated based on the problem specification or assigned by the scheduler.
We assume that the cross product of the taskindependent feature vectors and the taskdependent feature vector () encodes sufficient information to make high quality scheduling decisions. Modeling choices may affect the dimensionalities of these feature vectors. For example, in one formulation the state of task may include a list of upper and lowerbound temporal constraints between task and all other tasks ; alternatively, depending on the problem, a lowerdimensional representation of the same relevant information may simply include the latest possible time (i.e., the deadline) by which each task must start to satisfy the problem temporal constraints.
We note that our approach relies upon the ability of domain experts to articulate an appropriate set of features for the given problem. We believe this to be a reasonable limitation. Results from prior work have indicated that domain experts are adept at describing the highlevel, contextual, and taskspecific features used in their decision making; however, it is more difficult for experts to describe how they reason about these features [Cheng, Wei, TsengCheng et al.2006, Raghavan, Madani, JonesRaghavan et al.2006]. In future work, we aim to extend our approach to include feature learning rather than relying upon experts to enumerate the important features they reason about in order to construct schedules.
Our learning approach deconstructs the problem into two steps: 1) For each agent, determine the candidate next task to schedule; and 2) For each candidate task, determine whether to schedule said task.
3.2.1 Learning Task Priorities
In order to learn to correctly assign the next task to schedule, we transform each observation into a new set of observations by performing pairwise comparisons between the scheduled task and the set of unscheduled tasks (Equations 56). Equation 5 creates a positive example for each observation in which a task was scheduled. This example consists of the input feature vector, , and a positive label, . Each element of input feature vector is computed as the difference between the corresponding values in the feature vectors and , describing scheduled task and unscheduled task concatenated with the highlevel contextual feature vector . Equation 6 creates a set of negative examples with . For the input vector, we take the difference of the feature values between unscheduled task and scheduled task concatenated with the highlevel contextual feature vector .
We note that it is necessary to separate the taskindependent features as pointwise terms so as to preserve their information. Consider the example taskindependent feature, , representing the proportion of agents currently idle. If this feature would be encoded in each taskspecific feature vector as , the result would be for all tasks and . Thus, for their information to be preserved for the learning algorithm, one must concatenate a separate vector of contextual features to the pairwise differences.
(5)  
(6) 
(7) 
Figure 1 is a graphical depiction of the process for automatically generating positive and negative training examples for each . For illustrative purposes, the graphic depicts the process considering two taskspecific features, and , corresponding to the x and yaxes, respectively.
In the left graphic, the node “” represents the state of the scheduling domain at time , mapped to the feature space (, ). At this time , the apprentice scheduler observes the demonstrator scheduling task (denoted by the solid arrow vector ). The apprentice scheduler observes that the demonstrator chose to not schedule the two other available tasks or at (denoted by the dashed vectors and , respectively). After the scheduling and execution of , the scheduling domain is observed to be in the state represented by node “”. The figure shows the process repeating at time .
The right graphic depicts the generation of training examples. For each time step, the apprenticeship scheduler constructs positive and negative training examples through vector subtraction of taskdependent feature vectors. The red dotted lines depict the vector difference of the scheduled task’s feature vector and each unscheduled task’s feature vector; the resulting vectors are applied to construct negative training examples. The blue dotted lines depict the negative vector difference of the scheduled task’s feature vector and each unscheduled task’s feature vector; the resulting vectors are applied to construct positive training examples. Recall that a contextual “taskindependent” feature vector, , is appended to each pairwise term in the formation of each training example . This procedure then repeats for each observation (i.e., each time step for each demonstrated schedule) and task. The value of this approach is that the learner does not need to explicitly solicit pairwise comparisons from the demonstrator; instead, the pairwise comparisons are derived automatically through observation of the expert demonstrator.
3.2.2 Learning to Schedule or Idle
Given these observations and their associated features, we can train a classifier, , to predict whether it is better to schedule task as the next task rather than . With this pairwise classifier, we can determine which single task is the highestpriority task according to Equation 7 by determining which task has the highest cumulative priority in comparison to the other tasks in . In this work, we train a single classifier, , to model the behavior of the set of all agents rather than train one for each agent. is a function of all features associated with the agents; as such, agents need not be interchangeable, and different sets of features may be associated with each agent.
Next, we must learn to predict whether should be scheduled or the agent should remain idle. To do so, we train a second classifier, , that predicts whether or not should be scheduled. The observations set, , consists either of examples in which a task was scheduled or those in which no task was scheduled. To train this classifier, we construct a new set of examples according to Equation 8, which assigns positive labels to examples from in which a task was scheduled and negative labels to examples in which no task was scheduled.
(8) 
Finally, we construct a scheduling algorithm to act as an apprentice scheduler (Algorithm 1). This algorithm takes as input the set of tasks, ; agents, ; temporal constraints (i.e., upper and lowerbound temporal constraints) relating tasks in the problem, ; and the set of task pairs that require the same resources and can therefore not be executed at the same time, . Lines 1 2 iterate over each agent at each time step. (In the event that resourcetotask assignments are not predefined, the algorithm would also iterate over each resource that could be assigned.) In Line 3, the highestpriority task, , is determined for a particular agent. In Lines 45, is scheduled if predicts that should be scheduled at the current time.
Note that iteration over agents (Line 2) can be performed according to a specific ordering, or the system can alternatively learn a more general priority function to select and schedule the best agenttaskresource tuple using , . In the latter case, the features are mapped to agenttaskresource tuples rather than tasks , which represent the atomic (i.e., lowestlevel) job. For the synthetic evaluation, we use the original formulation, . For the ASMD application, we use , where represents the objective of mitigating missile during time step , is the decoy to be deployed, and is the physical location for that deployment. For the hospital domain evaluation, we use , where represents the stage of labor for patient , is the assigned nurse, and is the room to which the patient is assigned. For convenience in notation, we refer to this tuple as a “scheduling action.” Finally, note that multiple agentresource pairs can be assigned to a single task, . The apprentice scheduler would first pick the best agent (or agentresource pair) to assign to a task according to the metric. During the same time step (or a subsequent time step), another agent (or agentresource pair) can be added. The algorithm will continue to add assignments to the task until the null assignment (i.e., no further changes to the current set of assignments) is the best option according to .
Our model is a hybrid point and pairwise formulation, which has several key benefits for learning to schedule form expert demonstration. First, we can directly apply standard classification techniques, such as a decision tree, support vector machine, logistic regression, or neural networks. Second, because this technique only considers two scheduling actions at a time, the model is nonparametric in the number of possible actions. Thus, the system can train on
schedules with agents and tasks, yet apply to construct a schedule for a problem with agents and tasks where , , and . Furthermore, it can even trainon demonstrations of a heterogeneous data set of scheduling observations with differing numbers of agents and tasks. Third, the pairwise portion of the formulation provides structure for the learning problem. A formulation that simply concatenated the features of two or more scheduling actions would need to solve the more complex problem of learning the relationships between features and then how to use those relationships to predict the highestpriority scheduling action. Such a concatenation approach would suffer from the curse of dimensionality and require a very large training data set
[Indyk MotwaniIndyk Motwani1998]. Note, however, that this method requires the designer to appropriately partition the features into pairwise and pointwise components such that the pairwise portion does not lose information by considering the differences between actions’ features. Fourth, the transformation of the observations into a pairwise model results in some features that are advantageous for learning from small data sets: the number of positive and negative training examples is balanced given that the algorithm simultaneously creates one negative label for every positive label, and the observations are bootstrapped to create examples for each time step, rather than only for a pointwise model, where .4 Data Sets
Here, we validate that schedules produced by the learned policies are of comparable quality to those generated by human or synthetic experts. To do so, we considered a synthetic data set from the XD [STSATA] class of problems and two realworld data sets from the CD [MTMATA] class of problems, as defined by Korsah et al. [Korsah, Stentz, DiasKorsah et al.2013]. We present each problem domain and describe the manner in which the data set of expert demonstrations for the domain was acquired.
4.1 Synthetic Data Set
For our first investigation, we generated a synthetic data set of scheduling problems in which agents were assigned a set of tasks. The tasks were related through precedence or wait constraints, as well as deadline constraints, which could be absolute (relative to the start of the schedule) or relative to another task’s initiation or completion time. Agents were required to access a set of shared resources to execute each task. Agents and tasks had defined starting locations, and task locations were static. Agents were only able to perform tasks when present at the corresponding task location, and each agent traveled at a constant speed between task locations. Task completion times were potentially nonuniform and agentspecific, as would be the case for heterogeneous agents. An agent that was incapable of performing a given task was assumed to have an infinite completion time for that task. The objective was to minimize the makespan or other timebased performance measures.
This problem definition spans a range of scheduling problems, including the traveling salesman, jobshop scheduling, multivehicle routing and multirobot task allocation problems, among others. We describe this range as a vehicle routing problem with time windows, temporal dependencies, and resource constraints (VRPTWTDR), which falls within the XD [STSATA] class in the taxonomy by Korsah:2013: agents perform tasks sequentially (ST), each task requires one agent (SA), and commitments are made over time (TA).
To generate our synthetic data set, we developed a mock scheduling expert that applies one of a set of contextdependent rules based on the composition of the given scheduling problem. This behavior was based upon rules presented in prior work addressing these types of problems [Gombolay, Wilcox, ShahGombolay et al.2013, Gombolay ShahGombolay Shah2015, SolomonSolomon1987, Tan, Lee, Zhu, OuTan et al.2001]. Our objective was to show that our apprenticeship scheduling algorithm learns both contextdependent rules and how to identify the associated context for their correct application.
The mock scheduling expert functions as follows: First, the algorithm collects all alive and enabled tasks as defined by [Muscettola, Morris, TsamardinosMuscettola et al.1998]. Consider a pair of tasks, and , with start and finish times and , respectively, such that there is a wait constraint requiring to start at least units of time after . A task is alive and enabled if for all such and in .
After task collection, the heuristic iterates over each agent to identify the highestpriority task, , to schedule for that agent. The algorithm determines which scheduling rule is most appropriate to apply for each agent. If agent speed is sufficiently slow ( m/s), travel time will become the major bottleneck. If agents move quickly but utilize one or more resources heavily ( for some constant c), use of these resources can become the bottleneck. Otherwise, task durations and associated wait constraints are generally most important.
If the algorithm identifies travel distance as the primary bottleneck, it chooses the next task by applying a priority rule wellsuited for vehicle routing that minimizes a weighted, linear combination of features [Gambardella, Éric Taillard, AgazziGambardella et al.1999, SolomonSolomon1987] comprised of the distance and angle relative to the origin between agent and . This rule is depicted in Equation 9, where is the location of , is the location of agent , is the relative angle between the vector from origin to the agent location and the origin to the location of , and and are weighting constants:
(9) 
If the algorithm identifies resource contention as the most important bottleneck, it employs a rule to mitigate resource contention in multirobot, multiresource problems based on prior work in scheduling for multirobot teams [Gombolay, Wilcox, ShahGombolay et al.2013]. Specifically, the algorithm uses Equation 10 to select the highpriority task to schedule next, where is the deadline of and is a weighting constant:
(10) 
If the algorithm decides that temporal requirements are the major bottleneck, it employs an Earliest Deadline First rule (Equation 11), which performs well across many scheduling domains [Chen AskinChen Askin2009, Gombolay, Wilcox, ShahGombolay et al.2013, Gombolay ShahGombolay Shah2015]:
(11) 
After selecting the most important task, , the algorithm determines whether the resource required for , , is idle and whether the agent is able to travel to the task location by time . If these constraints are satisfied, the heuristic schedules task at time . (An agent is able to reach task if for all that the agent has already completed, where is the agent’s speed.)
We constructed the synthetic data set for two homogeneous agents and 20 partially ordered tasks located within a 20 x 20 grid.
4.2 RealWorld Data Set: AntiShip Missile Defense
In ASMD, the goal is to protect one’s naval vessel against attacks by antiship missiles using “softkill weapons” (i.e., decoys) that mimic the qualities of a target in order to direct the missile away from its intended destination.
Developing tactics for softkill weapon coordination is highly difficult due to the relationship between missile behavior and softkill weapon characteristics. The control laws governing antiship missiles vary, and the captain must select the correct decoy types in order to counteract the associated antiship missiles. For example, a ship’s captain may deploy a decoy that emits a large amount of heat in order to cause an enemy heatseeking missile to fly toward the decoy rather than the ship. Also, an enemy missile may consider the spatial layout of all targets in order to select the nearest or furthest targets; in doing so, the missile may consider the magnitude of the radar reflectivity, radar emissions, and heat emissions, either separately or in various combinations.
Further, decoys have different financial costs and timing characteristics: Some decoys, such as unmanned aerial vehicles (UAVs), are able to function throughout the entirety of an engagement, while others, such as an infrared (IR) flares, disappear after a certain time. As a result, a captain may be required to use multiple decoys in tandem in order to divert a single antiship missile, but may also be able to use a single decoy to defeat multiple missiles. There is a complex interplay between the types and locations of decoys relative to the control laws governing antiship missiles. For example, deployment of a particular decoy, while effective against one airborne enemy missile, may actually cause a second enemy missile that was previously homing in on a second decoy to now impact the ship.
The ASMD problem is characterized as the most complex class of scheduling problem according to the Korsah:2013. taxonomy : CD [MTMATA]. The problem considers multitask agents (MA) in the form of decoys, each of which can work to divert multiple missiles at the same time. The problem also incorporates multiagent tasks (MT); a feasible solution may require the simultaneous use of multiple agents in order to complete an individual task. Further, timeextended agent allocation (TA) must be considered, given the potential future consequences of scheduling actions taken at the current moment. Finally, the ASMD problem falls within the CD class, because each task can be decomposed in a variety of ways – each with their own cost – in order to accomplish the same goal, with each decomposition affecting the value and feasibility of the decompositions of other tasks. The full specification of the mixedinteger linear program formulation for the ASMD problem is provided in Appendix A.
4.2.1 Data Collection
A realworld data set was collected, consisting of human demonstrators of various skill levels solving the antiship missile defense (ASMD) weapontotarget assignment problem. Data was collected from domain experts playing a serious game, called Strike Group Defender^{1}^{1}1SGD was developed by Pipeworks Studio in Eugene, Oregon, USA. (SGD), for ASMD training. Game scenarios involved five types of decoys and 10 types of threats. Threats were randomly generated for each played scenario, promoting the development of strategies that were robust to a varied distribution of scenarios. Each decoy had a specified effectiveness against each threat type.
Players attempted to deploy a set of decoys by using the correct types at the correct locations and times in order to distract incoming missiles. Threats were launched over time; an effective deployment at time could become counterproductive in the future as new enemy missiles were launched.
Games were scored as follows: points were received each time a threat was neutralized and points were received for each second a threat spent homing in on a decoy. Players lost points for each threat impact and point was deducted for each second a threat spent homing in on the player’s ship. At each decoy deployment, players lost  points depending upon decoy type.
The collected data set consisted of games played by humans across threat configurations, or “levels.” From this set, we also separately analyzed 16 threat configurations such that each configuration included at least one human demonstration in which the ship was successfully protected from all enemy missiles. For these 16 configurations, there were total games played by unique human demonstrators. The player cohort consisted of technical fellows and associates, as well as contractors at a federally funded research and development center (FFDRC), with expertise varying from “generally knowledgeable about the ASMD problem” to “domain experts” with professional experience or training in ASMD.
4.3 RealWorld Data Set: Labor and Delivery
To further evaluate our approach, we applied our method to a second data set collected from a labor and delivery floor at a Boston hospital. In this domain, a “resource nurse” must solve a problem of task allocation and schedule optimization with stochasticity in the number and types of patients and the duration of tasks. Specifically, the resource nurse is responsible for ensuring that the correct patient is in the correct type of room at the correct time, with the correct types of nurses present to care for those patients. The functions of a resource nurse are to assign nurses to take care of labor patients; assign patients to labor beds, recovery room beds, operating rooms, antepartum ward beds or postpartum ward beds; assign scrub technicians to assist with surgeries in operating rooms; call in additional nurses if necessary; accelerate, delay or cancel scheduled inductions or cesarean sections; expedite active management of a patient in labor; and reassign roles among nurses.
Using our apprenticeship scheduling method in for the Labor and Delivery problem domain, a task represents the set of steps (subtasks) required to care for patient , and each is a given stage of labor for that patient. Stages of labor are related by stochastic lowerbound constraints , requiring the stages to progress sequentially. There are stochastic time constraints, and , relating the stages of labor to account for the inability of resource nurses to perfectly control when a patient will move from one stage to the next. Arrivals of (i.e. patients) are drawn from stochastic distributions. The model considers three types of patients: scheduled cesarean patients, scheduled induction patients and unscheduled patients. The set of , and are dependent upon patient type.
Labor nurses are modeled as agents with a finite capacity to process tasks in parallel, where each subtask requires a variable amount of this capacity. For example, a labor nurse may generally care for a maximum of two patients simultaneously. If the nurse is caring for a patient who is “full and pushing” (i.e., the cervix is fully dilated and the patient is actively trying to push out the baby) or in the operating room, he or she may only care for that patient.
Rooms on the labor floor (e.g., a labor room, an operating room, etc.) are modeled as resources, which process subtasks in series. Agent and resource assignments to subtasks are preemptable, meaning that the agent and resource assigned to care for any patient during any step in the care process may be changed over the course of executing that subtask.
In this formulation,
∗[t]Aaτji∈{0,1} is a binary decision variable for assigning agent to subtaskfor time epoch
. is an integer decision variable for assigning a certain portion of the effort of agent to subtask for time epoch . is a binary decision variable for whether subtask is assigned resource for time epoch . is a binary decision variable for whether task and its corresponding subtasks are to be completed. specifies the effort required from any agent to work on . are the start and finish times of .(12) 
(13)  
(14)  
(15)  
(16) 
(17)  
(18)  
(19)  
(20)  
(21) 
Equation 13 enforces that each subtask during each time epoch is assigned a single agent. Equation 14 ensures that each subtask receives a sufficient portion of the effort of its assigned agent during epoch . Equation 15 ensures that agent is not oversubscribed. Equation 16 ensures that each subtask of each task that is to be completed (i.e., ) is assigned one resource . Equation 17 ensures that each resource is assigned to only one subtask during each epoch . Equation 18 requires the duration of subtask to be less than or equal to and at least units of time. Equation 19 requires that occurs at least units of time after . Equation 20 requires that the duration between the start of and the finish of be less than . Equation 21 requires that finishes before units of time have expired since the start of the schedule.
The functions of a resource nurse are to assign nurses to take care of labor patients and to assign patients to labor beds, recovery room beds, operating rooms, antepartum ward beds or postpartum ward beds. The resource nurse has substantial flexibility when assigning beds, and his or her decisions will depend upon the type of patient and the current status of the unit in question. He or she must also assign scrub technicians to assist with surgeries in operating rooms, and call in additional nurses if required. The corresponding decision variables for staff assignments and room/ward assignments in the above formulation are and , respectively.
The resource nurse may accelerate, delay or cancel scheduled inductions or cesarean sections in the event that the floor is too busy. Resource nurses may also request expedited active management of a patient in labor. The decision variables for the timing of transitions between the various steps in the care process are described by and . The commitments to a patient (or that patient’s procedures) are represented by .
The resource nurse may also reassign roles among nurses: For example, a resource nurse may pull a nurse from triage, or even care for patients herself if the floor is too busy. Or, if a patient’s condition is particularly acute (e.g., the patient has severe preeclampsia), the resource nurse may assign onetoone nursing. The level of attentional resources a patient requires and the level a nurse has available correspond to variables and , respectively. The resource nurse makes his or her decisions while considering current patient status , which is manually transcribed on a whiteboard, as shown in Figure 3.
The stochasticity of the problem arises from the uncertainty in the upper and lowerbound of the durations of each of the steps in caring for a patient; the number and types of patients, ; and the temporal constraints, , relating the start and finish of each step. These variables are a function of the resource and staff allocation variables, , as well as patient task state , which includes information on patient type (i.e., presentation with scheduled induction, scheduled cesarean section, or acute unplanned anomaly), gestational age, gravida, parity, membrane status, anesthesia status, cervix status, time of last exam and the presence of any comorbidities. Formally, .
The computational complexity of completely searching for a solution that satisfies the constraints in Equations 1321 is given by , where is the number of agents, with each agent possessing an integer processing capacity of . There are tasks , each with subtasks, resources, and an integervalued planning horizon of units of time. In practice, there are nurses (agents) who can care for up to two patients at a time (i.e., ), different rooms (resources) of varying types, patients (tasks) at any one time, and a planning horizon of hours or minutes, yielding a worstcase complexity of , which is computationally intractable for exact methods without the assistance of informative search heuristics.
4.3.1 Data Collection
To collect data from resource nurses about their decisions, a highfidelity simulation of a labor and delivery floor was developed, as depicted in Figure 4. We developed this simulation in collaboration with Beth Israel Medical Deaconess Hospital in Boston. The effort was part of a qualityimprovement project at the hospital to develop training tools and involved a rigorous, yearlong design and iteration process that included workshops with nurses, physicians, and medical students to ensure the tool accurately captured the role of a resource nurse. Parameters within the simulation (e.g., patient arrivals, timelines for labor progression) were drawn from medical textbooks and papers and modified through alpha and beta testing to ensure that the simulation closely mirrored the patient population and nurse experience at our partner hospital.
We invited expert resource nurses to play this simulation in order to collect a data set for training our apprenticeship scheduling algorithm. This data set was generated by seven resource nurses working with the simulation for a total of hours, simulating hours of elapsed time on a real labor floor and yielding a set of more than individual decisions.
5 Empirical Evaluation of Apprenticeship Scheduling
In this section, we evaluate our prototype for apprenticeship scheduling using synthetic and realworld data sets.
5.1 Synthetic Data Set
We trained our model using a decision tree, KNN classifier, logistic regression (logit) model, a support vector machine with a radial basis function kernel (SVMRBF), and a neural network to learn
and . We randomly sampled of the data for training and for testing.We defined the input features as follows: The highlevel feature vector of the task set, , was comprised of the agents’ speed and the degree of resource contention, . The taskspecific feature vector, , was comprised of the task’s deadline, a binary indicator for whether or not the task’s precedence constraints had been satisfied, the number of other tasks sharing the given task’s resource, a binary indicator for whether or not the given task’s resource was available, the travel time remaining to reach the task location, the distance agent would travel to reach , and the angular difference between the vector describing the location of agent and the vector describing the position of relative to agent .
We compared the performance of our pairwise approach with pointwise and naïve approaches. In the pointwise approach, training examples for selecting the highestpriority task were of the form . The label was equal to if task was scheduled in observation , and was otherwise. In the naïve approach, examples were comprised of an input vector that concatenated the highlevel features of the task set and the taskspecific features of the form ; labels were given by the index of the task scheduled in observation .
Figures 4(a)4(b) depict the sensitivity (true positive rate) and specificity (true negative rate), respectively, of the model. We found that a pairwise model outperformed the pointwise and naïve approaches. Within the pairwise model, a decision tree yielded the best performance: The trained decision tree was able to identify the correct task and when to schedule that task of the time, and was able to accurately predict when no task should be scheduled of the time.
To more fully understand the performance of a decision tree trained with a pairwise model as a function of the number and quality of training examples, we trained decision trees with the pairwise model using 15, 150, and 1,500 demonstrations. The sensitivity and specificity depicted in Figures 5(a) and 5(b) for 15 and 150 demonstrations represent the mean sensitivity and specificity of 10 models trained via random subsampling without replacement.
We also varied the quality of the training examples, assuming the demonstrator was operating under an greedy approach with a
probability of selecting the correct task to schedule, and selecting another task from a uniform distribution otherwise. Our goal in this evaluation was to empirically investigate the impact of noisy demonstrations (i.e., those in which the demonstrator does not always select the“best” tasks) on the quality of the learned policy. There are a number of possible models for introducing such noise, including an epsilongreedy approach or a softmax model. An epsilongreedy approach is expected to produce lowerquality demonstrations compared with a noisy human demonstrator, since a human would be more likely to select the second or thirdbest task when making an error than to select a task at random, thus making the LfD problem more difficult. While no model will perfectly imitate an imperfect human demonstrator, we selected an epsilongreedy approach as a reasonably conservative method of introducing more noise than might be generated by an imperfect human demonstrator.
Training a model from pairwise comparisons of between the scheduled and each unscheduled tasks produced a comparable policy to that of the synthetic expert. The decision tree model performed well due to the modal nature of the multifaceted scheduling heuristic. Note that this data set consisted of scheduling strategies with mixed discretecontinuous functional components; performance could potentially be improved upon in future work by combining decision trees with logistic regression. This hybrid learning approach has been successful in prior ML classification tasks [Landwehr, Hall, FrankLandwehr et al.2005]
and can be readily applied to this apprenticeship scheduling framework. There is also an opportunity to improve performance through hyperparameter tuning (e.g., to select the minimum number of examples in each leaf of the decision tree). We leave comprehensive investigation of the relative benefits for a range of learning techniques for future work.
Note that the results presented in Figures 4(a)5(b) were achieved without any hyperparameter tuning. For example, with the decision tree, we did not perform an inner crossvalidation loop to estimate the minimum number of examples in each leaf to achieve the best performance. The purpose of this analysis was to show that, with our pairwise approach, the system can accurately learn expert heuristics from example. In the following section, we investigate how apprenticeship scheduling using a decision tree classifier can be improved upon via an inner crossvalidation loop to tune the model’s hyperparameters.
5.1.1 Performance of Decision Tree with Hyperparameter Tuning
We performed our initial analysis, detailed above, to identify which techniques have inherent advantages that can be realized without extensive hyperparameter tuning. Our results indicate that the pairwise formulation for apprenticeship scheduling, in conjunction with a decision tree classifier, has advantages over alternative formulations for learning a highquality scheduling policy. Given evidence of this advantage, we further evaluated the potential of the pairwise formulation with hyperparameter tuning.
To improve the performance of the model, we manipulated the “leafiness” of the decision tree to find the best setting to increase the accuracy of the apprenticeship scheduler. Specifically, we varied the minimum number of training examples required in each leaf of the tree. As the minimum number required for each leaf decreases, the chance of overfitting to the data increases. Conversely, as the minimum number increases, the chance of not learning a helpful policy (underfitting) increases. To identify the best number of leaves for generalization, we tested values for the minimum number of examples required for each leaf of the decision tree in the set . If the minimum number of examples in each leaf exceeded the total number of examples, the setting was trivially set to the total number of examples available for training.
We performed fold crossvalidation for each value of examples as follows: We trained an apprentice scheduler on fourfifths of the training data and tested on onefifth of the data, and recorded the average testing accuracy across each of the five folds. Then, we used the setting of the minimum number of examples required for each leaf that yielded the best accuracy during crossvalidation to train a full apprenticeship scheduling model on all of the training data ( of the total data). Finally, we tested the full apprenticeship scheduling model on the of the total data reserved for testing. Thus, none of the data used to test the full model was used to estimate the best setting for the leafiness of the tree. We repeated this procedure 10 times, randomly subsampling the data and taking the average performance across the 10 trials.
The sensitivity and specificity of the fully trained apprenticeship scheduling algorithm are depicted in Figures 6(a) and 6(b) for 1, 5, 15, and 150 scheduling demonstrations with homogeneous agents, and in Figures 7(a) and 7(b) for demonstrations with heterogeneous agents. As before, we also varied the quality of the training examples, assuming the demonstrator was operating under an greedy approach with a probability of selecting the correct task to schedule and selecting another task from a uniform distribution otherwise.
For both the homogeneous and heterogeneous cases, we found that the apprenticeship scheduling algorithm was able to average sensitivity and specificity either with five perfect schedules or 15 schedules generated by an operator making mistakes of the time. Hyperparameter tuning substantially increased the sensitivity of the model from to for five scheduling examples generated by an operator making mistakes of the time. (Recall that a schedule consists of allocating 20 tasks to two workers and sequencing those tasks in time.)
Through our synthetic evaluation, we have shown that our apprentice scheduling algorithm is able to learn to make sequential decisions that accurately emulate the decision making process of a mock expert. The apprenticeship scheduler model shows a robust ability to learn from sparse, noisy data. In the following sections, we investigate the ability of the apprentice scheduler to learn from scheduling demonstrations produced by experts performing realworld scheduling tasks.
5.2 RealWorld Data Set: ASMD
We trained a decision tree with our pairwise scheduling model and tested its performance via leaveoneout crossvalidation involving 16 real demonstrations in which a player successfully protected the ship from all enemy missiles. Each demonstration originated from a unique threat scenario. Features for each decoy/missile pair (or null decoy deployment due to inaction) included indicators for whether a decoy had been placed such that a missile was successfully distracted by that decoy, whether a missile would be lured into hitting the ship due to decoy placement, or whether a missile would be unaffected by decoy placement.
Across all 16 scenarios, the mean player score was . With only 15 examples of expert human demonstrations, our apprenticeship scheduling model achieved a mean score of
, with a standard deviation of
. We hypothesized that scores produced by the learned policy would be statistically significantly better than the scores achieved by the human demonstrators. The null hypothesis stated that the number of scenarios in which the apprenticeship scheduling model achieved superior performance would be less than or equal to the number of scenarios in which the mean score of the human demonstrators was superior to that of the apprenticeship scheduler. We set the significance level at
, which means that the risk of identifying a difference between the mean scores earned by the apprenticeship scheduler and the set of human performers when no such difference exists is less than .Results from a binomial test rejected the null hypothesis, indicating that the learned scheduling policy performed better than the human demonstrators in significantly more scenarios ( versus scenarios; ). In other words, we can say with certainty that the apprenticeship scheduler outperformed the average human player for the majority of the presented missile defense scenarios. This promising result was achieved using a relatively small training set, and suggests that learned policy can form the basis for a training tool to improve the average human player’s score.
5.3 RealWorld Data Set: Labor and Delivery
Currently, nurse resource managers commonly operate without technological decisionmaking aids. As such, it is imprudent to introduce a fully autonomous solution for resource management, as doing so could have lifethreatening consequences for practitioners unfamiliar with such automation. Rather, research has shown that a semiautonomous system is preferable when integrating machines into human cognitive workflows [Kaber EndsleyKaber Endsley1997, Wickens, Li, Santamaria, Sebok, SarterWickens et al.2010]. Such a system would provide recommendations that a human supervisor could then accept or modify, and would be placed within the “46” range on Sheridan’s 10point scale for levels of automation [Parasuraman, Sheridan, WickensParasuraman et al.2000].
We found it prudent to test our apprenticeship scheduling technique with the algorithm offering recommendations to labor nurses who would evaluate how acceptable they found the quality of each recommendation. Specifically, we wanted to test whether the algorithm was able to learn to differentiate between high and lowquality resource management decisions. If nurses accepted what the apprenticeship scheduler had learned to be highquality advice while rejecting what the scheduler had learned to be lowquality advice, we could be reasonably confident that the apprentice scheduler had captured the desired resource management policy.
The first step, then, was to train a decision tree using the pairwise scheduling model based on the data set described in Section 4.3.1 of resource nurses’ scheduling decisions. Recall that this data set consisted of the results of expert resource nurses playing the simulation for hours, simulating hours of elapsed time on a real labor floor, and yielding a data set of more than decisions.
Second, we invited 15 labor nurses, none of whom were among those involved in training the algorithm, to play the same simulation used to collect the data (Figure 4). However, instead of purely soliciting decisions from the player, the simulation used the apprenticeship scheduling policy to offer recommendations about how to manage patients. Specifically, whenever a new patient arrived in the simulated waiting room, the apprenticeship scheduler would offer advice recommending 1) which of six wards to admit that patient to, 2) which bed within that ward to place that patient, and 3) which nurse should care for that patient. Nurses would then either accept the advice, automatically implementing the decision, or reject the advice and implement their own decisions.
In order to generate highquality advice, the apprenticeship scheduler simply applied Equation 7. To generate lowquality advice, the apprenticeship scheduler applied Equation 22, which changes the maximization to a minimization, as follows:
(22) 
However, such a minimization could create a strawman counterpoint to the highquality advice, demonstrating only that the apprenticeship scheduler learned at least hard constraints (e.g., “do not assign a patient to an occupied bed”) rather than a gradation over feasible actions (e.g., “assign a lessbusy nurse to a new patient rather than a busier nurse”). As such, we also used the apprenticeship scheduler to generate lowquality but feasible advice by only considering such that was feasible, as determined through a manuallyencoded schedulability test.
For each of the 15 nurse players, we conducted two trials with the simulation offering advice. In one trial, the advice was highquality; in the other, the simulation offered lowquality advice randomly chosen to be lowquality but feasible or lowquality and infeasible. We hypothesized that nurses would accept advice during the highquality trials and reject advice during the lowquality trials (regardless of feasibility). Each simulation trial was randomly generated, with each player experiencing different scenarios with differing advice. On average, a nurse would receive recommendations per trial, resulting in a total of 256 recommendations across all nurses and trials.
The nurses accepted highquality advice 88.4% of the time (114 of 129 highquality recommendations), while rejecting lowquality advice 88.2% of the time (112 of 127 lowquality recommendations), indicating that the apprenticeship scheduling technique is able to learn a highquality model for resource management decision making in the context of labor and delivery. In other words, the apprenticeship scheduler was able to learn contextspecific strategies for hospital resource allocation and apply them to make reasonable suggestions about which tasks to perform and when.
Anecdotally, some of the advice was not accepted for reasons that could be easily remedied: For example, upon initiation of the test, we were unaware that one room on the labor and delivery floor was unique because it uniquely contained cardiac monitoring equipment. As such, the algorithm did not know to reason about that feature and sometimes offered a recommendation that was feasible but less preferable for patients with cardiacrelated comorbidities. It was not until later that we learned from the nurses about this particular feature. Such findings motivate the need for active learning for improved feature solicitation in future work. We also note that interoperator agreement among nurse demonstrators is unlikely to be 100%. For these reasons, we believe learning a policy that can generate advice validated to be correct nearly 90% of the time is a favorable result.
6 Model for Collaborative Optimization via Apprenticeship Scheduling
Apprenticeship scheduling is designed to simply emulate human expert scheduling decisions; in this work, we also use the apprenticeship scheduler in conjunction with optimization to automatically and efficiently produce optimal solutions to challenging realworld scheduling problems. Our approach, called Collaborative Optimization via Apprenticeship Scheduling (COVAS), involves applying apprenticeship scheduling to generate a favorable (if suboptimal) initial solution to a new scheduling problem. To guarantee that the generated schedule is serviceable, we augment the apprenticeship scheduler to solve a constraint satisfaction problem, ensuring that the execution of each scheduling commitment does not directly result in infeasibility for the new problem. COVAS uses this initial solution to provide a tight bound on the value of the optimal solution, substantially improving the efficiency of a branchandbound search for an optimal schedule.
We show that COVAS is able to leverage good (but imperfect) human demonstrations to quickly produce globally optimal solutions. We also report that COVAS can transfer an apprenticeship scheduling policy learned for a small problem to optimally solve problems with twice as many variables as any shown during training, and produce an optimal solution an order of magnitude faster than mathematical optimization alone. Here, we provide an overview of the COVAS architecture and present its two components: the policy learning and optimization routines.
6.1 COVAS Architecture
The system (Figure 9) takes as input a set of domain expert scheduling demonstrations (e.g., Gantt charts) that contains information describing which agents complete which tasks, when and where. These demonstrations are passed to an apprenticeship scheduling algorithm that learns a classifier, , to predict whether the demonstrator(s) would have chosen scheduling action over action . Next, COVAS uses to construct a schedule for a new problem. The system creates an eventbased simulation of this new problem and runs this simulation in time until all tasks have been completed. In order to complete tasks, COVAS uses at each moment in time to select the best scheduling action to take. We describe this process in detail in the next section. COVAS then provides this output as an initial seed solution to an optimization subroutine (i.e., a MILP solver). The initial solution produced by the apprenticeship scheduler improves the efficiency of a search by providing a bound on the objective function value of the optimal schedule. This bound informs a branchandbound search over the integer variables [Bertsimas WeismantelBertsimas Weismantel2005], enabling the search algorithm to prune areas of the search tree and focus its search on areas that can yield the optimal solution. After the algorithm has identified an upper and lowerbound within some threshold, COVAS returns the solutions that have proven optimal within that threshold. Thus, an operator can use COVAS as an anytime algorithm and terminate the optimization upon finding a solution that is acceptable within a provable bound.
6.2 Apprenticeship Scheduling Subroutine
In Section 3, we presented our apprenticeship scheduling algorithm, which is centered around learning a classifier, , to predict whether an expert would take scheduling action over . With this function, we can then predict which single action amongst a set of actions the expert would take by applying Equation 7. In this section, we build upon this formulation and integrate it into our collaborativeoptimization via apprenticeship scheduling framework.
As a subroutine within COVAS, is applied to obtain the initial solution to a new scheduling problem as follows: First, the user must instantiate a simulation of the scheduling domain; then, at each time step in the simulation, take the scheduling action predicted by Equation 7 to be the action that the human demonstrators would take. This equation identifies the task with the highest importance marginalized over all other tasks . Unlike our original formulation in Section 3, each selected action is validated using a schedulability test (i.e., solving a constraint satisfaction problem) to ensure that direct application of that action does not violate the constraints of the new problem. For example, in antiship missile defense, one would check to ensure that the given action does not result in a suicidal deployment (i.e., the decoy directly causes a missile to impact the ship). This test must be fast, so as to make the benefit to feasibility and optimality in the resulting schedule worth the additional complexity. If, at a given time step, does not pass the schedulability test, COVAS uses Equation 7 for all to consider the secondbest action. If no action passes the schedulability test, no action is taken during that time step.
While the schedulability test forces the apprenticeship scheduling algorithm to follow a subset of the full constraints in the MILP formulation, it is possible that the algorithm may not successfully complete all tasks. Here, we model tasks as optional and use the objective function to maximize the total number of tasks completed. In turn, constraints for a task that the apprenticeship scheduling algorithm did not satisfactorily complete can be turned off, with a corresponding penalty in the objective function score. Thus, an initial seed solution that has not completed all tasks (i.e., satisfied all constraints to complete the task) can still be helpful for seeding the MILP.
6.3 Optimization Subroutine
For optimization, we employ mathematical programming techniques to solve mixedinteger linear programs via branchandbound search. COVAS incorporates the solution produced by the apprenticeship scheduler to seed a mathematical programming solver with an initial solution, which is a builtin capability provided by many offtheshelf, stateoftheart MILP solvers, including CPLEX^{2}^{2}2IBM ILOG CPLEX Optimization Studio http://www03.ibm.com/software/products/en/ibmilogcpleoptistud and Gurobi^{3}^{3}3Gurobi Optimization, Inc. http://www.gurobi.com. This seed provides a tight bound on the objective function value of the optimal solution, which serves cut the search space; these cuts allow COVAS to more quickly hone in on the optimal solution. Furthermore, this approach allows COVAS to quickly achieve a bound on the optimality of the solution provided by the apprenticeship scheduling subroutine. In such a manner, an operator can determine whether the apprenticeship scheduling solution is acceptable or whether waiting for successive solutions from COVAS is warranted.
7 Results and Discussion
In this section, we empirically validate that COVAS is able to generate optimal solutions more efficiently than stateoftheart optimization techniques. We also analyze the sensitivity of the computational time COVAS required to find an optimal solution as a function of the quality of the scheduling policy learned by the apprenticeship scheduling algorithm.
7.1 Validation Against Expert Benchmark
In this section, we empirically validate that COVAS is able to generate optimal solutions more efficiently than stateoftheart optimization techniques. As a baseline benchmark, we solve a pure MILP formulation (Appendix A Equations 2344) using Gurobi, which applies stateoftheart techniques for heuristic upperbounds, cutting planes and LP relaxation lowerbounds. We set the optimality threshold at . For the apprenticeship scheduling subroutine’s schedulability test, we apply Equations 3637 as a constraint satisfaction check when testing the feasibility of action , given by applying Equation 7. With regard to tasks within the apprenticeship scheduler’s seed solution that are not satisfactorily completed, the MILP can leave those tasks incomplete to start by initially setting .
We trained COVAS’ apprenticeship scheduling algorithm on demonstrations of experts’ solutions to unique ASMD scenarios (save for one “holdout” scenario) from the ASMD data set described in Section 4.2. We then tested COVAS on the holdout scenario. We also applied a pure MILP benchmark on this scenario and compared the performance of COVAS to the benchmark. We generated one data point for each unique demonstrated scenario (i.e., leaveoneout crossvalidation) to validate the benefit of COVAS.
Figure 10 consists of two performance indicators: The total computation time required for the MILP benchmark and COVAS to solve for the optimal solution is depicted on the left; to the right is the computation time required for the benchmark and COVAS to identify a solution better than that provided by a human expert. This figure indicates that COVAS was not only able to improve overall optimization time, but that it also substantially improved computation time for solutions superior to those produced by human experts. The average improvements in computation time with COVAS were x the overall optimization time and x the expertgenerated solutions.
Next, we evaluated COVAS’ ability to transfer prior learning to morechallenging task sets. We trained on a level in the ASMD game in which a total of 10 missiles of varying types came from specific bearings at given times. We randomly generated a set of scenarios involving 15 and 20 missiles, with bearings and times randomly sampled with replication from the set of bearings used in the 10missile scenario.
Figure 11 depicts the computation time required by COVAS and the MILP benchmark to identify the optimal solution for scenarios involving 10, 15 and 20 missiles. The average improvement to computation time with COVAS was x, x, and x, respectively, demonstrating that COVAS is able to efficiently leverage the solutions of human domain experts to quickly solve problems twice as large as those the demonstrator provided for training.
7.2 Sensitivity Analysis of COVAS to Apprenticeship Scheduler’s Learned Policy
Here, we assess the sensitivity of the computational time COVAS required to find an optimal solution as a function of the quality of the scheduling policy learned by the apprenticeship scheduling algorithm.
7.2.1 Sensitivity Analysis Design
We sought to understand how incorrect predictions generated by the apprenticeship scheduling algorithm’s classifier, , would affect COVAS’ computational efficiency. We considered three classes of mistakes that the apprenticeship scheduler could make when creating an initial schedule: two types of mistakes related to agent allocation (swapping tasks among agents and the misallocation of agents to a particular task), as well as task sequencing errors. We generated a synthetic dataset involving these three error classes as follows:

Allocation: Swapping: Select two tasks with uniform probability, and , such that the agent assigned to is different from the agent assigned to , and subsequently swap their assignment such that agent now performs and viceversa.

Allocation: Stealing: Select one task, , with uniform probability, where