In this paper, we show how to combine verification and learning techniques (model-based and model-free) to solve a scheduling problem featuring both hard and soft constraints. We investigate solutions to this problem both from a theoretical and from a more pragmatic point of view. On the theoretical side, we show how safety guarantees (as understood in formal verification) can be combined with guarantees offered by the Probably Approximately Correct (PAC) learning framework. On the pragmatic side, we show how safety guarantees obtained from automatic synthesis can be combined with techniques based on deep -learning  to offer a scalable and practical solution to solve the scheduling problem at hand.
The scheduling problem that we consider has been introduced in  and is defined as follows. A task system is composed of a set of preemptible tasks partitioned into a set of soft tasks and a set of hard tasks. Time is assumed to be discrete and measured e.g. in CPU ticks. Each task generates an infinite number of instances , called jobs, with
Jobs generated by both hard and soft tasks are equipped with deadlines, which are relative to the respective arrival times of the jobs in the system. The computation time requirements of the jobs follow a discrete probability distribution, and are unknown to the scheduler but upper bounded by their relative deadline. Jobs generated by hard tasks must complete before their respective deadlines. While this is not mandatory for jobs generated by soft tasks, deadline misses result in a penalty/cost. The tasks are assumed to be independent and generated stochastically: the occurrence of a new job of one task does not depend on the occurrences of jobs of other tasks, and both the inter-arrival time and the computation time of jobs are independent random variables. The scheduling problem consists in finding ascheduler
, i.e. a function that associates, to all CPU ticks, a task that must run at that moment; in order to:(i) avoid deadline misses by hard tasks; and (ii) minimise the mean cost of deadline misses by soft tasks.
The semantics of the task system has been modelled using an MDP in , and the task there is to compute an optimal and safe scheduler. However, it assumes that the distribution of all tasks is known a priori which may be unrealistic. In the current paper, we investigate learning techniques in order to build algorithms that are able to schedule safely and optimally a set of hard and soft tasks if only the deadlines and the domains of the distributions describing the tasks of the system are known a priori and not the exact distributions. We believe that this is a more realistic assumption. Our motivation was also to investigate the joint application of both synthesis techniques coming from the field of formal verification and learning techniques on an easily understandable, yet challenging, setting.
First, we consider model-based learning. We show that the distributions underlying a task system with only soft tasks are efficiently PAC learnable: by executing the task system for a polynomial number of steps, enough samples can be collected to infer -accurate approximations of the distributions with high probability (Thm. 3.1).
Then, we consider the general case of systems with both hard and soft tasks. Here, safe PAC learning is not always possible, and we identify two algorithmically-checkable sufficient conditions for task systems to be safely learnable (Thms. 3.2 and 3.3). These crucially depend on the the underlying MDP being a single maximal end-component, as is the case in our setting (Lemma 2). Subsequently, we use robustness results on MDPs to compute or learn near-optimal safe strategies from the learnt models (Thm. 3.4). For the learning part, we apply shielded -learning in the sense of .
Third, in order to evaluate the relevance of the different algorithms defined in the paper, we present experiments of a prototype implementation. These empirically validate the efficient PAC guarantees proved in the theory we have developed. Unfortunately, the models that are learnt by the learning algorithms are often too large for the probabilistic model-checking tools. In contrast, the shielded deep -learning algorithm scales to larger examples: e.g. we learn safe scheduling strategies for systems with more than states. The experiments also show that a strategy that is learnt by assigning high costs to missing deadlines of hard tasks does not respect safety even if the learning happens for a reasonably long period of time, and the costs assigned for missing the deadlines of hard tasks are very high (cf. ).
The authors of  introduce the scheduling problem considered here but make the assumption that the underlying distributions of the tasks are known. We drop this assumption here and provide learning algorithms. In 
, the framework to combine safety via shielding and model-free reinforcement learning is introduced and applied to several examples using table-based Q-learning as well as deep RL. Our shielded version of deep-learning fits the framework of post-posed shielding of . In , shield synthesis is studied for long-run objective guarantees instead of rather than safety requirements. Unlike our work, the transition probabilities on MDPs in both  and  are assumed to be known. We observe that  and  do not provide model-based learning and PAC guarantees. Moreover, neither considers scheduling problems.
A framework to mix reactive synthesis and model-based reinforcement learning for mean-payoff with PAC guarantees has been studied in 
. There, the learning algorithm estimates the probabilities on the transitions of the MDP. In our approach, we do not estimate these probabilities directly from the MDP, but learn probabilities for the individual tasks in the task system. The efficient PAC guarantees that we have obtained for the model-based part cannot be obtained from their framework. In the combination of shielding with model-predictive control using MCTS has been introduced. However, that paper does not consider learning.
We denote by the set of natural numbers; by , the set of rational numbers; and by the set of all non-negative rational numbers. Given , we denote by the set . Given a finite set , a (rational) probability distribution over is a function such that . We call the domain of , and denote it by . We denote the set of probability distributions on by . The support of the probability distribution on is . A distribution is called Dirac if . For a probability distribution , the minimum probability assigned by to the elements in is . We say two distributions and are structurally identical if . Given two structurally identical distributions and , for , we say that is -close to , denoted , if , and for all , we have that .
An instance of the scheduling problem studied in  consists of a task system , where are preemptible tasks partitioned into hard and soft tasks and respectively. The latter need to be scheduled on a single processor. Formally, the work of  relies on a probabilistic model for the computation times of the jobs and for the delay between the arrival of two successive jobs of the same task. For all , task is defined as a tuple , where: (i) is a discrete probability distribution on the (finitely many) possible computation times of the jobs generated by ; (ii) is the deadline of all jobs generated by which is relative to their arrival time; and (iii) is a discrete probability distribution on the (finitely many) possible inter-arrival times of the jobs generated by . We denote by the maximum probability appearing in the definition of , that is, across all the distributions and , for all . It is assumed that for all ; hence, at any point in time, there is at most one job per task in the system. Also note that when a new job of some task arrives at the system, the deadline for the previous job of this task is already over. The potential degradation in the quality when a soft task misses its deadline is modelled by a cost function that associates to each soft task a cost that is incurred every time a job of misses its deadline.
Given a task system with tasks, the structure of is where . We denote by and respectively the maximum computation time, and the maximum inter-arrival time of a task in . Formally, , and . Note that . We also let . We denote by the number of tasks in the task system . Consider two task systems , and , with , for all and . The two task systems and are said to be -close, denoted , if (i) , (ii) for all , we have , and (iii) for all , we have .
Markov decision processes
Let us now introduce Markov Decision Process (MDP) as they form the basis of the formal model of , which we recall later. A finite Markov decision process is a tuple , where: (i) is a finite set of actions; (ii) is a finite directed graph and is an edge-labelling function (we denote by the set of outgoing edges from vertex ); (iii) the set of vertices is partitioned into and ; (iv) the graph is bipartite i.e. , and the labelling function is s.t. if , and if ; and (v) assigns to each vertex a rational probability distribution on . For all edges , we let if , and otherwise. We further assume that, for all , for all , in : implies , i.e. an action identifies uniquely an outgoing edge. Given , and , we define . For all vertices , we denote by , the set of actions . The size of an MDP , denoted , is the sum of the number of vertices and the number of edges, that is, . An MDP is said to structurally identical to another MDP if for all , we have that . For two structurally identical MDPs and with distribution assignment functions and respectively, we say that is -approximate to , denoted , if for all : .
An MDP can be interpreted as a game between two players: and , who own the vertices in and respectively. A play in an MDP is a path in its underlying graph . We say that a prefix of a play belongs to player , iff its last vertex is in . The set of prefixes that belong to player is denoted by . A play in the MDP is then obtained by the interaction of the two players as follows: if the current play prefix belongs to , she plays by picking an edge (or, equivalently, an action that labels a necessarily unique edge from ). Otherwise, when belongs to , the next edge is chosen randomly according to . In both cases, the plays prefix is extended by and the game goes ad infinitum.
A (deterministic) strategy of is a function , such that for all prefixes. A strategy is memoryless if for all finite prefixes and : implies . For memoryless strategies, we will abuse notations and assume that such strategies are of the form (i.e., the strategy associates the edge to play to the current vertex and not to the full prefix played so far). From now on, we will consider memoryless deterministic strategies unless otherwise stated. Let be an MDP, and let be a memoryless strategy. Then, assuming that plays according to , we can express the behaviour of
as a Markov chain, where the probability distributions reflect the stochastic choices of (see  for the details).
An end-component (EC, for short) with , and is a sub-MDP of (for all , is a subset of the actions available to from ; and for all : ) that is strongly connected. A maximal EC (MEC) is an EC that is not included in any other EC.
MDP model for the scheduling problem
Given a system of tasks, we describe below the modelling of the scheduling problem by a finite MDP as it appears in . The two players and correspond respectively to the Scheduler and the task generator (TaskGen) respectively. The vertices of the MDP correspond to the system states. Since there is at most one job of each task that is active at all times, we maintain, in all vertices, the following information about each task : (i) a distribution over the job’s possible remaining computation times (rct); (ii) the time up to its deadline; and (iii) a distribution over the possible times up to the next arrival of a new job of . We also tag vertices with either or to remember their respective owners and we have a special vertex that will be reached when a hard task misses a deadline. For a vertex , for , let be the tasks that have an active job in , and be the tasks that have missed a deadline for sure in .
Possible moves of Scheduler The possible actions of Scheduler are to schedule an active task or to idle the CPU. We model this by having, from all vertices one transition labelled by some element from , or by (no job gets scheduled). Such transitions model the elapsing of one clock tick.
Possible moves of TaskGen The moves of TaskGen consist in selecting, for each task one possible action out of four: either (i) nothing (); or (ii) to finish the current job without submitting a new one (); or (iii) to submit a new job while the previous one is already finished (); or (iv) to submit a new job and kill the previous one, in the case of a soft task (), which will incur a cost.
Expected mean-cost threshold synthesis
Let us first associate a value, called the mean-cost to all plays in an MDP . First, for a prefix , we define (recall that when is an action). Then, for a play , we have . Observe that is a measurable function. A strategy is optimal for the mean-cost from some initial vertex if . Such optimal strategy always exists, and it is well-known that there is always one which is memoryless
. Moreover, this problem can be solved in polynomial time through linear programming or in practice using value iteration (as implemented, for example, in the tool Storm ). We denote by the optimal value .
Given an MDP , an initial vertex , and a strategy , we define the set of possible outcomes in the Markov chain as the set of paths in s.t., for all , there is non-null probability to go from to in . Let denote the set of vertices visited in the set of possible outcomes .
Given with a set of vertices, and an initial vertex , and a set of so-called bad vertices, the safety synthesis problem is to decide whether has a strategy ensuring to visit the safe vertices only, i.e.: (in our scheduling problem, such vertices will model the situations where a hard task has never missed a deadline). If this is the case, we call such a strategy safe. The safety synthesis problem is decidable in polynomial time for MDPs. Indeed, since probabilities do not matter for this problem, the MDP can be regarded as a plain two-player game played on graphs (like in ), and the classical attractor algorithm can be used. We briefly describe below the attractor algorithm for completeness.
Given , let be its set of successors, and be its set of outgoing edges. We assume that for all : , i.e. there is no deadlock.
The algorithm consists of computing all the vertices from which cannot avoid reaching the unsafe vertices. To this end, the algorithm computes a sequence of sets of vertices defined as follows: (i) ; and (ii) for all : . That is, the sequence is initialised to the set of unsafe vertices. Then, the algorithm grows this set of vertices by adding: (i) vertices belonging to whose set of successors has been entirely identified as unsafe in a previous step; and (ii) vertices belonging to having at least one unsafe successor.
It is easy to check that this sequence converges after at most steps (the graph of the MDP being finite) and returns the set of vertices from which has no strategy to stay within . Hence, has a strategy to stay within from all vertices in which is s.t. (any successor of satisfying this criterion yields a safe strategy).
Moreover, it is well known that, if a safe strategy exists, then there is in particular a memoryless safe strategy; so, from now on, we will consider safe strategies that are memoryless only. We say that a vertex is safe iff has a safe strategy from , and that an edge is safe iff there is a safe strategy s.t. . So, the safe edges from some node , denoted , correspond to the choices that can safely make from . The set of safe edges exactly correspond to the set of safe actions that can make from . Then, we let the safe region of be the MDP obtained from by applying the following transformations: (i) remove from all unsafe edges; (ii) remove from all vertices and edges that are not reachable from . Note is an MDP, since we have removed edges from Player one vertices only.
Most general safe scheduler
A task system is said to be schedulable for the hard tasks if Scheduler has a winning strategy to avoid in . This strategy corresponds to a scheduler that prevents hard tasks from ever missing a deadline. We say that a scheduler is the most general safe scheduler for the hard tasks if from all vertices of Scheduler, it allows all possible safe edges from 111The existence of the most general safe scheduler is a direct consequence of the fact that the most general strategy exists for safety objectives ..
Consider a system with one hard task s.t. and ; one soft task s.t. , , and ; and the cost function s.t. . Fig. 1 presents an excerpt of the MDP built from the set of tasks of Example 1. A distribution with support is denoted by . When is s.t. for some , we simply denote by . Vertices from and are depicted by rectangles and rounded rectangles respectively. Each vertex is labelled by on the top, and below.
A strategy to avoid missing a deadline of consists in first scheduling , then . One then reaches the left-hand part of the graph from which can avoid whatever does. Note that other safe strategies are possible: the first step of the algorithm in  is actually to compute all the safe nodes (i.e. those from which can ensure to avoid ), and then to look for an optimal one w.r.t to missed-deadline costs.
There are two optimal memoryless strategies, one in which Scheduler first chooses to execute , then ; and another where is scheduled for time unit, and then preempted to let execute. Since the time difference between the arrival of two consecutive jobs of the soft task is and the cost of missing a deadline is , for both of these optimal strategies, the soft task’s deadline is missed with probability over this time duration of , and hence the mean-cost is . Observe that there is another safe schedule that is not optimal is one in which only is granted CPU access, and is never scheduled thus giving a mean-cost of .
3 Model-Based Learning
We now investigate the case of model-based learning of task systems. First, we consider the simpler case of task systems with only soft tasks. We show that those systems are always efficiently PAC learnable. Second, we consider learning task systems with both hard and soft tasks. In that case, we study two conditions for learnability. The first condition allows us to identify task systems that are safely PAC learnable, i.e. learnable while enforcing safety for the hard tasks. The second condition is stronger and allows us to identify task systems that are safely and efficiently PAC learnable. Our learning algorithms on (safely) sampling the distributions underlying the behaviour of tasks.
We consider a setting in which we are given the structure of a task system to schedule. While the structure is known, the actual distributions that describe the behaviour of the tasks are unknown and need to be learnt to behave optimally or near optimally. The learning must be done only by observing the jobs that arrive along time. When the task system contains some hard tasks (), all deadlines of such tasks must be enforced.
For learning the inter-arrival time distribution of a task, a sample corresponds to observing the time difference between the arrivals of two consecutive jobs of that task. For learning the computation time distribution, a sample corresponds to observing CPU time that a job of the task has been assigned up to completion. Thus if a job does not finish execution before its deadline, we do not obtain a valid sample for the computation time. Given a class of task systems, we say:
the class is probably approximately correct (PAC) learnable if there is an algorithm such that for all task systems in this class, for all : given , the algorithm can execute the task system , and can compute such that , with probability at least .
the class is safely PAC learnable if it is PAC learnable, and can ensure safety for the hard tasks while computing .
the class is called (safely) efficiently PAC learnable if it is (safely) PAC learnable, and there exists a polynomial in the size of the task system, in , and in , s.t. can obtain enough samples and compute in a time which is bounded by .
Note that our notion of efficient PAC learning is slightly stronger than the definition used in classical PAC learning terminology  since we take into account the time that is needed to get samples and not only the number of samples needed. We will see later in this section why this distinction is important.
Learning discrete finite distributions by sampling
We analyse the number of samples needed to closely approximate a discrete distribution with high probability. Towards this, we first introduce Hoeffding’s inequality.
Let be independent random variables with domain bounded by the interval , and let . For all the following hold.
To learn an unknown discrete distribution defined on a finite domain , we collect i.i.d. samples from that distribution and infer a model of the distribution from those samples. Formally, given a sequence of samples drawn i.i.d. from the distribution , we denote by , the function that maps every element to its relative frequency in , i.e. to .
The following lemma tells us that if the size of is large enough then the model is close to the actual with high probability.
For all finite discrete distributions with , for all such that , if is a sequence of at least i.i.d. samples drawn from , then with probability at least .
For a distribution , and an element in , let be independent and identically distributed Bernoulli random variables with for . Recall that a Bernoulli random variable takes two values, and . In our case, the value denotes witnessing the element in the domain of the distribution . Thus we have . Let . Here is the number of samples required to learn the probability of occurrence of the element of the support of the distribution.
By Hoeffding’s two sided inequality, for the special case of Bernoulli random variables, we have,
Now we want that the probability of for all is at most , so that the probability of for some element in the domain of the distribution is at most .
Thus we have leading to . Since there are elements in the domain, we need a total of at least samples, and hence the result.
For all distributions , we say that we ‘PAC learn’ the distribution if for all such that , by drawing a sequence of i.i.d. samples from , we have with probability at least . Informally, we also refer to this as learning a distribution with strong guarantees. We note that given a task system , if we can learn the distributions corresponding to all the tasks in , and hence a model of the task system, such that each learnt distribution in is structurally identical to its corresponding distribution in , the corresponding MDP are structurally identical.
Efficient PAC learning with soft tasks only
Let be a task system with soft tasks only, and let . We assume that for all distributions occurring in the models of the tasks in : . To learn a model which is -close to with probability at least , we apply Lemma 1 in the following algorithm:
for all tasks , repeat the following learning phase:
Always schedule task when a job of this task is active. Collect the samples of and of as observed. Collect enough samples to apply Lemma 1 and obtain the desired accuracy as fixed by and .
the models of inter-arrival time distribution and computation time distribution for task are and respectively.
It follows that task systems with only soft tasks are efficiently PAC learnable:
There exists a learning algorithm such that for all task systems with , for all , the algorithm learns a model such that with probability at least after executing for steps.
Using Lemma 1, given , for every distribution of the task system, a sequence of i.i.d. samples suffices to have with probability at least . Since in the task system , there are distributions, with probability at least , we have that the learnt model . Thus for , and using , we have that for each distribution, a sequence of samples suffices so that with probability at least .
Since samples for computation time distribution and inter-arrival time distribution for each soft task can be collected simultaneously, and observing each sample takes a maximum of time steps, and we collect samples for each soft task by scheduling one soft task after another, the result follows.
Safe learning in the presence of hard tasks
We now turn to the more challenging case of safely learning a task systems with both hard and soft tasks (i.e. and ). The learning algorithm must ensure that all the jobs of hard tasks meet their deadlines while learning the task distributions. This algorithm for systems with only soft tasks is clearly not valid for that more general case.
From now on, we assume that the task system for which we want to safely learn a model is schedulable for the hard tasks222Note that safety synthesis already identifies task systems that violate this condition.. This is a necessary condition for safe learning but it is not a sufficient condition. Indeed, to apply Lemma 1, we need enough samples for all tasks .
First, we note that when executing any safe schedule for the hard tasks, we will observe enough samples for the hard tasks. Indeed, under a safe schedule for the hard tasks, any job of a hard task that enters the system will be executed to completion before its deadline (as the scheduler is safe). Then clearly, we observe the value of the inter-arrival time as well as the value of the computation time requested by all the jobs of hard tasks that enter the system. Unfortunately, this is not necessarily the case for soft tasks when they execute in the presence of hard tasks. As an example, consider a task system such that a job of a soft task with a computation time of and deadline arrives at a time when there already exists an active job of a hard task with remaining units of computation time and a remaining time of before the deadline. The job of the hard task needs to complete execution before its deadline and thus the job of the soft task cannot execute to completion before its deadline. Thus, considering as samples only those jobs that have been fully scheduled to completion and ignoring those that do not, would lead to samples that are biased towards smaller computation times, and would not allow us to draw conclusions about the real computation time distribution.
We thus need stronger conditions than hard task schedulability to safely learn task systems. in order to be able to learn the distributions of the soft tasks while ensuring safety. We develop two such conditions in the rest of this section. The first condition ensures PAC learnability but does not provide polynomial time guarantee on the learning time, while the second condition is stricter but ensures efficient PAC learnability.
PAC guarantees for safe learning
Our condition to ensure safe PAC learnability relies on properties of the safe region in the MDP associated to the task system . First, note that is guaranteed to be non-empty as the task system is guaranteed to be schedulable for its hard tasks by hypothesis. Our condition will exploit the following property of its structure:
Let be a task system and let be the safe region of its MDP. Then is a single MEC.
We first assume that the task system is schedulable. Otherwise, is empty and the Lemma is trivially true. Let and be the set of vertices and the set of edges of respectively. First, observe that, since we want to prove that the whole MDP corresponds to an MEC, we only need to show that its underlying graph is strongly connected. Indeed, since contains all vertices and edges from , it is necessarily maximal, and all choices of actions from any vertex will always lead to a vertex in .
In order to show the strongly connected property, we fix a vertex , and show that there exists a path in from to . Since all vertices in are, by construction of , reachable from the initial vertex , this entails that all vertices are also reachable from , hence, the graph is strongly connected.
Let us first assume that , i.e., is a vertex where Scheduler has to take a decision. Let be the path leading to , where all vertices belong to Scheduler, and all are are vertices that belong to TaskGen.
Then, from path , we extract, for all tasks the sequence of actual inter-arrival times defined as follows: for all , is the time elapsed (in CPU ticks) between the arrival of the th job the th job of task along (assuming the initial release occurring in the initial state is the -th release). In other words, letting , the th job of is released along on the transition between and . Observe thus that all tasks are in the same state in vertex and in vertex , i.e. the time to the deadline, and the probability distributions on the next arrival and computation times are the same in and . However, the vertices can be different for all the different tasks, since they depend on the sequence of job releases of along . Nevertheless, we claim that can be extended, by repeating the sequence of arrivals of all the tasks along , in order to reach a vertex where all tasks have just submitted a job (i.e. ). To this aim, we first extend, for all tasks , into , where ensures that the arrival of a occurs after .
For all , let denote , i.e. is the total number of CPU ticks needed to reach the first state after where task has just submitted a job (following the sequence of arrival defined above). Further, let . Now, let be a path in that respects the following properties:
is a prefix of ;
has a length of CPU ticks;
ends in a vertex ; and
for all tasks : submits a job at time along iff it submits a job at time along .
Observe that, in the definition of , we do not constrain the decisions of Scheduler after the prefix . First, let us explain why such a path exists. Observe that the sequence of task arrival times is legal, since it consists, for all tasks , in repeating times the sequence of inter-arrival times which is legal since it is extracted from path (remember that nothing that Scheduler player does can restrict the times at which TaskGen introduces new jobs in the system). Then, since is schedulable, we have the guarantee that all vertices in have at least one outgoing edge. This is sufficient to ensure that indeed exists. Finally, we observe visits (since is a prefix of ), and that the last vertex of is a vertex obtained just after all tasks have submitted a job, by construction. Thus , and we conclude that, from all which is reachable from , one can find a path in that leads back to .
This reasoning can be extended to account for the nodes : one can simply select any successor of , and apply the above reasoning from to find a path going back to .
Good for sampling and learnability
The safe region of the task system is good for sampling if for all soft tasks , there exists a vertex such that:
a new job of task enters the system in ; and
there exists a strategy of Scheduler that is compatible with the set of safe schedules for the hard tasks so that from , under schedule , the new job associated to task is guaranteed to reach completion before its deadline, no matter how TaskGen behaves.
There is an algorithm that executes in polynomial time in the size of and which decides if is good for sampling. Also, remember that only the knowledge of the structure of the task system is needed to compute .
Given a task system that is good for sampling, given any , we safely learn a model which is -close to with probability at least (PAC guarantees) by applying the following algorithm:
Choose any safe strategy for the hard tasks, and apply this strategy until enough samples for each have been collected according to Lemma 1. The models for tasks are and respectively.
Then for each , apply the following phase:
from the current vertex , play uniformly at random among the safe edges in up to reaching some (while playing uniformly at random, we reach some with probability .333This follows from the fact that since there is a single MEC in the MDP by Lemma 2. The existence of a is guaranteed by the hypothesis that is good for sampling)
from , apply the schedule as defined in the good for sampling condition. This way we are guaranteed to observe the computation time requested by the new job of task that entered the system in vertex , no matter how TaskGen behaves. At the completion of this job of task , we have collected a valid sample of task .
go back to until enough samples have been collected for task according to Lemma 1. The models for task is given by and .
The properties of the learning algorithm above are used to prove that:
There exists a learning algorithm such that for all task systems with a safe region that is good for sampling, for all , the algorithm learns a model such that with probability at least .
For the hard tasks, as mentioned above, we can learn the distributions by applying the safe strategy to collect enough samples for each .
We assume an order on the set of soft tasks. First for all for , since is good for sampling, we note that the set of vertices (as defined in the definition of good for sampling condition) is non-empty. Recall from Lemma 2 that has a single MEC. Thus from every vertex of , Scheduler by playing uniformly at random reaches some with probability , and hence can visit the vertices of infinitely often with probability . Now given and , using Theorem 3.1, we can compute an , the number of samples corresponding to each distribution required for safe PAC learning of the task system. Since by playing uniformly at random, Scheduler has a strategy to visit the vertices of infinitely often with probability , it is thus possible to visit these vertices at least times with arbitrarily high probability.
Also after we safely PAC learn the distributions for task , since there is a single MEC in , there exists a uniform memoryless strategy to visit a vertex corresponding to task with probability . Hence the result.
In the algorithm above, to obtain one sample of a soft task, we need to reach a particular vertex from which we can safely schedule a new job for the task up to completion. As the underlying MDP can be large (exponential in the description of the task system), we cannot bound by a polynomial the time needed to get the next sample in the learning algorithm. So, this algorithm does not guarantee efficient PAC learning. We develop in the next paragraph a stronger condition to guarantee the existence of an efficient PAC learning algorithm.
Good for efficient sampling and efficient PAC learning
The safe region of the task system is good for efficient sampling if there exists which is bounded polynomially in the size of , and if, for all soft tasks the two following conditions hold:
let be the set of Scheduler vertices in . There is a non-empty subset of vertices from which there is a strategy for Scheduler to always schedule safely the set of tasks (i.e. all hard tasks and the task ); and
for all , there is a uniform memoryless strategy such that:
is compatible with the safe strategies (for the hard tasks) of ; and
when is executed from any , then the set is reached within steps. By Lemma 2, since has a single MEC, we have that is reachable from every .
Here again, the condition can be efficiently decided: there is a polynomial-time algorithm in the size of that decides if is good for efficient sampling.
Given a task system that is good for efficient sampling, given , we safely and efficiently learn a model which is -close of with probability at least than (efficient PAC guarantees) by applying:
Choose any safe strategy for the hard tasks, and apply this strategy until enough samples for each have been collected according to Lemma 1. The models for tasks are and respectively.
Then for each , apply the following phase:
from the current vertex , play