1 Introduction
In the field of computer science, it is considered challenging to tackle combinatorial optimization problems (COPs) that are computationally intractable. While numerous heuristic approaches have been studied to provide polynomialtime solutions, they often require indepth knowledge on problemspecific features and customization upon the changes of problem conditions. Furthermore, several heuristic approaches such as branching
(chu1998genetic) and tabusearch glover1989tabu to solving COPs explore combinatorial search spaces extensively, and thus render themselves limited in large scale problems.Recently, deep learning techniques have proven their feasibility in addressing COPs, e.g., routing optimization
kool2018attention, task scheduling lee2020panda, and knapsack problem gupointerkp. For deep learningbased COP approaches, it is challenging to build a training dataset with optimal labels, because many COPs are computationally infeasible to find exact solutions. Reinforcement learning (RL) is considered viable for such problems as neural architecture search zoph2016neural, device placement mirhoseini2017device, games silver2017mastering where collecting supervised labels is expensive or infeasible.As the RL action space of COPs can be intractably large (e.g.,
possible solutions for ranking 100items), it is undesirable to use a single probability distribution on the whole action space. Thus, a sequential structure, in which a probability distribution of an item to be selected next is iteratively calculated to represent a onestep action, becomes a feasible mechanism to establish RLbased COP solvers, as have been recently studied in
bello2016neural; vinyals2019grandmaster. The sequential structure is effective to produce a permutation comparable to optimal solutions, but it often suffers from long inference time due to its iterative nature. Therefore, it is not suitable to apply these approaches to the field of mission critical applications with strict service level objectives and time constraints. For example, task placement in SoC devices necessitates fast inferences in a few milliseconds, but the inferences by a complex model with sequential processing often take a few seconds, so it is rarely feasible to employ deep learningbased task placement in SoC ykman2006fast; shojaei2009parameterized.In this paper, we present RLRD, an RLtorank distillation framework to address COPs, which enables the lowlatency inference in online system environments. To do so, we develop a novel ranking distillation method and focus on two COPs where each problem instance can be treated as establishing the optimal policy about ranking items or making priority orders. Specifically, we employ a differentiable relaxation scheme for sorting and ranking operations blondel2020fast to expedite direct optimization of ranking objectives. It is combined with a problemspecific objective to formulate a distillation loss that corrects the rankings of input items, thus enabling the robust distillation of the ranking policy from sequential RL to a noniterative, scorebased ranking model. Furthermore, we explore the efficient sampling technique with Gumbel trick jang2016categorical; kool2019estimating on the scores generated by the distilled model to expedite the generation of sequence samples and improve the model performance with an inference time limitation.
Through experiments, we demonstrate that a distilled model by the framework achieves the lowlatency inference while maintaining comparable performance to its teacher. For example, compared to the highperformance teacher model with sequential RL, the distilled model makes inferences up to 65 times and 3.8 times faster, respectively for the knapsack and task scheduling problems, while it shows only about 2.6% and 1.0% degradation in performance, respectively. The Gumbel trickbased sequence sampling improves the performance of distilled models (e.g., 2% for the knapsack) efficiently with relatively small inference delay. The contributions of this paper are summarized as follows.

We present the learningbased efficient COP framework RLRD that can solve various COPs, in which a lowlatency COP model is enabled by the differentiable ranking (DiffRank)based distillation and it can be boosted by Gumbel trickbased efficient sequence sampling.

We test the framework with wellknown COPs in system areas such as prioritybased task scheduling in realtime systems and multidimensional knapsack in resource management, demonstrating the robustness of our framework approach to various problem conditions.
2 Our Approach
In this section, we describe the overall structure of the RLRD framework with two building blocks, the deep RLbased COP model structure, and the ranking distillation procedure.
Framework Structure
In general, retraining or finetuning is needed to adapt a deep learningbased COP model for varying conditions on production system requirements. The RLRD framework supports such model adaptation through knowledge distillation. As shown in Figure 1, (1) for a COP, a learningtorank (teacher) policy in the encoderdecoder model is first trained by sequential RL, and (2) it is then transferred through the DiffRankbased distillation to a student model with noniterative ranking operations, according to a given deployment configuration, e.g., requirements on lowlatency inference or model size. For instance, a scheduling policy is established by RL to make inferences on the priority order of a set of tasks running on a realtime multiprocessor platform, and then it is distilled into a lowlatency model to make same inferences with some stringent delay requirement.
Reinforcement LearningtoRank
Here, we describe the encoderdecoder structure of our RLtoRank model for COPs, and explain how to train it. Our teacher model is based on a widely adopted attentive structure kool2018attention. In our model representation, we consider parameters (e.g.,
for Affine transformation of vector
), and we often omit them for simplicity. In an RLtoRank model, an encoder takes the features of items as input, producing the embeddings for the items, and a decoder conducts ranking decisions iteratively on the embeddings, yielding a permutation for the items. This encoderdecoder model is endtoend trained by RL.COP Encoder.
In the encoder, each item containing features is first converted into vector through the simple Affine transformation, Then, for items, matrix, is passed into the
attention layers, where each attention layer consists of a MultiHead Attention layer (MHA) and a Feed Forward network (FF). Each sublayer is computed with skip connection and Batch Normalization (BN). For
, are updated by(1) 
where
(2) 
is the concatenation of tensors,
is a fixed positive integer, and is a learnable parameter. AM is given by(3) 
where and denote the layerwise parameters for query, key and value vaswani2017attention. The result output in (1) is the embedding for the input items, which are used as input to a decoder in the following.
Ranking Decoder.
With the embeddings for items from the encoder, the decoder sequentially selects items to obtain an sized permutation where distinct integers correspond to the indices of the items. That is, item is selected first, so it is assigned the highest ranking (priority), and item is assigned the second, and so on. Specifically, the decoder establishes a function to rank items stochastically,
(4) 
where represents a probability that item is assigned the th rank.
From an RL formulation perspective, in (4), the information about items () including rankingassigned items [] until corresponds to state , and selecting corresponds to action . That is, a state contains a partial solution over all permutations and an action is a onestep inference to determine a next ranked item. Accordingly, the stochastic ranking function above can be rewritten as parameterized policy for each timestep .
(5) 
This policy is learned based on problemspecific reward signals. To establish such policy via RL, we formulate a learnable score function of item upon state
, which is used to estimate
, e.g.,(6) 
where is the embedding of an item () selected at timestep , and is a global vector obtained by
(7) 
Note that is randomly initialized. To incorporate the alignment between and in , we use Attention vaswani2017attention,
(8) 
for query and vectors , where and are learnable parameters. Finally, we have the policy that calculates the ranking probability that the th item is selected next upon state .
(9) 
Training.
For endtoend training the encoderdecoder, we use the REINFORCE algorithm williams1992simple, which is effective for episodic tasks, e.g., problems formulated as ranking items. Suppose that for a problem of items, we obtain an episode with
(10) 
that are acquired by policy , where and are state, action and reward samples. We set the goal of model training to maximize the expected total reward by ,
(11) 
where is a discount rate, and use the policy gradient ascent.
(12) 
Note that is a return, is a learning rate, and is a baseline used to accelerate the convergence of model training.
LearningtoRank Distillation
In the RLRD framework, the ranking decoder repeats times of selection to rank items through its sequential structure. While the decoder structure is intended to extract the relational features of items that have not been selected, the high computing complexity of iterative decoding renders difficulties in the application of the framework to missioncritical systems. To enable fast inferences without significant degradation in model performance, we employ knowledge distillation from an RLtorank model with iterative decoding to a simpler model. Specifically, we use a noniterative, scorebased ranking model as a student in knowledge distillation, which takes the features of items as input and directly produces a score vector for the items as output. A score vector is used to rank the items.
For items, the RLtorank model produces ranking vector as supervised label , and by distillation, the student model learns to produce such a score vector s maximizing the similarity between y and the corresponding ranking of s, say . For example, given a score vector for items, it is sorted to , so . The ranking distillation loss is defined as
(13) 
where
is a differentiable evaluation metric for the similarity of two ranking vectors. We use mean squared error (MSE) for
, because minimizing MSE of two ranking vectors is equivalent to maximizing the Spearmanrho correlation of two rankings y and rank(s).Differentiable Approximated Ranking.
To distill with the loss in (13) using gradient descent, the ranking function rank needs to be differentiable with nonvanishing gradient. However, differentiating rank has a problem of vanishing gradient because a slight shift of score s does not usually affect the corresponding ranking. Thus, we revise the loss in (13) using an approximated ranking function having nonzero gradients in the same way of blondel2020fast.
Consider score and permutation which is a bijection from to itself. A descending sorted list of s is represented as
(14) 
where . Accordingly, the ranking function rank : is formalized as
(15) 
where is an inverse of , which is also a permutation. For example, consider . Its descending sort is , so we have and . Accordingly, we have .
To implement the DiffRank, a function is used, which approximates rank in a differential way with nonzero gradients such as
(16) 
Here is called a perumutahedron, which is a convex hull generated by the permutation with . As explained in blondel2020fast, the function converges to rank as , while it always preserves the order of . That is, given in (14) and , we have .
In addition, we also consider a problemspecific loss. For example, in the knapsack problem, an entire set of items can be partitioned into two groups, one for selected items and the other for not selected items. We can penalize the difference of partitions obtained from label y
and target output score
s by the function . Finally the total loss is given by(17) 
where . The overall distillation procedure is illustrated in Algorithm 1.
Here, we present the explicit nonvanishing gradient form of our ranking loss function
, where its proof can be found in Appendix A.Proposition 1.
Fix . Let as in (16) and where . Let ,
(18) 
and . Then, we have
(19)  
(20) 
where @ is a matrix multiplication, , is a square matrix whose entries are all with size , and is an permutation. Here, for any matrix , denotes row and column permutation of according to .
Efficient Sequence Sampling.
As explained, we use a score vector in (14) to obtain its corresponding ranking vector deterministically. On the other hand, if we treat such a score vector as an unnormalized logprobability of a categorical distribution on items, we can randomly sample from the distribution using the score vector without replacement and obtain a ranking vector for the items. Here, the condition of without replacement specifies that the distribution is renormalized so that it sums up to for each time to sample an item. This times drawing and normalization increases the inference time. Therefore, to compute rankings rapidly, we exploit Gumbel trick gumbel1954maxima; maddison2014sampling.
Given score s
, consider the random variable
where(21) 
and suppose that S is sorted by , as in (14). Note that are random variables.
Theorem 1.
Appendix A in kool2019estimating. For each k ,
(22) 
where .
This sampling method reduces the complexity to obtain each ranking vector instance from quadratic (sequentially sampling each of items on a categorical distribution) to loglinear (sorting a perturbed list) of the number of items , improving the efficiency of our model significantly.
3 Experiments
In this section, we evaluate our framework with multidimensional knapsack problem (MDKP) and global fixedpriority task scheduling (GFPS) davis2016review. The problem details including RL formulation, data generation, and model training can be found in Appendix B and C.
Multidimensional Knapsack Problem (MDKP)
Given values and dimensional weights of items, in MDKP, each item is either selected or not for a knapsack with dimensional capacities to get the maximum total value of selected items.
For evaluation, we use the performance (the achieved value in a knapsack) by GLOP implemented in the ORtools ortools as a baseline. We compare several models in our framework. RL is the RLtorank teacher model, and RD is the distilled student model. RLS is the RL model variant with sampling, and RDG is the RD model variant with Gumbel trickbased sequence sampling. In RLS, the onestep action in (9) is conducted stochastically, while in RL, it is conducted greedily. For both RLS and RDG, we set the number of ranking sampling to 30 for each item set, and report the the highest score among samples. In addition, we test the Greedy method that exploits the mean ratio of item weight and value.
Model Performance.
Table 1 shows the performance in MDKP, where GAP denotes the performance ratio to the baseline GLOP and Time denotes the inference time.

RD shows comparable performance to its respective highperformance teacher RL, with insignificant degradation of average 2.6% for all the cases. More importantly, RD achieves efficient, lowlatency inferences, e.g., 23 and 65 times faster inferences than RL for =50 and =150 cases, respectively.

RDG outperforms RL by 0.3% on average and also achieves 4.4 and 20 times faster inferences than RL for =50 and =150 cases, respectively. Moreover, RDG shows 2% higher performance than RD, while its inference time is increased by 3.7 times.

RLS shows 1.8% higher performance than RL model. However, unlike RDG, the inference time of RLS is increased linearly to the number of ranking samples (i.e., about 30 times increase for 30 samples).

As increases, all methods shows longer inference time, but the increment gap of GLOP is much larger than RL and RD. For example, as increases from 50 to 150 when =0, the inference time of GLOP is increased by 39 times, while RL and RD shows 9.3 and 3.1 times increments, respectively.

The performance of Greedy degrades drastically in the case of =0.9. This is because the weightvalue ratio for items becomes less useful when the correlation is high. Unlike Greedy, our models show stable performance for both high and low correlation cases.
Priority Assignment Problem for GFPS
For a set of periodic tasks, in GFPS, each task is assigned a priority (an integer from to ) to be scheduled. GFPS with a priority order (or a ranking list) can schedule the highestpriority tasks in each time slot upon a platform comprised of processors, with the goal of not incurring any deadline violation of the periodic tasks.
For evaluation, we need to choose a schedulability test for GFPS, that determines whether a task set is deemed schedulable by GFPS with a priority order. We target a schedulability test called RTALC Guan2009; Davis2011 which has been known to perform superior to the others in terms of covering schedulable task sets. We compare our models with Audsley’s Optimal Priority Assignment (OPA) Audsley1991; Audsley2001 with the stateoftheart OPAcompatible DALC test Davis2011, which is known to have the highest performance compared to other heuristic algorithms. Same as those in MDKP, we denote our models as RL, RLS, RD, and RDG. For both RLS and RDG, we limit the number of ranking samples to 10 for each task set.
Model Performance.
Table 2 shows the performance in the schedulability ratio of GFPS with respect to different task set utilization settings on an processor platform and tasks.

Our models all show better performance than OPA, indicating the superiority of the RLRD framework. The performance gap is relatively large on the intermediate utilization ranges, because those ranges can provide more opportunities to optimize with a better strategy. For example, when =8, =64 and Util=6.3, RL and RD show 15.9% and 11.9% higher schedulability ratio than OPA, respectively, while when =8, =64 and Util=5.7, their gain is 4.9% and 4.6%, respectively.

The performance difference of RD and its teacher RL is about 1% on average, while the inference time of RD is decreased (improved) by 3.8 times. This clarifies the benefit of the ranking distillation.

As the utilization (Util) increases, the inference time of RLS and RDG becomes longer, due to multiple executions of the schedulability test up to the predefined limit (i.e., 10 times). On the other hand, the inference time of OPA decreases for large utilization; the loop of OPA is terminated when a task cannot satisfy its deadline with the assumption that other priorityunassigned tasks have higher priorities than that task.

RDG shows comparable performance to, and often achieves slight higher performance than RLS. This is the opposite pattern of MDKP where RLS achieves the best performance. While direct comparison is not much valid due to different sampling methods, we notice the possibility that a distilled student can perform better than its teacher for some cases, and the similar patterns are observed in tangrankdistil; kim2016sequence.
Analysis on Distillation
Effects of Iterative Decoding.
To verify the feasibility of distillation from sequential RL to a scorebased ranking model, we measure the difference of the outputs by iterative decoding and greedy sampling. In the case when the decoder generates the ranking distribution at timestep and takes action as in (9), by masking the th component of the distribution and renormalizing it, we can obtain a renormalized distribution . In addition, consider another probability distribution generated by the decoder at .
Figure 2 illustrates the difference of the distributions in terms of KLdivergence on three specific COPs, previously explained MDKP and GFPS as well as Traveling Salesman Problem (TSP). As shown, MDKP and GFPS maintain a low divergence value, implying that the ranking result of decoding with many iterations can be approximated by decoding with no or few iterations. Unlike MDKP and GFPS, TSP shows a large divergence value. This implies that many decoding iterations are required to obtain an optimal path. Indeed, in TSP, we obtain good performance by RL (e.g, 2% better than a heuristic method), but we hardly achieve comparable performance to RL when we test RD. The experiment and performance for TSP can be found in Appendix D.
Effects of Distillation Loss.
To evaluate the effectiveness of DiffRankbased distillation, we implement other ranking metrics such as a pairwise metric in ranknet and a listwise metric in listnet and test them in the framework as a distillation loss.
Table 3 and 4 show the performance in MDKP and GFPS, respectively, achieved by different distillation loss functions, where RD denotes our distilled model trained with DiffRankbased distillation loss, and the performance of the other two is represented as the ratio to RD. Note that they all use the same RL model as a teacher in this experiment.
As shown, RD achieves consistently better performance than the others for most cases. Unlike RD, the other methods commonly show datadependent performance patterns. The pairwise method (with pairwise distillation loss) achieves performance similar to or slightly lower than RD in MDKP but shows much lower performance than RD in GFPS. The listwise method shows the worst performance for many cases in both MDKP and GFPS except for the cases of in MDKP. These results are consistent with the implication in Figure 2 such that GFPS has larger divergence than MDKP and thus GFPS is more difficult to distill, giving a large performance gain to RD.
4 Related Work
Advanced deep neural networks combined with RL algorithms showed the capability to address various COPs in a datadriven manner with less problemspecific customization. In
bello2016neural, the pointer network was introduced to solve TSP and other geometric COPs, and in kool2018attention, a transformer model was incorporated for more generalization. Besides the pointer network, a temporal difference based model showed positive results in the JobShop problem zhangjobshop, and deep RLbased approaches such as Qlearning solvers kpqlearning were explored for the knapsack problem kpqlearning. Several attempts have been also made to address practical cases formulated in the knapsack problem, e.g., maximizing user engagements under business constraints homepagerelevance; emailvolumeoptimize.Particularly, in the field of computer systems and resource management, there have been several works using deep RL to tackle system optimization under multiple, heterogeneous resource constraints in the form of COPs, e.g., cluster resource management mao2016resource; mao2018learning, compiler optimization chen2018tvm. While we leverage deep RL techniques to address COPs in the same vein as those prior works, we focus on efficient, lowlatency COP models.
The ranking problems such as prioritizing input items based on some scores have been studied in the field of information retrieval and recommendation systems. A neural network based rank optimizer using a pairwise loss function was first introduced in ranknet, and other ranking objective functions were developed to optimize relevant metrics with sophisticated network structures. For example, Bayesian Personalized Ranking rendle2012 is known to maximize the AUC score of given item rankings with labeled data. However, although these approaches can bypass the nondifferentiability of ranking operations, the optimization is limited to some predefined objectives such as NDCG or AUC; thus, it is difficult to apply them to COPs because the objectives do not completely align with the COP objectives. To optimize arbitrary objectives involving nondifferentiable operations such as ranking or sorting, several works focused on smoothing nondifferentiable ranking operations grover2019stochastic; blondel2020fast. They are commonly intended to make arbitrary objectives differentiable by employing relaxed sorting operations.
Knowledge distillation based on the teacherstudent paradigm has been an important topic in machine learning to build a compressed model
hinton2015distilling and showed many successful practices in image classification pmlrv139touvron21akim2016sequence; jiaoetal2020tinybert. However, knowledge distillation to ranking models has not been fully studied. A ranking distillation for recommendation system was introduced in tangrankdistil, and recently, a general distillation framework RankDistil Reddi2021RankDistilKD was presented with several loss functions and optimization schemes specific to top ranking problems. These works exploited pairwise objectives and sampling based heuristics to distill a ranking model, but rarely focused on arbitrary objectives and sequential models, which are required to address various COPs. The distillation of sequential models was investigated in several works kim2016sequence; jiaoetal2020tinybert. However, to the best of our knowledge, our work is the first to explore the distillation from sequential RL models to scorebased ranking models.5 Conclusion
In this paper, we presented a distillationbased COP framework by which an efficient model with highperformance is achieved. Through experiments, we demonstrate that it is feasible to distill the ranking policy of deep RL to a scorebased ranking model without compromising performance, thereby enabling the lowlatency inference on COPs.
The direction of our future work is to adapt our RLbased encoderdecoder model and distillation procedure for various COPs with different consistency degrees between embeddings and decoding results and to explore meta learning for fast adaptation across different problem conditions.
Acknowledgement
We would like to thank anonymous reviewers for their valuable comments and suggestions.
This work was supported by the Institute for Information and Communications Technology Planning and Evaluation (IITP) under Grant 2021000875 and 2021000900, by the ICT Creative Consilience Program supervised by the IITP under Grant IITP2020001821, by Kakao I Research Supporting Program, and by Samsung Electronics.
References
Appendix A Proof of proposition 1
Proof.
By the chain rule, we have
where the first part is followed directly from the elementary calculus. For the second part, we note that where and where is a permutahedron generated by w. Applying the chain rule and proposition 4 in blondel2020fast, we obtain
for some integers such that and some permutation . This completes the proof. ∎
Here, we can analyze how
behaves as we tune the hyperparameter
. We first formulate the function in terms of an isotonic optimization problem,(A.1)  
(A.2) 
where and . It is known that for some . Note that the optimal solution of th term inside the ArgMin of (A.2) is just . Fix a vector s
for a moment in the following.
If we take , then the values become more outoforder, meaning that . This implies the partition of appearing in a pool adjacent violator (PAV) algorithm pavalgorithm becomes more chunky (Note that our objective in (A.1) is ordered in decreasing). This implies the block diagonal matrix in the right hand side of (20) in Proposition 1 becomes more uniform. When the right hand side form is multiplied with the reciprocal of in the same equation, the gradient is the small enough, but in the uniform manner.
On the other hand, if we take , the subtraction of block diagonal matrix from I has more zero entries. This is because the optimization has much more chance to have inorder solutions, meaning that
. Consequently, the block diagonal matrix in (20) tends to be an identity matrix. In the training phase, the error term
becomes small enough if the score s leads to correct predictions for given ranking labels. Accordingly, theth score is not engaged in the backpropagation.
In summary, we exploit controllability such that “hard” rankings with exact gradient and “soft” rankings with uniform, small gradient for different settings from to .
Appendix B Multidimensional Knapsack Problem (MDKP)
Problem Description
Given an sized item set where the th item is represented by , is a value and represent dimensional weights. A knapsack has dimensional capacities .
In RL formulation, a state consists of and at timestep such that . Note that is a set of allocable items and is a set of the other items in . To define an allocable item, suppose the items are in the knapsack at time . Then, an item is allocable if and only if and . Then, an action corresponds to a selection to place in the knapsack. Considering the objective of MKDP maximizing the total value of selected items, a reward is given by
(B.1) 
When no items remain to be selected, an episode terminates. Accordingly, a model is learned to establish policy that maximizes the total reward in (11) and thus optimizes the performance of total values achieved for a testing set, e.g.,
(B.2) 
where is a testing set of item sets, represents the index of a selected item, and denotes a length of an episode. We set each episode to have an single item set with items.
Dataset Generation
Given a number of items , we set dimensional weights
of item to be randomly generated from a uniform distribution
for integers, and, the item value is given by(B.3) 
where manipulates the correlation of weights and values. For dimensional capacities of a given knapsack, we set . A range of parameter values for items is given in Table B.5.
Item Parameter  Values 

Num. of items  {50, 100, 150} 
Weights dim.  {3, 10, 15} 
Max weights  {200, 500} 
Correlation  {0, 0.9} 
Through the encoder, input items are transformed to a raw feature vector representation. Given vector , we denote Max(z) = max, Min(z) = min, and Mean(z) = . For item , we denote its utilization as . These input raw features are listed in Table B.6.
Model Structure
In MDKP, we use an encoder structure for a student network in the RLtorank distillation. Specifically, item is converted to the embedding for some parameters and , giving a matrix . A global representation on the embeddings is calculated by
(B.4) 
for arbitrarily initialized learnable parameter . Then, the score of item is defined by
(B.5) 
which is similar to (6) for timestep . The detail hyper parameter settings for the teacher RL and distilled student models are given in Table B.7
Model Training
In training with the REINFORCE algorithm, we leverage greedy algorithms to establish a baseline in (12). Suppose that by using a specific greedy algorithm that selects an item of the highest weightvalue ratio at each timestep, say greedy policy , we obtain an episode . Then, the baseline is established upon the episode by
(B.6) 
for discount factor .
For the problemspecific loss function in (17), consider a student network generating score vectors . Then, if each item is selected by the rankings based on scores in s, we obtain a vector with
(B.7) 
Similarly, if each item is selected by the rankings based on supervised labels y from the teacher RL model, we obtain where . Accordingly, we can have such that
(B.8) 
that penalizes the difference on made by the rankings based on the supervised labels and the rankings based on the scores.
We implement our models using Python3 v3.6 and Pytorch v1.8, and train the models on a system of Intel(R) Core(TM) i910940X processor, with an NVIDIA RTX 3090 GPU.
Comparison with an Optimal Solver
We test an optimal mixed integer programming solver SCIP implemented in the ORtools, and we observe that it has slow inferences as shown in Table B.8. For example, SCIP achieves 8 times lower performance than ours (RD, RDG) under 3s time limit, but it achieves only 5% higher performance than ours under 600s time limit. As such, SCIP shows insignificant performance gain compared to the unacceptably slow inference time for mission critical environments.
Appendix C Global Fixed Priority Scheduling
(Gfps)
Problem Description
For a set of periodic tasks, in GFPS, each task is assigned a priority (an integer from to ) to be scheduled. That is, GFPS with a priority order (or a ranking list) is able to schedule the highestpriority tasks in each time slot upon a platform comprised of homogeneous processors without incurring any deadline violation of the periodic tasks over time.
Specifically, given an sized task set where each task is specified by its period, deadline, and worst case execution time (WCET), we formulate an RL procedure for GFPS as below. A state is set to have two task sets and for each timestep , where the former specifies priorityassigned tasks and the latter specifies the other tasks in . Accordingly, holds for all . An action () implies that the task is assigned the highest priority in . A reward is calculated by
(C.1) 
where is a priority order and is a given schedulability test Guan2009 which maps to 0 or 1 depending on whether or not a task set is schedulable by GFPS with .
With those RL components above, a model is learned to establish such a policy in (9) that generates a schedulable priority order for each , which passes , if that exists. The overall performance is evaluated upon a set of task set samples .
(C.2) 
Note that we focus on an implicit deadline problem, meaning that period and deadline are the same, under nonpreemptive situations.
Dataset Generation
To generate task set samples, we exploit the Randfixedsum algorithm Emberson2010, which has been used widely in research of scheduling problems brandenburg2016global; gujarati2015multiprocessor. For each task set, we first configure the number of tasks and total task set utilization . The Randfixedsum algorithm randomly selects utilization of for each task , i.e., . The algorithm also generates a set of task parameter samples each of which follows the rules in Table C.9, yielding values for period , WCET , and deadline under the predetermined utilization , where we have .
Task Parameter  Generation Rule 

for implicit deadline 
Three properties and are transformed into a vector representation of each task in Table C.10.
Model structure
In GFPS, we use a linear network for a student model. Specifically, task is transformed to its corresponding score by
(C.3) 
for some learnable parameter and . Hyperparameter settings for RL and distilled models are shown in Table C.11
Appendix D Travelling Salesman Problem
(Tsp)
TSP is a problem of assigning priorities to points to be visited so that the total distance to visit all points is minimized. Specifically, given an sized set of points , a state consists of two sets and at each timestep , where is a set of priorityassigned points and is priorityunassigned points. An action where corresponds to assign the highest rank to points , so that it is visited at time . We define a reward by
(D.1) 
where is a priority order of all points, and the function Dist calculates the total distance by .
Here, we report the TSP performance of RLbased models, compared to a random insertion heuristic method in Table D.12
. For this experiment on TSP, we use the open source implemented in
kool2018attention, which has an encoderdecoder structure with sequential processing, similar to ours.Appendix E Limitations and Generality of Our Approach
We analyze the characteristic of various COPs in Figure 2 in the main manuscript, demonstrating that RLRD works well for MDKP and GFPS, but does not for TSP. Here, we provide more detail explanation by comparing TSP with Sorting.
Let be and consider the corresponding unnormalized distribution . Upon sampling which is the number with the largest probability in greedily from , a new set is obtained and has to be sorted. By masking the index of 4 in , new unnormalized probability distribution is obtained. Sorting is conducted by sampling greedily from . In other words, sorting can be perfectly solved by inductively doing greedy sampling without replacement, starting from one single distribution . On the other hand, in TSP, the probability distribution over items highly depends on the items chosen at previous steps, or contexts, as we described. In this sense, we consider TSP to be fully contextdependent, because no such single distribution exists for TSP in contrast to sorting. If the student model performs greedy sampling without replacement from one distribution for a contextdependent problem, it might end up with suboptimal performance.
Many COPs lie in between sorting (no contextdependent) and TSP (fully contextdependent) in terms of the degree of context dependency, and we consider them as a feasible problem space for RLRD. We present such characteristics of MDKP and GFPS compared with TSP, which are summarized in the table below.
Appendix F Choice of the Number of Sampling
In this section, we explain our default settings on the number of Gumbel samplings, which are set to 30 and 10 for MDKP and GFPS problems, respectively.
Figure F.3 shows the performance and inference time in MDKP and GFPS, with respect to Gumbel sampling sizes. In MDKP, the performance gain trading off the inference time decreases around 30 samples. Likewise, in GFPS, the performance gain trading off the inference time decreases around 10 samples. These empirical results attribute to our settings for the Gumbel trickbased sequence sampling. We also consider the inference time of teacher RL models, and limit these sampling sizes to make the inference time of distilled models much shorter than their respective teacher RL models.
Furthermore, we normalize score by
(F.1) 
before Gumbel perturbation is added to score s, shown in (21). This is because the Gumbeltrick might not generate any variation of rankings, if the gap between item scores is too big, and Gumbel perturbation might generate irrelevant random rankings, if the gap is too small.