Log In Sign Up

An Efficient Combinatorial Optimization Model Using Learning-to-Rank Distillation

by   Honguk Woo, et al.
Kakao Corp.

Recently, deep reinforcement learning (RL) has proven its feasibility in solving combinatorial optimization problems (COPs). The learning-to-rank techniques have been studied in the field of information retrieval. While several COPs can be formulated as the prioritization of input items, as is common in the information retrieval, it has not been fully explored how the learning-to-rank techniques can be incorporated into deep RL for COPs. In this paper, we present the learning-to-rank distillation-based COP framework, where a high-performance ranking policy obtained by RL for a COP can be distilled into a non-iterative, simple model, thereby achieving a low-latency COP solver. Specifically, we employ the approximated ranking distillation to render a score-based ranking model learnable via gradient descent. Furthermore, we use the efficient sequence sampling to improve the inference performance with a limited delay. With the framework, we demonstrate that a distilled model not only achieves comparable performance to its respective, high-performance RL, but also provides several times faster inferences. We evaluate the framework with several COPs such as priority-based task scheduling and multidimensional knapsack, demonstrating the benefits of the framework in terms of inference latency and performance.


Deep Reinforcement Learning for Exact Combinatorial Optimization: Learning to Branch

Branch-and-bound is a systematic enumerative method for combinatorial op...

Geometric Deep Reinforcement Learning for Dynamic DAG Scheduling

In practice, it is quite common to face combinatorial optimization probl...

Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning

In deep reinforcement learning (RL), data augmentation is widely conside...

Reinforcement Learning to Solve NP-hard Problems: an Application to the CVRP

In this paper, we evaluate the use of Reinforcement Learning (RL) to sol...

Constrained Combinatorial Optimization with Reinforcement Learning

This paper presents a framework to tackle constrained combinatorial opti...

Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems

Over the recent years, reinforcement learning (RL) has shown impressive ...

Neural Feature Selection for Learning to Rank

LEarning TO Rank (LETOR) is a research area in the field of Information ...

1 Introduction

In the field of computer science, it is considered challenging to tackle combinatorial optimization problems (COPs) that are computationally intractable. While numerous heuristic approaches have been studied to provide polynomial-time solutions, they often require in-depth knowledge on problem-specific features and customization upon the changes of problem conditions. Furthermore, several heuristic approaches such as branching 

(chu1998genetic) and tabu-search glover1989tabu to solving COPs explore combinatorial search spaces extensively, and thus render themselves limited in large scale problems.

Recently, deep learning techniques have proven their feasibility in addressing COPs, e.g., routing optimization 

kool2018attention, task scheduling lee2020panda, and knapsack problem gupointerkp. For deep learning-based COP approaches, it is challenging to build a training dataset with optimal labels, because many COPs are computationally infeasible to find exact solutions. Reinforcement learning (RL) is considered viable for such problems as neural architecture search zoph2016neural, device placement mirhoseini2017device, games silver2017mastering where collecting supervised labels is expensive or infeasible.

As the RL action space of COPs can be intractably large (e.g.,

possible solutions for ranking 100-items), it is undesirable to use a single probability distribution on the whole action space. Thus, a sequential structure, in which a probability distribution of an item to be selected next is iteratively calculated to represent a one-step action, becomes a feasible mechanism to establish RL-based COP solvers, as have been recently studied in 

bello2016neural; vinyals2019grandmaster. The sequential structure is effective to produce a permutation comparable to optimal solutions, but it often suffers from long inference time due to its iterative nature. Therefore, it is not suitable to apply these approaches to the field of mission critical applications with strict service level objectives and time constraints. For example, task placement in SoC devices necessitates fast inferences in a few milliseconds, but the inferences by a complex model with sequential processing often take a few seconds, so it is rarely feasible to employ deep learning-based task placement in SoC ykman2006fast; shojaei2009parameterized.

In this paper, we present RLRD, an RL-to-rank distillation framework to address COPs, which enables the low-latency inference in online system environments. To do so, we develop a novel ranking distillation method and focus on two COPs where each problem instance can be treated as establishing the optimal policy about ranking items or making priority orders. Specifically, we employ a differentiable relaxation scheme for sorting and ranking operations blondel2020fast to expedite direct optimization of ranking objectives. It is combined with a problem-specific objective to formulate a distillation loss that corrects the rankings of input items, thus enabling the robust distillation of the ranking policy from sequential RL to a non-iterative, score-based ranking model. Furthermore, we explore the efficient sampling technique with Gumbel trick jang2016categorical; kool2019estimating on the scores generated by the distilled model to expedite the generation of sequence samples and improve the model performance with an inference time limitation.

Through experiments, we demonstrate that a distilled model by the framework achieves the low-latency inference while maintaining comparable performance to its teacher. For example, compared to the high-performance teacher model with sequential RL, the distilled model makes inferences up to 65 times and 3.8 times faster, respectively for the knapsack and task scheduling problems, while it shows only about 2.6% and 1.0% degradation in performance, respectively. The Gumbel trick-based sequence sampling improves the performance of distilled models (e.g., 2% for the knapsack) efficiently with relatively small inference delay. The contributions of this paper are summarized as follows.

  • We present the learning-based efficient COP framework RLRD that can solve various COPs, in which a low-latency COP model is enabled by the differentiable ranking (DiffRank)-based distillation and it can be boosted by Gumbel trick-based efficient sequence sampling.

  • We test the framework with well-known COPs in system areas such as priority-based task scheduling in real-time systems and multidimensional knapsack in resource management, demonstrating the robustness of our framework approach to various problem conditions.

2 Our Approach

In this section, we describe the overall structure of the RLRD framework with two building blocks, the deep RL-based COP model structure, and the ranking distillation procedure.

Figure 1: Overview of RLRD framework

Framework Structure

In general, retraining or fine-tuning is needed to adapt a deep learning-based COP model for varying conditions on production system requirements. The RLRD framework supports such model adaptation through knowledge distillation. As shown in Figure 1, (1) for a COP, a learning-to-rank (teacher) policy in the encoder-decoder model is first trained by sequential RL, and (2) it is then transferred through the DiffRank-based distillation to a student model with non-iterative ranking operations, according to a given deployment configuration, e.g., requirements on low-latency inference or model size. For instance, a scheduling policy is established by RL to make inferences on the priority order of a set of tasks running on a real-time multiprocessor platform, and then it is distilled into a low-latency model to make same inferences with some stringent delay requirement.

Reinforcement Learning-to-Rank

Here, we describe the encoder-decoder structure of our RL-to-Rank model for COPs, and explain how to train it. Our teacher model is based on a widely adopted attentive structure kool2018attention. In our model representation, we consider parameters (e.g.,

for Affine transformation of vector

), and we often omit them for simplicity. In an RL-to-Rank model, an encoder takes the features of -items as input, producing the embeddings for the -items, and a decoder conducts ranking decisions iteratively on the embeddings, yielding a permutation for the -items. This encoder-decoder model is end-to-end trained by RL.

COP Encoder.

In the encoder, each item containing features is first converted into vector through the simple Affine transformation, Then, for -items, -matrix, is passed into the

-attention layers, where each attention layer consists of a Multi-Head Attention layer (MHA) and a Feed Forward network (FF). Each sub-layer is computed with skip connection and Batch Normalization (BN). For

, are updated by




is the concatenation of tensors,

is a fixed positive integer, and is a learnable parameter. AM is given by


where and denote the layer-wise parameters for query, key and value vaswani2017attention. The result output in (1) is the embedding for the input -items, which are used as input to a decoder in the following.

Ranking Decoder.

With the embeddings for -items from the encoder, the decoder sequentially selects items to obtain an -sized permutation where distinct integers correspond to the indices of the -items. That is, item is selected first, so it is assigned the highest ranking (priority), and item is assigned the second, and so on. Specifically, the decoder establishes a function to rank -items stochastically,


where represents a probability that item is assigned the th rank.

From an RL formulation perspective, in (4), the information about -items () including ranking-assigned items [] until corresponds to state , and selecting corresponds to action . That is, a state contains a partial solution over all permutations and an action is a one-step inference to determine a next ranked item. Accordingly, the stochastic ranking function above can be rewritten as -parameterized policy for each timestep .


This policy is learned based on problem-specific reward signals. To establish such policy via RL, we formulate a learnable score function of item upon state

, which is used to estimate

, e.g.,


where is the embedding of an item () selected at timestep , and is a global vector obtained by


Note that is randomly initialized. To incorporate the alignment between and in , we use Attention vaswani2017attention,


for query and vectors , where and are learnable parameters. Finally, we have the policy that calculates the ranking probability that the th item is selected next upon state .



For end-to-end training the encoder-decoder, we use the REINFORCE algorithm williams1992simple, which is effective for episodic tasks, e.g., problems formulated as ranking -items. Suppose that for a problem of -items, we obtain an episode with


that are acquired by policy , where and are state, action and reward samples. We set the goal of model training to maximize the expected total reward by ,


where is a discount rate, and use the policy gradient ascent.


Note that is a return, is a learning rate, and is a baseline used to accelerate the convergence of model training.

Learning-to-Rank Distillation

In the RLRD framework, the ranking decoder repeats -times of selection to rank -items through its sequential structure. While the decoder structure is intended to extract the relational features of items that have not been selected, the high computing complexity of iterative decoding renders difficulties in the application of the framework to mission-critical systems. To enable fast inferences without significant degradation in model performance, we employ knowledge distillation from an RL-to-rank model with iterative decoding to a simpler model. Specifically, we use a non-iterative, score-based ranking model as a student in knowledge distillation, which takes the features of -items as input and directly produces a score vector for the -items as output. A score vector is used to rank the -items.

For -items, the RL-to-rank model produces ranking vector as supervised label , and by distillation, the student model learns to produce such a score vector s maximizing the similarity between y and the corresponding ranking of s, say . For example, given a score vector for -items, it is sorted to , so . The ranking distillation loss is defined as



is a differentiable evaluation metric for the similarity of two ranking vectors. We use mean squared error (MSE) for

, because minimizing MSE of two ranking vectors is equivalent to maximizing the Spearman-rho correlation of two rankings y and rank(s).

Differentiable Approximated Ranking.

To distill with the loss in (13) using gradient descent, the ranking function rank needs to be differentiable with non-vanishing gradient. However, differentiating rank has a problem of vanishing gradient because a slight shift of score s does not usually affect the corresponding ranking. Thus, we revise the loss in (13) using an approximated ranking function having nonzero gradients in the same way of blondel2020fast.

Consider score and -permutation which is a bijection from to itself. A descending sorted list of s is represented as


where . Accordingly, the ranking function rank : is formalized as


where is an inverse of , which is also a permutation. For example, consider . Its descending sort is , so we have and . Accordingly, we have .

To implement the DiffRank, a function is used, which approximates rank in a differential way with nonzero gradients such as


Here is called a perumutahedron, which is a convex hull generated by the permutation with . As explained in blondel2020fast, the function converges to rank as , while it always preserves the order of . That is, given in (14) and , we have .

In addition, we also consider a problem-specific loss. For example, in the knapsack problem, an entire set of items can be partitioned into two groups, one for selected items and the other for not selected items. We can penalize the difference of partitions obtained from label y

and target output score

s by the function . Finally the total loss is given by


where . The overall distillation procedure is illustrated in Algorithm 1.

Here, we present the explicit nonvanishing gradient form of our ranking loss function

, where its proof can be found in Appendix A.

Proposition 1.

Fix . Let as in (16) and where . Let ,


and . Then, we have


where @ is a matrix multiplication, , is a square matrix whose entries are all with size , and is an -permutation. Here, for any matrix , denotes row and column permutation of according to .

1:  Load (teacher) RL policy
2:  Initialize parameter of student model
3:  Input : Sample batch and learning rate .
4:  for  do
5:     .
6:     for Itemset  do
7:        Get rank = from
8:        Get score from target model.
9:        Get approx. ranking using (16)
10:        Calculate loss using (17)
12:     end for
14:  end for
Algorithm 1 Learning-to-rank distillation

Efficient Sequence Sampling.

As explained, we use a score vector in (14) to obtain its corresponding ranking vector deterministically. On the other hand, if we treat such a score vector as an un-normalized log-probability of a categorical distribution on -items, we can randomly sample from the distribution using the score vector without replacement and obtain a ranking vector for the -items. Here, the condition of without replacement specifies that the distribution is renormalized so that it sums up to for each time to sample an item. This -times drawing and normalization increases the inference time. Therefore, to compute rankings rapidly, we exploit Gumbel trick gumbel1954maxima; maddison2014sampling.

Given score s

, consider the random variable



and suppose that S is sorted by , as in (14). Note that are random variables.

Theorem 1.

Appendix A in kool2019estimating. For each k ,


where .

This sampling method reduces the complexity to obtain each ranking vector instance from quadratic (sequentially sampling each of -items on a categorical distribution) to log-linear (sorting a perturbed list) of the number of items , improving the efficiency of our model significantly.

width=1.0 GLOP Greedy RL RL-S RD RD-G - Time Gap Time Gap Time Gap Time Gap Time Gap Time 50 3 200 0 100% 0.0060s 97.9% 0.0003s 99.7% 0.0706s 98.8% 2.0051s 97.5% 0.0029s 100.1% 0.0152s 0.9 100% 0.0063s 87.6% 0.0003s 100.2% 0.0675s 104.5% 1.8844s 97.7% 0.0030s 102.2% 0.0154s 500 0 100% 0.0066s 97.8% 0.0003s 99.4% 0.0686s 99.6% 1.9768s 97.4% 0.0029s 99.7% 0.0159s 0.9 100% 0.0064s 81.4% 0.0004s 101.5% 0.0687s 105.4% 1.8840s 97.9% 0.0030s 101.5% 0.0150s 100 10 200 0 100% 0.0950s 101.3% 0.0005s 102.2% 0.4444s 101.9% 12.4996s 100.5% 0.0060s 102.7% 0.0198s 0.9 100% 0.0435s 93.4% 0.0004s 103.2% 0.4443s 106.9% 12.3932s 99.2% 0.0086s 102.6% 0.0222s 500 0 100% 0.1046s 100.9% 0.0005s 100.6% 0.4363s 101.1% 12.3392s 98.9% 0.0088s 101.6% 0.0214s 0.9 100% 0.0436s 90.2% 0.0004s 103.5% 0.4381s 107.1% 12.3211s 100.4% 0.0059s 104.6% 0.0198s 150 15 200 0 100% 0.2494s 102.0% 0.0007s 102.9% 0.6370s 102.8% 17.8975s 100.7% 0.0090s 102.5% 0.0328s 0.9 100% 0.0885s 96.7% 0.0006s 103.4% 0.5380s 106.4% 16.2406s 99.7% 0.0090s 102.3% 0.0290s 500 0 100% 0.2497s 101.8% 0.0007s 99.1% 0.6488s 100.1% 19.4974s 96.6% 0.0093s 98.6% 0.0244s 0.9 100% 0.0425s 92.5% 0.0006s 103.7% 0.5289s 107.0% 14.0924s 100.7% 0.0088s 103.9% 0.0292s

Table 1: The Evaluation of MDKP. For each method, Gap denotes the performance (average achieved value) ratio of the method to GLOP, and Time denotes the average inference time for a problem instance. and denote the number of items and the size of knapsack resource dimensions, respectively. denotes the sampling range of item weight on [, ], and denotes the correlation of weight and value of items. The performance is averaged for a testing dataset of 500 item sets.

3 Experiments

In this section, we evaluate our framework with multidimensional knapsack problem (MDKP) and global fixed-priority task scheduling (GFPS) davis2016review. The problem details including RL formulation, data generation, and model training can be found in Appendix B and C.

Multidimensional Knapsack Problem (MDKP)

Given values and -dimensional weights of -items, in MDKP, each item is either selected or not for a knapsack with -dimensional capacities to get the maximum total value of selected items.

For evaluation, we use the performance (the achieved value in a knapsack) by GLOP implemented in the OR-tools ortools as a baseline. We compare several models in our framework. RL is the RL-to-rank teacher model, and RD is the distilled student model. RL-S is the RL model variant with sampling, and RD-G is the RD model variant with Gumbel trick-based sequence sampling. In RL-S, the one-step action in (9) is conducted stochastically, while in RL, it is conducted greedily. For both RL-S and RD-G, we set the number of ranking sampling to 30 for each item set, and report the the highest score among samples. In addition, we test the Greedy method that exploits the mean ratio of item weight and value.

Model Performance.

Table 1 shows the performance in MDKP, where GAP denotes the performance ratio to the baseline GLOP and Time denotes the inference time.

  • RD shows comparable performance to its respective high-performance teacher RL, with insignificant degradation of average 2.6% for all the cases. More importantly, RD achieves efficient, low-latency inferences, e.g., 23 and 65 times faster inferences than RL for =50 and =150 cases, respectively.

  • RD-G outperforms RL by 0.3% on average and also achieves 4.4 and 20 times faster inferences than RL for =50 and =150 cases, respectively. Moreover, RD-G shows 2% higher performance than RD, while its inference time is increased by 3.7 times.

  • RL-S shows 1.8% higher performance than RL model. However, unlike RD-G, the inference time of RL-S is increased linearly to the number of ranking samples (i.e., about 30 times increase for 30 samples).

  • As increases, all methods shows longer inference time, but the increment gap of GLOP is much larger than RL and RD. For example, as increases from 50 to 150 when =0, the inference time of GLOP is increased by 39 times, while RL and RD shows 9.3 and 3.1 times increments, respectively.

  • The performance of Greedy degrades drastically in the case of =0.9. This is because the weight-value ratio for items becomes less useful when the correlation is high. Unlike Greedy, our models show stable performance for both high and low correlation cases.

Priority Assignment Problem for GFPS

width=0.94 m N Util OPA RL RL-S RD RD-G Ratio Time Ratio Time Ratio Time Ratio Time Ratio Time 4 32 3.0 78.1% 0.3531s 87.5% 0.0616s 89.4% 0.0697s 86.5% 0.0145s 90.1% 0.0263s 3.1 63.5% 0.3592s 74.8% 0.0599s 77.6% 0.0840s 73.8% 0.0139s 78.9% 0.0390s 3.2 44.9% 0.3487s 56.9% 0.0623s 60.1% 0.0991s 56.0% 0.0140s 61.2% 0.0509s 3.3 26.6% 0.3528s 35.7% 0.0621s 38.5% 0.9123s 35.8% 0.0131s 39.2% 0.0620s 6 48 4.4 84.2% 0.4701s 92.5% 0.1021s 94.3% 0.1153s 91.76% 0.0298s 94.3% 0.0406s 4.6 61.9% 0.4600s 78.4% 0.1057s 78.4% 0.1508s 74.6% 0.0308s 79.3% 0.0728s 4.8 33.2% 0.4287s 46.5% 0.1082s 50.4% 0.1967s 45.4% 0.0290s 50.8% 0.1123s 5.0 11.5% 0.3888s 15.7% 0.1010s 18.1% 0.2474s 18.0% 0.0256s 20.2% 0.1615s 8 64 5.7 92.9% 0.6686s 97.8% 0.1437s 98.6% 0.1596s 97.5% 0.0502s 98.5% 0.0537s 6.0 72.9% 0.6460s 86.5% 0.1490s 89.9% 0.2043s 85.0% 0.0364s 88.7% 0.0907s 6.3 37.6% 0.5798s 53.5% 0.1559s 57.5% 0.2800s 52.5% 0.0509s 57.7% 0.1695s 6.6 10.4% 0.4806s 15.1% 0.1488s 17.7% 0.4093s 17.0% 0.0390s 19.6% 0.2715s

Table 2: The Evaluation of GFPS. For each method, Ratio denotes the performance in the schedulability ratio , and Time denotes the average inference time for a problem instance. and denote the number of processors and the number of tasks, respectively, i.e., scheduling -tasks on an -processor platform. Util denotes the task set utilization, i.e., the sum of task utilization . The performance is averaged for a testing dataset of 5,000 task sets.

For a set of -periodic tasks, in GFPS, each task is assigned a priority (an integer from to ) to be scheduled. GFPS with a priority order (or a ranking list) can schedule the highest-priority tasks in each time slot upon a platform comprised of -processors, with the goal of not incurring any deadline violation of the periodic tasks.

For evaluation, we need to choose a schedulability test for GFPS, that determines whether a task set is deemed schedulable by GFPS with a priority order. We target a schedulability test called RTA-LC Guan2009; Davis2011 which has been known to perform superior to the others in terms of covering schedulable task sets. We compare our models with Audsley’s Optimal Priority Assignment (OPA) Audsley1991; Audsley2001 with the state-of-the-art OPA-compatible DA-LC test Davis2011, which is known to have the highest performance compared to other heuristic algorithms. Same as those in MDKP, we denote our models as RL, RL-S, RD, and RD-G. For both RL-S and RD-G, we limit the number of ranking samples to 10 for each task set.

Model Performance.

Table  2 shows the performance in the schedulability ratio of GFPS with respect to different task set utilization settings on an -processor platform and -tasks.

  • Our models all show better performance than OPA, indicating the superiority of the RLRD framework. The performance gap is relatively large on the intermediate utilization ranges, because those ranges can provide more opportunities to optimize with a better strategy. For example, when =8, =64 and Util=6.3, RL and RD show 15.9% and 11.9% higher schedulability ratio than OPA, respectively, while when =8, =64 and Util=5.7, their gain is 4.9% and 4.6%, respectively.

  • The performance difference of RD and its teacher RL is about 1% on average, while the inference time of RD is decreased (improved) by 3.8 times. This clarifies the benefit of the ranking distillation.

  • As the utilization (Util) increases, the inference time of RL-S and RD-G becomes longer, due to multiple executions of the schedulability test up to the predefined limit (i.e., 10 times). On the other hand, the inference time of OPA decreases for large utilization; the loop of OPA is terminated when a task cannot satisfy its deadline with the assumption that other priority-unassigned tasks have higher priorities than that task.

  • RD-G shows comparable performance to, and often achieves slight higher performance than RL-S. This is the opposite pattern of MDKP where RL-S achieves the best performance. While direct comparison is not much valid due to different sampling methods, we notice the possibility that a distilled student can perform better than its teacher for some cases, and the similar patterns are observed in tangrankdistil; kim2016sequence.

Analysis on Distillation

Effects of Iterative Decoding.

To verify the feasibility of distillation from sequential RL to a score-based ranking model, we measure the difference of the outputs by iterative decoding and greedy sampling. In the case when the decoder generates the ranking distribution at timestep and takes action as in (9), by masking the th component of the distribution and renormalizing it, we can obtain a renormalized distribution . In addition, consider another probability distribution generated by the decoder at .

Figure 2: The KL divergence at different timesteps and is used to estimate the difference of decoder driven ranking distribution () and renormalized ranking distribution (), where time intervals . Scales are normalized.

Figure 2 illustrates the difference of the distributions in terms of KL-divergence on three specific COPs, previously explained MDKP and GFPS as well as Traveling Salesman Problem (TSP). As shown, MDKP and GFPS maintain a low divergence value, implying that the ranking result of decoding with many iterations can be approximated by decoding with no or few iterations. Unlike MDKP and GFPS, TSP shows a large divergence value. This implies that many decoding iterations are required to obtain an optimal path. Indeed, in TSP, we obtain good performance by RL (e.g, 2% better than a heuristic method), but we hardly achieve comparable performance to RL when we test RD. The experiment and performance for TSP can be found in Appendix D.

Effects of Distillation Loss.

To evaluate the effectiveness of DiffRank-based distillation, we implement other ranking metrics such as a pairwise metric in ranknet and a listwise metric in listnet and test them in the framework as a distillation loss.

Table 3 and 4 show the performance in MDKP and GFPS, respectively, achieved by different distillation loss functions, where RD denotes our distilled model trained with DiffRank-based distillation loss, and the performance of the other two is represented as the ratio to RD. Note that they all use the same RL model as a teacher in this experiment.

As shown, RD achieves consistently better performance than the others for most cases. Unlike RD, the other methods commonly show data-dependent performance patterns. The pairwise method (with pairwise distillation loss) achieves performance similar to or slightly lower than RD in MDKP but shows much lower performance than RD in GFPS. The listwise method shows the worst performance for many cases in both MDKP and GFPS except for the cases of in MDKP. These results are consistent with the implication in Figure 2 such that GFPS has larger divergence than MDKP and thus GFPS is more difficult to distill, giving a large performance gain to RD.

width=0.4 N k w RD Pairwise Listwise 50 5 200 0 100% 99.4% 69.8% 0.9 100% 99.1% 98.5% 100 10 200 0 100% 98.8% 80.1% 0.9 100% 100.1% 99.2% 150 15 200 0 100% 95.4% 86.8% 0.9 100% 99.6% 99.6%

Table 3: Ranking Loss Comparison in MDKP.

width=0.4 m N Util RD Pairwise Listwise 4 32 3.1 100% 90.1% 72.8% 6 48 4.6 100% 91.3% 59.6% 8 64 6.0 100% 91.6% 49.2%

Table 4: Ranking Loss Comparison in GFPS.

4 Related Work

Advanced deep neural networks combined with RL algorithms showed the capability to address various COPs in a data-driven manner with less problem-specific customization. In 

bello2016neural, the pointer network was introduced to solve TSP and other geometric COPs, and in kool2018attention, a transformer model was incorporated for more generalization. Besides the pointer network, a temporal difference based model showed positive results in the Job-Shop problem zhangjobshop, and deep RL-based approaches such as Q-learning solvers kpqlearning were explored for the knapsack problem kpqlearning. Several attempts have been also made to address practical cases formulated in the knapsack problem, e.g., maximizing user engagements under business constraints homepagerelevance; emailvolumeoptimize.

Particularly, in the field of computer systems and resource management, there have been several works using deep RL to tackle system optimization under multiple, heterogeneous resource constraints in the form of COPs, e.g., cluster resource management mao2016resource; mao2018learning, compiler optimization chen2018tvm. While we leverage deep RL techniques to address COPs in the same vein as those prior works, we focus on efficient, low-latency COP models.

The ranking problems such as prioritizing input items based on some scores have been studied in the field of information retrieval and recommendation systems. A neural network based rank optimizer using a pairwise loss function was first introduced in ranknet, and other ranking objective functions were developed to optimize relevant metrics with sophisticated network structures. For example, Bayesian Personalized Ranking rendle2012 is known to maximize the AUC score of given item rankings with labeled data. However, although these approaches can bypass the non-differentiability of ranking operations, the optimization is limited to some predefined objectives such as NDCG or AUC; thus, it is difficult to apply them to COPs because the objectives do not completely align with the COP objectives. To optimize arbitrary objectives involving nondifferentiable operations such as ranking or sorting, several works focused on smoothing nondifferentiable ranking operations grover2019stochastic; blondel2020fast. They are commonly intended to make arbitrary objectives differentiable by employing relaxed sorting operations.

Knowledge distillation based on the teacher-student paradigm has been an important topic in machine learning to build a compressed model 

hinton2015distilling and showed many successful practices in image classification pmlr-v139-touvron21a

and natural language processing 

kim2016sequence; jiao-etal-2020-tinybert. However, knowledge distillation to ranking models has not been fully studied. A ranking distillation for recommendation system was introduced in tangrankdistil, and recently, a general distillation framework RankDistil Reddi2021RankDistilKD was presented with several loss functions and optimization schemes specific to top- ranking problems. These works exploited pairwise objectives and sampling based heuristics to distill a ranking model, but rarely focused on arbitrary objectives and sequential models, which are required to address various COPs. The distillation of sequential models was investigated in several works kim2016sequence; jiao-etal-2020-tinybert. However, to the best of our knowledge, our work is the first to explore the distillation from sequential RL models to score-based ranking models.

5 Conclusion

In this paper, we presented a distillation-based COP framework by which an efficient model with high-performance is achieved. Through experiments, we demonstrate that it is feasible to distill the ranking policy of deep RL to a score-based ranking model without compromising performance, thereby enabling the low-latency inference on COPs.

The direction of our future work is to adapt our RL-based encoder-decoder model and distillation procedure for various COPs with different consistency degrees between embeddings and decoding results and to explore meta learning for fast adaptation across different problem conditions.


We would like to thank anonymous reviewers for their valuable comments and suggestions.

This work was supported by the Institute for Information and Communications Technology Planning and Evaluation (IITP) under Grant 2021-0-00875 and 2021-0-00900, by the ICT Creative Consilience Program supervised by the IITP under Grant IITP-2020-0-01821, by Kakao I Research Supporting Program, and by Samsung Electronics.


Appendix A Proof of proposition 1


By the chain rule, we have

where the first part is followed directly from the elementary calculus. For the second part, we note that where and where is a permutahedron generated by w. Applying the chain rule and proposition 4 in blondel2020fast, we obtain

for some integers such that and some permutation . This completes the proof. ∎

Here, we can analyze how

behaves as we tune the hyperparameter

. We first formulate the function in terms of an isotonic optimization problem,


where and . It is known that for some . Note that the optimal solution of th term inside the ArgMin of (A.2) is just . Fix a vector s

for a moment in the following.

If we take , then the values become more out-of-order, meaning that . This implies the partition of appearing in a pool adjacent violator (PAV) algorithm pavalgorithm becomes more chunky (Note that our objective in  (A.1) is ordered in decreasing). This implies the block diagonal matrix in the right hand side of (20) in Proposition 1 becomes more uniform. When the right hand side form is multiplied with the reciprocal of in the same equation, the gradient is the small enough, but in the uniform manner.

On the other hand, if we take , the subtraction of block diagonal matrix from I has more zero entries. This is because the optimization has much more chance to have in-order solutions, meaning that

. Consequently, the block diagonal matrix in (20) tends to be an identity matrix. In the training phase, the error term

becomes small enough if the score s leads to correct predictions for given ranking labels. Accordingly, the

th score is not engaged in the backpropagation.

In summary, we exploit -controllability such that “hard” rankings with exact gradient and “soft” rankings with uniform, small gradient for different settings from to .

Appendix B Multidimensional Knapsack Problem (MDKP)

Problem Description

Given an -sized item set where the th item is represented by , is a value and represent -dimensional weights. A knapsack has -dimensional capacities .

In RL formulation, a state consists of and at timestep such that . Note that is a set of allocable items and is a set of the other items in . To define an allocable item, suppose the items are in the knapsack at time . Then, an item is allocable if and only if and . Then, an action corresponds to a selection to place in the knapsack. Considering the objective of MKDP maximizing the total value of selected items, a reward is given by


When no items remain to be selected, an episode terminates. Accordingly, a model is learned to establish policy that maximizes the total reward in (11) and thus optimizes the performance of total values achieved for a testing set, e.g.,


where is a testing set of item sets, represents the index of a selected item, and denotes a length of an episode. We set each episode to have an single item set with items.

Dataset Generation

Given a number of items , we set -dimensional weights

of item to be randomly generated from a uniform distribution

for integers, and, the item value is given by


where manipulates the correlation of weights and values. For -dimensional capacities of a given knapsack, we set . A range of parameter values for items is given in Table B.5.

Item Parameter Values
Num. of items {50, 100, 150}
Weights dim. {3, 10, 15}
Max weights {200, 500}
Correlation {0, 0.9}
Table B.5: The generation range for items in MDKP

Through the encoder, input items are transformed to a raw feature vector representation. Given vector , we denote Max(z) = max, Min(z) = min, and Mean(z) = . For item , we denote its utilization as . These input raw features are listed in Table B.6.

width=0.47 Item representation Weight Utilization , , Value Utilization ,    ,    Min-Max Utilization , ,

Table B.6: Raw features for items in MDKP

Model Structure

In MDKP, we use an encoder structure for a student network in the RL-to-rank distillation. Specifically, item is converted to the embedding for some parameters and , giving a matrix . A global representation on the embeddings is calculated by


for arbitrarily initialized learnable parameter . Then, the score of item is defined by


which is similar to (6) for timestep . The detail hyper parameter settings for the teacher RL and distilled student models are given in Table B.7

width=0.3 RL Distilled Batch size 128 1 Iterations 250 10000 Learning Rate Optimizer Adam Adam of DiffRank - Embed. Dim. 256 - Num. Att. Layer 2 - Discount Factor 1 -

Table B.7: Hyperparameter settings in MDKP

Model Training

In training with the REINFORCE algorithm, we leverage greedy algorithms to establish a baseline in (12). Suppose that by using a specific greedy algorithm that selects an item of the highest weight-value ratio at each timestep, say greedy policy , we obtain an episode . Then, the baseline is established upon the episode by


for discount factor .

For the problem-specific loss function in (17), consider a student network generating score vectors . Then, if each item is selected by the rankings based on scores in s, we obtain a vector with


Similarly, if each item is selected by the rankings based on supervised labels y from the teacher RL model, we obtain where . Accordingly, we can have such that


that penalizes the difference on made by the rankings based on the supervised labels and the rankings based on the scores.

We implement our models using Python3 v3.6 and Pytorch v1.8, and train the models on a system of Intel(R) Core(TM) i9-10940X processor, with an NVIDIA RTX 3090 GPU.

Comparison with an Optimal Solver

width=0.47 limit SCIP GLOP RD RD-G - Time Perf Time Perf Time Perf Time 3s 100% 3s 894% 0.02s 884% 0.004s 908% 0.02s 60s 100% 60s 201% 0.02s 198% 0.004s 209% 0.02s 600s 100% 600s 90.4% 0.02s 92.0% 0.004s 94.7% 0.02s

Table B.8: Relative performance to SCIP

We test an optimal mixed integer programming solver SCIP implemented in the OR-tools, and we observe that it has slow inferences as shown in Table B.8. For example, SCIP achieves 8 times lower performance than ours (RD, RD-G) under 3s time limit, but it achieves only 5% higher performance than ours under 600s time limit. As such, SCIP shows insignificant performance gain compared to the unacceptably slow inference time for mission critical environments.

Appendix C Global Fixed Priority Scheduling

Problem Description

For a set of -periodic tasks, in GFPS, each task is assigned a priority (an integer from to ) to be scheduled. That is, GFPS with a priority order (or a ranking list) is able to schedule the highest-priority tasks in each time slot upon a platform comprised of homogeneous processors without incurring any deadline violation of the periodic tasks over time.

Specifically, given an -sized task set where each task is specified by its period, deadline, and worst case execution time (WCET), we formulate an RL procedure for GFPS as below. A state is set to have two task sets and for each timestep , where the former specifies priority-assigned tasks and the latter specifies the other tasks in . Accordingly, holds for all . An action () implies that the task is assigned the highest priority in . A reward is calculated by


where is a priority order and is a given schedulability test Guan2009 which maps to 0 or 1 depending on whether or not a task set is schedulable by GFPS with .

With those RL components above, a model is learned to establish such a policy in (9) that generates a schedulable priority order for each , which passes , if that exists. The overall performance is evaluated upon a set of task set samples .


Note that we focus on an implicit deadline problem, meaning that period and deadline are the same, under non-preemptive situations.

Dataset Generation

To generate task set samples, we exploit the Randfixedsum algorithm Emberson2010, which has been used widely in research of scheduling problems brandenburg2016global; gujarati2015multiprocessor. For each task set, we first configure the number of tasks and total task set utilization . The Randfixedsum algorithm randomly selects utilization of for each task , i.e., . The algorithm also generates a set of task parameter samples each of which follows the rules in Table C.9, yielding values for period , WCET , and deadline under the predetermined utilization , where we have .

Task Parameter Generation Rule
 for implicit deadline
Table C.9: Generation rules for tasks in GFPS

Three properties and are transformed into a vector representation of each task in Table C.10.

width=0.47 Task representation Log scale representation Utilization Slack time

Table C.10: Raw features for tasks in GFPS

Model structure

In GFPS, we use a linear network for a student model. Specifically, task is transformed to its corresponding score by


for some learnable parameter and . Hyperparameter settings for RL and distilled models are shown in Table C.11

width=0.35 RL Distilled #Train. Samples 200K 200K Batch size 256 20 Epochs 30 with early stop 9 Learning Rate Optimizer Adam Adam of DiffRank - Embedded Dim. 128 - Num. Att. Layer 2 - Discount Factor 1 -

Table C.11: Hyperparameter settings in GFPS

Appendix D Travelling Salesman Problem

TSP is a problem of assigning priorities to points to be visited so that the total distance to visit all points is minimized. Specifically, given an -sized set of points , a state consists of two sets and at each timestep , where is a set of priority-assigned points and is priority-unassigned points. An action where corresponds to assign the highest rank to points , so that it is visited at time . We define a reward by


where is a priority order of all points, and the function Dist calculates the total distance by .

Here, we report the TSP performance of RL-based models, compared to a random insertion heuristic method in Table D.12

. For this experiment on TSP, we use the open source implemented in 

kool2018attention, which has an encoder-decoder structure with sequential processing, similar to ours.

width=0.47 Heuristics RL Distance Gap Time(s) Distance Gap Time(s) N=20 3.92 - 0.0019 3.84 97.9% 0.4704 N=50 6.01 - 0.0054 5.89 98.0% 0.9709

Table D.12: Performance of TSP. A Gap is a ratio of distance to Heuristics.

Appendix E Limitations and Generality of Our Approach

We analyze the characteristic of various COPs in Figure 2 in the main manuscript, demonstrating that RLRD works well for MDKP and GFPS, but does not for TSP. Here, we provide more detail explanation by comparing TSP with Sorting.

Let be and consider the corresponding unnormalized distribution . Upon sampling which is the number with the largest probability in greedily from , a new set is obtained and has to be sorted. By masking the index of 4 in , new unnormalized probability distribution is obtained. Sorting is conducted by sampling greedily from . In other words, sorting can be perfectly solved by inductively doing greedy sampling without replacement, starting from one single distribution . On the other hand, in TSP, the probability distribution over items highly depends on the items chosen at previous steps, or contexts, as we described. In this sense, we consider TSP to be fully context-dependent, because no such single distribution exists for TSP in contrast to sorting. If the student model performs greedy sampling without replacement from one distribution for a context-dependent problem, it might end up with sub-optimal performance.

Many COPs lie in between sorting (no context-dependent) and TSP (fully context-dependent) in terms of the degree of context dependency, and we consider them as a feasible problem space for RLRD. We present such characteristics of MDKP and GFPS compared with TSP, which are summarized in the table below.

width=0.47 COP MDKP GFPS TSP Context-Dependency Low Moderate High Performance of Greedy Baseline High Moderate Low

Table D.13: Characteristics of COPs

Appendix F Choice of the Number of Sampling

In this section, we explain our default settings on the number of Gumbel samplings, which are set to 30 and 10 for MDKP and GFPS problems, respectively.

Figure F.3: Overall performance and inference time with respect to Gumbel sampling sizes

Figure F.3 shows the performance and inference time in MDKP and GFPS, with respect to Gumbel sampling sizes. In MDKP, the performance gain trading off the inference time decreases around 30 samples. Likewise, in GFPS, the performance gain trading off the inference time decreases around 10 samples. These empirical results attribute to our settings for the Gumbel trick-based sequence sampling. We also consider the inference time of teacher RL models, and limit these sampling sizes to make the inference time of distilled models much shorter than their respective teacher RL models.

Furthermore, we normalize score by


before Gumbel perturbation is added to score s, shown in (21). This is because the Gumbel-trick might not generate any variation of rankings, if the gap between item scores is too big, and Gumbel perturbation might generate irrelevant random rankings, if the gap is too small.