Log In Sign Up

Domain-specific Genetic Algorithm for Multi-tenant DNNAccelerator Scheduling

by   Sheng-Chun Kao, et al.

As Deep Learning continues to drive a variety of applications in datacenters and HPC, there is a growing trend towards building large accelerators with several sub-accelerator cores/chiplets. This work looks at the problem of supporting multi-tenancy on such accelerators. In particular, we focus on the problem of mapping layers from several DNNs simultaneously on an accelerator. Given the extremely large search space, we formulate the search as an optimization problem and develop a specialized genetic algorithm called G# withcustom operators to enable structured sample-efficient exploration. We quantitatively compare G# with several common heuristics, state-of-the-art optimization methods, and reinforcement learning methods across different accelerator set-tings (large/small accelerators) and different sub-accelerator configurations (homogeneous/heterogeneous), and observeG# can consistently find better solutions. Further, to enable real-time scheduling, we also demonstrate a method to generalize the learnt schedules and transfer them to the next batch of jobs, reducing schedule compute time to near zero.


page 1

page 4

page 9

page 10


DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators

The design of DNN accelerators includes two key parts: HW resource confi...

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Recently, numerous sparse hardware accelerators for Deep Neural Networks...

DNA: Differentiable Network-Accelerator Co-Search

Powerful yet complex deep neural networks (DNNs) have fueled a booming d...

Apollo: Transferable Architecture Exploration

The looming end of Moore's Law and ascending use of deep learning drives...

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

Recent advances in Deep Neural Networks (DNNs) have led to active develo...

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

DNN accelerators provide efficiency by leveraging reuse of activations/w...

Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search

Modern day computing increasingly relies on specialization to satiate gr...

1 Introduction

Accelerators for Deep Neural Network (DNN) models are commonplace today in datacenters. As AI workloads continue to drive up the demand for compute, there is a trend towards building large accelerators housing several sub-accelerator/arrays (summarized in

Table 1). Key examples include MCM-based SIMBA [simba], wafer-scale Cerebras [cerebras] or scaled-out platforms [cloud_tpu, aimt]. Some recent studies have also explored heterogeneous multi-accelerator designs enabled via reconfiguration [planaria] or separate sub-accelerators [herald].

With the emergence of such platforms, enabling multi-tenancy is a natural use-case. This is driven by two trends. First, end applications (such as AR/VR [herald] or self-driving [chowdhuri2019multinet, xiao2020multimodal]) and cloud services (such as search [ncf, zhang2018towards] and translation [xlm, bart]) often rely on several DNN models internally. Second, running queries from several users simultaneously can help enhance throughput and meet SLA requirements. While there has been significant prior work on scheduling a single DNN model efficiently over one (or more distributed) accelerators [shen2017maximizing, suda2016throughput, cong_fpga, systolic_mapping, lu2017flexflow, stoutchinin2019optimally], scheduling multiple models simultaneously is relatively unexplored, and is the focus of this work. Some recent works on multi-tenant DNN accelerators such as Prema [prema], AI-MT [aimt] and Herald [herald] use heuristics (SJF, random, greedy) [aimt, prema, herald]. However, they often work under the premise of homogeneity in the underlying accelerator-modules [aimt, prema] or pre-defined heterogeneous sub-accelerators [herald]. In other words, the scheduling algorithms are heavily tied to the designed multi-module accelerator. While this might be reasonable for an edge accelerator running a known set of models [herald], enabling multi-tenancy in datacenters needs a scheduler that can work across evolving hardware platforms.

Figure 1: Multi-tenant accelerator with schedule optimizer.
Table 1: The comparisons of related works on multi-tenancy and multi-sub-accelerators.

In this work, we propose a multi-tenant DNN scheduler called (G-SHARP111Genetic algorithm-based Scheduler for Heterogeneous Accelerator Platforms or G#) for multi-module accelerators housing multiple homogeneous or heterogeneous sub-accelerators, as shown in Fig. 1. We break the multi-tenant schedule into two components: sub-accelerator selection and job prioritization

. Sub-accelerator selection is where we assign each job an accelerator to execute; job prioritization is where we order the jobs that are assigned to an sub-accelerator. Each component creates an immense design space by itself. The full design space is the combinatorial probability of both, which becomes as large as O(

) (discussed in Section 3). The constraint in this optimization problem is the memory and interconnect bandwidth shared across the sub-accelerators. Given a multi-tenant schedule, each sub-accelerator then schedules the job assigned to it via its own internal scheduler (which is the problem of mapping a DNN layer on a single PE array for which several solutions exist [eyeriss_isca, timeloop]).

Compared to prior work on multi-tenant DNN scheduling [aimt, prema, herald], this work expands the scope of the problem-space in the following ways:

  • We optimize both job scheduling across sub-accelerators, and execution order of jobs on each of them, while prior works primarily focus on the former.

  • We target both homogeneous and heterogeneous DNN accelerator platforms.

  • We target a diverse spectrum of models across vision, language and recommendation which exhibit different bandwidth requirements.

Our solution, G#, includes the following novel features:

  • 1 an encoding format for the scheduling problem to formulate it as an optimization problem. This enables our scheduling framework222

    to be open-sourced upon acceptance of this work.

    to leverage any black-box optimizer [nevergrad], including G#.

  • 2 several domain-specific operators to enable structured exploration of the large mapping space. This makes G# orders of magnitude faster and more sample-efficient than baseline optimization methods [ga, de, es_origin, cma, pso_paper, tbpsa, portfolio], baseline genetic algorithm (GA), and Reinforcement-learning (RL)-based methods[ppo2, a2c]. From our comprehensive experiments across 42 widely-used models across vision, language and recommendation running over several (simulated) hardware platforms, G# achieves 86.4x to 610.7x speedup over schedules generated by other methods.

  • 3 a method to generalize and transfer the scheduling knowledge to future jobs without re-running the scheduling algorithm. This enables G# to perform runtime scheduling without requiring any search, unlike other optimization methods. In this mode, we observe 10.2x speedup over top-performing heuristics with the same (near zero) search time.

2 Background

2.1 Characteristics of DNN Models

In this paper, we consider three kinds of applications that are common in DNN-based cloud services: Vision, deep recommendation system (Recom), and language model (Lang).

Vision. Most of Vision models [Resnet, vgg, sandler2018mobilenetv2, szegedy2017inception, Alexnet]

are dominated by convolution layers (2D/depth-wise/point-wise) (CONV) and many of them have a MultiLayer Perceptron (MLP) or fully connected layer (FC) at the end 

[Alexnet, vgg].

Recom. Recommendation models are either dominated by MLP, attention, or embedding lookup layers [deeprecsys, dlrm, widedeep, ncf]. With respect to the compute and HW cost, the MLPs and the attention layers are modeled as several FCs. We assume the embedding lookups are kept in the CPU host.

Lang.Language models are often dominated by MLPs and attention layers with word embedding dimension. Their compute/HW cost are modeled by what we defined as FC-embedded (FC-E) layer, a fully connected layer with embedding dimension. FC-Es are often used as hidden layers of RNNs or MLPs and attention layers of transformer-based models [elmo, bert, gpt2, transformerxl].

Different applications and their dominant layer types are summarized in Fig. 2, where the dimension notations used throughout this paper are included.

Figure 2: The dominant layers of three applications and the dimension definitions of different layer types. CONV: K/C: size of output/input channels, Y/X: height/width of activations, R/S: height/width of weight. FC: O: Output nodes, I: Input nodes, FC-E: E: Embedding sizes.

2.2 Multi-tenant Acceleration Platform

2.2.1 Accelerator Architecture

As shown in Fig. 1, all sub-accelerators share the “system BW" via an interconnection network. We define system BW as the minimum of main memory (e.g., HBM/DRAM) BW and host-accelerator (e.g., PCIe) BW. The specific interconnection network architecture can vary depending on the target technology (on-chip [aimt] versus on-package [simba] versus wafer-scale [cerebras]) and the scheduler is agnostic to this. In this work, we target accelerators with both homogeneous and heterogeneous sub-accelerators. The motivation for heterogeneity among sub-accelerators comes from diverse dataflow preferences for different kinds of layers [maestro]. For instance, certain sub-accelerators could be optimized for convolutions [eyeriss_isca, nvdla] to service vision models, some for GEMM [tpu, nvidia_volta] to service NLP models, and some for and embeddings to service recommendation models[hwang2020centaur].

2.2.2 Sub-Accelerator Architecture and Schedule

Each sub-accelerator in our system is a conventional DNN accelerator that is comprised of an array of Processing Elements (PE). Each PE has a MAC to compute partial sums, and local scratchpad (called SL in this paper) to store weights, activations, and partial sums. Each sub-accelerator also houses a shared global scratchpad (SG) to prefetch activations and weights from HBM/DRAM for the next tile of computation that will be mapped over the PEs and SLs. Networks-on-Chip (NoCs) are used to distribute operands from the SG to the SLs and write the outputs back to the SG.

Local Schedule. Given a DNN layer, each sub-accelerator employs what we call as a local schedule (to distinguish it from the global schedule determined by G#). The local schedule (aka mapping [maestro]) determines the loop order, parallelism dimensions, and tile sizes for running the layer on the sub-accelerator. From the data movement perspective, a tile is a basic data movement unit from DRAM/HBM to the SG. The tile sizes are bound by the size of the SG buffer. The SG is double-buffered [maestro, eyeriss_isscc] to try and hide the data-fetching latency of the current tile from DRAM/HBM behind the compute latency. However, if the bandwidth to main memory is insufficient to hide the fetch latency (based on the bandwidth allocation determined by the scheduler), the accelerator will stall. The local schedule depends on the dataflow [eyeriss_isca, maestro] strategy of the accelerator. For instance, the NVDLA [nvdla] dataflow keeps weights in the outermost loop (i.e., weight-stationary) and schedules the weight tiles spatially over the array along the input and output channels for parallelism.

3 Problem Formulation

The focus of this paper is designing a scheduling optimization method for multi-tenant heterogeneous DNN accelerator. We formulate the scheduling problem as an optimization problem and discuss details here. In Section 4, we provide specific details of our proposed algorithm.

3.1 Scheduler Overview

In this paper, we refer to a “job" in layer granularity (i.e., each DNN layer is a job instance). Fig. 1 shows an overview of the multi-tenant schedule optimizer. The CPU host packs independent jobs (layers) into a ready-batch and queries the optimizer for schedule, making it a static scheduling problem, whose procedure is consistent with the assumption in many resource scheduling works  [scarl, deeprm, delimitrou2014quasar]. The host’s query consists of a description of a batch of jobs as shown in Fig. 1.

The optimizer takes the job query as input and outputs a schedule for the batch. The schedule consists of two key components:

  • Sub-accelerator selection: the assignment of each job to a specific sub-accelerator.

  • Job prioritization: execution order of a batch of jobs on a given sub-accelerator.

The schedule gets appended with the portion of the system bandwidth that is allocated to each sub-accelerator for each job (Section 3.6). The final schedule is leveraged by the DMA controller to manage data movement from the system memory to the sub-accelerator scratchpads. While the current batch is running, the host optimizes the schedule for subsequent batches.

Optimization Framework. In this work, we develop a technique to optimize for both components of the schedule simultaneously. The structure of our proposed schedule optimizer called G# is shown in Fig. 3

. We follow a genetic algorithm flow. At a high-level, in each time epoch, G# generates multiple valid schedules, evaluates them using a cost model, picks the best ones and uses those to generate better schedules for the next time epoch. The optimization loop finishes when the targeting objective value (e.g., latency) converges and the constraint (e.g., bandwidth utilization) is met, or can be stopped after a fixed set of time epochs (if there is a constraint on the amount of time available for scheduling). We provide more details about the G# algorithm in

Section 4. Our proposed flow can also work with other optimization methods as we show in Section 4.4.

We use previously proposed heuristics for both jobs priority and sub-accelerator selection, along with optimization methods that can solve for both together as baselines in our evaluations (Table 5).

3.2 Search Space

The full search space of the proposed scheduling problem is the combinatorial combination of the choices for sub-accelerator selection and job priority. In this paper, we assume the maximum batch size of 100 jobs, which is 100 parallel jobs to be offloaded. Assuming we are scheduling for a platform with 4 sub-accelerators, and each sub-accelerator assigned approximately 100/4=25 jobs. The full possible combinations become which is extremely massive. Therefore the sample efficiency333Performance improvement over the number of sampling budget. of the optimization methods, which decides the convergent rate, becomes a key factor.

3.3 Objective and Constraints

We examine the performance of the schedule by the makespan latency444Makespan latency is the duration elapses from the start of jobs to the end of the latest finished job. of the batched job, which is the most critical condition of the tail latency555Tail latency is the completion time of the 95th, 98th, 99th, or 100th percentile of the batched jobs. Makespan latency is the most critical condition (100th percentile) of the tile latency., as described below:


, , is the runtime of (Sub-Accelerator n) to run all jobs that assigned to it. The goal is to distribute the jobs and make all the working sub-accelerators as compact as possible. However, they would interfere with each other’s performance by competing for the system BW. The total accelerator BW usage at any time , should not exceed the constraint, shown below:


Therefore, the schedule optimizer is not only tasked to schedule the sub-accelerator allocation and job order but also to generate a BW allocation schedule across the runtime.

Alternate Objectives. In this paper, we demonstrate our algorithm with the objective of minimizing the makespan latency of the batch of jobs targeting maximum system throughput. However, different objectives such as system energy, system power, energy-delay-product, and so on can all be valid objectives for different applications/scenarios and easily applied. We present a case study on latency sensitive applications or batch with DNN layer dependency in Section 5.7 and provide the formulated objective.

The makespan latency, which is the objective, could also be set as a “soft" constraint666The constraints that are allowed to be violated with certain penalty. for applications with tight inference latency goals. The optimizer will approach the constraint as much as possible. Once the constraint is met, the optimizer can terminate early.

Alternate Constraints. In addition to the system BW constraint, which is essential in our task, G# can support other constraints such as latency, energy, and power. The search time (optimizer runtime) can also be set as a constraint and serve as a stopping criteria for the algorithm.

Figure 3: The structure and flow of G#.

Set-up: Host sets up optimizer by Accel. config., Constraint., and Objective.
Pre-process:Job Analyzer: Prepare the job analysis table. Init.: Initialize random genes by putting random values into encoder. The outputted genes consist of two genomes (features) representing: Accel Sel. and Job Prio.
Optimization Loop:Evolution block: Genetic operators: Accel. sel. and Job prio., represented by two genomes, respectively, are evolved by generic operators. Evaluation block: Decoder: Decode the genes into a descr. of schedule in Fig. 4(a). Job BW allocator: using the descr. to manage/allocate the BW to each accel at each time step. Fitness: Extract and set makespan latency as fitness value. Select: Select the parents of the next generation.

Figure 4: (a) Schedule description from the decoder. (b) The BW and the corresponding sub-accels schedule from the jobs BW allocator.

3.4 Optimization Algorithm Flow

Set-up: At the start, the host sets up the optimizer by feeding in the configurations (number of PEs, local schedule) of each sub-accelerators, the system constraint (system BW), and objective (e.g., latency).

Pre-process: Job analyzer receives job descriptions from the host and prepares a job analysis table as shown in Fig. 3. Init creates random genes by putting random values into genetic encoder. The encoded genes represent two features: the schedule for sub-accelerator selection and job prioritizing (Section 4.2.2). The genes are sent into the optimization loop.

Optimization Loop:

  • Evolution block: Genetic operators. The genes representing sub-accelerator selection and job prioritizing are evolved by four designed genetic operators, described in Section 4.2.3.

  • Evaluation block: Decoder decodes genes into a schedule description as shown in Fig. 4(a). Job BW allocator takes in the schedule description and allocates the BW for each sub-accelerator. Fitness function extracts the objective and sets it as fitness value. Select function selects the individuals (i.e., schedules) with the highest fitness as the parents for the next generation.

This finishes one generation/epoch of algorithm777The generalized solution block is described later in Section 4.3.. The solution, which is a detailed schedule, as shown in Fig. 4(b) is output to the host.

3.5 Job Analyzer

The job analyzer takes the jobs (layers) description as input and estimates the no-stall latency and its required BW for each sub-accelerator using a cost model (described below) to generate a job analysis table as

Fig. 3 shows. This table serves as a performance lookup table by the Job BW allocator (Section 3.6) within the optimization loop.

3.5.1 HW cost model for Sub-Accelerators

In G#, we leverage MAESTRO [maestro_web] as our underlying cost model for each sub-accelerator because of its ability to support diverse accelerator dataflows and configurations888In this paper, we explore heterogeneity with the aspect of different specialized DNN accelerators configurations (PEs, buffer size, dataflows). However, G# is general enough, so that it could also consider generic architectures such as CPUs/GPUs/TPUs by plugging in their cost models.. It supports most of the common DNN layers such as CONV, depth-wise CONV, and fully connected. Given a DNN layer, a HW resource configuration (PE, SL size, SG size, NoC latency, and BW), and a mapping/dataflow strategy, MAESTRO estimates the statistics such as latency, energy, runtime, power, and area.

3.5.2 Job Analysis Table

No-stall Latency. We define no-stall latency as the latency for running each job on each sub-accelerator, assuming it has sufficient memory bandwidth (i.e., not memory-bound). This value is computed by running each job through the MAESTRO cost-model for all sub-accelerators, which internally estimates the mapping (i.e., local-schedule). We assume double buffering at tile-granularity in the sub-accelerator to hide the tile fetch latency (except for the first tile).

No-stall Bandwidth. We define no-stall bandwidth as the minimum bandwidth requirement from each sub-accelerator to make it stay compute-bound, not memory-bound. As described in Section 2.2.2, the local schedule for each sub-accelerator will decide the basic data movement unit (i.e., tiles) and their movement pattern from main memory to the accelerator, and within the accelerator. A full compute of a layer (job) consists of multiple tiles. For example, the second layer of VGG16 [vgg] has the shape (K=64, C=64, Y=224, X=224, R=3, S=3). Assuming a tiling strategy that makes a tile of dimension (k=4, c=4, y=8, x=8, r=3, s=3), it leads to 200,704 (=) tiles, which divide the total data movement of an entire layer into 200K of small basic units. Based on the tile sizes, and the compute time for each tile determined by MAESTRO, we estimate the BW needed to fetch the next tile while the current tile is computing.

3.6 Jobs BW Allocator

Jobs BW allocator is the key module that enables the consideration of shared system BW. Receiving the decoded schedule as shown in Fig. 4(a), the jobs BW allocator lookup those jobs’ no-stall latency and required BW from the job analysis table (Section 3.5), and allocates the system BW to each sub-accelerator at each time frame by Algorithm 1. Briefly, it checks the , an array describing the (no-stall) BW request of each sub-accelerator, at any time . If the total request is larger than , it allocates the according to the weighting of each sub-accelerator’s BW request. With Algorithm 1, it outputs the detailed BW schedules for each sub-accelerator.

For example, from the output BW schedule in Fig. 4, we can tell, jobs J1 and J5 will be launched in Sub-accel-1 and Sub-accel-2, concurrently. Sub-accel-2 will be allocated more BW because it is running a more BW-intensive job (detail in Algorithm 1). When Sub-accel-2 finishes J5 and launches J3, the BW will be re-scheduled to reflect the change of live running jobs in the accelerators, where Sub-accel-1’s BW is reduced and reallocated to Sub-accel-2, as shown in Fig. 4. Finally Fig. 4(b) shows that when the total requesting BW is larger than the system BW (memory-bound), the allocator maximizes the BW usage by fully allocating them to each sub-accelerator. On the other hand, when the system BW is larger than requesting BW (compute-bound), there will be unutilized BW as shown at the right of Fig. 4(b).

  Input: Schedule description
  Output: , t=1,2…T
  Get , an array of no-stall latency for the parallel jobs at time t, t=0
  Get , an array of required BW for the parallel jobs at time t, t=0
  while  is not empty do
      if sum()  then
      end if
       = min()
       = argmin()
      Fetch the next and of , compute and insert into , and .
  end while
  The makespan latency T =
Algorithm 1 Job BW Allocator
Figure 5: (a) The genetic encoding and its decoding methods. Genetic operators: (b) mutation, (c) crossover-gen, (d) crossover-rg, and (e) crossover-accel. In (b-e), we show the genes of parents and children at the left and the decoded scheduling of mom and son at the right.

4 G# Optimization Algorithm

G# is a GA-based search technique. Its key difference from standard GA is (i) it customizes the optimization algorithm’s exploration momentum and mechanism (i.e., genetic operators in GA context) for the target search space, and (ii) provides knowledge transfer support.

4.1 Why GA?

Research shows GA reaches competitive performance with deep reinforcement learning [uberGA, openai_es], and hyper-parameter optimization problem. STOKE [schkufza2013stochastic]

and Tensor Comprehensions

[vasilache2018tensor] use GA to search the space of DNN code optimization. From a search time perspective, GA is light and fast  [uberGA, openai_es] comparing to many optimizations methods since the optimization mechanism in GA uses simple operations (e.g., crossover and mutations). A key challenge with standard GA however is that it is not sample-efficient. We address this issue using our customized operators (Section 4.2.3).

Table 2: Terminology used in G# Algorithm.

4.2 G# Algorithm Details

4.2.1 Terminology and Basics of GA

We list the common terminology of GA we use throughout the paper in Table 2, namely gene, genome, individual, generation. The basic mechanism in GAs is to create a population of individuals in each generation. All individuals are evaluated and sorted based on their fitness. The best performing individuals are used as parents to create a new population of individuals using genetic operators ( Section 4.2.3).

In the context of this work, an individual is a complete scheduling strategy in our context, a genome represents one of the aspects of the schedule (sub-accel. sel./ job prio.) of an individual, and genes inside a genome represents a schedule decision of a job on either sub-accel. sel. or job prio. The goal of GA is to perturb genes (i.e., components of the schedule) and retain well-performing ones across generations.

4.2.2 Genetic encoding

The genetic encoding is the interface that bridges the evolution block with the evaluation block in Fig. 3. It describes the joint strategy of job prioritization and sub-accelerator selection, as shown in Fig. 5(a). There are two genomes per individual: the sub-accelerator ID genome and the job prioritizing genome. Each genome has N genes that correspond to N jobs in the batch. In our evaluations, we assume the maximum job batch size to be 100. Therefore an individual has maximum 200 genes. (Smaller batch size is also allowed, and in fact, it introduces shorter genes, and the algorithm converges faster.) The designed genetic encoding is general enough that it is not exclusive for the G# algorithm but could be used as the interface to other optimization as well (described in Section 4.4).

We describe the genetic encoding and genetic operators in G# using the walkthrough example in Fig. 5 assuming two sub-accelerators and a batch of five jobs.

Sub-accel. ID genome. Each gene describes the sub-accel ID for the corresponding job. For example, jobs J1 and J4, are assigned to sub-accel 1, and J2, J3, and J5 are assigned to sub-accel 2 as shown in the sub-accel selection part of the gene decoding in Fig. 5(a).

Job Prioritizing genome. Each gene describes the priority of the corresponding job. The priority value ranges from 0 to 1, where 0 is the highest priority. We order the job assigned to a certain sub-accelerator by the order of priority value. For example, J1 runs before J4 in sub-accel 1 as shown in the job prioritizing part of the gene decoding in Fig. 5(a).

4.2.3 Genetic operators

Standard GA Operators. The standard genetic operators in GA consist of mutation and crossover. The standard mutation operator randomly mutates some genes. The standard crossover operator samples a pivot point and exchanges the genes of parents according to that pivot point. The sampling efficiency of the GA relies on the efficiency of the genetic operators to sample high-quality next generation.

G# Operators. In G#, we inherit the standard mutation mechanism and design three specialized crossover genetic operators. Different crossover operators are designed to preserve different dependency of genes while exploration. They allow us to explore the scheduling problem in a more strategical manner. We describe the genetic operators next.

Mutation. During mutation, we randomly select multiple genes (according to the mutation rate) and mutate them to random values. Fig. 5(b) shows an example when mutating at the third and second genes of two genomes respectively. On the right side of the figure, it shows how the son’s genes/schedule are generated by the dad’s mutation. J3 is moved to sub-accel 1 because of the first mutation. J2 is moved to a higher priority in sub-accel 2 because of the second mutation. In our experiments, we use a mutation rate of 0.05.

Crossover-gen. This is a genome-wise crossover. First, we randomly sample a type of genome to crossover. Next, we randomly sample a pivot point and exchange the genes of the genomes. There are two benefits of genome-wise crossover. First, we keep the perturbation to the level of the genome, which potentially keeps the good characteristics of the other un-touched genomes, and therefore is more stable throughout the evolution. Second, we eliminate the order dependency of the genomes. The genomes are independently representing their features, where the order of them provides no information (, i.e., representing Sub-accel Sel. genome first and Job Prio. Genome later does not make the J5 of Sub-accel Sel. and J1 of Job Prio. strongly correlated despite their being next to each other.). Therefore, a genome-wise crossover, which operates genomes independently, enables us to perturb the gene without unnecessary assumptions of the genome order. Crossover-gen becomes the major crossover function, which we set the crossover rate as 0.9.

Fig. 5(c) shows an example that we pick the second genome (Job Prio.) as the crossover region and the third location of the region as the pivot point. With the respect of schedule change after crossovering, in the example, the orders of J4 and J5 in mom’s schedule are passed to son’s schedule.

Crossover-rg. This is a range crossover mechanism structured to preserve the the dependency of genes across genomes. For example, in Fig. 5(a), the first and the sixth genes are dependent, since they are both representing some features for J1. We randomly pick a range of genome (e.g., the 3rd to the 5th locations of each genome) and simultaneously crossover all the genes falling into the picked region from both genomes, and thus the cross-genome dependency is preserved. With the respect of scheduling change after crossovering, the order and accel selection of J3, J4, and J5 are exchanged between two individuals. Crossover-rg has crossover rate of 0.05.

Crossover-accel. This is a crossover method to preserve the dependency of job ordering within an sub-accelerator. We randomly select a sub-accelerator and pass the job ordering information of this sub-accelerator to the children. For example, in Fig. 5(e), we select sub-accel 2. Next, we check the Sub-accel Sel. genome of Mom, copy the genes related to sub-accel 2 (the first and second genes of both genomes in (e)), and paste them to son’s genomes.

To increase load balancing, the original jobs assigned to sub-accel 2 in Son will be randomly mutated. Crossover-accel has crossover rate of 0.05.

4.2.4 Hyper-parameter Tuning

The above mentioned mutation, crossover rates, populations sizes, and elite ratios are hyper-parameters in G#. We applied a hyper-parameter search via a Bayesian optimization framework [bergstra2013making] to select a set of hyper-parameters that makes G# achieve the highest performance across multiple workloads.

4.3 Knowledge Transfer of Learnt Schedule

In this section, we present the method we utilize to make the learnt knowledge transferable/generalizable, i.e., the learnt knowledge could be transferred to a different batch of jobs, as long as they fall within the same application types.

We add one additional feature extraction function at the start of the optimization process. The feature extraction function takes in the no-stall latency and requested (no-stall) BW of a job from the job analysis table (a row in

Fig. 3). Next, our feature function simply calculates the mean latency of a job across different sub-accelerators as the outputting feature value. Finally, we rank the jobs by the feature value and record their ranking order. As an example, suppose the feature values (such as mean latency) of three jobs are 6.5K, 1.5K, and 1K respectively, and thus their ranking orders are 0, 1, 2, assuming only three jobs in a batch. When the optimization process finishes, we learn two pieces of knowledge: (i) the schedule for the current batch, which we output, and (ii) the schedule of the ranking order, , which is stored as a generalized knowledge in the generalized solution block in Fig. 3. For e.g.,, could describe putting the job with ranking order to sub-accel 1 with the highest priority, and so on.

In the transfer learning scenario, we would fetch the

from the generalized solution block in Fig. 3. When the next brand-new batch of jobs comes, we extract their feature with the feature extraction function, rank them, and, according to the ranking value, order them into an initial schedule following . The insight is preserves the scheduling strategy with respect to the relative distance of the jobs in the feature domain.

4.4 Leveraging Other Optimization Methods

G# is designed to be general enough to be compatible with other optimization methods. We use the same encoding scheme and algorithm flows. However, we replace the GA evolution block with other optimization operators. The objective is the same as G#, to find a series of parameters (genes) that optimizes the fitness value. We evaluate a series of optimization methods listed in  Table 5 in Section 5. We include multiple population-based optimizations  [tbpsa, ga, de, es_origin, cma, pso_paper, portfolio, lhs, halton, hammersley] and two RL methods: A2C [a2c] which is used in  [sung2020deepsocs, decima] for scheduling and PPO2 [ppo2], which is one of the advanced policy-gradient methods that succeed in many fields and is used in  [chen2019reinforcement, rummukainen2019practical] for resource scheduling.

App Type DNN Models
Vision AlexNet [Alexnet], DenseNet [densenet], GoogleNet [Googlenet], MnasNet [mnasnet], MobileNet-V2 [sandler2018mobilenetv2], ResNet50 [Resnet], ResNext50 [resnext], ResNet18 [Resnet], WideResNet50 [Resnet], SuffleNet-V2 [zhang2018shufflenet], SqueezeNet [squeezenet], VGG16 [vgg]
Lang ALBERT [albert], Bart [bart], BERT [bert], CamemBERT [camembert], CTRL [ctrl_lang], DistilBERT [sanh2019distilbert], ELECTRA [electra], FlauBERT [flaubert], GPT2 [gpt2], GPT [gpt], LongFormer [longformer], MarianNMT [mariannmt], MobileBERT [sun2020mobilebert], Reformer [reformer], RetriBERT [retribert], RoBERTa [roberta], T5 [t5], TransformerXL [transformerxl], XLM [xlm], XLM-RoBERTa [xlmroberta], XLNet [xlnet]
Recom DIN [din], DIEN [dien], DLRM-MLPerf [mlperf], DLRM-large [dlrm_intel], DLRM-small [dlrm], WideDeep [widedeep], DLRM-RMC1 [deeprecsys], DLRM-RMC2 [deeprecsys]
Table 3: Evaluated DNNs in different applications.
Table 4: Platform setting of the experiments.
Table 5: Baseline heuristics and optimization methods. Green rows represent heuristics for job prioritizing or accel selection. A complete schedule algorithm is the combination of two. e.g., FCFS-OLB. HEFT (dark green) is a heuristic-based joint method. Blue rows represent baseline optimization methods, which jointly optimize both aspects.

5 Evaluations

5.1 Methodology

5.1.1 Target DNN Models

We consider three different kinds of applications: vision, language, and recommendations. The DNN models from each are shown in Table 3.

5.1.2 Job Categories

We present results with a system batch size of 100. As discussed earlier in Section 3.1, we assume that the host CPU dispatches a batch of independent jobs (i.e., DNN layers from various models) to the scheduler. For our evaluations, the batch of jobs is generated by sampling 100 different layers from the classes of DNN models discussed above. We categorize the jobs into Vision, Lang, Recom and, Mix (i.e., sampling from all three classes).

5.1.3 Accelerator Platforms

We consider two classes of accelerator: Small and Large. For each class, we consider homogeneous and heterogeneous accelerator settings with different PEs and mappings. We construct six different platforms environments as listed in Table 4. We model the platforms with MAESTRO [maestro_web].

Sub-Accelerator Dataflow Styles. For our evaluations, we pick two distinct dataflow styles for the heterogeneous sub-accelerators: High Bandwidth dataflow style (HB) (inspired by NVDLA) [nvdla]) and relatively Low Bandwidth dataflow style (LB) (inspired by Eyeriss [eyeriss_isca]). The HB-style parallelizes across channel dimensions, and shows high-efficiency on late layers for CNN-based (vision) models, while the LB-style parallelize across activations dimensions and excels on the early layers of CNN-based models [maestro]. For Lang and Recom, we found the HW-style is more compute efficient but BW intensive, while LB-style is less compute efficient (as Lang and Recom models do not have 2D activations) but also less BW demanding (Fig. 6). Therefore we house both these sub-accelerators in a BW constrained accelerator platform to act as a good test for our optimizer to learn and exploit their difference. G# is general enough to run with any heterogeneous combination of two or more accelerator styles.

Resources: PEs and Buffers. We uniformly set one dimension of the 2D PEs array to 64999Based on our observation, most of the popular models that we collected, especially language and recommendation ones, are manually designed to have the tensor shape formed by the multiples of 64. Setting one dimension to 64, which aligns with the tensor shape, ensures higher utilization rate. and scale the PEs array size by increasing the other dimension. We consider three kinds of PEs configuration: 32 64 for Small accelerator [zhu2019energy, fu2020soft, mseddi2019intelligent, du2020new, li2020intelligent] platform, 64 64 and 128 64 for Large accelerator. The dataflow strategy (discussed above) and target tile sizes determine the buffer sizes for both SL and SG [maestro].

System BW. We assume the accelerator platform is executing under frequency 1GHz, and the inference data width is 1 byte per element. For the system BW, at the Small accelerator, we consider the BW to be range from 1GB/s to 16GB/s, which is the range of DDR1-DDR4 BW [ddr_bw] and PCIe1.0 - PCIe3.0 [pcie_spec] BW; at the Large accelerator, we consider the BW to be range from 1GB/s to 256GB/s, which is the range of DDR4-DDR5  [ddr5_spec] and HBM BW  [hbm_spec] and PCIe3.0 - PCIe5.0 and upcoming PCIe6.0 BW [pcie_spec].

5.1.4 Baseline Heuristics and Optimization Methods

Table 5 lists our baseline heuristic and optimization methods.

Heuristics. A complete job schedule (Section 3) is created as a combination of two heuristics - the first representing the job prioritizing and the second the sub-accelerator selection. For job prioritizing, heuristics such as FCFS, SJF and others have been dound to be effective [aimt, gpu_fcfs, beisel2011cooperative, joo2014resource, leutenegger1990performance]. For sub-accelerator, Opportunistic Load Balancing (OLB) [olb] and Minimum Execution Time (MET) [armstrong1998relative, met] are two widely-used greedy methods for heterogeneous platforms. OLB greedily assigns the job to the available sub-accelerator. MET greedily assigns the job to the sub-accelerator that can execute it the fastest. The full scheduling strategy is the combination of both components. For example, a valid strategy could be FCFS-OLB that uses FCFS for job prioritizing and OLB for sub-accelerator selection. We also consider a joint method, Heterogeneous Earliest-Finish-Time (HEFT) [heft, qlheft, eheft], as a baseline.

Optimization Methods and Settings. For all optimization methods, we use the G# encoding scheme, presented in Section 4.2.2, but plug in implementations of optimization schemes from the nevergrad open-source package [nevergrad]. The specific hyper-parameters settings are listed in Table 5. For fair comparisons, all optimization methods are given a sampling budget of 10K data points.

5.1.5 G# Settings

For G#, we set the number of individuals in a generation, to be 100, and we find 100 generations (epochs) are enough for G# to converge in the listing experiment settings. Therefore, we set G# to run 100 epochs in all experiments, which means we have the sampling budget of 10K datapoints, just like other optimization methods. We run the experiments on a desktop with Intel i9-9820 CPU. G# takes about 0.25 seconds per epoch, and 25 seconds for a full optimization with 100 epochs.

5.1.6 Evaluation Metric

In all experiments, we plot the makespan latency for running an entire batch of jobs (which is effectively the reciprocal of the throughput of the current batch) across all four categories (Vision, Lang, Recom and Mix) on the platform under study, based on the schedule determined by the baseline and proposed methods after 100 epochs. For ease of comparisons across different scheduling methods, we concatenate the four independent latency numbers into a stacked bar and show the total latency.

Figure 6: (a) The average per-job (i.e., per-layer) no-stall latency and required BW for no-stall across different models on high (HW) and low (LB) bandwidth mapping style (b) average no-stall latency and (c) average BW required for no-stalls across all layers.
Figure 7: The experiment results on small accelerator with (a) S1 and (b) S2 setting, and on large accelerator on (c) S4 and (d) S5 setting.

5.2 Latency-BW Characteristics of DNNs

We start by showing the latency characteristics and bandwidth requirements of the DNN models from the three application classes when running by itself on two separate dataflow styles (HB and LB). We show three of the models from each class and the average across all the models in that class in Fig. 6(a). The average values across all model layers across both accelerators are plotted in Fig. 6(b-c). In general, we can see that the per-job latency of the Vision models is higher because more compute is needed in the CONV dominant models. However, CONV is generally less memory-bound than FC. The data also shows that usually Vision has the lowest BW requirement, and Recom has the largest.

Figure 8: Performance comparisons of different methods across different BWs on setting (a)S1, (b)S2, (c) S3, and (d)S4. The total latency is the sum of the methods’ latency on each of the four job categories: Vision, Lang, Recom, and Mix.
Figure 9: Jobs analysis of (a) the averaged per-job no-stall latency and (b) the averaged per-job required BW. Performance evaluation of G# on S3, S4, and S5 with different BW. The total latency is the sum of G#’s latency on each of the four job categories.
Figure 10: The experiment results on a scale-up platform in S6, BW=256.

5.3 Results on Small Accelerators

5.3.1 Homogeneous Accelerators

We examined the homogeneous accelerators on the Small accelerator with system BW=16 GB/s. Fig. 7(a) displays the total latency across the heuristics, optimization methods and G# across all four job categories. Fig. 7(a) shows that G# outperforms both heuristic-based and state-of-the-art optimization methods. G# can improve the total latency of best-performing heuristics, SJF-RR, by 30.8% and best-performing optimization method, PSO, by 30.4%.

5.3.2 Heterogeneous Accelerators

In the heterogeneous setting, we replace the dataflow of one of the sub-accelerators from HB-style to LB-style. We can observe that G#’s latency decreases in S2 (blue bar Fig. 7(b)). From our observation on the resulting schedule, the latency decrease is caused by the fact that G# can learn to schedule early-layer of CNNs to LB-style sub-accelerator and others to HB-style ones to exploit different strength of different architectures. The heterogeneity makes scheduling a more complicated problem. Many methods experience serious performance degradation as shown in Fig. 7(b). Overall, G# consistently performs better than all comparisons. On Small accelerator across both S1 and S2 settings, G#’s latency value is smaller than Vision, Language, Recom, and Mix by 5.5x, 17.4x, 297.0x, and 25.9x respectively, and in average 86.4x smaller.

5.4 Results on Large Accelerators

5.4.1 Comparisons with other methods

In the interest of space, we list the results of two settings: S4 (Hetero) and S5 (Hetero BigLittle) with the Large BW=256(GB/s), in Fig. 7(c-d). Overall, G# can reach the most optimized performance compared to all the methods in all listed scenarios. On Large accelerator across both S4 and S5 settings, G#’s latency value is smaller than Vision, Language, Recom, and Mix by 5.0x, 98.0x, 1233.0x, and 34.7x respectively, and in average 342.7x smaller. We also experiment on different settings with different system BW, and summarize each comparing method’s total latency across the four job categories in Fig. 8. Across all scenarios, G# consistently finds better schedules.

5.4.2 Comparisons on different platform settings

In this experiment, we examine the performance change in different settings, S3 (Homog Big), S4 (Hetero Big), S5 (Hetero, BigLittle) of the Large accelerator.

Homog versus Hetero. The LB-style sub-accelerators usually take larger runtime but lower BW requirements than HB-style in Lang and Recom as shown in Fig. 6(a). The jobs analysis in Fig. 9(a-b) reflect the fact that S4, in general, induces more no-stall latency but requires less BW than S3. Therefore, when BW is limited (BW=1), the hetero setting enables G# to leverage the difference of BW requirement among sub-accelerators to relax the BW contention. Thus S4 reaches better performance than S3 at BW=1 in Fig. 9(c). However, when the BW is mostly sufficient (BW=256), the performance will reflect more of the behavior of the no-stall latency. Thus S3 reaches better performance.

Bigs versus BigLittle. We consider a platform with a smaller setting, BigLittle (S5). It is obvious when the BW budget is sufficient (BW=256), BigLittle will perform worse than both of the Bigs (S3, S4) as shown in Fig. 9(c), and can be verified by the jobs analysis in Fig. 9(b). However, BigLittle has smaller BW requirement because of its smaller sub-accelerator size, as shown in Fig. 9(a). Therefore, as shown in Fig. 9(c), when the BW is limited (BW=1), BigLittle reaches the best performance, with the least amount of resources. The results indicate G# is able to exploit different characteristics of sub-accelerators, both sizes of sub-accelerators and their mappings, to optimize its scheduling while different platforms and constraints are provided.

Scale-up. We scale up the platform by doubling the number of sub-accelerators, which complicates the problem to the search space of O(). Fig. 10 shows G# consistently finds better solutions, whose latency value is in average 610.7x smaller.

Figure 11: Jobs analysis of (a) the averaged per-job no-stall latency and (b) the averaged per-job required BW of fixed and flexible PEs arrays. Performance evaluation of G# with fixed or flexible PEs array on (c) Vision and (d) Random.
Figure 12: The convergence curve of G# with three level of genetic operations: Mutation only, Mutation and Crossover-gen, and G# with all four genetic operators.
Figure 13: The visualization of found solution by FCFS-OLB and G#. (a)(c) shows the respective sub-accelerator allocations, and (b)(d) shows the respective BW allocations. (Mix, S5, BW=1).
Table 6: The performance of knowledge transferring on (a) Mix, setting S4, BW=1, and (b) the averaged performance across different applications and different settings under BW=1. All the values are normalized by the values of Trf-100-ep of each columns. Raw (highlighted in orange) is the latency without learning or transferred. Trf-0-ep (highlighted in green) is a direct transfer. Trf-1-ep is a transfer with one epoch of re-training, and likewise for Trf-30-ep. Trf-100-ep (highlighted in blue) is a full training.
Figure 14: (a) The SAT convergence trace (Mix, S3, BW=1). (b) The definition of SAT.
Table 7: The comparisons of search time and the makespan speedup over heuristic methods. The statistics values are averaged across different settings (S1-S6) of different job categories. G-SHARP-transfer represents G-SHARP with transferred knowledge.

5.5 Flexible Accelerator

In this experiment, we consider accelerators where the PE array dimensions are configurable, such as FPGAs [brainwave], CGRA [maeri], or programmable accelerators [bang201714, yin2018141, zheng2019ultra].

Accelerator Configuration. We extend the setting of S1 (Small, fixed) and S3 (Large, fixed) to have flexible accelerators. The number of PEs in the sub-accelerator are fixed (the same as in Table 4). However, the shape of 2D PE arrays is flexible, that is we can configure the routing among the PEs. This enables the sub-accelerator to run various dataflows or mappings [maeri]. The maximum size of SLs are fixed as 1KB in each PE, and SGs are fixed as 2MB in each sub-accelerator.

Dataflow Strategy. We pick the dataflow strategy of the sub-accelerator to maximize the utilization of the PEs array. In order to maximize the utilization, we will align the PEs array dimension to be the factor of the the parallelizing dimension of the tile as much as possible. For example if the parallelizing dimension of the tile is (2, 15), which is going to map over the y and x dimension of the PEs array with 16 PEs. The potential PE array shape could be 28 while aligning to the factor of y dimension, or 35, 53, and 115 while aligning to the factor of x dimension. We examine these combinations, evaluate their expected latency by the HW cost model, and pick the lowest latency one as our PE array configurations.

Target Jobs to Schedule. We evaluate Vision and Random jobs; Random jobs are similar to Mix jobs, but we randomly change the tensor shape of the layers to simulate the scenario that the accelerator is serving some customized DNNs or the DNNs generated by Neural Architecture Search whose shapes are dynamic  [nas, hsu2018monas, tan2019mnasnet].

Evaluations. From the performance analysis in Fig. 11(a-b), we can observe that for both Vision and Random jobs, flexible outperforms fixed in ave. per-job no-stall latency, owing to its ability to maximizing the utilization rate of the PEs array. However, it would also incur higher BW requirement. It is because the flexible mapping we found is to maximize the PE utilization rate, which also increases the number of data to fetch per tile to keep PEs busy.

Next, we evaluate G#’s ability to leverage this flexibility. From all scenario in Fig. 11(c-d), flexible outperforms fixed. The results conclude that with flexible accelerators (ASIC of FPGA), we could further increase the expected platform performance without providing additional compute HW resources (PEs) by simply changing the shape of PEs array, and most importantly, G# can learn to leverage this flexibility to reach better performance.

5.6 Deep Dive into G# Algorithm.

5.6.1 Ablation study of G#

Next, we show how different genetic operators affect the performance of G#. We construct three levels of algorithms: mutation only, mutation and crossover-gen, and G# with all four genetic operators. Since the mutation operator is the basic perturbation unit, we encompass mutation for all levels of algorithms. The key difference among three levels of algorithms is the convergent speed. Fig. 12 shows that by adding Crossover-gen, it converges much faster and that by adding all designed operators, it can further increase the convergent speed, showing the effectiveness of each operator.

5.6.2 Analysis of found solutions

We visualize one of the found schedules in Fig. 13, which corresponds to the high-level figure in Fig. 4(b). It shows that G# learns to distribute the BW-intensive layers (Recom, Lang) across the runtime to balance the BW requirement (Fig. 13(e-f)), comparing with one of the widely-used heuristics, FCFS-OLB, (Fig. 13(a-b)). We found G# can highly utilize the BW and achieve better makespan latency.

5.6.3 Transferring Knowledge, i.e., Generalization.

In the experiments, we train G# on a random batch of jobs, Insts0. Then, we test, transfer, and re-train on the other four different batches of jobs. Table 6(a) shows that by directly applying transferred knowledge (Trf-0-ep), we could achieve 16x lower latency than the usual starting points, randomly initialization (Raw). By transferring the knowledge and retraining for one epoch (Trf-1-ep), we could already receive 93% of the expected performance gain of a full training (Trf-100-ep). We execute the same experiment for different types of applications and for different setting (S1-S6) (Table 6(b)). We can observe for BW-intensive applications, Lang and Recom, the knowledge of the scheduling become more important, and therefore the performance gain from the direct-transfer become significant. Overall, by direct transfer without training, G# can achieve 7.4x to 152x better performance than the the usual starting points (Raws).

5.7 G# with alternate objective functions

We used the objective of makespan latency targeting maximum system throughput in previous experiments. However, we could also target different scenarios such as (i) latency sensitive applications, where the jobs have priority and need to be completed as soon as possible, or (ii) batch with job dependency, where some jobs need to be executed before others. In these scenario, users give the system the targeting order of jobs finish time (owing to the latency sensitiveness, priority or the dependency of jobs). The objective function becomes optimizing the schedule to match the targeting job order. We define the performance metric of satisfaction (SAT) rate as shown in  Fig. 14(b), where , is the number of jobs and sub-accelerators; , is the actual and targeting order of jobs finish time. In  Fig. 14(a), we could see that we could boot the SAT rate from 24% to 96%. For the jobs with dependency, the non-satisfied jobs (jobs whose scheduled finish time is out of desired order) would incur sub-accelerator stall at runtime. However, G# will make best effort to schedule jobs finish time in desired order, which is effectively minimizing the stall time.

5.8 Comparisons of schedule search time

We discuss the search time of different methods, next. Top-performing optimization methods (Elite Opt., RLs, G#) can achieve 3.1-238.0x better makespan latency comparing to heuristics, as shown in Table 7. However, they come with a search time overhead. Interestingly, G#’s genetic operators increase its sample efficiency, which makes its search-time-to-converge 626-657x better over Elite Opt. and RLs. G# offers knowledge transfer which allows us to get 10.2x better makespan performance over Elite (top-performing) heuristics with the same (near zero) search time.

6 Related Works

Multi-tenant Scheduling for DNN Accelerators. Most DNN schedulers (heuristics and ML-based) have focused on scheduling one DNN on one accelerator [shen2017maximizing, suda2016throughput, cong_fpga, systolic_mapping, lu2017flexflow, stoutchinin2019optimally]. Some recent works look into multi-DNN scheduling: Prophet [prophet] builds a runtime prediction model for multi-tenancy on GPU with FCFS scheduling. AI-MT [aimt] develops a heuristic for DNN job scheduling on a platform with multi-homogeneous systolic arrays. Prema [prema] explores preemptive multi-tenancy on a NPU with token-based SJF. Herald [herald] and Planaria [planaria] use manual-designed scheduling for assigning jobs to sub-accelerators or reconfigurable PEs array. A learning-based method, SCARL [scarl] utilizes RL to make a two step action, job selection and machine selection and demonstrates better performance than a widely-used heuristics SJF. In this work, we optimize multi-DNN scheduling with developed learning-based method and conduct a full comparisons with heuristics the RL method used by previous works  [aimt, prema, prophet, scarl].

Multi-tenant Scheduling for CPUs and GPUs. Multi-tenancy has been investigated for decades for multi-tasking on a single CPU and job ordering in CPU clusters [zaharia2009job, sherwani2004libra]. Heuristics such as FCFS is often used in GPUs [beisel2011cooperative, joo2014resource]. GAs are one of the most popular algorithms for the scheduling problem for its lightness and simplicity  [correa1999scheduling, hou1994genetic, singh1996mapping, shroff1996genetic, wang1996genetic]. PSO [wu2010revised], CMA-ES [emadi2017task], and other optimizations have also been used. Some works leverage RL for jobs ordering over clusters such as DeepRM  [deeprm], Decima [decima] and Thamsen et al. [thamsen2017scheduling]. However, they presume a unified abstraction of the underlying cluster, where heterogeneity of the system is not considered. HEFTs [heft, eheft, qlheft] considered multi-tenancy in heterogeneous system; however, it is a manual-designed algorithm, which is not optimized for DNN workloads.

7 Conclusion and Key Takeaways

This work presents a schedule optimizer for multi-tenant DNN accelerators. The key takeaways are as follows. (i) Heuristic and optimization methods have been used successfully for the design space of either job scheduling across sub-accelerators or prioritizing the allocated jobs inside one sub-accelerator. However, co-optimization of both of these is needed for upcoming platforms (Table 1). (ii) The search space for this co-optimization is extremely enormous (Section 3.2). The search sample-efficiency of baseline optimization methods, including widely-used RLs, is not sufficient to find optimized solutions (Fig. 7). (iii) We develop a scheduler called G# that customizes the optimization algorithm’s exploration momentum and mechanism (genetic operators in this work) for the target search space. We design three specialized crossover operators to preserve dependencies between the jobs and sub-accelerators during exploration (Section 4.2.3). Our search method yields faster searches (Table 7) with more optimized solutions (Fig. 7). G# also provides knowledge transfer demonstrating better schedules than SOTA heuristics at zero search overhead. In future, we plan to integrate G# into a compilation framework for real systems.