The proliferation of deep learning (DL) has been fueled, in part, by a rapid growth in the size and complexity of deep neural network (DNN)(dean2012large; ying2018graph). This has spurred the rapid development of hardware (wang2016dlau; jouppi2017datacenter) and software (abadi2016tensorflow; paszke2017; cyphers2018intel) dedicated to deep learning workloads that seek to optimize critical performance metrics like throughput and power efficiency (mattson2020mlperf). Compiler optimizations to map the tensors of a neural network’s computational graph to the memory units on host hardware is a critical challenge. Since different memory types trade off bandwidth and capacity differently, a sub-optimal mapping could significantly increase latency.
For DL inference, the computational graph is static and placement can be pre-planned instead of relying on online cache management (zhangoptimal; shi2019applying). However, this is especially challenging with DNNs due to the high dimensional search space. For example, ResNet-50 (he2016deep) has 57 operational layers; mapping each activation and weight tensor to three (DRAM, LLC, and SRAM) memory caches represents possible decisions. Since optimizing this mapping is intractable with traditional approaches such as dynamic programming (bellman1954)
, current solutions primarily rely on manually-tuned heuristic rules encoded in a compiler.
In this paper, we investigate if a machine learning based solution could address this problem in a scalable manner. We formulate the task as a Reinforcement Learning (RL) problem, where an agent performs actions to map each layer’s weights and activations to one of several memory caches on the chip (e.g. DRAM, LLC and SRAM).
In addition to the extremely large action space, the reward is end-to-end latency, which is a sparse and noisy learning signal, which we demonstrate is unsuitable for purely gradient-based Deep RL algorithms. Instead, we contribute Evolutionary Graph RL (EGRL), an extension of CERL (khadka2019cerl)
, a population based method which previously performed well in sparse-reward tasks by combining fast policy gradient (PG) learning with a stable evolutionary algorithm (EA). Since the action spaces explored in this paper are several orders of magnitude larger than the ones explored in CERL, we also needed a mechanism to improve the sample-efficiency of the slow EA component. Thus we introduce Boltzmann chromosomes - a set of fast, stateless policies that accelerate evolution by providing partially optimized solutions as anchors.
Further, we employ a graph neural network (GNN) (wu2020comprehensive; scarselli2008graph) to represent our policy. This allows our agent to natively process computational graphs representing deep learning workloads, enabling generalization over workloads of arbitrary size and connectivity. Figure 1 illustrates the high level overview of our method.
We demonstrate our solution on the Intel Neural Network Processor for Inference (NNP-I) (wechsler_nnpi), a deep learning accelerator with constraints on memory capacity, bandwidth and power. This is a key differentiator from prior works such as REGAL (regal) that assume infinite bandwidth and memory that are not practical on real hardware. Additionally, we consider single-batch inference. While large batch sizes have greater computational efficiency (e.g., (bert_blog) on NNP-I), they are sub-optimal for a given inference example due to the latency associated with queuing up a batch. Therefore, single-batch inference is key to many time-critical applications (park2018deep) where an individual inference query needs to be processed in real-time.
Results on ResNet-50 (resnet), ResNet-101 (he2016deep) and BERT (devlin2018bert), show that EGRL significantly outperforms the chipset’s native compiler across all workloads, and exceeds the performance of a dynamic programming approach and a policy-gradient approach.
Specifically, the contributions of this work are:
A generalized GNN-based policy representation that can natively process deep learning workloads without the need for serialized, layer-dependent representations.
EGRL, a scalable population-based algorithm that can effectively train on sparse and noisy feedback from the host hardware.
An RL agent that trains directly on the hardware, with a feedback mechanism for constraint violation. Thus, we are able to directly deploy and test on hardware.
2 Background and Related Work
We consider a Markov Decision Process (MDP) setting defined by the tuple, with a state space , a discrete action space
, an unknown state transition probabilitythat maps a state at time and an action to the probability of a next state , and a reward provided by the environment for a given state transition. We learn a policy that maximizes the expectation of the total episodic return from time-step , , where
is the discount factor. Policy Gradient (PG) methods re-frame this goal of maximizing the expected return as the minimization of a loss functionwhere encapsulates the agent parameters. A widely used method is Soft-Actor Critic (SAC) (haarnoja2018), a model-free algorithm developed for continuous high-dimensional settings. SAC uses an actor-critic architecture with separate networks for the policy and the Q-value function. A stochastic Gaussian policy enables it to use a maximum entropy objective (ziebart2008maximum) through which it demonstrates state-of-the-art results.
Collaborative Evolutionary Reinforcement Learning (CERL) (khadka2018evolution; khadka2019cerl) combines Evolutionary Algorithms (EAs) (floreano2008; luders2017; fogel2006; spears1993) with PG. It diversifies exploration by allowing a population of EA policies to add data to a central replay buffer shared by a population of PG learners. Since the gradient-free EA directly optimizes for episode-wide return, it biases exploration, and implicitly the PG policies, towards states with higher long-term returns. Concurrently, PG policies are inserted into the EA population in order to provide better search anchors for EA. CERL’s integrated framework was shown to outperform its components (PG and EA) in isolation. We directly build on CERL because the memory mapping solution inherently relies on optimizing a very sparse feedback signal (e.g., latency) that is obtained at the end of an inference run through a workload.
Graph Neural Networks (GNNs) were first proposed as a recursive message passing framework with learnable parameters (gori2005new). Subsequent work relaxed the architecture to work as a generalization of convolutional networks resulting in a broader graph data structure (xu2015empirical; sukhbaatar2016learning)
. Pertinently, GNNs have been paired with RL for tackling fundamental combinatorial optimization problems such as Minimum Vertex Cover, Maximum Cut and the Traveling Salesman Problem.(combi_graph; learning_heuritics).
Optimizing Hardware using Machine Learning: Deep RL was used in (mirhoseini_chip_placement) to learn subsystems placement to optimize power, performance and space. Similarly, AutoTVM (chen2018learning) employed learning to optimize low-level implementations of operators in tensor programs. On the same note, Placeto (addanki2018placeto) combines GNNs and RL to achieve effective device placements in distributed clusters.
A closely related work is REGAL (regal)
which optimizes run-time and peak-memory via hardware placement. It utilizes a graph representation with a genetic algorithm (GA) guided by RL. The RL agent predicts the parameters of GA - a form of indirect information transfer - while GA directly optimizes the final strategy. In contrast, our RL and EA components each co-optimize the mapping strategies via direct information transfer (policy migration) and a shared replay buffer. REGAL’s assumes infinite bandwidth and memory, whereas we train and validate entirely on physical hardware introducing specific mechanisms to incentivize compiler-valid mappings. This ensures that our solutions are performant under real-world operating conditions and closer to production-use.
We formulate the hardware mapping problem as an MDP and use RL to train a solution. Figure 1 illustrates the high-level formulation of our workflow.
3.1 MDP Formulation
State: A major challenge in memory mapping is to find effective general representations for large and complex workloads. One approach is to convert the computational graph into sequential segments (amc; haq) that are fed to the RL policy one-by-one. A reward can then be computed at the end of each episode. While this sequential approach is simple, it struggles to generalize to workloads with varying depths and complexity. Furthermore, the serial nature of this approach limits the speed of learning that can be obtained by parallelization.
To address these limitations, we encapsulate the workload as a directed graph , whose nodes represent operational layers (e.g. convolution, pooling, etc.), and edges indicate the connectivity between layers. Each node’s features describe its operation as well as characteristics of the weights and activation tensors associated with it (e.g. byte-size of the weights, kernel size, etc.). A detailed description of the node features can be found in Appendix A. Since all the outgoing edges of a node denote the same output tensor, their associated information are encapsulated in their source node features, leaving the edges featureless.
Action: Given the graph representation of the DL workload, , we use a GNN policy to map it to a set of memory mapping decisions . Algorithm 1 details this process. The goal of is to maximize the improvement in a performance metric . The agent takes two distinct sub-actions per node, each choosing between one of three memory units (DRAM, LLC and SRAM): one for weights and the other for the activations (Line 5). The agent’s complete memory map is sent to the compiler. If any of the mapping decisions cannot be executed on the hardware (i.e., invalid mapping), the compiler rectifies them and outputs a modified map, , that is fully executable (Line 6).
Reward: In a standard RL setting, one can generally constrain the action space to avoid invalid actions. However, in our problem setting, constraining the action explicitly requires reproducing the compiler’s logic for valid mappings, which would vary across hardware and compiler versions. In order to keep the RL algorithm independent of the compiler logic, we formulate separate reward domains for invalid and valid mappings. If the agent produces any invalid mapping that the compiler re-assigns, we do not execute an inference. Instead, we formulate a negative reward as a function of the re-assigned bytes ratio, to quantify the extent of the invalidity (Line 12 in Algorithm 1). This formulation allows us to avoid implementing the compiler logic explicitly - instead relying on a negative feedback that enables the agent to implicitly learn the rules for valid mappings.
When the agent does produce a fully valid mapping, we execute inference and compute a positive reward. This reward is a function of the agent performance score normalized by that of the native compiler (Line 10 in Algorithm 1). While the normalization is not necessary when training on a single workload, it allows for flexibility to concurrently train on multiple workloads that might have a wide range of scores. For our application, we maximize the reciprocal of latency.
Our training algorithm, EGRL, builds on the CERL framework (khadka2019cerl) to tackle variable-sized, multi-discrete action settings. Figure 2 illustrates the high level architecture of EGRL. It comprises of a single PG learner (a GNN) and an EA population containing a mixture of GNN and Boltzmann policies, as detailed below. A round of rollouts is conducted to compute the fitness for each individual in the population. All data generated across the population is stored in the PG learner’s replay buffer. The population then goes through probabilistic selection, mutation and crossover operators commensurate with the fitnesses computed.
Concurrently, the PG learner updates its actor and critic by sampling from the shared replay buffer. It periodically migrates to the EA population as a form of information transfer. At any given time, the top-ranked policy in the EA population is chosen for deployment. A detailed description can be found in the Appendix while a truncated codebase can be found in 111https://anonymous.4open.science/r/290d6d70-5324-4d13-8458-19de1dc6aeed for reference.
GNN Policy: Given the large state space representing the operational layers of each workload, it was critical to develop a policy representation that could exploit the inherent dependencies between them. We implemented a Graph U-Net policy based on (GraphUNet). The Graph U-Net leverages bidirectional graph convolutions and graph attention operations to derive invariant intermediate node features. This representation afforded us a multidimensional action-space where the agent can simultaneously affect the memory mappings for all weights and activations of the workload. This enables our agent to generalize across workloads of varying sizes and diversity of operations.
Boltzmann Chromosome: Figure 3 illustrates the Boltzmann chromosome, an additional policy representation we introduced into the population based on the Boltzmann softmax operation (asadi2017alternative)
. Each Boltzmann chromosome is parameterized by a set of prior probabilities () and a temperature () for each node. To compute an action for each node, we sample from the Boltzmann softmax function using that node’s and . In contrast to a GNN policy, which is parameterized by its weights and produces mappings following a feed-forward operation, the Boltzmann chromosome directly represents the mapping decision and its associated uncertainty. Thus it is significantly faster to compute and is an ideal embedding for search-based EA method. The ratio of exploration to exploitation is controlled by the temperature parameter directly. A lower temperature favors decisions close to the prior while a higher temperature encourages exploration further from . Crucially, is learned (via evolution) for each node independently which allows for varying degrees of exploration-exploitation across different mapping decisions simultaneously.
Mixed Population: The EA population concurrently holds both GNN and Boltzmann policies. Crucially, all policies share data and benefit from the joint exploration. The PG based GNN policy can directly leverage the states explored by the Boltzmann policy to compute gradients. Conversely, as shown in Figure 2, the Boltzmann policy’s prior
is periodically seeded using the GNN policy’s posterior probability distribution - thus enabling it to directly bootstrap from the GNN population.
Policy Gradient Algorithm: We build on SAC (haarnoja2018), making modifications to tackle our large multi-discrete actions space. Please refer to Appendix D for a detailed description.
Domain: We evaluated the performance of our agents on the Intel NNP-I hardware. For a given DNN workload, our agents controlled how their intermediate tensors are mapped to memory units on the chip. We then report the resulting latency as measured directly in the hardware. We conduct both training and testing entirely on the physical hardware.
Workloads Tested: We benchmarked on three popular workloads. ResNet-50 (he2016deep), with 57 nodes, is widely used for benchmarks such as MLPerf (reddi2019mlperf). ResNet-101, with nodes, allowed us to test for scale. Lastly, BERT (devlin2018bert), with
nodes, is a state-of-the-art natural language processing model. This allowed us to test for scale and generalization of our approach. Since the action space for a workload withnodes is , the corresponding sizes of the action spaces are , and respectively.
Metrics Reported: We define speedup as the relative improvement in latency achieved by the agent’s mapping versus that of the compiler. A score greater than indicates an improvement in latency while a score between and indicates a degradation. A score of indicates an invalid mapping. We conduct
independent statistical runs and report the mean and standard deviation. Further, we report all speedups against iterations where an iteration refers to an inference process in the physical hardware. To ensure a fair comparison between population-based and single-policy methods, we count the iterations cumulatively across the population.
Baseline: We use the Intel NNP-I’s default compiler as our baseline. The compiler consists of a collection of heuristic rules specific to the memory and compute capacity of the hardware and the nature of the specific workload. We also implement a number of learning and search based agents for comparison, detailed below:
Greedy Dynamic Programming (DP) agent, inspired by DP methods for optimization (andonov2000unbounded; bertsimas2004robust), makes layer-wise greedy decisions directly on the workload. Since we have memory choices for types of tensors, we have distinct decisions per node. The Greedy-DP agent tries all possible maps for the first node (keeping all other mapping static), and chooses the action that leads to the maximum reward. It repeats this for every single node in the workload. Once it reaches the end, it circles back to the first node and repeats the entire process conducting several passes. The Greedy-DP essentially assumes conditional independence of mapping across the layers to reduce the solution space from , where is the number of layers. While the conditional independence is a fairly naïve assumption, running multiple passes through the graph produces a reasonable solution.
Evolutionary Algorithm (EA) agent ablates the policy gradient component within EGRL and uses only the evolutionary component to train the RL agent.
Policy Gradient (PG) agent ablates the evolutionary component of EGRL and tests the modified SAC-discrete algorithm in isolation.
Figure 4 shows the relative speedup achieved for the various agents tested on the ResNet-50, ResNet-101 and BERT workloads. The speedups are reported relative to the compiler and are measured directly on the NNP-I hardware. Results demonstrate that EA and EGRL significantly improve upon the compiler consistently across all three workloads. Greedy-DP approaches baseline performance while the PG agent fails to reach it.
ResNet-50: EGRL and EA significantly outperform the baseline compiler as well as the other agents reaching a final speedup of and , respectively. Greedy-DP underperforms the compiler at while PG converges to .
ResNet-101: EGRL significantly outperform the baseline compiler and all other agents reaching a final speedup of . EA comes second, converging to a final speedup of . This performance gap demonstrates the role played by the collaborative learning using the shared replay buffer in EGRL. While the PG learner fails to find full mapping solutions by itself, the partial solutions it finds carry vital information. The EA population in EGRL directly leverages this information to achieve better mapping solutions than what it could find by itself. Greedy-DP outperforms the compiler, converging to while PG converges to .
BERT: EGRL and EA significantly outperform the compiler as well as the other agents reaching a final speedup of and , respectively. Greedy-DP converges to a speedup of , greatly underperforming the compiler. This is unsurprising as BERT is comparatively much larger in size than the two ResNet models. Thus, the simplistic assumption of conditional independence amongst node-level actions made by Greedy-DP begins to falter when the number of nodes increases. PG fails to find good mappings and converges to .
reports the generalization performance of the GNN-policy used in EGRL after being trained on BERT and ResNet-50. Here, the GNN-policy is trained on one workload and performance is reported on other workloads without any fine-tuning. Results demonstrate that policies trained with either workload demonstrate decent zero-shot transfer to other workloads. We observe some intermediate drops in transfer performance through training but the overall trend shows that the intermediate representation and knowledge encoded by the GNN policy transfers effectively to the other workloads. As training progresses, we see sharper dips and inconsistent transfer performance marked by larger variance. This is reminiscent of overfitting where the GNN-policy optimizes for the specifics of its training workload, degrading its ability to generalize.
5.2 Visualizing Memory Mappings
Fig 6 employs a UMAP embedding (mcinnes2018umap) to illustrate the differences between the mapping solutions found by the compiler and during different phases of training. For each workload, we collected its mappings twice - first when the agent’s mappings approximately reach the compiler’s speedup performance () denoted as compiler-competitive-mappings, and second when the agent reaches its best recorded speedup denoted as best-mappings. Since these mappings represent a collection of discrete decision per node, we represent them with a one-hot categorical expression and concatenate them across all nodes of the workload. Given this representation, we use the Jaccard distance (niwattanakul2013using) to compute the UMAP embedding. We also use a neighbour size of to balance between the global and local structure captured by the projection.
Results show that compiler-competitive-mappings and best-mappings are well-separable across all three workloads. While we see some mixing, the general trend suggests strong separability between the two classes of mappings. Further, the compiler’s mapping also fell within the cluster of compiler-competitive-mappings across all three workloads. This suggests that the agents learn to mimic the compiler’s mappings at some point in their training. This is unsurprising as the reward we use to train the agents before they find valid mappings is based on differences with the compiler.
Interestingly, the intra-cluster spread for compiler-competitive-mappings is markedly higher than best-mappings across all three workloads. This indicates that the mappings associated with higher speedups are more self-similar than those that are less performant. This is unsurprising since the number of inferior mappings is higher than that of the superior ones.
5.2.1 Differences in Memory Mappings
Figure 7 provides some insights into the differences in mappings between the compiler and EGRL.
The transition matrices on top show a high-level view of how the distribution of tensors to the different memories shifted. Each row corresponds to a memory unit. The corresponding columns indicate how EGRL fractionally re-distributed tensors originally mapped to that unit into all available memories. At the bottom, we illustrate how each tensor in a workload was mapped by the compiler and by EGRL. Each band represents either a weight or an activation tensor.
While it is difficult to interpret the mapping decisions reliably, these visualizations indicate that EGRL generally found maps that avoided the slower but higher-capacity DRAM. This difference is particularly prominent for the weight tensors. EGRL also favored contiguity - where tensors from neighboring layers generally got mapped to the same type of memory. Both are performant strategies to optimize latency - but not trivial to achieve using heuristics that need to trade-off speed and capacity for a large number of tensors. EGRL’s graph-based global view of the workloads enables it to make globally optimal allocations compared to the sequential decision making of the compiler.
6 Discussion and Future Work
This paper introduced EGRL, a hybrid framework that pairs graph neural networks with population-based reinforcement learning to learn effective memory mapping solutions for large deep learning workloads. We train our policies end-to-end on the NNP-I chipset to ensure that the solutions are robust to the real-world constraints and uncertainties of the chip. Complimentary to other approaches like compression (cheng2018model; kim2015compression), sparsification (venkatesh2017accelerating) and network pruning (han2015deep), EGRL keeps the workload unchanged and instead tackles how their tensors are mapped in hardware.
We show that EGRL scales effectively across varying sizes and operational types of DL workloads. Results show that EGRL outperforms several learning and search methods as well as the heuristic logic of the compiler. This scalability paves the way for learning-based agents to tackle other hardware mapping problems. Specifically, our future work will expand the action space of the EGRL agent to control other settings like batch size, ring frequencies, power efficiency and data decomposition.
7 Broader Impacts
We demonstrated the use of deep reinforcement learning in tackling the hardware mapping problem. Specifically, we showed that we can use GNN and population-based reinforcement learning to achieve a 28-78% speedup in inference on prominent deep learning models for computer vision (Resnet-50 and Resnet-101) and natural language processing (BERT). These models are key participants in the ongoing widespread proliferation of deep learning in industrial and consumer applications. For instance, ResNet-based models are frequently used in enabling autonomous driving(chi2017deep; teichmann2018multinet) applications. Similarly, BERT is a key model used for real-world deployment of chatbots (bathija2020guided), document understanding (yang2019simple; adhikari2019docbert) and natural language processing (tenney2019bert). All these application are time-critical as they involve interaction with a customer. Further, some like autonomous driving are additionally safety-critical as a fast perception engine is crucial for effective and safe driving. The ability to maintain low latency is thus critical for both safety and scalability of such applications. The solution we develop in our paper is an enabling technology towards this goal.
One limitation of our solution is that the decisions taken by the RL agent are difficult to explain and understand. A broader shift towards RL based optimization, while improving overall performance, could therefore lead to lower explainability of the resulting solution. We are encouraged by the growing research in explainability related to deep learning algorithms and reinforcement learning to address this issue in a meaningful way.
As it pertains to using RL to automate design, one potential undesired effect is that by optimizing for greater throughput speeds, one might inadvertently over-optimize to a given metric without considering other important factors in the application. In the case of optimizing hardware, the RL agent may suggest a design that significantly decreases the lifetime of the hardware by overloading certain parts, which could also impact overall reliability of the hardware. Similarly, software products exposed to automatic agents need to be robustly designed so that the agent cannot manipulate the software to cause undesired side effects. One example is that the agent directly changes the compiler software or the firmware on the hardware itself which may cause undesired downstream effects. Moreover, if the decisions taken by RL agent are difficult to explain, this could lead to significant challenges in finding and resolving issues for a variety of applications, and lead to lower confidence in the applicability and reliability of many deep learning based methods.
Appendix A Graph Embedding
Table 1 details the features we used for the node embedding. These features encapsulate encapsulate information about the input and output tensors of the given operation, as well as summary information about future layers.
|Size in bytes of the weights if exist, 0 otherwise|
|Input feature map size on the x axis|
|Input feature map size on the y axis|
|Input feature map size on the z axis|
|Output feature map size on the x axis|
|Output feature map size on the y axis|
|Output feature map size on the z axis|
|Total size of the input feature map ()|
|Total size of the onput feature map ()|
|Total number of operations after current|
|Total number of weights from current to the last node|
|Number of groups - Convolution related parameter, set to 0 otherwise|
|Kernel size on x axis - Convolution related parameter, set to 0 otherwise|
|Kernel size on y axis - Convolution related parameter, set to 0 otherwise|
|Stride size - Convolution related parameter, set to 0 otherwise|
|Padding size - Convolution related parameter, set to 0 otherwise|
|Dilation - Convolution related parameter, set to 0 otherwise|
|Input batch size|
Appendix B Hyperparameters
|Hyperparameter||Range explored||Value used|
|GNN hidden layer size||[32, 64, 128]||128|
|GNN output layer size||[32, 64, 128]||128|
|Number of GNN attention heads||4|
|# Steps per Episode||[1, 5, 10]||1|
|Initial mapping action||[’DRAM’]||’DRAM’|
|Reward for invalid mapping||[-10, -1]||-1|
|Discount Rate||[0.9, 0.97, 0.99]||0.99|
|EA population size||[10, 20]||20|
|PG Rollout size||[0, 1, 10]||1|
|Fraction of EA population that are Boltzmann||[0.1, 0.2, 0.5]||0.2|
|Total steps in the environment||[4000, 10000]||4000|
|Replay buffer size||||100000|
|Critic learning rate||[1e-3, 1e-4]||1e-3|
|Actor learning rate||[1e-3, 1e-4]||1e-3|
|Alpha (Entropy Coefficient)||[0.05, 0.1, 0.2]||0.05|
|Tau (Double-Q Network synchronization rate)||[1e-3]||1e-3|
|Batch size for PG||24||24|
|Reward scaling multiplier||5||5|
|Gradients steps per environment step||1||1|
Appendix C Egrl
EGRL incorporates EA’s population-based search with powerful gradient-based methods from DRL to expedite learning. In this work, we instantiate the EA population to use both the GNN encodings as well as a Boltzmann chromosome encoding to direct its search. Concurrently, we use a modified SAC haarnoja2018 algorithm as our gradient-based technique in training the GNN policies. Algorithm 2 details the EGRL algorithm.
A general flow of the EGRL algorithm proceeds as follow: a mixed population of GNN-policies and Boltzmann-based policies is initialized with random weights. In addition to the population, one additional actor network (referred to as henceforth) is initialized alongside a critic network. The population is then evaluated in the environment by allowing it to control the memory mapping for the specified workload in SPH hardware. A selection operator then selects a portion of the population for survival with probability commensurate on their relative performance. The population is then probabilistically perturbed through mutation and crossover operations to create the next generation. A select portion top performers are preserved as elites and are shielded from the mutation step.
Shared Replay Buffer: Unlike a traditional evolutionary population, each individual (whether GNN or Botlzmann-based) stores its experience defined by the tuple (current state, action, next state, reward) in a globally shared replay buffer. This is done for every interaction that takes place with the hardware to maximize data efficiency. The critic samples a random minibatch from this shared replay buffer and uses it to update its parameters using gradient descent. The critic is then used to train the using the sampled policy gradient.
The shared replay buffer is a key mechanism that enables the sharing of information across the varying learning methods. In contrast to a standard search method which would extract the performance score and disregard the underlying data immediately, EGRL retains every interaction in the global buffer and engages the and critic to learn from them repeatedly using powerful gradient-based methods. This enables maximal information extraction from each individual experiences as interfacing with the hardware is an expensive operation.
Mixed Exploration: A noisy version of the using Gaussian noise generator is used to generate additional experiences for the replay buffer. In contrast to the population of GNN-actors which explore by noise in their neural weights, the actors explore through noise in its action space. Boltzmann chromosomes tread this line in between where they explore in the parameters space more directly connected to the action space. Overall, each exploration technique are complementary and collectively lead to an effective exploration of the solution space.
Migration: Periodically, the network’s weights are copied into the evolutionary population. This process enables the evolutionary framework to directly leverage the information learned through gradient descent. This process also serves to stabilize learning and make it more robust to deception. If the policy learned by the is favorable, it will be selected to survive and extend its influence to the population over subsequent generations. However, in case it is bad, it will be selected against and discarded. This mechanism ensures that the flow of information from the to the evolutionary population is constructive.
Appendix D Policy Gradient modifications to SAC
Policy Gradient Algorithm: We build on SAC [haarnoja2018] to tackle our large multi-discrete actions space. Since our policy is discrete, we compute entropy directly as
We then average over all nodes to compute the overall entropy of the policy. Further, we use a noisy version of the one-hot encoded behavioral action to compute our Bellman update as
We use the minimum of two heads from the Q-Network based on [fujimoto2018addressing]. The noisy action is computed by adding Gaussian noise clipped between and
This noisy action smoothens the value estimate towards similar state-action value estimates by the policy. It serves to make the policy smooth and addresses overfitting to the one-hot encoded behavioral output. The actor is trained using the sampled policy gradient.
Appendix E Boltzmann Chromosome
Figure 8 illustrates the operation of the Boltzmann chromosome for a particular action choice in one node. Parameters for prior (, , ) and temperature fully encode the chromosome’s policy. To compute an action, we first compute the probabilities by applying the Boltzmann softmax operation with the associated prior and temperature. Action is the sampled from this probability distribution. The choice of temperature directly modulates the exploration-exploitation knob of decision making. A higher temperature leads to higher entropy probability distribution enabling higher exploration. In contrast, a lower value of temperature will lead to lower entropy in the probability distribution enabling exploitation of the prior information.
For the agent policy described in this paper, a Boltzmann chromosome solution comprises of priors and temperature parameters for each node and action choice in the computational graph. Learning either through seeding, mutation or crossover involves a direct update of these parameters. Importantly, these parameters are learned independently within the context of each node allowing for varying degrees of exploration-exploitation position across nodes. For instance, the agent could be very confident about mapping a specific node while concurrently be unsure for a different node of the same workload at the same time. The enables the agent to systematically balance the exploration-exploitation tradeoff at the resolution of individual node actions.